Statistics in Musicology (Interdisciplinary Statistics,)

Interdisciplinar y Statistics STATISTICS in MUSICOLOGY Jan Beran CHAPMAN & HALL/CRC A CRC Press Company Boca Raton Lo...

Author: Jan Beran

303 downloads 1864 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Interdisciplinar y Statistics

STATISTICS in MUSICOLOGY

Jan Beran

CHAPMAN & HALL/CRC A CRC Press Company Boca Raton London New York Washington, D.C. ©2004 CRC Press LLC

C2190 disclaimer.fm Page 1 Monday, June 9, 2003 10:51 AM

Library of Congress Cataloging-in-Publication Data Beran, Jan, 1959Statistics in musicology / Jan Beran. p. cm. — (Interdisciplinary statistics series) Includes bibliographical references (p. ) and indexes. ISBN 1-58488-219-0 (alk. paper) 1. Musical analysis—Statistical methods. I. Title. II. Interdisciplinary statistics MT6.B344 2003 781.2—dc21

2003048488

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe.

Visit the CRC Press Web site at www.crcpress.com © 2004 by Chapman & Hall/CRC No claim to original U.S. Government works International Standard Book Number 1-58488-219-0 Library of Congress Card Number 2003048488 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper

©2004 CRC Press LLC


Contents Preface 1 Some mathematical foundations of music 1.1 General background 1.2 Some elements of algebra 1.3 Specific applications in music 2 Exploratory data mining in musical spaces 2.1 Musical motivation 2.2 Some descriptive statistics and plots for univariate data 2.3 Specific applications in music – univariate 2.4 Some descriptive statistics and plots for bivariate data 2.5 Specific applications in music – bivariate 2.6 Some multivariate descriptive displays 2.7 Specific applications in music – multivariate 3 Global measures of structure and randomness 3.1 Musical motivation 3.2 Basic principles 3.3 Specific applications in music 4 Time series analysis 4.1 Musical motivation 4.2 Basic principles 4.3 Specific applications in music 5 Hierarchical metho ds 5.1 Musical motivation 5.2 Basic principles 5.3 Specific applications in music 6 Markov chains and hidden Markov mo dels 6.1 Musical motivation 6.2 Basic principles



6.3 Specific applications in music 7 Circular statistics 7.1 Musical motivation 7.2 Basic principles 7.3 Specific applications in music 8 Principal comp onent analysis 8.1 Musical motivation 8.2 Basic principles 8.3 Specific applications in music 9 Discriminant analysis 9.1 Musical motivation 9.2 Basic principles 9.3 Specific applications in music 10 Cluster analysis 10.1 Musical motivation 10.2 Basic principles 10.3 Specific applications in music 11 Multidimensional scaling 11.1 Musical motivation 11.2 Basic principles 11.3 Specific applications in music List of figures References


Preface An essential aspect of music is structure. It is therefore not surprising that a connection between music and mathematics was recognized long before our time. Perhaps best known among the ancient “quantitative musicologists” are the Pythagoreans, who found fundamental connections between musical intervals and mathematical ratios. An obvious reason why mathematics comes into play is that a musical performance results in sound waves that can be described by physical equations. Perhaps more interesting, however, is the intrinsic organization of these waves that distinguishes music from “ordinary noise”. Also, since music is intrinsically linked with human perception, emotion, and reflection as well as the human body, the scientific study of music goes far beyond physics. For a deeper understanding of music, a number of different sciences, such as psychology, physiology, history, physics, mathematics, statistics, computer science, semiotics, and of course musicology – to name only a few – need to be combined. This, together with the lack of available data, prevented, until recently, a systematic development of quantitative methods in musicology. In the last few years, the situation has changed dramatically. Collection of quantitative data is no longer a serious problem, and a number of mathematical and statistical methods have been developed that are suitable for analyzing such data. Statistics is likely to play an essential role in future developments of musicology, mainly for the following reasons: a) statistics is concerned with finding structure in data; b) statistical methods and structures are mathematical, and can often be carried over to various types of data – statistics is therefore an ideal interdisciplinary science that can link different scientific disciplines; and c) musical data are massive and complex – and therefore basically useless, unless suitable tools are applied to extract essential features. This book is addressed to anybody who is curious about how one may analyze music in a quantitative manner. Clearly, the question of how such an analysis may be done is very complex, and no ultimate answer can be given here. Instead, the book summarizes various ideas that have proven useful in musical analysis and may provide the reader with “food for thought” or inspiration to do his or her own analysis. Specifically, the methods and applications discussed here may be of interest to students and researchers in music, statistics, mathematics, computer science, communication, and en-


gineering. There is a large variety of statistical methods that can be applied in music. Selected topics are discussed in this book, ranging from simple descriptive statistics to formal modeling by parametric and nonparametric processes. The theoretical foundations of each method are discussed briefly, with references to more detailed literature. The emphasis is on examples that illustrate how to use the results in musical analysis. The methods can be divided into two groups: general classical methods and specific new methods developed to solve particular questions in music. Examples illustrate on one hand how standard statistical methods can be used to obtain quantitative answers to musicological questions. On the other hand, the development of more specific methodology illustrates how one may design new statistical models to answer specific questions. The data examples are kept simple in order to be understandable without extended musicological terminology. This implies many simplifications from the point of view of music theory – and leaves scope for more sophisticated analysis that may be carried out in future research. Perhaps this book will inspire the reader to join the effort. Chapters are essentially independent to allow selective reading. Since the book describes a large variety of statistical methods in a nutshell it can be used as a quick reference for applied statistics, with examples from musicology. I would like to thank the following libraries, institutes, and museums for their permission to print various pictures, manuscripts, facsimiles, and photographs: Zentralbibliothek Z¨ urich (Ruth H¨ ausler, Handschriftenabteilung; Anik´ o Lad` anyi and Michael Kotrba, Graphische Sammlung); Belmont Mu¨ sic Publishers (Anne Wirth); Philippe Gontier, Paris; Osterreichische Post AG; Deutsche Post AG; Elisabeth von Janoza-Bzowski, D¨ usseldorf; University Library Heidelberg; Galerie Neuer Meister, Dresden; Robert-Sterl-Haus (K.M. Mieth); Béla Bartók Memorial House (János Szir´ anyi); Frank Martin Society (Maria Martin); Karadar-Bertoldi Ensemble (Prof. Francesco Bertoldi); col legno (Wulf Weinmann). Thanks also to B. Repp for providing us with the tempo data for Schumann’s Träumerei. I would also like to thank numerous colleagues from mathematics, statistics, and musicology who encouraged me to write this book. Finally, I would like to thank my wife and my daughter for their encouragement and support, without which this book could not have been written. Jan Beran Konstanz, March 2003


CHAPTER 1

Some mathematical foundations of music 1.1 General background The study of music by means of mathematics goes back several thousand years. Well documented are, for instance, mathematical and philosophical studies by the Pythagorean school in ancient Greece (see e.g. van der Waerden 1979). Advances in mathematics, computer science, psychology, semiotics, and related fields, together with technological progress (in particular computer technology) lead to a revival of quantitative thinking in music in the last two to three decades (see e.g. Archibald 1972, Solomon 1973, Schnitzler 1976, Balzano 1980, G¨ otze and Wille 1985, Lewin 1987, Mazzola 1990a, 2002, Vuza 1991, 1992a,b, 1993, Keil 1991, Lendvai 1993, Lindley and Turner-Smith 1993, Genevois and Orlarey 1997, Johnson 1997; also see Hofstadter 1999, Andreatta et al. 2001, Leyton 2001, and Babbitt 1960, 1961, 1987, Forte 1964, 1973, 1989, Rahn 1980, Morris 1987, 1995, Andreatta 1997; for early accounts of mathematical analysis of music also see Graeser 1924, Perle 1955, Norden 1964). Many recent references can be found in specialized journals such as Computing in Musicology, Music Theory Online, Perspectives of New Music, Journal of New Music Research, Intégral, Music Perception, and Music Theory Spectrum, to name a few. Music is, to a large extent, the result of a subconscious intuitive “process”. The basic question of quantitative musical analysis is in how far music may nevertheless be described or explained partially in a quantitative manner. The German philosopher and mathematician Leibniz (1646-1716) (Figure 1.5) called music the “arithmetic of the soul”. This is a profound philosophical statement; however, the difficulty is to formulate what exactly it may mean. Some composers, notably in the 20th century, consciously used mathematical elements in their compositions. Typical examples are permutations, the golden section, transformations in two or higher-dimensional spaces, random numbers, and fractals (see e.g. Schönberg, Webern, Bart´ ok, Xenakis, Cage, Lutoslawsky, Eimert, Kagel, Stockhausen, Boulez, Ligeti, Barlow; Figures 1.1, 1.4, 1.15). More generally, conscious “logical” construction is an inherent part of composition. For instance, the forms of sonata and symphony were developed based on reflections about well balanced proportions. The tormenting search for “logical perfection” is well


Figure 1.1 Quantitative analysis of music helps to understand creative processes. (Pierre Boulez, photograph courtesy of Philippe Gontier, Paris; and “Jim” by J.B.)

Figure 1.2 J.S. Bach (1685-1750). (Engraving by L. Sichling after a painting by Elias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek Z¨ urich.)


documented in Beethoven’s famous sketchbooks. Similarily, the art of counterpoint that culminated in J.S. Bach’s (Figure 1.2) work relies to a high degree on intrinsically mathematical principles. A rather peculiar early account of explicit applications of mathematics is the use of permutations in change ringing in English churches since the 10th century (Fletcher 1956, Price 1969, Stewart 1992, White 1983, 1985, 1987, Wilson 1965). More standard are simple symmetries, such as retrograde (e.g. Crab fugue, or Canon cancricans), inversion, arpeggio, or augmentation. A curious example of this sort is Mozart’s “Spiegel Duett” (or mirror duett, Figures 1.6, 1.7 ; the attibution to Mozart is actually uncertain). In the 20th century, composers such as Messiaen or Xenakis (Xenakis 1971; figure 1.15) attempted to develop mathematical theories that would lead to new techniques of composition. From a strictly mathematical point of view, their derivations are not always exact. Nevertheless, their artistic contributions were very innovative and inspiring. More recent, mathematically stringent approaches to music theory, or certain aspects of it, are based on modern tools of abstract mathematics, such as algebra, algebraic geometry, and mathematical statistics (see e.g. Reiner 1985, Mazzola 1985, 1990a, 2002, Lewin 1987, Fripertinger 1991, 1999, 2001, Beran and Mazzola 1992, 1999a,b, 2000, Read 1997, Fleischer et al. 2000, Fleischer 2003). The most obvious connection between music and mathematics is due to the fact that music is communicated in form of sound waves. Musical sounds can therefore be studied by means of physical equations. Already in ancient Greece (around the 5th century BC), Pythagoreans found the relationship between certain musical intervals and numeric proportions, and calculated intervals of selected scales. These results were probably obtained by studying the vibration of strings. Similar studies were done in other cultures, but are mostly not well documented. In practical terms, these studies lead to singling out specific frequencies (or frequency proportions) as “musically useful” and to the development of various scales and harmonic systems. A more systematic approach to physics of musical sounds, music perception, and acoustics was initiated in the second half of the 19th century by path-breaking contributions by Helmholz (1863) and other physicists (see e.g. Rayleigh 1896). Since then, a vast amount of knowledge has been accumulated in this field (see e.g. Backus 1969, 1977, Morse and Ingard 1968, 1986, Benade 1976, 1990, Rigden 1977, Yost 1977, Hall 1980, Berg and Stork 1995, Pierce 1983, Cremer 1984, Rossing 1984, 1990, 2000, Johnston 1989, Fletcher and Rossing 1991, Graff 1975, 1991, Roederer 1995, Rossing et al. 1995, Howard and Angus 1996, Beament 1997, Crocker 1998, Nederveen 1998, Orbach 1999, Kinsler et al. 2000, Raichel 2000). For a historic account on musical acoustics see e.g. Bailhache (2001). It may appear at first that once we mastered modeling musical sounds by physical equations, music is understood. This is, however, not so. Music is not just an arbitrary collection of sounds – music is “organized sound”.


Figure 1.3 Ludwig van Beethoven (1770-1827). (Drawing by E. D¨ urck after a painting by J.K. Stieler, 1819; courtesy of Zentralbibliothek Z¨ urich.)

¨ Figure 1.4 Anton Webern (1883-1945). (Courtesy of Osterreichische Post AG.)


Figure 1.5 Gottfried Wilhelm Leibniz (1646-1716). (Courtesy of Deutsche Post AG and Elisabeth von Janota-Bzowski.)

Physical equations for sound waves only describe the propagation of air pressure. They do not provide, by themselves, an understanding of how and why certain sounds are connected, nor do they tell us anything (at least not directly) about the effect on the audience. As far as structure is concerned, one may even argue – for the sake of argument – that music does not necessarily need “physical realization” in form of a sound. Musicians are able to hear music just by looking at a score. Beethoven (Figures 1.3, 1.16) composed his ultimate masterpieces after he lost his hearing. Thus, on an abstract level, music can be considered as an organized structure that follows certain laws. This structure may or may not express feelings of the composer. Usually, the structure is communicated to the audience by means of physical sounds – which in turn trigger an emotional experience of the audience (not necessarily identical with the one intended by the composer). The structure itself can be analyzed, at least partially, using suitable mathematical structures. Note, however, that understanding the mathematical structure does not necessarily tell us anything about the effect on the audience. Moreover, any mathematical structure used for analyzing music describes certain selected aspects only. For instance, studying symmetries of motifs in a composition by purely algebraic means ignores psychological, historical, perceptual, and other important issues. Ideally, all relevant scientific disciplines would need to interact to gain a broad understanding. A further complication is that the existence of a unique “truth” is by no means certain (and is in fact rather unlikely). For instance, a composition may contain certain structures that are important for some listeners but are ignored by others. This problem became apparent in the early 20th century with the introduction of 12-tone music. The general public was not ready to perceive the complex structures of dodecaphonic music and was rather appalled by the seemingly chaotic noise, whereas a minority of “specialized” listeners was enthusiastic. Another example is the


comparison of performances. Which pianist is the best? This question has no unique answer, if any. There is no fixed gold standard and no unique solution that would represent the ultimate unchangeable truth. What one may hope for at most is a classification into types of performances that are characterized by certain quantifiable properties – without attaching a subjective judgment of “quality”. The main focus of this book is statistics. Statistics is essential for connecting theoretical mathematical concepts with observed “reality”, to find and explore structures empirically and to develop models that can be applied and tested in practice. Until recently, traditional musical analysis was mostly carried out in a purely qualitative, and at least partially subjective, manner. Applications of statistical methods to questions in musicology and performance research are very rare (for examples see Yaglom and Yaglom 1967, Repp 1992, de la Motte-Haber 1996, Steinberg 1995, Waugh 1996, Nettheim 1997, Widmer 2001, Stamatatos and Widmer 2002) and mostly consist of simple applications of standard statistical tools to confirm results or conjectures that had been known or “derived” before by musicological, historic, or psychological reasoning. An interesting overview of statistical applications in music, and many references, can be found in Nettheim (1997). The lack of quantitative analysis may be explained, in part, by the impossibility of collecting “objective” data. Meanwhile, however, due to modern computer technology, an increasing number of musical data are becoming available. An in-depth statistical analysis of music is therefore no longer unrealistic. On the theoretical side, the development of sophisticated mathematical tools such as algebra, algebraic geometry, mathematical statistics, and their adaptation to the specific needs of music theory, made it possible to pursue a more quantitative path. Because of the complex, highly organized nature of music, existing, mostly qualitative, knowledge about music must be incorporated into the process of mathematical and statistical modeling. The statistical methods that will be discussed in the subsequent chapters can be divided into two categories: 1. Classical methods of mathematical statistics and exploratory data analysis: many classical methods can be applied to analyze musical structures, provided that suitable data are available. A number of examples will be discussed. The examples are relatively simple from the point of view of musicology, the purpose being to illustrate how the appropriate use of statistics can yield interesting results, and to stimulate the reader to invent his or her own statistical methods that are appropriate for answering specific musicological questions. 2. New methods developed specifically to answer concrete questions in musicology: in the last few years, questions in music composition and performance lead to the development of new statistical methods that are specifically designed to solve questions such as classification of perfor-


mance styles, identification and modeling of metric, melodic, and harmonic structures, quantification of similarities and differences between compositions and performance styles, automatic identification of musical events and structures from audio signals, etc. Some of these methods will be discussed in detail. A mathematical discipline that is concerned specifically with abstract definitions of structures is algebra. Some elements of basic algebra are therefore discussed in the next section. Naturally, depending on the context, other mathematical disciplines also play an equally important role in musical analysis, and will be discussed later where necessary. Readers who are familiar with modern algebra may skip the following section. A few examples that illustrate applications of algebraic structures to music are presented in Section 1.3. An extended account of mathematical approaches to music based on algebra and algebraic geometry is given, for instance, in Mazzola (1990a, 2002) (also see Lewin 1987 and Benson 1995-2002). 1.2 Some elements of algebra 1.2.1 Motivation Algebraic considerations in music theory have gained increasing popularity in recent years. The reason is that there are striking similarities between musical and algebraic structures. Why this is so can be illustrated by a simple example: notes (or rather pitches) that differ by an octave can be considered equivalent with respect to their harmonic “meaning”. If an instrument is tuned according to equal temperament, then, from the harmonic perspective, there are only 12 different notes. These can be represented as integers modulo 12. Similarily, there are only 12 different intervals. This means that we are dealing with the set Z12 = {0, 1, ..., 11}. The sum of two elements x, y ∈ Z12 , z = x + y is interpreted as the note/interval resulting from “increasing” the note/interval x by the interval y. The set Z12 of notes (intervals) is then an additive group (see definition below). 1.2.2 Definitions and results We discuss some important concepts of algebra that are useful to describe musical structures. A more comprehensive overview of modern algebra can be found in standard text books such as those by Albert (1956), Herstein (1975), Zassenhaus (1999), Gilbert (2002), and Rotman (2002). The most fundamental structures in algebra are group, ring, field, module, and vector space. Definition 1 Let G be a nonempty set with a binary operation + such that a + b ∈ G for all a, b ∈ G and the following holds: 1. (a + b) + c = a + (b + c) (Associativity)


2. There exists a zero element 0 ∈ G such that 0 + a = a + 0 = a for all a∈G 3. For each a ∈ G, there exists an inverse element (−a) ∈ G such that (−a) + a = a + (−a) = 0 Then (G, +) is called a group. The group (G, +) is called commutative (or abelian), if for each a, b ∈ G, a + b = b + a. The number of elements in G is called order of the group and is denoted by o(G). If the order is finite, then G is called a finite group. In a multiplicative way this can be written as Definition 2 Let G be a nonempty set with a binary operation · such that a · b ∈ G for all a, b ∈ G and the following holds: 1. (a · b) · c = a · (b · c) (Associativity) 2. There exists an identity element e ∈ G such that e · a = a · e = a for all a∈G 3. For each a ∈ G, there exists an inverse element a−1 ∈ G such that a−1 · a = a · a−1 = e Then (G, ·) is called a group. The group (G, ·) is called commutative (or abelian), if for each a, b ∈ G, a · b = b · a. For subsets we have Definition 3 Let (G, ·) and (H, ·) be groups and H ⊂ G. Then H is called subgroup of G. Some groups can be generated by a single element of the group: Definition 4 Let (G, ·) be a group with n < ∞ elements denoted by ai (i = 0, 1, ..., n − 1) and such that 1. ao = an = e 2. ai aj = ai+j if i + j ≤ n and ai aj = ai+j−n if i + j > n Then G is called a cyclic group. Furthermore, if G = (a) = {ai : i ∈ Z} where ai denotes the product with all i terms equal to a, then a is called a generator of G. An important notion is given in the following Definition 5 Let G be a group that “acts” on a set X by assigning to each x ∈ X and g ∈ G an element g(x) ∈ X. Then, for each x ∈ X, the set G(x) = {y : y = g(x), g ∈ G} is called orbit of x. Note that, given a group G that acts on X, the set X is partitioned into disjoint orbits. If there are two operations + and ·, then a ring is defined by Definition 6 Let R be a nonempty set with two binary operations + and · such that the following holds: 1. (R, +) is an abelian group


2. a · b ∈ R for all a, b ∈ R 3. (a · b) · c = a · (b · c) (Associativity) 4. a · (b + c) = a · b + a · c and (b + c) · a = b · a + c · a (distributive law) Then (R, +, ·) is called an (associative) ring. If also a · b = b · a for all a, b ∈ R, then R is called a commutative ring. Further useful definitions are: Definition 7 Let R be a commutative ring and a ∈ R, a = 0 such that there exists an element b ∈ R, b = 0 with a · b = 0. Then a is called a zero-divisor. If R has no zero-divisors, then it is called an integral domain. Definition 8 Let R be a ring such that (R \ {0}, ·) is a group. Then R is called a division ring. A commutative division ring is called a field. A module is defined as follows: Definition 9 Let (R, +, ·) be a ring and M a nonempty set with a binary operation +. Assume that 1. (M, +) is an abelian group 2. For every r ∈ R, m ∈ M , there exists an element r · m ∈ M 3. r · (a + b) = r · a + r · b for every r ∈ R, m ∈ M 4. r · (s · b) = (r · s) · a for every r, s ∈ R, m ∈ M 5. (r + s) · a = r · a + s · a for every r, s ∈ R, m ∈ M Then M is called an R−module or module over R. If R has a unit element e and if e · a = a for all a ∈ M , then M is called a unital R−module. A a unital R−module where R is a field is called a vector space over R. There is an enormous amount of literature on groups, rings, modules, etc. Some of the standard results are summarized, for instance, in text books such as those given above. Here, we cite only a few theorems that are especially useful in music. We start with a few more definitions. Definition 10 Let H ⊂ G be a subgroup of G such that for every a ∈ G, a · H · a−1 ∈ H. Then H is called a normal subgroup of G. Definition 11 Let G be such that the only normal subgroups are H = G and H = {e}. Then G is called a simple group. Definition 12 Let G be a group and H1 , ..., Hn normal subgroups such that (1.1) G = H1 · H 2 · · · Hn and any a ∈ G can be written uniquely as a product a = b1 · b2 · · · bn

(1.2)

with bi ∈ Hi . Then G is said to be the (internal) direct product of H1 , ..., Hn .


Definition 13 Let G1 and G2 be two groups, define G = G1 × G2 = {(a, b) : a ∈ G1 , b ∈ G2 } and the operation · by (a1 , b1 ) · (a2 , b2 ) = (a1 · a2 , b1 · b2 ). Then the group (G, ·) is called the (external) direct product of G1 and G2 . Definition 14 Let M be an R−module and M1 , ..., Mn submodules such that every a ∈ M can be written uniquely as a sum a = a1 + a2 + ... + an

(1.3)

with ai ∈ Mi . Then M is said to be the direct sum of M1 , ..., Mn . We now turn to the question which subgroups of finite groups exist. Theorem 1 Let H be a subgroup of a finite group G. Then o(H) is a divisor of o(G). Theorem 2 (Sylow) Let G be a group and p a prime number such that pm is a divisor of o(G). Then G has a subgroup H with o(H) = pm . Definition 15 A subgroup H ⊂ G such that pm is a divisor of o(G) but pm+1 is not a divisor, is called a p−Sylow subgroup. The next theorems help to decide whether a ring is a field. Theorem 3 Let R be a finite integral domain. Then R is a field. Corollary 1 Let p be a prime number and R = Zp = {x mod p : x ∈ N } be the set of integers modulo p (with the operations m + and · defined accordingly). Then R is a field. An essential way to compare algebraic structures is in terms of operationpreserving mappings. The following definitions are needed: Definition 16 Let (G1 , ·) and (G2 , ·) be two groups. A mapping g : G1 → G2 such that g(a · b) = g(a) · g(b) (1.4) is called a (group-)homomorphism. If g is a one-to-one (group-)homomorphism, then it is called an isomorphism (or group-isomorphism). Moreover, if G1 = G2 , then g is called an automorphism (or group-automorphism). Definition 17 Two groups G1 , G2 are called isomorphic, if there is an isomorphism g : G1 → G2 . Analogous definitions can be given for rings and modules: Definition 18 Let R1 and R2 be two rings. A mapping g : G1 → G2 such that g(a + b) = g(a) + g(b) (1.5) and g(a · b) = g(a) · g(b) (1.6) is called a (ring-)homomorphism. If g is a one-to-one (ring-)homomorphism, then it is called an isomorphism (or ring-isomorphism). Furthermore, if R1 = R2 , then g is called an automorphism (or ring-automorphism).


Definition 19 Two rings R1 , R2 are called isomorphic, if there is an isomorphism g : R1 → R2 . Definition 20 Let M1 and M2 be two modules over R. A mapping g : M1 → M2 such that for every a, b ∈ M1 , r ∈ R, g(a + b) = g(a) + g(b)

(1.7)

and g(r · a) = r · g(a) (1.8) is called a (module-)homomorphism (or a linear transformation). If g is a one-to-one (module-)homomorphism, then it is called an isomorphism (or module-isomorphism). Furthermore, if G1 = G2 , then g is called an automorphism (or module-automorphism). Definition 21 Two modules M1 , M2 are called isomorphic, if there is an isomorphism g : M1 → M2 . Finally, a general family of transformations is defined by Definition 22 Let g : M1 → M2 be a (module-)homomorphism. Then a mapping h : M1 → M2 defined by h(a) = c + g(a)

(1.9)

with c ∈ M2 is called an affine transformation. If M1 = M2 , then g is called a symmetry of M . Moreover, if g is invertible, then it is called an invertible symmetry of M . Studying properties of groups is equivalent to studying groups of automorphisms: Theorem 4 (Cayley’s theorem) Let G be a group. Then there is a set S such that G is isomorphic to a subgroup of A(S) where A(S) is the set of all one-to-one mappings of S onto itself. Definition 23 Let G be a finite group. Then the group (A(S), ◦) (where a ◦ b denotes successive application of the functions a and b) is called the symmetric group of order n, and is denoted by Sn . Note that Sn is isomorphic to the group of permutations of the numbers 1, 2, ..., n, and has n! elements. Another important concept is motivated by representation in coordinates as we are used to from euclidian geometry. The representation follows since, in terms of isomorphy, the inner and outer product can be shown to be equivalent: Theorem 5 Let G = H1 · H2 · · · Hn be the internal direct product of H1 , ..., Hn and G∗ = H1 × H2 × ... × Hn the external direct product. Then G and G∗ are isomorphic, through the isomorphism g : G∗ → G defined by g(a1 , ..., an ) = a1 · a2 · ... · an . This theorem implies that one does not need to distinguish between the internal and external direct product. The analogous result holds for modules:


Theorem 6 Let M be a direct sum of M1 , ..., Mn . Then M is isomorphic to the module M ∗ = {(a1 , a2 , ..., an ) : ai ∈ Mi } with the operations (a1 , a2 , ...) + (b1 , b2 , ...) = (a1 + b1 , a2 + b2 , ...) and r · (a1 , a2 , ...) = (r · a1 , r · a2 , ...). Thus, a module M = M1 + M2 + ... + Mn can be described in terms of its coordinates with respect to Mi (i = 1, ..., n) and the structure of M is known as soon as we know the structure of Mi (i = 1, ..., n). Direct products can be used, in particular, to characterize the structure of finite abelian groups: Theorem 7 Let (G, ·) be a finite commutative group. Then G is isomorphic to the direct product of its Sylow-subgroups. Theorem 8 Let (G, ·) be a finite commutative group. Then G is the direct product of cyclic groups. Similar, but slightly more involved, results can be shown for modules, but will not be needed here. 1.3 Specific applications in music In the following, the usefulness of algebraic structures in music is illustrated by a few selected examples. This is only a small selection from the extended literature on this topic. For further reading see e.g. Graeser (1924), Sch¨ onberg (1950), Perle (1955), Fletcher (1956), Babbitt (1960, 1961), Price (1969), Archibald (1972), Halsey and Hewitt (1978), Balzano (1980), Rahn (1980), G¨ otze and Wille (1985), Reiner (1985), Berry (1987), Mazzola (1990a, 2002 and references therein), Vuza (1991, 1992a,b, 1993), Fripertinger (1991), Lendvai (1993), Benson (1995-2002), Read (1997), Noll (1997), Andreatta (1997), Stange-Elbe (2000), among others. 1.3.1 The Mathieu group It can be shown that finite simple groups fall into families that can be described explicitly, except for 26 so-called sporadic groups. One such group is the so-called Mathieu group M12 which was discovered by the French mathematician Mathieu in the 19th century (Mathieu 1861, 1873, also see e.g. Conway and Sloane 1988). In their study of probabilistic properties of (card) shuffling, Diaconis et al. (1983) show that M12 can be generated by two permutations (which they call Mongean shuffles), namely 1 2 3 4 5 6 7 8 9 10 11 12 π1 = (1.10) 7 6 8 5 9 4 10 3 11 2 12 1 and

π2 =

1 2 6 7


3 4 5 8

5 6 4 9

7 8 9 3 10 2

10 11 12 11 1 12

(1.11)

where the low rows denote the image of the numbers 1, ..., 12. The order of this group is o(M12 ) = 95040 (!) An interesting application of these permutations can be found in Ile de feu 2 by Olivier Messiaen (Berry 1987) where π1 and π2 are used to generate sequences of tones and durations. 1.3.2 Campanology A rather peculiar example of group theory “in action” (though perhaps rather trivial mathematically) is campanology or change ringing (Fletcher 1956, Wilson 1965, Price 1969, White 1983, 1985, 1987, Stewart 1992). The art of change ringing started in England in the 10th century and is still performed today. The problem that is to be solved is as follows: there are k swinging bells in the church tower. One starts playing a melody that consists of a certain sequence in which the bells are played, each bell being played only once. Thus, the initial sequence is a permutation of the numbers 1, ..., k. Since it is not interesting to repeat the same melody over and over, the initial melody has to be varied. However, the bells are very heavy so that it is not easy to change the timing of the bells. Each variation is therefore restricted, in that in each “round” only one pair of adjacent bells can exchange their position. Thus, for instance, if k = 4 and the previous sequence was (1, 2, 3, 4), then the only permissible permutations are (2, 1, 3, 4), (1, 3, 2, 4), and (1, 2, 4, 3). A further, mainly aesthetic restiction is that no sequence should be repeated except that the last one is identical with the initial sequence. A typical solution to this problem is, for instance, the “Plain Bob” that starts by (1, 2, 3, 4), (2, 1, 4, 3), (2, 4, 1, 3),... and continues until all permutations in S4 are visited. 1.3.3 Representation of music Many aspects of music can be “embedded” in a suitable algebraic module (see e.g. Mazzola 1990a). Here are some examples: 1. Apart from glissando effects, the essential frequencies in most types of music are of the form K ω = ωo pxi i (1.12) i=1

where K < ∞, ωo is a fixed basic frequency, pi are certain fixed prime numbers and xi ∈ Q. Thus, ψ = log ω = ψo +

K i=1

xi ψi

(1.13)

K where ψo = log ωo , ψi = log pi (i ≥ 1). Let Ψ = {ψ : ψ = i=1 xi ψi , xi ∈ Q} be the set of all log-frequencies generated this way. Then Ψ is a module over Q. Two typical examples are:


(a) ωo = 440 Hz, K = 3, ω1 = 2, ω2 = 3, ω3 = 5 : This is the so-called Euler module in which most Western music operates. An important subset consists of frequencies of the just intonation with the pure intervals octave (ratio of frequencies 2), fifth (ratio of frequencies=3/2) and major third (ratio of frequencies 5/4): ψ = log ω = log 440 + x1 log 2 + x2 log 3 + x3 log 5

(1.14)

(xi ∈ Z). The notes (frequencies) ψ can then be represented by points in a three-dimensional space of integers Z3 . Note that, using the notation a = (a1 , a2 , a3 ) and b = (b1 , b2 , b3 ), the pitch obtained by addition c = a + b corresponds to the frequency ωo 2a1 +b1 3a2 +b2 5a3 +b3 . p (b) ωo = 440 Hz, K = 1, ω1 = 2, and x = 12 , where p ∈ Z : This corresponds to the well-tempered tuning where an octave is divided into √ equal intervals. Thus, the ratio 2 is decomposed into 12 ratios 12 2 so that p log 2 (1.15) ψ = log 440 + 12 If notes that differ by one or several octaves are considered equivalent, then we can identify the set of notes with the Z−module Z12 = {0, 1, ..., 11}.

2. Consider a finite module of notes (frequencies), such as for instance the well-tempered module M = Z12 . Then a scale is an element of S = {(x1 , ..., xk ) : k ≤ |M |, xi ∈ M, xi = xj (i = j)}, the set of all finite vectors with different components. 1.3.4 Classification of circular chords and other musical objects A central element of classical theory of harmony is the triad. An algebraic property that distinguishes harmonically important triads from other chords can be described as follows: let x1 , x2 , x3 ∈ Z12 , such that (a) xi =xj (i =j) and (b) there is an “inner” symmetry g : Z12 → Z12 such that {y : y = g k (x1 ), k ∈ N} = {x1 , x2 , x3 }. It can be shown that all chords (x1 , x2 , x3 ) for which (a) and (b) hold are standard chords that are harmonically important in traditional theory of harmony. Consider for instance the major triad (c, e, g) = (0, 4, 7) and the minor triad (c, e#, g) = (0, 3, 7). For the first triad, the symmetry g(x) = 3x + 7 yields the desired result: g(0) = 7 = g, g(7) = 4 = e and g(4) = 7 = g. For the minor triad the only inner symmetry is g(x) = 3x + 3 with g(7) = 0 = c, g(0) = 3 = e# and g(3) = 0 = c. This type of classification of chords can be carried over to more complicated configurations of notes (see e.g. Mazzola 1990a, 2002, Straub 1989). In particular, musical scales can be classified by comparing their inner symmetries.


1.3.5 Torus of thirds Consider the group G = (Z12 , +) of pitches modulo octave. Then G is isomorphic to the direct sum of the Sylow groups Z3 and Z4 by applying the isomorphism g : Z12 → Z3 + Z4 , x → y = (y1 , y2 ) = (x mod 3, −x mod 4)

(1.16) (1.17)

Geometrically, the elements of Z3 + Z4 can be represented as points on a torus, y1 representing the position on the vertical meridian and y2 the position on the horizontal equatorial circle (Figure 1.8). This representation has a musical meaning: a movement along a meridian corresponds to a major third, whereas a movement along a horizontal circle corresponds to a minor third. One then can define the “torus-distance” dtorus (x, y) by equating it to the minimal number of steps needed to move from x to y. The value of dtorus (x, y) expresses in how far there is a third-relationship between x and y. The possible values of dtorus are 0 (if x = y), 1, 2, and 3 (smallest third-relationship). Note that dtorus can be decomposed into d3 + d4 where d3 counts the number of meridian steps and d4 the number of equatorial steps. 1.3.6 Transformations For suitably chosen integers p1 , p2 , p3 , p4 , consider the four-dimensional module M = Zp1 × Zp2 × Zp3 × Zp4 over Z where the coordinates represent onset time, pitch (well-tempered tuning if p2 = 12), duration, and volume. Transformations in this space play an essential role in music. A selection of historically relevant transformations used by classical composers is summarized in Table 1.1 (also see Figure 1.13). Generally, one may say that affine transformations are most important, and among these the invertible ones. In particular, it can be shown that each symmetry of Z12 can be written as a product (in the group of symmetries Symm(Z12 )) of the following musically meaningful transformations: • Multiplication by −1 (inversion); • Multiplication by 5 (ordering of notes according to circle of quarts); • Addition of 3 (transposition by a minor third); • Addition of 4 (transposition by a major third). All these transformations have been used by composers for many centuries. Some examples of apparent similarities between groups of notes (or motifs) are shown in Figures 1.10 through 1.12. In order not to clutter the pictures, only a small selection of similar motifs is marked. In dodecaphonic and serial music, transformation groups have been applied systematically (see e.g. Figure 1.9). For instance, in Sch¨ oberg’s Orchestervariationen op.


Table 1.1 Some affine transformations used in classical music Function

Musical meaning

Shift: f (x) = x + a

Transposition, repetition, change of duration, change of loudness

Shear, e.g. of x = (x1 , ..., x4 )t w.r.t. line y = βo + t · (0, 1, 0, 0): f (x) = x + a · (0, 1, 0, 0) for x not on line, f (x) = x for x on line

Arpeggio

Reflection, e.g. w.r.t. v = (a, 0, 0, 0): f (x) = (a − (x1 − a), x2 , x3 , x4 )

Retrograde, inversion

Dilatation, e.g. w.r.t. pitch: f (x) = (x1 , a · x2 , x3 , x4 )

Augmentation

Exchange of coordinates: f (x) = (x2 , x1 , x3 , x4 )

Exchange of “parameters” (20th century)

31, the full orbit generated by inversion, retrograde and transposition is used. Webern used 12-tone series that are diagonally symmetric in the two-dimensional space spanned by pitch and onset time. Other famous examples √ include Eimert’s rotation by 45 degrees together with a dilatation by 2 (Eimert 1964) and serial compositions such as Boulez’s “Structures” and Stockhausen’s “Kontra-Punkte”. With advanced computer technology (e.g. composition soft- and hardware such as Xenaki’s UPIC graphics/computer system or the recently developed Presto software by Mazzola 1989/1994), the application of affine transformations in musical spaces of arbitrary dimension is no longer the tedious work of the early dodecaphonic era. On the contrary, the practical ease and enormous artistic flexibility lead to an increasing popularity of computer aided transformations among contemporary composers (see e.g. Iannis Xenakis, Kurt Dahlke, Wilfried Jentzsch, Guerino Mazzola 1990b, Dieter Salbert, Karl-Heinz Sch¨ oppner, Tamas Ungvary, Jan Beran 1987, 1991, 1992, 2000; cf. Figure 1.14).


Spiegel-Duett Allegro q=120              

Violin

   

(W.A. Mozart)



  



       





mf

7

Vln.

 12

Vln.

 18

Vln.

 22

Vln.

 27

Vln.

 32

Vln.

 36

Vln.

 41

Vln.

 46

Vln.

 51

Vln.

 57

Vln.

 60

Vln.











   

 





  



 

       

  

   

  





 

   

    



        



   

  

  







             





  





  

  





 



  

  

  











      

  











 

  



  









 



 

  





  







    









               

  

  

   



  



     

   

       





  



        



       

        



       



      





   







   



    



       

     



       



          











 

       











Figure 1.6 W.A. Mozart (1759-1791) (authorship uncertain) – Spiegel-Duett.


Figure 1.7 Wolfgang Amadeus Mozart (1756-1791). (Engraving by F. M¨ uller after a painting by J.W. Schmidt; courtesy of Zentralbibliothek Z¨ urich.)

Figure 1.8 The torus of thirds Z3 + Z4 .


Figure 1.9 Arnold Sch¨ onberg – Sketch for the piano concert op. 42 – notes with tone row and its inversions and transpositions. (Used by permission of Belmont Music Publishers.)

Figure 1.10 Notes of “Air” by Henry Purcell. (For better visibility, only a small selection of related “motifs” is marked.)


Figure 1.11 Notes of Fugue No. 1 (first half ) from “Das Wohltemperierte Klavier” by J.S. Bach. (For better visibility, only a small selection of related “motifs” is marked.)

Figure 1.12 Notes of op. 68, No. 2 from “Album f¨ ur die Jugend” by Robert Schumann. (For better visibility, only a small selection of related “motifs” is marked.)


Figure 1.13 A miraculous transformation caused by high exposure to Wagner operas. (Caricature from a 19th century newspaper; courtesy of Zentralbibliothek Z¨ urich.)

Figure 1.14 Graphical representation of pitch and onset time in Z271 together with ánti – Piano concert No. 2 instrumentation of polygonal areas. (Excerpt from S¯ by Jan Beran, col legno CD 20062; courtesy of col legno, Germany.)


Figure 1.15 Iannis Xenakis (1922-1998). (Courtesy of Philippe Gontier, Paris.)

Figure 1.16 Ludwig van Beethoven (1770-1827). (Courtesy of Zentralbibliothek Z¨ urich.)


CHAPTER 2

Exploratory data mining in musical spaces 2.1 Musical motivation The primary aim of descriptive statistics is to summarize data by a small set of numbers or graphical displays, with the purpose of finding typical relevant features. An in-depth descriptive analysis explores the data as far as possible in the hope of finding anything interesting. This activity is therefore also called “exploratory data analysis” (EDA; see Tukey 1977), or “data mining”. EDA does not require a priori model assumptions – the purpose is simply free exploration. Many exploratory tools are, however, inspired by probabilistic models and designed to detect features that may be captured by these. Descriptive or exploratory analysis is of special interest in music. The reason is that in music very subtle local changes play an important role. For instance, a good pianist may achieve a desired emotional effect by slight local variations of tempo, dynamics, etc. Composers are able to do the same by applying subtle variations. Extreme examples of small gradual changes can be found, for instance, in minimal music (e.g. Reich, Glass, Riley). As a result, observed data consist of a dominating deterministic component plus many other very subtle (and presumably also deterministic, i.e. intended) components. Thus, because of their subtle nature, many musically relevant features are difficult to detect and can often be identified in a descriptive way only - for instance by suitable graphical displays. A formal statistical “proof” that these features are indeed real, and not just accidental, is then only possible if more similar data are collected. To illustrate this, consider the tempo curves of three performances of Robert Schumann’s (1810-1856) Tr¨ aumerei by Vladimir Horowitz (19031989), displayed in Figure 2.2. It is obvious that the three curves are very similar even with respect to small details. However, since these details are of a local nature and we observed only three performances, it is not an easy task to show formally (by statistical hypothesis testing or confidence intervals) that, apart from an overall smooth trend, Horowitz’s tempo variations are not random. An even more difficult task is to “explain” these features, i.e. to attach an explicit musical meaning to the local tempo changes.


Träumerei op. 15, No. 7 q=100 (72)  Piano





p

 

     

   

    

         

 

5

                                  

Robert Schumann

 

 

 

 

         

 



                

                         

  

ritard.

9

   



 

                

                                                           

                                                    

13

  

ritard.

17

   







a tempo                                

         

 

             



      

 

 

 

 

            

 



  

 

 

 

21

       



 23         ritard.          

  

 





  



 

  

 

 









   

                 





        p    



  

 



Figure 2.1 Robert Schumann (1810-1856) – Tr¨ aumerei op. 15, No. 7.


0 -5

1947

log(tempo)

1963

-15

-10

1965

0

10

20

30

onset time

Figure 2.2 Tempo curves of Schumann’s Tr¨ aumerei performed by Vladimir Horowitz.

2.2 Some descriptive statistics and plots for univariate data 2.2.1 Definitions We give a brief summary of univariate descriptive statistics. For a comprehensive discussion we refer the reader to standard text books such as Tukey (1977), Mosteller and Tukey (1977), Hoaglin (1977), Tufte (1977), Velleman and Hoaglin (1981), Chambers et al. (1983), Cleveland (1985). Suppose that we observe univariate data x1 , x2 , ..., xn . To summarize general characteristics of the data, various numerical summary statistics can be calculated. Essential features are in particular center (location), variability, asymmetry, shape of distribution, and location of unusual values (outliers). The most frequently used statistics are listed in Table 2.1. We recall a few well known properties of these statistics: • Sample mean: The sample mean can be understood as the “center of gravity” of the data, whereas the median divides the sample in two halves


Table 2.1 Simple descriptive statistics Name

Definition

Feature measured −1

n

1{xi ≤ x}

Empirical distribution function

Fn (x) = n

Minimum

xmin = min{x1 , ..., xn }

Smallest value

Maximum

xmin = max{x1 , ..., xn }

Largest value

Range

xrange = xmax − xmin x ¯ = n−1 n i=1 xi

Total spread

Sample mean

i=1

Proportion of obs. ≤ x

Center 1 } 2

Sample median

M = inf {x : Fn (x) ≥

Sample α−quantile

qα = inf {x : Fn (x) ≥ α}

Border of lower 100α%

Lower and upper quartile

Q1 = q 1 , Q2 = q 3

Border of lower 25%, upper 75%

Sample variance

s2 = (n − 1)−1 √ s = + s2

Sample standard deviation Interquartile range Sample skewness Sample kurtosis

4

4

n

i=1 (xi

−x ¯)2

IQR = Q2 − Q1 m3 = n−1 n ¯)/s]3 i=1 [(xi − x m4 = n−1 n ¯)/s]4 − 3 i=1 [(xi − x

Center

Variability Variability Variability Asymmetry Flat/sharp peak

with an (approximately) equal number of observations. In contrast to the median, the mean is sensitive to outliers, since observations that are far from the majority of the data have a strong influence on its value. • Sample standard deviation: The sample standard deviation is a measure of variability. In contrast to the variance, s is directly comparable with the data, since it is measured in the same unit. If observations are drawn independently from the same normal probability distribution (or a distribution that is similar to a normal distribution), then the following rule of thumb applies: (a) approximately 68% of the data are in the interval x¯ ± s; (b) approximately 95% of the data are in the interval x¯ ± 2s; (c) almost all data are in the interval x ¯ ± 3s. For a sufficiently large sample size, these conclusions can be carried over to the population from which the data were drawn.


• Interquartile range: The interquartile range also measures variability. Its advantage, compared to s, is that it is much less sensitive to outliers. If the observations are drawn from the same normal probability distribution, then IQR/1.35 (or more precisely IQR/[Φ−1 (0.75) − Φ−1 (0.25)] where Φ−1 is the quantile function of the standard normal distribution) estimates the same quantity as s, namely the population standard deviation. • Quantiles: For α = ni (i = 1, ..., n), qα coincides with at least one observation. For other values of α, qα can be defined as in Table 1.1 or, alternatively, by interpolating neighboring observed values as follows: let β = ni < α < γ = i+1 ˜α is defined by n . Then the interpolated quantile q q˜α = qβ +

α−β (qγ − qα ) 1/n

(2.1)

Note that a slightly different convention used by some statisticians is to call inf{x : Fn (x) ≥ α} the (α − 0.5 n )-quantile (see e.g. Chambers et al. 1983). • Skewness: Skewness measures symmetry/asymmetry. For exactly symmetric data, m3 = 0, for data with a long right tail m3 > 0, for data with a long left tail m3 < 0. • Kurtosis: The kurtosis is mainly meaningful for unimodal distributions, i.e. distributions with one peak. For a sample from a normal distribution, m4 ≈ 0. The reason is that then E[(X − µ)4 ] = 3σ 4 where µ = E(X). For samples from unimodal distributions with a sharper or flatter peak than the normal distribution, we then tend to have m4 > 0 and m4 < 0 respectively. Simple, but very useful graphical displays are: • Histogram: 1. Divide an interval (a, b] that includes all observations into disjoint intervals I1 = (a1 , b1 ], ..., Ik = (ak , bk ]. 2. Let n1 , ..., nk be the number of observations in the intervals I1 , ..., Ik respectively. 3. Above each interval Ij , plot a rectangle of width wj = bj − aj and height hj = nj /wj . Instead of the absolute frequencies, one can also use relative frequencies nj /n where n = n1 + ... + nk . The essential point is that the area is proportional to nj . If the data are drawn from a probability distribution with density function f, then the histogram is an estimate of f. • Kernel estimate of a density function: The histogram is a step function, and in that sense does not resemble most density functions. This can be improved as follows. If the data are realizations of acontinuous random x variable X with distribution F (x) = P (X ≤ x) = −∞ f (u)du, then a smooth estimate of the probability density function f can be defined by a kernel estimate (Rosenblatt 1956, Parzen 1962, Silverman 1986) of the


form 1 xi − x ) fˆ(x) = K( nb i=1 b n

(2.2)

∞ where K(u) = K(−u) ≥ 0 and −∞ K(u)du = 1. Most kernels used in practice also satisfy the condition K(u) = 0 for |u| > 1. The “bandwidth” b then specifies which data in the neighborhood of x are used to estimate f (x). In situations where one has partial knowledge of the shape of f, one may incorporate this into the estimation procedure. For instance, Hjort and Glad (2002) combine parametric estimation based ˆ with kernel smoothing of the on a preliminary density function f (x; θ) ˆ “remaining density” f /f (x; θ). They show that major efficiency gains can be achieved if the preliminary model is close to the truth. • Barchart: If data can assume only a few different values, or if data are qualitative (i.e. we only record which category an item belongs to), then one can plot the possible values or names of categories on the x-axis and on the vertical axis the corresponding (relative) frequencies. • Boxplot (simple version): 1. Calculate Q1 , M, Q2 and IQR = Q2 − Q1 . 2. Draw parallel lines (in principle of arbitrary length) at the levels Q1 , M, Q2 , A1 = Q1 − 32 IQR, A2 = Q2 + 32 IQR, B1 = Q1 − 3IQR and B2 = Q1 + 3IQR. The points A1 , A2 are called inner fence, and B1 , B2 are called outer fence. 3. Identify the observation(s) between Q1 and A1 that is closest to A1 and draw a line connecting Q1 with this point. Do the same for Q2 and A2 . 4. Identify observation(s) between A1 and B1 and draw points (or other symbols) at those places. Do the same for A2 and B2 . 5. Draw points (or other symbols) for observations beyond B1 and B2 respectively. The boxplot can be interpreted as follows: the relative positions of Q1 , M, Q2 and the inner and outer fences indicate symmetry or asymmetry. Moreover, the distance between Q1 and Q2 is the IQR and thus measures variability. The inner and outer fences help to identify outliers, i.e. values lying unusually far from most of the other observations. • Q-q-plot for comparing two data sets x1 , ..., xn and y1 , ..., ym : 1. Define a certain number of points 0 < p1 < ... < pk ≤ 1 (the standard choice is: pi = i−0.5 where N = min(n, m)). 2. Plot the pi -quantiles (i = 1, ..., N ) N of the y−observations versus those of the x − −observations. Alternative plots for comparing distributions are discussed e.g. in Ghosh and Beran (2000) and Ghosh (1996, 1999).


2.3 Sp ecific applications in music – univariate 2.3.1 Tempo curves Figure 2.3 displays 28 tempo curves for performances of Schumann’s Tr¨ aumerei op. 15, No. 7, by 24 pianists. The names of the pianists and dates of the recordings (in brackets) are Martha Argerich (before 1983), Claudio Arrau (1974), Vladimir Ashkenazy (1987), Alfred Brendel (before 1980), Stanislav Bunin (1988), Sylvia Capova (before 1987), Alfred Cortot (1935, 1947 and 1953), Clifford Curzon (about 1955), Fanny Davies (1929), J¨ org Demus (about 1960), Christoph Eschenbach (before 1966), Reine Gianoli (1974), Vladimir Horowitz (1947, before 1963 and 1965), Cyprien Katsaris (1980), Walter Klien (date unknown), André Krust (about 1960), Antonin Kubalek (1988), Benno Moisewitsch (about 1950), Elly Ney (about 1935), Guiomar Novaes (before 1954), Cristina Ortiz (before 1988), Artur Schnabel (1947), Howard Shelley (before 1990), Yakov Zak (about 1960). Tempo is more likely to be varied in a relative rather than absolute way. For instance, a musician plays a certain passage twice as fast as the previous one, but may care less about the exact absolute tempo. This suggests consideration of the logarithm of tempo. Moreover, the main interest lies in comparing the shapes of the curves. Therefore, the plotted curves consist of standardized logarithmic tempo (each curve has sample mean zero and variance one). Schumann’s Tr¨ aumerei is divided into four main parts, each consisting of about eight bars, the first two and the last one being almost identi cal (see Figure 2.1). Thus, the structure is: A, A , B, and A . Already a very simple exploratory analysis reveals interesting features. For each pianist, we calculate the following statistics for the four parts respectively: x ¯, M, s, Q1 , Q2 , m3 and m4 . Figures 2.4a through e show a distinct pattern that corresponds to the division into A, A , B, and A . Tempo is much lower in A and generally highest in B. Also, A seems to be played at a slightly slower tempo than A – though this distinction is not quite so clear (Figures 2.4a,b). Tempo is varied most towards the end and considerably less in the first half of the piece (Figures 2.4c). Skewness is generally negative which is due to occasional extreme “ritardandi”. This is most extreme in part B and, again, least pronounced in the first half of the piece (A, A ). A mirror image of this pattern, with most extreme positive values in B, is observed for kurtosis. This indicates that in B (and also in A ), most tempo values vary little around an average value, but occasionally extreme tempo changes occur. Also, for A, there are two outliers with an extremly negative skewness – these turn out to be Fanny Davies and Jörg Demus. Figures 2.4f through h show another interesting comparison of boxplots. In Figure 2.4f, the differences between the lower quartiles in A and A for performances before 1965 are compared with those from performances recorded in 1965 or later. The clear difference indicates that, at least for the


-20 0

ARGERICH

ARRAU

ASKENAZE

-40

BRENDEL

BUNIN

CAPOVA

CORTOT1 CORTOT2 CORTOT3

CURZON

-60

DEMUS

GIANOLI

HOROWITZ1 HOROWITZ2 HOROWITZ3

KATSARIS

-80

log(tempo)

DAVIES

ESCHENBACH

KLIEN KRUST

KUBALEK MOISEIWITSCH

NEY

-100

NOVAES ORTIZ

SCHNABEL

SHELLEY

ZAK

0

10

20

30

onset time

Figure 2.3 Twenty-eight tempo curves of Schumann’s Tr¨ aumerei performed by 24 pianists. (For Cortot and Horowitz, three tempo curves were available.)

sample considered here, pianists of the “modern era” tend to make a much stronger distinction between A and A in terms of slow tempi. The only exceptions (outliers in the left boxplot) are Moiseiwitsch and Horowitz’ first performance and Ashkenazy (outlier in the right boxplot). The comparsion of skewness and curtosis in Figures 2.4g and h also indicates that “modern” pianists seem to prefer occasional extreme ritardandi. The only exception in the “early 20th century group” is Artur Schnabel, with an extreme skewness of −2.47 and a kurtosis of 7.04. Direct comparisons of tempo distributions are shown in Figures 2.5a


Figure 2.4 Boxplots of descriptive statistics for the 28 tempo curves in Figure 2.3.

through f. The following observations can be made: a) compared to Demus (quantiles on the horizontal axis), Ortiz has a few relatively extreme slow tempi (Figure 2.5a); b) similarily, but in a less extreme way, Cortot’s interpretation includes occasional extremely slow tempo values (Figure 2.5b); c) Ortiz and Argerich have practically the same (marginal) distribution (Figure 2.5c); d) Figure 2.5d is similar to 2.5a and b, though less extreme; e) the tempo distribution of Cortot’s performance (Figure 2.5e) did not change much in 1947 compared to 1935; f) similarily, Horowitz’s tempo distribu-


1

Figure 2.5c: q-q-plot Ortiz (1988) - Argerich (1983)

Figure 2.5b: q-q-plot Demus (1960) - Cortot (1935) 0

1

Figure 2.5a: q-q-plot Demus (1960) - Ortiz (1988)

-1 -2

Argerich

-2 -3

Cortot

-2 -4

-4

-4

-3

-3

Ortiz

-1

-1

0

0

1

2

2

tions in 1947 and 1963 are almost the same, except for slight changes for very low tempi (Figure 2.5f).

-2

-1

0

-2

1

-1

0

-4

1

-3

-2

-1

0

1

Ortiz

Demus

Figure 2.5d: q-q-plot Demus (1960) - Krust (1960)

Figure 2.5e: q-q-plot Cortot (1935) - Cortot (1947)

0 -1 -4

-4

-4

-3

-2

Horowitz 1963

-2

Cortot 1947

-1 -2 -3

Krust

Figure 2.5f: q-q-plot Horowitz (1947) - Horowitz (1963)

2 0

0

1

1

Demus

-2

-1

0

1

Demus

-4

-3

-2

-1

Cortot 1935

0

1

2

-4

-3

-2

-1

0

1

Horowitz 1947

Figure 2.5 q-q-plots of several tempo curves (from Figure 2.3).

2.3.2 Notes modulo 12 In most classical music, a central tone around which notes “fluctuate” can be identified, and a small selected number of additional notes or chords (often triads) play a special role. For instance, from about 400 to 1500 A.D., music was mostly written using so-called modes. The main notes


0.3

4396aedb

8db

436

150f

13297afec

f

14390

5460a

38fed

12

5460

1e

5276afe

387fedcb

12

9 f

1543298760aedcb 0

1

1543298760afedcb 2

3

4

d

532486

8dc

32acb

197

b

5498760

0aedcb

1543298760afedcb 5

6

1543298760afedcb

f 7

8

0.2

5987afe

47ac

9

10

Figure 2.6b: W.A.Mozart - KV 545, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)

0.1

5960b

12

1cde b 2af 0 43 5789 6

215436789ac0bdfe

0.0

0.10

0.15

0.20

Figure 2.6a: J.S.Bach - Fugue 1, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 287c 0dcb

0.0

0.05

0.4

were the first one (finalis, the ”final note”) and the fifth note of the scale (dominant). The system of 12 major and 12 minor scales was developed later, adding more flexibility with respect to modulation and scales. The main “representatives” of a major/minor scale are three triads, obtained by “adding” thirds, starting at the basic note corresponding to the first (tonic), fourth (subtonic) and fifth (tonic) note of the scale respectively. Other triads are also – but to a lesser degree – associated with the properties “tonic”, “subtonic” and/or “dominant”. In the 20th century, and partially already in the late 19th century, other systems of scales as well as systems that do not rely on any specific scales were proposed (in particular 12-tone music).

11

0

1

54678 39 2 0 1afe cbd 2

4678 53 219 0 a cbdfe

215436789ac0bdfe 3

0.30

Figure 2.6c: R.Schumann - op.15/2, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)

0.20

45 736 18209 afe bd c

0

1874326509abcfed

432f 187659ced 0ab

1874326509abcfed

1

2

3

12abcfed 87436509 4

5

5 269abfe 187430cd

870bcd 139a 4265fe

6

187460ac 3259bfed

143265fed 8709abc 7

(Notes-Tonic) mod 12

8

0.10

9 80abce 765fd 432 1

1874326509abcfed 9

10

687a0b 59c d 2143 fe

11

0.0

0.0

0.1

1bcd 320afe 87469 5

4

215436789ac0bdfe 5

6

7

1 2f 3cbd 5469a0e 78

3cdfe 21 490b 5678a

56789a0b 2143cdfe 8

9

10

56789a0 b 4 213cdfe 11


0.2

0.3

0.4


a 5436789c0bdfe 21

cbdfe a 0 29 1543678

0

Figure 2.6d: R.Schumann - op.15/3, frequencies of notes number i, i in [1+j,16+j] (j=0, ,65) 2e 143fd 5 a0b 6c 98c 17fd 7 b 2436e 243fe 190dc 8ab 23568afe 15 5 90 47 67dc 239fe 9870 98ab 145687a0dcb 14356afedcb 0 214356987af0edcb 2 214356987af0edcb 214356987af0edcb 214356987af0edcb

1

2

3

4

5

6

7

8

9

10

456987a0

213fedcb 11


Figure 2.6 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.

A very simple illustration of this development can be obtained by counting the frequencies of notes (pitches) in the following way: consider a score in equal temperament. Ignoring transposition by octaves, we can represent all notes x(t1 ), ..., x(tn ) by the integers 0, 1, ..., 11. Here, t1 ≤ t2 ≤ ... ≤ tn


0.3

Figure 2.7a: A.Scriabin - op.51/2, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)

0.0

dfe 1234567890acb 0

0.12

a0de

1

a 1234567890 2

790ab 12568dc 34fe

1234567890adcbfe 3

4

0.2

f 12347890ae 56dcb

0.1

b

6a 1247890dcbe 35f

5

6

1234567890adcbfe 7

dcbfe 1234567890a 8

9

adcbfe 0 123456789

1f 2345670adcbe 89

10

11

f e 12d 3465c ba078 9

f

1ba0dc9 234e8f 57 6

0.0

dcfe

0.1

dcfe

345 12 68 79 0c adbe f

Figure 2.7b: A.Scriabin - op.51/4, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)

5 6 34a78 29 1b0 c de

0

1

dcef b 123465a0798

2

f 465a07de98 123bc

3

dcef 1b 2345a09 678

6

7

1234bdcef 65a0798

8

9

46507e98f 123bad c

f 1234bdce 65a0798

10

11

Figure 2.7d: F.Martin - Prelude 7, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 2 ef 13 45

1276

456

357

1a0cf

156ba

3459ade

fe

345bcf

13afe

4689

24768bde

9bdcf

234569fe

2390c

768bcf

23689dcf

36

36

4769ba0de

12

5f 1

89a0de

2789b0dc

ba0d

359

cfe 2

3

4

5

234578a0e

78b0dc

16

1a

6

7

4dfe

21

1457ba0e

249a0c

35768 8

9

10

2ef 1345

890ad

7ef 126

67abc

35

0

89d

1234abc 687def 590

1290

dce

87

b

a bdcef

4

11

0


d

4

f

c

1bd 0.10

1238c

0.0

0.08 0.04

5

0ef 6ba7c98 d 1235 4

Figure 2.7c: F.Martin - Prelude 6, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 12 78 0

b

0

4

8 ba09 dcef


89bc

457f

23 165 47

9 a08 b7c d 46 1235 e f


0.20

0.3 0.2

89 35670 24 1a b

1

2

3

a 18790

68790ab 1235dc

23456

4ef

4

567

0abc

5687

5

1

456790adef

345890dc

c 1238b

67ae 12bf

10

11

9

8790abc

24890ab 3

9

8

0abdcef

1234567

6d 12345ef

def

6

7

8

c

9


Figure 2.7 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.

denote the score-onset times of the notes. To make different compositions comparable, the notes are centered by subtracting the central note which is defined to be the most frequent note. Given a prespecified integer k (in our case k = 16), we calculate the relative frequencies −1

pj (x) = (2k + 1)

j+2k

1{x(ti ) = x}

i=j

where 1{x(ti ) = x} = 1, if x(ti ) = x and zero otherwise and j = 1, 2, ..., n− 2k − 1. This means that we calculate the distribution of notes for a moving window of 2k + 1 notes. Figures 2.6a through d and 2.7a through d display the distributions pj (x) (j = 4, 8, ..., 64) for the following compositions: Fugue 1 from “Das Wohltemperierte Klavier I” by J.S. Bach (16851750), Sonata KV 545 (first movement) by W.A. Mozart (1756-1791; Figure 2.8), Kinderszenen No. 2 and 3 by R. Schumann (1810-1856; Figure 2.9), Préludes op. 51, No. 2 and 4 by A. Scriabin (1872-1915) and Préludes No.


Figure 2.8 Johannes Chrysostomus Wolfgangus Theophilus Mozart (1756-1791) in the house of Salomon Gessner in Zurich. (Courtesy of Zentralbibliothek Z¨ urich.)


Figure 2.9 R. Schumann (1810-1856) – lithography by H. Bodmer. (Courtesy of Zentralbibliothek Z¨ urich.)


6 and 7 by F. Martin (1890-1971). For each j = 4, 8, ..., 64, the frequencies pj (0), ..., pj (11) are joined by lines respectively. The obvious common feature for Bach, Mozart and Schumann is a distinct preference (local maximum) for the notes 5 and 7 (apart from 0). Note that if 0 is the root of the tonic triad, then 5 corresponds to the root of the subdominant triad. Similarily, 7 is root of the dominant triad. Also relatively frequent are the notes 3 =minor third (second note of tonic triad in minor) and 10 =minor seventh, which is the fourth note of the dominant seventh chord to the subtonic. Also note that, for Schumann, the local maxima are somewhat less pronounced. A different pattern can be observed for Scriabin and even more for Martin. In Scriabin’s Prélude op. 51/2, the perfect fifth almost never occurs, but instead the major sixth is very frequent. In Scriabin’s Prélude op. 51/4, the tonal system is dissolved even further, as the clearly dominating note is 6 which builds together with 0 the augmented fourth (or diminished fifth) – an interval that is considered highly dissonant in tonal music. Nevertheless, even in Scriabin’s compositions, the distribution of notes does not change very rapidly, since the sixteen overlayed curves are almost identical. This may indicate that the notion of scales or a slow harmonic development still play a role. In contrast, in Frank Martin’s Prélude No. 6, the distribution changes very quickly. This is hardly surprising, since Martin’s style incorporates, among other influences, dodecaphonism (12tone music) – a compositional technique that does not impose traditional restrictions on the harmonic structure. 2.4 Some descriptive statistics and plots for bivariate data 2.4.1 Definitions We give a short overview of important descriptive concepts for bivariate data. For a comprehensive treatment we refer the reader to standard text books given above (also see e.g. Plackett 1960, Ryan 1996, Srivastava and Sen 1997, Draper and Smith 1998, and Rao 1973 for basic theoretical results). Correlation If each observation consists of a pair of measurements (xi , yi ), then the main objective is to investigate the relationship between x and y. Consider, for example, the case where both variables are quantitative. The data can then be displayed in a scatter plot (y versus x). Useful statistics are Pearson’s sample correlation n n (xi − x ¯)(yi − y¯) 1 xi − x¯ yi − y¯ ( )( ) = n i=1 (2.3) r= n 2 n i=1 sx sy ¯) ¯)2 i=1 (xi − x i=1 (yi − y


where s2x = n−1 ni=1 (xi − x¯)2 and s2y = n−1 ni=1 (yi − y¯)2 and Spearman’s rank correlation n n (ui − u¯)(vi − v¯) 1 ui − u¯ vi − v¯ rSp = ( )( ) = n i=1 (2.4) n i=1 su sv ¯)2 ni=1 (vi − v¯)2 i=1 (ui − u where ui denotes the rank of xi among the x−values and vi is the rank of yi among the y−values. In (2.3) and (2.4) it is assumed that sx , sy , su and sv are not zero. Recall that these definitions imply the following properties: a) −1 ≤ r, rSp ≤ 1; b) r = 1, if and only if yi = βo + β1 xi and β1 > 0 (exact linear relationship with positive slope); c) r = −1, if and only if yi = βo + β1 xi and β1 < 0 (exact linear relationship with negative slope); d) rSp = 1, if and only if xi > xj implies yi > yj (strictly monotonically increasing relationship); e) r = −1, if and only if xi > xj implies yi < yj (strictly monotonically decreasing relationship); f) r measures the strength (and sign) of the linear relationship; g) rSp measures the strength (and sign) of monotonicity; h) if the data are realizations of a bivariate random variable (X, Y ), then r is an estimate of the population correlation ρ = cov(X, Y )/ var(X)var(Y ) where cov(X, Y ) = E[XY ] − E[X]E[Y ], var(X) = cov(X, X) and var(Y ) = cov(Y, Y ). When using these measures of dependence one should bear in mind that each of them measures a specific type of dependence only, namely linear and monotonic dependence respectively. Thus, a Pearson or Spearman correlation near or equal to zero does not necessarily mean independence. Note also that correlation can be interpreted in a geometric way as follows: defining the n−dimensional vectors x = (x1 , ..., xn )t and y = (y1 , ..., yn )t , r is equal to the standardized scalar product between x and y, and is therefore equal to the cosine of the angle between these two vectors. A special type of correlation is interesting for time series. Time series are data that are taken in a specific ordered (usually temporal) sequence. If Y1 , Y2 , ..., Yn are random variables observed at time points i = 1, ..., n, then one would like to know whether there is any linear dependence between observations Yi and Yi−k , i.e. between observations that are k time units apart. If this dependence is the same for all time points i, and the expected value of Yi is constant, then the corresponding population correlation can be written as function of k only (see Chapter 4), cov(Yi , Yi+k ) = ρ(k) var(Yi )var(Yi+k )

(2.5)

and a simple estimate of ρ(k) is the sample autocorrelation (acf) ρˆ(k) = where s2 = n−1

n−k 1 yi − y¯ yi+k − y¯ )( ) ( n i=1 s s

(2.6)

(yi − y¯)(yi+k − y¯). Note that here summation stops at


n − k, because no data are available beyond (n − k) + k = n. For large lags (large compared to n), ρˆ(k) is not a very precise estimate, since there are only very few pairs that are k time units apart. The definition of ρ(k) and ρˆ(k) can be extended to multivariate time series, taking into account that dependence between different components of the series may be delayed. For instance, for a bivariate time series (Xi , Yi ) (i = 1, 2, ...), one considers lag-k sample cross-correlations ρˆXY (k) =

n−k 1 xi − x¯ yi+k − y¯ ( )( ) n i=1 sX sY

(2.7)

as estimates of the population cross-correlations cov(Xi , Yi+k ) ρXY (k) = (2.8) var(Xi )var(Yi+k ) where s2X = n−1 (xi − x ¯)(xi+k − x ¯) and s2Y = n−1 (yi − y¯)(yi+k − y¯). If |ρXY (k)| is high, then there is a strong linear dependence between Xi and Yi+k . Regression In addition to measuring the strength of dependence between two variables, one is often interested in finding an explicit functional relationship. For instance, it may be possible to express the response variable y in terms of an explanatory variable x by y = g(x, ε) where ε is a variable representing the part of y that is unexplained. More specifically, we may have, for example, an additive relationship y = g(x) + ε or a multiplicative equation y = g(x)eε . The simplest relationship is given by the simple linear regression equation (2.9) y = βo + β1 x + ε where ε is assumed to be a random variable with E(ε) = 0 (and usually finite variance σ 2 = var(ε) < ∞). Thus, the data are yi = βo +β1 xi +εi (i = 1, ..., n) where the εi s are generated by the same zero mean distribution. Often the εi ’s are also assumed to uncorrelated or even independent – this is however not a necessary assumption. An obvious estimate of the unknown parameters βo and β1 is obtained by minimizing the total sum of squared errors (yi − bo − b1 xi )2 = ri2 (bo , b1 ) (2.10) SSE = SSE(bo , b1 ) = with respect to bo , b1 . The solution is found by setting the partial derivatives with respect to bo and b1 equal to zero. A more elegant way to find the solution is obtained by interpreting the problem geometrically: defining the n-dimensional vectors 1 = (1, ..., 1)t , b = (bo , b1 )t and the n × 2 matrix X with columns 1 and x, we have SSE = ||y − bo 1 − b1 x||2 = ||y − Xb||2


where ||.|| denotes the squared euclidian norm, or length of the vector. It is then clear that SSE is minimized by the orthogonal projection of y on the plane spanned by 1 and x. The estimate of β = (βo , β1 )t is therefore βˆ = (βô , βˆ1 )t = (X t X)−1 X t y

(2.11)

and the projection – which is the vector of estimated values yî – is given by (2.12) y ˆ = (ˆ y1 , ..., yˆn )t = X(X t X)−1 X t y Defining the measure of the total variability of y, SST = ||y−¯ y1||2 (total sum of squares), and the quantities SSR = ||ˆ y−¯ y1||2 (regression sum of squares=variability due to the fact that the fitted line is not horizontal) 2 and SSE = ||y − y ˆ|| (error sum of squares, variability unexplained by regression line), we have by Pythagoras SST = SSR + SSE

(2.13)

The proportion of variability “explained” by the regression line yˆ = βô +βˆ1 x is therefore n (ˆ yi − y¯i )2 SSE ||ˆ y − y¯1||2 SSR =1− . (2.14) R2 = i=1 = = n 2 2 ||y − y ¯ 1|| SST SST (y − y ¯ ) i i=1 By definition, 0 ≤ R2 ≤ 1, and R2 = 1 if and only if yî = yi (i.e. all points are on the regression line). Moreover, for simple regression we also have R2 = r2 . The advantage of defining R2 as above (instead of via r2 ) is that the definition remains valid for the multiple regression model (see below), i.e. when several explanatory variables are available. Finally, note that an ˆ 2 = (n − 2)−1 ri2 (βô , βˆ1 ). estimate of σ 2 is obtained by σ In analogy to the sample mean and the sample variance, the least squares estimates of the regression parameters are sensitive to the presence of outliers. Outliers in regression can occur in the y-variable as well as in the x-variable. The latter are also called influential points. Outliers may often be correct and in fact very interesting observations (e.g. telling us that the assumed model may not be correct). However, since least squares estimates are highly influenced by outliers, it is often difficult to notice that there may be a problem, since the fitted curve tends to lie close to the outliers. Alternative, robust estimates can be helpful in such situations (see Huber 1981, Hampel et al. 1986). For instance, instead of minimizing the residual sum of squares we may minimize ρ(ri ) where ρ is a bounded function. If ρ is differentiable, then the solution can usually also be found by solving the equations n r ∂ ρ( ) r(b) = 0 (j = 0, ..., p) (2.15) σ ˆ ∂bj i=1 where σ ˆ 2 is a robust estimate of σ 2 obtained from an additional equation and p is the number of explanatory variables. This leads to estimates that


are (up to a certain degree) robust with respect to outliers in y, not however with respect to influential points (outliers in x). To control the effect of influential points one can, for instance, solve a set of equations n

r ψj ( , xi ) = 0 (j = 0, ..., p) σ ˆ i=1

(2.16)

where ψ is such that it downweighs outliers in x as well. For a comprehensive theory of robustness see e.g. Huber (1981), Hampel et al. (1986). For more recent, efficient and highly robust methods see Yohai (1987), Rousseeuw and Yohai (1984), Gervini and Yohai (2002), and references therein. The results for simple linear regression can be extended easily to the case where more than one explanatory variable is available. The multiple linear regression model with p explanatory variables is defined by y = βo + β1 x1 + ...+βp xp +ε. For data we write yi = βo +β1 xi1 +...+βp xip +εi (i = 1, ..., n). Note that the word “linear” refers to linearity in the parameters βo , ..., βp . The function itself can be nonlinear. For instance, we may have polynomial regression with y = βo +β1 x+...+βp xp +ε. The same geometric arguments as above apply so that (2.11) and (2.12) hold with β = (βo , ..., βp )t , and the n × (p + 1)−matrix X = (x(1) , ..., x(p+1) ) with columns x(1) = 1 and x(j+1) = xj = (x1j , ..., xnj )t (j = 1, ..., p). Regression smoothing A more general, but more difficult, approach to modeling a functional relationship is to impose less restrictive assumptions on the function g. For instance, we may assume y = g(x) + ε (2.17) with g being a twice continuously differentiable function. Under suitable additional conditions on x and ε it is then possible to estimate g from observed data by nonparametric smoothing. As a special example consider observations yi taken at time points i = 1, 2, ..., n. A standard model is yi = g(ti ) + εi

(2.18)

where ti = i/n, εi are independent identically distributed (iid) random variables with E(εi ) = 0 and σ 2 = var(εi ) < 0. The reason for using standardized time ti ∈ [0, 1] is that this way g is observed on an increasingly fine grid. This makes it possible to ultimately estimate g(t) for all values of t by using neighboring values ti , provided that g is not too “wild”. A simple estimate of g can be obtained, for instance, by a weighted average (kernel smoothing) n wi yi (2.19) gˆ(t) = i=1


with suitable weights wi ≥ 0, Nadaraya-Watson weights

wi = 1. For example, one may use the

i K( t−t b ) wi = wi (t; b, n) = n t−tj j=1 K( b )

(2.20)

with b > 0, and a kernel function K ≥ 0 such that K(u) = K(−u), K(u) = 1 0 (|u| > 1) and −1 K(u)du = 1. The role of b is to restrict observations that influence the estimate to a small window of neighboring time points. For instance, the rectangular kernel K(u) = 12 1{|u| ≤ 1} yields the sample mean of observations yi in the “window” n(t − b) ≤ i ≤ n(t + b). An even more elegant formula can be obtained 1 by approximating the Riemann sum n t−tj 1 K( ) by the integral K(u)du = 1: j=1 nb b −1 gˆ(t) =

n i=1

1 t − ti )yi K( nb i=1 b n

wi yi =

(2.21)

In this case, the sum of the weights is not exactly equal to one, but asymptotically (as n → ∞ and b → 0 such that nb3 → ∞) this error is negligible. It can be shown that, under fairly general conditions on g and ε, gˆ converges to g, in a certain sense that depends on the specific assumptions (see e.g. Gasser and M¨ uller 1979, Gasser and M¨ uller 1984, Härdle 1991, Beran and Feng 2002, Wand and Jones 1995, and references therein). An alternative to kernel smoothing is local polynomial fitting (Fan and Gijbels 1995, 1996; also see Feng 1999). The idea is to fit a polynomial locally, i.e. to data in a small neighborhood of the point of interest. This can be formulated as a weighted least squares problem as follows: gˆ(t) = βô

(2.22)

where βˆ = (βô , βˆ1 , ..., βˆp )t solves a local least squares problem defined by ti − t 2 βˆ = arg min )ri (a). (2.23) K( a b Here ri = yi − [ao + a1 (ti − t) + ... + ap (ti − t)p ], K is a kernel as above and b > 0 is the bandwidth defining the window of neighboring observations. It can be shown that asymptotically, a local polynomial smoother can be written as kernel estimator (Ruppert and Wand 1994). A difference only occurs at the borders (t close to 0 or 1) where, in contrast to the local polynomial estimate, the kernel smoother has to be modified. The reason is that observations are no longer symmetrically spaced in the window t ± b). A major advantage of local polynomials is that they automatically provide estimates of derivatives, namely gˆ (t) = βˆ1 , gˆ (t) = 2βˆ2 etc. Kernel smoothing can also be used for estimation of derivatives; however different (and rather complicated) kernels have to be used for each derivative (Gasser and M¨ uller 1984, Gasser et al. 1985). A third alternative, so-called wavelet


thresholding, will not be discussed here (see e.g. Daubechies 1992, Donoho and Johnston 1995, 1998, Donoho et al. 1995, 1996, Vidakovic 1999, and Percival and Walden 2000 and references therein). A related method based of wavelets is discussed in Chapter 5. Smoothing of two-dimensional distributions, sharpening Estimating a relationship between x and y (where x and y are realizations of random variables X and Y respectively) amounts to estimating the joint two-dimensional distribution function F (x,y) = P (X ≤ x, Y ≤ y). For continuous variables with F (x, y) = u≤x v≤y f (u, v) dudv, the density function f can be estimated, for instance, by a two-dimensional histogram. For visual and theoretical reasons, a better estimate is obtained by kernel estimation (see e.g. Silverman 1986) defined by 1 K(xi − x, yi − y; b1 , b2 ) (2.24) fˆ(x, y) = nb1 b2 i=1 the kernel K is such that K(u, v) = K(−u, v) = K(u, −v) ≥ 0, and where K(u, v)dudv = 1. Usually, b1 = b2 = b and K(u, v) has compact support. Examples of kernels are K(u, v) = 14 1{|u| ≤ 1}1{|v| ≤ 1} (rectangular kernel with rectangular support), K(u, v) = π −1 1{u2 + v 2 ≤ 1} (rectangular kernel with circular support), K(u, v) = 2π −1 [1−u2 −v 2 ] (Epanechnikov kernel with circular support) or K(u, v) = (2π)−1 exp[− 21 (u2 + v 2 )] (normal density kernel with infinite support). In analogy to one-dimensional density estimation, it can be shown that under mild regularity conditions, fˆ(x, y) is a consistent estimate of f (x, y), provided that b1 , b2 → 0, and nb1 , nb2 → ∞. Graphical representations of two-dimensional distribution functions are • 3-dimensional perspective plot: z = f (x, y) (or fˆ(x, y)) is plotted against x and y; • contour plot: like in a geographic map, curves corresponding to equal levels of f are drawn in the x-y-plane; • image plot: coloring of the x-y-plane with the color at point (x, y) corresponding to the value of f. A simple way of enhancing the visual understanding of scatterplots is socalled sharpening (Tukey and Tukey 1981; also see Chambers et al. 1983): for given numbers a and b, only points with a ≤ fˆ(x, y) ≤ b are drawn in the scatterplot. Alternatively, one may plot all points and highlight points with a ≤ fˆ(x, y) ≤ b. Interpolation Often a process may be generated in continuous time, but is observed at discrete time points. One may then wish to guess the values of the points


in between. Kernel and local polynomial smoothing provide this possibility, since gˆ(t) can be calculated for any t ∈ (0, 1). Alternatively, if the observations are assumed to be completely without “error”, i.e. yi = g(ti ), then deterministic interpolation can be used. The most popular method is spline interpolation. For instance, cubic splines connect neighboring observed values yi−1 , yi by cubic polynomials such that the first and second derivatives at the endpoints ti−1 , ti are equal. For observations y1 , ..., yn at equidistant time points ti with ti − ti−1 = tj − tj−1 = ∆t (i, j = 1, ..., n), we have n − 1 polynomials pi (t) = ai + bi (t − ti ) + ci (t − ti )2 + di (t − ti )3 (i = 1, ..., n − 1)

(2.25)

To achieve smoothness at the points ti where two polynomials pi−1 , pi meet, one imposes the condition that the polynomials and their first two derivatives are equal at ti . This together with the conditions pi (ti ) = yi leads to a system of 3(n − 2) + n = 4(n − 1) − 2 equations for 4(n − 1) parameters ai , bi , ci , di (i = 1, ..., n − 1). To specify a unique solution one therefore needs two additional conditions at the border. A typical assumption is p (t1 ) = p (tn ) = 0 which defines so-called natural splines. Cubic splines have a physical meaning, since these are the curves that form when a thin rod is forced to pass through n knots (in our case the knots are t1 , ..., tn ), corresponding to minimum strain energy. The term “spline” refers to the thin flexible rods that were used in the past by draftsmen to draw smooth curves in ship design. In spite of their “natural” meaning, interpolation splines (and similarily other methods of interpolation) can be problematic since the interpolated values may be highly dependent on the specific method of interpolation and are therefore purely hypothetical unless the aim is indeed to build a ship. Splines can also be used for smoothing purposes by removing the restriction that the curve has to go through all observed points. More specifically, one looks for a function gˆ(t) such that ∞ n (yi − gˆ(ti ))2 + λ [ˆ g (t)]2 dt (2.26) V (λ) = i=1

−∞

is minimized. The parameter λ > 0 controls the smoothness of the resulting curve. For small values of λ, the fitted curve will be rather rough but close to the data; for large values more smoothness is achieved but the curve is, in general, not as close to the data. The question of which λ to choose reflects a standard dilemma in statistical smoothing: one needs to balance the aim of achieving a small bias (λ small) against the aim of a small variance (λ large). For a given value of λ, the solution to the minimization problem above turns out to be a natural cubic spline (see Reinsch 1967; also see Wahba 1990 and references therein). The solution can also be written as a kernel smoother with a kernel function K(u) proportional


√ √ 1 to exp(−|u|/ 2) sin(π/4 + |u|/ 2) and a bandwidth b proportional to λ 4 1 (Silverman 1986). If ti = i/n, then the bandwidth is exactly equal to λ 4 . Statistical inference In this section, correlation, linear regression, nonparametric smoothing, and interpolation were introduced in an informal way, without exact discussion of probabilistic assumptions and statistical inference. All these techniques can be used in an informal way to explore possible structures without specific model assumptions. Sometimes, however, one wishes to obtain more solid conclusions by statistical tests and confidence intervals. There is an enormous literature on statistical inference in regression, including nonparametric approaches. For selected results see the references given above. For nonparametric methods also see Wand and Jones (1995), Simonoff (1996), Bowman and Azzalini (1997), Eubank (1999) and references therein. 2.5 Sp ecific applications in music – bivariate 2.5.1 Empirical tempo-acceleration Consider the tempo curves in Figure 2.3. An approximate measure of tempo-acceleration may be defined by a(ti ) =

[y(ti ) − y(ti−1 )] − [y(ti−1 ) − y(ti−2 )] ∆2 y(t) = 2 ∆ t [ti − ti−1 ] − [ti−1 − ti−2 ]

(2.27)

where y(t) is the tempo (or log-tempo) at time t. Figures 2.10a through f show a(t) for the three performances by Cortot and Horowitz. From the pictures it is not quite easy to see in how far there are similarilies or differences. Consider now the pairs (aj (ti ), al (ti )) where aj , al are acceleration measurements of performance j and l respectively. We calculate the sample correlations for each pair (j, l) ∈ {1, ..., 28} × {1, ..., 28}, (j = l). Figure 2.11a shows the correlations between Cortot 1 (1947) and the other performances. As expected, Cortot correlates best with Cortot: the correlation between Cortot 1 and Cortot’s other two performances (1947, 1953) is clearly highest. The analogous observation can be made for Horowitz 1 (1947) (Figure 2.11b). Also interesting is to compare how much overall resemblance there is between a selected performance and the other performances. For each of the 28 performances, the average and the maximal correlation with other performances were calculated. Figures 2.11c and d indicate that, in terms of accelaration, Cortot’s style appears to be quite unique among the pianists considered here. The overall (average and maximal) similarily between each of his three acceleration curves and the other performances is much smaller than for any other pianist.


10

10

b) Acceleration - Cortot (1947)

c) Acceleration - Cortot (1953)

-10 -15

-10

-10

-5

-5

-5

a(t)

a(t)

a(t)

0

0

0

5

5

5

10

a) Acceleration - Cortot (1935)

5

10

15

20

25

30

0

5

10

onset time t

10

d) Acceleration - Horowitz (1947)

20

25

30

0

5

10

15

20

25

30

onset time t

e) Acceleration - Horowitz (1963)

f) Acceleration - Horowitz (1965)

0

5

10

15

20

25

onset time t

30

-15

-10

-10

-10

-5

-5

-5

a(t)

a(t)

a(t)

0

0

0

5

5

5

10

10

15

onset time t 15

0

0

5

10

15

20

onset time t

25

30

0

5

10

15

20

25

30

onset time t

Figure 2.10 Acceleration of tempo curves for Cortot and Horowitz.

2.5.2 Interpolated and smoothed tempo curves – velocity and acceleration Conceptually it is plausible to assume that musicians control tempo in continuous time. The measure of acceleration given above is therefore a rather crude estimate of the actual acceleration curve. Interpolation splines provide a simple possibility to “guess” the tempo and its derivatives between the observed time points. One should bear in mind, however, that interpolation is always based on specific assumptions. For instance, cubic splines assume that the curve between two consecutive time points where observations are available is, or can be well approximated by, a third degree polynomial. This assumption can hardly be checked experimentally and can lead to undesirable effects. Figure 2.12 shows the observed and interpolated tempo for Martha Argerich. While most of the interpolated values seem plausible, there are a few rather doubtful interpolations (marked with arrows) where the cubic polynomial by far exceeds each of the two observed values at the neighboring knots.


0 5 CORTOT2

ASKENAZE

ARRAU

10

©2004 CRC Press LLC 15

Performance KRUST

20 25 0 5 10 GIANOLI

15 20 KRUST

SHELLEY ZAK

NOVAES

20

SCHNABEL

ORTIZ

NEY

MOISEIWITSCH

15

KUBALEK

KLIEN

KATSARIS

HOROWITZ3

HOROWITZ2

10

HOROWITZ1

Performance

ESCHENBACH

DEMUS

CURZON

5

DAVIES

CORTOT1 CORTOT2 CORTOT3

CAPOVA

1.0

0

BUNIN

BRENDEL

ASKENAZE

ARRAU

0.9

25

ARGERICH

0.8

c) Mean correlations with other pianists

0.7

mean correlation

ZAK

SHELLEY

SCHNABEL

20

ORTIZ

NOVAES

NEY

MOISEIWITSCH

15

KUBALEK

KLIEN

KATSARIS

HOROWITZ3

HOROWITZ2

0.8

10

HOROWITZ1

GIANOLI

ESCHENBACH

DEMUS

DAVIES

CURZON

CAPOVA

BRENDEL BUNIN

0.7

5

0.6

0.6 ARGERICH

CORTOT1

0.5

mean correlation

0

CORTOT3

0.4

0.2

0.4

KATSARIS

CURZON

25

Performance

Figure 2.11 Tempo acceleration – correlation with other performances. ZAK

SHELLEY

SCHNABEL

ORTIZ

NOVAES

NEY

MOISEIWITSCH

KUBALEK

KRUST

KLIEN

KATSARIS

HOROWITZ3

HOROWITZ2

GIANOLI

ESCHENBACH

DEMUS

DAVIES

CORTOT3

CORTOT2

CORTOT1

CAPOVA

BUNIN

BRENDEL

ASKENAZE

ARRAU

ARGERICH

Correlation

ZAK

0.6

1.0

CORTOT2

SHELLEY

SCHNABEL

ORTIZ

NOVAES

NEY

MOISEIWITSCH

KUBALEK

KRUST

KLIEN

HOROWITZ3

HOROWITZ2

HOROWITZ1

GIANOLI

ESCHENBACH

DEMUS

DAVIES

CURZON

CORTOT3

CAPOVA

BUNIN

BRENDEL

ASKENAZE

ARRAU

ARGERICH

0.8

Correlation

1.2

a) Acceleration - Correlations of Cortot (1935) with other performances b) Acceleration- Correlations of Horowitz (1947) with other performances

25

Performance

d) Maximal correlations with other pianists

1.4

Figure 2.12 Martha Argerich – interpolation of tempo curve by cubic splines.

2.5.3 Tempo – hierarchical decomposition by smoothing The tempo curve may be thought of as an aggregation of mostly smooth tempo curves at different onset-time-scales. This corresponds to the general structure of music as a mixture of global and local structures at various scales. It is therefore interesting to look at smoothed tempo curves, and their derivatives, at different scales. Reasonable smoothing bandwidths may be guessed from the general structure of the composition such as time signature(s), rhythmic, metric, melodic, and harmonic structure, and so on. For tempo curves of Schumann’s Tr¨ aumerei (Figure 2.3), even multiples of 1/8th are plausible. Figures 2.13 through 2.16 show the following kernelsmoothed tempo curves with b1 = 8, b2 = 1, and b3 = 1/8 respectively: t − ti gˆ1 (t) = (nb1 )−1 K( )yi (2.28) b1 t − ti gˆ2 (t) = (nb2 )−1 )[yi − gˆ1 (t)] (2.29) K( b2 t − ti )[yi − gˆ1 (t) − gˆ2 (t)] (2.30) gˆ3 (t) = (nb3 )−1 K( b3 and the residuals eˆ(t) = yi − gˆ1 (t) − gˆ2 (t) − gˆ3 (t).


(2.31)

0

10 15 20 25 30 t

5

0

10 15 20 25 30 t

5

5

5

5

5

5

5

10 15 20 25 30 t

-0.6

-0.6

-0.6

5

5

10 15 20 25 30 t

KRUST

0

10 15 20 25 30 t

5

10 15 20 25 30 t

NOVAES

NEY

0

5

0

10 15 20 25 30 t

5

10 15 20 25 30 t

ZAK

SHELLEY

0

5

10 15 20 25 30 t

Figure 2.13 Smoothed tempo curves gˆ1 (t) = (nb1 )−1


0

10 15 20 25 30 t

-0.6

0

10 15 20 25 30 t

-0.6

0

SCHNABEL -0.6

10 15 20 25 30 t

5

KLIEN

10 15 20 25 30 t

10 15 20 25 30 t

HOROWITZ2

-0.6

0

10 15 20 25 30 t

-0.6

5

5

MOISEIWITSCH

ORTIZ

0

0

10 15 20 25 30 t

-0.6

-0.6

5

0

10 15 20 25 30 t

-0.6

0

10 15 20 25 30 t

KUBALEK

0

5

HOROWITZ1

10 15 20 25 30 t

-0.6

-0.6

0

0

KATSARIS

HOROWITZ3

5

DEMUS

-0.6

0

10 15 20 25 30 t

0

10 15 20 25 30 t

DAVIES

10 15 20 25 30 t

-0.6

-0.6

5

5

GIANOLI

ESCHENBACH

0

CORTOT2

-0.6

0

10 15 20 25 30 t

10 15 20 25 30 t

-0.6

5

0

10 15 20 25 30 t

-0.6

-0.6

0

5

CORTOT1

CURZON

CORTOT3

0

10 15 20 25 30 t

-0.6

-0.6

-0.6

5

5

CAPOVA

BUNIN

0

0

10 15 20 25 30 t

-0.6

5

-0.4

-0.4

-0.4

-0.4

0

BRENDEL

ASKENAZE

ARRAU

ARGERICH

0

5

10 15 20 25 30 t

i )yi (b1 = 8). K( t−t b1

0

5

0

10 15 20 25 30 t

5

5

5

5

5

5

5

10 15 20 25 30 t

-1.5

-1.5

-1.5

5

5

10 15 20 25 30 t

KRUST

0

10 15 20 25 30 t

NEY

0

5

5

10 15 20 25 30 t

NOVAES

0

10 15 20 25 30 t

5

10 15 20 25 30 t

ZAK

SHELLEY

0

5

10 15 20 25 30 t

Figure 2.14 Smoothed tempo curves gˆ2 (t) = (nb2 )−1 1).


0

10 15 20 25 30 t

-2.0

0

10 15 20 25 30 t

-1.5

0

SCHNABEL -2.0

10 15 20 25 30 t

5

KLIEN

10 15 20 25 30 t

10 15 20 25 30 t

HOROWITZ2

-1.5

0

10 15 20 25 30 t

-2.0

5

5

MOISEIWITSCH

ORTIZ

0

0

10 15 20 25 30 t

-1.5

-1.5

5

0

10 15 20 25 30 t

-1.5

0

10 15 20 25 30 t

KUBALEK

0

5

HOROWITZ1

10 15 20 25 30 t

-1.5

-1.5

0

0

KATSARIS

HOROWITZ3

5

DEMUS

-1.5

0

10 15 20 25 30 t

0

10 15 20 25 30 t

DAVIES

10 15 20 25 30 t

-1.5

-1.5

5

5

GIANOLI

ESCHENBACH

0

CORTOT2

-1.5

0

10 15 20 25 30 t

10 15 20 25 30 t

-2.0

5

0

10 15 20 25 30 t

-1.5

-1.5

0

5

CORTOT1

CURZON

CORTOT3

0

10 15 20 25 30 t

-1.5

-1.5

-1.5

5

5

CAPOVA

BUNIN

0

0

10 15 20 25 30 t

1.0

10 15 20 25 30 t

-2.0

5

-1.5

-1.5

-1.5

-0.5

0

BRENDEL

ASKENAZE

ARRAU

ARGERICH

0

5

10 15 20 25 30 t

i )[yi − gˆ1 (t)] (b2 = K( t−t b2

0

10 15 20 25 30 t

5

5

5

0 -3

0 -3

0 -3

10 15 20 25 30 t

0 -3

0

10 15 20 25 30 t

5

10 15 20 25 30 t

NOVAES 0

5

0

10 15 20 25 30 t

SHELLEY

5

10 15 20 25 30 t

ZAK

0 -3

0

5

10 15 20 25 30 t

0

5

10 15 20 25 30 t

Figure 2.15 Smoothed tempo curves gˆ3 (t) = (nb3 )−1 gˆ2 (t)] (b3 = 1/8).


1 -2

1

0

SCHNABEL -3

10 15 20 25 30 t

5

NEY

10 15 20 25 30 t

0

0 -3

5

5

0

5

10 15 20 25 30 t

KRUST

-3

0

0

ORTIZ

0

0

MOISEIWITSCH

10 15 20 25 30 t

0

10 15 20 25 30 t

KLIEN

10 15 20 25 30 t

-3

0 -3

5

5

0

5

5

HOROWITZ2

-3

0

0

KUBALEK

0

0

KATSARIS

10 15 20 25 30 t

0

10 15 20 25 30 t

HOROWITZ1

10 15 20 25 30 t

-3

0 -3

5

5

0

5

10 15 20 25 30 t

DEMUS

-3

0

0

HOROWITZ3

0

0

GIANOLI

10 15 20 25 30 t

5

DAVIES

10 15 20 25 30 t

-3

0 -3

5

0

10 15 20 25 30 t

0

0

10 15 20 25 30 t

ESCHENBACH

0

5

-3

5

10 15 20 25 30 t

CORTOT2

-3

0 -3

0 -3

0

0

10 15 20 25 30 t

CURZON

CORTOT3

5

CORTOT1 0

-3

0

10 15 20 25 30 t

0

10 15 20 25 30 t

-3

0

1 -2

5

5

CAPOVA

BUNIN

0

0

10 15 20 25 30 t

1

5

BRENDEL

-3

-2

0

ASKENAZE

-2

1

ARRAU

-2

1

ARGERICH

0

5

10 15 20 25 30 t

i )[yi − gˆ1 (t) − K( t−t b3

5

MOISEIWITSCH

-1.5

-1.5

0

5

0

10 15 20 25 30 t

5

10 15 20 25 30 t

5

10 15 20 25 30 t

-1.5

-1.5

-1.5

0

5

10 15 20 25 30 t

KRUST

0

10 15 20 25 30 t

5

5

10 15 20 25 30 t

NOVAES

0

10 15 20 25 30 t

5

10 15 20 25 30 t

ZAK

SHELLEY -1.5

0

HOROWITZ2

10 15 20 25 30 t

NEY

0

10 15 20 25 30 t

-1.5

1.5 -1.5

5

5

SCHNABEL

ORTIZ

0

5

KLIEN

0

10 15 20 25 30 t

10 15 20 25 30 t

-1.5

5

5

-1.5

KUBALEK

1.5

0

10 15 20 25 30 t 1.5

5

0

-1.5

1.5

KATSARIS

0

HOROWITZ1

10 15 20 25 30 t

10 15 20 25 30 t

-1.5

5

5

DEMUS

10 15 20 25 30 t

-1.5

-1.5

0

5

-1.5

0

-1.5

1.5

HOROWITZ3

0

GIANOLI

10 15 20 25 30 t

0

10 15 20 25 30 t

DAVIES

10 15 20 25 30 t

-1.5

-1.5

5

5

-1.5

0

ESCHENBACH

0

0

CURZON

10 15 20 25 30 t

10 15 20 25 30 t

CORTOT2

CORTOT1

10 15 20 25 30 t

-1.5

-1.5

5

5

1.5

5

0

10 15 20 25 30 t

-1.5

0

10 15 20 25 30 t

CORTOT3

0

5

CAPOVA -1.5

-1.5

5

0

10 15 20 25 30 t

1.5

5

BUNIN

0

-1.0

-1.0

1.0

0

10 15 20 25 30 t

1.5

5

1.5

-1.0

0

BRENDEL

ASKENAZE

ARRAU

-1.0

1.0

ARGERICH

0

5

10 15 20 25 30 t

0

5

10 15 20 25 30 t

Figure 2.16 Smoothed tempo curves – residuals eˆ(t) = yi − gˆ1 (t) − gˆ2 (t) − gˆ3 (t).


The tempo curves are thus decomposed into curves corresponding to a hierarchy of bandwidths. Each component reveals specific features. The first component reflects the overall tendency of the tempo. Most pianists have an essentially monotonically decreasing curve corresponding to a gradual, and towards the end emphasized, ritardando. For some performances (in particular Bunin, Capova, Gianoli, Horowitz 1, Kubalek, and Moisewitsch) there is a distinct initial acceleration with a local maximum in the middle of the piece. The second component gˆ2 (t) reveals tempo-fluctuations that correspond to a natural division of the piece in 8 times 4 bars. Some pianists, like Cortot, greatly emphasize the 8×4 structure. For other pianists, such as Horowitz, the 8×4 structure is less evident: the smoothed tempo curve is mostly quite flat, though the main, but smaller, tempo changes do take place at the junctions of the eight parts. Striking is also the distinction between part B (bars 17 to 24) and the other parts (A,A ,A ) of the composition – in particular in Argerich’s performance. The third component characterizes fluctuations at the resolution level of 2/8th. At this very local level, tempo changes frequently for pianists like Horowitz, whereas there is less local movement in Cortot’s performances. Finally, the residuals e(t) consist of the remaining fluctuations at the finest resolution of 1/8th. The similarity between the three residual curves by Horowitz illustrate that even at this very fine level, the “seismic” variation of tempo is a highly controlled process that is far from random.

2.5.4 Tempo curves and melodic indicator

In Chapter 3, the so-called melodic indicator will be introduced. One of the aims will be to “explain” some of the variability in tempo curves by melodic structures in the score. Consider a simple melodic indicator m(t) = wmelod (t) (see Section 3.3.4) that is essentially obtained by adding all indicators corresponding to individual motifs. Figures 2.17a and d display smoothed curves obtained by local polynomial smoothing of −m(t) using a large and a small bandwidth respectively. Figures 2.17b and e show the first derivatives of the two curves in 2.17a,d. Similarily, the second derivatives are given in figures 2.17c and f. For the tempo curves, the first and second derivatives of local polynomial fits with b = 4 are given in Figures 2.18 and 2.19 respectively. A resemblance can be found in particular between the second derivative of −m(t) in Figure 2.17f and the second derivatives of tempo curves in Figure 2.19. Also, there are interesting similarities and differences between the performances, with respect to the local variability of the first two derivatives. Many pianists start with a very small second derivative, with strongly increased values in part B.


c): -m’’(t) (span=24/32)

0.2

2nd der.

-0.4

-0.5

-84

0

5

10

15

20

25

30

0

5

10

t

15

20

25

30

0

5

10

t

20

25

30

f) -m’’(t) (span=8/32) 150

e) -m’(t) (span=8/32)

0

5

10

15

20

25

30

t

100 0

2nd der.

-50

0

-100

-40

-100

-20

-80

1st der.

-60

50

20

-40

15 t

40

d) -m(t) (span=8/32)

mel. Ind.

0.0

0.0

1st der.

-80 -82

mel. Ind.

0.4

0.5

0.6

b) -m’(t) (span=24/32)

-78

a) -m(t) (span=24/32)

0

5

10

15 t

20

25

30

0

5

10

15

20

25

30

t

Figure 2.17 Melodic indicator – local polynomial fits together with first and second derivatives.

2.5.5 Tempo and loudness By invitation of Prince Charles, Vladimir Horowitz gave a benefit recital at London’s Royal Festival Hall on May 22, 1982. It was his first European appearance in 31 years. One of the pieces played at the concert was Schumann’s Kinderszene op. 15, No. 4. Figure 2.20 displays the (approximate) soundwave of Horowitz’s performance sampled from the CD recording. Two variables that can be extracted quite easily by visual inspection are: a) on the horizontal axis the time when notes are played (and derived from this quantity, the tempo) and b) on the vertical axis, loudness. More specifically, let t1 , ..., tn be the score onset-times and u(t1 ), ..., u(tn ) the corresponding performance times. Then an approximate tempo at score-onset time ti can be defined by y(ti ) = (ti+1 − ti )/(u(ti+1 ) − u(ti )). A complication with loudness is that the amplitude level of piano sounds decreases gradually in a complex manner so that “loudness” as such is not defined exactly. For simplicity, we therefore define loudness as the initial amplitude level (or rather its logarithm). Moreover, we consider only events where the scoreonset time is a multiple of 1/8. For illustration, the first four events (score onset times 1/8, 2/8, 3/8, 4/8) are marked with arrows in Figure 2.20. An interesting question is what kind of relationship there may be between time delay y and loudness level x. The autocorrelations of x(ti ) =


1.0

1.0

0.5 1st der.

0.5 -0.5

1st der.

-1.5

-1.5

CORTOT1

-0.5

1.0 0.5 -1.5

-0.5

1st der.

0.5 -1.5

-1.5

-1.5

-0.5

1st der.

0.5 -0.5

1st der.

0.5 -0.5

1st der.

1.0

1.0

1.0

1.0

t

CORTOT2

CORTOT3

CURZON

DAVIES

DEMUS

ESCHENBACH

GIANOLI

1st der.

-1.5

-1.5

-1.5

-0.5

0.5

1.0

1.0

0 5 10 15 20 25 30

0.5 -0.5

1st der.

0.5 -0.5

1st der.

0.5 -1.5

-1.5

-1.5

-0.5

1st der.

0.5 -0.5

1st der.

0.5 -0.5

1st der.

-0.5

1.0

t

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

t

0 5 10 15 20 25 30

t

t

HOROWITZ1

HOROWITZ2

HOROWITZ3

KATSARIS

KLIEN

KRUST

KUBALEK

0.5 -0.5

1st der.

-1.5

-1.5

-0.5

1st der.

0.5

0.5 -0.5

1st der.

-1.5

-1.5

-0.5

1st der.

0.5

0.5 -1.5

-0.5

1st der.

0.5 -0.5

1st der.

-1.5

-0.5

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

t

t

t

t

NEY

NOVAES

ORTIZ

SCHNABEL

SHELLEY

ZAK

1st der.

-1.5

-1.5

-1.5

-0.5

0.5

1.0

0 5 10 15 20 25 30

0.5 -0.5

1st der.

0.5 -0.5

1st der.

0.5 -1.5

-1.5

-0.5

1st der.

1st der.

0.5

0.5 1st der.

-1.5

-0.5

1.0

1.0 0.5

-0.5 -1.5

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

t

1.0

0 5 10 15 20 25 30

-0.5

0.5 -0.5

CAPOVA

0 5 10 15 20 25 30

-1.5

1st der.

BUNIN

t

0.5

1.0

BRENDEL

0 5 10 15 20 25 30

-1.5

1st der.

ASKENAZE

t

MOISEIWITSCH

1st der.

ARRAU

0 5 10 15 20 25 30

0.5

1.0

-1.5

1st der.

ARGERICH

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

t

t

t

t

t

t

t

Figure 2.18 Tempo curves (Figure 2.3) – first derivatives obtained from local polynomial fits (span 24/32).


3 2

2

1

1

-1

2nd der.

-2

3 -3

-3

0 5 10 15 20 25 30

2

2

0 5 10 15 20 25 30

CORTOT1

0

0

2nd der.

-2

3

3 -3

2

-1

1 0 -2

2nd der.

-1

1 0 -1

2nd der.

-2

3 -3

2

3

3 2

3 2

3 1 0 2nd der.

-2

3 -3

2

2

-1

1 0 -1

2nd der.

-2

3 -3

-3

0 5 10 15 20 25 30

DEMUS

ESCHENBACH

GIANOLI

2nd der.

-2

-1

0

1 0 2nd der.

-3

-2 -3

-3

-3

-1

0 -2

2nd der.

-1

0 -2

2nd der.

-1

0

-3

-3

-2

-2

2nd der.

-1

0 -1

2nd der.

-1 -3

1

DAVIES

1

CURZON

1

CORTOT3

1

t

CORTOT2

1

t

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

t

t

t

t

t

t

t

HOROWITZ1

HOROWITZ2

HOROWITZ3

KATSARIS

KLIEN

KRUST

KUBALEK 3 -0

1

2nd der.

2

2 1

2nd der.

-0

1

1

2 3

--

-2 3

2 3

--

--

1

1

-0

1

2nd der.

2

3

3

3 -0

1

2nd der.

2

2 1

2nd der.

-0 --

2 3

0 5 10 15 20 25 30

t

NEY

NOVAES

ORTIZ

SCHNABEL

SHELLEY

ZAK

t

2nd der.

-0

1

1 --

0 5 10 15 20 25 30 t

2 3

2nd der.

-0 1 --

t

2 3

0 5 10 15 20 25 30

1

2

3 2

3 -0 1 --

0 5 10 15 20 25 30

2 3

--

t

2 3

0 5 10 15 20 25 30

1

2nd der.

2 1

2nd der.

-0 1

1 --

t

2

3

3 2 1

2nd der.

-0

-0

0 5 10 15 20 25 30

3

t

1

2nd der.

0 5 10 15 20 25 30

t

1 2 3

0 5 10 15 20 25 30

t

--

t

0 5 10 15 20 25 30

t

2

2 1 -0 1 --

0 5 10 15 20 25 30

0 5 10 15 20 25 30

t

3

3

0 5 10 15 20 25 30

2 3

2 3

0 5 10 15 20 25 30

2 3

--

1

1

1

-0

1

2nd der.

2

2

3

3

3

0 5 10 15 20 25 30

1

2

0 5 10 15 20 25 30

CAPOVA

t

-2

2nd der.

0 5 10 15 20 25 30

BUNIN

t

-0

2nd der.

0 5 10 15 20 25 30

BRENDEL

t

-2 3

ASKENAZE

t

MOISEIWITSCH

2nd der.

ARRAU

t

0

1

0 5 10 15 20 25 30

t

2 3

2

3 2

3 2 -1 3

-2

2nd der.

0

1

ARGERICH

0 5 10 15 20 25 30 t

Figure 2.19 Tempo curves (Figure 2.3) – second derivatives obtained from local polynomial fits (span 8/32).


Figure 2.20 Kinderszene No. 4 – sound wave of performance by Horowitz at the Royal Festival Hall in London on May 22, 1982.

log(Amplitude) and y(ti ) as well as the cross-autocorrelations between the two time series are shown in Figure 2.21a. The main remarkable crossautocorrelation occurs at lag 8. This can also be seen visually when plotting y(ti+8 ) against x(ti ) (Figure 2.21b). There appears to be a strong relationship between the two variables with the exception of four outliers. The three fitted lines correspond to a) a least square linear regression fit using all data; b) a robust high breakdown point and high efficiency regression (Yohai et al. 1991); and c) a least squares fit excluding the outliers. It should be noted that the “outliers” all occur together in a temporal cluster (see Figure 2.21c) and correspond to a phase where tempo is at its extreme (lowest for the first three outliers and fastest for the last outlier). This indicates that these are informative “outliers” (in contrast to wrong measurements) that should not be dismissed, since they may tell us something about the intention of the performer. Finally, Figure 2.21d displays a sharpened version of the scatterplot in Figure 2.21b: Points with high estimated joint density fˆ(x, y) are marked with “O”. In contrast to what one would expect from a regression model, random errors εi that are independent of x, the points with highest density gather around a horizontal line rather than the regression line(s) fitted in Figure 2.21b. Thus, a linear regression model is hardly applicable. Instead, the data may possibly be divided into three clusters: a) a cluster with low loudness and low tempo; b) a second cluster with medium loudness and low to medium tempo; and c) a third cluster with a high level of loudness and medium to high tempo.


Figure 2.21 log(Amplitude) and tempo for Kinderszene No. 4 – auto- and cross correlations (Figure 2.24a), scatter plot with fitted least squares and robust lines (Figure 2.24b), time series plots (Figure 2.24c), and sharpened scatter plot (Figure 2.24d).


2.5.6 Loudness and tempo – two-dimensional distribution function In the example above, the correlation between loudness and tempo, when measured at the same time, turned out to be relatively small, whereas there appeared to be quite a clear lagged relationship. Does this mean that there is indeed no “immediate” relationship between these two variables? Consider x(ti ) = log(Amplitude) and the logarithm of tempo. The scatterplot and the boxplot in Figures 22a and b rather suggest that there may be a relationship, but the dependence is nonlinear. This is further supported by the two-dimensional histogram (Figure 23a), the smoothed density (Figure 24a) and the corresponding image plots (Figures 23b and 24b; the actual observations are plotted as stars). The density was estimated by a kernel estimate with the Epanechnikov kernel. Since correlation only measures linear dependence, it cannot detect this kind of highly nonlinear relationship.

Figure 2.22 Horowitz’ performance of Kinderszene No. 4 – log(tempo) versus log(Amplitude) and boxplots of log(tempo) for three ranges of amplitude.

2.5.7 Melodic tempo-sharpening Sharpening can also be applied by using an “external” variable. This is illustrated in Figures 2.25 through 2.27. Figure 2.25a displays the estimated density function of log(m+1) where m(t) is the value of a melodic indicator at onset time t. The marked region corresponds to very high values of the density function f (namely f (x) > 0.793). This defines a set Isharp of corresponding “sharpening onset times”. The series m(t) is shown in Figure 2.25b, with sharpening onset times t ∈ Isharp highlighted by vertical


Figure 2.23 Horowitz’ performance of Kinderszene No. 4 – two-dimensional histogram of (x, y) = (log(tempo), log(Amplitude)) displayed in a perspective and image plot respectively.

Figure 2.24 Horowitz’ performance of Kinderszene No. 4 – kernel estimate of two-dimensional distribution of (x, y) = (log(tempo), log(Amplitude)) displayed in a perspective and image plot respectively.


Figure 2.25 R. Schumann, Tr¨ aumerei op. 15, No. 7 – density of melodic indicator with sharpening region (a) and melodic curve plotted against onset time, with sharpening points highlighted (b).


CORTOT2

tempo

CORTOT3

HOROWITZ2

tempo

HOROWITZ3

0

0

0

tempo

tempo

HOROWITZ1

0

0

0

tempo

tempo

CORTOT1

Figure 2.26 R. Schumann, Tr¨ aumerei op. 15, No. 7 – tempo by Cortot and Horowitz at sharpening onset times.

CORTOT1

-10

HOROWITZ2

HOROWITZ3 10

diff(tempo)

0 -10

0 -10

diff(tempo)

10

10 0 -10

0

diff(tempo)

10

10

HOROWITZ1

diff(tempo)

CORTOT3

-10

0

diff(tempo)

0 -10

diff(tempo)

10

CORTOT2

Figure 2.27 R. Schumann, Tr¨ aumerei op. 15, No. 7 – tempo “derivatives” for Cortot and Horowitz at sharpening onset times.


lines. Figures 2.26 and 2.27 show the tempo y and its discrete “derivative” v(ti ) = [y(ti+1 ) − y(ti )]/(ti+1 − ti ) for ti ∈ Isharp and the performances by Cortot and Horowitz. The pictures indicate a systematic difference between Cortot and Horowitz. A common feature is the negative derivative at the fifth and sixth sharpening onset time. 2.6 Some multivariate descriptive displays 2.6.1 Definitions Suppose that we observe multivariate data x1 , x2 , ..., xn where each xi is a p-dimensional vector (xi1 , ..., xip )t ∈ Rp . Obvious numerical summary statistics are the sample mean where x ¯j = n−1

n i=1

x ¯ = (¯ x1 , x ¯2 , ..., x¯p )t xij and the p × p covariance matrix S with elements

Sjl = (n − 1)−1

n

(xij − x ¯j )(xil − x ¯l ).

i=1

Most methods for analyzing multivariate data are based on these two statistics. One of the main tools consists of dimension reduction by suitable projections, since it is easier to find and visualize structure in low dimensions. These techniques go far beyond descriptive statistics. We therefore postpone the discussion of these methods to Chapters 8 to 11. Another set of methods consists of visualizing individual multivariate observations. The main purpose is a simple visual identification of similarities and differences between observations, as well as search for clusters and other patterns. Typical examples are: • Faces: xi =(xi1 , ..., xip )t is represented by a face with features depending on the values of corresponding coordinates. For instance, the face function in S-Plus has the following correspondence between coordinates and feature parameters: xi,1 =area of face; xi,2 = shape of face; xi,3 = length of nose; xi,4 = location of mouth; xi,5 = curve of smile; xi,6 = width of mouth; xi,7 = location of eyes; xi,8 = separation of eyes; xi,9 = angle of eyes; xi,10 = shape of eyes; xi,11 = width of eyes; xi,12 = location of pupil; xi,13 = location of eyebrow; x14 = angle of eyebrow; xi,15 = width of eyebrows. • Stars: Each coordinate is represented by a ray in a star, the length of each corresponding to the value of the coordinate. More specifically, a star for a data vector xi = (xi1 , ..., xip )t is constructed as follows: 1. Scale xi to the range [0, r] : 0 ≤ x1j, ..., xnj ≤ r; 2. Draw p rays at angles ϕj = 2π(j − 1)/p (j = 1, ..., p); for a star with


origin 0 representing observation xi , the end point of the jth ray has the coordinates r · (xij cos ϕj , xij sin ϕj ); 3. For visual reasons, the end points of the rays may be connected by straight lines. • Profiles: An observation xi =(xi1 , ..., xip )t is represented by a plot of xij versus j where neighboring points xij−1 and xij (j = 1, ..., p) are connected. • Symb ol plot: The horizontal and vertical positions represent xi1 and xi2 respectively (or any other two coordinates of xi ). The other coordinates xi3 , ..., xip determine p − 2 characteristic shape parameters of a geometric object that is plotted at point (xi1 , xi2 ). Typical symbols are circle (one additional dimension), rectangle (two additional dimensions), stars (arbitrary number of additional dimensions), and faces (arbitrary number of additional dimensions). 2.7 Sp ecific applications in music – multivariate 2.7.1 Distribution of notes – Chernoff faces In music that is based on scales, pitch (modulo 12) is usually not equally distributed. Notes that belong to the main scale are more likely to occur, and within these, there are certain prefered notes as well (e.g. the roots of the tonic, subtonic and supertonic triads). To illustrate this, we consider the following compositions: 1. Saltarello (Anonymus, 13th century); 2. Prelude and Fugue No. 1 from “Das Wohltemperierte Klavier” (J. S. Bach, 1685-1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-1856); 4. Piano piece op. 19, No. 2 (A. Sch¨ onberg, 1874-1951; figure 2.28); 5. Rain Tree Sketch 1 (T. Takemitsu, 1930-1996). For each composition, the distribution of notes (pitches) modulo 12 is calculated and centered around the “central pitch” (defined as the most frequent pitch modulo 12). Thus, the central pitch is defined as zero. We then obtain five vectors of relative frequencies pj = (pj0 , ..., pj11 )t (j = 1, ..., 5) characterizing the five compositions. In addition, for each of these vectors the number nj of local peaks in pj is calculated. We say that a local peak at i ∈ {1, ..., 10} occurs, if pji > max(pji−1 , pji+1 ). For i = 10, we say that a local peak occurs, if pji > pji−1 . Figure 2.29a displays Chernoff faces of the 12-dimensional vectors vj = (nj , pj1 , ..., pj11 )t . In Figure 2.29b, the coordinates of vj (and thus the assignment of feature variables) were permuted. The two plots illustrate the usefulness of Chernoff faces, and at the same time the difficulties in finding an objective interpretation. On one hand, the method discovers a plausible division in two groups: both picures show a clear distinction between classical tonal music (first three faces) and the three representatives of “avant-garde” music of the 20th century. On the other hand, the


exact nature of the distinction cannot be seen. In Figure 2.29a, the classical faces look much more friendly than the rather miserable avant-garde fellows. The judgment of conservative music lovers that “avant-garde” music is unbearable, depressing, or even bad for health, seems to be confirmed! Yet, bad temper is the response of the classical masters to a simple permutation of the variables (Figure 2.29b), whereas the grim avant-garde seems to be much more at ease. The difficulty in interpreting Chernoff faces is that the result depends on the order of the variables, whereas due to their psychological effect most feature variables are not interchangeable.

Figure 2.28 Arnold Sch¨ onberg (1874-1951), self-portrait. (Courtesy of Verwertungsgesellschaft Bild-Kunst, Bonn.)

2.7.2 Distribution of notes – star plots We consider once more the distribution vectors pj = (pj0 , ..., pj11 )t of pitch modulo 12 where 0 is the tonal center. In contrast to Chernoff faces, permutation of coordinates in star plots is much less likely to have a subjective influence on the interpretation of the picture. Nevertheless, certain patterns can become more visible when using an appropriate ordering of the variables. From the point of view of tonal music, a natural ordering of pitch can be obtained, for instance, from the ascending circle of fourths. This leads to the following permutation p∗j = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t . (p0 is omitted, since it is maximal by definition for all compositions.) Since stars are easy to look at, it is possible to compare a large number of observations simultaneously. We consider the following set of compositions:


a

ANONYMUS

BACH

SCHUMANN

WEBERN

SCHOENBERG

TAKEMITSU

Figure 2.29 a) Chernoff faces for 1. Saltarello (Anonymus, 13th century); 2. Prelude and Fugue No. 1 from “Das Wohltemperierte Klavier” (J. S. Bach, 16851750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-1856); 4. Piano piece op. 19, No. 2 (A. Sch¨ onberg, 1874-1951); 5. Rain Tree Sketch 1 (T. Takemitsu, 1930-1996).

b

ANONYMUS

BACH

SCHUMANN

WEBERN

SCHOENBERG

TAKEMITSU

Figure 2.29 b) Chernoff faces for the same compositions as in figure 2.29a, after permuting coordinates.


• A. de la Halle (1235?-1287): “Or est Bayard en la pature, hure!”; • J. de Ockeghem (1425-1495): Canon epidiatesseron; • J. Arcadelt (1505-1568): a) Ave Maria, b) La ingratitud, c) Io dico fra noi; • W. Byrd (1543-1623): a) Ave Verum Corpus, b) Alman, c) The Queen’s Alman; • J.P. Rameau (1683-1764): a) La Poplinière, b) Le Tambourin, c) La Triomphante; • J.S. Bach (1685-1750): Das Wohltemperierte Klavier – Preludes und Fuges No. 5, 6 and 7; • D. Scarlatti (1660-1725): Sonatas K 222, K 345 and K 381; • J. Haydn (1732-1809): Sonata op. 34, No. 2; • W.A. Mozart (1756-1791): 2nd movements of Sonatas KV 332, KV 545 and KV 333; • M. Clementi (1752-1832): Gradus ad Parnassum – Studies 2 and 9 (Figure 11.4); • R. Schumann (1810-1856): Kinderszenen op. 15, No. 1, 2, and 3; • F. Chopin (1810-1849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32, No. 1, c) Etude op. 10, No. 6; • R. Wagner (1813-1883): a) Bridal Choir from “Lohengrin”, b) Ouverture to Act 3 of “Die Meistersinger”; • C. Debussy (1862-1918): a) Claire de lune, b) Arabesque No. 1, c) Reflections dans l’eau; • A. Scriabin (1872-1915): Preludes op. 2/2, op. 11/14 and op. 13/2; • B. Bartók (1881-1945): a) Bagatelle op. 11, No. 2 and 3, b) Sonata for Piano; • O. Messiaen (1908-1992): Vingts regards sur l’enfant de Jésus, No. 3; • S. Prokoffieff (1891-1953): Visions fugitives No. 11, 12 and 13; • A. Sch¨ onberg (1874-1951): Piano piece op. 19, No. 2; • T. Takemitsu (1930-1996): Rain Tree Sketch No. 1; • A. Webern (1883-1945): Orchesterst¨ uck op. 6, No. 6; ´ • J. Beran (*1959): S¯ anti – piano concert No. 2 (beginning of 2nd Mov.) The star plots of p∗j are given in Figure 2.31. From Halle (cf. Figure 2.30) up to about the early Scriabin, the long beams form more or less a halfcircle. This means that the most frequent notes are neighbors in the circle of quarts and are much more frequent than all other notes. This is indeed what one would expect in music composed in the tonal system. The picture starts changing in the neighborhood of Scriabin where long beams are either


isolated (most extremely for Bartók’s Bagatelle No. 3) or tend to cover more or less the whole range of notes (e.g. Bartók, Prokoffieff, Takemitsu, Beran). Due to the variety of styles in the 20th century, the specific shape of each of the stars would need to be discussed in detail individually. For instance, Messiaen’s shape may be explained by the specific scales (Messiaen scales) he used. Generally speaking, the difference between star plots of the 20th century and earlier music reflects the replacement of the traditional tonal system with major/minor scales by other principles.

Figure 2.30 The minnesinger Burchard von Wengen (1229-1280), contemporary of Adam de la Halle (1235?-1288). (From Codex Manesse, courtesy of the University Library Heidelberg.) (Color figures follow page 152.)


Distribution of notes ordered according to ascending circle of fourths

HALLE

OCKEGHEM

ARCADELT

ARCADELT

ARCADELT

BYRD

BYRD

BYRD

RAMEAU

RAMEAU

RAMEAU

BACH

BACH

BACH

SCARLATTI

SCARLATTI

SCARLATTI

HAYDN

MOZART

MOZART

MOZART

CLEMENTI

CLEMENTI

SCHUMANN

SCHUMANN

SCHUMANN

CHOPIN

CHOPIN

CHOPIN

WAGNER

WAGNER

DEBUSSY

DEBUSSY

DEBUSSY

SCRIABIN

SCRIABIN

SCRIABIN

BARTOK

BARTOK

BARTOK

PROKOFFIEFF

PROKOFFIEFF

PROKOFFIEFF

MESSIAEN

SCHOENBERG

WEBERN

TAKEMITSU

BERAN

Figure 2.31 Star plots of p∗j = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 )t for compositions from the 13th to the 20th century.

2.7.3 Joint distribution of interval steps of envelopes Consider a composition consisting of onset times ti and pitch values x(ti ). In a polyphonic score, several notes may be played simultaneously. To simplify analysis, we define a simplified score by considering the lower and upper envelope: Definition 24 Let n Cj C = {(ti , x(ti )) : ti ∈ A, x(ti ) ∈ B, i = 1, 2, ..., N } = j=1

{t∗1 , ..., t∗n }

Z+ (t∗1

t∗2

t∗n ),

where A = ⊂ < < ... < B ⊂ R or Z and Cj = {(t, x(t)) ∈ C : t = t∗j }. Then the lower and upper envelope of C are


defined by Elow = {(t∗j ,

min

x(t)), j = 1, ..., n}

max

x(t)), j = 1, ..., n}.

(t,x(t))∈Cj

and Eup = {(t∗j ,

(t,x(t))∈Cj

In other words, for each onset time, the lowest and highest note are selected to define the lower and upper envelope respectively. In the example below, we consider interval steps ∆y(ti ) = y(ti+1 ) − y(ti ) mod 12 for the upper envelope of a composition with onset times t1 , ..., tn and pitches y(t1 )..., y(tn ). A simple aspect of melodic and harmonic structure is the question in which sequence intervals are likely to occur. Here, we look at the empirical two-dimensional distribution of (∆y(ti ), ∆y(ti+1 )). For each pair (i, j), (−11 ≤ i, j ≤ 11, i, j =0), we count the number nij of occurences and define Nij = log(nij + 1). (The value 0 is excluded here, since repetitions of a note – or transposition by an octave – are less interesting.) If only the type of interval and not its direction is of interest, then i, j assume the values 1 to 11 only. A useful representation of Nij can be obtained by a symbol plot. In Figures 2.32 and 2.33, the x- and y-coordinates correspond to i and j respectively. The radius of a circle with center (i, j) is proportional to Nij . The compositions considered here are: a) J.S. Bach: Pr¨ aludium No. 1 from ”Das Wohltemperierte Klavier”; b) W.A. Mozart : Sonata KV 545, (beginning of 2nd Movement); c) A. Scriabin: Prélude op. 51, No. 4; and d) F. Martin: Prélude No. 6. For Bach’s piece, there is a clear clustering in three main groups in the first plot (there are almost never two successive interval steps downwards) and a horseshoe-like pattern for absolute intervals. Remarkable is the clear negative correlation in Mozart’s first plot and the concentration on a few selected interval sequences. A negative correlation in the plots of interval steps with sign can also be found for Scriabin and Martin. However, considering only the types of intervals without their sign, the number and variety of interval sequences that are used relatively frequently is much higher for Scriabin and even more for Martin. For Martin, the plane of absolute intervals (Figure 2.33d) is filled almost uniformly. 2.7.4 Pitch distribution – symbol plots with circles Consider once more the distribution vectors pj = (pj0 , ..., pj11 )t of pitch modulo 12 as in the star-plot example above. The star plots show a clear distinction between “modern” compositions and classical tonal compositions. Symbol plots can be used to see more clearly which composers (or compositions) are close with respect to pj . In figure 2.34 the x- and yaxis corresponds to pj5 and pj7 . Recall that if 0 is the root of the tonic triad, then 5 is the root of the subtonic and 7 the root of the dominant


Figure 2.32 Symbol plot of the distribution of successive interval pairs (∆y(ti ), ∆y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for the upper envelopes of Bach’s Pr¨ aludium No. 1 (Das Wohltemperierte Klavier I) and Mozart ’s Sonata KV 545 (beginning of 2nd movement).


Figure 2.33 Symbol plot of the distribution of successive interval pairs (∆y(ti ), ∆y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for the upper envelopes of Scriabin’s Prélude op. 51, No. 4 and F. Martin’s Prélude No. 6.


0.20

triad. The radius of the circles in Figure 2.34 is proportional to pj1 , the frequency of the “dissonant” minor second. In color Figure 2.35, the radius represents pj6 , i.e. the augmented fourth. Both plots show a clear positive relationship between pj5 and pj7 . Moreover the circles tend to be larger for small values of x and y. The positioning in the plane together with the size of the circles separates (apart from a few exceptions) classical tonal compositions from more recent ones. To visualize this, four different colors are chosen for “early music” (black), “baroque and classical” (green), “romantic” (blue) and “20/21st century” (red). The clustering of the four colors indicates that there is indeed an approximate clustering according to the four time periods. Interesting exceptions can be observed for “early” music with two extreme “outliers” (Halle and Arcadelt). Also, one piece by Rameau is somewhat far from the rest.

RAMEAU

ARCADELT RAMEAU SCHUMANN

ARCADELT

0.15

BYRD SCRIABIN RAMEAU CLEMENTI MOZART CHOPIN CHOPIN

SCRIABIN

0.10

WEBERN

SCARLATTI SCARLATTI BYRD BYRD DEBUSSY PROKOFFIEFF OCKEGHEM BACH BACH MOZART SCARLATTI DEBUSSY SCHUMANN BACH HAYDN WAGNER MOZART CLEMENTI SCRIABIN

DEBUSSY CHOPIN PROKOFFIEFF BARTOK

SCHUMANN WAGNER

0.05

TAKEMITSU SCHOENBERG MESSIAEN BERAN ARCADELT BARTOK

0.0

PROKOFFIEFF BARTOK HALLE

0.0

0.05

0.10

0.15

0.20

Figure 2.34 Symbol plot with x = pj 5 , y = pj 7 and radius of circles proportional to pj 1 .

2.7.5 Pitch distribution – symbol plots with rectangles By using rectangles, four dimensions can be represented. Color Figure 2.36 shows a symbol with (x, y)-coordinates (pj5 , pj7 ) and rectangles with width


0.20

RAMEAU

ARCADELT RAMEAU SCHUMANN

ARCADELT

0.15

BYRD SCRIABIN RAMEAU CLEMENTI MOZART CHOPIN CHOPIN

SCRIABIN

0.10

WEBERN

SCARLATTI SCARLATTI BYRD BYRD DEBUSSY PROKOFFIEFF OCKEGHEM BACH BACH MOZART SCARLATTI DEBUSSY SCHUMANN BACH HAYDN WAGNER MOZART CLEMENTI SCRIABIN

DEBUSSY CHOPIN PROKOFFIEFF BARTOK

SCHUMANN WAGNER

0.05

TAKEMITSU SCHOENBERG MESSIAEN BERAN ARCADELT BARTOK

0.0

PROKOFFIEFF BARTOK HALLE

0.0

0.05

0.10

0.15

0.20

Figure 2.35 Symbol plot with x = pj 5 , y = pj 7 and radius of circles proportional to pj 6 . (Color figures follow page 152.)

pj1 (diminished second) and height pj6 (augmented fourth). Using the same colors for the names as above, a similar clustering as in the circle-plot can be observed. The picture not only visualizes a clear four-dimensional relationship between pj1 , pj5 , pj6 and pj7 , but also shows that these quantities are related to the time period. 2.7.6 Pitch distribution – symbol plots with stars Five dimensions are visualized in color Figure 2.37 with (x, y) = (pj5 , pj7 ) and the variables pj1 , pj6 and pj10 (diminished seventh) defining a starplot for each observation, the first variables starting on the right and the subsequent variables winding counterclockwise around the star (in this case a triangle). The shape of the triangle is obviously a characteristic of the time period. For tonal music composed mostly before about 1900, the stars are very narrow with a relatively long beam in the direction of the diminished seventh. The diminished seventh is indeed an important pitch in tonal music, since it is the fourth note in the dominant seventh chord to the subtonic. In contrast, notes that are a diminished second and an


0.0

RAMEAU

ARCADELT SCHUMANN

RAMEAU ARCADELT

BYRD SCRIABIN RAMEAU SCARLATTI SCARLATTI CLEMENTI BYRD MOZART DEBUSSY BYRD PROKOFFIEFF BACH SCARLATTI BACH MOZART OCKEGHEM CHOPIN SCRIABIN DEBUSSY SCHUMANN CHOPIN BACH HAYDN WAGNER WEBERN MOZART CLEMENTI SCRIABIN DEBUSSY CHOPIN SCHUMANN PROKOFFIEFF BARTOK WAGNER TAKEMITSU SCHOENBERG BERAN ARCADELTMESSIAEN BARTOK PROKOFFIEFF BARTOK

-0.1

HALLE

0.0

0.05

0.10

0.15

0.20

Figure 2.36 Symbol plot with x = pj5 , y = pj7 . The rectangles have width pj1 (diminished second) and height pj 6 (augmented fourth). (Color figures follow page 152.)


augmented fourth above the root of the tonic triad build, together with the tonic root, highly dissonant intervals and are therefore less frequent in tonal music. Color Figure 2.37 shows the triangles; the names without the triangles are plotted in color Figure 2.38. 2.7.7 Pitch distribution – profile plots Finally, as an alternative to star plots, Figure 2.39 displays profile plots of p∗j = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t . For compositions up to about 1900, the profiles are essentially U-shaped. This corresponds to stars with clustered long and short beams respectively, as seen previously. For “modern” compositions, there is a large variety of shapes different from a U-shape.


0.10 0.05 0.0

0.0

0.05

0.10

0.15

0.20

Figure 2.37 Symbol plot with x = pj5 , y = pj7 , and triangles defined by pj1 (diminished second), pj 6 (augmented fourth) and pj 10 (diminished seventh). (Color figures follow page 152.)


0.20

RAMEAU

RAMEAU SCHUMANN

ARCADELT

0.10

0.15

BYRD SCRIABIN RAMEAU SCARLATTI SCARLATTI CLEMENTI BYRD MOZART DEBUSSY BYRD PROKOFFIEFF BACH BACH MOZART OCKEGHEM CHOPIN SCRIABIN SCARLATTI DEBUSSY SCHUMANN CHOPIN BACH HAYDN WAGNER WEBERN MOZART CLEMENTI SCRIABIN DEBUSSY CHOPIN SCHUMANN PROKOFFIEFF BARTOK WAGNER TAKEMITSU

0.05 0.0

ARCADELT

SCHOENBERG BERAN ARCADELTMESSIAEN BARTOK PROKOFFIEFF BARTOK HALLE

0.0

0.05

0.10

0.15

0.20

Figure 2.38 Names plotted at locations (x, y) = (pj 5 , pj 7 ). (Color figures follow page 152.)


6 8 10

8 10

0.10 0.0

0.0

4 6

8 10

6 8 10

SCARLATTI

0.0

0.08

0.10

0.0

0.0

2

6 8 10

0.10

2 4

SCARLATTI

4 6

2 4

8 10

CLEMENTI

6 8 10

SCHUMANN

4 6

8 10

6 8 10

DEBUSSY 0.10

0.08

0.0

0.02

0.0

2 4

WAGNER

2

6 8 10

4 6

2 4

8 10

BARTOK

6 8 10

BARTOK

0.0

6 8 10

4 6

8 10

6 8 10

6 8 10

0.10

BERAN

0.06

0.09

2 4

2 4

TAKEMITSU

0.07

0.08

6 8 10

2

WEBERN

0.02

2 4

0.06

0.10

0.10

0.15 0.05

2 4

SCHOENBERG 0.20

8 10

2

BARTOK

6 8 10

0.05

4 6

0.0

0.02 0.10

0.10

2 4

MESSIAEN

2

2 4

SCRIABIN

8 10

6 8 10

WAGNER

6 8 10

0.10

4 6

0.05

0.10

0.15

0.10 0.02 0.15 0.05

6 8 10

2 4

0.0

2

2 4

CHOPIN

8 10

0.04

2 4

2

6 8 10

CLEMENTI

6 8 10

0.04

4 6

0.12

4 6

2 4

SCRIABIN

PROKOFFIEFF

0.05

2

2 4

MOZART

BYRD

0.10

0.10 0.0

0.20 0.10 0.0

0.0 0.02 0.10

2

6 8 10

BYRD

BACH

6 8 10

0.10

8 10

0.02

2 4

0.10 0.02

0.04

2 4

4 6

CHOPIN

6 8 10

0.10

PROKOFFIEFF

2

SCRIABIN

8 10

2 4

8 10

0.0

0.02 0.10

2 4

0.0

4 6

4 6

2 4

6 8 10

BACH

MOZART

CHOPIN

8 10

0.02

2

2

6 8 10

0.02

4 6

0.10

0.10

2 4

DEBUSSY

PROKOFFIEFF

6 8 10

0.10

0.08 0.0

2

6 8 10

2 4

BACH

MOZART

8 10

0.0

6 8 10

0.02

2 4

2 4

SCHUMANN

DEBUSSY

0.10

4 6

0.10

0.08 0.0

2 4

8 10

0.10

8 10

0.05

2

SCHUMANN

4 6

0.10

4 6

HAYDN

6 8 10

2

RAMEAU

0.0

2

6 8 10

6 8 10

BYRD

ARCADELT

0.08

RAMEAU

SCARLATTI

2 4

2 4

8 10

0.0

0.0 0.15

4 6

0.10

0.10

RAMEAU

2 4

0.10

0.10

2

ARCADELT

0.10

6 8 10

0.20

2 4

ARCADELT

0.0

0.0

0.0

0.08

0.10

OCKEGHEM

0.0

HALLE

2

4 6

8 10

2 4

6 8 10

Figure 2.39 Profile plots of p∗j = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t .


CHAPTER 3

Global measures of structure and randomness 3.1 Musical motivation Essential aspects of music may be summarized under the keywords “structure”, “information” and “communication”. Even aleatoric pieces where events are generated randomly (e.g. Cage, Xenakis, Lutoslawsky) have structure and information induced by the definition of specific random distributions. It is therefore meaningful to measure the amount of structure and information contained in a composition. Clearly, this is a nontrivial task and many different, and possibly controversial, definitions can be invented. In this chapter, two types of measures are discussed: 1) general global measures of information or randomness, and 2) specific local measures indicating metric, melodic, and harmonic structures. 3.2 Basic principles 3.2.1 Measuring information and randomness There is an enormous amount of literature on information measures and their applications. In this section, only some basic fundamental definitions and results are reviewed. These and other classical results can be found, in particular, in Fisher (1925, 1956), Hartley (1928), Bhattacharyya (1946a), Erd¨ os (1946), Wiener (1948), Shannon (1948), Shannon and Weaver (1949), Barnard (1951), McMillan (1953), Mandelbrot (1953, 1956), Khinchin (1953, 1956), Goldman (1953), Bartlett (1955), Brillouin (1956), Komogorov (1956), Ashby (1956), Joshi (1957), Kullback (1959), Wolfowitz (1957, 1958, 1961), Woodward (1953), Rényi (1959a,b, 1961, 1965, 1970). Also see e.g. Ash (1965) for an overview. A classical measure of information (or randomness) is entropy, which is also called Shannon information (Shannon 1948, Shannon and Weaver 1949). To explain its meaning, consider the following question: how much information is contained in a message, or more specifically, what is the necessary number of digits to encode the message unambiguously in the binary system? For instance, if the entire vocabulary only consisted of the words “I”, “hungry”, “not”, “very”, then the words could be identified with the binary numbers 00 = “I”, 01 = “hungry”, 10 =


¨ Figure 3.1 Ludwig Boltzmann (1844-1906). (Courtesy of Osterreichische Post AG.)

“not” and 11 = “very”. Thus, for a vocabulary V of |V | = N = 22 words, n = 2 digits would be sufficient. More generally, suppose that we have a set V with N = 2n elements. Then we need n = log2 N digits for encoding the elements in the binary system. The number n is then called the information of a message from vocabulary V . Note that in the special case where V consists of one element only, n = 0, i.e. the information content of a message is zero, because we know which element of V will be contained in the message even before receiving it. An extension of this definition to integers N that are not necessarily powers of 2 can be justified as follows: consider a sequence of k elements from V . The number of sequences v1 , ..., vk (vi ∈ V ) is N k . (Note that one element is allowed to occur more than once.) The number of binary digits to express a sequence v1 , ..., vk is nk where 2nk −1 < N k ≤ 2nk . The average number of digits needed to express an element in this sequence is nk /k where k log2 N ≤ nk < k log2 N + 1. We then have nk = log2 N. lim k→∞ k The following definition is therefore meaningful: Definition 25 Let VN be a finite set with N elements. Then the information necessary to characterize the elements of VN is defined by I(VN ) = log2 N

(3.1)

This definition can also be derived by postulating the following properties a measure of information should have: 1. Additivity: If |VK | = N M , then I(VK ) = I(VN ) + I(VM )


2. Monotonicity: I(VN ) ≤ I(VN +1 ) 3. Definition of unit: I(V2 ) = 1. The only function that satisfies these conditions is I(VN ) = log2 N. Consider now a more complex situation where VN = ∪kj=1 Vj , Vj ∩ Vl = φ (j = l) and |Vj | = Nj (and hence N = N1 +...+Nk ), and define pj = Nj /N . Suppose that we select an element from V randomly, each element having the same probability of being chosen. If an element v ∈ V is known to belong to a specific Vj , then the additional information needed to identify it within Vj is equal to I(Vj ) = log2 Nj . The expected value of this additional information is therefore I2 =

k

pj log2 Nj =

j=1

k

pj log2 (N pj )

(3.2)

j=1

Let I1 be the information needed to identify the set Vj which v belongs to. Then the total information needed for identifying (encoding) elements of V is (3.3) log2 N = I1 + I2 On the other hand, pj log2 N = log2 N so that we obtain Shannon’s famous formula k I1 = − pj log2 (pj ) (3.4) j=1

I1 is also called Shannon information. Shannon information is thus the expected information about the occurence of the sets V1 , ..., Vk contained in a randomly chosen element from V . Note that the term “information” can be used synonymously for “uncertainty”: the information obtained from a random experiment diminishes uncertainty by the same amount. The derivation of Shannon information is credited to Shannon (1948) and, independently, Wiener (1948). In physics, an analogous formula is known as entropy and is a measure of the disorder of a system (see Boltzmann 1896, figure 3.1). Shannon’s formula can also be derived by postulating the following properties for a measure of information of the outcome of a random experiment: let V1 , ..., Vk be the possible outcomes of a random experiment and denote by pj = P (Aj ) the corresponding probabilities. Then a measure of information, say I, obtained by the outcome of the random experiment should have the following properties: 1. Function of probabilities: I = I(p1 , ..., pk ), i.e. I depends on the probabilities pj only; 2. Symmetry: I(p1 , ..., pk ) = I(pπ(1) , ..., pπ(k) ) for any permutation π; 3. Continuity: I(p, 1 − p) is a continuous function of p (0 ≤ p ≤ 1); 4. Definition of unit: I( 12 , 12 ) = 1;


5. Additivity and weighting by probabilities: I(p1 , ..., pk ) = I(p1 + p2 , p3 , ..., pk ) + (p1 + p2 )I(

p1 p2 , ) (3.5) p1 + p2 p1 + p2

The meaning of the first four properties is obvious. The last property can be interpreted as follows: suppose the outcome of an experiment does not distinguish between V1 and V2 , i.e. if v turns out to be in one of these two sets, we only know that v ∈ V1 ∪ V2 . Then the infomation provided by the experiment is I(p1 + p2 , p3 , ..., pk ). If the experiment did distinguish between V1 and V2 , then it is reasonable to assume that the information would be larger by the amount p1 p2 , ). (p1 + p2 )I( p1 + p2 p1 + p2 Equation (3.5) tells us exactly that: the complete information I(p1 , ..., pk ) can be obtained by adding the partial and the additional information. It turns out that the only function for which the postulates hold is Shannon’s information: Theorem 9 Let I be a functional that assigns each finite discrete distribution function P (defined by probabilities p1 , ..., pk , k ≥ 1) a real number I(P ), such that the properties above hold. Then I(P ) = I(p1 , ..., pk ) = −

k

pj log2 pj

(3.6)

j=1

Shannon information has an obvious upper bound that follows from Jensen’s inequality: recall that Jensen’s inequality states that for a convex function wj = 1 we have g and weights wj ≥ 0 with wj g(xj ). g( wj xj ) ≤ In particular, for g(x) = x log2 x, k −1 pj ) = −k −1 log2 k. pj log2 pj ≥ g( k −1 g(pj ) = k −1 Hence, I(P ) ≤ log2 k

(3.7)

This bound is achieved by the uniform distribution pj = 1/k. The other extreme case is pj = 1 for some j. This means that event Vj occurs with certainty and I(p1 , ..., pk ) = I(pj ) = I(1) = I(1, 0) = I(1, 0, 0) etc. Then from the fifth property we have I(1, 0) = I(1) + I(1, 0) so that I(1) = 0. The interpretation is that, if it is clear a priori which event will occur, then a random experiment does not provide any information. The notion of information can be extended in an obvious way to the case where one has an infinite but countable number of possible outcomes.


The information contained in the realization of a random variable X with possible outcomes x1 , x2 , ... is defined by I(X) = − pj log2 pj where pj = P (X = xj ). More subtle is the extension to continuous distributions and random variables. A nice illumination of the problem is given in Renyi (1970): for a random variable with uniform distribution on (0,1), the digits in the binary expansion of X are infinitely many independent 0-1-random variables where 0 and 1 occur with probability 1/2 each. The information furnished by a realization of X would therefore be infinite. Nevertheless, a meaningful measure of information can be defined as a limit of discrete approximations: Theorem 10 Let X be a random variable with density function f. Define XN = [N X]/N where [x] denotes the integer part of x. If I(X1 ) < ∞, then the following holds: I(XN ) =1 (3.8) lim N →∞ log2 N ∞ lim (I(XN ) − log2 N ) = − f (x) log2 f (x)dx (3.9) N →∞

−∞

We thus have Definition 26 Let X be a random variable with density function f . Then ∞ f (x) log2 f (x)dx (3.10) I(X) = − −∞

is called the information (or entropy) of X. Note that, in contrast to discrete distributions, information can be negative. This is due to the fact that I(X) is in fact the limit of a difference of informations. The notion of entropy can also be carried over to measuring randomness in stationary time series in the sense of correlations. (For the definition of stationarity and time series in general see Chapter 4.) Definition 27 Let Xt (t ∈ Z) be a stationary process with var(Xt ) = 1, and spectral density f . Then the spectral entropy of Xt is defined by π f (x) log2 f (x)dx (3.11) I(Xt , t ∈ Z) = − −π

This definition is plausible, because for a process with unit variance, f has the same properties as a probability distribution and can be interpreted as a distribution on frequencies. The process Xt is uncorrelated if and only if f is constant, i.e. if f is the uniform distribution on [−π, π]. Exactly in this case entropy is maximal, and knowledge of past observations does not help to predict future observations. On the other hand, if f has one or more


extreme peaks, then entropy is very low (and in the limit minus infinity). This corresponds to the fact that in this case future observations can be predicted with high accuracy from past values. Thus, future observations do not contain as much new information as in the case of independence. 3.2.2 Measuring metric, melodic, and harmonic importance General idea Western classical music is usually structured in at least three aspects: melody, metric structure, and harmony. With respect to representing the essential melodic, metric, and harmonic structures, not all notes are equally important. For a given composition K, we may therefore try to find metric, melodic, and harmonic structures and quantify them in a weight function w : K → R3 (which we will also call an “indicator”). For each note event x ∈ K, the three components of w(x) = (wmelodic (x), wmetric (x), wharmonic (x)) quantify the ”importance” of x with respect to the melodic, metric, and harmonic structure of the composition respectively. Omnibus metric, melodic, and harmonic indicators Specific definitions of structural indicators (or weight functions) are discussed for instance in Mazzola et al. (1995), Fleischer et al. (2000), and Beran and Mazzola (2001). To illustrate the general approach, we give a full definition of metric weights. Melodic and harmonic weights are defined in a similar fashion, taking into account the specific nature of melodic and harmonic structures respectively. Metric structures characterize local periodic patterns in symbolic onset times. This can be formalized as follows: let K ⊂ Z4 be a composition (with coordinates “Onset Time”, “Pitch”,”Loudness”, and “Duration”), T ⊂ Z its set of onset times (i.e. the projection of K on the first axis) and let tmax = max{t : t ∈ T }. Without loss of generality the smallest onset time in T is equal to one. Definition 28 For each triple (t, l, p) ∈ Z × N × N the set B(t, l, p) = {t + kp : 0 ≤ k ≤ l} is called a meter with starting point t, length l and period p. The meter is called admissible, if B(t, l, p) ⊂ T . The non-negative length l of a local meter M = B(t, l, p) is uniquely determined by the set M and is denoted by l(M ). Note that by definition, t ∈ B(t, l, p) for any (t, l, p) ∈ Z × N × N. The importance of events at onset time s is now measured by the number of meters this onset is contained in. For a given triple (t, l, p), three situations can occur:


1. B(t, l, p) is admissible and there is no other admissible local meter B = B (t , l , p ) such that B B ; 2. B(t, l, p) is not admissible; 3. B(t, l, p) is admissible, but there is another admissible local meter B = B (t , l , p ) such that B B . We count only case 1. This leads to the following definition: Definition 29 An admissible meter B(t, l, p) for a composition K ⊂ Z4 is called a maximal local meter if and only if it is not a proper subset of another admissible local meter B(t , l , p ) of K. Denote by M(K) the set of maximal local meters of K and by M(K, t) the set of maximal local meters of K containing onset t. Note that the set M(K) is always a covering of T . Metric weights can now be defined, for instance, by Definition 30 Let x ∈ K be a note event at onset time t(x) ∈ T , M = M(K, t) the set of maximal local meters of K containing t(x), and h a nondecreasing real function on Z. Specify a minimal length lmin . Then the metric indicator (or metric weight) of x, associated with the minimal length lmin , is given by h(l(M )) (3.12) wmetric (x) = M∈M, l(M)≥lmin

In a similar fashion, melodic indicators wmelodic and harmonic indicators wharmonic can be derived from a melodic and harmonic analysis respectively. Specific indicators A possible objection to weight functions as defined above is that only information about pitch and onset time is used. A score, however, usually contains much more symbolic information that helps musicians to read it correctly. For instance, melodic phrases are often connected by a phrasing slur, notes are grouped by beams, separate voices are made visible by suitable orientation of note stems, etc. Ideally, structural indicators should take into account such additional information. An improved indicator that takes into account knowledge about musical “motifs” can be defined for example as follows: Definition 31 Let M = {(τ1 , y1 ), ..., (τk , yk )}, τ1 < τ2 < ... < τk be a “motif ” where y denotes pitch and τ onset time. Given a composition K ⊂ T × Z ⊂ Z2 , define for each score-onset time ti ∈ T (i = 1, ..., n) and u ∈ {1, ..., k}, the shifted motif M (ti , u) = {(ti + τ1 − τu , y1 ), ..., (ti + τk − τu , yk )}


and denote by Tu (ti ) = {ti + τ1 − τu , ..., ti + τk − τu } = {s1 , ..., sk } the corresponding onset times. Moreover, let Xu (ti ) = {x = (x(s1 ), ..., x(sk )) : (si , x(si )) ∈ K} be the set of all pitch-vectors with onset set Tu (ti ). Then we define the distance k du (ti ) = min (x(si ) − yi )2 (3.13) x∈Xu (ti )

i=1

If Xu is empty, then du (ti ) is not defined or set equal to an arbitrary upper bound D < ∞. In this definition, it is assumed that the motif is identified beforehand by other means (e.g. “by hand” using traditional musical analysis). The distance du (ti ) thus measures in how far there are notes that are similar to those in M, if ti is at the uth place of the rhythmic pattern of motif M. Note that the euclidian distance (x(si ) − yi )2 could be replaced by any other reasonable distance. Analogously, distance or similarity can be measured by correlation: Definition 32 Using the same definitions as above, let xo = arg

min

k

x∈Xu (ti )

(x(si ) − yi )2 ,

i=1

and define ru (ti ) to be the sample correlation between xo and y = (y1 , ..., yk ). If M (ti , u) K, then set ru (ti ) = 0. Disregarding the position within a motif, we can now define overall motivic indicators (or weights), for instance by wd,mean (ti ) = g(

k

du (ti ))

(3.14)

u=1

where g is a monotonically decreasing function, wd,min (ti ) = min du (ti )

(3.15)

wcorr (ti ) = max ru (ti )

(3.16)

1≤u≤k

or 1≤u≤k

Finally, given weights for p different motifs, we may combine these into one overall indicator. For instance, an overall melodic indicator based on correlations can be defined by wmelod (ti ) =

p j=1


h(wcorr,j (ti ), Li )

(3.17)

where wcorr,j is the weight function for motif number j and Li is the number of elements in the motif. Including Li has the purpose of attributing higher weights to the presence of longer motifs. The advantage of the motif-based definition is that one can first search for possible motifs in the score, making full use of the available information in the score as well as musicological and historical knowledge, and then incorporate these in the definition of melodic weights. Similar definitions may be obtained for metric and harmonic indicators. 3.2.3 Measuring dimension There are many different definitions of dimension, each measuring a specific aspect of “objects”. Best known is the topological dimension. In the usual k euclidian √ space Rk with scalar product < x, y >= i=1 xi yi and distances |x−y| = < x − y, x − y >, the topological dimension of the space is equal to k. The dimension of an object in this space is equal to the dimension of the subspace it is contained in. The euclidian space is, however, rather special since it is metric with a scalar product. More generally, one can define a topological dimension in any topological (not necessarily metric) space in terms of coverings. We start with the definition of a topological space: a topological space is a nonempty set X together with a family O of so-called open subsets of X satisfying the following conditions: 1. X ∈ O and φ ∈ O (φ denotes the empty set) 2. If U1 , U2 ∈ O, then U1 ∪ U2 ∈ O 3. If U1 , U2 ∈ O, then U1 ∩ U2 ∈ O. A covering of a set S ⊆ X is a collection U ⊆ O of open sets such that S ⊆ ∪U∈U U. A refinement of a covering U is a covering U ∗ such that for each U ∗ ∈ U ∗ there exists a U ∈ U with U ∗ ⊆ U . The definition of topological dimension is now as follows: Definition 33 A topological space X has topological dimension m, if every covering U of X has a refinement U ∗ in which every point of X occurs in at most m + 1 sets of U ∗ , and m is the smallest such integer. The topological dimension of a subset S ⊆ X is analogous. For instance, a straight line in a euclidian space can be divided into open intervals such that at most two intervals intersect – so that dT = 1. Similarily, a simple geometric figure in the plane, such as a disk or a rectangle (including the inner area), can be covered with arbitrarily small circles or rectangles such that at most three such sets intersect – this number can however not be made smaller. Thus, the topological dimension of such an object is dT = 3 − 1 = 2.


The topological dimension is a relatively rough measure of dimension, since it can assume integer values only and thus classifies sets (in a topological space) into a finite or countable number of categories. On the other hand, dT is defined for very general spaces where a metric (i.e. distances) need not exist. A finer definition of dimension, which is however confined to metric spaces, is the Hausdorff-Besicovitch dimension. Suppose we have a set A in a metric space X. In a metric space, we can define open balls of radius r around each point x ∈ X by U (r) = {y ∈ X : dX (x, y) < r} where dX is the metric in X. The idea is now to measure the size of A by covering it with a finite number of balls Ur = {U1 (r), ..., Uk (r)} of radius r and to calculate an approximate measure of A by µUr ,r,h (A) = h(r) (3.18) where the sum is taken over all balls and h is some positive function. This measure depends on r, the specific covering Ur and h. To obtain a measure that is independent of a specific covering, we define the measure µr,h (A) = inf µUρ ,ρ,h (A) Uρ :ρ 2, µh (A) = 0. For standard sets, such as circles, rectangles, triangles, cylinders, etc., it is generally true that the intrinsic function for a set A that with topological dimension dT = d is given by (Hausdorff 1919) h(r) = hd (r) =


{Γ( 21 )}d Γ(1 + d2 )

rd .

(3.21)

Many other more complicated sets, including randomly generated sets, have intrinsic functions of the form h(r) = L(r)rd for some d > 0 which is not always equal to dT , and L a function that is slowly varying at the origin (see e.g. Hausdorff 1919, Besicovitch 1935, Besicovitch and Ursell 1937, Mandelbrot 1977, 1983, Falcomer 1985, 1986, Kono 1986, Telcs 1990, Devaney 1990). Here, L is called slowly varying at zero, if for any u > 0, limr→0 [L(ur)/L(r)] = 1. This leads to the following definition of dimension: Definition 35 Let A be a subset of a metric space and h(r) = L(r) · rd an intrinsic function of A where L(r) is slowly varying. Then dH = d is called the Hausdorff-Besicovitch dimension (or Hausdorff dimension) of A. The definition of Hausdorff dimension leads to the definition of fractals (see e.g. Mandelbrot 1977): Definition 36 Let A be a subset of a metric space. Suppose that A has topological dimension dT and Hausdorff dimension dH such that dH > dT . Then A is called a fractal.

Figure 3.2 Fractal pictures (by Céline Beran, computer generated.) (Color figures follow page 152.)

Intuitively, dH > dT means that the set A is “more complicated” than a standard set with topological dimension dT . An alternative definition of Hausdorff-dimension is the fractal dimension: Definition 37 Let A be a compact subset of a metric space. For each ε > 0, denote by N (ε) the smallest number of balls of radius r ≤ ε necessary to cover A. If log N (ε) (3.22) dF = − lim ε→0 log ε exists, then dF is called the fractal dimension of A. It can be shown that dF ≥ dT . Moreover, in Rk one has dF ≤ k = dT . Beautiful examples of fractal curves and surfaces (cf. Figure 3.2) can be found in


Mandelbrot (1977) and other related books. Many phenomena, not only in nature but also in art, appear to be fractal. For instance, fractal shapes can be found in Jackson Pollock’s (1912-1956) abstract drip paintings (Taylor 1999a,b,c, 2000). In music, the idea of fractals was used by some contemporary composers, though mainly as a conceptual inspiration rather than an exact algorithm (e.g. Harri Vuori, Gy¨ orgy Ligeti; Figure 3.3).

Figure 3.3 Gy¨ orgy Ligeti (*1923). (Courtesy of Philippe Gontier, Paris.)

The notion of fractals is closely related to self-similarity (see Mandelbrot 1977 and references therein). Self-similar geometric objects have the property that the same shapes are repeated at infinitely many scales. By drawing recursively m smaller copies of the same shape – rescaling them by a factor s – one can construct fractals. For self-similar objects, the fractal dimension can be calculated directly from the scaling factor s and the number m of repetitions of the rescaled objects by dF =

log m log s

(3.23)

For many purposes more realistic are random fractals where instead of the shape itself, the distribution remains the same after rescaling. More specifically, we have Definition 38 Let Xt (t ∈ R) be a stochastic process. The process is called self-similar with self-similarity parameter H, if for any c > 0 Xt =d c−H Xct where = d means equality of the two processes in distribution. The parameter H is also called Hurst exponent. Self-similar processes are (like their deterministic counterparts) very special models. However, they play a central role for stochastic processes just like the normal distribution for random variables. The reason is that, under very general conditions, the limit of partial sum processes (see Lamperti 1962, 1972) is always a self-similar process:


Theorem 11 Suppose that Zt (t ∈ R+ ) is a stochastic process such that Z1 = 0 with positive probability and Zt is the limit in distribution of the sequence of normalized partial sums −1 a−1 n Snt = an

[nt]

Xs (n = 1, 2, ...)

(3.24)

s=1

where X1 , X2 , ... is a stationary discrete time process with zero mean and a1 , a2 , ... a sequence of positive normalizing constants such that log an → ∞. Then there exists an H > 0 such that for any u > 0, limn→∞ (anu /an ) = uH , Zt is self-similar with self-similarity parameter H, and Zt has stationary increments. The self-similarity parameter therefore also makes sense for processes that are not exactly self-similar themselves, since it is defined by the rate n−H needed to standardize partial sums. Moreover, H is related to the fractal dimension, the exact relationship between H and the fractal dimension however depends on some other properties of the process as well. For instance, sample paths of (univariate) Gaussian self-similar processes socalled fractional Brownian motion (see Chapter 4) have, with probability one, a fractal dimension of 2 − H with possible values of H in the interval (0, 1). Thus, the closer H is to 1, the more a sample paths is similar to a simple geometric line with dimension one. On the other hand, as H approaches zero, a typical sample path fills up most of the plane so that the dimension approaches two. Practically, H can be determined from an observed series X1 , ..., Xn , for example by maximum likelihood estimation. For a thorough discussion of self-similar and related processes and statistical methods see e.g. Beran (1994). Further references on fractals apart from those given above are, for instance, Edgar (1990), Falconer (1990), Peitgen and Saupe (1988), Stoyan and Stoyan (1994), and Tricot (1995). A cautionary remark should be made at this point: in view of theorem 11, the fact that we do find self-similarity in aggregated time series is hardly surprising and can therefore not be interpreted as something very special that would distinguish the particular series from other data. What may be special at most is which particular value of H is obtained and which particular self-similar process the normalized aggregated series converges to. 3.3 Specific applications in music 3.3.1 Entropy of melodic shapes Let x(ti ) be the upper and y(ti ) the lower envelope of a composition at score-onset times ti (i = 1, ..., n). To investigate the shape of the melodic


movement we consider the first and second discrete “derivatives” x(1) (ti ) =

∆x(ti ) x(ti+1 ) − x(ti ) = ∆ti ti+1 − ti

(3.25)

and x(2) (ti ) =

∆2 x(ti ) [x(ti+2 ) − x(ti+1 )] − [x(ti+1 ) − x(ti )] = ∆2 ti [ti+2 − ti+1 ] − [ti+1 − ti ]

(3.26)

Alternatively, if octaves “do not count”, we define x(1;12) (ti ) =

[x(ti+1 ) − x(ti )]12 ti+1 − ti

(3.27)

and x(2;12) (ti ) =

[x(ti+2 ) − x(ti+1 )]12 − [x(ti+1 ) − x(ti )]12 [ti+2 − ti+1 ] − [ti+1 − ti+2 ]

(3.28)

where [x]k = x mod k. Thus, in this definition intervals between successive notes x(ti ), x(ti+1 ) and x(tj ), x(tj+1 ) respectively are considered identical if they differ by octaves only. The number of possible values of x(2) and x(2;12) is finite, however potentially very large. In first approximation we may therefore consider both variables to be continuous. In the following, the distribution of x(2) and x(2;12) is approximated by a continuous density kernel estimate fˆ (see Chapter 2). For illustration, we define the following measures of entropy: 1.

E1 = −

fˆ(x) log2 fˆ(x)dx

(3.29)

where fˆ is obtained from the observed data x(2;12) (t1 ), ..., x(2;12) (tn ) by kernel estimation. 2. E2 : Same as E1 , but using x(2) (t1 ), ..., x(2) (tn ) instead.

3. E3 = −

fˆ(x, y) log2 fˆ(x, y)dxdy

(3.30)

where fˆ(x, y) is a kernel estimate based on observations (ai , bi ) with ai = x(2) (ti−1 ) and bi = x(2) (ti ). Thus, E3 is the (empirical) entropy of the joint distribution of two successive values of x(2) . 4. E4 : Same as Entropy 3, but using (x(2;12) (ti−1 ), x(2;12) (ti )) instead. 5. E5 : Same as Entropy 3, but using (x(ti ) − y(ti ))(1) instead. 6. E6 : Same as Entropy 3, but using (x(ti ) − y(ti ))(1;12) instead. 7. E7 : Same as Entropy 1, but using (x(ti ) − y(ti ))(1) instead. 8. E8 : Same as Entropy 1, but using (x(ti ) − y(ti ))(1;12) instead.


Figure 3.4 Comparison of entropies 1, 2, 3, and 4 for J.S. Bach’s Cello Suite No. I and R. Schumann’s op. 15, No. 2, 3, 4, and 7, and op. 68, No. 2 and 16.


Each of these entropies characterizes information content (or randomness) of certain aspects of melodic patterns in the upper and lower envelope. Figures 3.4a through d show boxplots of Entropies 1 through 4 for Bach and Schumann (Figure 3.8). The pieces considered here are: J.S. Bach Cello Suite No. I (each of the six movements separately), Pr¨ aludium und Fuge No. 1 and 8 from “Das Wohltemperierte Klavier” I (each piece separately); R. Schumann – op. 15, No. 2, 3, 4 and 7, and op. 68, No. 2 and 16. Obviously there is a difference between Bach and Schumann in all four entropy measures. In Bach’s pieces, entropy is higher, indicating a more uniform mixture of local melodic shapes. 3.3.2 Spectral entropy of local interval variability Consider the local variability of intervals yi = x(ti+1 ) − x(ti ) between successive notes. Specifically, we consider a moving “nearest neighbor” window [ti , ti+4 ] (i = 1, ..., n − 4) and define local variances 3

vi =

1 (yi+j − y¯i )2 4 − 1 j=0

(3.31)

3 where y¯i = 4−1 j=0 yi+j . Based on this, a SEMIFAR-model is fitted to the time series zi = log(vi + 12 ) (see Chapter 4 for the definition of SEMIˆ is then used to define the FAR models). The fitted spectral density f (λ; θ) spectral entropy π ˆ log f (λ; θ)dλ ˆ E9 = − f (λ; θ) (3.32) −π

If octaves do not count, then intervals are circular so that an estimate of variability for circular data should be used. Here, we use R∗ = 2(1 − R) as defined in Chapter 7. To transform the range [0, 2] of R∗ to the real line, the logistic transformation is applied, defining zi = log(

R∗ + ε ) 2 + ε − R∗

where ε is a small positive number that is needed in order that −∞ < zi < ∞ even if R∗ = 0 or 2 respectively. Fitting a SEMIFAR-model to zi we then define E10 the same way as E9 above. Figure 3.6 shows a comparison of E9 and E10 for the same compositions as in 3.3.1. In contrast to the previous measures of entropy, Bach is consistently lower than Schumann. With respect to E10 this is also the case in comparison with Scriabin (Figure 3.5) and Martin. Thus, for Bach there appears to be a high degree of nonrandomness (i.e. organization) in the way variability of interval steps changes sequentially.


Figure 3.5 Alexander Scriabin (1871-1915) (at the piano) and the conductor Serge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of Gem¨ aldegalerie Neuer Meister, Dresden, and Robert-Sterl-House.)

Figure 3.6 Comparison of entropies 9 and 10 for Bach, Schumann, and Scriabin/Martin.


3.3.3 Omnibus metric, melodic, and harmonic indicators for compositions by Bach, Schumann, and Webern Figures 3.7, and 3.9 through 3.11 show the “omnibus” metric, melodic, and harmonic weight functions for Bach’s Canon cancricans, Schumann’s op. 15/2 and 7, and for Webern’s Variations op 27. For Bach’s composition, the almost perfect symmetry around the middle of the composition can be seen. Moreover, the metric curve exhibits a very regular up and down. Schumann’s curves, in particular the melodic one, show clear periodicities. This appears to be quite typical for Schumann and becomes even clearer when plotting a kernel-smoothed version of the curves (here a bandwidth of 8/8 was used). Interestingly, this type of pattern can also be observed for Webern. In view of the historic development of 12-tone music as a logical continuation of harmonic freedom and romantic gesture achieved in the 19th and early 20th centuries, this similarity is not completely unexpected. Finally, note that a relationship between metric,

Figure 3.7 Metric, melodic, and harmonic global indicators for Bach’s Canon cancricans.

melodic and harmonic structure can not be seen directly from the “raw” curves. However, smoothed weights as shown in the figures above reveal clear connections between the three weight functions. This is even the case for Webern, in spite of the absence of tonality.


Figure 3.8 Robert Schumann (1810-1856). (Courtesy of Zentralbibliothek Z¨ urich.)

3.3.4 Specific melodic indicators for Schumann’s Tr¨ aumerei Schumann’s Träumerei is rich in local motifs. Here, we consider eight of these as indicated in Figure 3.12. Figure 3.13 displays the individual indicator functions obtained from (3.16). The overall indicator function m(t) = wmelod (t) displayed in Figure 3.15 is defined by (3.17) with h(w, L) = [2 · max(w, 0.5)]L and Lj =number of notes in motif j. The contributions h(wcorr,j (ti ), Lj ) of wcorr,j (j = 1, ..., 8) are given in Figure 3.14.


Figure 3.9 Metric, melodic, and harmonic global indicators for Schumann’s op. 15, No. 2 (upper figure), together with smoothed versions (lower figure).


Figure 3.10 Metric, melodic, and harmonic global indicators for Schumann’s op. 15, No. 7 upper figure), together with smoothed versions (lower figure).


Figure 3.11 Metric, melodic, and harmonic global indicators for Webern’s Variations op. 27, No. 2 (upper figure), together with smoothed versions (lower figure).


Figure 3.12 R. Schumann – Tr¨ aumerei: motifs used for specific melodic indicators.


Figure 3.13 R. Schumann – Tr¨ aumerei: indicators of individual motifs.

Figure 3.14 R. Schumann – Tr¨ aumerei: contributions of individual motifs to overall melodic indicator.


150 100

w 50 0 0

5

10

15

20

25

30

onset time

Figure 3.15 R. Schumann – Tr¨ aumerei: overall melodic indicator.


CHAPTER 4

Time series analysis 4.1 Musical motivation Musical events are ordered according to a specific temporal sequence. Time series analysis deals with observations that are indexed by an ordered variable (usuallly time). It is therefore not surprising that time series analysis is important for analyzing musical data. Traditional applications are concerned with “raw physical data” in the form of audio signals (e.g. digital CD-recording, sound analysis, frequency recognition, synthetic sounds, modeling musical instruments). In the last few years, time series models have been developed for modeling symbolic musical data and analyzing “higher level” structures in musical performance and composition. A few examples are discussed in this chapter. 4.2 Basic principles 4.2.1 Deterministic and random components, basic definitions Time series analysis in its most sophisticated form is a complex subject that cannot be summarized in one short chapter. Here, we briefly mention some of the main ingredients only. For a thorough systematic account of the topic we refer the reader to standard text books such as Priestley (1981a,b), Brillinger (1981), Brockwell and Davis (1991), Diggle (1990), Beran (1994), Shumway and Stoffer (2000). A time series is a family of (usually, but not necessarily) real variables Xt with an ordered index t. For simplicity, we assume that observations are taken at equidistant discrete time points t ∈ Z (or N). Usually, observations are random with certain deterministic components. For instance, we may have an additive decomposition Xt = µ(t) + Ut where Ut is such that E(Ut ) = 0 and µ(t) is a deterministic function of t. One of the main aims of time series analysis is to identify the probability model that generated an observed time series x1 , ..., xn . In the additive model this would mean to estimate the mean function µ(t) and the probability distribution of the random sequence U1 , U2 , .... Note that a random sequence can also be understood as a function mapping positive integers t to the real numbers Ut . The main difficulties in identifying the correct distribution are:


1. The probability law has to be defined on an infinite dimensional space of vectors (X1 , X2 , ...). This difficulty is even more serious for continuous time series where a sample path is a function on R; 2. The finite sample vector X(n) = (X1 , ..., Xn )t has an arbitrary n-dimensional distribution so that it cannot be estimated from observed values x1 , ..., xn consistently, unless some minimal assumptions are made. Difficulty 1 can be solved by applying appropriate mathematical techniques and is described in detail in standard books on stochastic processes and time series analysis (see e.g. Billingsley 1986 and the references above). Difficulty 2 cannot be solved by mathematical arguments only. It is of course possible to give necessary or sufficient conditions such that the probability distribution can be estimated with arbitrary accuracy (measured in an appropriate sense) as n tend infinity. However, which concrete assumptions should be used depends on the specific application. Assumptions should neither be too general (otherwise population quantities cannot be estimated) nor too restrictive (otherwise results are unrealistic). A standard, and almost necessary, assumption is that Xt can be reduced to a stationary process Ut by applying a suitable transformation. For instance, we may have a deterministic “trend” µ(i) plus stationary “noise” Ui , Xi = µ(i) + Ui , (4.1) or an integrated process of order m for which the mth difference is stationary, i.e. (4.2) (1 − B)m Xi = Ui where (1 − B)Xi = Xi − Xi−1 . In the latter case, Xt is called m-difference stationary. Stationarity is defined as follows: Definition 39 A time series Xi is called strictly stationary, if for any k, i1 , ..., in ∈ N, P (Xi1 ≤ x1 , ..., Xin ≤ xn ) = P (Xi1 +k ≤ x1 , ..., Xin +k ≤ xn )

(4.3)

The time series is called weakly (or second order) stationary, if µ(i) = E(Xi ) = µ = const

(4.4)

and for any i, j ∈ N, the autocovariance depends on the lag k = |i − j| only, i.e. (4.5) cov(Xi , Xi+k ) = γ(k) = γ(−k) A second order stationary process can be decomposed into uncorrelated random components that correspond to periodic signals, via the so-called spectral representation π eitλ dZX (λ). (4.6) Xt = µ + −π

Here ZX (λ) = ZX,1 (λ) + iZX,2 (λ) ∈ C is a so-called orthogonal increment


process (in λ) with the following properties: ZX (0) = 0, E[ZX (λ)] = 0 and for λ1 > λ2 ≥ ν1 > ν2 , E[∆Z X (λ2 , λ1 )∆ZX (ν2 , ν1 )] = 0

(4.7)

where ∆ZX (u, v) = ZX (u) − ZX (v). The integral in (4.6) is defined as a limit in mean square. It can be constructed by approximating the function eitλ by step functions gn (λ) = αi,n 1{ai,n < λ ≤ bi,n } (n ∈ N). For step functions we have the integrals π In = gn (λ)dZX (λ) = αi,n [Z(bi,n ) − Z(ai,n )]. −π

As gn → e that

itλ

, the integrals In converge to a random variable I, in the sense

lim E[(I − In )2 ] = 0. The random variable I is then denoted by exp(itλ)dZ(λ). The spectral representation is especially useful when one needs to identify (random) periodicities. For this purpose one defines the spectral distribution function n→∞

FX (λ) = E[|ZX (λ) − ZX (0)|2 ] = E[|ZX (λ)|2 ] The variance is then decomposed into frequency contributions by π π 2 E[|dZX (λ)| ] = dFX (λ) var(Xt ) = −π

(4.8)

(4.9)

−π

This means that the expected contribution (expected squared amplitude) of components with frequencies in the interval (λ, λ + ε] to the variance of Xt is equal to F (λ + ε) − F (λ). Two interesting special cases can be distinguished: Case 1 – F differentiable: In this case, d F (λ)ε + o(ε) = f (λ)ε + o(ε). dλ The function f is called spectral density and can also be defined directly by ∞ 1 γX (k)eikλ (4.10) f (λ) = 2π F (λ + ε) − F (λ) =

k=−∞

where γX (k) = cov(Xt , Xt+k ). The inverse relationship is π eikλ f (λ)dλ γX (k) =

(4.11)

−π

A high peak of f at a frequency λo means that the component(s) at (or in the neighborhood of) λo contribute largely to the variability of Xt . Note


that the period of exp(itλ), as a function of t, is T = 2π/λ (sometimes ˜ = λ/(2π) as frequency in order that the period T is one therefore defines λ directly the inverse of the frequency). Thus, a peak of f at λo implies that a sample path of Xt is likely to exhibit a strong periodic component with frequency λo . Periodicity is, however, random – the observed series is not a periodic function. The meaning of random periodicity can be explained best in the simplest case where T is an integer: if f has a peak at frequency λo = 2π/T, then the correlation between Xt and Xt+jT (j ∈ Z) is relatively high compared to other correlations with similar lags. A further complication that blurs periodicity is that, if f is continuous around a peak at λo , then the observed signal is a weighted sum of infinitely (in fact uncountably) many, relatively large components with frequencies that are similar to λo . The sharper the peak, the less this “blurring” takes place and a distinct periodicity (though still random) can be seen. In the other extreme case where f is constant, there is no preference for any frequency, and γX (k) = 0 (k = 0), i.e. observations are uncorrelated. Case 2 - F is a step function with a finite or countable number of jumps: this corresponds to processes of the form Xt =

k

Aj eiλj t

j=1

for some k ≤ ∞, and λj ∈ [0, π], Aj ∈ C. We then have F (λ) = E[|Aj |2 ],

(4.12)

j:λj ≤λ

var(Xt ) =

k

E[|Aj |2 ]

(4.13)

j=1

This means that the variance is a sum of contributions that are due to the frequencies λj (1 ≤ j ≤ k). A sample path of Xt cannot be distinguished from a deterministic periodic function, because the randomly selected amplitudes Aj are then fixed. Finally, it should be noted that not all frequencies are observable when observations are taken at discrete time points t = 1, 2, ..., n. The smallest identifiable period is 2, which corresponds to a highest observable frequency of 2π/2 = π. The largest identifiable period is n/2, which corresponds to the smallest frequency 4π/n. As n increases, the lowest frequency tends to zero, however the highest does not. In other words, the highest frequency resolution does not improve with increasing sample size. To obtain more general models, one may wish to relax the condition of stationarity. An asymptotic concept of local stationarity is defined in Dahlhaus (Dahlhaus 1996a,b, 1997): a sequence of stochastic processes Xt,n


(n ∈ N ) is called locally stationary, if we have a spectral representation π t Xt,n = µ( ) + eitλ At,n (λ)dZX (λ), (4.14) n −π with “ = ” meaning almost sure (a.s.) equality, µ(u) continuous, and there exists a 2π−periodic function A : [0, 1] × R → C such that A(u, −λ) = ¯ λ), A(u, λ) is continuous in u, and A(u, t sup |A( , λ) − At,n (λ)| ≤ cn−1 n t,λ

(4.15)

(a.s.) for some constant c < ∞. Intuitively, this means that for n large enough, the observed process can be approximated locally in a small time window t ± ε by the stationary process exp(itλ)A( nt , λ)dZX (λ). The order n−1 of the approximation is chosen such that most standard estimation procedures, such as maximum likelihood estimation, can be applied locally and their usual properties (e.g. consistency, asymptotic normality) still hold. Under smoothness conditions on A one can prove that a meaningful “evolving” spectral density fX (u, λ) (u ∈ (0, 1)) exists such that ∞ 1 cov(X[u·n−k/2],n , X[u·n+k/2],n ) n→∞ 2π

fX (u, λ) = lim

(4.16)

k=−∞

The function fX (u, λ) is called evolutionary spectral density. Note that, for fixed u, lim cov(X[u·n−k/2],n , X[u·n+k/2],n ) = γX (k) n→∞ = (2π)−1 exp(ikλ)fX (u, λ)dλ. Thumfart (1995) carries this concept over to series with discrete spectra. A simplified definition can be given as follows: a sequence of stochastic processes Xt,n (n ∈ N ) is said to have a discrete evolutionary spectrum FX (u, λ), if t t t Aj ( )eiλj ( n )t (4.17) Xt,n = µ( ) + n n j∈M

where M ⊆ Z, and µj (u) is twice continuously differentiable. The discrete evolutionary spectrum can be defined in analogy to the continuous case. For other definitions of nonstationary processes see e.g. Priestley (1965, 1981), Ghosh et al. (1997) and Ghosh and Draghicescu (2002a,b). 4.2.2 Sampling of continuous-time time series Often time series observed at discrete time points t = j · ∆τ (j = 1, 2, 3, ...) actually “happen” in continuous time τ ∈ R. Sampling in discrete time


leads to information loss in the following way: let Yτ be a second order stationary time series with τ ∈ R. (Stationarity in continuous time is defined in an exact analogy to definition 39.) Then, Yτ has a spectral representation ∞ Yτ = eiτ λ dZY (λ), (4.18) −∞

a spectral distribution function

FY (λ) =

λ

E[|dZ(λ)|2 ]

(4.19)

−∞

and, if F exists, a spectral density function ∞ 1 e−iτ λ γY (τ )dτ fY (λ) = F (λ) = 2π −∞ We also have

(4.20)

γY (τ ) = cov(Yt , Yt+τ ) =

eiλτ f (λ)dλ.

The reason why the frequency range extends to (−∞, ∞), instead of [−π, π], is that in continuous time, by definition, arbitrarily small frequencies are observable. Suppose now that Yτ is observed at discrete time points t = j · ∆τ , i.e. we observe (4.21) Xt = Yj·∆τ Then we can write ∞ ∞ Xt = eij(∆τ λ) dZY (λ) = −∞

=

u=−∞

∞ u=−∞

π/∆τ

−π/∆τ

π/∆τ +(2π/∆τ )u

−π/∆τ +(2π/∆τ )u

eij(∆τ λ) dZY (λ + (2π/∆τ )u) =

eij(∆τ λ) dZY (λ) (4.22) π/∆τ

−π/∆τ

eitλ dZX (λ) (4.23)

where dZX (λ) =

∞

dZY (λ + (2π/∆τ )u)

(4.24)

u=−∞

Moreover, if Yτ has spectral density fY , then the spectral density of Xt is fX (λ) =

∞

fY (λ + (2π/∆τ )u)

(4.25)

u=−∞ π π for λ ∈ [− ∆τ , ∆τ ]. This result can be interpreted as follows: a frequency λ > π/∆τ can be written as λ = λo − (2π/∆τ )j for some j ∈ N where λo is in the interval [−π/∆τ, π/∆τ ]. The contributions of the two frequencies λ and


λo to the observed function Xt (in discrete time) are confounded, i.e. they cannot be distinguished. Thus, if we observe a peak of fX at a frequency λ ∈ (0, π/∆τ ], then this may be due to any of the periodic components with periods 2π/(λ + (2π/∆τ )u), u = 0, 1, 2, ..., or a combination of these. This has, for instance, direct implications for sampling of sound signals. Suppose that 22050Hz (i.e. λ = 22050 · 2π ≈ 138544.2) is the highest frequency that we want to identify (and later reproduce) correctly, instead of attributing it to a lower frequency. This would cover the range perceivable by the human ear. Then ∆τ must be so small that π/∆τ ≥ 22050 · 2π. Thus the time gap ∆τ between successive measurements of the sound wave must not exceed 1/44100. 4.2.3 Linear filters Suppose we need to extract or eliminate frequency components from a signal Xt with spectral density fX . The aim is thus, for instance, to produce an output signal Yt whose spectral density fY is zero for a frequency interval a ≤ λ ≤ b. The simplest, though not necessarily best, way to do this is linear filtering. A linear filter maps an input series Xt to an output series Yt by ∞

Yt =

aj Xt−j

(4.26)

j=−∞

The coefficients must fulfill certain conditions in order that the sum is a2j < ∞. The defined. If Xt is second order stationary, then we need resulting spectral density of Yt is fY (λ) = |A(λ)|2 fX (λ) where A(λ) =

∞

aj e−ijλ .

(4.27)

(4.28)

j=−∞

To eliminate a certain frequency band [a, b] one thus needs a linear filter such that A(λ) ≡ 0 in this interval. Equation (4.27) also helps to construct and simulate time series models with desired spectral densities: a series with spectral density fY (λ) = (2π)−1 |A(λ)|2 can be simulated by passing a series of independent observations Xt through the filter A(λ). Note that, in reality, one can use only a finite number of terms in the filter so that only an approximation can be achieved. 4.2.4 Special models When modeling time series statistically, one may use one of the following approaches: a) parametric modeling; b) nonparametric modeling; and c)


semiparametric modeling. In parametric modeling, the probability distribution of the time series is completely specified a priori, except for a finite dimensional parameter θ = (θ1 , ..., θp )t . In contrast, for nonparametric models, an infinite dimensional parameter is unknown and must be estimated from the data. Finally, semiparametric models have parametric and nonparametric components. A link between parametric and nonparametric models can also be established by data-based choice of the length p of the unknown parameter vector θ, with p tending to infinity with the sample size. Some typical parametric models are: 1. White noise: Xt second order stationary, var(Xt ) = σ 2 , fX (λ) = σ 2 /(2π), and γX (k) = 0 (k = 0) 2. Moving average process of order q, MA(q): Xt = µ + εt +

q

ψk εt−k

(4.29)

k=1

with µ ∈ R, εt independent identically distributed (iid) r.v., E(εt ) = 0 and σε2 = var(εt ) < ∞. This can also be written as Xt − µ = ψ(B)εt

q

(4.30)

where backshift operator with BXt = Xt−1 , ψ(B) = k=0 ψk B k . q B is the k If k=0 ψk z = 0 implies |z| > 1, then Xt is invertible in the sense that it can also be written as ∞ Xt − µ = ϕk (Xt−k − µ) + εt . k=1

3. Autoregressive process of order p, AR(p): (Xt − µ) −

p

ϕk (Xt−k − µ) = εt

(4.31)

k=1

or ϕ(B)(Xt − µ) = εt where ϕ(B) = 1 − pk=1 ϕk B k . If 1 − pk=1 ϕk z k = 0 implies |z| > 1, then Xt is stationary. 4. Autoregressive moving average process, ARMA(p, q): ϕ(B)(Xt − µ) = ψ(B)εt .

(4.32)

The spectral density is fX (λ) = σε2 5. Linear process: Xt = µ +

|ψ(eiλ )|2 . |ϕ(eiλ )|2

∞ j=−∞


ψj εt−j

(4.33)

(4.34)

where ψj depend on a finite dimensional parameter vector θ. The spectral density is fX (λ) = σε2 |ψ(eiλ )|2 . 6. Integrated ARIMA process, ARIMA(p, d, q) (Box and Jenkins 1970): ϕ(B)((1 − B)d Xt − µ) = ψ(B)εt

(4.35)

with d = 0, 1, 2, ..., where ϕ(z) and ψ(z) are not zero for |z| ≤ 1. This means that the dth difference (1 − B)d Xt is a stationary ARMA process. 7. Fractional ARIMA process, FARIMA(p, d, q) (Granger and Joyeux 1980, Hosking 1981, Beran 1995): (1 − B)δ ϕ(B){(1 − B)m Xt − µ} = ψ(B)εt with d = m + δ,

1 2

0, Yt =d c−H Yct . This definition implies that the covariances of Yt are equal to σ 2 2H (|t| + |s|2H − |t − s|2H ) 2 where σ 2 > 0. If Yt is Gaussian (i.e. all joint distributions are normal), then the process is fully determined by its expected value and the covariance function. Therefore, there is only one self-similar Gaussian process. This process is called fractional Brownian motion BH (t) with self-similarity parameter 0 < H < 1. The discrete time increment process cov(Yt , Yt+s ) =

Xt = BH (t) − BH (t − 1) (t ∈ N)

(4.38)

is called fractional Gaussian noise (FGN). FGN is stationary with autocovariances σ2 γ(k) = (|k + 1|2H + |k − 1|2H − 2|k|2H ), (4.39) 2 the spectral density is equal to (Sinai 1976) f (λ) = 2cf (1 − cos λ)

∞

|2πj + λ|−2H−1 , λ ∈ [−π, π]

(4.40)

j=−∞

with cf = cf (H, σ 2 ) = σ 2 (2π)−1 sin(πH)Γ(2H + 1) and σ 2 = var(Xi ). For further discussion see e.g. Beran (1994). 8. Polynomial trend model: Xt =

p

βj t j + U t

(4.41)

j=0

where Ut is stationary. 9. Harmonic or seasonal trend model: p p αj cos λj t + αj sin λj t + Ut Xt = j=0

(4.42)

j=0

with Ut stationary 10. Nonparametic trend model: t (4.43) Xt,n = g( ) + Ut n with g : [0, 1] → R a “smooth” function (e.g. twice continuously differentiable) and Ut stationary. 11. Semiparametric fractional autoregressive model, SEMIFAR(p, d, q) (Beran 1998, Beran and Ocker 1999, 2001, Beran and Feng 2002a,b): (1 − B)δ ϕ(B){(1 − B)m Xt − g(st )} = Ut


(4.44)

where d, ϕ, εt and g are as above and m = 0, 1. In this case, the centered differenced process Yt = (1 − B)m Xt − g(st ) is a fractional ARIMA(p, δ, 0) model. The SEMIFAR model incorporates stationarity, difference stationarity, antipersistence, short memory and long memory, as well as an unspecified trend. Incorporating all these components enables us to distinguish statistically which of the components are present in an observed time series (see Beran and Feng 2002a,b). A software implementation by Beran is included in the S − P lus−package F inM etrics and described in Zivot and Wang (2002). 4.2.5 Fitting parametric models If Xt is a second order stationary model with a distribution function that is known except for a finite dimensional parameter θo = (θ1o , ..., θko )t ∈ Θ ⊆ Rk , then the standard estimation technique is the maximum likelihood method: given an observed time series x1 , ..., xn , estimate θ by (4.45) θˆ = arg max h(x1 , ..., xn ; θ) θ∈Θ

where h is the joint density function of (X1 , ..., Xn ). If observations are discrete, then h is the joint probability P (X1 = x1 , ..., Xn = xn ). Equivalently, we may maximize the log-likelihood L(x1 , ..., xn ; θ) = log h(x1 , ..., xn ; θ). Under fairly general regularity conditions, θˆ is asymptotically consistent, in ˆ the sense that it converges in probabilty to θo . In other words, limn→∞ P (|θ− θo | > ε) = 0 for all ε > 0. In the case of a Gaussian time series with spectral density fX (λ; θ), we have 1 t L(x1 , ..., xn ; θ) = − [log 2π + log |Σn | + (x−¯ x) Σ−1 x)] (4.46) n (x−¯ 2 ¯=x ¯ · (1, 1, ..., 1)t , and |Σn | is the determinant of where x = (x1 , ..., xn )t , x the covariance matrix of (X1 , ..., Xn )t with elements [Σn ]ij = cov(Xi , Xj ). Since under general conditions n−1 log |Σn | converges to (2π)−1 times the o 1958), and the (j, l)th element of integral of log fX (Grenanderand Szeg¨ −1 Σ−1 can be approximated by f (λ) exp{i(j − l)λ}dλ, an approximation n X ˆ to θ can be obtained by the so-called Whittle estimator θ˜ (Whittle 1953; also see e.g. Fox and Taqqu 1986, Dahlhaus 1987) that minimizes π 1 I(λ) Ln (θ) = [log fX (λ; θ) + ]dλ (4.47) 4π −π fX (λ; θ) An alternative approximation for Gaussian processes ∞is obtained by using an autoregressive representation of the type Xt = j=1 bj Xt−j + =t , where =t are independent identically distributed zero mean normal variables with variance σ2 . This leads to minimizing the sum of the squared residuals as explained below in Equation (4.50) (see e.g. Box and Jenkins 1970, Beran 1995).


In general, the actual mathematical and practical difficulty lies in defining a computationally feasible estimation procedure and also to obtain ˆ There is a large variety of models for the asymptotic distribution of θ. which this has been achieved. Most results are known for linear models Xt = ψj εt−j with iid εt . (All examples given in the previous section are linear.) The reason is that, if the distribution of εt is known, then the distribution of the process can be recovered by looking at the autocovariances, or equivalently the spectral density, t is invertible, i.e. only. Furthermore, if X o if Xt can be written as Xt = ∞ k=1 ϕk Xt−k + εt , then θ can be estimated by maximizing the loglikelihood of the independent variables εt : θˆ = arg max θ∈Θ

n

log hε (et (θ))

(4.48)

t=1

where hε is the probability density of ε and et (θ) = xt − ∞ ϕ x . t−1k=1 k t−k For a finite sample, et (θ) is approximated by eˆt (θ) = xt − k=1 ϕk xt−k . In 1 the simplest case where εt are normally distributed with hε (x)= (2πσε2 )− 2 exp{−x2 /(2σe2 )} and θ = (σε2 , θ2 , ..., θp ) = (σε2 , η), we have et (θ) = et (η) and 2 n n et (η) log σε2 + ] (4.49) θˆ = arg min[ θ∈Θ σε t=1 t=1 Differentiating with respect to θ leads to ηˆ = arg min η

n

e2t (η)

(4.50)

t=1

and σ ˆε2 = n−1 e2t (ˆ η ). Under mild regularity conditions, as n tends to √ ˆ infinity, the distribution of n(θ−θ) tends to a normal distribution N (0, V ) with with covariance matrix V = 2B −1 where B is a p × p matrix with elements π ∂ ∂ log f (λ; θ) log f (λ; θ)dλ Bij = (2π)−1 ∂θj −π ∂θi (see e.g. Box and Jenkins 1970, Beran 1995). The estimation method above assumes that the order of the model, i.e. the length p of the parameter vector θ, is known. This is not the case in general so that p has to be estimated from data. Information theoretic considerations (based on definitions discussed in Section 3.1) lead to Akaike’s famous criterion (AIC; Akaike 1973a,b) pˆ = arg min{−2 log likelihood + 2p} p

(4.51)

More generally, we may minimize AICα = −2 log likelihood + αk with respect to p. This includes the AIC (α = 2), the BIC (Bayesian information criterion, Schwarz 1978, Akaike 1979) with α = log n and the HIC (Han-


nan and Quinn 1979) with α = 2c log log n (c > 1). It can be shown that, if the observed process is indeed generated by a process from the postulated class of models, and if its order is po , then for α ≥ O(2c log log n) the estimated order is asymptotically correct with probability one. In contrast, if α/(2c log log n) → 0 as n → ∞, then the criterion tends to choose too many parameters in the sense that P (ˆ p > po ) converges to a positive probability. This is, for instance, the case for Akaike’s criterion. Thus, if identification of a correct model is the aim, and the observed process is indeed likely to be at least very close to the postulated model class, then α ≥ O(2c log log n) should be used. On the other hand, one may argue that no model is ever correct, so that increasing the number of parameters with increasing sample size may be the right approach. In this case, the original AIC is a good candidate. It should be noted, however, that if p → ∞ as n → ∞, then the asymptotic distribution and even the rate of convergence of θˆ changes, since this is a kind of nonparametric modeling with an ultimately infinite dimensional parameter. 4.2.6 Fitting non- and semiparametric models Most techniques for fitting nonparametric models rely on smoothing, combined with additional estimation of parameters needed for fine tuning of the smoothing procedure. To illustrate this, consider for instance, (1 − B)m Xt = g(st ) + Ut

(4.52)

as defined above where Ut is second order stationary and st = t/n. If m is known, then g may be estimated, for instance, by a kernel smoother 1 st − sto )yt K( nb t=1 b n

gˆ(to ) =

(4.53)

as defined in Chapter 2, with xt = (1 − B)m xt . However, results may differ considerably depending on the choice of the bandwidth b (see e.g. Gasser and M¨ uller 1979, Beran and Feng 2002a,b). The optimal bandwidth depends on the nature of the residual process Ut . A criterion for optimality is, for instance, the integrated mean squared error IM SE = E{[ˆ g(s) − g(s)]2 }ds. The IMSE can be written as 2 g (s))ds = {Bias2 +variance}ds. IM SE = {E[ˆ g(s)]−g(s)} ds+ var(ˆ The Bias only depends on the function g, and is thus independent of the error process. The variance, on the other hand, is a function of the covariances γU (k) = cov(Ut , Ut+k ), or equivalently the spectral density fU .


The bandwidth that minimizes the IM SE thus depends on the unknown quantities g and fU . Both g and fU , therefore, have to be estimated simultaneously in an iterative fashion. For instance, in a SEMIFAR model, the asymptotically optimal bandwidth can be shown to be equal to bopt = Copt n(2δ−1)/(5−2δ) where Copt is a constant that depends on the unknown parameter vector θ = (σ2 , d, ϕ1 , ..., ϕp )t . Note that in this case, m is also part of the unknown vector. An algorithm for estimating g as well as θ can be defined by starting with an initial estimate of θ, calculating the corresponding optimal bandwidth, subtracting gˆ from xt , reestimating θ, estimating the new optimal bandwidth and so on. Note that in addition the order p is unknown, so that a model choice criterion has to be used at some stage. This complicates matters considerably, and special care has to be taken to define a reliable algorithm. Algorithms that work theoretically as well as practically for reasonably small sample sizes are discussed in Beran and Feng (2002a,b). 4.2.7 Spectral estimation Sometimes one is only interested in the spectral density fX of a stationary process or, equivalently, the autocovariances γX (k), without modeling the whole distribution of the time series. The reason can be, for instance, that as discussed above, one may be mainly interested in (random) periodicities which are identifiable as peaks in the spectral density. A natural nonparametric estimate of γX (k) is the sample autocovariance γˆ (k) =

n−k 1 (xt − x ¯)(xt+k − x ¯) n t=1

(4.54)

for k ≥ 0 and γˆ(−k) = γˆ (k). The corresponding estimate of fX is the periodogram 1 I(λ) = 2π

n−1

γˆ (k)e

ikλ

k=−(n−1)

n 1 | = (xt − x ¯)eitλ |2 2πn t=1

(4.55)

Sometimes a so-called tapered periodogram is used: Iw (λ) = (2πn)−1 |

n

t w( )(xt − x ¯)eitλ |2 n t=1

where w is a weight function. It can be shown that E[I(λ)] → fX (λ) as n → ∞. However, for lags close to n−1, γˆ (k) is very inaccurate, because one averages over n − k observed pairs only. For instance, for k = n − 1, there is only one observed pair, namely (x1 , xn ), with this lag! As a result, I(λ) does


not converge to fX (λ). Instead, the following holds, under mild regularity conditions: if 0 < λ1 < ... < λk < π, and n → ∞, then, as n → ∞, the distribution of 2 · [I(λ1 )/fX (λ1 ), ..., 2I(λk )/fX (λk )] converges to the distribution of (Z1 , ..., Zk ) where Zi are independent χ22 -distributed random variables. This result is also true for sequences of frequencies 0 < λ1,n < ... < λk,n < π as long as the smallest distance between the frequencies, min |λi,n − λj,n | does not converge to zero faster than n−1 . Because of the latter condition, and also for computational reasons (fast Fourier transform, FFT; see Cooley and Tukey 1965, Bringham 1988), one usually calculates I(λ) at the so-called Fourier frequencies λj = 2πj/n (j= 1, ..., m) with n m = [(n − 1)/2]) only. Note that for Fourier frequencies, t=1 eitλj = 0, so that the xt eitλ |2 . I(λ) = (2πn)−1 | Thus, the sample mean actually does not need to be subtracted. The periodogram at Fourier frequencies can also be understood as a decomposition of the variance into orthogonal components, analogous to classical analysis of variance (Scheffé 1959): for n odd, n

(xt − x ¯)2 = 4π

t=1

m

I(λj )

(4.56)

I(λj ) + 2πI(π).

(4.57)

j=2

and for n even, n

2

(xt − x ¯) = 4π

t=1

m j=2

This means that I(λj ) corresponds to the (empirically observed) contribution of periodic components with frequency λj to the overall variability of x1 , ..., xn . A consistent estimate of fX can be obtained by eliminating or downweighing sample autocovariances with too large lags: 1 fˆ(λ) = 2π

n−1

wn (k)ˆ γ (k)eikλ

(4.58)

k=−(n−1)

where wn (k) = 0 (or becomes negligible) for k > Mn , with Mn /n → 0 and Mn → ∞. Equivalently, one can define a smoothed periodogram fˆ(λ) = Wn (ν − λ)I(ν)dν (4.59) for a suitable sequence of window functions Wn such that Wn (ν−λ)f (ν)dν converges to f (λ) as n → ∞. See e.g. Priestley (1981) for a detailed discussion. Finally, it should be noted that, in spite of inconsistency, the raw periodogram is very useful for finding periodicities. In particular, in the case


of deterministic periodicities with frequencies ωj , I(λ) diverges to infinity for λ = ωj and remains finite (proportional to a χ22 −variable) elsewhere. 4.2.8 The harmonic regression model An important approach to analyzing musical sounds is the harmonic regression model Xt =

p

[αj cos ωj t + βj sin ωj t] + Ut

(4.60)

j=1

with Ut stationary. Note that, theoretically, this model can also be understood as a stationary process with jumps in the spectral distribution FX (see Section 4.2.1). Given ω = (ω1 , ..., ωp )t , the parameter vector θ = (α1 , ..., αp , β1 , ..., βp )t can be estimated by the least squares or, more generally, weighted least squares method, θˆ = arg min θ

n

p t w( )[xt − (αj cos ωj t + βj sin ωj t)]2 n t=1 j=1

(4.61)

where w is a weight function. The solution is obtained from usual linear regression formulas. In many applications the situation is more complex, since the frequencies ω1 , ..., ωp are also unknown. This leads to a nonlinear regression problem. A simple approximate solution can be given by (Walker 1971, Hannan 1973, Hassan 1982, Brown 1990, Quinn and Thomson 1991) p p n t iωj t 2 | w( )xt e | = arg max Iw (ωj ), (4.62) ω ˆ = arg max ω 0 1 and b > 0 is a bandwidth that determines how large the window (block) is, i.e. how many consecutive observations are considered to correspond approximately to a harmonic regression model with fixed coefficients αj , βj and stationary noise Ut . This is illustrated in color Figure 4.7 for a harpsichord sound, with W (u) = 1{|u| ≤ 1}. Intense pink corresponds to high values of I(t, λ). Figures 4.6a through d show explicitly the change in I(t, λ) between four different blocks. Since the note was played “staccato”, the sound wave is very short, namely about 0.1 seconds. Nevertheless, there is a change in the spectrum of the sound, with some of the higher harmonics fading away. Apart from the relative amplitudes of partials, most musical sounds in-


Figure 4.3 Periodogram of piano sound wave in Figure 4.2.

1000 0 -3000

-2000

-1000

amplitude

2000

3000

Sound wave of e’’ flat played by harpsichord (0.25sec at sampling rate=44100 Hz)

0.0

0.01

0.02

0.03

0.04

time in seconds

Figure 4.4 Sound wave of e played on a harpsichord.


Figure 4.5 Periodogram of harpsichord sound wave in Figure 4.4.

Harpsichord - Periodogram (block 1) 10^6 10^4

periodogram

10^0

10^2

10^5 10^3 10^1

periodogram

10^7

Harpsichord - Periodogram (block 22)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0

0.5

1.0

frequency a


2.0

2.5

3.0

10^4

periodogram

10^4 10^2

10^0

10^2

10^6

10^6


10^0

periodogram

1.5 frequency a

0.0

0.5

1.0

1.5

frequency b

2.0

2.5

3.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

frequency c

Figure 4.6 Harpsichord sound – periodogram plots for different time frames (moving windows of time points).


Figure 4.7 A harpsichord sound and its spectrogram. Intense pink corresponds to high values of I(t, λ). (Color figures follow page 152.)

clude a characteristic nonperiodic noise component. This is a further justification, apart from possible measurement errors, to include a random deviation part in the harmonic regression equation. The properties of the stochastic process Ut are believed to be characteristic for specific instruments (see e.g. Serra and Smith 1991, Rodet 1997). Typical noise components are, for instance, transient noise in percussive instruments, breath noise in wind instruments, or bow noise of string instruments. For a discussion of statistical issues in this context see e.g. Irizarry (2001). For most instruments, not only the harmonic amplitudes but also the characteristics of the noise component change gradually. This may be modeled by smoothly changing processes as defined for instance in Ghosh et al. (1997). Other approaches are discussed in Priestley (1965) and Dahlhaus (1996a,b, 1997) (see Section 4.2.1 above). Some interesting applications of the asymptotic results in Section 4.2.8 to questions arising in the analysis of musical sounds are discussed in Irizarry


(2001). In particular, the following experiment is described: recordings of a professional clarinet player trying to play concert pitch A (ω1 = 441Hz) and a professional guitar player playing D (ω1 = 146.8Hz) were made. For the analysis of the clarinet sound, a one-second segment was divided into non-overlapping blocks consisting of 1025 measurements (≈23 milliseconds) and the harmonic regression model was fitted to each block separately. For the guitar, the same was done with 60 non-overlapping intervals with 3000 observations each. Two types of results were obtained: 1. The clarinet player turned out to be always out of tune in the sense that the estimated fundamental frequency ω ˆ 1 was always outside the 95% 3 acceptance region 441Hz ± 1.96 C33 (ω1o )n− 2 where the null hypothesis o is Ho : ω1 = ω1 = 441Hz. On the other hand, from the point of view of musical perception, the clarinet player was not out of tune, because the deviation from 441Hz was less than 0.76Hz which corresponds to 0.03 semitones. According to experimental studies, the human ear cannot distinguish notes that are 0.03 semitones apart (Pierce 1983/1992). 2. Physical models (see e.g. Fletcher and Rossing 1991) postulate the following relationships between the fundamental frequency and partials: for a “harmonic instrument” such as the clarinet, one expects ωj = j · ω1 , whereas for a “plucked string instrument”, such as the guitar, one should have ωj ≈ cj 2 · ω1 where c is a constant determined by properties of the strings. The experiment described in Irizarry (2001) supports the assumption for the clarinet in the sense that, in general, the 95%-confidence intervals for the difference ωj − jω1 contained 0. For the guitar, his findings suggest a relationship of the form ωj ≈ c(a + j)2 ω1 with a = 0. 4.3.2 Licklider’s theory of pitch perception Thumfart (1995) uses the theory of discrete evolutionary spectra to derive a simple linear model for pitch perception as proposed by Licklider (1951). The general biological background is as follows (see e.g. Kelly 1991): vibrations of the ear drum caused by sound waves are transferred to the inner ear (cochlea) by three ossicles in the middle ear. The inner ear is a spiral structure that is partitioned along its length by the basilar membrane. The sound wave causes a traveling wave on the basilar membrane which in turn causes hair cells positioned at different locations to release a chemical transmitter. The chemical transmitter generates nerve impulses to the auditory nerve. At which location on the membrane the highest amplitude occurs, and thus which groups of hair cells are activated, depends on the frequency


of the sound wave. This means that certain frequency regions correspond to certain hair groups. Frequency bands with high spectral density f (or high increments dF of the spectral distribution) activate the associated hair groups. To obtain a simple model for the effect of a sound on the basilar membrane movement, Slaney and Lyon (1991) partition the cochlea into 86 sections, each section corresponding to a particular group of cells. Thumfart (1995) assumes that each group of cells acts like a separate linear filter Ψj (j = 1, ..., 86). (This is a simplification compared to Slaney and Lyon who use nonlinear models.) The wave entering the inner ear is assumed to be the original sound wave Xt , filtered by the outer ear by a linear filter A1 , and the middle ear by a linear A2 . Thus, the output of the inner ear that generates the final nerve impulses consists of 86 time series Yt,j = Ψj (B)A2 (B)A1 (B)Xt (j = 1, ..., 86).

(4.82)

Calculating tapered local periodograms Ij (u, λ) of Yt,j for each of the 86 sections (j = 1, ..., 86), one can then define the quantity π Ij (u, λ)eikλ dλ (4.83) c(k, j, u) = −π

which Slaney and Lyon call “correlogram”. This is in fact an estimated local autocovariance at lag k for section j and the time-segment with midpoint u. The “Slaney-Lyon-correlogram” thus essentially characterizes the local autocovariance structure of the resulting nerve impulse series. Thumfart (1995) shows formally how, and under which conditions, this model can be defined within the framework of processes with a discrete evolutionary spectrum. He also suggests a simple method for estimating pitch (the fundamental frequency) at local time u by setting ω ˆ (u) = 2π/kmax (u) where 86 1 kmax (u) = arg maxk C(k, u) and C(k, u) = j=1 c(k, j, u). 4.3.3 Identification of pitch, tone separation and purity of intonation In a recent study, Weihs et al. (2001) investigate objective criteria for judging the quality of singing (also see Ligges et al. 2002). The main question asked in their analysis is how to assess purity of intonation. In an experimental setting, with standardized playback piano accompaniment in a recording studio, 17 singers were asked to sing H¨ andel’s “Tochter Zion” and Beethoven’s “Ehre Gottes aus der Natur”. The audio signal of the vocal performance was recorded in CD quality in 16-bit format at a sampling rate of 44100 Hz. For the actual statistical analysis, data is reduced to 11000Hz, for computational reasons, and standardized to the interval [-1,1]. The first question is how to identify the fundamental frequency (pitch) ω1 . In the harmonic regression model above, estimates of ω1 and the partials ωj (2 ≤ j ≤ k) are identical with the k frequencies where the pe-


riodogram assumes its k largest values. Weihs et al. suggest a simplified (though clearly suboptimal) version of this, in that they consider the periodogram at Fourier frequencies λj = 2πj/n (j = 1, 2, ..., m = [(n − 1)/2]) only and set ω ˜1 =

min

λj ∈{λ2 ,...,λm−1 }

{λj : I(λj ) > max[I(λj−1 ), I(λj+1 )]}.

(4.84)

In other words, ω ˜ 1 corresponds to the Fourier frequency where the first peak of the periodogram occurs. Because of the restriction to Fourier frequencies, the peridogram may have two adjacent peaks and the estimate is too inaccurate in general. An empirical interpolation formula is suggested by the authors to obtain an improved estimate ω ˆ 1 . A comparison with harmonic regression is not made, however, so that it is not clear how good the interpolation works in comparison. Given a procedure for pitch identification, an automatic note separation procedure can be defined. This is a procedure that identifies time points in a sound signal where a new note starts. The interesting result in Weihs et al. is that automatic note separation works better for amateur singers than for professionals. The reason may be the absence of vibrato in amateur voices. In a third step, Weihs et al. address the question of how to assess computationally the purity of intonation based on a vocal time series. This is done using discriminant analysis. The discussion of these results is therefore postponed to Chapter 9. 4.3.4 Music as 1/f noise? In the 1970s Voss and Clarke (1975, 1978) discovered a seemingly universal “law” according to which music has a 1/f spectrum. With 1/f -spectrum one means that the observed process has a spectral density f such that f (λ) ∝ λ−1 as λ → 0. In the sense of definition (4.10), such a density actually does not exist - however, a generalized version of spectral density exists in the sense that the expected value of the periodogram converges to this function (see Matheron 1973, Solo 1992, Hurvich and Ray 1995). Specifically, Voss and Clarke analyzed acoustic music signals by first transforming the recorded signal Xt in the following way: a) Xt is filtered by a low-pass filter (frequencies outside the interval [10Hz, 10000Hz] are eliminated); and b) the “instantaneous power” Yt = Xt2 is filtered by another low-pass filter (frequencies above 20Hz are eliminated). This filtering technique essentially removes higher frequencies but retains the overall shape (or envelope) of each sound wave corresponding to a note and the relative position on the onset axis. In this sense, Voss and Clarke actually analyzed rhythmic structures. A recent, statistically more sophisticated study along this line is described in Brillinger and Irizarry (1998). One objection to this approach can be that in acoustic signals, structural


18

b) Harpsichord - log(power)

16 14

15

log(power)

1000 -1000

13

-3000

air pressure

17

3000

a) Harpsichord sound wave (e flat) sampled at 44100 Hz

0.0

0.02

0.04

0.06

0.08

0.10

0.12

0.0

0.02

0.04

time (sec)

0.06

0.08

0.10

0.12

time (sec)

d) Harpsichord log-log-periodogram and SEMIFAR-fit (d=0.51)

log(f)

60 0

0.0001

20

40

0.0100

80

10

0

1.0000

c) Harpsichord histogram of log(power)

13

14

15

16

log(y**2)

17

18

0.01

0.05

0.10

0.50

1.00

log(frequency)

Figure 4.8 A harpsichord sound wave (a), logarithm of squared amplitudes (b), histogram of the series (c) and its periodogram on log-scale (d) together with fitted SEMIFAR-spectrum.

properties of the composition may be confounded with those of the instruments. Consider, for instance, the harpsichord sound wave in Figure 4.8a. The square of the wave is displayed in Figure 4.8b on logarithmic scale. The picture illustrates that, apart from obvious oscillation, the (envelope of the) signal changes slowly. Fitting a SEMIFAR-model (with order p ≤ 8 chosen by the BIC) yields a good fit to the periodogram. The estimated fractional differencing parameter is dˆ = 0.51 with a 95%-confidence interval of [0.29,0.72]. This corresponds to a spectral density (defined in the generalized sense above) that is proportional to λ−1.02 , or approximately λ−1 . Thus, even in a composition consisting of one single note one would detect 1/f noise in the resulting sound wave. Instead of recorded sound waves, we therefore consider the score itself, independently of which instrument is supposed to play. This is similar but not identical to considering zero crossings of a sound signal (see Voss and


Clarke 1975, 1978, Voss 1988; Brillinger and Irizarry 1998). Figures 4.9a and c show the log-frequencies plotted against onset time for the first movement of Bach’s first Cello-Suite and for Paganini’s Capriccio No. 24. For Bach, the SEMIFAR-fit yields dˆ ≈ 0.7 with a 95%-confidence interval of [0.46, 0.93]. This corresponds to a 1/f 1.4 spectrum; however 1/f (d = 1/2) is included in the confidence interval. Thus, there is not enough evidence against the 1/f hypothesis. In contrast, for Paganini (Figure 4.11) we obtain dˆ ≈ 0.21 with a 95%-confidence interval of [0.07, 0.35] which excludes 1/f noise. This indicates that there is a larger variety of fractal behavior than the “1/f law” would suggest. Note also that in both cases there is also a trend in the data which is in fact an even stronger type of long memory than the stochastic one. Moreover, Bach’s (and also to a lesser degree Paganini’s) spectrum has local maxima in the spectral density, indicating periodicities (see Section 4.2.9). Thus, there is no “pure” 1/f α behavior but instead a mixture of long-range dependence expressed by the power law near the origin, and short-range periodicities.

Figure 4.9 Log-frequencies with fitted SEMIFAR-trend and log-log-periodogram together with SEMIFAR-fit for Bach’s first Cello Suite (1st movement; a,b) and Paganini’s Capriccio No. 24 (c,d) respectively.

Finally, consider an alternative quantity, namely local variability of notes modulo octave. Since we are in Z12 , a measure of variability for circular ¯ as defined in data should be used. Here, we use the measure V = (1 − R) Chapter 7 or rather the transformed variable log[(V +0.05)/(1.05−V )]. The resulting standardized time series are displayed in Figures 4.10a and c. The log-log-plot of the periodgrams and fitted SEMIFAR-spectra are given in Figures 4.10b and d respectively. The estimated long-memory parameters


Figure 4.10 Local variability with fitted SEMIFAR-trend and log-log-periodogram together with SEMIFAR-fit for Bach’s first Cello Suite (1st movement; a,b) and Paganini’s Capriccio No. 24 (c,d) respectively.

are similar to before, namely dˆ = 0.51 ([0.20, 0.81]) for Bach and 0.33 ([0.24, 0.42]) for Paganini.


Figure 4.11 Niccol` o Paganini (1782-1840). (Courtesy of Zentralbibliothek Z¨ urich.)


CHAPTER 5

Hierarchical methods 5.1 Musical motivation Musical structures are typically generated in a hierarchical manner. Most compositions can be divided approximately into natural segments (e.g. movements of a sonata); these are again divided into smaller units (e.g. exposition, development, and coda of a sonata movement). These can again be divided into smaller parts (e.g. melodic phrases), and so on. Different parts even at the same hierarchical level need not be disjoint. For instance, different melodic lines may overlap. Moreover, different parts are usually closely related within and across levels. A general mathematical approach to understanding the vast variety of possibilities can be obtained, for instance, by considering a hierarchy of maps defined in terms of a manifold (see e.g. Mazzola 1990a). The concept of hierarchical relationships and similarities is also related to “self-similarity” and fractals as defined in Mandelbrot (1977) (see Chapter 3). To obtain more concrete results, hierarchical regression models have been developed in the last few years (Beran and Mazzola 1999a,b, 2000, 2001). 5.2 Basic principles 5.2.1 Hierarchical aggregation and decomposition Suppose that we have two time series Yt , Xt and we wish to model the relatioship between Yt and Xt . The simplest model is simple linear regression Yt = βo + β1 Xt + εt

(5.1)

where εt is a stationary zero mean process independent of Xt . If Yt and Xt are expected to be “hierarchical”, then we may hope to find a more realistic model by first decomposing Xt (and possibly also Yt ) and searching for dependence structures between Yt (or its components) and the components of Xt . Thus, given a decomposition Xt = Xt,1 + ... + Xt,M , we consider the multiple regression model Yt = βo +

M j=1


βj Xt,j + εt

(5.2)

with εt second order stationary and E(εt ) = 0. Alternatively, if Yt = Yt,1 + ... + Yt,L , we may consider a system of L regressions Yt,1 = β01 +

M

βj1 Xt,j + εt,1

j=1

Yt,2 = β02 +

M

βj2 Xt,j + εt,2

j=1

.. . Yt,L = β0L +

M

βjL Xt,j + εt,L .

j=1

Three methods of hierarchical regression based on decompositions will be discussed here: HIREG: hierarchical regression using explanatory variables obtained by kernel smoothing with predetermined fixed bandwidths; HISMOOTH: hierarchical smoothing models with automatic bandwidth selection; HIWAVE: hierarchical wavelet models. 5.2.2 Hierarchical regression Given an explanatory time series Xt (t = 1, 2, ..., n), a smoothing kernel K, and a hierarchy of bandwidths b1 > b2 > ... > bM > 0, define n 1 t−s K( )Xt nb1 s=1 nb1

(5.3)

j−1 n 1 t−s K( )[Xt − Xt,l ] nbj s=1 nbj

(5.4)

Xt,1 = and for 1 < j ≤ M , Xt,j =

l=1

The collection of time series {X1,j , ..., Xn,j } (j = 1, ..., M ) is called a hierarchical decomposition of Xt . The HIREG-model is then defined by (5.2). If εt (t = 1, 2, ...) are independent, then usual techniques of multiple linear regression can be used (see e.g. Plackett 1960, Rao 1973, Ryan 1996, Srivastava and Sen 1997, Draper and Smith 1998). In case of correlated errors εt , appropriate adjustments of tests, confidence intervals, and parameter selection techniques must be made. The main assumption in the HIREG model is that we know which bandwidths to use. In some cases this may indeed be true. For instance, if there is a three-fourth meter at the beginning of a musical score, then bandwidths that are divisible by three are plausible.


5.2.3 Hierarchical smoothing Beran and Mazzola (1999b) consider the case where the bandwidths bj are not known a priori. M Essentially, this amounts to a nonlinear regression model Yt = βo + j=1 βj Xt,j + εt where not only βj (j = 0, ..., p) are unknown, but also b1 , ..., bM , and possibly the order M, have to be estimated. The following definition formalizes the idea (for simplicity it is given for the case of one explanatory series Xt only): Definition 40 For integers M, n > 0, let β = (β1 , ..., βM ) ∈ RM , b = (b1 , ..., bM ) ∈ RM , b1 > b2 > ... > bM = 0, ti ∈ [0, T ], 0 < T < ∞, t1 < t2 < ... < tn , and θ = (β, b)t . Denote by K : [0, 1] → R+ a non-negative symmetric kernel function such that K(u)du = 1, K is twice continuously differentiable, and define for b > 0 and t ∈ [0, T ], the Nadaraya-Watson weights (Nadaraya 1964, Watson 1964) i K( t−t b ) ab (t, ti ) = n t−tj j=1 K( b )

(5.5)

Also, let εi (i ∈ Z) be a stationary zero mean process satisfying suitable moment conditions, fε the spectral density of εi , and assume εi to be independent of Xi . Then the sequence of bivariate time series {(X1,n , Y1,n ), ..., (Xn,n , Yn,n )} (n = 1, 2, 3, ...) is a Hierarchical Smoothing Model (or HISMOOTH model), if Yi,n = Y (ti ) =

M

βj g(ti ; bj ) + εi

(5.6)

j=1

where ti = i/n and g(ti ; bj ) =

n

abj (ti , tl )Xl,n

(5.7)

l=1

Denote by θo = (β o , bo )t the true parameter vector. Then θo can be estimated by a nonlinear least squares method as follows: define ei (θ) = Y (ti ) −

M

βj g(ti ; bj )

(5.8)

l=1

as a function of θ = (β, b)t , let S(θ) =

n

2 i=1 ei (θ)

θˆ = argminθ S(θ)

and g˙ =

∂ ∂b g.

Then (5.9)

or equivalently n i=1


ˆ =0 ψ(ti , y; θ)

(5.10)

where ψ = (ψ1 , ..., ψ2M )t , ψj (t, y; θ) = ei (θ)g(t; bj )

(5.11)

ψj (t, y; θ) = ei (θ)βj g(t; ˙ bj )

(5.12)

for j = 1, ..., M, and

for j = M +1, ..., 2M. Under suitable assumptions, the estimate θˆ is asymptotically normal. More specifically, set hi (t; θo ) = g(t; bi ) (i = 1, ..., M )

(5.13)

˙ bi ) (i = M + 1, ..., 2M ) hi (t; θo ) = βi g(t;

(5.14)

Σ = [γε (i − j)]i,j=1,...,n = [cov(εi , εj )]i,j=1,...,n

(5.15)

and define the 2M × n matrix G = G2M×n = [hi (tj ; θo )]i=1,...,2M;j=1,...,n

(5.16)

and the 2M × 2M matrix Vn = (GGt )−1 (GΣGt )(GGt )−1

(5.17)

The following assumptions are sufficient to obtain asymptotic normality: (A1) fε (λ) ∼ cf |λ|−2d (cf > 0) as λ → 0 with − 12 < d < 12 ; (A2) Let ar = n−1

n

γε (i − j)g(ti ; br )g(tj ; br ),

i,j=1

br = n−1

n

γε (i − j)g(t ˙ i ; br )g(t ˙ j ; bs ).

i,j=1

Then, as n → ∞, lim inf |ar | > 0, and lim inf |br | > 0 for all r, s ∈ {1, ..., M }. (A3) x(ti ) = ξ(ti ) where ξ : [0, T ] → R is a function in C[0, T ], T < ∞. (A4) The set of time points converges to a set A that is dense in [0, T ]. Then we have (Beran and Mazzola 1999b): Theorem 12 Let Θ1 and Θ2 be compact subsets of R and R+ respectively, 1 M Θ = ΘM 1 × Θ2 and let η = 2 min{1, 1 − 2d}. Suppose that (A1), (A2), (A3) and (A4) hold and θo is in the interior of Θ. Then, as n → ∞, (i) θˆ →p θo ; (ii) Vn → V where V is a symmetric positive definite 2M × 2M matrix; (iii) nη (θˆ − θ) →d N (0, V ).


Thus, θˆ is asymptotically normal, but for d > 0 (i.e. long-memory errors), 1 1 the rate of convergence n 2 −d is slower than the usual n 2 −rate. A particular aspect of HISMOOTH models is that the bandwidths bj are fixed positive unknown parameters that are estimated from the data. This means that, in contrast to nonparametric regression models (see e.g. Gasser and M¨ uller 1979, Simonoff 1996, Bowman and Azzalini 1997, Eubank 1999), the notion of optimal bandwidth does not exist here. There is a fixed true bandwidth (or a vector of true bandwidths) that has to be estimated. A HISMOOTH model is in fact a semiparametric nonlinear regression rather than a nonparametric smoothing model. Theorem 1 can be interpreted as multiple linear regression where uncertainty due to (explanatory) variable selection is taken into account. The set of possible combinations of explanatory variables is parametrized by a continuous bandwidth-parameter vector b ∈ ΘM 2 . Confidence intervals for β based on the asymptotic distribution of θˆ take into account additional uncertainty due to “variable selection” from the (infinite) parametric family of M explanatory variables X = {(xb1 , ..., xbM ) : bj ∈ Θ2 , b1 > b2 > ... > bM }. For the practical implementation of the model, the following algorithms that include estimation of M are defined in Beran and Mazzola (1999b): if M is fixed, then the algorithm consists of two basic steps: a) generation of the set of all possible explanatory variables xs (s ∈ S), and b) selection of M variables (bandwidths) that maximize R2 . This means that after step 1, the estimation problem is reduced to variable selection in multiple regression, with a fixed number M of explanatory variables. Standard regression software, such as the function leaps in S-Plus, can be used for this purpose. The detailed algorithm is as follows: Algorithm 1 Define a sufficiently fine grid S = {s1 , ..., sk } ⊂ Θ2 and carry out the following steps: Step 1: Define k explanatory time series xs = [xs (t1 ), ..., xs (tn )]t (s ∈ S) by xs (ti ) = g(ti , s). Step 2: For each b = (b1 , ..., bM ) ∈ S M , with bi > bj (i < j) define the n × M matrix X = (xb1 , ..., xbM ) and let β = β(b) = (X t X)−1 X t y. Also, denote by R2 (b) the corresponding value of R2 obtained from least squares regression of y on X. ˆ ˆb)t by ˆb = argmax R2 (b) and βˆ = β(ˆb). Step 3: Define θˆ = (β, b

If M is unknown, then the algorithm can be modified, for instance by increasing M as long as all β-coefficients are significant. In order to calculate the standard deviation of βˆ at each stage, the error process εi needs to be modeled explicitly. Beran and Mazzola (1999) use fractional autoregressive models together with the BIC for choosing the order of the process. This leads to Algorithm 2 Define a sufficiently fine grid S = {s1 , ..., sk } ⊂ Θ2 for the


bandwidths, and calculate k explanatory time series xs (s ∈ S) by xs (ti ) = g(ti , s). Furthermore, define a significance level α, set Mo = 0, and carry out the following steps: Step 1: Set M = Mo + 1; Step 2: For each b = (b1 , ..., bM ) ∈ S M , with bi > bj (i < j) define the n × M matrix X = (xb1 , ..., xbM ) and let β = β(b) = (X t X)−1 X t y. Also, denote by R2 (b) the corresponding value of R2 obtained from least squares regression of y on X. ˆ t by ˆb = argmax R2 (b) and βˆ = β(ˆb). Step 3: Define θˆ = (ˆb, β) b t ˆ Step 4: Let e(θ) = [e1 , ..., en ] be the vector of regression residuals. Assume that ei is a fractional autoregressive process of unknown order p characterized by a parameter vector ζ = (σε2 , d, φ1 , ..., φp ). Estimate p and ζ by maximum likelihood and the BIC. Step 5: Calculate for each j = 1, ..., M, the estimated standard deviation ˆ of βˆj , and set σj (ζ) ˆ pj = 2[1 − Φ(|βˆj |σj−1 (ζ))] where Φ denotes the cumulative standard normal distribution function. If max (pj ) < α, set Mo = Mo + 1 and repeat 1 through 5. Otherwise, ˆ = Mo and θˆ equal to the corresponding stop the iteration and set M estimate. 5.2.4 Hierarchical wavelet models Wavelet decomposition has become very popular in statistics and many fields of application in the last few years. This is due to the flexibility to depict local features at different levels of resolution. There is an extended literature on wavelets spanning a vast range between profound mathematical foundations and mathematical statistics to concrete applications such as data compression, image and sound processing, and data analysis, to name only a few. For references see for example Daubechies (1992), Meyer (1992, 1993), Kaiser (1994), Antoniadis and Oppenheim (1995), Ogden (1996), Mallat (1998), H¨ ardle et al. (1998), Vidakovic (1999), Percival and Walden (2000), Jansen (2001), Jaffard et al. (2001). The essential principle of wavelets is to express square integrable functions in terms of orthogonal basis functions that are zero except in a small neighborhood, the neighborhoods being hierarchical in size. The set of basis functions Ψ = {ϕok , k ∈ Z} ∪ {ψjk , j, k ∈ Z} is generated by two functions only, the father wavelet ϕ and the mother wavelet ψ, respectively, by up/downscaling and shifting of the location respectively. If scaling is done by powers of 2 and shifting by integers, then the basis functions are: ϕok (x) = ϕoo (x − k) = ϕ(x − k) (k ∈ Z)


(5.18)

j

j

ψjk (x) = 2 2 ψoo (2j x − k) = 2 2 ψ(2j x − k) (j ∈ N, k ∈ Z) (5.19) With respect to the scalar product < g, h >= g(x)h(x)dx, these basis functions are orthonormal: < ϕok , ϕom >= 0 (k = m), < ϕok , ϕok >= ||ϕk ||2 = 1

(5.20)

< ψjk , ψlm >= 0 (k = m or j = l), < ψjk , ψjk >= ||ψjk ||2 = 1

(5.21)

< ψjk , ϕol >= 0

(5.22)

2

Every function g in L (R) (the space of square integrable functions on R) has a unique representation ∞

g(x) =

=

ak ϕok (x) +

k=−∞ ∞

∞ ∞

bjk ψjk (x)

j=0 k=−∞ ∞ ∞

ak ϕ(x − k) +

bjk ψ(2j x − k)

(5.23)

(5.24)

j=0 k=−∞

k=−∞

where ak =< g, ϕk >=

g(x)ϕk (x)dx

(5.25)

bjk =< g, ψjk >= g(x)ψjk (x)dx (5.26) 2 2 2 Note in particular that g (x)dx = ak + bjk . The purpose of this representation is a decomposition with respect to frequency and time. A simple wavelet, where the meaning of the decomposition can be understood directly, is the Haar wavelet with and

ϕ(x) = 1{0 ≤ x < 1}

(5.27)

where 1{0 ≤ x < 1} = 1 for 0 ≤ x < 1 and zero otherwise, and 1 1 } − 1{ ≤ x < 1}. 2 2 For the Haar basis functions ϕk , we have coefficients k+1 g(x)dx ak = ψ(x) = 1{0 ≤ x 0 implies that n is a multiple of τ . For an irreducible Markov chain, all states have the same period. Hence, the following definition is meaningful: Definition 46 An irreducible Markov chain is called periodic if τ > 1, and it is called aperiodic if τ = 1. It can be shown that for an aperiodic Markov chain, there is at most one stationary distribution and, if there is one, then the initial distribution does not play any role ultimately: Theorem 15 If Xt (t = 0, 1, ...) is an aperiodic irreducible Markov chain for which a stationary distribution π exists, then the following holds: (i) the Markov chain is persistent; (n)

(ii) limn→∞ pij = πj > 0 for all i, j; (iii) the stationary distribution π is unique. In the other case of an aperiodic irreducible Markov chain for which no stationary distribution exists, we have (n) lim p n→∞ ij

=0

for all i, j. Note that this is even the case if the Markov chain is persistent. One then can classify irreducible aperiodic Markov chains into three classes:


Theorem 16 If Xt (t = 0, 1, 2, ...) is an irreducible aperiodic Markov chain, then one the following three possibilities is true: (i) Xt is transient, (n) lim p n→∞ ij

and

∞

=0

(n)

pij < ∞

n=1

(ii) Xt is persistent, but no stationary distribution π exists, (n) lim p n→∞ ij ∞

= 0,

(n)

pij = ∞

n=1

and µj =

∞

(n)

nfjj = ∞

n=1

(iii) Xt is persistent, and a unique stationary distribution π exists, (n) lim p n→∞ ij

= πj > 0

for all i, j and the average number of steps till the process returns to state j is given by µj = πj−1 For Markov chains with a finite state space, the results simplify further: Theorem 17 If Xt is an irreducible aperiodic Markov chain with a finite state space, then the following holds: (i) Xt is persistent (ii) a unique stationary distribution π = (π1 , ..., πk )t exists and is the solution of πj = 1) (6.11) π t (I − M ) = 0, (0 ≤ πj ≤ 1, where I is the m × m identity matrix. Note that j Mij = j pij = 1 so that j (I − M )ij = 0, i.e. the matrix (I − M ) is singular. (If this were not the case, then the only solution to the system of linear equations would be 0 so that no stationary distribution would exist.) Thus, there are infinitely many solutions of (6.13). However, there is only one solution that satisfies the conditions 0 ≤ πj ≤ 1 and πj = 1.


6.2.3 Hidden Markov models A hidden Markov model is, as the name says, a model where an underlying Markov process is not directly observable. Instead, observations Xt (t = 1, 2, ...) are generated by a series of probability distributions which in turn are controlled by an unobserved Markov chain. More specifically, the following definitions are used: let θt (t = 1, 2, ...) be a Markov chain with initial distribution π so that P (θ1 = j) = πj , and transition probabilities pij = P (θt+1 = j|θt = i).

(6.12)

The state of the Markov chain determines the probability distribution of the observable random variables Xt by ψij = P (Xt = j|θt = i)

(6.13)

In particular, if the state spaces of θt and Xt are finite with dimensions m1 and m2 respectively, then the probability distribution of the process Xt is determined by the m1 -dimensional vector π, the m1 × m1 -dimensional transition matrix M = (pij )i,j=1,...,m1 and the m2 ×m1 -dimensional matrix Ψ = (ψij )i=1,...,m2 ;j=1,...,m1 that links θt with Xt . Analogous models can be defined for the case where Xt (t ∈ N) are continuous variables. The flexibility of hidden Markov models is due to the fact that Xt can be an arbitrary quantity with an arbitrary distribution that can change in time. For instance, Xt itself can be equal to a time series Xt = (Z1 , ..., Zn ) = (Z1 (t), ..., Zn (t)) whose distribution depends on θt . Typically, such models are used in automatic speech processing (see e.g. Levinson et al. 1983, Juang and Rabiner 1991). The variable θt may represent the unobservable state of the vocal tract at time t, which in turn “produces” an observable acoustic signal Z1 (t), ..., Zn (t) generated by a distribution characterized by θt . Given observations Xt (t = 1, 2, ..., N ), the aim is to guess which configurations θt (t = 1, 2, ..., N ) the vocal tract was in. More specifically, it is sometimes assumed that there is only a finite number of possible acoustic signals. We may therefore denote by Xt the label of the observed signal and estimate θ by maximizing the a posteriori probability P (θ = j|Xt = i). Using the Bayes rule, this leads to θˆt = arg = arg

max

j=1,...,m1

max

j=1,...,m1

P (θt = j|Xt = i)

P (X = i|θt = j)P (θt = j) m1 t l=1 P (Xt = i|θt = l)P (θt = l)

(6.14)

6.2.4 Parameter estimation for Markov and hidden Markov models In principle, parameter estimation for Markov chains and hidden Markov models is simple, since the likelihood function can be written down explic-


itly in terms of simple conditional probabilities. The main difficulties that can occur are: 1. Large number of unknown parameters: the unknown parameters for a Markov chain are the initial distribution π and the transition matrix M = (pij )i,j=1,...,m . If m is finite, then the number of unknown parameters is (m−1)+m(m−1). If the initial distribution does not matter, then this reduces to m(m − 1). Both numbers can be quite large compared to the available sample size, since they increase quadratically in m. The situation is even worse if the state space is infinite, since then the number of unknown parameters is infinite. A solution to this problem is to impose restrictions on the parameters or to define parsimonious models where M is characterized by a low-dimensional parameter vector. 2. Implicit solution: The maximum likelihood estimate of the unknown parameters is the solution of a system of nonlinear equations, and therefore must be found by a suitable numerical algorithm. For real time applications with massive data input, as they typically occur in speech processing or processing of musical sound signals, fast algorithms are required. 3. Asymptotic distribution: The asymptotic distribution of maximum likelihood estimates is not always easy to derive. 6.3 Specific applications in music 6.3.1 Stationary distribution of intervals modulo 12 We consider intervals between successive notes modulo octave for the upper envelopes of the following compositions: • Anonymus: a) Saltarello (13th century); b) Saltarello (14th century); c) Alle Psallite (13th century); d) Troto (13th century) • A. de la Halle (1235?-1287): Or est Bayard en la pature, hure! • J. de Ockeghem (1425-1495): Canon epidiatesseron • J. Arcadelt (1505-1568): a) Ave Mari, b) La Ingratitud, c) Io Dico Fra Noi • W. Byrd (1543-1623): a) Ave Verum Corpus, b) Alman, c) The Queen’s Alman • J. Dowland (1562-1626): a) Come Again, b) The Frog Galliard, c) The King Denmark’s Galliard • H.L. Hassler (1564-1612): a) Galliard, b) Kyrie from “Missa secunda”, c) Sanctus et Benedictus from “Missa secunda” • G.P. Palestrina (1525-1594): a) Jesu Rex admirabilis, b) O bone Jesu, c) Pueri Hebraeorum


• J.P. Rameau (1683-1764): a) La Popliniere, b) Tambourin, c) La Triomphante (Figure 6.1) • J.F. Couperin (1668-1733): a) Barriquades mysterieuses, b) La Linotte Effarouchée, c) Les Moissonneurs, d) Les Papillons • J.S. Bach (1685-1750): Das Wohltemperierte Klavier; Cello-Suites I to VI (1st Movements) • D. Scarlatti (1660-1725): a) Sonata K 222, b) Sonata K 345, c) Sonata K 381 • J. Haydn (1732-1809): Sonata op. 34, No. 2 • W.A. Mozart (1756-1791): a) Sonata KV 332, 2nd Mov., b) Sonata KV 545, 2nd Mov., c) Sonata KV 333, 2nd Mov. • F. Chopin (1810-1849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32, No. 1, c) Etude op. 10, No. 6 (Figure 6.2) • R. Schumann (1810-1856): Kinderszenen op. 15 • J. Brahms (1833-1897): a) Hungarian dances No. 1, 2, 3, 6, 7, b) Intermezzo op. 117, No. 1 (Figures 6.12, 9.7, 11.5) • C. Debussy (1862-1918): a) Claire de lune, b) Arabesque No. 1, c) Reflections dans l’eau • A. Scriabin (1872-1915): Preludes a) op. 2, No. 2, b) op. 11, No. 14, c) op. 13, No. 2 • S. Rachmaninoff (1873-1943): a) Prelude op. 3, No. 2, b) Preludes op. 23, No. 3, 5, 9 • B. Bartók (1881-1945): a) Bagatelle op. 11, No. 2, b) Bagatelle op. 11, No. 3, c) Sonata for piano • O. Messiaen (1908-1992): Vingts regards sur l’enfant de Jésus, No. 3 • S. Prokoffieff (1891-1953): Visions fugitives a) No. 11, b) No. 12, c) No. 13 • A. Sch¨ onberg (1874-1951): Piano piece op. 19, No. 2 • T. Takemitsu (1930-1996): Rain tree sketch No. 1 • A. Webern (1883-1945): Orchesterst¨ uck op. 6, No. 6 Since we are not interested in note repetitions, zero is excluded, i.e. the state space of Xt consists of the numbers 1,...,11. For the sake of simplicity, Xt is assumed to be a Markov chain. This is, of course, not really true nevertheless an “approximation” by a Markov chain may reveal certain characteristics of the composition. The elements of the transition matrix M = (pij )i,j=1,...,11 are estimated by relative frequencies n 1{xt−1 = i, xt = j} pîj = t=2n−1 , (6.15) t=1 1{xt = i}


Figure 6.1 Jean-Philippe Rameau (1683-1764). (Engraving by A. St. Aubin after J. J. Cafferi, Paris after 1764; courtesy of Zentralbibliothek Z¨ urich.)

and the stationary distribution π of the Markov chain with transition maˆ = (ˆ trix M pij )i,j=1,...,11 is estimated by solving the system of linear equations ˆ) = 0 π t (I − M as described above. Figures 6.3a through l show the resulting values of π ˆj (joined by lines). For each composition, the vector π ˆj is plotted against j. For visual clarity, points at neighboring states j and j−1 are connected. The figures illustrate how the characteristic shape of π changed in the course of the last 500 years. The most dramatic change occured in the 20th century with a “flattening” of the peaks. Starting with Scriabin a pioneer of atonal music, though still rooted in the romantic style of the late 19th century, this is most extreme for the compositions by Schönberg, Webern, Takemitsu, and Messiaen. On the other hand, Prokoffieff’s “Visions fugitives” exhibit clear peaks but at varying locations. The estimated stationary distributions can also be used to perform a cluster analysis. Figure 6.4 shows the result of the single linkage algorithm with the manhattan norm (see Chapter 10). To make names legible, only a subsample of the data was used. An almost perfect separation between Bach and composers from the classical and romantic period can be seen.


Figure 6.2 Frédéric Chopin (1810-1849). (Courtesy of Zentralbibliothek Z¨ urich.)

6.3.2 Stationary distribution of interval torus values An analogous analysis can be carried out replacing the interval numbers by the corresponding values of the torus distance (see Chapter 1). Excluding zeroes, the state space consists of the three numbers 1, 2, 3 only. For the same compositions as above, the stationary probabilities π ˆj (j = 1, 2, 3) are calculated. A cluster analysis as above, but with the new probabilties, yields practically the same result as before (Figure 6.5). Since the state space contains three elements only, it is now even easier to find the patterns that πj ) (i = j) apdetermine clustering. In particular, log-odds-ratios log(ˆ πi /ˆ pear to be characteristic. Boxplots are shown in Figures 6.6a, 6.7a and 6.8a for categories of composers defined by date of birth as follows: a) before 1600 (“early music”); b) [1600,1720) (“baroque”); c) [1720,1800) (“classic”); d) [1800,1880) (“romantic and early 20th century”) (Figure 6.12); e) 1880 and later (“20th century”). This is a simple, though somewhat arbitrary, division with some inaccuracies for instance, Sch¨ onberg is classified ˆ2 is highin category 4 instead of 5. The log-odds-ratio between π ˆ1 and π est in the “classical” period and generally tends to decrease afterwards. Moreover, there is a distinct jump from the baroque to the classical period. π3 ). Here, however, the attained level This jump is also visible for log(ˆ π1 /ˆ π3 ) a gradual increase is kept in the subsequent time periods. For log(ˆ π2 /ˆ


Figure 6.3 Stationary distributions π ˆj (j = 1, ..., 11) of Markov chains with state space Z12 \ {0}, estimated for the transition between successive intervals.


BRAHMS BRAHMS SCHUMANN SCHUMANN SCHUMANN HAYDN CHOPIN RACHMANINOFF MOZART BACH HAYDN

HAYDN SCHUMANN MOZART CHOPIN BACH HAYDN BRAHMS CHOPIN BACH BACH MOZART RACHMANINOFF CHOPIN SCHUMANN RACHMANINOFF SCHUMANN HAYDN BRAHMS SCHUMANN SCHUMANN SCHUMANN BRAHMS SCHUMANN MOZART MOZART SCHUMANN RACHMANINOFF SCHUMANN HAYDN BRAHMS

BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH HAYDN

2

4

SCHUMANN

6

8

Clusters based on stationary distribution

Figure 6.4 Cluster analysis based on stationary Markov chain distributions for compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmaninoff.

can be observed. The differences are even more visible when comparing individual composers. This is illustrated in Figures 6.9a and b where Bach’s π3 ) and log(ˆ π2 /ˆ π3 ) are compared, and in Figures and Schumann’s log(ˆ π1 /ˆ 6.10a through f where the median and lower and upper quartiles of π ˆj are plotted against j. Finally, Figure 6.11 shows the plots of log(ˆ π1 /ˆ π3 ) and π3 ) against the date of birth. log(ˆ π2 /ˆ 6.3.3 Classification by hidden Markov models Chai and Vercoe (2001) study classification of folk songs using hidden Markov models. They consider, essentially, four ways of representating a melody; namely by a) a vector of pitches modulo 12; b) a vector of pitches modulo 12 together with duration (duration being represented by repeating the same pitch); c) a sequence of intervals (differenced series of pitches); and d) sequence of intervals, with intervals being classified into only five interval classes {0}, {−1, −2}, {1, 2}, {x ≤ −3} and {x ≥ 3}. The observed data consist of 187 Irish, 200 German, and 104 Austrian homophonic melodies from folk songs. For each melody representation, the authors estimate the parameters of several hidden Markov models which differ mainly with respect to the size of the hidden state space. The models are fitted for each


3.0

BRAHMS CHOPIN RACHMANINOFF BACH HAYDN BRAHMS SCHUMANN MOZART CHOPIN CHOPIN BACH SCHUMANN HAYDN SCHUMANN HAYDN BACH RACHMANINOFF MOZART SCHUMANN BACH SCHUMANN CHOPIN SCHUMANN BRAHMS RACHMANINOFF MOZART SCHUMANN RACHMANINOFF SCHUMANN HAYDN SCHUMANN BRAHMS BRAHMS BRAHMS SCHUMANN MOZART MOZART SCHUMANN

BACH BACH HAYDN HAYDN BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH HAYDN BACH SCHUMANN BACH BACH BACH

1.0

SCHUMANN

1.5

2.0

2.5

Clusters based on stationary distribution of torus distances

Figure 6.5 Cluster analysis based on stationary Markov chain distributions of torus distances for compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmaninoff.

country separately. Only 70% of the data are used for estimation. The remaining 30% are used for validation of a classification rule defined as follows: a melody is assigned to country j, if the corresponding likelihood (calculated using the country’s hidden Markov model) is the largest. Not surprisingly, the authors conclude that the most reliable distinction can be made between Irish and non-Irish songs.


a) log(pi(1)/pi(2)) for five different periods

-1.5

-1.5

-1.0

-1.0

-0.5

-0.5

0.0

0.0

0.5

0.5

b): log(pi(1)/pi(2)) for ‘classic’ vs. ‘not classic’

b. 1600

1600 -1720

1720 -1800

1800 -1880

from 1880

birth 1720-1800

birth before 1720 or 1800 and later

Figure 6.6 Comparison of log odds ratios log(ˆ π1 /ˆ π2 ) of stationary Markov chain distributions of torus distances.

b) log(pi(1)/pi(3)) for ‘upto baroque’ vs. ‘after baroque’

-1

-1

0

0

1

1

2

2


b. 1600

1600 -1720

1720 -1800

1800 -1880

from 1880

birth before 1720

birth 1720 and later




0

0

1

1

2

2

3

3

b) log(pi(2)/pi(3)) for ‘upto baroque’ vs. ‘after baroque’

b. 1600

1600 -1720

1720 -1800

1800 -1880

from 1880

birth before 1720

birth 1720 and later


b) log(pi(2)/pi(3)) for Bach and Schumann

-0.5

-1.0

0.0

-0.5

0.5

0.0

1.0

0.5

1.5

1.0

2.0

1.5

a) log(pi(1)/pi(3)) for Bach and Schumann

Bach

Schumann

Bach

Schumann

Figure 6.9 Comparison of log odds ratios log(ˆ π1 /ˆ π3 ) and log(ˆ π2 /ˆ π3 ) of stationary Markov chain distributions of torus distances.


Figure 6.10 Comparison of stationary Markov chain distributions of torus distances.

log(pi(1)/pi(3)) plotted against date of birth

-1

0

0

1

log(pi(2)/pi(3))

1

log(pi(1)/pi(3))

2

2

3

log(pi(2)/pi(3)) plotted against date of birth

1200

1400

1600

year a

1800

1200

1400

1600

1800

year b

Figure 6.11 Log odds ratios log(ˆ π1 /ˆ π3 ) and log(ˆ π2 /ˆ π3 ) plotted against date of birth of composer.


Figure 6.12 Johannes Brahms (1833-1897). (Courtesy of Zentralbibliothek Z¨ urich.)

6.3.4 Reconstructing scores from acoustic signals One of the ultimate dreams of musical signal recognition is to reconstruct a musical score from the acoustic signal of a musical performance. This is a highly complex task that has not yet been solved in a satisfactory manner. Consider, for instance, the problem of polyphonic pitch tracking defined as follows: given a musical audio signal, identify the pitches of the music. This problem is not easy for at least two reasons: a) different instruments have different harmonics and a different change of the spectrum; and b) in polyphonic music, one must be able to distinguish different voices (pitches) that are played simultaneously by the same or different instruments. An approach based on a rather complex hierarchical model is proposed for instance in Walmsley, Godsill, and Rayner (1999). Suppose that a maximal number N of notes can be played simultaneously and denote by ν = (ν1 , ..., νN )t the vector of 0-1-variables indicating whether note j (j = 1, ..., N ) is played or not. Each note j is associated with a harmonic representation (see Chapter 4) with fundamental frequency j and amplitudes b1 (j), ..., bk (j) (k = number of harmonics). Time is divided into


disjoint time intervals, so-called frames. In each frame i of length mi , the sound signal is assumed to be equal to yi (t) = µi (t) + ei (t) where µi (t) (t = 1, ..., mi ) is the sum of the harmonic representations of the notes and a random noise ei . Walmsley et al. assume ei to be iid (independent identically distributed) normal with zero mean and variance σi2 . Taking everything together, the probability distribution of the acoustic signal is fully specified by a finite dimensional parameter vector θ. In principle, given an observed signal, θ could be estimated by maximizing the likelihood (see Chapter 4). The difficulty is, however, that the dimension of θ is very high compared to the number of observations. The solution proposed by Walmsley et al. is to circumvent this problem by a Bayesian approach, in that θ is assumed to be generated by an a priori distribution. Given the data, consisting of a sound signal and an a priori distribution p(θ), the a posteriori distribution p(θ|yi ) of θ is given by p(θ|yi ) =

f (yi |θ)p(θ) ˜ θ)d ˜ θ˜ f (yi |θ)p(

where f (yi |θ) = (2πσi )−mi /2 exp(−

mi

(6.16)

e2i (t)/σi2 )

t=1

and ei (t) = ei (t; θ). How many notes and which pitches are played can then be decided, for instance, by searching for the mode of the distribution. Even if this model is assumed to be realistic, a major practical difficulty remains: the dimension of θ can be several hundred. The computation of the a posteriori distribution is therefore very difficult since calculation of ˜ θ)d ˜ θ˜ involves high-dimensional numerical intergration. A furf (yi |θ)p( ther complication is that some of the parameters may be highly correlated. Walmsley et al. therefore propose to use Markov Chain Monte Carlo Methods (see e.g. Gilks et al. 1996). The essential idea is to simulate the integral by a sample mean of f (yi |θ) where θ is sampled randomly from the a priori distribution p(θ). Sampling can be done by using a Markov process whose stationary distribution is p. The simulation can be simplified further by the so-called Gibbs sampler which uses suitable one-dimensional conditional distributions (Besag 1989). A more modest task than polyphonic pitch tracking is automatic segmentation of monophonic music. The task is as follows: given a monophonic musical score and a sampled acoustic signal of a performance of the score, identify for each note and rest in the score the corresponding time interval in the performance. A possible approach based on hidden Markov processes and Bayesian models is proposed in Raphael (1999) (also see Raphael 2001a,b). Raphael, who is a professional oboist and a mathematical statistician, also implemented his method in a computer system, called Music Plus One, that performs the role of a musical accompanist.


CHAPTER 7

Circular statistics 7.1 Musical motivation Many phenomena in music are circular. The best known examples are repeated rhythmic patterns, the circles of fourths and fifths, and scales modulo octave in the well-tempered system. In the circle of fourths, for example, one progresses by steps of a fourth and arrives, after 12 steps, at the initial starting point modulo octave. It is not immediately clear whether and how to “calculate” in such situations, and what type of statistical procedures may be used. The theory of circular statistics has been developed to analyze data on circles where angles have a meaning. Originally, this was motivated by data in biology (e.g. direction of bird flight), meteorology (e.g. direction of wind), and geology (e.g. magnetic fields). Here we give a very brief introduction, mostly to descriptive statistics. For an extended account of methods and applications of circular statistics see, for instance, Mardia (1972), Batschelet (1981), Watson (1983), Fisher (1993), and Jammalamadaka and SenGupta (2001). In music, circular methods can be applied to situations where angles measure a meaningful distance between points on the circle and arithmetic operations in the sense of circular data are well defined.

7.2 Basic principles 7.2.1 Some descriptive statistics Circular data are observations on a circle. In other words, observations consist of directions expressed in terms of angles. The first question is which statistics describe the data in a meaningful way or, at an even more basic level, how to calculate at all when “moving” on a circle. The difficulty can be seen easily by trying to determine the “average direction”. Suppose we observe two angles ϕ1 = 330o and ϕ2 = 10o . It is plausible to say that the average direction is 350o . However, the average is (330o + 10o )/2 = 170o which is almost the opposite direction. Calculating the sample mean of angles is obviously not meaningful. The simple solution is to interpret angular observations as vectors in the plane, with end points on the unit circle, and applying vector addition


instead of adding angles. Thus, we replace ϕi (i = 1, ..., n) by xi = (sin ϕi , cos ϕi ) where ϕ is measured anti-clockwise relative to the horizontal axis. The following descriptive statistics can then be defined. Definition 47 Let n n C= cos ϕi , S = sin ϕi , R = C 2 + S 2 . i=1

(7.1)

i=1

The (vector of the) mean direction of ϕi (i = 1, ..., n) is equal to cos ϕ¯ C/R x¯ = = sin ϕ¯ S/R

(7.2)

Equivalently one may use the following Definition 48 The (angle of the) mean direction of ϕi (i = 1, ..., n) is equal to S (7.3) ϕ¯ = arctan + π1{C < 0} + 2π1{C > 0, S < 0} C Moreover, we have Definition 49 The mean resultant length of ϕi (i = 1, ..., n) is equal to ¯=R R (7.4) n Note that R is the length of the vector n¯ x obtained by adding all observed ¯ = 1. In all other vectors. If all angles are identical, then R = n so that R ¯ cases, we have 0 ≤ R < 1. In the other extreme case with ϕi = 2πi/n (i.e. the angles are scattered uniformly over [0, 2π], there are no clusters ¯ = 0. In this sense, R ¯ measures the amount of of directions), we have R concentration around the mean direction. This leads to Definition 50 The sample circular variance of ϕi (i = 1, ..., n) is equal to ¯ V =1−R

(7.5)

¯ is not a perfect measure of concentration, since Note, however, that R ¯ = 0 does not necessarily imply that the data are scattered uniformly. R For instance, suppose n is even, ϕ2i+1 = π and ϕ2i = 0. Thus there are two ¯ = 0. preferred directions. Nevertheless, R Alternative measures of center and variability respectively are the median and the difference between the lower and upper quartile. The median direction is a direction Mn = ϕo determined as follows: a) find the axis (straight line through zero) such that the data are divided into two groups of equal size (if n is odd, then the axis passes through at least one point, otherwise through the midpoint between the two observations in the middle); b) take the direction ϕ on the chosen axis for which the more points


xi are closer to the point (cosφ, sinφ)t defined by φ. Similarly, the lower and upper quartiles, Q1 , Q2 can be defined by dividing each of the halves into two halves again. An alternative measure of variability is then given by IQR = Q2 − Q1 . Since we are dealing with vectors in the two-dimensional plane, all quantities above can be expressed in terms of complex numbers. In particular, one can define trigonometric moments by Definition 51 For p = 1, 2, ... let Cp =

n

cos pϕi , Sp =

i=1

n

sin pϕi , Rp =

Cp2 + Sp2

(7.6)

i=1

Cp ¯ Sp ¯ Rp C¯p = , Sp = , Rp = n n n

(7.7)

and ϕ(p) ¯ = arctan

Sp + π1{Cp < 0} + 2π1{Cp > 0, Sp < 0} Cp

(7.8)

¯ ¯ p eiϕ(p) mp = C¯p + iS¯p = R

(7.9)

Then is called the pth trigonometric sample moment. For p = 1, this definition yields ¯ ¯ 1 eiϕ(1) m1 = C¯1 + iS¯1 = R

¯ S¯p = S, ¯ R ¯p = R ¯ and ϕ(p) with C¯1 = C, ¯ = ϕ¯ as before. Similarily, we have Definition 52 Let Cpo =

n

cos p(ϕi − ϕ(1)), ¯ Spo =

i=1

sin p(ϕi − ϕ(1)) ¯

(7.10)

i=1

C¯po = ϕ¯o (p) = arctan

n

Cpo Spo , S¯po = n n

(7.11)

Spo + π1{Cpo < 0} + 2π1{Cpo > 0, Spo < 0} Cpo

(7.12)

Then ¯ p eiϕ¯o (p) mop = C¯po + iS¯po = R

(7.13) mop ,

centered relis called the pth centered trigonometric (sample) moment ative to the mean direction ϕ(1). ¯ ¯ 1 . An overview ¯ = 0 so that mo1 = R Note, in particular, that sin(ϕi −ϕ(1)) of descriptive measures of center and variability is given in Table 7.1.


Table 7.1 Some Important Descriptive Statistics for Circular Data Name

Definition

Feature measured

Sample mean

x ¯ = (C/R,√S/R)t with R = C 2 + S 2

Center (direction)

Mean resultant length

¯ = R/n R

Concentration

Mean direction

ϕ ¯ = arctan S/C + π1{C < 0} +2π1{C > 0, S < 0}

Center (angle)

Median direction

Mn = g(φ) where g(φ) = n i=1 |π − |ϕi − ϕ||

Center (angle)

Quartiles Q1 , Q2

Q1 = median of {ϕi : Mn − π ≤ ϕi ≤ Mn } Q2 = median of {ϕi : Mn ≤ ϕi ≤ Mn + π}

Center of “left” and “right” half

Modal direction

˜ n = arg max fˆ(ϕ) where M fˆ(ϕ) = estimate of density f

Center (angle)

Principal direction

a = first of eigenvector t S= n i=1 xi xi

Center (direction, unit vector)

Concentration

ˆ 1 = first eigenvalue of S λ

Variability

Circular variance

¯ Vn = 1 − R sn = −2 log(1 − V ) ¯2 ) ¯ 2 + S¯2 )/(2R dn = (1 − C

Variability

Variability

n

Variability

Circular stand. dev. Circular dispersion

2

1 n

Mean deviation

Dn = π −

Interquartile range

IQR = Q2 − Q1

i=1

2

|π − |ϕi − Mn ||

Variability

Variability

7.2.2 Correlation and autocorrelation A model for perfect “linear” association between two circular random variables ϕ, ψ is ϕ = ±ψ + (c mod 2π) (7.14) where c ∈ [0, 2π) is a fixed constant. A sample statistic that measures how close we are to this perfect association is n i,j=1;i=j sin(ϕi − ϕj ) sin(ψi − ψj ) (7.15) rϕ,ψ = n n 2 2 sin (ϕ − ϕ ) sin (ψ − ψ ) i j i j i,j=1;i=j i,j=1;i=j or

n det(n−1 i=1 xi yit ) rϕ,ψ = det(n−1 ni=1 xi xti ) det(n−1 ni=1 yi yit )


(7.16)

where xi = (cos ϕi , sin ϕi )t and yi = (cos ψi , sin ψi )t . For a time series ϕt (t = 1, 2, ...) of circular data, this definition can be carried over to autocorrelations n i,j=1;i=j sin(ϕi − ϕj ) sin(ϕi+k − ϕj+k ) r(k) = (7.17) n 2 i,j=1;i=j sin (ϕi − ϕj ) or rϕ (k) =

det(n−1

n−k

t i=1 xi xi+k ) n−k det(n−1 i=1 xi xti )

(7.18)

7.2.3 Probability distributions A probability distribution for circular data is a distribution F on the interval [0, 2π). The sample statistics defined in Section 7.1 are estimates of the corresponding population counterparts in Table 7.2. Most frequently used distributions are the uniform, cardioid, wrapped, von Mises, and mixture distributions. Uniform distribution U ([0, 2π)):

F (u) = P (0 ≤ ϕ ≤ u) =

u 1{0 ≤ u < 2π}, 2π

1 1{0 ≤ u < 2π}. 2π In this case, µp = ρp = 0, the mean direction µϕ is not defined, and the circular standard deviation σ and dispersion δ are infinite. This expresses the fact that there is no preference for any direction and variability is therefore maximal.

f (ϕ) = F (ϕ) =

Cardioid (or Cosine) distribution C(µ, ρ):

F (u) = [

u ρ sin(u − µ) + ]1{0 ≤ u < 2π} π 2π

and 1 (1 + 2ρ cos(u − µ))1{0 ≤ u < 2π} 2π where 0 ≤ ρ ≤ 12 . In this case, µϕ = µ, ρ1 = ρ, µp = 0 (p ≥ 1) and δ = 1/(2ρ2 ). An interesting property is that this distribution tends to the uniform distribution as ρ → 0. f (u) =


Table 7.2 Some important population statistics for circular data Name pth trigonometric moment Mean direction

pth central trig. moment

Definition µp = 02π cos(pϕ)dF (ϕ) 2π +i 0 sin(pϕ)dF (ϕ) = µp,C + iµp,S = ρp eiµϕ (p)

Feature

µϕ = arctan µ1,S /µ1,C +π1{µ1,C < 0} +2π1{µ1,C > 0, µ1,S < 0} µop = 02π cos(p(ϕ − µϕ ))dF (ϕ) 2π +i 0 sin(p(ϕ − µϕ ))dF (ϕ) = µop,C + iµop,S

Center (angle)

-

-

Mean resultant length

ρ = |µ1 |

Median direction

M = {α :

Quartiles q1 , q2

q1 = median of {ϕ : M − π ≤ ϕ ≤ M } q2 = median of {ϕ : M ≤ ϕ ≤ M + π}

25%-quantile 75%-quantile

Modal direction

˜ = arg max f (ϕ) M

Center (angle)

Principal direction

α of = first eigenvector t ϕ = E(XX )

Center (direction)

Concentration

λ1 = first eigenvalue of

Circular variance

υ = 1−ρ σ = −2 log(1 − υ)

Circular stand. dev.

δ = (1 −

Circular dispersion

Concentration π

α−π

dF (ϕ) =

ϕ+π α

dF (ϕ) =

ϕ

1 } 2

Center (angle)

Variability Variability Variability

ρ)/(2ρ2 )

Variability

2π

Variability

Mean deviation

∆=π−

Interquartile range

IQR = q2 − q1

0

|π − |ϕ − M ||dF (ϕ)

Variability

Wrapped distribution: Let X be a random variable with distribution function FX . The random variable ϕ = X (mod 2π) has a distribution Fϕ on [0, 2π) given by Fϕ (u) =

∞

[F (u + 2πj) − F (2πj)]

j=−∞

If X has a density function fX , then the density function of ϕ is equal to fϕ (u) =

∞ j=−∞


fX (u + 2πj).

An important special example is the wrapped normal distribution. The wrapped normal distribution W N (µ, ρ) is obtained by wrapping a normal distribution with E(X) = µ and var(X) = −2 log ρ (0 < ρ ≤ 1). This yields the circular density function fϕ (u) =

∞ 2 1 [1 + 2 ρj cos j(u − µ)]1{0 ≤ u < 2π} 2π j=1 2

Then, µϕ = µ, ρ1 = ρ, δ = (1 − ρ4 )/(2ρ2 ), µp,C = ρp and µp,S = 0 (p ≥ 1). For ρ → 0, we obtain the uniform distribution, and for ρ → 1 a distribution with point mass in the direction µϕ . von Mises distribution M (µ, κ) The most frequently used unimodal circular distribution is the von Mises distribution with density function 1 eκ cos(u−µ) 1{0 ≤ u < 2π} fϕ (u) = 2πIo (κ) where 0 ≤ κ < ∞, 0 ≤ µ < 2π and 2π ∞ 1 κ 2j 1 exp(κ cos(v − µ))dv = ( ) Io = 2 2 2π o (j!) j=0 is the modified Bessel function of the first kind and order 0. In this case, we have µϕ = µ, ρ1 = I1 /Io , δ = (κI1 /Io )−1 , µp,C = Ip /Io and µp,S = 0 (p ≥ 1) where ∞ κ 1 ( )2j+p Ip = (j + p)!j! 2 j=0 is a modified Bessel function of order p. For κ → 0, the M (µ, κ)-distribution converges to U ([0, 2π)), and for κ → ∞ we obtain a point mass in the direction µϕ . Mixture distribution: All distributions above are unimodal. Distributions with more than one mode can be modeled, for instance, by mixture distributions fϕ (u) = p1 fϕ,1 (u) + ... + pm fϕ,m (u) pi = 1 and fϕ,j are different circular probabilwhere 0 ≤ p1 , ..., pm ≤ 1, ity densities. 7.2.4 Statistical inference Statistical inference about population parameters is mainly known for the distributions above. Classical methods can be found in Mardia (1972),


Batschelet (1981), Watson (1983), and Fisher (1993). For recent results see e.g. Jammalamadaka and SenGupta (2001). 7.3 Sp ecific applications in music 7.3.1 Variability and autocorrelation of notes modulo 12

Figure 7.1 Béla Bart´ ok – statue by Varga Imre in front of the Béla Bart´ ok Memorial House in Budapest. (Courtesy of the Béla Bart´ ok Memorial House.)

The following analysis is done for various compositions: pitch is represented in Z12 with 0 set equal to the note (modulo 12) with the highest frequency in the composition. Given a note j in Z12 , the corresponding circular point is then x = (x1 , x2 )t = (cos(2πj/12), sin(2πj/12))t . The ¯ d and the maximal circular autofollowing statistics are calculated: λ1 , R, correlation m = max1≤k≤10 |rϕ (k)|. The compositions considered here are:

Figure 7.2 Sergei Prokoffieff as a child. (Courtesy of Karadar Bertoldi Ensemble; www.karadar.net/Ensemble/.)


Figure 7.3 Circular representation of compositions by J. S. Bach (Pr¨ aludium und Fuge No. 5 from “Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart´ ok (Bagatelles No. 3), and S. Prokoffieff (Visions fugitives No. 8).


• J. S. Bach: Das Wohltemperierte Klavier I (all preludes and fugues) • D. Scarlatti: Sonatas Kirkpatrick No. 49, 125, 222, 345, 381, 412, 440, 541 • B. Bartók (Figure 7.1): Bagatelles No. 1–3, Sonata for Piano (2nd movement) • S. Prokoffief (Figure 7.2): Visions fugitives No. 1–15. To simplify the analysis, the upper envelope is considered for each composition. The data set that was available consists of played music. Thus, instead of the written score we are looking at its realization by a pianist. This results in some changes of onset times. In particular, some notes with equal score onset times are not played simultaneously. Strictly speaking, the analysis thus refers to the played music rather than the original score. In Figure 7.3, four representative compositions are displayed. Z12 is represented by a circle starting on top with 0 and proceeding clockwise as j ∈ Z12 increases. A composition is thus represented by pitches j1 , ..., jn ∈ Z12 , each pitch beings represented by a dot on the circle. In order to visualize how frequent each note is, each point xi = (cos ϕi , sin ϕi )t (i = 1, ..., n) where ϕi = 2πji , is displaced slightly by adding a random number from a uniform distribution on [0, 0.1] to the angle φi . (This technique of exploratory data analysis is often referred to as “jittering” see Chambers et al. 1983) Moreover, to obtain an impression of the dynamic movement, successive points xi , xi+1 are joined by a line. The connections visualize which notes are likely to follow each other. Some clear differences are visible between the four plots: for Bach, the main movements take place along the edges, the main points and vertices corresponding to the D-major scale. The rather curious simple figure for Bartók’s Bagatelle No. 3 stems from the continuous repetition of the same chromatic figure in the upper voice. For Prokoffieff one can see two main vertices that are positioned symmetrically with respect to the middle vertical line. This is due to the repetitive nature of the upper en¯ d, and log m, comparing Bach, ˆ 1 , R, velope. Figure 7.4 shows boxplots of λ Scarlatti, Bart´ ok and Prokoffief. Variability is clearly lower for Bart´ ok and Prokoffief, independently of the specific statistic that is used. There are also some, but less extreme, differences with respect to the maximal autocorrelation m. As one may perhaps expect, Bartók has the highest values of m. 7.3.2 Variability and autocorrelation of note intervals modulo 12 The same as above can be carried out for intervals between successive notes (Figure 7.5). Figure 7.6 shows that, again, variability is much lower for Bart´ ok and Prokoffieff.


ˆ 1 , R, ¯ d and log m for notes modulo 12, comparing Bach, Figure 7.4 Boxplots of λ Scarlatti, Bart´ ok, and Prokoffief.


Figure 7.5 Circular representation of intervals of successive notes in the following compositions: J. S. Bach (Pr¨ aludium und Fuge No. 5 from “Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart´ ok (Bagatelles No. 3), and S. Prokoffieff (Visions fugitives No. 8).


ˆ 1 , R, ¯ d and log m for note intervals modulo 12, comparing Figure 7.6 Boxplots of λ Bach, Scarlatti, Bart´ ok, and Prokoffief.


Figure 7.7 Circular representation of notes ordered according to circle of fourths in the following compositions: J. S. Bach (Pr¨ aludium und Fuge No. 5 from ”Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart´ ok (Bagatelles No. 3), and S. Prokoffieff (Visions fugitives No. 8).


ˆ 1 , R, ¯ d and log m for notes 12 ordered according to circle Figure 7.8 Boxplots of λ of fourths, comparing Bach, Scarlatti, Bart´ ok and Prokoffief.


Figure 7.9 Circular representation of intervals of successive notes ordered according to circle of fourths in the following compositions: J. S. Bach (Pr¨ aludium und Fuge No. 5 from “Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart´ ok (Bagatelles No. 3), and S. Prokoffieff (Visions fugitives No. 8).


ˆ 1 , R, ¯ d and log m for note intervals modulo 12 ordered Figure 7.10 Boxplots of λ according to circle of fourths, comparing Bach, Scarlatti, Bart´ ok, and Prokoffief.


7.3.3 Notes and intervals on the circle of fourths Alternatively, the analysis above can be carried out by ordering notes according to the circle of fourths. Thus, a rotation by 360o/12 = 30o corresponds to a step of one fourth. The analogous plots are given in Figures 7.7 through 7.10. This specific circular representation makes some symmetries and their harmonic meaning more visible.


CHAPTER 8

Principal component analysis 8.1 Musical motivation Observations in music often consist of vectors. Consider, for instance, the tempo measurements for Schumann’s Träumerei (Figure 2.3). In this case, the observational units are performances and an observation consists of a tempo “curve” which is a vector of n tempo measurements x(ti ) at symbolic score onset times ti (i = 1, ..., p). The main question is which similarities and differences there are between the performances. Principal component analysis (PCA) provides an answer in the sense that the “most interesting”, and hopefully interpretable, projections are found. In this chapter, a brief introduction to PCA is given. For a detailed account and references see e.g. Mardia et al. (1979), Anderson (1984), Dillon and Goldstein (1984), Seber (1984), Krzanowski (1988), Flury and Riedwyl (1988), Johnson and Wichern (2002). 8.2 Basic principles 8.2.1 Definition of PCA for multivariate probability distributions Algorithmic definition Let X = (X1 , ..., Xp )t be a random vector with expected value E(X) = µ and covariance matrix Σ. The following algorithm is defined: • Step 0. Initialization: Set j = 1 and Z (1) = X. • Step 1. Find a direction, i.e. a vector a(j) with |a(j) | = 1, such that (j) (j) (j) (j) the projection Zj = [a(j) ]t Z (j) = a1 Z1 + ... + ap Zp has the largest possible variance. • Step 2. Consider the part of Z (j) that is orthogonal to a(1) , ..., a(j) , i.e. set Z (j+1) = Z (j) − Zj a(j) . If j = p, or all components of Z (j+1) have variance zero, then stop. Otherwise set j = j + 1 and go to Step 1. The algorithm finds successively orthogonal directions a(1) , a(2) , ... such that the corresponding projections of Z have the largest variance among all projections that are orthogonal to the previous ones. A projection with a large variance is suitable for comparing, ranking, and classifying observations, since different random realizations of the projection tend to be widely scattered. In contrast, if a projection has a small variance, then individuals


do not differ very much with respect to that projection, and are therefore more difficult to distinguish. Definition via spectral decomposition of matrices The algorithm given above has an elegant interpretation: Theorem 18 (Spectral decomposition theorem) Let B be a symmetric p×p matrix. Then B can be written as p B = AΛAt = λj a(j) [a(j) ]t (8.1) 



j=1

λ1 0 . . 0  0 λ2 .     . .  where Λ =  .  is a diagonal matrix, λj are the eigen . . 0  0 . . 0 λp values and the columns a(j) of A the corresponding orthonormal eigenvectors of B, i.e. we have (8.2) Ba(j) = λj a(j) (8.3) |a(j) |2 = [a(j) ]t a(j) = 1, and [a(j) ]t a(l) = 0 for j = l In matrix form equation (8.3) means that A is an orthogonal matrix, i.e. At A = I

(8.4)

where I denotes the identity matrix with Ijj = 1 and Ijl = 0 (j = l). This result can now be applied to the covariance matrix of a random vector X = (X1 , ..., Xp )t : Theorem 19 Let X be a p-dimensional random vector with expected value E(X) = µ and p × p covariance matrix Σ. Then Σ = AΛAt

(8.5)

(j)

where the columns a of A are eigenvectors of Σ and Λ is a diagonal matrix with eigenvalues λ1 , ..., λp ≥ 0. In particular, we may permute the sequence of the X-components such that the eigenvalues are ordered. We thus obtain: Theorem 20 Let X be a p-dimensional random vector with expected value E(X) = µ and a p×p covariance matrix . Then there exists an orthogonal matrix A such that (8.6) Σ = AΛAt (j) of A are eigenvectors of and Λ is a diagonal where the columns a matrix with eigenvalues λ1 ≥ λ2 ≥ ... ≥ λp ≥ 0. Moreover, the covariance matrix of the transformed vector Z = At (X − µ)


(8.7)

is equal to cov(Z) = At ΣA = Λ (8.8) Note in particular that var(Z1 ) = λ1 ≥ var(Z2 ) = λ2 ≥ ... ≥ var(Zp ) = λp and the covariance matrix Σ may be approximated by a matrix Σ(q) =

q

λj a(j) [a(j) ]t

j=1

for a suitably chosen value q ≤ p. If a good approximation can be achieved for a relatively small value of q, then this means that most of the random variation in X occurs in a low dimensional space spanned by the random vector Z(q) = (Z1 , ..., Zq )t . Definition 53 The transformation defined by Z = At (X − µ) is called the principal component transformation. The ith component of Z, Zj = [At (X − µ)]j = [(X − µ)t a(j) ]t

(8.9)

is called the jth principal component of X. The jth column of A, i.e. the jth eigenvector a(j) , is called the vector of principal component loadings. In summary, the principal component transformation rotates the original random vector X − µ in such a way that the new coordinates Z1 , ..., Zp are uncorrelated (orthogonal) and they are ordered according to their importance with respect to characterizing the covariance structure of X. The following result states that the algorithmic and the algebraic definition are indeed the same: Theorem 21 Consider U = bt X where b = (b1 , ..., bp )t and |b| = 1. Suppose that U is orthogonal (i.e. uncorrelated) to the first k principal components of X. Then var(U ) is maximal, among all such projections, if and only if b = a(k+1) , i.e. if U is the (k + 1)st principal component Zk+1 . 8.2.2 Definition of PCA for observed data The definition of principal components given above cannot applied directly to data, since the expected value and covariance matrix are usually unknown. It can however be modified in an obvious way by replacing population quantities by suitable estimates. The simplest solution is to use the sample mean and the sample covariance matrix. For observed vectors x(i) = (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) one defines 1 x(i) n i=1 n

µ ˆ=x ¯=

and the estimate of the covariance matrix n ˆ= 1 Σ (x(i) − x ¯)(x(i) − x ¯)t . n i=1


(8.10)

(8.11)

The estimated ith vector of principal component loadings, a ˆ(j) , is the stanˆ The dardized eigenvector corresponding to the jth-largest eigenvalue of Σ. estimated principal component transformation is then defined by ˆt z = Aˆt (x − x ¯) = [(x − x¯)t A]

(8.12)

where the columns of Aˆ are equal to the orthogonal vectors a ˆ(j) . Applying this transformation to the observed vectors x(1), ..., x(n), enables us to compare observations with respect to their principal components. The jth principal component of the ith observation is equal to ¯)t a ˆ(j) zj (i) = (x(i) − x

(8.13)

In other words, the ith observed vector x(i)− x ¯ is transformed into a rotated vector z(i) = (z1 (i), ..., zp (i))t with the corresponding observed principal components. In matrix form, we can define the n × p matrix of observations   x1 (1) x2 (1) · · · xp (1)  x1 (2) x2 (2) · · · xp (2)    (8.14) X=  .. .. ..   . .. . x1 (n) x2 (n) · · · xp (n) and the n × p matrix of observed principal components   z1 (1) z2 (1) · · · zp (1)  z1 (2) z2 (2) · · · zp (2)    Z =  .. .. ..   . .. . z1 (n) z2 (n) · · · zp (n)

(8.15)

so that Z = (X − I y¯t )Aˆ

(8.16)

where I denotes the identity matrix. Note that the jth column z (j) = (zj (1), ..., zj (n))t consists of the observed jth principal components. Therefore, the sample variance of the jth principal components is given by s2z = n−1

n

ˆj . zj2 (i) = λ

i=1

ˆj is large, then the observed jth principal components zj (1), ..., zj (n) If λ have a large sample variance so that the observed values are scattered far apart. 8.2.3 Scale invariance? The principal component transformation is based on the covariance matrix. It is therefore not scale invariant, since variance and covariance depend on the units in which individual components Xj are measured. It is


therefore often recommended to standardize all components. nThus, we replace eachcoordinate xj by (xj − x ¯j )/sj wherex ¯j = n−1 i=1 xj (i) and n n s2j = n−1 i=1 (xj (i) − x ¯j )2 (or s2j = (n − 1)−1 i=1 (xj (i) − x ¯j )2 ). 8.2.4 Choosing important principal components Since an orthogonal transformation does not change the length of vectors, the “total variability” of the random vector Z in (8.7) is the same as the one of the original random vector X with covariance matrix Σ = (σij )i,j=1,...,p . More specifically, one defines total variability by Vtotal = tr(Σ) =

p

σii .

(8.17)

i=1

The singular value decomposition (spectral decomposition) of Σ then implies Theorem 22 Let Σ be a covariance matrix with spectral decomposition Σ = AΛAt . Then Vtotal = tr(Σ) =

p

λii

(8.18)

i=1

Since the eigenvalues λi are ordered according to their size, we may therefore hope that the proportion of total variation P (q) =

λ1 + ... + λq p i=1 λi

(8.19)

is close to one for a low value of q. If this is the case, then one may reduce the dimension of the random vector considerably without losing much ˆ i versus q and ˆ q )/ λ ˆ 1 + ... + λ information. For data, we plot Pˆ (q) = (λ judge by eye from which point on the increase in Pˆ (q) is not worth the price of adding additional dimensions. Alternatively, we may plot the conˆ i or λ ˆ j itself, against j. This is the ˆj / λ tribution of each eigenvalue, λ so-called scree graph. More formal tests, e.g. for testing which eigenvalues are nonzero or for comparing different eigenvalues, are available however mostly under the rather restrictive assumption that the distribution of X is multivariate normal (see e.g. Mardia et al. 1979, Ch. 8.3.2). In addition to the scree plot, the decision on the number of principal components is often also based on the (possibly subjective) interpretability of the components. The interpretation of principal components may be (i) based on the coefficients aj and/or on the correlation between Zj and the coordinates of the original random vector X = (X1 , ..., Xp )t . Note that since E(ZX t ) = E(At XX t) = At Σ = At AΛAt = ΛAt , var(Xk ) = σkk and


var(Zi ) = λi , the correlation between Zj and Xk is equal to λj (j) ρj,k = corr(Zj , Xk ) = ak σkk Analogously, for observed data we have the empirical correlations ˆj λ (j) ˆk ρˆj,k = a σ ˆkk

(8.20)

(8.21)

8.2.5 Plots One of the main difficulties with high-dimensional data is that they cannot be represented directly in a two-dimensional display. Principal components provide a possible solution to this problem. The situation is particularly simple if the first two principal components explain most of the variability. In that case, the original data (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) may be replaced by the first two principal components (z1 (i), z2 (i))t (i = 1, 2, ..., n). Thus, z2 (i) is plotted against z1 (i). If more than two principal components are needed, then the plot of z2 (i) versus z1 (i) provides at least a partial view of the data structure, and further projections can viewed by corresponding scatter plots of other components, or by symbol plots as described in Chapter 2. The scatter plots can be useful for identifying structure in the data. In particular, one may detect unusual observations (outliers) or clusters of similar observations. 8.3 Sp ecific applications in music 8.3.1 PCA of tempo skewness The 28 tempo curves in Figure 2.3, each consisting of measurements at p = 212 onset times, can be considered as n = 28 observations of a 212dimensional random vector. Principal component analysis cannot be applied directly to these data. The reason is that PCA relies on estimating the p × p covariance matrix. The number of observations (n = 28) is much smaller than p. Therefore, not all elements of the covariance matrix can be estimated consistently and an empirical PCA-decomposition would be highly unreliable. A solution to this problem is to reduce the dimension p in a meaningful way. Here, we consider the following reduction: the onset time axis is divided into 8 disjoint blocks A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 of 4 bars each. For each part number i (i = 1, ..., 8) and each performance j (j = 1, ..., 28), we calculate the skewness measure ηj (i) =


x ¯−M Q2 − Q1

-0.6

-0.4

-0.2

0.0

Skewness of tempo plotted against period 1,2, ,8

1

2

3

4

5

6

7

8

Figure 8.1 Tempo curves for Schumann’s Tr¨ aumerei: skewness for the eight parts A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted against the number of the part.

where M is the median and Q1 , Q2 are the lower and upper quartile respectively. Figure 8.1 shows ηj (i) plotted against i. An apparent pattern is the generally strong negative skewness in B2 . (Recall that negative skewness can be created by extreme ritardandi.) Apart from that, however, Figure 8.1 is difficult to interpret directly. Principal component analysis helps to find more interesting features. Figure 8.3 shows the loadings for the first four principal components which explain more than 80% of the variability (see Figure 8.2). The loadings can be interpreted as follows: the first component corresponds to a weighted average emphasizing the skewness values in the first half of the piece. The 28 performances apparently differ most with respect to ηj (i) during the first 16 bars of the piece (parts A1 , A2 , A1 , A2 ). The second most important distinction between pianists is characterized by the second component. This component compares skewness for the A-parts with the values in B1 and B2 . The third component essentially


0.355

0.015

Variances

0.025

Skewness of tempo - screeplot

0.564

0.709

0.889 1 Comp. 8

0.971

Comp. 7

Comp. 6

Comp. 5

Comp. 4

Comp. 3

Comp. 2

0.935

Comp. 1

0.0

0.005

0.824

Figure 8.2 Schumann’s Tr¨ aumerei: screeplot for skewness.

compares the first with the second half. Finally, the fourth component essentially compares the odd with the even numbered parts, excluding the end A1 , A2 . Components two to five are displayed in Figure 8.4, with z2 and z3 on the x- and y-axis respectively and rectangles representing z4 and z5 . Note in particular that Cortot and Horowitz mainly differ with respect to the third principal component. Horowitz has a more extreme difference in skewness betweem the first and second halves of the piece. Also striking are the “outliers” Brendel, Ortiz, and Gianoli. The overall skewness, as represented by the first component, is quite extreme for Brendel and Ortiz. For comparison, their tempo curves are plotted in Figure 8.5 together with Cortot’s and Horowitz’ first performances. In view of the PCA one may now indeed see that in the tempo curves by Brendel and Ortiz there is a strong contrast between small tempo variations applied most of the time and occasional strong local ritardandi.


A’1

A’2

B1

B2

Skewness: Loadings of second PCA-component A’’1

A’’2

A1

A2

A’2

B1

B2

A’’1

A’’2

-0.2

loading

0.4 0.3

-0.6 2

4

6

8

2

Skewness: Loadings of third PCA-component A2

A’1

A’2

B1

B2

4

6

8

Skewness: Loadings of fourth PCA-component A’’1

A’’2

A1

A’1

A’2

B1

B2

A’’1

A’’2

loading

-0.2 -0.6

-0.2

0.2

A2

0.2

0.6

A1

loading

A’1

0.2

A2

0.2

loading

0.5

Skewness: Loadings of first PCA-component A1

2

4

6

8

2

4

6

8

Figure 8.3 Schumann’s Tr¨ aumerei: loadings for PCA of skewness.

T2

RT

RT

R

RO

W

IT

Z3

KU

HO

HO

S

AR I

TS SH EL

LE Y

G

-0.5

IA NO

LI

-0.4

KA

SC

BA OW LEZ I KA TKZ K L IE1 N

S VI E DA KR US M TO

IT Z2

W BU N

IN

RO HN

HO

AB EL

IS AR A EIW IT RA SK U EN SC AZ H DE E M US

C NO DE L BR EN

-0.3

z3

-0.2

O

VA O ES R TO ES T1 CH CO EN RT BA O C T3 CU C H RZAP O OV AR N A G ER IC H C

O

O

-0.1

IZ

NE

Y

0.0

PCA of skewness symbol plot of principal components 2-5

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

z2

Figure 8.4 Schumann’s Tr¨ aumerei: symbol plot of principal components z2 , ..., z5 for PCA of tempo skewness.


-10

-5

Cortot1

Horowitz1

Gianoli

-25

-20

-15

Brendel

0

50

100

150

200

Figure 8.5 Schumann’s Tr¨ aumerei: tempo curves by Cortot, Horowitz, Brendel, and Gianoli.

8.3.2 PCA of entropies Consider the entropy measures E1 , E2 , E3 , E4 , E8 and E10 defined in Chapter 3. We ask the following question: is there a combination of entropy measures that enables us to distinguish ”computationally” between various styles of composition? The following compositions are included in the study: Henry Purcell 2 Airs (Figure 8.6), Hornpipe; J.S. Bach First movements of Cello Suites No. 1-6, Prelude and Fugue No. 1 and 8 from “Das Wohltemperierte Klavier”; W.A. Mozart KV 1e, 331/1, 545/1; R. Schumann op. 15, No. 2,3,4,7; op. 68, No. 2, 16; A. Scriabin op. 51, No. 2, 4; F. Martin Préludes No. 6, 7 (cf. Figures 8.11, 8.12). For each composition, we define the vector x = (x1 , ..., x6 )t = (E1 , E2 , E3 , E4 , E9 , E10 )t . The results of PCA are displayed in Figures 8.7 through 8.10. The first principal component mainly consists of an average of the first four components and a comparison with E10 (Figure 8.8). The second component essentially includes a comparison between E9 and E10 , whereas the third component is mainly a weighted average of E2 , E9 , and E10 . Finally, the fourth component compares E2 , E3 with E1 . According to the screeplot (Figure 8.7), the first three components already explain more than 95% of the variability. Scatterplots of the first three components (Figures 8.9 and 8.10) together with symbols representing the next two components show a


clear clustering. For clarity, only three different names (Purcell, Bach, and Schumann) are written explicitly in the plots. Schumann turns out to be completely separated from Bach. Moreover, Purcell appears to be somewhat outside the regions of Bach and Schumann, in particular in Figure 8.10. In conclusion, entropies, as defined above, do indeed seem to capture certain features of a composer’s style.


AIR q = 96

  

       Piano

6

 







  









14

 

   

  

11



   

      











 



   

   

   



  

  









 

















Henry Purcell (1659-1695)



   



 

   

   

  

  





















 





 



Figure 8.6 Air by Henry Purcell (1659-1695).






Figure 8.7 Screeplot for PCA of entropies.

Figure 8.8 Loadings for PCA of entropies.


Purcell

-2

-4

-2

0

Bach

Bach

Bach

Bach Bach Purcell Bach Bach Schumann

Bach

Schumann Bach

Purcell Schumann

Schumann Schumann

Schumann

-1

0

1

Bach

2

3

4

Entropies - second vs. first principal component; rectangles with width=3rd comp., height=4th comp.

2

Figure 8.9 Entropies – symbol plot of the first four principal components.

Purcell

Bach

Bach Bach

Bach Schumann Schumann

Schumann Schumann

-2

Purcell

Schumann

Purcell

Schumann

1 -1

0

Bach Bach Bach Bach Bach Bach

Third vs. second principal component rectangles with width=4th comp., height=5th comp.

-1

0

1

2

3

4

Figure 8.10 Entropies – symbol plot of principal components no. 2-5.


Figure 8.11 F. Martin (1890-1971). (Courtesy of the Soci´ eté Frank Martin and Mrs. Maria Martin.)

Figure 8.12 F. Martin (1890-1971) - manuscript from 8 Pr´ eludes. (Courtesy of the Société Frank Martin and Mrs. Maria Martin.)


CHAPTER 9

Discriminant analysis 9.1 Musical motivation Discriminant analysis, often also referred to under the more general notion of pattern recognition, answers the question of which category an observed item is most likely to belong to. A typical application in music is attribution of an anonymous composition to a time period or even to a composer. Other examples are discussed below. A prerequisite for the application of discriminant analysis is that a “training data set” is available where the correct answers are known. We give a brief introduction to basic principles of discriminant analysis. For a detailed account see e.g. Mardia et al. (1979), Klecka (1980), Breiman (1984), Seber (1984), Fukunaga (1990), McLachlan (1992) and Huberty (1994), Ripley (1995), Duda et al. (2000), Hastie et al. (2001). 9.2 Basic principles 9.2.1 Allocation rules Suppose that an observation x ∈ Rk is known to belong to one of p mutually exclusive categories G1 , G2 ,...,Gp . Associated with each category is a probability density fi (x) of X on Rk . This means that if an individual comes from group i, then the individual’s random vector X has the probability distribution fi . The problem addressed by discriminant analysis is as follows: observe X = x, and try to guess which group the observation comes from. The aim is, of course, to make as few mistakes as possible. In probability terms this amounts to minimizing the probability of misclassification. The solution is defined by a classification rule. A classification rule is a division of Rk into p disjoint regions: Rk = R1 ∪ R2 ... ∪ Rp , Ri ∩ Rj = φ (i = j). The rule allocates an observation to group Gi , if x ∈ Ri . More generally, we may define a randomized rule πby allocating an observation to group Gi with probability ψi (x), where i=1 ψi (x) = 1 for every x. The advantage of allowing random allocation is that discriminant rules can be averaged and the set of all random rules is convex, thus allowing to find optimal rules. Note that deterministic rules are a special case, by setting ψi (x) = 1 if x ∈ Ri and 0 otherwise.


9.2.2 Case I: Known population distributions Discriminant analysis without prior group probabilities – the ML-rule Assume that it is not known a priori which of the groups is more likely to occur; however for each group the distribution fi is known exactly. This case is mainly of theoretical interest; it does however illustrate the essential ideas of discriminant analysis. A plausible discriminant rule is the Maximum Likelihood Rule (MLRule): allocate x to group Gi , if fi (x) = max fj (x) j=1,...,p

(9.1)

If the maximum is reached for several groups, then x is considered to be in the union of these (for continuous distributions this occurs with probability zero). In the case of two groups the ML-rule means that x is allocated to G1 , if f1 (x) > f2 (x), or, equivalently, log

f1 (x) >0 f2 (x)

(9.2)

In the case where all probability densities are normal with equal covariance matrices we have: Theorem 23 Suppose that each fi is a multivariate normal distribution with expected value µi and covariance matrix Σi . Suppose further that Σ1 = Σ2 = ... = Σp = Σ and det Σ > 0. Then the ML-rule is given as follows: allocate x to group Gi , if (x − µi )t Σ−1 (x − µi ) = min (x − µj )t Σ−1 (x − µj ) j=1,...,p

(9.3)

Note that the “Mahalanobis distance” di = (x − µi )t Σ−1 (x − µi ) measures how far x is from the expected value µi , while taking into account covariances between the components of the random vector X = (X1 , ..., Xp )t . In particular, for p = 2, x is allocated to G1 , if 1 (9.4) at (x − (µ1 + µ2 )) > 0 2 where a = Σ−1 (µ1 − µ2 ). Thus, we obtain a linear rule where x is compared with the midpoint between µ1 and µ2 . Discriminant analysis with prior group probabilities – the Bayesian rule Sometimes one has a priori knowledge (or belief) how likely each of the groups is to occur. Thus, it is assumed that we know the probabilities πi = P (observation drawn from group Gi ) (i = 1, ..., p) (9.5) πi = 1. The conditional likelihood that the obserwhere 0 ≤ πi ≤ 1 and vation comes from group Gi given the observed value X = x is proportional


to πi fi (x). The natural rule is then the Bayes rule: Allocate x to Gi , if πi fi (x) = max πj fj (x) j=1,...,p

(9.6)

For the “noninformative prior” π1 = π2 = ... = πp = 1/p, representing complete lack of knowledge about which groups observations are more likely to come from, the Bayes rule coincides with the ML-rule. In the case of two groups, the Bayes rule is a simple modification of the ML-rule, since x is allocated to G1 , if f1 (x) π2 log > log (9.7) f2 (x) π1 Which rule is better? The quality of a rule is judged by the probability of correct classification (or misclassification). There are two standard ways of comparing classification rules: a) comparison of individual probabilities of correct classification; and b) comparison of the overall probability of correct classification. The first criterion can be understood as follows: for a random allocation rule with probabilties ψi (.), the probability that a randomly chosen individual coming from group Gi is classified into group Gj is equal to pji = ψj (x)fi (x)dx (9.8) Thus, correct classification for individuals from group Gi occurs with probability pii and misclassification with probability 1 − pii . A rule r with correct-classification-probabilities pii is said to be at least as good as a rule r˜ with probabilities p˜ii , if pii ≥ p˜ii for all i. If there is at least one “ > ” sign, then r is better. If there is no better rule than r, then r is called admissible. Consider now a Bayes rule r with probabilities pij . Is there any better rule than r? Suppose that r˜ is better. Then πi p˜ii . πi pii < On the other hand,

≤

πi p˜ii =

ψ˜i πi fi (x)dx

ψ˜i max{πj fj (x)}dx = j

Since r is a Bayes rule, we have max{πj fj (x)} = j

max{πj fj (x)}dx. j

ψi πi fi (x)

so that finally, the inequality is: πi pii ≥ πi p˜ii ψi πi fi (x)dx =


which contradicts the first inequality. The conclusion is therefore that every Bayes rule is optimal in the sense that it is admissible. If there are no a priori probabilities πi , or more exactly the noninformative prior is used, then this means that the ML-rule is optimal. The second criterion is applicable if a priori probabilities are available: the probability of correct allocation is p p πi pii = πi ψi fi (x)dx (9.9) pcorrect = i=1

i=1

A rule is optimal if pcorrect is maximal. In contrast to admissibility, all rules can be ordered according to “classification correctness”. As before, it can be shown that the Bayes rule is optimal. Both criteria can be generalized to the case where misclassification is associated with costs that may differ for different groups. 9.2.3 Case II: Population distribution form known, parameters unknown Suppose that each fi is known, except for a finite dimensional parameter vector θi . Then the rules above can be adopted accordingly, replacing parameters by their estimates. The ML-rule is then: allocate x to Gi , if fi (x; θî ) = max fj (x; θˆj ) j=1,...,p

(9.10)

The Bayes rule allocates x to G1 , if πi fi (x; θî ) = max πj fj (x; θˆj ) j=1,...,p

(9.11)

The rule becomes particularly simple if fi are normal with unknown means ¯i be the sample µi and equal covariance matrices Σ1 = Σ2 = ... = Σ. Let x ˆ i the sample covariance matrix for observations from group Gi . mean and Σ Estimating the common covariance matrix Σ by ˆ 1 + n2 Σ ˆ 2 + ... + np Σ ˆ p )/(n − p) ˆ = (n1 Σ Σ where ni is the number of observations from Gi and n = n1 + ... + np , the ML-rule allocates x to Gi , if (x − µi )t Σ−1 (x − µi ) = min (x − µj )t Σ−1 (x − µj ) j=1,...,p

(9.12)

For two groups, we have the linear ML-rule 1 x1 + x ¯2 )) > 0 a ˆt (x − (¯ 2

(9.13)

ˆ −1 (¯ where a ˆ=Σ x1 − x ¯2 ), and the corresponding Bayes rule 1 π2 x1 + x a ˆt (x − (¯ ¯2 )) > log 2 π1


(9.14)

It should be emphasized here that while a linear discriminant rule is meaningful for the normal distribution, this may not be so for other distributions. For instance, if for G1 a one-dimensional random variable X is observed with a uniform distribution on [−1, 1] and for G2 the variable X is uniformly distributed on [−3, −2] ∪ [2, 3], then the two groups can be distinguished perfectly, however not by a linear rule. 9.2.4 Case III: Population distributions completely unknown If the population distributions fi are completely unknown, then the search for reasonable rules is more difficult. In recent literature, some rules based on nonparametric estimation or suitable projection techniques have been proposed (see e.g. Friedman 1977, Breiman 1984, Hastie et al. 1994, Polzehl 1995, Ripley 1995, Duda et al. 2000, Hand et al. 2001). The simplest, and historically most important, rule is based on Fisher’s linear discriminant function. Fisher postulated that a linear rule may often be reasonable (see however the remark in Section 9.2.3 why this need not always be so). He proposed to find a vector a such that the linear function at x maximizes the ratio between the variability between groups compared to the variability within the groups. More specifically, define Xn×p = X to be the n × p matrix where each row i corresponds to an observed vector xi = (xi1 , ..., xip )t . We denote the columns of X by x(j) (j = 1, ..., p). The rows are assumed to be ordered according to groups, i.e. rows 1 to n1 are observations from G1 , rows n1 + 1 through n1 + n2 are from G2 and so on. Moreover, define the matrix Mn×n = M = I − n−1 1 · 1t where I is the identity matrix and 1 = (1, ..., 1)t . We denote the subma(i) trices of X and M that belong to the different groups by Xnj ×p = X (j) (j)

and Mnj ×nj = M (j) respectively. The corresponding subvectors of y = (y1 , ..., yn )t are denoted by y (j) . Then the variability of the vector y = Xa, defined by n SST = (yi − y¯)2 = y t M y = at X t M Xa (9.15) i=1

can be written as SST = SSTwithin + SSTbetween where SSTwithin =

nj p j=1 i=1


(j)

(yi

− y¯(j) )2 = at W a

(9.16)

(9.17)

and SSTbetween =

p

nj (¯ y (j) − y¯)2 = at Ba

(9.18)

j=1

Here, W =

p

n j Sj =

j=1

p

[X (j) ]t M (j) X (j)

j=1

is the within groups matrix and B=

p

nj (¯ x(j) − x ¯)(¯ x(j) − x¯)t

j=1

the between groups matrix, Sj is the sample covariance matrix of obserp nj (j) yi is the overall mean, vations xi from group Gj , y¯ = n−1 j=1 i=1 −1 (j) (j) y¯ = nj ¯(j) and x¯ are the correyi the mean in group Gj and x sponding (vector) means for x. Fisher’s linear discriminant function (or first canonical variate) is the linear function at x where a maximizes the ratio at Ba SSTbetween = t (9.19) Q(a) = SSTwithin a Wa The solution is given by Theorem 24 Let a be the eigenvector of W −1 B that corresponds to the largest eigenvalue. Then Q(a) is maximal. The classification rule is then: allocate x to Gi , if ¯(i) | = min |at x − at x¯(j) | |at x − at x j=1,...,p

(9.20)

If there are only p = 2 groups, then n1 n2 (1) (¯ x − x¯(2) )(¯ B= x(1) − x ¯(2) )t n has rank 1 and the only non-zero eigenvalue is n1 n2 (1) tr(W −1 B) = (¯ x − x¯(2) )t W −1 (¯ x(1) − x ¯(2) ) n with eigenvector a = W −1 (¯ x(1) − x¯(2) ). The discriminant rule then becomes the same as the ML-rule for normal distributions with equal covariance matrices: allocate x to Gi , if 1 (1) (¯ x(1) − x¯(2) )t W −1 (x − (¯ x + x¯(2) )) > 0 2

(9.21)

9.2.5 How good is an empirical discriminant rule? If the densities fi are not known, then the classification rule as well as the probabilities pii of correct classification must be estimated from the given


Figure 9.1 Discriminant analysis combined with time series analysis can be used to judge purity of intonation (“Elvira” by J.B.).

data. In principle this is easy, since the corresponding estimates can simply be plugged into the formula for pii . The observed data that are used for estimation are also called “training sample”. A problem with these estimates is, however, that the search for the optimal discriminant rule was done with the same data. Therefore, pˆ11 will tend to be too optimistic (i.e. too large), unless n is very large. The same is true for any method that estimates classification probabilities from the training data. A possibility to avoid this is to partition the data set randomly into a “training” sample that is used for estimation of the discriminant rule, and a disjoint “validation” sample that is used for estimation of classification probabilities. Obviously, this can only be done for large enough data sets. For recently developed computational methods of validation, such as bootstrap, see e.g. Efron (1979), L¨ auter (1985), Fukunaga (1990), Hirst (1996), LeBlanc and Tibshirani (1996), Davison and Hinkley (1997), Chernick (1999), Good (2001). 9.3 Sp ecific applications in music 9.3.1 Identification of pitch, tone separation, and purity of intonation Weihs et al. (2001) investigate objective criteria for judging purity of intonation of singing. The acoustic data are as described in Chapter 4. In order to address the question of how to computationally assess purity of intonation, a vocal expert classified 132 selected tones of 17 performances (Figure 9.1) of H¨ andel’s “Tochter Zion” into the classes “flat”, “correct”, and “sharp”. The opinion of the expert is assumed to be the truth. An objective measure of purity is defined by ∆ = log12 (ωobserved ) − log12 (ωo )


where ωo is the correct basic frequency, corresponding to the note in the score and adjusted to the tuning of the accompanying piano, and ωobserved is the actually measured frequency. Maximum likelihood discriminant analysis leads to the following classification rule: the maximal permissible error in halftones which is accepted in order to classify a tone as “correct” is about 0.4 halftones below and above the target tone. Note that this is much higher than 0.03 halfnotes which is the minimal distance between frequencies a trained ear can distinguish in principle (see Pierce 1992). If a note is considered incorrect by an expert, then the estimated probability of being nevertheless classified as “correct” by the discriminant rule turns out to be 0.174. This rather high error rate may be due to several causes. “Purity of intonation” is a phenomenon that probably depends on more than just the basic frequency. Possible factors are, for instance, amount of vibrato, loudness, pitch, context (e.g. previous and subsequent notes), timbre, etc. Thus, more variables that characterize the sound may have to be incorporated, in addition to ∆, in order to define a musically meaningful notion of “purity of intonation”. 9.3.2 Identification of historic periods For a composition, consider notes modulo octave, with 0 being set equal to the most frequent note (which we will also call “basic tone”). The relative frequencies of each note 0, ..., 11 are denoted by po , ..., p11 . We the set x1 = p5 . Note that, if 0 is the root of the tonic triad then 5 is the root of the subdominant. Moreover we define n x2 = E = − log(pi + 0.001)pi i=1

which is slightly modified measure of entropy. We now describe each composition by a bivariate observation x = (p5 , E)t . The question is now whether this very simple 2-dimensional descriptive statistic can tell us anything about the time when the music was composed. In view of the somewhat naive simplicity of x, the answer is not at all obvious. To simplify the problem, composers are divided into two groups: Group 1 = composers who died before 1800, and Group 2 = composers who died after 1800 (or are still alive). Essentially, the two groups correspond to the partition into “early music to baroque” and “classical till today”. The compositions considered here are those given in the star plot example (Section 2.7.2). In order to be able to check objectively how the procedure works, only a subset of n = 94 compositions is used for estimation. Applying a linear discriminant rule partitions the plane into two half planes by


Fitted discriminant rule and training data used for estimation

-1.4 -1.6 -2.0

-1.8

P(Subdominant)

-1.2

before 1800 after 1800

1.9

2.0

2.1

2.2

2.3

2.4

entropy

Figure 9.2 Linear discriminant analysis of compositions before and after 1800, with the training sample. The data used for the discriminant rule consists of x = (p5 , E).

a straight line. Figure 9.2 shows the estimated partitioning line together with the training sample (o = before 1800, x = after 1800). Apparently, the two groups can indeed be separated quite well by the estimated straight line. This is quite surprising, given the simplicity of the two variables. As expected, however, the partition is not perfect, and it does not seem to be possible to improve it by more complicated partitioning lines. To assess how well the rule may indeed classify, we consider 50 other compositions that were not used for estimating the discriminant rule. Figure 9.3 shows that the rule works well, since almost all observations in the validation sample are classified correctly. An unusual composition is Bart´ ok’s Bagatelle No. 3 which lies far on the left in the “wrong” group. The partitioning can be improved if the time periods of the two groups are chosen farther apart. This is done in figures 9.3a and b with Group 1 = “Early Music to Baroque” and 2 = “Romantic to 20th century”. (A beautiful example of early music is displayed in Figure 9.6; also see Figures 9.7 and 9.8 for portraits of Brahms and Wagner.) Figure 9.4 shows the corresponding plot of the partition together with the data (n = 72). Compositions not used in the estimation are shown in Figure 9.5. Again, the rule works well, except for Bart´ ok’s third Bagatelle.


Fitted discriminant rule and validation data not used for estimation

B ar

to

k

-1.4 -1.6 -2.0

-1.8

P(Subdominant)

-1.2

before 1800 after 1800

1.9

2.0

2.1

2.2

2.3

2.4

entropy

Figure 9.3 Linear discriminant analysis of compositions before and after 1800, with the validation sample. The data used for the discriminant rule consists of x = (p5 , E).

-1.6

-1.4

-1.2

Early & Baroque Romantic & 20th

-2.0

-1.8

P(Subdominant)

-1.0

Fitted discriminant rule and data used for estimation

1.8

2.0

2.2

2.4

entropy

Figure 9.4 Linear discriminant analysis of “Early Music to Baroque” and “Romantic to 20th Century”. The points (”o” and ”×”) belong to the training sample. The data used for the discriminant rule consists of x = (p5 , E).


Fitted discriminant rule and validation data not used for estimation

ar to

k

-1.0

B

-1.4 -2.2

-1.8

P(Subdominant)

Early & Baroque Romantic & 20th

1.8

2.0

2.2

2.4

entropy

Figure 9.5 Linear discriminant analysis of “Early Music to Baroque” and “Romantic to 20th century”. The points (”o” and ”×”) belong to the validation sample. The data used for the discriminant rule consists of x = (p5 , E).

Figure 9.6 Graduale written for an Augustinian monastery of the diocese Konstanz, 13th century. (Courtesy of Zentralbibliothek Z¨ urich.) (Color figures follow page 152.)


Figure 9.7 Johannes Brahms (1833-1897). (Photograph by Maria Fellinger, courtesy of Zentralbibliothek Z¨ urich.)

Figure 9.8 Richard Wagner (1813-1883). (Engraving by J. Bankel after a painting by C. J¨ ager, courtesy of Zentralbibliothek Z¨ urich.)


CHAPTER 10

Cluster analysis 10.1 Musical motivation In discriminant analysis, an optimal allocation rule between different groups is estimated from a training sample. The type and number of groups are known. In some situations, however, it is neither known whether the data can be divided into homogeneous subgroups nor how many subgroups there may be. How to find such clusters in previously ungrouped data is the purpose of cluster analysis. In music, one may for instance be interested in how far compositions or performances can be grouped into clusters representing different “styles”. In this chapter, a brief introduction to basic principles of statistical cluster analysis is given. For an extended account of cluster analysis see e.g. Jardine and Sibson (1971), Anderberg (1973), Hartigan (1978), Mardia et al. (1979), Seber (1984), Blashfield et al. (1985), Hand (1986), Fukunaga (1990), Arabie et al. (1996), Gordon (1999), H¨ oppner et al. (1999), Everitt et al. (2001), Jajuga et al. (2002), Webb (2002). 10.2 Basic principles 10.2.1 Maximum likelihood classification Suppose that observations x1 , ..., xn ∈ Rk are realizations of n independent random variables Xi (i = 1, ..., n). Assume further that each random variable comes from one of p possible groups such that if Xi comes from group j, then it is distributed according to a probability density f (x; θj ). In contrast to discriminant analysis, it is not observed which groups xi (i = 1, ..., n) belong to. Each observation xi is thus associated with an unobserved parameter (or label) ηi specifying group membership. We may simply define ηi = j if xi belongs to group j. Denote by η = (η1 , ..., ηn )t the vector of labels and, for each j = 1, ..., p, let Aj = {xi : 1 ≤ i ≤ n, ηi = j} be the unknown set of observations that belong group j. Then the likelihood function of the observed data is p { f (xi ; θj )} (10.1) L(x1 , ..., xn ; θ1 , ..., θp , η1 , ..., ηn ) = j=1 xi ∈Aj

Maximizing L with respect to the unknown parameters θ1 , ..., θp and η1 , ..., ηn , we obtain ML-estimates θˆ1 , ..., θˆp , ηˆ1 , ..., ηˆn and estimated sets


Aˆ1 , ..., Aˆp . Denoting by m the dimension of θj , the number of estimated parameters is p · m + n. This is larger than the number of observations. It can therefore not be expected that all parameters are estimated consistently. Nevertheless, the ML-estimate provides a classification rule due to the following property: suppose that we change one of the Aˆj s by removing an observation xio from Aˆj and putting it into another set Aˆl (l=j). Then the likelihood can at most become smaller. The new likelihood is obtained from the old one by dividing by f (xio ; θˆj ) and multiplying by f (xio ; θˆl ). We therefore have the following property f (xio ; θˆl ) L(x1 , ..., xn ; θˆ1 , ..., θˆp , ηˆ1 , ..., ηˆn ) ≤ L(x1 , ..., xn ; θˆ1 , ..., θˆp , ηˆ1 , ..., ηˆn ) f (xio ; θˆj ) (10.2) or, dividing by L (assuming that it is not zero), f (x; θˆj ) ≥ f (x; θˆl ) for x ∈ Aˆj

(10.3)

This is identical with the ML-allocation rule in discriminant analysis. The only, but essential, difference here is that η is unknown, i.e. our sample (“training data”) gives us only information about the distribution of X but not about η. This makes the task much more difficult. In particular, since the number of unknown parameters is too large in general, maximum likelihood clustering can not only be computationally difficult but its asymptotic performance may not stabilize sufficiently. In special cases, however, a simple method can be obtained. Suppose, for instance, that the distributions in the groups are multivariate normal with means µj and covariance matrices Σj . Then the ML-estimates of these parameters, given η, are the group sample means 1 xi x¯j = nj (η) i∈Aj (η)

and group sample covariance matrices ˆ j (η) = 1 Σ (xi − x¯j (η))(xi − x ¯j (η))t nj (η) i∈Aj (η)

respectively. The log-likelihood function then reduces to a constant minus 1 p ˆ j |. Maximization with respect to η leads to the estimate n log |Σ j j=1 2 ηˆ = arg min h(η) η

where h(η) =

p

ˆ j (η)|nj (η) |Σ

(10.4)

(10.5)

j=1

Computationally this means that the function h(η) is evaluated for all groupings η of the observations x1 , ..., xn , and the estimate is the grouping


that minimizes h(η). Clearly, this is a computationally demanding task. A simpler rule is obtained if we assume that all covariance matrices are equal to a common covariance matrix . Then ˆ = arg min n−1 ηˆ = arg min |Σ| η

η

p j=1

ˆ j ) = arg min (nj Σ η

p

ˆj) (nj Σ

(10.6)

j=1

Even in this simplified form, finding the best clustering is computationally demanding. For instance, if data have to be divided 2 into two groups, then ˆ j ) may differ is the number of possible assignments for which j=1 (nj Σ n−1 equal to 2 . In addition, if the number of groups is not known a priori, then a suitable, and usually computationally costly, method for estimating p must be applied. From a principle point of view it should also be noted that if normal distributions or any other distributions with overlapping domains are assumed, then there are no perfect clusters. Even if the distributions were known, an observation x can be from any group with fi (x) > 0, with positive probability, so that one can never be absolutely sure where it belongs. A variation of ML-clustering is obtained if the groups themselves are associated with probabilities. Let πj be the probability that a randomly sampled observation comes from group j. In analogy to the arguments above, maximization of the likelihood with respect to all parameters inˆj as prior cluding πj (j = 1, ..., p) leads to a Bayesian allocation rule with π distribution. 10.2.2 Hierarchical clustering ML-clustering yields a partition of observations into p groups. Sometimes it is desirable to obtain a sequence of clusters, e.g. starting with two main groups and then subdividing these into increasingly homogeneous clusters. This is particularly suitable for data where a hierarchy is expected - such as, for instance, in music. Generally speaking, a hierarchical method has the following property: a partitioning into p + 1 clusters consists of • two clusters whose union is equal to one of the clusters from the partitioning into p groups • p − 1 clusters that are identical with p − 1 clusters of the partitioning into p groups. In a first step, data are transformed into a matrix D = (dij )i,j=1,...,n of distances or a matrix S = (sij )i,j=1,...,n of similarities. The definition of distance and similarity used in cluster analysis is more general than the usual definition of a metric: Definition 54 Let X be an arbitrary set and d : X × X → R a real valued function such that for all x, y ∈ X


D1. d(x, y) = d(y, x) D2. d(x, y) ≥ 0 D3. d(x, x) = 0 Then d is called a distance. If in addition we also have D4. d(x, y) = 0 ⇔ x = y D5. d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality), then d is a metric. A measure of similarity is usually assumed to have the following properties: Definition 55 Let X be an arbitrary set and s : X × X → R a real valued function such that for all x, y ∈ X S1. s(x, y) = s(y, x) S2. s(x, y) > 0 S3. s(x, y) increases with increasing “similarity”. Then s is called a measure of similarity. Axiom S3 is of course somewhat subjective, since it depends on what is meant exactly by “similarity”. Table 10.1 gives examples of distances and measures of similarity. Suppose now that, for an observed data set x1 , ..., xn , we can define a distance matrix D = (dij )i,j=1,...,n where dij denotes the distance between vectors xi and xj . A hierarchical clustering algorithm tries to group the data into a hierarchy of clusters in such a way that the distances within these clusters are generally much smaller than those between the clusters. Numerous algorithms are available in the literature. The reason for the variety of solutions is that in general the result depends on various “free choices”, such as the sequence in which clusters are built or the definition of distance between clusters. For illustration, we give the definition of the complete linkage (or furthest neighbor) algorithm: 1. Set a threshold do . (o) (o) 2. Start with the initial clusters A1 = {x1 }, ..., An = {xn } and set i = 1. (o) (o) (o) The distances between the clusters are defined by djl = d(Aj , Al ) = (o)

d(xj , xl ). This gives the n × n distance matrix D(o) = (djl )j,l=1,...,n . (i−1)

3. Join the two clusters for which the distance djl

is minimal, thus ob-

(i) (i) A1 , ..., An−i .

taining new clusters 4. Calculate the new distances between clusters by (i)

(i)

(i)

djl = d(Aj , Al ) =

max

(i)

(i)

d(x, y)

(10.7)

x∈Aj ,y∈Al

and the corresponding (n−i)×(n−i) distance matrix D(i) with elements (i) djl (j, l = 1, ..., n − i).


Table 10.1 Some measures of distance and similarity between x = (x1 , ..., xk )t , y = (y1 , ..., yk )t ∈ Rk . For some of the distances, it is assumed that a data set of observations in Rk is available to calculate sample variances s2j (j = 1, ..., k) and a k × k sample covariance matrix S. Name Euclidian distance

Definition

k 2 d(x, y) = i=1 (xi − yi )

Pearson distance

d(x, y) =

Mahalanobis distance

d(x, y) =

Manhattan metric

d(x, y) = (wi ≥ 0)

Minkowski metric

d(x, y) = (λ ≥ 1)

Bhattacharyya distance

d(x, y) =

Binary similarity

s(x, y) = k −1

Simple matching coefficient Gower’s similarity coefficient

5. If

k i=1 (xi

Comments Usual distance in Rk

− yi )2 /s2j

Standardized Euclidian

(x − y)t S −1 (x − y)

Standardized Euclidian

k

i=1

wi |xi − yi |

k i=1

wi |xi − yi |λ

√ k i=1 ( xi

−

√

Less sensitive to outliers 1/λ

yi )2

1/2

x i yi

ai , s(x, y) = k −1 ai = xi yi + (1 − xi )(1 − yi ) wi |xi − yj |, s(x, y) = 1 − k −1 wi = 1 if xi qualitative, wi = 1/Ri if quantitative (with Ri = range of ith coordinate)

(i) max d j,l=1,...,n−i jl

≤ do

For λ = 1 : Manhattan For xi , yi ≥ 0 (example: proportions) Suitable for xi = 0, 1 Suitable for for xi = 0, 1 Suitable if some xi qualitative, some xi quantitative

(10.8)

then stop. Otherwise, set i = i + 1 and go to step 3. Note in particular that for the final clusters, the maximal distance within each cluster is at most do . As a result, the final clusters tend to be very “compact”. A related method is the so-called nearest neighbor single linkage algorithm. It is identical with the above except that distance between clusters is defined as the minimal distance between points in the two clusters. This can lead to so-called “chaining” in the form of elongated clusters.


For other algorithms and further properties see the references given at the beginning of this chapter, and references therein. 10.2.3 HISMOOTH and HIWAVE clustering HISMOOTH and HIWAVE models, as defined in Chapter 5, can be used to extract dominating features of a time series y(t) that are related to an explanatory series x(t). Suppose that we have several y-series, yj (t) (j = 1, ..., N ) that share the same explanatory series x(t). An interesting question is then in how far features related to x(t) are similar, and which series have more in common than others. One way to answer the question consists of the following clustering algorithm: 1. For each series yj (t), fit a HISMOOTH or HIWAVE model, thus obtaining a decomposition yj (t) = µ ˆj (t, xt ) + ej (t) where µ ˆj is the estimated expected value of yj given x(t). 2. Perform a cluster analysis of the fitted curves µ ˆj (t, xt ). 10.3 Sp ecific applications in music 10.3.1 Distribution of notes Consider the distribution pj (j = 0, 1, ..., 11) of notes modulo as defined for the star plots in Chapter 2. Can the visual impression of star plots in Figure 2.31 be confirmed by cluster analysis? We consider the transformed data vectors ζ = (ζ1 , ..., ζ11 )t , with ζj = log(pj /(1 − pj )), for the following compositions: 1) Anonymus: Saltarello (13th century); Saltarello (14th century); Troto (13th century); Alle psalite (13th century); 2) A. de la Halle (1235?-1287): Or est Bayard en la pature, hure!; 3) J. Ockeghem (1425-1495): Canon epidiatesseron; 4) J. Arcadelt (1505-1568): Ave Maria; La Ingratitud; Io dico fra noi; 5) W. Byrd (1543-1623): Ave Verum Corpus; Alman; The Queen’s Alman; 6) J. Dowland (1562-1626): The Frog Galliard; The King of Denmark’s Galliard; Come again; 7) H.L. Hassler (1564-1612): Galliarda; Kyrie from Missa Secunda; Sanctus et Benedictus from Missa Secunda; 8) Palestrina (1525-1594): Jesu! Rex admirablis; O bone Jesu; Pueri hebraeorum; 9) J.H. Schein (1586-1630): Banchetto musicale; 10) J.S. Bach (1685-1750): Preludes and Fugues 1-24 from “Das Wohltemperierte Klavier”; 11) J. Haydn (1732-1809): Sonata op. 34/3 (Figure 10.3); 12) W.A. Mozart (1756-1791): Sonata KV 545 (2nd Mv.); Sonata KV 281 (2nd Mv.); Sonata KV 332 (2nd Mv.); Sonata KV 333 (2nd Mv); 13) C. Debussy (1862-1918): Claire de lune; Arabesque 1; Reflections dans l’eau; 14) A. Sch¨ onberg (1874-1951): op. 19/2 (Figure 10.4); 15) A. Webern (18831945): Orchesterst¨ uck op. 6, No. 6; 16) Bartók (1881-1945): Bagatelles No.


©2004 CRC Press LLC 8

DOWLAND ARCADELT ANONYMUS ARCADELT ANONYMUS ARCADELT PALESTRINA DOWLAND ANONYMUS

6

HASSLER PALESTRINA PALESTRINA BYRD BYRD BARTOK BYRD SCHEIN BACH SCHOENBERG DEBUSSY BACH DEBUSSY MOZART MOZART MOZART MOZART BACH BACH BACH BACH MOZART BACH BACH BACH BACH DEBUSSY BACH BARTOK WEBERN MESSIAEN HAYDN BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BARTOK BARTOK TAKEMITSU

HASSLER HASSLER

ANONYMUS OCKEGHEM

4

10

HALLE

12

14

5

10

DOWLAND HASSLER HASSLER ANONYMUS ARCADELT ARCADELT ARCADELT PALESTRINA

SCHOENBERG BARTOK BARTOK TAKEMITSU MESSIAEN BARTOK WEBERN BACH BACH BACH BACH BACH BACH BACH BACH BACH HAYDN BACH BACH BACH BACH BACH BACH BARTOK BACH MOZART DEBUSSY DEBUSSY BACH BACH MOZART MOZART MOZART MOZART BACH BACH DEBUSSY BACH BACH BACH BACH DOWLAND HASSLER PALESTRINA SCHEIN PALESTRINA BYRD BYRD BYRD ANONYMUS OCKEGHEM ANONYMUS ANONYMUS

15

HALLE

20

25

30

Distribution of notes modulo 12 - complete linkage

Figure 10.1 Complete linkage clustering of log-odds-ratios of note-frequencies.

Distribution of notes modulo 12 - single linkage

Figure 10.2 Single linkage clustering of log-odds-ratios of note-frequencies.

Figure 10.3 Joseph Haydn (1732-1809). (Title page of a biography published by the Allgemeine Musik-Gesellschaft Z¨ urich, 1830; courtesy of Zentralbibliothek Z¨ urich.)

1-3; Piano Sonata (2nd Mv.); 17) O. Messiaen (1908-1992): Vingts regards de Jesu No. 3; 18) T. Takemitsu (1930-1996): Rain tree sketch No. 1. Figure 10.1 shows the result of complete linkage clustering of the vectors (ζ1 , ..., ζ11 )t , based on the Euclidian and do = 5. The most striking feature is the clear separation of “early music” from the rest. Moreover, the 20th century composers considered here are in a separate cluster, except for Bart´ ok’s Bagatelle No. 3 (and Debussy, who may be considered as belonging to the 19th and 20th centuries). In contrast, clusters provided by a single linkage algorithm are less easy to interpret. Figure 10.2 illustrates a typical result of this method namely long narrow clusters where the maximal distance within a cluster can be quite large. In our example this does


Figure 10.4 Klavierst¨ uck op. 19, No. 2 by Arnold Sch¨ onberg. (Facsimile; used by permission of Belmont Music Publishers.)


Bach: F.8/WK I

Bach: Pr.8/WK I

Bach: F.1/WK I

Bach: Pr.1/WK I

Bach: Cello Suite V/1

Bach: Cello Suite III/1

Bach: Cello Suite II/1

Bach: Cello Suite I/1

Bach: Cello Suite VI/1

Bach: Cello Suite IV/1

5

6

7

8

Clusters of entropies - complete linkage

Figure 10.5 Complete linkage clustering of entropies.

not seem appropriate, since, due to the “organic” historic development of music, the effect of chaining is likely to be particularly pronounced. 10.3.2 Entropies Consider entropies as defined in Chapter 3. More specifically, we define for each composition a vector y = (E1 , ..., E10 )t . After standardization of each coordinate, cluster analysis is applied the following compositions by J.S. Bach: Cello Suites No. I to VI (1st movement from each); Preludes and Fugues No. 1 and 8 from “Das Wohltemperierte Klavier” (each separately). The complete linkage algorithm leads to a clear separation of the Cello Suites from “Das Wohltemperierte Klavier” displayed in Figure 10.5. 10.3.3 Tempo curves One of the obvious questions with respect to the tempo curves in Figure 2.3 is whether one can find clusters of similar performances. Applying complete linkage cluster analysis (with the euclidian distance) to the raw data yields the clusters in Figure 10.6. Cortot and Horowitz appear to have very individual styles, since they build distinct clusters on their own. It should be noted, however, that this does not imply that other pianists do not have their own styles. Cortot and Horowitz simply happen to be the lucky ones


CORTOT3 CORTOT1 CORTOT2 MOISEIWITSCH ORTIZ NEY NOVAES DAVIES SCHNABEL SHELLEY CURZON KRUST ASKENAZE ARRAU BRENDEL ESCHENBACH ARGERICH DEMUS KLIEN HOROWITZ1 HOROWITZ2 HOROWITZ3 BUNIN KUBALEK CAPOVA ZAK GIANOLI KATSARIS

6

8

10

12

14

Clusters of tempo curves - complete linkage

Figure 10.6 Complete linkage clustering of tempo.

who are represented more than once in the sample, so that the consistency of their performances can be checked empirically. Figure 10.6 also shows that Cortot is somewhat of an “outlier”, since his cluster separates from all other pianists at the top level. 10.3.4 Tempo curves and melodic structure Cluster analysis alone does not provide any further explanation about the meaning of observed clusters. In particular, we do not know which musically meaningful characteristics determine the clustering of tempo curves. In contrast, cluster analysis based on HISMOOTH or HIWAVE models provides a way to gain more insight. The fitted HISMOOTH curves in Figures 5.9a through d extract essential features that make comparisons easier. The estimated bandwidths can be interpreted as a measure of how much emphasis a pianist puts on global and local features respectively. Figure 10.7 shows clusters based on the fitted HISMOOTH curves. In contrast to the original data, complete and single linkage turn out to yield almost the same clusters. Thus, applying the HISMOOTH fit first leads to a stabilization of results. From Figure 10.7, we may identify about six main clusters, namely: • A: KRUST, KATSARIS, SCHNABEL;


KRUST KATSARIS SCHNABEL MOISEIWITSCH NOVAES ORTIZ DEMUS CORTOT1 CORTOT3 ARGERICH SHELLEY CAPOVA CORTOT2 ARRAU BUNIN KUBALEK CURZON GIANOLI ASKENAZE DAVIES ZAK ESCHENBACH NEY HOROWITZ3 KLIEN BRENDEL HOROWITZ1 HOROWITZ2

4

5

6

7

8

9

Clusters of HISMOOTH fits - complete linkage

Figure 10.7 Complete linkage clustering of HISMOOTH-fits to tempo curves.

• B: MOISEIWITSCH, NOVAES, ORTIZ; • C: DEMUS, CORTOT1, CORTOT2, CORTOT3, ARGERICH, SHELLEY, CAPOVA; • D: ARRAU, BUNIN, KUBALEK, CURZON, GIANOLI; • E: ASKENAZE, DAVIES; • F: HOROWITZ1, HOROWITZ2, HOROWITZ3, ZAK, ESCHENBACH, NEY, KLIEN, BRENDEL. This is related to grouping of the vector of estimated bandwidths, (b1 , b2 , b3 )t ∈ R3+ . In figure 10.8, the x- and y-coordinates correspond to b1 and b2 respectively, and the radius of a circle is proportional to b3 . The letters A through F identify locations where one or more observation from that cluster occurs. The pictures show that only a few selected values of b1 and b2 are selected. Particularly striking are the large bandwidths for clusters A and B. Apparently, these pianists emphasize mostly larger structures of the composition. Also note that the clusters do not separate equally well in each projection. Apart from clusters A and B, one cannot “order” the performances in terms of large versus small bandwidth. Overall, one may conclude that HISMOOTH-clustering together with analytic indicator functions provides a better understanding of essential characteristics of musical performance (Figure 10.9).


3

B 2

A F

B BB

A D

1

D C C C D F

0

D

0.5

1.0

CD F C 1.5

2.0

E F F E 2.5

3.0

3.5

Figure 10.8 Symbol plot of HISMOOTH bandwidths for tempo curves. The radius of each circle is proportional to a constant plus log b3 ; the horizontal and vertical axes are equal to b1 and b2 respectively. The letters A–F indicate where at least one observation from the corresponding cluster occurs.

Figure 10.9 Maurizio Pollini (*1942). (Courtesy of Philippe Gontier, Paris.)


CHAPTER 11

Multidimensional scaling 11.1 Musical motivation In some situations data consist of distances only. These distances are not necessarily euclidian so that they do not necessarily correspond to a configuration of points in a euclidian space. The question addressed by multidimensional scaling (MDS) is in how far one may nevertheless find points in a hopefully low-dimensional euclidian space that have exactly or approximately the observed distances. The procedure is mainly an exploratory tool that helps to find structure in distance data. We give a brief introduction to the basic principles of MDS. For a detailed discussion and an extended bibliography see, for instance, Kruskal and Wish (1978), Cox and Cox (1994), Everitt and Rabe-Hesketh (1997), Borg and Groenen (1997), Schiffman (1997); also see textbooks on multivariate statistics, such as the ones given in the previous chapters. For the origins of MDS and early references see Young and Householder (1941), Guttman (1954), Shepard (1962a,b), Kruskal (1964a,b), Ramsay (1977). 11.2 Basic principles 11.2.1 Basic definitions In MDS, any symmetric n × n matrix D = (dij )i,j=1,...,n with dij ≥ 0 and dii = 0 is called a distance matrix. Note that this corresponds to the axioms D1, D2, and D3 in the previous chapter. If instead of distances, a similarity matrix S = (sij )i,j=1,....,n is given, then one can define a corresponding distance matrix by a suitable transformation. One possible transformation is, for instance, dij = sii − 2sij + sjj (11.1) The question addressed by metric MDS can be formulated as follows: given an n × n distance matrix D, can one find a dimension k and n points ˜ with D ˜ x1 , ..., xn in Rk such that these points have a distance matrix D approximately, or even exactly, equal to D? Clearly one prefers low dimensions (k = 2 or 3, if possible), since it is then easy to display the points graphically. On the other hand, the dimension cannot be too low in order to obtain a good approximation of D, and hence a realistic picture of structures in the data. As an alternative to metric MDS, one may also consider


non-metric methods where one tries to find points in a euclidian space such that the ranking of the distances remains the same, whereas their nominal values may differ. 11.2.2 Metric MDS In the ideal case, the metric solution constructs n points x1 , ..., xn ∈ Rk ˜ with elements for some k such that their euclidian distance matrix D, t ˜ dij = (xi − xj ) (xi − xj ), is exactly equal to the original distance matrix D. If this is possible, then D is called euclidian. The condition under which this is possible is as follows: Theorem 25 D = Dn×n = (dij )i,j=1,...,n is euclidian if and only if the matrix B = Bn×n = M AM is positive semidefinite, where M = (I − n−1 11t ), I = In×n is the identity matrix, 1 = (1, ..., 1)t and A = An×n has elements 1 aij = − d2ij (i, j = 1, ..., n). 2 The reason for positive semidefiniteness of B is that if D is indeed a euclidian matrix corresponding to points x1 , ..., xn ∈ Rk , then bij = (xi − x ¯)t (xj − x¯)

(11.2)

so that B defines a “centered” scalar product for these points. In matrix form we have B = (M X)(M X)t where the n rows of Xn×k correspond to the vectors xi (i = 1, ..., n). Since for any matrix C, the matrices C t C and CC t are positive semidefinite, so is B. The construction of the points x1 , ..., xn given D = Dn×n (or Bn×n ≥ 0) is done as follows: suppose that B is of rank k ≤ n. Since B is a symmetric matrix, we have the spectral decomposition B = CΛC t = ZZ t

(11.3)

where Λ is the n×n diagonal matrix with eigenvalues λ1 ≥ λ2 ≥...≥ λk > 0 and λj = 0 (j > k) in the diagonal, and Z = Zn×n = (zij )i,j=1,...,n the n × n matrix with the first k columns z (j) (j = 1, ..., k) equal to the first k eigenvectors. Then xi = (zi1 , ..., zik )t (i = 1, ..., n)

(11.4)

of Z are points in R with distance matrix D. In practice, the following difficulties can occur: 1. D is euclidian, but k is too large to be of any use (after all the purpose is to obtain an interpretable picture of the data); 2. D is not euclidian with a) all λi positive, or, b) some λi negative. Because of these problems, one often uses a rough approximak


tion of D, based on a small number of eigenvectors that correspond to positive eigenvalues. Finally, note that if instead of distances, similarities are given and the similarity matrix S is positive semidefinite, then S can be transformed into a euclidian distance matrix by defining dij =

sii − 2sij + sjj

(11.5)

11.2.3 Non-metric MDS For qualitative data, or generally observations in non-metric spaces, distances can only be interpreted in terms of ranking. For instance, the subjective judgement of an audience may be that a composition by Webern is slightly more “difficult” than Wagner, but much more difficult than Mozart, thus defining a larger distance between Webern and Mozart than Webern and Wagner. It may, however, not be possible to express distances between the compositions by numbers that could be interpreted directly. In such cases, D is often called a dissimilarity matrix rather than a distance matrix. Since only the relative size of distances is meaningful, various computationally demanding algorithmic methods for defining points in a euclidian space such that the ranking of the distances remains the same have been developed in the literature (e.g. Shepard 1962a,b, Kruskal 1964a,b, Guttman 1968, Lingoes and Roskam 1973).

11.2.4 Chronological ordering Suppose a distance matrix D (or a similarity matrix S) is given and one would like to find out whether there is a natural ordering of the observational units. For instance, a listener may assign a distance matrix between various musical pieces without knowing anything about these pieces a priori. The question then may be whether the listener’s distance matrix corresponds approximately to the sequence in time when the pieces were composed. This problem is also called seriation. MDS provides a possible solution in the following way: if the distances expressed the temporal (or any other) sequence exactly, then the configuration of points found by MDS would be one-dimensional. In the more realistic case that distances are partially due to the temporal sequence, the points in Rk should be scattered around a one-dimensional, not necessarily straight, line in Rk . In the simplest case, this may already be visible in a two-dimensional plot.


11.3 Sp ecific applications in music 11.3.1 Seriation by simple descriptive statistics Suppose we would like to guess which time a composition is from, without listening to the music but instead using an algorithm. There is a large amount of music theory that can be used to determine the time when a composition was written. One may wonder, however, whether there may be a very simple computational way of guessing. Consider, for instance, the following frequencies: xi = pi−1 (i = 1, ..., 12) are the relative frequencies of notes modulo 12 centered around the central tone, as defined in section 9.3.2. Moreover, set x13 equal to the relative frequency of a sequence of four notes following the sequence of interval steps 3, 3 and 3. This corresponds to an arpeggio of the diminished seventh chord. Thus, we consider a vector x = (x1 , ..., x13 )t with coordinates corresponding to proportions. An appropriate measure of distance between proportions is the Bhattacharyya distance (Bhattacharyya 1946b) given in Table 10.1 namely 1/2 k √ √ 2 ( xi − yi ) . d(x, y) = i=1

This is not a euclidian distance so that it is not a priori clear whether a suitable representation of the observations in a euclidian space is possible. MDS with k ∗ = 2 yields the points in Figure 11.1. Three time periods are distinguished by using different symbols for the points. The periods are defined in a very simple way, namely by date of birth of the composer: a) before 1720 (“early to baroque”; see e.g. Figure 11.3); b) 1720-1880 (“classical to romantic”); and c) 1880 or later (“20th century”). The configuration of the respective points does show an “effect” of time. The three time periods can be associated with regional clusters though the regions overlap. An outlier from the middle category is Schoenberg. This is due to the crude definition of the time periods: Schoenberg (in particular his op. 19/2) clearly belongs the 20th century he just happens to be born a little bit too early (1874), and is therefore classified as “classical to romantic”. The dependence between time period and second MDS-coordinate can also be seen by comparing boxplots (Figure 11.2). 11.3.2 Perception and music psychology MDS is frequently used to analyze data that consist of subjective distances between musical sounds (e.g. with respect to pitch or timbre) or compositions obtained in controlled experiments. Typical examples are Grey and Gordon (1978), Gromko (1993), Ueda and Ohgushi (1987), Wedin (1972), Wedin and Goude (1972), Markuse and Schneider (1995). Since it is not known in how far the cognitive “metric” may correspond approximately to


0.6

before 1720 1720-1880 1880 or later

-0.4

-0.2

0.0

x2

0.2

0.4

Schoenberg

-0.4

-0.2

0.0

0.2

0.4

x1

Figure 11.1 Two-dimensional multidimensional scaling of compositions ranging from the 13th to the 20th century, based on frequencies of intervals and interval sequences.

a euclidian distance, MDS is a useful method to investigate this question, to simplify high-dimensional distance data and possibly find interesting structures. Grey and Gordon consider perceptual effects of timbres characterized by spectra. For a related study see Wedin and Goude (1972). Gromko (1993) carries out an MDS analysis to study perceptual differences between expert and novice music listeners. Ueda and Ohgushi (1987) study perceptual components of pitch and use MDS to obtain a spatial representation of pitch.


0.6 0.4 0.2 0.0 -0.2 -0.4

birth before 1720

1720-1880

1880 and later

Figure 11.2 Boxplots of second MDS-component where compositions are classified according to three time periods.

Figure 11.3 Fragment of a graduale from the 14th century. (Courtesy of Zentralbibliothek Z¨ urich.)


Figure 11.4 Muzio Clementi (1752-1832). (Lithography by H. Bodmer, courtesy of Zentralbibliothek Z¨ urich.)

Figure 11.5 Freddy (by J.B.) and Johannes Brahms (1833-1897) going for a drink. (Caricature from a contemporary newspaper; courtesy of Zentralbibliothek Z¨ urich.)


List of figures Figure 1.1: Quantitative analysis of music helps to understand creative processes. (Pierre Boulez, photograph courtesy of Philippe Gontier, Paris; and “Jim” by J.B.) Figure 1.2: J.S. Bach (1685-1750). (Engraving by L. Sichling after a painting by Elias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek Z¨ urich.) Figure 1.3: Ludwig van Beethoven (1770-1827). (Drawing by E. D¨ urck after a painting by J.K. Stieler, 1819; courtesy of Zentralbibliothek Z¨ urich.) ¨ Figure 1.4: Anton Webern (1883-1945). (Courtesy of Osterreichische Post AG.) Figure 1.5: Gottfried Wilhelm Leibniz (1646-1716). (Courtesy of Deutsche Post AG and Elisabeth von Janota-Bzowski.) Figure 1.6: W.A. Mozart (1759-1791) (authorship uncertain) – SpiegelDuett. Figure 1.7: Wolfgang Amadeus Mozart (1756-1791). (Engraving by F. M¨ uller after a painting by J.W. Schmidt; courtesy of Zentralbibliothek Z¨ urich.) Figure 1.8: The torus of thirds Z3 + Z4 . Figure 1.9: Arnold Sch¨ onberg – Sketch for the piano concert op. 42 – notes with tone row and its inversions and transpositions. (Used by permission of Belmont Music Publishers.) Figure 1.10: Notes of “Air” by Henry Purcell. (For better visibility, only a small selection of related “motifs” is marked.) Figure 1.11: Notes of Fugue No. 1 (first half) from “Das Wohltemperierte Klavier” by J.S. Bach. (For better visibility, only a small selection of related “motifs” is marked.) Figure 1.12: Notes of op. 68, No. 2 from “Album f¨ ur die Jugend” by Robert Schumann. (For better visibility, only a small selection of related “motifs” is marked.) Figure 1.13: A miraculous transformation caused by high exposure to Wagner operas. (Caricature from a 19th century newspaper; courtesy of Zentralbibliothek Z¨ urich.)


Figure 1.14: Graphical representation of pitch and onset time in Z271 to´ anti – gether with instrumentation of polygonal areas. (Excerpt from S¯ Piano concert No. 2 by Jan Beran, col legno CD 20062; courtesy of col legno, Germany.) Figure 1.15: Iannis Xenakis (1922-1998). (Courtesy of Philippe Gontier, Paris.) Figure 1.16: Ludwig van Beethoven (1770-1827). (Courtesy of Zentralbibliothek Z¨ urich.) Figure 2.1: Robert Schumann (1810-1856) – Tr¨ aumerei op. 15, No. 7. Figure 2.2: Tempo curves of Schumann’s Träumerei performed by Vladimir Horowitz. Figure 2.3: Twenty-eight tempo curves of Schumann’s Tr¨ aumerei performed by 24 pianists. (For Cortot and Horowitz, three tempo curves were available.) Figure 2.4: Boxplots of descriptive statistics for the 28 tempo curves in Figure 2.3. Figure 2.5: q-q-plots of several tempo curves (from Figure 2.3). Figure 2.6: Frequencies of notes 0,1,...,11 for moving windows of onsetlength 16. Figure 2.7: Frequencies of notes 0,1,...,11 for moving windows of onsetlength 16. Figure 2.8: Johannes Chrysostomus Wolfgangus Theophilus Mozart (17561791) in the house of Salomon Gessner in Zurich. (Courtesy of Zentralbibliothek Z¨ urich.) Figure 2.9: R. Schumann (1810-1856) – lithography by H. Bodmer. (Courtesy of Zentralbibliothek Z¨ urich.) Figure 2.10: Acceleration of tempo curves for Cortot and Horowitz. Figure 2.11: Tempo acceleration – correlation with other performances. Figure 2.12: Martha Argerich – interpolation of tempo curve by cubic splines. i Figure 2.13: Smoothed tempo curves gˆ1 (t) = (nb1 )−1 K( t−t b1 )yi (b1 = 8). i Figure 2.14: Smoothed tempo curves gˆ2 (t) = (nb2 )−1 K( t−t b2 )[yi − gˆ1 (t)] (b2 = 1). i Figure 2.15: Smoothed tempo curves gˆ3 (t) = (nb3 )−1 K( t−t b3 )[yi − gˆ1 (t) − gˆ2 (t)] (b3 = 1/8). Figure 2.16: Smoothed tempo curves – residuals eˆ(t) = yi − gˆ1 (t) − gˆ2 (t) − gˆ3 (t).


Figure 2.17: Melodic indicator – local polynomial fits together with first and second derivatives. Figure 2.18: Tempo curves (Figure 2.3) – first derivatives obtained from local polynomial fits (span 24/32). Figure 2.19: Tempo curves (Figure 2.3) – second derivatives obtained from local polynomial fits (span 8/32). Figure 2.20: Kinderszene No. 4 – sound wave of performance by Horowitz at the Royal Festival Hall in London on May 22, 1982. Figure 2.21: log(Amplitude) and tempo for Kinderszene No. 4 – auto- and cross correlations (Figure 2.24a), scatter plot with fitted least squares and robust lines (Figure 2.24b), time series plots (Figure 2.24c), and sharpened scatter plot (Figure 2.24d). Figure 2.22: Horowitz’ performance of Kinderszene No. 4 – log(tempo) versus log(Amplitude) and boxplots of log(tempo) for three ranges of amplitude. Figure 2.23: Horowitz’ performance of Kinderszene No. 4 – two-dimensional histogram of (x, y) = (log(tempo), log(Amplitude)) displayed in a perspective and image plot respectively. Figure 2.24: Horowitz’ performance of Kinderszene No. 4 – kernel estimate of two-dimensional distribution of (x, y) = (log(tempo), log(Amplitude)) displayed in a perspective and image plot respectively. Figure 2.25: R. Schumann, Träumerei op. 15, No. 7 – density of melodic indicator with sharpening region (a) and melodic curve plotted against onset time, with sharpening points highlighted (b). Figure 2.26: R. Schumann, Träumerei op. 15, No. 7 – tempo by Cortot and Horowitz at sharpening onset times. Figure 2.27: R. Schumann, Tr¨ aumerei op. 15, No. 7 – tempo “derivatives” for Cortot and Horowitz at sharpening onset times. Figure 2.28: Arnold Sch¨ onberg (1874-1951), self-portrait. (Courtesy of Verwertungsgesellschaft Bild-Kunst, Bonn.) Figure 2.29: a) Chernoff faces for 1. Saltarello (Anonymus, 13th century); 2. Prelude and Fugue No. 1 from “Das Wohltemperierte Klavier” (J. S. Bach, 1685-1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 18101856); 4. Piano piece op. 19, No. 2 (A. Sch¨ onberg, 1874-1951); 5. Rain Tree Sketch 1 (T. Takemitsu, 1930-1996); b) Chernoff faces for the same compositions as in figure 2.29a, after permuting coordinates. Figure 2.30: The minnesinger Burchard von Wengen (1229-1280), contemporary of Adam de la Halle (1235?-1288). (From Codex Manesse, courtesy of the University Library Heidelberg.) (Color figures follow page 168.)


Figure 2.31: Star plots of p∗j = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 )t for compositions from the 13th to the 20th century. Figure 2.32: Symbol plot of the distribution of successive interval pairs (∆y(ti ), ∆y(ti+1 )) (2.36a, c) and their absolute values (b, d) respectively, for the upper envelopes of Bach’s Pr¨ aludium No. 1 (Das Wohltemperierte Klavier I) and Mozart ’s Sonata KV 545 (beginning of 2nd movement). Figure 2.33: Symbol plot of the distribution of successive interval pairs (∆y(ti ), ∆y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for the upper envelopes of Scriabin’s Prélude op. 51, No. 4 and F. Martin’s Prélude No. 6. Figure 2.34: Symbol plot with x = pj5 , y = pj7 and radius of circles proportional to pj1 . Figure 2.35: Symbol plot with x = pj5 , y = pj7 and radius of circles proportional to pj6 . (Color figures follow page 168.) Figure 2.36: Symbol plot with x = pj5 , y = pj7 . The rectangles have width pj1 (diminished second) and height pj6 (augmented fourth). (Color figures follow page 168.) Figure 2.37: Symbol plot with x = pj5 , y = pj7 , and triangles defined by pj1 (diminished second), pj6 (augmented fourth) and pj10 (diminished seventh). (Color figures follow page 168.) Figure 2.38: Names plotted at locations (x, y) = (pj5 , pj7 ). (Color figures follow page 168.) Figure 2.39: Profile plots of p∗j = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t . ¨ Figure 3.1: Ludwig Boltzmann (1844-1906). (Courtesy of Osterreichische Post AG.) Figure 3.2: Fractal pictures (by Céline Beran, computer generated.) (Color figures follow page 168.) Figure 3.3: Gy¨ orgy Ligeti (*1923). (Courtesy of Philippe Gontier, Paris.) Figure 3.4: Comparison of entropies 1, 2, 3, and 4 for J.S. Bach’s Cello Suite No. I and R. Schumann’s op. 15, No. 2, 3, 4, and 7, and op. 68, No. 2 and 16. Figure 3.5: Alexander Scriabin (1871-1915) (at the piano) and the conductor Serge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of Gemäldegalerie Neuer Meister, Dresden, and Robert-Sterl-House.) Figure 3.6: Comparison of entropies 9 and 10 for Bach, Schumann, and Scriabin/Martin. Figure 3.7: Metric, melodic, and harmonic global indicators for Bach’s Canon cancricans. Figure 3.8: Robert Schumann (1810-1856). (Courtesy of Zentralbibliothek Z¨ urich.)


Figure 3.9: Metric, melodic, and harmonic global indicators for Schumann’s op. 15, No. 2 (upper figure), together with smoothed versions (lower figure). Figure 3.10: Metric, melodic, and harmonic global indicators for Schumann’s op. 15, No. 7 upper figure), together with smoothed versions (lower figure). Figure 3.11: Metric, melodic, and harmonic global indicators for Webern’s Variations op. 27, No. 2 (upper figure), together with smoothed versions (lower figure). Figure 3.12: R. Schumann – Träumerei: motifs used for specific melodic indicators. Figure 3.13: R. Schumann – Tr¨ aumerei: indicators of individual motifs. Figure 3.14: R. Schumann – Tr¨ aumerei: contributions of individual motifs to overall melodic indicator. Figure 3.15: R. Schumann – Tr¨ aumerei: overall melodic indicator. Figure 4.1: Sound wave of c and f played on a piano. Figure 4.2: Zoomed piano sound wave – shaded area in Figure 4.1. Figure 4.3: Periodogram of piano sound wave in Figure 4.2. Figure 4.4: Sound wave of e played on a harpsichord. Figure 4.5: Periodogram of harpsichord sound wave in Figure 4.4. Figure 4.6: Harpsichord sound – periodogram plots for different time frames (moving windows of time points). Figure 4.7: A harpsichord sound and its spectrogram. Intense pink corresponds to high values of I(t, λ). (Color figures follow page 168.) Figure 4.8: A harpsichord sound wave (a), logarithm of squared amplitudes (b), histogram of the series (c) and its periodogram on log-scale (d) together with fitted SEMIFAR-spectrum. Figure 4.9: Log-frequencies with fitted SEMIFAR-trend and log-log-periodogram together with SEMIFAR-fit for Bach’s first Cello Suite (1st movement; a,b) and Paganini’s Capriccio No. 24 (c,d) respectively. Figure 4.10: Local variability with fitted SEMIFAR-trend and log-logperiodogram together with SEMIFAR-fit for Bach’s first Cello Suite (1st movement; a,b) and Paganini’s Capriccio No. 24 (c,d) respectively. Figure 4.11: Niccol` o Paganini (1782-1840). (Courtesy of Zentralbibliothek Z¨ urich.) Figure 5.1: Simulated signal (a) and wavelet coefficients (b); (c) and (d): wavelet components of simulated signal in a; (e) and (f): wavelet components of simulated signal in a and frequency plot of coefficients. Figure 5.2: Decomposition of x−series in simulated HIWAVE model.


Figure 5.3: Simulated HIWAVE model - explanatory series g1 (a), y−series (b), y versus x (c), y versus g1 (d), y versus g2 = x − g1 (e) and time frequency plot of y (f). Figure 5.4: HIWAVE time series and fitted function gˆ1 . Figure 5.5: Hierarchical decomposition of metric, melodic, and harmonic indicators for Bach’s “Canon cancricans” (Das Musikalische Opfer BWV 1079) and Webern’s Variation op. 27, No. 2. Figure 5.6: Quantitative analysis of performance data is an attempt to understand “objectively” how musicians interpret a score without attaching any subjective judgement. (Left: “Freddy” by J.B.; right: J.S. Bach, woodcutting by Ernst W¨ urtemberger, Z¨ urich. Courtesy of Zentralbibliothek Z¨ urich). Figure 5.7: Most important melodic curves obtained from HIREG fit to tempo curves for Schumann’s Träumerei. Figure 5.8: Successive aggregation of HIREG-components for tempo curves by Ashkenazy and Horowitz (third performance). Figure 5.9 a and b: HISMOOTH-fits to tempo curves (performances 1-14); Figure 5.9 c and d: HISMOOTH-fits to tempo curves (performances 1528). Figure 5.10: Time frequency plots for Cortot’s and Horowitz’s three performances. Figure 5.11: Wavelet coefficients for Cortot’s and Horowitz’s three performances. Figure 5.12: Tempo curves – approximation by most important 2 best basis functions. Figure 5.13: Tempo curves – approximation by most important 5 best basis functions. Figure 5.14: Tempo curves – approximation by most important 10 best basis functions. Figure 5.15: Tempo curves (a) by Cortot (three curves on top) and Horowitz, R2 obtained in HIWAVE-fit plotted against trial cut-off parameter η (b) and fitted HIWAVE-curves (c). Figure 5.16: First derivative of tempo curves (a) by Cortot (three curves on top) and Horowitz, R2 obtained in HIWAVE-fit plotted against trial cut-off parameter η (b) and fitted HIWAVE-curves (c). Figure 5.17: Second derivative of tempo curves (a) by Cortot (three curves on top) and Horowitz, R2 obtained in HIWAVE-fit plotted against trial cut-off parameter η (b) and fitted HIWAVE-curves (c). Figure 6.1: Jean-Philippe Rameau (1683-1764). (Engraving by A. St. Aubin after J. J. Cafferi, Paris after 1764; courtesy of Zentralbibliothek Z¨ urich.)


Figure 6.2: Frédéric Chopin (1810-1849). (Courtesy of Zentralbibliothek Z¨ urich.) Figure 6.3: Stationary distributions π ˆj (j = 1, ..., 11) of Markov chains with state space Z12 \{0}, estimated for the transition between successive intervals. Figure 6.4: Cluster analysis based on stationary Markov chain distributions for compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmaninoff. Figure 6.5: Cluster analysis based on stationary Markov chain distributions of torus distances for compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmaninoff. Figure 6.6: Comparison of log odds ratios log(ˆ π1 /ˆ π2 ) of stationary Markov chain distributions of torus distances. Figure 6.7: Comparison of log odds ratios log(ˆ π1 /ˆ π3 ) of stationary Markov chain distributions of torus distances. Figure 6.8: Comparison of log odds ratios log(ˆ π2 /ˆ π3 ) of stationary Markov chain distributions of torus distances. π3 ) and log(ˆ π2 /ˆ π3 ) of Figure 6.9: Comparison of log odds ratios log(ˆ π1 /ˆ stationary Markov chain distributions of torus distances. Figure 6.10: Comparison of stationary Markov chain distributions of torus distances. π3 ) and log(ˆ π2 /ˆ π3 ) plotted against Figure 6.11: Log odds ratios log(ˆ π1 /ˆ date of birth of composer. Figure 6.12: Johannes Brahms (1833-1897). (Courtesy of Zentralbibliothek Z¨ urich.) Figure 7.1: Béla Bartók – statue by Varga Imre in front of the Béla Bartók Memorial House in Budapest. (Courtesy of the Béla Bartók Memorial House.) Figure 7.2: Sergei Prokoffieff as a child. (Courtesy of Karadar Bertoldi Ensemble; www.karadar.net/Ensemble/.) Figure 7.3: Circular representation of compositions by J. S. Bach (Pr¨ aludium und Fuge No. 5 from “Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart´ ok (Bagatelles No. 3), and S. Prokoffieff (Visions fugitives No. 8). ˆ 1 , R, ¯ d and log m for notes modulo 12, comparing Figure 7.4: Boxplots of λ Bach, Scarlatti, Bart´ ok, and Prokoffief. Figure 7.5: Circular representation of intervals of successive notes in the following compositions: J. S. Bach (Pr¨ aludium und Fuge No. 5 from “Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart´ ok (Bagatelles No. 3), and S. Prokoffieff (Visions fugitives No. 8).


ˆ 1 , R, ¯ d and log m for note intervals modulo 12, Figure 7.6: Boxplots of λ comparing Bach, Scarlatti, Bart´ ok, and Prokoffief. Figure 7.7: Circular representation of notes ordered according to circle of fourhts in the following compositions: J. S. Bach (Pr¨ aludium und Fuge No. 5 from ”Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart´ ok (Bagatelles No. 3), and S. Prokoffieff (Visions fugitives No. 8). ˆ 1 , R, ¯ d and log m for notes 12 ordered according Figure 7.8: Boxplots of λ to circle of fourhts, comparing Bach, Scarlatti, Bart´ ok and Prokoffief. Figure 7.9: Circular representation of intervals of successive notes ordered according to circle of fourhts in the following compositions: J. S. Bach (Pr¨ aludium und Fuge No. 5 from ”Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart´ ok (Bagatelles No. 3), and S. Prokoffieff (Visions fugitives No. 8). ˆ1 , R, ¯ d and log m for note intervals modulo 12 Figure 7.10: Boxplots of λ ordered according to circle of fourhts, comparing Bach, Scarlatti, Bartók, and Prokoffief. Figure 8.1: Tempo curves for Schumann’s Träumerei: skewness for the eight parts A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted against the number of the part. Figure 8.2: Schumann’s Tr¨ aumerei: screeplot for skewness. Figure 8.3: Schumann’s Tr¨ aumerei: loadings for PCA of skewness. Figure 8.4: Schumann’s Tr¨ aumerei: symbol plot of principal components z2 , ..., z5 for PCA of tempo skewness. Figure 8.5: Schumann’s Tr¨ aumerei: tempo curves by Cortot, Horowitz, Brendel, and Gianoli. Figure 8.6: Air by Henry Purcell (1659-1695). Figure 8.7: Screeplot for PCA of entropies. Figure 8.8: Loadings for PCA of entropies. Figure 8.9: Entropies – symbol plot of the first four principal components. Figure 8.10: Entropies – symbol plot of principal components no. 2-5. Figure 8.11: F. Martin (1890-1971). (Courtesy of the Société Frank Martin and Mrs. Maria Martin.) Figure 8.12: F. Martin (1890-1971) - manuscript from 8 Préludes. (Courtesy of the Société Frank Martin and Mrs. Maria Martin.) Figure 9.1: Discriminant analysis combined with time series analysis can be used to judge purity of intonation (“Elvira” by J.B.). Figure 9.2: Linear discriminant analysis of compositions before and after 1800, with the training sample. The data used for the discriminant rule consists of x = (p5 , E).


Figure 9.3: Linear discriminant analysis of compositions before and after 1800, with the validation sample. The data used for the discriminant rule consists of x = (p5 , E). Figure 9.4: Linear discriminant analysis of “Early Music to Baroque” and “Romantic to 20th Century”. The points (”o” and ”×”) belong to the training sample. The data used for the discriminant rule consists of x = (p5 , E). Figure 9.5: Linear discriminant analysis of “Early Music to Baroque” and “Romantic to 20th century”. The points (”o” and ”×”) belong to the validation sample. The data used for the discriminant rule consists of x = (p5 , E). Figure 9.6: Graduale written for an Augustinian monastery of the diocese Konstanz, 13th century. (Courtesy of Zentralbibliothek Z¨ urich.) (Color figures follow page 168.) Figure 9.7: Johannes Brahms (1833-1897). (Photograph by Maria Fellinger, courtesy of Zentralbibliothek Z¨ urich.) Figure 9.8: Richard Wagner (1813-1883). (Engraving by J. Bankel after a painting by C. J¨ ager, courtesy of Zentralbibliothek Z¨ urich.) Figure 10.1: Complete linkage clustering of log-odds-ratios of note-frequencies. Figure 10.2: Single linkage clustering of log-odds-ratios of note-frequencies. Figure 10.3: Joseph Haydn (1732-1809). (Title page of a biography published by the Allgemeine Musik-Gesellschaft Z¨ urich, 1830; courtesy of Zentralbibliothek Z¨ urich.) Figure 10.4: Klavierst¨ uck op. 19, No. 2 by Arnold Sch¨ onberg. (Facsimile; used by permission of Belmont Music Publishers.) Figure 10.5: Complete linkage clustering of entropies. Figure 10.6: Complete linkage clustering of tempo. Figure 10.7: Complete linkage clustering of HISMOOTH-fits to tempo curves. Figure 10.8: Symbol plot of HISMOOTH bandwidths for tempo curves. The radius of each circle is proportional to a constant plus log b3 the horizontal and vertical axes are equal to b1 and b2 respectively. The letters A–F indicate where at least one observation from the corresponding cluster occurs. Figure 10.9: Maurizio Pollini (*1942). (Courtesy of Philippe Gontier, Paris.) Figure 11.1: Two-dimensional multidimensional scaling of compositions ranging from the 13th to the 20th century, based on frequencies of intervals and interval sequences. Figure 11.2: Boxplots of second MDS-component where compositions are classified according to three time periods.


Figure 11.3: Fragment of a graduale from the 14th century. (Courtesy of Zentralbibliothek Z¨ urich.) Figure 11.4: Muzio Clementi (1752-1832). (Lithography by H. Bodmer, courtesy of Zentralbibliothek Z¨ urich.) Figure 11.5: Freddy (by J.B.) and Johannes Brahms (1833-1897) going for a drink. (Caricature from a contemporary newspaper; courtesy of Zentralbibliothek Z¨ urich.)


References Akaike, H. (1973a). Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, B.N. Petrow and F. Csaki (eds.), Akademiai Kiado, Budapest, 267-281. Akaike, H. (1973b). Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika, Vol. 60, 255-265. Akaike, H. (1979). A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika, Vol. 26, 237-242. Albert. A.A. (1956). Fundamental Concepts of Higher Algebra. University of Chicago Press, Chicago. Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, New York and London. Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis (2nd ed.). Wiley, New York. Andreatta, M. (1997) Group-theoretical methods applied to music. PhD thesis, University of Sussex. M. Andreatta, M., Noll, T., Agon, C. and Assayag, G. (2001). The geometrical groove: rhythmic canons between theory, implementation and musical experiment. In: Les Actes des 8èmes Journes dInformatique Musicale, Bourges 7-9 juin 2001, p. 93-97. Antoniadis, A. and Oppenheim, G. (1995). Wavelets and Statistics. Lecture Notes in Statistics, No. 103, Springer, New York. Arabie, P., Hubert, L.J. and De Soete, G. (1996). Clustering and Classification. World Scientific Pub., London. Archibald, B. (1972). Some thoughts on symmetry in early Webern. Persp. New Music, 10, 159-163. Ash, R.B. (1965). Information Theory. Wiley, New York. Ashby, W.R. (1956). An Introduction to Cybernetics. Wiley, New York. Babbitt, M. (1960) Twelve-tone invariants as compositional determinant. Musical Quarterly, 46, 245-259. Babbitt, M. (1961) Set structure as a compositional determinant. JMT, 5, No. 2, 72-94. Babbitt, M. (1987) Words about Music. Dembski A. and Straus J.N. (eds.), University of Wisconsin Press, Madison. Backus, J. (1969). The acoustical Foundations of Music, W.W. Norton & Co., New York (reprinted 1977). Bailhache, P. (2001). Une Histoire de l’Acoustique Musicale, CNRS Editions. Balzano, G.J. (1980). The group-theoretic description of 12-fold and microtonal pitch systems. Computer Music Journal, Vol. 4, No. 4, 66-84.


Barnard, G.A. (1951). The theory of information. J. Royal Statist. Soc., Series B, Vol. 13, 46-69. Bartlett, M.S. (1955). An Introduction to Stochastic Processes. Cambridge University Press, Cambridge. Batschelet, E. (1981). Circular Statistics. Academic Press, London. Beament, J. (1997). The Violin Explained: Components, Mechanism, and Sound. Oxford University Press, Oxford. Benade, A.H. (1976). Fundamentals of Musical Acoustics. Oxford University Press, Oxford. (Reprinted by Dover in 1990). Benson, D. (1995-2002). Mathematics and Music. Internet Lecture Notes, Department of Mathematics, University of Georgia, USA (available at http://www.math.uga.edu/~djb/html/math-music.html). Beran, J. (1987). Aniseikonia. H.O.E. (Bison Records). Beran, J. (1991). Cirri. Centaur Records, CRC 2100. Beran, J. (1994). Statistics for Long-Memory Processes. Chapman & Hall, New York. Beran, J. (1995). Maximum likelihood estimation of the differencing parameter for invertible short- and long-memory ARIMA models. J. R. Statist. Soc., Series B, Vol. 57, No.4, 659-672. Beran, J. (1998) Modeling and objective distinction of trends, stationarity and long-range dependence. Proceedings of the VIIth International Congress of Ecology - INTECOL 98, Farina, A., Kennedy, J. and Boss´ u, V. (Eds.), p. 41. ánti. col legno, WWE 1CD 20062 (http://www.col-legno.de). Beran, J. (2000). S¯ Beran, J. and Feng. Y. (2002a). SEMIFAR models – a semiparametric framework for modelling trends, long-range dependence and nonstationarity. Computational Statistics & Data Analysis, Vol. 40, No. 2, 393-419. Beran, J. and Feng, Y. (2002b). Iterative plug-in algorithms for SEMIFAR models – definition, convergence, and asymptotic properties. J. Computational Graphical Statist., Vol. 11, No. 3, 690-713. Beran, J. and Ghosh, S. (2000). Estimation of the dominating frequency for stationary and nonstationary fractional autoregressive processes. J. Time Series Analysis, Vol. 21, No. 5, 513-533. Beran, J. and Mazzola, G. (1992). Immaculate Concept. SToA music, 1 CD 1002.92, Z¨ urich. Beran, J. and Mazzola, G. (1999). Analyzing musical structure and performance - a statistical approach. Statistical Science, Vol. 14, No. 1, pp.47-79. Beran, J. and Mazzola, G. (1999). Visualizing the relationship between two time series by hierarchical smoothing. J. Computational Graphical Statist., Vol. 8, No. 2, pp.213-238. Beran, J. and Mazzola, G. (2000). Timing Microstructure in Schumann’s “Tr¨ aumerei” as an Expression of Harmony, Rhythm, and Motivic Structure in Music Performance’. Computers Mathematics Appl., Vol. 39, No. 5-6, pp.99130. Beran, J. and Mazzola, G. (2001). Musical composition and performance – statistical decomposition and interpretation. Student, Vol. 4, No.1, 13-42. Beran, J. and Ocker, D. (1999). SEMIFAR forecasts, with applications to foreign


exchange rates. J. Statistical Planning Inference, 80, 137-153. Beran, J. and Ocker, D. (2001). Volatility of stock market indices - an analysis based on SEMIFAR models. J. Bus. Economic Statist., Vol. 19, No. 1, 103-116. Berg, R.E. and Stork, D.G. (1995). The Physics of Sound (2nd ed.). Prentice Hall, New Jersey. Berry, W. (1987). Structural Function in Music. Dover, Mineola. Besag, J. (1989). Towards Bayesian image analysis. J. Appl. Statistics, Vol. 16, 395-407. Besicovitch, A.S. (1935). On the sum of digits of real numbers represented in the dyadic system (On sets of fractional dimensions II). Mathematische Annalen, Vol. 110, 321-330. Besicovitch, A.S. and Ursell, H.D. (1937). Sets of fractional dimensions (V): On dimensional numbers of some continuous curves. J. London Mathematical Society, Vol. 29, 449-459. Bhattacharyya, A. (1946a). On some analogues of the amount of information and their use in statistical estimation. Sankhya, Vol. 8, 1-14. Bhattacharyya, A. (1946b). On a measure of divergence between two multinomial populations. Sankhya, 7, 401-406. Billingsley, P. (1986). Probability and Measure (2nd ed.). Wiley, New York. Blashfield, R.K. and Aldenderfer, M.S. (1985). Cluster Analysis. Sage, London. Boltzmann, L. (1896). Vorlesungen u ¨ber Gastheorie. Johann Ambrosius Barth, Leipzig. Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling: Theory and Applications. Springer, New York. Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Oxford University Press, Oxford. Box, G.E.P. and Jenkins, G.M. (1970). Time Series Analysis: Forecasting and Control. Holden-Day, San Francisco. Breiman, L. (1984). Classification and Regression Trees. CRC Press, Boca Raton. Bremaud, P. (1999). Markov Chains. Springer, New York. Brillouin, L. (1956). Science and Information Theory. Academic Press, New York. Brillinger, D. (1981). Time Series Data Analysis and Theory (expanded ed.). Holden Day, San Francisco. Brillinger, D. and Irizarry, R.A. (1998). An investigation of the second- and higher-order spectra of music. Signal Processing, Vol. 65, 161-179. Bringham, E.O. (1988). The Fast Fourier Transform and Applications. Prentice Hall, New Jersey. Brockwell, P.J. and Davis, R.A. (1991). Time series: Theory and methods (2nd ed.). Springer, New York. Brown, E.N. (1990). A note on the asymptotic distribution of the parameter estimates for the harmonic regression model. Biometrika, Vol. 77, No. 3, 653656. Chai, W. and Vercoe, B. (2001). Folk Music Classification Using Hidden Markov Models. Proceedings of International Conference on Artificial Intelligence, June 2001 (//web.media.mit.edu/∼ chaiwei/papers/chai ICAI183.pdf). Chambers, J., Cleveland, W., Kleiner, B., and Tukey, P. (1983). Graphical Meth-


ods for Data Analysis. Wadsworth Publishing Company: Belmont, California. Chernick, M.R. (1999). Bootstrap Methods: A Practitioner’s Guide. Jpssey-Bass, New York. Chung, K.L. (1967). Markov Chains with Stationary Transition Probabilities. Springer, Berlin. Cleveland, W. (1985). Elements of Graphing Data. Wadsworth Publishing Company: Belmont, California. Coifman, R., Meyer, Y., and Wickerhauser, V. (1992). Wavelet analysis and sinal processing. In: Wavelets and Their Applications, pp. 153-178. Jones and Bartlett Publishers, Boston. Coifman, R. and Wickerhauser, V. (1992). Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory, Vol. 38, No. 2, 713-718. Conway, J.H. and Sloane, N.J.A. (1988). Sphere packings, lattices and groups. Grundlehren der mathematischen Wissenschaften 290, Springe, Berlin. Cooley, J.W. and Tukey, J.W. (1965). An algorithm for the machine calculation of complex Fourier series. Math. Comput., Vol. 19, 297-301. Cox, T.F. and Cox, M.A.A. (1994). Multidimensional Scaling. Chapman & Hall, London. Cremer, L. (1984). The Physics of The Violin, MIT Press, 1984. Crocker, M.J. (ed.) (1998). Handbook of Acoustics, Wiley Interscience: New York. Dahlhaus, R. (1987). Efficient parameter estimation for self-similar processes. Ann. Statist., Vol. 17, 1749-1766. Dahlhaus, R. (1996a). Maximum likelihood estimation and model selection for locally stationary processes. J. Nonpar. Statist., Vol. 6, 171-191. Dahlhaus, R. (1996b) Asymptotic statistical inference for nonstationary processes with evolutionary spectra. In: Athens Conference on Applied Probability and Time Series, Vol. II, P.M. Robinson and M. Rosenblatt (Eds.), 145-159, Lecture Notes in Statistics, 115, Springer, New York. Dahlhaus, R. (1997). Fitting time series models to nonstationary processes. Ann. Statistics, Vol. 25, 1-37. Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia, PA. Davison, A.C. and Hinkley, D.V. (1997). Bootstrap Methods and Their Application. Cambridge University Press, Cambridge. de la Motte-Haber, H. (1996). Handbuch der Musikpsychologie (2nd ed.). Laaber Verlag, Laaber. Devaney, R.L. (1990). Chaos, Fractals and Dynamics. Addison-Wesley, California. Diaconis, P., Graham, R.L., and Kantor, W.M. (1983). The mathematics of perfect shuffles. Adv. Appl. Math., Vol. 4, 175-196. Diggle, P. (1990) Time Series – A Biostatistical Introduction. Oxford University Press, Ocford. Dillon, W. R. and Goldstein, M. (1984). Multivariate Analysis, Methods and Applications. Wiley, New York. Donoho, D.L. and Johnstone, I.M. (1995). Adapting to unknown smoothness via wavelet shrinkage. JASA, 90, 1200-1224. Donoho, D.L. and Johnstone, I.M. (1998). Minimax estimation via wavelet shrink-


age. Ann. Statistics 26, 879-921. Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1995). Wavelet shrinkage: Asymptopia? J. R. Statist. Soc., Series B, 57, 301-337. Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1996). Density estimation by wavelet thresholding. Ann. Statistics, 24, 508-539. Draper, N.R. and Smith, H. (1998). Applied Regression Analysis (3rd ed.). Wiley, New York. Duda, R.O., Hart, P.E. and Stork, D.G. (2000). Pattern classification (2nd ed.). Wiley, New York. Edgar, G.A. (1990). Measure, Topology and Fractal Geometry. Springer, New York. Effelsberg, W. and Steinmetz, R. (1998). Video Compression Techniques. Dpunkt Verlag, Heidelberg. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statistics, Vol. 7, 1- 26. Eimert, H. (1964). Grundlagen der musikalischen Reihentechnik. Universal Edition, Vienna. Elliott, R.J., Agoun, L., and Moore, J.B. (1995). Hidden Markov Models: Estimation and Control. Springer, New York. Erd¨ os, P. (1946). On the distribution function of additive functions. Ann. Mathematics, Vol. 43, 1-20. Eubank, R.L. (1999). Nonparametric Regression and Spline Smoothing (2nd ed.). Marcel Dekker: New York. Everitt, B.S., Landau, S. and Leese, M. (2001). Cluster Analysis (4th ed.). Oxford University Press, Oxford. Everitt, B.S. and Rabe-Hesketh, S. (1997). The Analysis of Proximity Data. Arnold, London. Falconer, K.J. (1985). The Geometry of Fractal Sets. Cambridge University Press, Cambridge. Falconer, K.J. (1986). Random Fractals. Math. Proc. Cambridge Philos. Soc., Vol. 100, 559-582. Falconer, K.J. (1990). Fractal Geometry. Wiley, New York. Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation. J. R. Statist. Soc., Ser. B, 57, 371–394. Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and Its Applications. Chapman & Hall, London. Feng, Y. (1999). Kernel- and Locally Weighted Regression – with Applications to Time Series Decomposition. Verlag f¨ ur Wissenschaft und Forschung, Berlin. Fisher, N.I. (1993). Statistical Analysis of Circular Data. Cambridge University Press, Cambridge. Fisher, R.A. (1925). Theory of Statistical Information. Proc. Camb. Phil. Soc., Vol. 22, pp. 700-725. Fisher, R.A. (1956). Statistical Methods and Scientific Inference. Oliver & Boyd, London. Fleischer, A. (2003). Die analytische Interpretation. Schritte zur Erschlieung eines Forschungsfeldes am Beispiel der Metrik. PhD dissertation, Humboldt-


University Berlin. dissertation.de, Verlag im Internet GmbH, Berlin. Fleischer, A., Mazzola, G., Noll, Th. Zur Konzeption der Software RUBATO f¨ ur musikalische Analyse und Performance. Musiktheorie, Heft 4, pp.314-325, 2000. Fletcher, T.J. (1956). Campanological groups. American Math. Monthly, 63/9, 619-626. Fletcher, N.H. and Rossing, T.D. (1991). The Physics of Musical Instruments. Springer, Berlin/New York. Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach. Cambridge University Press, Cambridge, UK. Forte, A. (1964). A theory of set-complexes for music. JMT, 8, No. 2, 136-183. Forte, A. (1973). Structure of atonal music. Yale University Press, New Haven. Forte, A. (1989). La set-complex theory: elevons les enjeux! Analyse musicale, 4eme trimestre, 80-86. Fox, R. and Taqqu, M.S. (1986). Large sample properties of parameter estimates for strongly dependent stationary Gaussian time series. Ann. Statisics., Vol. 14, 517-532. Friedman, J.H. (1977). A recursive partitioning decision rule for nonparametric classification. IEEE Transactions on Computers, Vol. 26, No. 4, 404-408. Fripertinger, H. (1991). Enumeration in music theory. Séminaire Lotharingien de Combinatoire, 26, 29-42. Fripertinger, H. (1999). Enumeration and construction in music theory. Diderot Forum on Mathematics and Music Computational and Mathematical Methods in Music, Vienna, Austria. December 2–4, 1999. H. G. Feichtinger and M. Drfler, editors. sterreichische Computergesellschaft, 179-204. Fripertinger, H. (2001). Enumeration of non-isomorphic canons. Tatra Mountains Math. Publ., 23. Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition (2nd ed.). Academic Press, New York. Gasser, T. and M¨ uller, H.G. (1979). Kernel estimation of regression functions. In: Smoothing Techniques for Curve Estimation. Gasser, T., Rosenblatt, M. (Eds.), Springer, New York, pp. 23-68. Gasser, T. and M¨ uller, H.G. (1984). Estimating regression functions and their derivatives by the kernel method. Scand. J. Statist., Vol. 11, 171-185. Gasser, T., M¨ uller, H.G., and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. J. R. Statist. Soc., Ser. B, Vol. 47, 238-252. Genevois, H. and Orlarey, Y. (1997). Musique et Mathématiques. Aléas-Grame, Lyon. Gervini, D. and Yohai, V.J. (2002). A class of robust and fully efficient regression estimators. Ann. Statistics, Vol. 30, 583-616. Ghosh, S. (1996). A new graphical tool to detect non-normality. J. R. Statist. Society, Series B, Vol. 58, 691-702. Ghosh, S. (1999). T3-plot. In: Encyclopedia for Statistical Sciences, Update volume 3, (S. Kotz ed.), pp. 739-744, Wiley, New York. Ghosh, S. and Beran, J. (2000). Comparing two distributions: The two sample T3 plot. J. Computational Graphical Statist., Vol. 9, No. 1, 167-179. W.J. Gilbert (2002) Modern Algebra with Applications. Wiley, New York. Ghosh, S. and Draghicescu, D.(2002a). Predicting the distribution function for


long-memory processes. Int. J. Forecasting, 18, 283-290. Ghosh, S., Draghicescu, D. (2002b). An algorithm for optimal bandwidth selection for smooth nonparametric quantiles and distribution functions. In: Statistics in Industry and Technology: Statistical Data Analysis Based on the L1Norm and Related Methods. Dodge Y. (Ed.), Birkh¨ auser Verlag, Basel, Switzerland, pp. 161-168. Ghosh, S., Beran, J. and Innes, J. (1997). Nonparametric conditional quantile estimation in the presence of long memory. Student - Special issue on the conference on L1-Norm and related methods, Vol. 2, 109-117. Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (Eds.) (1996). Markov Chain Monte Carlo in Practice. Chapman & Hall, London. Goldman, S. (1953). Information Theory. Prentice Hall, New Jersey. Good, P.I. (2001). Resampling Methods. Birkh¨ auser, Basel. Gordon, A.D. (1999). Classification (2nd ed.). Chapman and Hall, London. G¨ otze, H. and Wille, R. (Eds.) (1985). Musik und Mathematik (Salzburger Musikgesprch 1984 unter Vorsitz von Herbert von Karajan). Springer, Berlin. Graeser, W. (1924). Bachs “Kunst der Fuge”. In: Bach-Jahrbuch, 1924. Graff, K.F. (1975). Wave Motion in Elastic Solids. Oxford University Press. (reprinted by Dover, 1991). Granger, C.W.J. and Joyeux, R. (1980). An introduction to long-range time series models and fractional differencing. J. Time Series Anal., Vol. 1, 15-30. Grenander, U. and Szeg¨ o, G. (1958). Toeplitz Forms and Their Application. Univ. California Press, Berkeley. Grey, J. (1977). Multidimensional perceptual scaling of musical timbre. J. Acoustical Soc. America, Vol. 62, 1270-1277. Grey, J. and Gordon, J. (1978). Perceptual Effects of spectralmodifications on musical timbres. J. Acoust. Soc. America, 63, 1493-1500. Gromko, J.E. (1993). Perceptual Differences between expert and novice music listeners at multidimensional scaling analysis. Psychology of Music, 21, 34-47. Guttman, L. (1954). A new approach to factor analysis: the radex. In: Mathematical thinking in the behavioral sciences, P. Lazarsfeld (Ed.). Free Press, New York, pp. 258-348. Guttman, L. (1968). A general non-metric technique for finding the smallest coordinate space for a configuration of points. Psychometrika, 33, 469-506. Hall, D.E. (1980). Musical Acoustics. Wadsworth Publishing Company: Belmont, California. Halsey, D. and Hewitt, E. (1978). Eine gruppentheoretische Methode in der Musiktheorie. Jaresbericht der Duetschen Math. Vereinigung, Vol. 80. Hampel, F.R., Ronchetti, E., Rousseeuw, P., and Stahel, W.A. (1986). Robust Statistics: The Approach based on Influence Functions. Wiley, New York. Hand, D.J. (1986). Discrimination and Classification. Wiley, New York. Hand, D.J., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. MIT Press, Cambdridge (USA). Hannan, E.J. (1973). The estimation of frequency. J. Appl. Probab., Vol. 10, 510-519. Hannan, E.J. and Quinn, B.G. (1979). The determination of the order of an autoregression. J. R. Statist. Soc., Series B, Vol. 41, 190-195.


H¨ ardle, W. (1991) Smoothing Techniques. Springer. New York. H¨ ardle, W., Kerkyacharian, G., Picard, D., and Tsybokov, A. (1998). Wavelets, Approximation, and Statistical Applications. Lecture Notes in Statistics, No. 129. Springer, New York. Hartigan, J.A. (1975). Clustering Algorithms. Wiley, New York. Hartley, R.V. (1928). Transmission of information. Bell Syst. Techn. J., 535-563. Hassan, T. (1982). Nonlinear time series regression for a class of amplitude modulated cosinusoids. J. Time Series Analysis, Vol. 3, 109-122. Hastie, T., Tibshirani, R., and Buja, A. (1994). Flexible discriminant analysis by optimal scoring. JASA, Vol. 89, 1255-1270. Hastie, T., Tibshirani, R., and Friedman, J.H. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York. Hausdorff, F. (1919). Dimension und ¨ ausseres mass. Mathematische Annalen, Vol. 79, 157-179. von Helmholtz, H. (1863). Die Lehre von den Tonempfindungen als physiologische Grundlage der Musik, Reprinted in Darmstadt, 1968. Herstein, I.N. (1975). Topics in Algebra. Wiley, New York. Hirst, D. (1996). Error-rate estimation in multiple-group linear discriminant analysis. Technometrics, Vol. 38, 389-399. Hjort, N.L. and Glad, I.K. (2002). Nonparametric density estimation with a parametric start. Ann. Statistics, Vol. 23, No. 3, 882-904. Hofstadter, D.R. (1999). Gdel, Escher, Bach, Basic Books, New York. H¨ oppner, F. Klawonn, F., Kruse, R. and Runkler, T. (1999). Fuzzy Cluster Analysis. Wiley, New York. Hosking, J.R.M. (1981). Fractional differencing. Biometrika, Vol. 68, 165-176. Howard, D.M. and Angus, J. (1996). Acoustics and Psychoacoustics, Focal Press. Huber, P. (1981). Robust Statistics. Wiley, New York. Huberty, C.J. (1994). Applied Discriminant Analysis. Wiley, New York. Hurvich, C.M. and Ray, B.K. (1995). Estimation of the memory parameter for nonstationary or noninvertible fractionally integrated processes. J. Time Series Anal., Vol. 16 17-41. Irizarry, R.A. (1998). Statistics and music: fitting a local harmonic model to musical sound signals. PhD thesis, University of California, Berkeley. Irizarry, R.A. (2000). Asymptotic distribution of estimates for a time-varying parameter in a harmonic model with multiple fundamentals. Statistica Sinica, Vol. 10, 1041-1067. Irizarry, R.A. (2001). Local harmonic estimation in musical sound signals. JASA, Vol. 96, No. 454, 357-367. Irizarry, R.A. (2002). Weighted estimation of harmonic components in a musical sound signal. J. Time Series Anal., Vol. 23, 29-48. Isaacson, D.L. and Madsen, R.W. (1976). Markov Chains Theory and Applications. Wiley, New York. Jaffard, S., Meyer, Y., and Ryan, R. (2001). Wavelets: Tools for Science and Technology. SIAM, Philadelphia. Jajuga, K., Sokoowski, A. and Bock, H.H. (Eds.) (2002). Statistical Pattern Recognition. Springer, New York. Jammalamadaka, S.R. and SenGupta, A. (2001). Topics in circular statistics.


Series on Multivariate Analysis, Vol. 5. World Scientific, River Edge, NJ. Jansen, M. (2001). Noise Reduction by Wavelet Thresholding. Lecture Notes in Statistics, No. 161. Springer, New York. Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy. Wiley, New York. Johnson, J. (1997). Graph Theoretical Methods of Abstract Musical Transformation. Greenwood Publishing Group, London. Johnson, R.A. and Wichern, D.W. (2002). Applied Multivariate Statistical Analysis. Pretice Hall, New Jersey. Johnston, I. (1989). Measured Tones: The Interplay of Physics and Music. Institute of Physics Publishing, Bristol and Philadelphia. Joshi, D.D. (1957). L’information en statistique mathématique et dans la théorie des communications. PhD thesis, Faculté des Sciences de l’Université de Paris. Juang, B.H. and Rabiner, L.R. (1991). Hidden Markov models for speech recognition. Technometrics, Vol. 33, 251-272. Kaiser, G. (1994). A Friendly Guide to Wavelets. Birkh¨ auser, Boston. Keil, W. (1991). Gibt es den Goldenen Schnitt in der Musik des 16. bis 19. Jahrhunderts? Eine kritische Untersuchung rezenter Forschungen. Augsburger Jahrbuch f¨ ur Musikwissenschaft, Vol. 8 1991. p. 7-70. Schneider, Tutzing, Germany. Kelly, J.P. (1991). Hearing. In: Principles of Neural Science, E.R. Kandel, J.H. Schwarz, T.M. Jessel (Eds.), Elsevier, New York, pp. 481-499. Kemey, J.G., Snell, J.L., and Knapp, A.W. (1976). Denumerable Markov Chains. Springer, New York. Khinchin, A.I. (1953). The entropy concept in probability theory. Uspekhi Matematicheskikh Nauk, Vol. 8, No. 3 (55), 3-20 (Russian). Khinchin, A.I. (1956). On the fundamental theorems of information theory. Uspekhi Matematicheskikh Nauk, Vol. 11, No. 1 (67), 17-75 (Russian). Kinsler, L.E., Frey, A.R., Coppens, A.B., and Sanders, J.V. (2000) Fundamentals of Acoustics, (4th ed.). Wiley, New York. Klecka, W.R. (1980). Discriminant Analysis. Sage, London. Kolmogorov, A.N. (1956). On the Shannon theory of information transmission in the case of continuous signals. IRE Trans. on Inform. Theory, Vol. IT-2, 102-108. Kono, N. (1986). Hausdorff dimension of sample paths for self-similar processes. In: Dependence in Probability and Statistics, E. Eberlein and M.S. Taqqu (eds.), Birkh¨ auser, Boston. Kruskal, J.B. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1-27. Kruskal, J.B. (1964b). Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29, 115-129. Kruskal, J.B. and Wish, M. (1978). Multidimensional Scaling. Sage, London. Krzanowski, W.J. (1988). Principles of Multivariate Analysis. Oxford University Press, Oxford. Kullback, S. (1959). Information Theory and Statistics. Wiley, Newy York. Lanciani, A. (2001). Mathématiques et musique: les labyrinthes de la phénoménologie. Editions Jérˆ ome Millon, Grenoble. L¨ auter, H. (1985). An efficient estimator for the error rate in discriminant anal-


ysis. Statistics, Vol. 16, 107-119. Lamperti, J.W. (1962). Semi-stable stochastic processes. Trans. American Math. Soc., Vol. 104, 62-78. Lamperti, J.W. (1972). Semi-stable Markov processes. Z. Wahrsch. verw. Geb., Vol. 22, 205-225. LeBlanc, M. and Tibshirani, R. (1996). Combining estimates in regression and classification. JASA, Vol. 91, 1641-1650. Lendvai, E. (1993). Symmetries of Music. Kod´ aly Institute, Kecskemet. Levinson, S.E., Rabiner, L.R., and Sondhi, M.M. (1983). An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech reconition. Bell Systems Tech. J., Vol. 62, 1035-1074. Lewin, D. (1987). Generalized Musical Intervals and Transformations. Yale University Press, New Haven/London. Leyton, M. (2001). A Generative Theory of Shape. Springer, New York. Licklider, J.R.C. (1951). A duplex theory of pitch reception. Experientia, Vol. 7, 128-134. Ligges, U., Weihs, C., Hasse-Becker, P. (2002). Detection of locally stationary segments in time series. In: Proceedings in Computational Statistics, W. Hrdle, B. Rnz (Eds.), pp. 285-290. Lindley, M. and Turner-Smith, R. (1993). Mathematical Models of Musical Scales. Verlag f¨ ur systematische Musikwissenschaft GmbH, Bonn. Lingoes, J.C. and Roskam, E.E. (1973). A mathematical and empirical analysis of two multidimensional scaling algorithms. Psychometrika, 38, Monograph Suppl. No. 19. MacDonald, I.L. and Zucchini, W. (1997). Hidden Markov and Other Models for Discrete-valued Time Series. Chapman & Hall, London. Mallat, S. (1998). A Wavelet Tour of Signal Processing. Academic Press, London. Mandelbrot, B.B. (1953). Contribution ` a la théorie mathématique des jeux de communication. Publs. Inst. Statist. Univ. Paris, Vol. 2, Fasc. 1 et 2, 3-124. Mandelbrot, B.B. (1956). An outline of a purely phenomenological theory of statistical thermodynamics: I. canonical ensembles. IRE Trans. on Inform. Theory, Vol. IT-2, 190-203. Mandelbrot, B.B. (1977). Fractals: Form, Chance and Dimension. Freeman & Co., San Francisco. Mandelbrot, B.B. (1983). The Fractal Geometry of Nature. Freeman & Co., San Francisco. Mandelbrot, B.B. and van Ness, J.W. (1968). Fractional Brownian motions, fractional noises and applications. SIAM Review, Vol. 10, No.4, 422-437. Mandelbrot, B.B. and Wallis, J.R. (1969). Computer experiments with fractional Gaussian noises. Water Resour. Res., Vol. 5, No.1, 228-267. Mardia, K.V. (1972). Statistics of Directional Data. Academic Press, London. Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979). Multivariate Analysis. Academic Press, London. ¨ Markuse, B. and Schneider, A. (1995). Ahnlichkeit, N¨ ahe, Distanz: Zur Anwendung multidimensionaler Skalierung in musikwissenschaftlichen Untersuchungen. In: Festschrift f¨ ur Jobst Peter Fricke zum 65. Geburtstag, W. Auhagen, B. G¨ atjen and K. Niem¨ oller (Eds.), Musikwis-


senschaftliches Institut der Universitt zu K¨ oln (http://www.uni-koeln.de/philfak/muwi/publ/fs fricke/festschrift.html). Matheron, G. (1973). The intrinsic random functions and their applications. Adv. Appl. Prob., Vol. 5, 439-468. Mathieu, E. (1861). Mémoire sur l’étude des fonctions de plusieurs quantitées. J. Math. Pures Appl., Vol. 6, 241-243. Mathieu, E. (1873). Sur la fonction cinq fois transitive de 24 quantitées. J. Math. Pures Appl., Vol. 18, 25-46. Mazzola, G. (1985) Gruppen und Kategorien in der Musik, Heldermann-Verlag, Berlin. Mazzola, G. (1990a). Geometrie der T¨ one. Birkh¨ auser, Basel. Mazzola, G. (1990b). Synthesis. SToA music 1001.90, Z¨ urich. Mazzola, G. (1989/1994). Presto. SToA music, Z¨ urich. Mazzola, G. (2002). The Topos of Music. Birkh¨ auser, Basel. Mazzola, G. and Beran, J. (1998). Rational composition of performance. In: controlling creative processes in music, W. Auhagen, R. Kopiez (Eds.), Staatliches Institut f¨ ur Musikforschung (Berlin), Lang Verlag, Frankfurt/New York. Mazzola, G., Zahorka, O. and Stange-Elbe, J. (1995). Analysis and Performance of a Dream. In: Proceedings of the 1995 Symposium on Musical Performance, J. Sundberg (ed.), KTH, Stockholm. McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York. McMillan, B. (1953). The basic theorems of information theory. Ann. Math. Statistics, 24, 196-219. Meyer, Y. (1992). Wavelets and Operators. Cambridge University Press,Cambridge. Meyer, Y. (1993). Wavelets: Algorithms and Applications. SIAM, Philadelphia, PA. Morris, R.D. (1987). Composition with Pitch-Classes. Yale University Press, New Haven. Morris, R.D. (1995). Compositional spaces and other territories. PNM 33, 328358. Morse, P.M. and Ingard, K.U. (1968). Theoretical Acoustics. McGraw Hill. (Reprinted by Princeton University Press 1986.) Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. AddisonWesley, Reading, MA. Nadaraya, E.A. (1964). On estimating regression. Theory of Probability and its Applications, Vol. 9, 141-142. Nederveen, C.J. (1998). Acoustical Aspects of Woodwind Instruments. Northern Illinois University Press, de Kalb. Nettheim, N. (1997). A Bibliography of Statistical Applications in Musicology. Musicology Australia, Vol. 20, 94-106. Newton, H.J. and Pagano, M. (1983). A method for determining periods in time series. JASA, Vol. 78, 152-157. Noll, T. (1997). Harmonische Morpheme. Musikometrika, Vol. 8, 7-32. Norden, H. (1964). Proportions in Music. Fibonacci Quarterly, Vol. 2, 219. Norris, J.R. (1998). Markov Chains. Cambridge University Press, Cambridge.


Ogden, R.T. (1996). Essential Wavelets for Statistical Applications and Data Analysis. Birkh¨ auser, Boston. Orbach, J. (1999). Sound and Music. University Press of America, Lanham, MD. Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Statistics, Vol. 33, 1065-1076. Peitgen, H.-O. and Saupe, D. (1988). The Science of Fractal Images. Springer, New York. Percival, D.B. and Walden, A.T. (2000). Wavelet Methods for Time Series Analysis. Cambridge University Press, Cambridge, UK. Perle, G. (1955). Symmetric formations in the string quartets of Béla Bart´ ok. Music Review 16, 300-312. Pierce, J.R. (1983). The Science of Musical Sound. Scientific American Books, New York (2nd ed. printed by W.H. Freeman & Co, 1992). Plackett, R.L. (1960). Principles of Regression Analysis. Clarendon Press, Oxford. Polzehl, J. (1995). Projection pursuit discriminant analysis. Computational Statist. Data Anal., Vol. 20, 141-157. Price, B.D. (1969). Mathematical groups in campanology. Math. Gaz., 53, 129133. Priestley, M.B. (1965). Evolutionary spectra and non-stationary processes. J. R. Statist. Soc., Series B, Vol. 27, 204-237. Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 1): Univariate Time Series. Academic Press, New York. Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 2): Multivariate Series, Prediction and Control. Academic Press, New York. Quinn, B.G. and Thomson, P.J. (1991) Estimating the frequency of a periodic function. Biometrika, Vol. 78, No. 1, 65-74. Rahn, J. (1980). Basic Atonal Theory. Longman, New York. Raichel, D.R. (2000). The Science and Applications of Acoustics. American Inst. of Physics, College Park, PA. Ramsay, J.O. (1977). Maximum likelihood estimation in multidimensional scaling. Psychometrika, 42, 241-266. Raphael, C.S. (1999). Automatic segmentation of acoustic music signals using hidden Markov models. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 21, No. 4, 360-370. Raphael, C.S. (2001a). A probabilistic expert system for automatic musical accompaniment. J. Computational Graphical Statist., Vol. 10, No. 3, 487-512. Raphael, C.S. (2001b). Synthesizing musical accompaniment with Bayesian belief networks. J. New Music Res., Vol. 30, No. 1, 59-67. Rao, C.R. (1973). Linear Statistical Inference and its Applications (2nd ed.). Wiley & Sons, New York. Rayleigh, J.W.S. (1896). The Theory of Sound (2 vols), 2nd ed., Macmillan, London (Reprinted by Dover, 1945). Read, R.C. (1997). Combinatorial problems in the theory of music. Discrete Mathematics, 167/168, 543-551. Reiner, D. (1985). Enumeration in music theory, American Math. Monthly, 92/1, 51-54.


Rényi, A. (1959a). On the dimension and entropy of probability distributions. Acta Mathe. Acad. Sci. Hung., Vol. 10, 193-215. Rényi, A. (1959b). On a theorem of P. Erd¨ os and its applications in information theory. Mathematica Cluj, Vol. 1, No. 24, 341-344. Rényi, A. (1961). On measures of entropy and information. Proc. Fourth Berkeley Symposium on Math. Stat. Prob., Vol. I, Univ. California Press, Berkeley, 547561. Rényi, A. (1965). On foundations of information theory. Review of the International Statistical Institute, Vol. 33, 1-14. Rényi, A. (1970). Probability Theory. North Holland, Amsterdam. Repp, B. (1992). Diversity and Communality in Music Performance: An Analysis of Timing Microstructure in Schumann’s “Tr¨ aumerei”. J. Acoustic Soc. Am., 92, 2546-2568. Rigden, J.S. (1977). Physics and the Sound of Music. Wiley, New York. Ripley, B. (1995). Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Rodet, X.(1997). Musical sound signals analysis/synthesis: sinusoidal+residual and elementary waveform models. Appl. Signal Processing, 4, 131-141. Roederer, J.G. (1995). The Physics and Psychophysics of Music. Springer, Berlin/New York. Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statistics, Vol. 27, 832-837. Rossing, T.D. (ed.) (1984). Acoustics of Bells. Van Nostrand Reinhold, New York. Rossing, T.D. (1990). The Science of Sound (2nd ed.). Addison-Wesley, Reading, MA. Rossing, T.D. (2000). Science of Percussion Instruments. World Scientic, London. Rossing, T.D. and Fletcher, N.H. (1995). Principles of Vibration and Sound. Springer, Berlin/New York. Rotman, J.J. (2002). Advanced Modern Algebra. Prentice Hall, New Jersey. Rousseeuw, P. and Yohai, V.J. (1984). Robust regression by means of Sestimators. In: Robust Nonlinear Time Series Analysis, J. Franke, W. Hardle, and D. Martin (Eds.), Lecture Notes in Statistics, Vol. 26, 256-277, Springer, New York. Ruppert, D. and Wand, M.P. (1994). Multivariate locally weighted least squares regression. Ann. Statistics, Vol. 22, 1346-1370. Ryan, T.P. (1997). Modern Regression Methods. Wiley, New York. Scheffé, H. (1959). The Analysis of Variance. Wiley, New York. Schnitzler, G. (1976). Musik und Zahl. Verlag fr systematische Musikwissenschaft, Bonn. Sch¨ onberg, A. (1950). Die Komposition in 12 T¨ onen. In: Style and Idea, New York. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., Vol. 6, 461-464. Seber, G.A.F. (1984). Multivariate Observations. Wiley, New York. Serra, X. and Smith, J.O. (1991). Spectral modeling synthesis: A sound analysis/synthesis system based on deterministic plus stochastic decomposition. Computer Music J., Vol. 14, No. 4, 12-24.


Shannon, C.E. (1948). A mathematical theory of communication. Bell Syst. Techn. J., Vol. 27, 379-423. Shannon, C.E. and Weaver, W. (1949). The Mathematical Theory of Communication. Univ. Illinois Press, Urbana. Shepard, R.N. (1962a). The analysis of proximities: multidimensional scaling with unknown distance function Part I. Psychometrika, 27, 125-140. Shepard, R.N. (1962b). The analysis of proximities: multidimensional scaling with unknown distance function Part II. Psychometrika, 27, 219-246. Schiffman, S. (1997). Introduction to Multidimensional Scaling: Theory, Methods, and Applications by Susan. Academic Press, New York. Shumway, R. and Stoffer, D.S. (2000). Time Series Analysis and Its Applications. Springer, New York. Silverman, B. (1986). Density estimation for statistics and data analysis. Chapman & Hall, London. Simonoff, J.S. (1996). Smoothing methods in statistics. Springer, New York. Sinai, Y.G. (1976). Self-similar probability distributions. Theory Probab. Appl., Vol. 21, 64-80. Slaney, M. and Lyon, R.F. (1991). Apple hearing demo real. Apple Technical Report No. 25, Apple Computer Inc., Cupertino, CA. Solo, V. (1992). Intrinsic random fluctuations. SIAM Appl. Math., Vol. 52, 270291. Solomon, L.J. (1973). Symmetry as a determinant of musical composition. PhD thesis, University of West Virginia. Srivastava, M. and Sen, A.K. (1997). Regression Analysis: Theory, Methods and Applications. Springer, New York. Stamatatos, E. and Widmer, G. (2002). Music perfomer recognition using an ensemble of simple classifiers. Austrian Research Institute for Artificial Intelligence, Vienna, TR-2002-02. Stange-Elbe, J. (2000). Analyse- und Interpretationsperspektiven zu J.S. Bachs “Kunst der Fuge” mit Werkzeugen der objektorientierten Informationstechnologie. Habilitation thesis, University of Osnabr¨ uck. Steinberg, R. (ed.) (1995). Music and the Mind Machine. Springer, Heidelberg. Stewart, I. (1992). Another fine math you’ve got me into. . . , W. H. Freeman. Stoyan, D. and Stoyan, H. (1994). Fractals, Random Shapes and Point Fields: Methods of Geometrical Statistics. Wiley, New York. Straub, H. (1989). Beitr¨ age zur modultheoretischen Klassifikation musikalischer Motive. Diploma thesis, ETH Z¨ urich. Taylor, R. (1999a). Fractal analysis of Pollocks drip paintings. Nature, Vol. 399, p. 422. Taylor, R. (1999b). Fractal Expressionism. Physics World, Vol. 12, No. 10, p. 25. Taylor, R. (1999c). Fractal expressionism where art meets science. In: Art and Complexity, J. Casti (ed.), Perseus Press, Vol. Taylor, R. (2000). The use of science to investigate Jackson Pollocks drip paintings. Art and the Brain, Journal of Consciousness Studies, Vol. 7, No. 8-9, p137. Telcs, A. (1990). Spectra of graphs and fractal dimensions. Probab. Th. Rel. Fields, Vol. 82, 435-449.


Thumfart, A. (1995). Discrete Evolutionary Spectra and their Application to a Theory of Pitch Perception. StatLab Heidelberg, Beitr¨ age zur Statistik, No. 30. Tricot, C. (1995). Curves and Fractal Dimension. Springer, New York. Tufte, E. (1983). The visual display of quantitative information. Addison-Wesley, Reading, MA. Tukey, J.W. (1977). Exploratory data analysis. Addison-Wesley, Reading, MA. Tukey, P.A. and Tukey, J.W. (1981). Graphical display of data sets in 3 or more dimensions. In: Interpreting Multivariate Data, V. Barnett (ed.), Wiley, Chichester, UK. Ueda, K. and Ohgushi, K. (1987). Perceptual components of pitch: spatial representation using a multidimensional scaling technique. J. Acoust. Soc. Am., 82, 1193-1200. Velleman, P. and Hoaglin, D. (1981). The ABC’s of EDA: Applications, Basics, and Computing of Exploratory Data Analysis. Duxbury, Belmont, CA. Vidakovic, B. (1999). Statistical Modeling by Wavelets. John Wiley, New York. Voss, R.F. and Clarke, J. (1975). 1/f noise in music and speech. Nature, Vol. 258, 317-318. Voss, R.F. and Clarke, J. (1978). 1/f noise in music: music from 1/f noise. J. Acoust. Soc. America, Vol. 63, 258-263. Voss, R.F. (1988). Fractals in nature: From characterization to simulation. In: Science of fractal images, H.-O. Peitgen and D. Saupe (Eds.), Springer, Berlin, pp. 26-69. Vuza, D.T. (1991). Supplementary sets and regular complementary unending canons (part one). Persp. New Music, Vol. 29, No. 2, 22-49. Vuza, D.T. (1992a). Supplementary sets and regular complementary unending canons (part two). Persp. of New Music, Vol. 30, No. 1, 184-207. Vuza, D.T. (1992b). Supplementary sets and regular complementary unending canons (part three). Persp. New Music, Vol. 30, No. 2, 102-125. Vuza, D.T. (1993). Supplementary sets and regular complementary unending canons (part four). Persp. New Music, Vol. 31, No. 1, 270-305. van der Waerden, B.L. (1979). Die Pythagoreer. Artemis, Z¨ urich. Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia. Walker, A.M. (1971). On the estimation of a harmonic component in a time series with stationary independent residuals. Biometrika, Vol. 58, 21-36. Walmsley, P.J., Godsill, S.J. and Rayner, P.J.W. (1999). Bayesian graphical models for polyphonic pitch tracking. In: Diderot Forum on Mathematics and Music Computational and Mathematical Methods in Music, Vienna, Austria, December 2-4, 1999, H. G. Feichtinger and M. Drfler (eds.), sterreichische Computergesellschaft. Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, London. Watson, G. (1964). Smooth regression analysis. Sankhya, Series A, Vol. 26, 359372. Watson, G. (1983). Statistics on Spheres. Wiley, New York. Waugh, W.A. (1996). Music, probability, and statistics. In: Encyclopedia of Statistical Sciences, by S. Kotz, C. B. Read, and D.L. Banks (Eds.), 6, 134-137.


Webb, A.R. (2002). Statistical Pattern Recognition (2nd ed.). Wiley, New York. Wedin, L. (1972). Multidimensional scaling of emotional expression in music. Svensk Tidskrift f¨ or Musikforskning, 54, 115-131. Wedin, L. and Goude, G. (1972). Dimension analysis of the perception of musical timbre. Scand. J. Psychol., 13, 228-240. Weihs, C., Berghoff, S., Hasse-Becker, P. and Ligges, U. (2001). Assessment of Purity of Intonation in Singing Presentations by Discriminant Analysis. In: Mathematical Statistics and Biometrical Applications, J. Kunert, and G. Trenkler. (Eds.), pp. 395-410. White, A.T. (1983). Ringing the changes. Math. Proc. Camb. Phil. Soc. 94, 203215. White, A.T. (1985). Ringing the changes II. Ars Combinatorica, 20-A, 65-75. White, A.T. (1987). Ringing the cosets. American Math. Monthly 94/8, 721-746. Whittle, P. (1953). Estimation and information in stationary time series. Ark. Mat., Vol. 2, 423-434. Widmer, G. (2001). Discovering Simple Rules in Complex Data: A Meta-learning Algorithm and Some Surprising Musical Discoveries. Austrian Research Institute for Artifical Intelligence, Vienna, TR-2001-31. Wiener, N. (1948). Cybernetics or control and communication in the animal and the machine. Act. Sci. Indust., No. 1053, Hermann et Cie, Paris. Wilson, W.G. (1965). Change Ringing. October House Inc., New York. Wolfowitz, J. (1957). The coding of messages subject to chance errors. Illinois J. Math., Vol. 1, 591-606. Wolfowitz, J. (1958). Information theory for mathematicians. Ann. Math. Statistics, Vol. 29, 351-356. Wolfowitz, J. (1961). Coding Theorems of Information Theory. Springer, Berlin. Woodward, P.M. (1953). Probability and Information Theory with Applications to Radar. Pergamon Press, London. Xenakis, I. (1971). Formalized Music: Thought and Mathematics in Composition. Indiana University Press, Bloomington/London. Yaglom, A.M. and Yaglom, I.M. (1967). Wahrscheinlichkeit und Information. Deutscher Verlag der Wissenschaften, Berlin. Yost, W.A. (1977). Fundamentals of Hearing. An Introduction. Academic Press, San Diego. Yohai, V.J. (1987). High breakdown-point and high efficiency robust estimates for regression. Ann. Statistics, Vol. 15, 642-656. Yohai, V.J., Stahel, W.A., and Zamar, R. (1991). A procedure for robust estimation and inference in linear regression. In: Directions in robust statistics and diagnostics, Part II, W.A. Stahel, and S.W. Weisberg (Eds.), Springer, New York. Young, G. and Householder, A. S. (1941). A note on multidimensional psychophysical analysis. Psychometrika, 6, 331-333. Zassenhaus, H.J. (1999). The Theory of Groups. Dover, Mineola. Zivot, E. and Wang, J. (2002). Modeling Financial Time Series with S-Plus. Springer, New York.


Statistics in Musicology (Interdisciplinary Statistics,)

Statistics in Musicology

Bioequivalence and Statistics in Clinical Pharmacology (Interdisciplinary Statistics Series)

Statistics

Statistics

Statistical and Computational Pharmacogenomics (Interdisciplinary Statistics)

Mathematical Statistics ( Springer Texts in Statistics Series)

Computational Statistics (Statistics and Computing)

Correspondence Analysis in Practice, Second Edition (Interdisciplinary Statistics)

Problems in mathematical statistics

Statistics in Plain English

Linear models in statistics

Topics in Circular Statistics

Statistics in a Nutshell

Variational methods in statistics

Thiele, pioneer in statistics

Linear Models in Statistics

Statistics in Linguistics

Smoothing methods in statistics

Breakthroughs in statistics

Problems in mathematical statistics

Statistics in Plain English

Statistics in Spectroscopy

Frontiers in Statistics

Statistics in a Nutshell

Developing Thinking in Statistics

Statistics in dialectology

Deconvolution Problems in Nonparametric Statistics (Lecture Notes in Statistics)

Breakthroughs in Statistics: Volume III (Springer Series in Statistics Perspectives in Statistics)

Breakthroughs in Statistics: Volume II: Methodology and Distribution (Springer Series in Statistics Perspectives in Statistics)

Statistics in Musicology (Interdisciplinary Statistics,)