Compositional Data Analysis in the Geosciences: From Theory to Practice
The Geological Society of London Books Editorial Committee B. PANKHURST (UK) (CHIEF EDITOR)
Society Books Editors J. GREGORY (UK) J. GRIFFITHS (UK) J. HOWE (UK) P. LEAT (UK) N. ROBINS (UK) J. TURNER (UK)
Society Books Advisors M. BROWN (USA) E. BUFFETAUT (France) R. GIERI~ ( G e r m a n y ) J. GLUYAS (UK) D. STEAD (Canada) R. STEPHENSON (Netherlands)
Geological Society books refereeing procedures The Society makes every effort to ensure that the scientific and production quality of its books matches that of its journals. Since 1997, all book proposals have been refereed by specialist reviewers as well as by the Society's Books Editorial Committee. If the referees identify weaknesses in the proposal, these must be addressed before the proposal is accepted. Once the book is accepted, the Society Book Editors ensure that the volume editors follow strict guidelines on refereeing and quality control. We insist that individual papers can only be accepted after satisfactory review by two independent referees. The questions on the review forms are similar to those for Journal of the Geological Society. The referees' forms and comments must be available to the Society's Book Editors on request. Although many of the books result from meetings, the editors are expected to commission papers that were not presented at the meeting to ensure that the book provides a balanced coverage of the subject. Being accepted for presentation at the meeting does not guarantee inclusion in the book. More information about submitting a proposal and producing a book for the Society can be found on its web site: www.geolsoc.org.uk.
It is recommended that reference to all or part of this book should be made in one of the following ways: BUCCIANTI, A., MATEU-FIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) 2006. Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264. BUCCIANTI, A., TASSI, F. & VASELLI, O. 2006. Compositional changes in a fumarolic field, Vulcano Island: a statistical case study. In: BUCCIANTI, A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 67-77.
GEOLOGICAL SOCIETY SPECIAL PUBLICATION NO. 264
Compositional Data Analysis in the Geosciences" From Theory to Practice EDITED BY
A. BUCCIANTI Univerisit~ degli Studi di Firenze, Italy
G. MATEU=FIGUERAS Universitat e Girona, Spain and V. PAWLOWSKY-GLAHN Universitat de Girona, Spain
2006 Published by The Geological Society London
THE GEOLOGICAL SOCIETY The Geological Society of London (GSL) was founded in 1807. It is the oldest national geological society in the world and the largest in Europe. It was incorporated under Royal Charter in 1825 and is Registered Charity 210161. The Society is the UK national learned and professional society for geology with a worldwide Fellowship (FGS) of over 9000. The Society has the power to confer Chartered status on suitably qualified Fellows, and about 2000 of the Fellowship carry the title (CGeol). Chartered Geologists may also obtain the equivalent European title, European Geologist (EurGeol). One fifth of the Society's fellowship resides outside the UK. To find out more about the Society, log on to www.geolsoc.org.uk. The Geological Society Publishing House (Bath, UK) produces the Society's international journals and books, and acts as European distributor for selected publications of the American Association of Petroleum Geologists (AAPG), the Indonesian Petroleum Association (IPA), the Geological Society of America (GSA), the Society for Sedimentary Geology (SEPM) and the Geologists' Association (GA). Joint marketing agreements ensure that GSL Fellows may purchase these societies' publications at a discount. The Society's online bookshop (accessible from www.geolsoc.org.uk) offers secure book purchasing with your credit or debit card. To find out about joining the Society and benefiting from substantial discounts on publications of GSL and other societies worldwide, consult www.geolsoc.org.uk, or contact the Fellowship Department at: The Geological Society, Burlington House, Piccadilly, London W1J 0BG: Tel. +44 (0)20 7434 9944; Fax +44 (0)20 7439 8975; E-mail: enquiries @geolsoc.org.uk. For information about the Society's meetings, consult Events on www.geolsoc.org.uk. To find out more about the Society's Corporate Affiliates Scheme, write to
[email protected] Published by The Geological Society from: The Geological Society Publishing House, Unit 7, Brassmill Enterprise Centre, Brassmill Lane, Bath BA1 3JN, UK (Orders: Tel. -t-44 (0)1225 445046, Fax +44 (0)1225 442836) Online bookshop: www.geolsoc.org.uk/bookshop
The publishers make no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility for any errors or omissions that may be made. 9 The Geological Society of London 2006. All rights reserved. No reproduction, copy or transmission of this publication may be made without written permission. No paragraph of this publication may be reproduced, copied or transmitted save with the provisions of the Copyright Licensing Agency, 90 Tottenham Court Road, London W1P 9HE. Users registered with the Copyright Clearance Center, 27 Congress Street, Salem, MA 01970, USA: the item-fee code for this publication is 0305-8719/06/$15.00.
British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library. ISBN-10:1-86239-205-6 ISBN-13:978-1-86239-205-2 Typeset by Techset Composition, Salisbury, UK Printed by The Cromwell Press, Wiltshire, UK
Distributors North America For trade and institutional orders: The Geological Society, c/o AIDC, 82 Winter Sport Lane, Williston, VT 05495, USA Orders: Tel + 1 800-972-9892 Fax +1 802-864-7626 Email gsl.orders @aidcvt.com
For individual and corporate orders: AAPG Bookstore, PO Box 979, Tulsa, OK 74101-0979, USA Orders: Tel +1 918-584-2555 Fax +1 918-560-2652 Email
[email protected] Website http://bookstore.aapg.org India Affiliated East-West Press Private Ltd, Marketing Division, G-1/16 Ansari Road, Darya Ganj, New Delhi 110 002m India Orders: Tel. +91 11 2327-9113/2326-4180 Fax +91 11 2326-0538 E-mail
[email protected] Contents
Preface
vii
PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. Compositional data and their analysis: an introduction
Applications to the solution of real geological problems O.KovAcs, L., KovAcs, G. P., MARTiN-FERNANDEZ,J. A. & BARCELO-VIDAL,C. Major-oxide compositional discrimination in Cenozoic volcanites of Hungary
11
THOMAS, C. W. & AITCHISON, J. Log-ratios and geochemical discrimination of Scottish Dalradian limestones: a case study
25
GORELIKOVA, N., TOLOSANA-DELGADO,R., PAWLOWSKY-GLAHN,V., KHANCHUK,A. • GONEVCHUK, V. Discriminating geodynamical regimes of tin ore formation using trace element composition of cassiterite: the Sikhote'Alin case (Far Eastern Russia)
43
REYMENT, R. A. On stability of compositional canonical variate vector components
59
BUCCIANTI, A., TASSI, F. & VASELLI, O. Compositional changes in a fumarolic field, Vulcano Island, Italy: a statistical case study
67
WELTJE, G. J. Ternary sandstone composition and provenance: an evaluation of the 'Dickinson model'
79
Software and related issues THI0-HENESTROSA, S. & MARTIN-FERNANDEZ,J. A. Detailed guide to CoDaPack: a freeware compositional software
101
VAN DER BOOGAART, K. G. & TOLOSANA-DELGADO,R. Compositional data analysis with 'R' and the package 'compositions'
119
BREN, M., BATAGELJ, V. Visualization of three- and four-part (sub)compositions with R
129
General theory and methods EGOZCUE, J. J. & PAWLOWSKY-GLAHN,V. Simplicial geometry for compositional data
145
DAUNIS-I-ESTADELLA,J., BARCELO-VIDAL,C. & BLrCCIANTI,A. Exploratory compositional data analysis
161
vi
CONTENTS
BUCCIANTI, A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. Frequency distributions and natural laws in geochemistry
175
MARTIN-FERNANDEZ, J. A. & THI0-HENESTROSA, S. Rounded zeros: some practical aspects for compositional data
191
BARRABI~S,E. & MATEU-FIGUERAS,G. Is the simplex open or closed? (some topological concepts)
203
Index
207
Preface
Compositions are positive vectors whose components represent a relative contribution of different parts to a whole; therefore their sum is a constant, usually 1 or 100. Compositions are a familiar and important kind of data for geologists because they appear in many geological datasets (chemical analyses, geochemical compositions of rocks, sand-silt-clay sediments, etc). Since Karl Pearson wrote his famous paper on spurious correlation back in 1897, much has been said and written about the statistical analysis of compositional data, mainly by geologists such as Felix Chayes. His famous work concerned the G-2 granite sample and is used for comparison and standardization of geochemical analytical techniques between different laboratories. As with most igneous (and metamorphic) rocks that have achieved a stable mineralogy and minimized chemical-free energy, the number of minerals is limited by the phase rule to 0
(lO)
6i
where is the replacement value for the i-th part and is defined by the user. The default constant 8i is 0.005 but the user can define another constant or a column of constants that contains as constants as parts of the composition. The user has to select the input columns and where to put the results (Fig. 2).
Graphs menu This menu enables the user to create twodimensional graphs. The user can customize the appearance of each graph and, in some cases, plot the observations in the graph according to a previous classification.
Ternary diagram. This feature displays a ternary diagram of three selected parts.
(1) (2) (3) (4)
differentiate, by colour or by shape, each point depending on a previous classification; label the vertices of the triangle (the default labels are the part names); perturb the data with the inverse of the centre (centring) or with a given vector; display a reference grid of values. The default values of the grid are 1, 10, 33, 66, 90 and 99 but the user can define other values in a column.
ALR plot. This feature displays a plot according to the ALR transformation of the three columns selected. There are two options which modify the appearance of the graph (Fig. 19): (1) (2)
differentiate, by colour or by shape, each point depending on a previous classification; to label the axis (default labels are log(xl/x3) and log(x2/x3)).
CLR plot. This feature displays a plot according to the centred log-ratio transformation (clr) of three
DETAILED GUIDE OF CODAPACK
Fig. 17. Form of the
Fig. 18. From the
Centering routine
Graphs menu:
from the
main form and
Operations menu.
Options form of the Ternary Diagram routine.
113
114
S. THIO-HENESTROSA & J. A. MARTFN-FERNANDEZ
Fig. 19. From the Graphs menu: main form and Options form of the ALR Plot routine. selected parts. There are two options to modify the appearance of the graph:
(1)
(1)
(2)
(2)
differentiate, by colour or by shape, each point depending on a previous classification; label the axis (default labels are ILR1 and ILR2).
Biplot. This feature performs a 'Compositional Biplot' of selected parts. There are six options which modify the appearance of the graph (Fig. 4): (1) (2) (3) (4) (5) (6)
indicate a column with the labels of the axes; differentiate, by colour or by shape, each point depending on a previous classification; choose the factor plane indicating which parts to display; label the observations (default is no label); display or not the observations (default is yes); display with a different mark the observations that are outliers (default is no mark).
Principal components. This feature calculates the two compositional Principal Components for a three-part composition of three selected parts and displays the result in a ternary diagram. There are two options which modify the appearance of the graph (Fig. 20): (1) (2)
differentiate, by colour or by shape, each point depending on a previous classification; label the vertices of the triangle (default labels are the part names).
ALN confidence region. This feature calculates the 'Additive Logistic Normal Confidence Region' for the ALR-mean vector of the selected parts and displays the result in a ternary diagram. There are three options which modify the appearance of the graph (Fig. 22): (1) (2) (3)
perform an ALN Confidence Region for each group defined by a column; label the vertices of the triangle (the default labels are the part names); define the confidence level (the default is 0.95).
Descriptive statistics menu This menu retums characteristic values for a dataset.
Summary. Performs five descriptive statistics: two of log-ratios ('Variation Array' and 'CLR Variance') and three compositional descriptive statistics ('Centre', 'Min', 'Max' and quartiles). (1)
ALN predictive region. This feature calculates the 'Additive Logistic Normal Predictive Region' of the selected parts and displays the result in a ternary diagram. There are two options which modify the appearance of the graph (Fig. 21):
label the vertices of the triangle (the default labels are the part names); choose the default predictive levels (the default levels are 0.90, 0.95 and 0.99).
(2)
'Variation Array'. Returns a matrix where the upper diagonal contains the log-ratio variances and the lower diagonal contains the log-ratio means. That is, the ij-th component of the upper diagonal is var[ln(X//Xj)], and ij-th component of the lower diagonal is E[ln(X;/Xj)], where (i, j = 1, 2 . . . . . D). 'CLR Variance'. Returns the sum of log-ratio variances that involve each part. The sum of all 'CLR Variances' is the
DETAILED GUIDE OF CoDAPACK
115
Fig. 20. From the Graphs menu: main form and Options form of the Principal Components routine.
'Total Variance'. So
(3) D
CLR-Variancei = Z~=I,/r
var[ln(Xi/Xj)] 2D (11)
(4)
'Total variance'. Returns the sum of all ' C L R Variances'. 'Center'. Returns centre of the dataset, that ^ is, ~ = C[gl , g2 . . . . . go], where gi =(1--I~=lXki) 1/N symbolizes the geometric
Fig. 21. From the Graphs menu: main form and Options form of the ALN Predictive Region routine.
116
S. THIO-HENESTROSA & J. A. MARTIN-FERNANDEZ
Fig. 22. From the Graphs menu: main form and Options form of the ALN Confidence Region routine.
(5) (6)
mean of part X,- in dataset X. The dataset X has been previously closed. ' M i n i m u m ' and ' M a x i m u m ' . For each part of the dataset X returns the m a x i m u m and the m i n i m u m of the closed dataset C(X). 'Quartiles'. For each part of the dataset X returns Q1, the median and Q3 of the closed data set C(X)
The user has to select the columns to be closed and where to put the results. There are two options on this routine (Fig. 23): (1) (2)
perform the statistics for each group defined by a column; the user can choose the descriptive wanted (at least one must be chosen).
Centre. With this feature the user obtains the centre of the dataset as described on the Summary routine. The user has to select the columns to calculate the centre and where to put the result. Total variance. With this feature the user obtains the total variance of the selected columns of the dataset as described on the Summary routine. The user has to select the columns to calculate the total variance and where to put the result. Variation array. With this feature the user obtains the variation array of the selected columns. It returns a matrix where the upper diagonal contains the log-ratio variances and the lower diagonal contains the log-ratio means of the dataset as described
Fig. 23. From the Descriptive Statistics menu: main form and Options form of the Summary routine.
DETAILED GUIDE OF CODAPACK
117
Fig. 24. Form of Atypicality Indices routine from the Descriptive Statistics menu. in Summary. The user has to select the columns to calculate the variation array and where to put the results.
For each kind of test, Anderson-Darling, Cramervon Misses and Watson tests are performed.
Atypicality indices. With this feature the user
Preferences menu
obtains the atypical observations and its index under the assumption of 'Additive Logistic Normal' distribution of the selected parts. The user has to select the columns to calculate its atypical observations and where to put the results. Also the user has to indicate the threshold of atypicality (usually 0.95) (Fig. 24).
Screen size. With this feature the user indicates the resolution of the screen in order to obtain complete pictures of graphs on the screen. The default value is 1152 x 864 pixels.
Analysis menu In the present version this menu performs only the 'Logistic Normality Test'.
Logistic Normality Test. This feature performs a test for: (1) (2)
all marginal, univariate distributions (with a total of D tests); all bivariate angle distributions (with a total of D(D 1)/2 tests); the D-dimensional radius distribution. -
(3)
Fig. 25. Form of the Sum Constraint routine from the
Preferences menu.
118
S. THIO-HENESTROSA & J. A. MART~-FERN,~4DEZ
Sum-constraint. With this feature the user indicates which is the constant used to close the data. The default value is 1 (Fig. 25). This work has received financial support from the Direcci6n General de Investigaci6n of the Spanish Ministry for Science and Technology through the project BFM2003-05640/MATE. The dataset from the database of Cenozoic volcanic rocks of Hungary has been kindly provided by L. 6.Kovfics and G. P. Kovfics from the Hungarian Geological Survey.
References AITCHISON, J. 1986. The statistical analysis of compositional data. Chapman & Hall, London. Reprinted (2003) by The Blackburn Press, Caldwell, NJ. AtTCmSON, J. & GREENACRE, M. 2002. Biplots of compositional data. Applied Statistics, 51, 375-392. MARTiN-FERNANDEZ & THI0-HENESTROSA, S. 2006. Rounded zeros: some practical aspects for compositional data. In" BUCCIANTI, A., MATEUFIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences:
From Theory to Practice. Geological Society, London, Special Publications, 264, 191-201. MARTiN-FERN.~NDEZ, J. A., BARCEL0-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003. Dealing with zeros and missing values in compositional data sets. Mathematical Geology, 35 (3), 253-278. 6.KovAcs, L. & KovAcs, G. P. 2001. Petrochemical database of the Cenozoic volcanites in Hungary: structure and statistics. Acta Geologica Hungarica, 44 (4), 381-417. 0.KovAcs, L., KovAcs, G. P., MARTIN-FERNANDEZ, J. A. & BARCETO-VIDAL, C. 2006. Major-oxide compositional discrimination in Cenozoic vulcanites of Hungary. In: BUCCIANTI, A., MATEUFIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 11-24. TH10-HENESTROSA, S. & MARTiN-FERN,~NDEZ, J. A. 2005. Dealing with compositional data: the freeware CoDaPack. Mathematical Geology, 37 (7), 777-797. VON EYNATTEN, H., BARCELt3-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003. Composition and discrimination of sandstones: a statistical evaluation of different analytical methods. Journal of Sedimentary Research, 73 (1), 47-57.
Compositional data analysis with 'R' and the package 'compositions' K. G. V A N D E R B O O G A A R T 1 & R. T O L O S A N A - D E L G A D O 2
llnstitut fiir Mathematik und Informatik, Ernst-Moritz-Arndt-Universitiit Greifswald, Greifswald D-17487, Germany (e-maih boogaart@ uni-greifswald.de) 2Departament Informhtica i Matembtica Aplicada, Universitat de Girona, Girona E-17071, Spain Abstract: This paper is a hands-on introduction and shows how to perform basic tasks in the analysis of compositional data following Aitchison's philosophy, within the statistical package 'R' and using a contributed package (called 'compositions'), which is devoted specially to compositional data analysis. The studied tasks are: descriptive statistics and plots (ternary diagrams, boxplots), principal component analysis (using biplots), cluster analysis with Aitchison distance, analysis of variance (ANOVA) of a dependent composition, some transformations and operations between compositions in the simplex.
This paper will show how the basic tasks of compositional data analysis (Aitchison et al. 2002) can be performed with the package 'compositions' in the free statistical environment 'R' (R Development Core Team 2003). The paper aims to be useful for a wide spectrum of 'R' users: for this reason, it is suggested that the experienced skip these first steps, whereas those who never heard about 'R' should begin with Appendix A before continuing with the text. It is strongly recommended that the reader be in front of the computer, typing the examples outlined here: thus, text output of these instructions is kept to a minimum, and almost all figures are not included, although they are described briefly (with a few exceptions).
manuals or of typing to a command line any command found out there. However, it should be remembered that 'R' and its packages are a living project permanently adapted to the development of the field. More intstructions can be found at 'http://www.stat.boogaart.de/compositions/'. After starting 'R' (either by clicking on the appropriate icon, selecting the entry 'R' in the start menu or by typing the command 'R' to a console or command window, after installing the software) a command window appears where commands can be given to 'R'. The following appears:
R: C o p y r i g h t 2004, T h e R F o u n d a t i o n for Statistical Computing Version 2.0.1 (2004-11-15), ISBN3-900051-07-0
First steps 'R' is a powerful computer environment for multipurpose statistics and data analysis. It is available for all computer platforms and can be downloaded from 'http://www.cran.R-project.org'. 'Compositions' is a contributed package for 'R', devoted specially to the analysis of compositional data; it can be downloaded from 'http://www.stat.boogaart.de/compo sitions'. 'R' and 'compositions' are both distributed and developed under the GNU public license, hence they are available free of charge. Further instructions on downloading, installation and getting started with the software can be found in Appendix A. 'R' is classically based on a command line interface, but various graphical user interfaces are available from 'http://www.cran.R-project.org'. When compared with other compositional software, the 'R' package provides a maximum of flexibility. However, being based on a computer language, it demands from its users not to be afraid of reading
R is f r e e s o f t w a r e and comes with ABSOLUTELY NO WARRANTY. You are welcome to r e d i s t r i b u t e it under certain conditions. Type 'license()' or ' l i c e n c e ( ) ' distribution details. R is a c o l l a b o r a t i v e many contributors.
project
for with
Type 'contributors()' for more information and 'citation()' on h o w to cite R or R packages in publications. Type 'd e m o ( ) ' for some demos, 'help()' for o n - l i n e help, or 'help. s t a r t () ' for a HTML browser i n t e r f a c e to h e l p . T y p e 'q()' to q u i t R.
CompositionalData Analysis in the Geosciences:From Theory to Practice. Geological Society, London, Special Publications, 264, 119-127.
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds)
0305-8719/06/$15.00
9 The Geological Society of London 2006.
120
K . G . VAN DER B O O G A A R T & R. T O L O S A N A - D E L G A D O
The version number should be checked, since at least version 2.0.0 is required for running compositions. The ' > ' mark shows that 'R' is willing to accept commands. This character should not be typed with the commands. To see how 'R' works, type '3 "7', and hit the ENTER-Key to make 'R' execute this command: > 3*7 [i] 21 >
'R' executes the command by multiplying 3 and 7 and then prints the result 21. At this moment ignore the '[ 1 ] ' . 'R' can in this way be used as a (extremely powerful) calculator. To prepare 'R' for compositional data analysis the library compositions must be loaded with the library command:
When working in a terminal, the help can be closed by typing 'q' for Quit. In a windows-based environment the help window can simply be closed.
> i s ( ) # S h o w n a m e s of all v a r i a b l e s / datasets [i] " s a . d i r i c h l e t .... s a . d i r i c h l e t . dil .... s a . d i r i c h l e t . m i x " [4] " s a . d i r i c h l e t 5 .... s a . d i r i c h l e t 5 . dil .... s a . d i r i c h l e t 5 . m i x " ... (lines o m i t t e d )
The other commands show a typical usage of 'R': Use .'? to get help information, or '1 s ( )' to show all variables/datasets defined previously. Just type the name of a dataset to show its content, which in this case is a set of simulated amounts of three different chemical elements in ppm:
> library(compositions)
Attaching package
'compositions':
The following o b j e c t ( s ) from package:stats: cor
cov dist
are masked
var
The following object(s) from package:base:
are masked
%*% >
Either such output, or no output at all, informs the user about a properly loaded package. When an error appear such as this: > library(compositions) in library(compositions): is no package called 'compositions'
Error There
this means that the package is not properly downloaded or installed. Instructions for downloading and installation of the package can be found in Appendix A. After loading the package, some example data from the package should be loaded with the 'data' command: > data(SimulatedAmounts) # Load e x a m p l e d a t a (no o u t p u t ) > ? SimulatedAmounts # Show help about example data
Note that a hash mark' #' denotes the beginning of a comment: after it, the rest of the line is ignored by 'R'. Therefore, it is not necessary to type them.
> sa.lognormals
# Show
one
of
the
datasets Cu Zn Pb [i,] 8.8043262 35.1671810 45.895025 [2,] 0.8115227 2.6547329 47.804310 [3,] 1.2836130 12.4472047 40.553628 ... (lines o m i t t e d ) [60,] 3 . 9 8 5 4 9 9 8 6 . 1 3 0 1 9 0 9 4 0 . 5 7 9 4 1 7
To edit or just to inspect the dataset in a spreadsheet-like environment the command ' f i x ( s a . lognormals)' may be used. Appendix A contains instructions on how to load datasets. Basic compositional data analysis The zero step when using the package is to mark your data explicitly as a set of elements from a simplex under Aitchison geometry (Aitchison et al. 2002). This is done by converting the dataset to an _Aitchis~ compositional set through the function 'acomp', and storing it into a new variable by using the assignment sign ' p i e (mean (cdata)) > b a r p l o t (mean(cdata))
oc~
o
~176176176 " ~ Zn o
cP
I o o
1For some commands (barplot, boxplot, cdt, cor, cov, idt, mean, names, perturbe, plot, power, princomp, qqnorrn, rnorm, runif, scale, segments, split, summary, var, +, - , . , /, % 9% you need to add '.acomp' to see the Aitchison compositional specific help.
i
-0.4
I
-0.2
o:o
0:2
Comp.1
Fig. 1. Biplot of a three-part composition (Cu, Zn, Pb).
122
K.G. VAN DER BOOGAART & R. TOLOSANA-DELGADO
component analysis, which uses the clr transforms (Aitchison 2002): > pca pca # display results as text Call: p r i n c o m p . a c o m p ( x = cdata) Standard deviations: Comp.l Comp.2 1.3604382 0.4460269 3 variables and 60 observations. Mean (compositional): Cu Zn Pb 0.08918175 0.23949922 0.67131903 attr(,"class") [i] " a c o m p " +Loadings (compositional): Cu Zn Pb Comp.l 0.5533583 0.5570883 1.8895534 Comp.2 0.4207858 1.7307697 0.8484445 attr(,"class") [i] " a c o m p " -Loadings (compositional): Cu Zn Pb Comp.l 1.312246 1.3034604 0.3842932 Comp.2 1.725060 0.4193976 0.8555428 attr(,"class") [i] " a c o m p " > screeplot(pca) # display importance of components > biplot(pca) # display direction of components
The last component always has no importance. The first component, giving the highest variation, corresponds to the Pb against Zn and Cu balance here (as can be seen in Fig. 1). It explains 90% of the variablity, as can be obtained by considering the variance of first component divided by the metric variance of 'cdata'.
Working with compositions of four or more parts To analyse a different dataset or a subcomposition you might assign something different to 'cdata' or any other variable representing your compositional dataset, e.g. by
The optional parameter 'parts=', allows you to select the parts to be used in the subcomposition. Optional parameters are a typical way of 'R' providing additional functionality to the default behaviour of a command. The possible optional parameters and their effects are documented in the help to each command that can be invoked by '? nameoffunction'. The 'c ( ) ' function is just here to Concatenate the variable names. In principle, now everyone of the aforementioned commands can be applied to the new dataset. Try: m
I > plot(cdata) Since a ternary diagram can display only three parts at the same time, a table of multiple ternary diagrams, containing subcompositions or marginal compositions (Fig. 2) must be displayed. As a default, two parts are determined by the row and the column occupied by each plot, and the geometric mean of the remaining components is taken as the third component. Alternatively one can specify a component, by using the optional parameter ' m a r g i n ' : I plot(cdata,margin="Cd")
Performing a cluster analysis with Aitchison distance A hierarchical cluster analysis can be performed with the following instructions. First the clustering must be computed and the result stored in a variable: > Clusters data(SimulatedAmounts) # Load the example datasets > sa.groups5 # One of these > cdata cdata
# shows the d e n d r o g r a m
I
When the user has decided on the number of groups to interpret, maybe four in this case, a new variable containing the groups assigned to each case can be generated and the group membership
CODA ANALYSIS WITH R AND COMPOSITIONS 0.0
0.2
0.4
1
I
0.6
0.8
I
I
t .0
123
0.0
0.2
0.4
0.6
0.8
1.0
i
i
t
i
I
I
=
, d
Cd
o o.
-
oo
,o
Pb c5 eq c5 o c5
~3
Co
Pb
Cu Cu
I
I
I
I
I
0.0
0.2
0.4
0.6
0.8
1.0
0.0
C
0.2
0.4
0.6
0,8
1.0
Fig. 2. Matrix of ternary diagrams of a four-part composition (Cd, Pb, Co, Cu).
can be displayed in ternary diagrams, boxplots and biplots. > g r o u p group [ 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 1 3 2 3 2 4 3 3 2 2 4 3 4 2 2 3 2 22 [ 3 9 ] 3 1 3 3 2 4 4 3 4 4 4 3 4 4 3 4 4 4 3 3 4 4 > plot(cdata,col=group) > plot(cdata,pch=group) > plot(cdata, pch=as.character (group)) > plot(cdata,col=group,center=T) #display centered data > plot(Clusters,labels=group) > biplot(princomp(cdata), xlabs= group) > boxplot(cdata,factor(group))
Compositional computation Various mathematical transformations and operations are defined for the Aitchison simplex (Aitchison et al. 2002). These are interesting mainly for developers of new statistical methods. The perturbation and the power transform are considered as addition and scalar multiplication in a vector space structure of the simplex,
>a
> a b b [i] 0.8 0.i 0.i attr(,"class") [i] " a c o m p " > a+b # a d d i n g is p e r t u r b a t i o n [i] 0 . 6 6 6 6 6 6 7 0 . 1 6 6 6 6 6 7 0 . 1 6 6 6 6 6 7 attr(,"class") [i] " a c o m p " > 2*a # m u l t i p l i c a t i o n is p o w e r transform [i] 0 . i i i i i i i 0 . 4 4 4 4 4 4 4 0 . 4 4 4 4 4 4 4 attr(, " c l a s s " ) [i] " a c o m p " > (a+a)/2-a # i n v e r s e o p e r a t i o n s [i] 0 . 3 3 3 3 3 3 3 0 . 3 3 3 3 3 3 3 0 . 3 3 3 3 3 3 3 attr(,"class') [i] " a c o m p " > xx
XX
...
> mean(xx) # x x is c e n t e r e d Cd Pb Co Cu 0.25 0.25 0.25 0.25 attr(,"class") [i] " a c o m p " > msd(xx) # and n o r m a l i z e d
[i] 1 > a %*% a # scalar [1] 0.320302 > norm(a) # norm [1] 0.5659523 > cdata %*% a c o m p ( c ( 1 , 2 , 3 , 4 ) )
product
# scalar products [110.43526714 - 1 . 5 3 7 0 0 3 0 4 2 . 4 0 6 8 2 7 6 9 1.61307672 2.64960822 2.38350333 ... (lines omitted) > y y var(yy) # v a r ^ - 0 . 5 * [,i] [,2] [,3] [i, ] 0.75 -0.25 -0.25 [2,] -0.25 0.75 -0.25 [3,] -0.25 -0.25 0.75 [4,] -0.25 -0.25 -0.25
cdata [,4] -0.25 -0.25 -0.25 0.75
The standard transforms can be computed by
> > > > > > >
CenteredLogRatio F) sa.groups5.area 2 1.0872 22.2312 6 112 < 2.2 e-16 *** Residuals 57 ___ Signif. codes: 0 ~***" 0.001 "**" 0.01 ~*" 0.05~. " 0.i " "i > plot(ilr.inv(residuals(m)),col=sa.groups5. area) > plot(ilr.inv(predict(m)),col=sa.groups5. area) > qqnorm(ilr.inv(residuals(m))) > mvar(predict(m))/(mvar(residuals(m) +predict (m))) # ~R"^2 [i] 0.3980416 > diag(ilrvar2clr(var(predict(m)))/ilrvar2 clr(var(residuals(m)+predict(m)))) [i] 0.4001846 0.5670027 0.1392320 0.2654141
125
Here one sees a highly significant influence of the group given by a p-value stated as ' < 2 . 2 e - 1 6 ' . If this example was run, a series of plots would result: the first one would show the residuals with substantial spread. The second plot shows the location of the predicted group means in ternary diagrams. Unfortunately, the variable names are lost during the ilr transform, such that the plots are drawn without labels. The third plot shows qqnorm-plots of the pairwise log-ratios, in order to check the normality assumption used in the manova. The last two calculations give the total of the model of about 39% and the individual's for the four parts of the composition. In a similar way a discrimination analysis can be performed based on the ilr transform and standard functionality of 'R': >
library(MASS) # Loading appropriate library > # Generating example data > subsample TrainingData TrainingGroups ControlData ControlGroups ControlGroups [i] U p p e r U p p e r U p p e r M i d d l e M i d d l e Middle Middle Middle Middle Middle [ii] L o w e r L o w e r L o w e r L o w e r L o w e r Levels: Lower Middle Upper > # Performing the discriminat analysis > d s c r dscr ... ( o u t p u t o m i t t e d ) > predict(dscr,newdata:ilr (ControlData)) # Classify ControlData $class [i] U p p e r U p p e r U p p e r M i d d l e M i d d l e Lower Middle Middle Middle Middle [ii] L o w e r L o w e r L o w e r L o w e r L o w e r Levels: Lower Middle Upper $posterior Lower Middle Upper 1 3.626286e-16 1.851031e-07 9.999998e-01 2 7.991869e-12 8.473827e-05 9.999153e-01 ... ( l i n e s o m i t t e d ) > table(ControlGroups, predict (dscr, n e w d a t a = i l r (ControlData)) $class) ControlGroups Lower Middle Upper Lower 5 0 0 Middle 1 6 0 Upper 0 0 3
126
K.G. VAN DER BOOGAART & R. TOLOSANA-DELGADO
The calculated classification o f the 15 control samples based on 45 training samples w a s thus correct, with one exception. M o r e detailed information about discriminant analysis and ' l d a ' function can be found in the ' R ' help.
Importing data to 'R' The most simple way to provide data to 'R' is to store them into a simple text file. The first row should contain the variable names seperated by a semicolon. The following lines contain the data, again separated by a semicolon.
Conclusions For the beginner, this approach i m m e d i a t e l y provides all basic compositional plots, summaries and transformation in the form o f simple standard commands given in this publication. M o r e helpful reading can be found in Using the R package 'compositions', available at h t t p : / / w w w . s t a t . boogaart.de/compositions. Users can perform advanced analysis using the p a c k a g e in c o m b i nation with the statistical sub-routines o f ' R ' as exemplified in the later chapters and experts can even extend the functionality through the programming interface o f 'R'. The authors are open to suggestions to include m o r e functionality and c o n v e n i e n c e to the package.
Cd;Zn;Pb;Cd;Co 1.2;2.6;4.9;0.2;5 23.4;11;0.2;0.002;6.2 . . .
The data can then be loaded b y the ' R ' - c o m m a n d s :
> m y d a t a fux(mydata) # you m u s t close the w i n d o w a f t e r w a r d s
Appendix A: Help with technical details Downloading and installing 'R' On 'http://www.cran.R-project.org' one can find detailed instruction on downloading and installing 'R', as well as the downloadable packages themselves. Users must download the setup program of the base part of a precompiled binary distribution of 'R' for their platform. For example, for windows users it is sufficient to download the 'rw? .9 .9 ? . e x e ' file from 'http:// www'cran'R-pr~176176 and to double click the downloaded file to start the installation process.
Downloading and installing 'compositions'
One should always check with the fix command that the data are properly loaded before using them. Directories must always be separated by a forward slash '/' in pathnames. All spreadsheet programs can export to this format, when instructed to store as '. c s v ' . The separator and the decimal symbol can vary with the local configuration of the computer. Note that 'R' only uses a dot as the floating comma symbol in any output, although the import procedure accepts the optional parameter ' d e c = ' , " to deal with the colon (see '? r e a d . c s v ' or '? r e a d . table'). To use Tabulator as a separating character use ' s e p = ' " in the ' r e a d . c s v ' command.
Newbee problems and solutions 9
The package is available from 'http://www.stat. boogaart.de/compositions/' or as a contributed package from 'http://www.cran.R-project.org'. In a windows system ('R' v.2.0.1 or later) the package can now be installed through the 'Packages' menu option 'Install package(s) from local zip file...'. On Unix/Linux systems it is done by the command:
9
9 9
" R " CMD I N S T A L L DownloadedPackage.tar.gz
9
9 with 'DownloadedPackage.tgz' replaced by the actual filename of the downloaded package.
'R' is not exactly made for beginners: it's not your fault. Try to find someone to help you, Next year you will be the expert. When 'R' does not find your file, give the whole path and separate the directories with ' / ' , not with a backslash. Don't forget the extension (e.g. '. t x t ' ) . W e a r glasses w h e n c o p y i n g and typing commands. When you get neither a plot nor an error, the plot window is probably iconified. When 'R' answers with '+' instead of '>', you have made a typing error and 'R' thinks that the command is not yet finished. Type '; ', the ENTER-key, and try again. When you are bored by retyping commands again and again, try the up and down arrow keys or copy and paste from your favourite editor or a script window.
CODA ANALYSIS WITH R AND COMPOSITIONS 9 It doesn't work in the second session: Have you loaded all necessary libraries and prepared all variables? 9 'R' comes with plenty of help. Type the ' h e l p . s t a r t ( ) ' command and start with 'Introduction to R'. 9 Type 'q ( )' for quit and the ENTER-key to leave 'R'. Save your workspace, when asked.
References AITCHISON, J. 2002. Simplicial inference. In: VIANA, M. A. G. & RICHARDS, D. S. P. (eds) Algebraic Methods in Statistics and Probability. Contemporary Mathematics Series, 287, American Mathematical Society, Providence, Rhode Island, 1-22.
127
AITCHISON, J., BARCELO-VIDAL, C., EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2002. A concise guide to the algebraic geometric structure of the simplex, the sample space for compositional data analysis. In: BAYER, U., BURGER, H. & SKALA, W. (eds)
Proceedings of the 8th Annual Conference of the International Association for Mathematical Geology, Berlin, Germany, 387-392. PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. 2001. Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment, 15 (5), 384-398. R Development Core Team 2003. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (http://www.R-project.org).
Visualization of three- and four-part (sub)compositions with R M. B R E N t'3 & V. B A T A G E L J 2'3
1University of Maribor, Faculty of Organizational Sciences, Kidri(eva 55a, 4000 Kranj, Slovenia (e-mail: matevz.bren @fov. uni-mb.si) 2University of Ljubljana, Faculty of Mathematics and Physics, Jadranska 19, 1000 Ljubljana, Slovenia 3Institute of Mathematics, Physics and Mechanics, Jadranska 19, 1000 Ljubljana, Slovenia Abstract: In 2003 the MixeR (Mixtures with R) project was started and work began to develop a
library of functions written in R to support the analysis of compositional data, i.e. mixtures. This paper presents the 'mix' object in R, reading different data file formats and some MixeR routines for graphical presentation of three- and four-part (sub)compositions in ternary diagrams and tetrahedrons. Additional graphical features and use of parameters are applied on real data - a glacial dataset and dataset of the researcher's daily activities, both from Aitchison's (1986) book. All these routines and datasets are available at http://vlado.fmf.uni-lj.si/pub/MixeR
The paper begins by introducing the two datasets data on Researcher's daily activities and the Glacial data that will serve as examples. Reader are strongly recommended to sit in front of the computer with R installed (R beginners - see Downloading and installing R in Boogaart & Tolosana-Delgado 2006), typing the examples outlined here. With this the reader will also see all figures produced in colour and will understand the ideas on classification and classes more clearly. All MixeR routines and the two datasets are available at http://vlado.fmf.uni-lj.si/pub/MixeR
Researcher's
daily activities data
The dataset No. 31 from Aitchison (1986) gives activity patterns of a statistician for 20 days. The proportions of a day spent teaching, in consultation, administrative work, research, other wakeful activities and sleep are given. Data show the proportions of the 24 hours devoted to each activity, recorded on each of the 20 days. The activity proportions, not the values in hours are given. Therefore, the data are portions of a day summing to one, thus compositional, i.e. mixtures. The data are stored in a file in matrix form, days as rows and activities as columns and the first row comprising the abbreviations of the activities, i.e. variable names: teac - teaching, cons - consultation, a d m i - administration, rese - research, w a k e - other wakeful activities and slee - sleep are given. This dataset is presented in Table 1. The six activities may be divided into two categories: 'work' comprising activities 1, 2, 3, 4,
and 'leisure' comprising activities 5 and 6. These data will be used to present the subcompositional concepts, visualization of the data in ternary diagrams presenting variability with border percentile lines, centring, etc.
Glacial dataset
From Aitchison (1986) dataset No. 18 gives 92 samples of pebbles of glacial tills sorted into four categories: red sandstone, grey sandstone, crystalline and miscellaneous. The percentages by weight of these four categories and the total pebbles counts are recorded. These data are stored in the 'CoDa' data file - each line comprising just one record: in the first line the data file name, in the second the number of variables, in the third the number of cases, in the next rows the variable labels and then No. 1 for the first case and the variable values for the first case, then No. 2 and the variables values for the second case, etc. until the last, No. 92 (Table 2). These data in Table 2 will be used to present the visualization in tetrahedrons and KiNG Mage viewer animation, dealing with zeros, Aitchison's distance computation and classification.
Compositional software
data analysis
tools
First, a brief, not exhaustive history, of software tools for compositional data analysis will be given. CoDa, a microcomputer package for the statistical analysis of compositional data was the first
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) CompositionalData Analysis in the Geosciences:From Theory to Practice. Geological Society, London, Special Publications, 264, 129-143. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
130
M. BREN & V. BATAGELJ Table 1. Researcher's daily activities dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
teac
cons
admi
rese
wake
slee
0.162 0.200 0.201 0.134 0.224 0.144 0.125 0.127 0.139 0.108 0.187 0.184 0.155 0.181 0.224 0.198 0.214 0.132 0.167 0.166
0.041 0.039 0.082 0.077 0.080 0.063 0.054 0.077 0.052 0.052 0.091 0.070 0.086 0.097 0.096 0.067 0.073 0.037 0.073 0.064
0.138 0.073 0.115 0.107 0.091 0.103 0.137 0.110 0.128 0.082 0.113 0.066 0.101 0.081 0.101 0.139 0.102 0.148 0.127 0.101
0.123 0.076 0.146 0.146 0.162 0.123 0.102 0.101 0.111 0.075 0.116 0.151 0.119 0.164 0.142 0.154 0.130 0.099 0.122 0.145
0.254 0.346 0.194 0.214 0.195 0.316 0.312 0.341 0.266 0.413 0.264 0.305 0.225 0.271 0.203 0.162 0.201 0.307 0.266 0.242
0.282 0.266 0.261 0.321 0.248 0.252 0.270 0.244 0.304 0.270 0.228 0.216 0.315 0.206 0.234 0.281 0.281 0.277 0.245 0.282
From dataset 31 (Aitchison1986). software on compositions, written in Quick Basic by John Aitchison and available with his book (Aitchison 1986). CoDa was later upgraded by John Bacon-Shone.
Table 2. Gacial dataset b:glacial.dat 5 92 Case no A B C D Count 1 91.8 7.1 1.1 0 282 2 88.9 10.1 0.5 0.5 368 31.4 65.9 2.7 0 698 From dataset 18 (Aitchison1986).
CoDaPack freeware software was next, written in Excel in 2001 by Santiago Thi6 Fernandez de Henestrosa and Josep Antoni Mart/n-Fern~indez from Girona compositional research group. An introduction is available in this volume (Thi6Henestrosa & Martfn-Fern~indez 2006). There were also some attempts written in R - a language and environment for statistical computing and graphics. R (http://www.r-project.org/), is 'GNU S'; it provides a wide variety of statistical and graphical techniques (linear and non-linear modelling, statistical tests, time-series analysis, classification, clustering, etc.). Further extensions can be provided as packages. Basic Compositional Data Analysis functions for S + / R comprising basic operations, transformations, estimators and plots, written by Joel Reynolds & Dean Billheimer (2002) from the Washington University are available at http: //www.biostat.wustl. edu / archives/html/ s-news / 2003-12 / msg00139.html In 2003 work began to develop a library M i x e r of functions in R to support the analysis of compositional data, i.e. mixtures (Bren & Batagelj 2003) Routines were provided for: 9 Operations on compositions: perturbation and power transformation, subcomposition with or without residuals, centring of the data, computing Aitchison's, Euclidean, Bhattacharyya distances and compositional Kullback-Leibler divergence - see Martfn-Fernfindez et al. (1999). 9 Graphical presentation of three- and four-part (sub)compositions in ternary diagrams and
VISUALIZATION OF COMPOSITIONS WITH R tetrahedrons with additional features: geometric mean of the dataset, the percentiles and ratio lines, centring of the data, notation of individual data in the set, marking and colouring of subsets of the dataset, their geometric means, etc. 9 Log-ratio transformations of compositions into real vectors that are amenable to standard multivariate statistical analysis, etc. The current version of the MixeR library is available at http://vlado.fmf.uni-lj.si/pub/MixeR In April 2005 Kjetil Halvorsen, author of the 'Fahrmeir' R package, reported his work on coding some compositional routines in R (operations on compositions, air and clr transformations, etc.). In June 2005 a 'compositions' package, written by K. Gerald van der Boogaart and Raimon TolosanaDelgado was published and is now available at http://cran.r-project.org/src/contrib/Descriptions/ compositions.html. To analyse compositions this package supports four different multivariate scales represented by four different classes: ' r p l u s ' the total amount is meaningful and data are analysed in real geometry; ' r c o m p ' - the total amount is meaningless or the individual amounts are parts of a whole in equal units and data are analysed in real geometry; ' a c o m p ' - the total amount is meaningless or the individual amounts are parts of a whole in equal units and the data should be analysed in a relative, i.e. Aitchison's geometry; ' a p l u s ' - the total amount is meaningful and the data should be analysed in relative geometry. Choosing the right type of analysis according to the data is left to the user. The package manual 'The compositions Package' and the introduction 'Using the R package "compositions"' are also available. An introduction to this package is also available in this volume (Boogaart & TolosanaDelgado 2006).
m$sta
-2 -1 0
1
m$mat m$class
131
the status of the mix object with values matrix contains negative elements, zero row sum exists, matrix contains zero elements, matrix contains positive elements, rows with different row sum(s), matrix with constant row sum and normalized mixture, the row sums are all equal to 1. the matrix with the data, and the special attribute of the object, used to allow for an object-orientated style of programming in R. For the MixeR purpose the class is defined 'mixture'.
Example of the mixture object - the dataset of researcher's daily activities (Table 1). Proportions of a day in activity are given for a statistician for 20 days. > m m
Here the R cursor mark '>' denotes the start of the command line and ' < - ' is an assignment operator. The output is $tit [i] ' ' R e s e a r c h e r ' s
daily
activities''
$sum [i] N A $sta
[i] l $mat teac cons 1 0.162 0.041 2 0.200 0.039 3 0.201 0.082 ............
The mixture class in R
19 20
The mix object in R will be presented, dealing with different data file formats and some R routines for graphical presentation of three- and four-part (sub)compositions in ternary diagrams and tetrahedrons, not incorporated in the 'compositions' package. Additional features will be applied: plotting the geometric mean of the dataset, ratio lines and/or percentile lines, marking and colouring subsets of the dataset and centring of the dataset. The input mixture data consist of a data matrix preceded by a title. They are represented as an R 'data frame', an object m consisting of
attr [1 ]
0.167 0.166
admi 0.138 0.073 0.115
0.073 0.064
0.127 0.i01
rese 0.123 0.076 0.146
wake 0.254 0.346 0.194
slee 0.282 0.266 0.261
0.122 0.145
0.266 0.242
0.245 0.282
(, ' ' c l a s s ' ' ) ' 'mixture' '
It should be explained that the $sura is NA, not available, because row sums are not all exactly equal to one due to the rounding errors; therefore, the status $ s t a is one, i.e. the matrix contains positive elements and rows with different row sums.
The 'mix' procedures in R Some basic MixeR routines are presented.
m$tit m$sum
the title of the dataset, the value of the row sums, if constant,
mix. R e a d (~le,
eps=le-6)
132
M. BREN & V. BATAGELJ
Reads a mixture data from the file and returns it as a mixture object. If ] m S s u m - l l < e p s it sets m $ s t a = 3 . The default value for e p s is l e - 6 . mix. ReadML(file,
eps:le-6)
Reads a 'CoDa' data file and returns a mixture object. If ] m g s u m - l l < e p s it sets m g s t a : 3 . The default value for e p s is l e - 6 . mix. C h e c k ( m ,
eps=le-6)
Determines the m $ s u m and m S s t a of a given mixture object m. The default value for e p s is le-6. mix.Normalize(m,
c=l)
Normalizes a given mixture object m if m$ s ta> = 0. The rows sums are now normalized to the constant c with default value c--1. mix. M a t r i x (a, t)
Gives the subcomposition with the columns given by the components of the vector k; all the rest is amalgamated in the residual. Output is the normalized mixture object with l e n g t h ( k ) + 1 columns.
The subcomposition routines
Example Determine the three-part subcompositions of activity data comprising teaching, consulting and research activities, therefore excluding the 3rd, 5th and 6th columns. To define a vector in R the c - concatenate command is applied. > m mix. Sub(m, c ( 3 , 5 , 6 ) ) $tit [i] ' ' R e s e a r c h e r ' s activities''
Joins a matrix data a and the title t into a mixture object.
$sum [1] 1
mix. Random(nr,
$sta [i] 3
nc,
c=l)
Constructs a random mix object with n r rows and nc columns and constant row sum c with default value c = 1. The command mat r ix ( r u n i f ( n r * n c ) , n r , n c ) ) is applied to calculate matrix elements where r u n i f (n, rain:O, m a x = l ) function generates random deviates of uniform distribution. Then the row sums are normalized to the constant c.
daily
$mat teac 0.497 0.635
cons 0.126 0.124
rese 0.377 0.241
20 0.443
0.171
0.387
1 2
attr(,''class'') mix. Sub(m,
k, N o r m a l i z e : T R U E )
Output mix object is computed as a subcomposition of m without the columns given by the components of the vector k. The output mix object is normalized if N o r m a l i z e : T R U E , the default value. mix. S u b R e s (m, k)
Output is the normalized subcomposition without the columns given by the components of the vector k and amalgamated in the residual. mix. E x t r a c t ( m ,
k, N o r m a l i z e = T R U E )
[i]
''mixture''
Example To determine the three-part subcompositions of activity data on teaching and research activities, with all the rest (the 2nd, 3rd, 5th and 6th variables) amalgamated in the residual. > mix. E x t r a c t R e s (m, $tit [i] ' ' R e s e a r c h e r ' s activities''
Gives the subcomposition of m with only the columns given by components of the vector k, norrealized if N o r m a l i z e = T R U E , the default value.
$sum [1 ] 1
mix. E x t r a c t R e s (m, k)
$sta [i] 3
c (i, 4) )
daily
VISUALIZATION OF COMPOSITIONS WITH R
Example
Smat teac 0.162 0.200
1 2 .
133
.
20
.
.
.
0.166
rese 0.123 0.076 .
.
.
residual 0.715 0.724
In Figure 1, the plots of activity data subcompositions in ternary diagrams with the geometric mean and the border percentile lines are produced by the following commands
.
0.145
0.689
> mix. T e r n a r y ( m i x . Sub(m, c (3,5,6)) , d i s t G = c (. i, .I, .i) , Gmean=T) > mix. T e r n a r y ( m i x . Sub(m, c(3,5,6)), Borders=T, c l s = c ( ' ' r e d ' ' , ''magenta'', ''blue''))
attr(,''class'') [i]
''mixture''
In Figure 2, the plots of activity data subcompositions (with residual) in ternary diagrams with border percentile lines are produced by the following commands
Visualization in the ternary diagram routine The mix. T e r n a r y routine draws a ternary diagram with points marking the data. The routine mix. T e r n a r y (m, dist, distG, cls, Centre, B o r d e r s , Gmean) has the following parameters:
m
dist
distG
the mix object, displaces the numbers marking percentile lines for additional space given by the components of the vector d i s C. First component for percentile lines to the vertex No. 1 = top, second to the vertex No. 2 = right, and third to the vertex No. 3 = left, i.e. each component corresponds to one vertex. This displacing is needed to prevent overlaying. The default value is d i s t = c (0.05, 0 . 0 5 , 0.05) displaces the numbers marking percentile lines of the geometric mean for additional space to prevent overlaying. Additional space is given by the components of the vector d i s t , each component corresponds to one vertex. The default value is distG:c(0.05,
cls Centre
0.05,
0.05). colours of the percentile lines. centres the dataset if
> mix. T e r n a r y (mix. S u b R e s (m, c ( 2 , 3 , 5 , 6 ) ) , d i s t = c ( . 0 5 , .i, .05), B o r d e r s = T ) > mix. T e r n a r y (mix. S u b R e s (m, c(2,3,5,6)), dist=c(.05, .i, .05), Borders:T, Centre:T)
Example The glacial dataset (Table 2) consists of percentages by weight for 92 samples of pebbles of glacial tills sorted into four categories - red sandstone, grey sandstone, crystalline and miscellaneous. The percentages by weight of these four categories and the total pebbles counts are recorded. The data are stored in a CoDa data file format. > m
m
$tit ' ' G L A C I A L D A T A 92 s a m p l e s of p e b b l e s of g l a c i a l tills s o r t e d into four c a t e g o r i e s percentages by weight''
[i]
$sum [i] N A Ssta [I] 0 Smat
Centre:TRUE, Borders
draws border percentile lines if
Gmean
draws the geometric mean of the data if G m e a n : T R U E . The default value for all these options is FALSE.
B o r d e r s :TRUE,
A 91.8 88.9
1 2 .
.
90 91 92
.
.
.
.
15.9 16.9 31.4
B 7.1 I0.i ...
83.3 74.3 65.9
C i.i 0.5 ...
0.8 1.2 2.7
D 0.0 0.5 9
.
.
0.0 5.9 0.0
Count 282 368 9
..
245 575 698
134
M. BREN & V. BATAGELJ
Fig. 1. Three-part (teaching, consultation and research) subcompositions with (a) the geometric mean; and (b) border percentile lines. attr(, [i]
''class'')
' 'mixture'
> mix. Ternary Gmean=T ) > mix. Ternary Borders=T,
' (mix. S u b (m, c (4,5) ,
dist
(mix. S u b (m, c (4,5) , Centre=T)
See Figure 3.
d i s t = c (0.05, 0.05);
The percentile lines routine The routine that draws percentile lines into a drawn ternary diagram is p e r c e n t i l e . l i n e s ( y , direction,
cls,
dist,
it
it) with
parameters Y direction
the vector of percents or decimal values of percentile lines; directions for percentile lines with values 1 - percentile lines to the vertex No. 1 = top, 2 - percentile lines to the vertex No. 2 = right, and 3 - percentile lines to the vertex No. 3 = left. The default value is direction=l
cls
presentation stronger colours are advised; moves the numbers marking the percentile lines for additional space given by the components of the vector d2 s t , to prevent overlaying. The default value of
: 3 i.e. all
directions; the vector with colours, each component corresponds to one vertex. The default value is cls=c ( ' 'yellow'
',
''yellow2'', ' ' y e l l o w 3 ' ' ) visible on the screen but for printing or
0.05,
is the vector with line types, the 1 t y parameter in R graphics routines (values 1, 2 . . . . . 10), each component corresponds to one vertex. The default value is it=c(4,3,2) .
Example A normalized mix object m with nine cases and three variables, i.e. 9 x 3 matrix, is constructed, having 0.1 to 0.9 values in the first column, ratios of one half between the second and third. A ternary diagram is drawn with these nine points in different colours - c 1 s , different shapes - p c h and the size c e x = l (see Fig. 4a). $tit [i] ' ' D e c i l e s values column ' ' $ sum [i] i $sta [i] 3
in
the
~rst
VISUALIZATION OF COMPOSITIONS WITH R
135
Fig. 2. Three-part (teaching, research and residual) subcompositions with (a) borders percentile lines; and (b) centred for better visualization of the differences between cases. To avoid misunderstanding of this centred visualization, borders percentile lines with exact max variation values are obligatory.
Fig. 3. Three-part (red sandstone, grey sandstone and crystalline) subcompositions, with (a) geometric mean; and (b) centred for better visualization of the differences between cases - border percentile lines showing actual variation.
$mat aa 101 202 303 404 505 606 707 808 909
bb 0 30000000 26666670 23333330 20000000 16666670 13333330 i0000000 06666667 03333333
0 0 0 0 0 0 0 0 0
cc 60000000 53333330 46666670 40000000 33333330 26666670 20000000 13333330 06666667
attr( [i]
''class'') ''mixture''
> cls mix. T e r n a r y ( m , col=cls, p c h = 0 : 8 , cex=l) > perc.lines(10*l:9, dir=l, c l s = ' ' c y a n ' ' , it=l) To draw the t e m a r y diagram in Figure 4b, use the
mix. R a n d o m ( n r ,
nc,
s:l)
routine is used that constructs a random mix object with n r rows and n c c o l u m n s with a constant row sum s.
> mix. T e r n a r y (mix. R a n d o m ( 2 2 , 3 ) )
136
M. BREN & V. BATAGELJ
Fig. 4. Three-part compositions: (a) deciles values in the first column, constant ratios half between the second and the third column plotted in the ternary diagram with deciles lines in the first direction; (b) ternary diagram with the random 22 points and deciles lines in all three directions. > perc.lines(!0*l:9, cls=c(''blue'', ''violet''))
Example ''blueviolet'',
The ratio lines routine
The command that draws lines of constant ratios of two components into a drawn ternary diagram direction, cls, is r a t i o , l i n e s (y, dist). The routine parameters are
Y direction
the vector of ratios for ratio lines; directions for ratio lines with value 1 - ratio lines No. 1, i.e. 2 - ratio lines No. 2, i.e. 3 - ratio lines No. 3, i.e.
cls
dist
to the x2/x3 to the xl/x3 to the xl /x2
vertex ----y, vertex = y, and vertex = y.
The default value is d i r e c t i o n = l :3 that stands for all the directions; the vector with colours, each component corresponds to one vertex. The default value is c l s : c ( ' 'green' ' , ' 'green3' ', ' 'y e l l o w g r e e n ' ' ) visible on the screen but for printing or presentation stronger colours are advised, moves the numbers marking ratio lines for additional space given by the components of the vector d i s t , to prevent overlaying. The default value is d i s t : c
(0.05,
0.05,
0.05).
A matrix with nine cases and three variables is constructed, first triple of cases having a constant ratio of one half between second and third variables, second triple between first and third . . . and each triplet is coloured (see Fig. 5a). In this ternary diagram we draw the 1/7, 1/3, 1/2, 1, 2, 3 and 4, ratio lines to all three sides (see Fig. 5b). m tco mix. T e r n a r y ( m , col=co[t]) # draws ternary diagram >ratio.lines (c(i/3,1/2,1,2,3,4, 1/7), cls=co)# draws ratio lines
Visualization with the tetrahedron routine The m i x . Q 2 k i n routine transforms a four-part mixture m into three-dimensional XYZ coordinates using quadrays transformations and saves them as a file.kin. This transformation applies Quadrays and XYZ by K. Umer and Quadray formulas by T. Ace available on the web. The kin file is displayed as 3D animation using KiNG or MAGE viewer--free software available at http://kinemage.biochem.duke.edu. The mix. Q 2 k i n (kinfile, m, clu=NULL, vec=NULL, king=TRUE, s c a l e = 0 . 2 , col=l) routine'sparameters are kirtle m
clu vec king scale col
the name of a f i l e . k i n , the mix object with four variables, partition determining the colours of points, vector of values determining points sizes, FALSE for Mage, TRUE for King, relative size of points, and colour of points if clu=NULL.
Example From the activity data mix object m will be constructed. A four-part composition with variables teaching, consulting and research, administration work, and leisure comprising other wakeful activities and sleep. The Aitchison distance will be computed and the complete linkage classification method performed. Out of the dendrogram four clusters will be detected and drawn with the m i x . Q 2 k i n command in a ac4. k i n file to be displayed with the KiNG viewer in a tetrahedron using different colours (Figs 6 and 7). > m $ m a t < - c b i n d (m$mat [, i] , m $ m a t [, 2] + m S m a t [, 4] , m $ m a t [, 3 ] , mSmat[,5]+mSmat[,6] ) > d i m n a m e s (m$mat) [[2] ] m $tit [i] ' ' R e s e a r c h e r ' s ties ' '
daily
activi-
admi 0.138 0.073
leas 0.398 0.490
$ sum [i] N A $sta [i] 1 $mat teach cons&rese 0.162 0.164 0.200 0.115
1 2 .
.
19 20
.
.
.
.
0.167 0.166
.
.
.
.
0.195 0.209
.
.
.
.
.
0.127 0.i01
0.410 0.386
138
M. BREN & V. BATAGELJ
Cluster Q
Dendrogram
~
0
C
II
[7
[
0
d
hclust ('. "complete") Fig. 6. Activity dataset classification represented by a dendrogram.
Fig. 7. Two snapshots of 3D KiNG view of tetrahedral display of the activity data - four-part compositions.
> d ,m m
E
0.40.3-
0
0.20.10 0
I
I
I
I
20
40
60
80
....
100
time
Fig. 4. Compositional evolution of two compositional processes of an exponential decay of mass (plain line and plus marker). Rates of decay are equal but initial compositions differ (see text). Plot of cumulated parts. where the Ai are assumed positive in this particular case. Frequently, the relevant information is the relative abundance of each isotope. To study the isotope composition in the sample, a compositional vector can be defined as x(t) =C[xl(t), x2(t) . . . . . xo(t)]. The evolution in time (forward and backward) of x(t) is identified readily as
x(t) = x(0) ~ (t E) exp(-k)), 2~ = [A1, A2. . . . . Ao],
(16)
which is a compositional line with direction v = e x p ( - k ) and starting point x(0). Note that closure xl
of v and x(0) is irrelevant. A conclusion is that all exponential mass-growth or mass-decay processes follow compositional lines when considered from the compositional point of view (Egozcue et al. 2003; Egozcue & Pawlowsky-Glahn 2005). Figures 4 and 5 show two compositional lines corresponding to an exponential mass decay with D = 3, k = [1,1/2,1/4]. The initial or starting points are x(0) -- C[1, 1, 1] = n for the first one (plain line), and x ( 0 ) = C[5,5, 1] for the second one (plus marker). Figure 4 shows the evolution in t of the cumulated parts; Figure 5 represents the lines in the ternary diagram. Since disintegration rate of isotope 3 (A3) is the smallest one, both processes tend to the vertex associated with the pure isotope 3, despite initial conditions.
Distance and other metric concepts Aitchison (1983, 1986) introduced a simplicial distance suitable for the analysis of compositional data. If x = [xl, x2 . . . . . xo] and y = [ Yl, Y2. . . . . YD] are compositions in S v, the squared Aitchison distance between them is
x2
x3
Fig. 5. Same processes of exponential decay of mass as in Figure 4, represented as compositional lines in a temary diagram.
1 ~ ~ (lnXi Xj
da2(X'Y) = D i=1 j=i+l \
=
lnY/~ 2
yj/I
k(Xi _~y))2 In--
i=1 ~'
g(x)
- In
(17)
SIMPLICIAL GEOMETRY where g ( x ) = IX1 "X2"..XD] lID denotes the geometric mean of the parts of the compositional vector in the argument. The main properties of such a distance are: 1. 2.
it does not depend on the closure constant; it is invariant under perturbation: if p C S ~ da(x O p, y O p) = da (x, y);
3.
(18)
itisscaledbypowering:ifcisarealnumber, then
da(c@x, cQy) : Icl" da(x,y);
(19)
it is invariant under permutation of the parts; it guarantees the so-called subcompositional dominance: let R be a set of r subscripts, 1 0.03. Their probability plots are reported in Figures 1 and 2. If (1/x/~) ln(Ca2+/nco3) and (1/~r2) ln(Na+/K +) are well represented by the normal model, then the ratios (Ca2+/HCO~-) 1/42 and (Na+/K+) 1/4~ follow the log-normal one. The
180
A. BUCCIANTI ET AL . i
i
0.999 0.997 0.99 0.98 0.95 0.90 0.75 0.50 ~- 0.25 0.10 0.05 0.02 0.01 0.003 0.001 -3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
(1/~/2)log(CaZ+/HCO3)
Fig. 1. Probability plot of (1/~)ln(Ca2+/HCO~). Continuous line represents a perfect normal distribution. exponent affects only the scale of the ratios and is important to preserve consistency with the geometric properties of the full composition. Nevertheless, for interpretation, the scale can be changed, i.e. the exponent can be omitted, and the ratios
'
0.999 0.997
'
'
Ca2+/HCO~ and N a + / K + will also follow a lognormal distribution. From a geochemical point of view, these ratios appear to be the product of many independent random processes, so that the quantity present in
/
!o
'
I
'.
I
0.99 0.98 0.95 0.90 0.75 0.50 ,.Q O
0.25 0.10 0.05 0.02 0.01 0.003 0.001
.
.
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.5
1
1.5
2
2.5
3
3.5
4
( 1/~2)log(Na+/K+) Fig. 2. Probability plot of (l/v/2)ln(Na+/K+). Continuous line represents a perfect normal distribution.
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY each state can be expressed as a random proportion of the quantity present in the immediately prior state. The chemical species involved in the ratios would have experienced considerable physical movement and agitation, as well as dilution in successive, independent stages. The resulting lognormal distributions would thus represent dominant and general phenomena affecting the investigated waters, where weathering of silicates and carbonates is important. The input of carbon dioxide from the deep uprising gaseous flow is, in fact, able to give an aggressive character to the water. It produces a weak rock weathering with consequent presence in water of Na +, K +, Ca 2+ and HCO~-. Physical movement and agitation, as well as dilution in successive, independent stages can lead to the log-normal distribution. The same simple modelling path cannot be accepted for the log-ratios (1/,,/2) ln(Na+/C1-), (l/v/2) ln(Ca2+/Mg2+), (1/q'-2) ln(Mg2+/SO42-) and (1/V'2)ln(Ca2+/SO 2-) (all p-values of the Kolmogorov-Smirnov test are less than 0.01). Their probability plots, with the reference line of the Gaussian model, are reported in Figures 3, 4, 5 and 6. As the log-ratios display a moderate skewness, the performance of the skew-normal model or, equivalently, the logskew-normal model is explored for the corresponding ratios. In Figures 7, 8, 9 and 10 the histograms and the estimated skew-normal curves obtained using
181
the maximum likelihood estimation procedure (Azzalini 1985) are reported. As can be seen, the skew-normal model appears to capture the skewness affecting the log-ratios; the log-likelihood function (a value similar to the sum of squared error in regression analysis) shows, in fact, a better value if compared with the normal model, for each log-ratio. Furthermore, the value of the likelihood ratio test statistic to compare both models (i.e. the null hypothesis that the shape parameter ,~ is zero) leads always to a p-value < 0.01. Thus, the conclusion is that the skew-normal model is significantly better than the normal one in all cases. As can be seen in Table 1, the shape parameter & explains a skewness ranging from y -- -0.48 to 0.58. Consequently, omitting the exponent for interpretational purposes, low values can be found with higher frequency for the ratios Na+/C1 - and Ca2+/Mg 2+, and high values for Mg2+/SO4aand Ca2+/SO42-, when a comparison with the log-normal model is performed. However, the application of some goodness-of-fit tests for the skew-normal model (Kolmogorov-Smirnov, Kuiper, Anderson-Darling, Cramer-von Mises and Watson (Mateu-Figueras 2003)) indicates that only for the log-ratio (l/v/2) ln(Na+/C1 -) can the skew-normal model be considered statistically acceptable, taking a significance level of 0.01 (p-value > 0.01). In the other cases, the bimodality
0.999 0.997 0.99 0.98 0.95 0.90 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.00t -1
-0.5
0
0.5
1.5
( 1/'~2)log(Na+/C1-) Fig. 3. Probability plot of (l/x/2)ln(Na+/C1-). Continuous line represents a perfect normal distribution.
182
A. BUCCIANTI E T A L . i
0.999 0.997 0.99 0.98
.
-! .......
I
.
i
.
.
i........
I
.
i
.
i .......
.
! .......
i
i
i
i .......
i......
.
g
! .......
,,! 9
....
0.95 0.90 0.75 0.50 e-,
0.25 0.10 0.05 0.02 0.01 0.003 0.001 -1
-0.5
0
0.5
1
1.5
2
2.5
(1/~2)log(Ca2+/Mg 2+) Fig. 4. Probability plot of (1/V~)ln(Ca2+/Mg2+). Continuous line represents a perfect normal distribution. affecting the data indicates a more complex structure that cannot be explained considering only a moderate skewness. The adoption of the skew-normal model for (1/v'~) ln(Na+/C1 - ) implies that Na+/C1 -
0.999 .............................. 0.997 -i .......
! ......
(omitting again the exponent) follows the logskew-normal one, so that a sort of mechanism able to generate a further skewness (compared with the log-normal) is present. This result indicates that the samples have not experienced dilution as
!.....................
E ......
i.......
i ......
i.......
i ......
i ......
!~S".
i
i
i
i
i
~
!..,
0.99 0.98 0.95 0.90 0.75 0.50 0.25 0.10 0.05 0.02 0.01
I
0.003 .i.. gO 0.001 -4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
i..
0
(1/~]2)log(Mg2+/SO2-) Fig. 5. Probability plot of (1/v~)ln(Mg2+/SO~-). Continuous line represents a perfect normal distribution.
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY 0.999 ................................ 0.997 i ...... i ..... 0.99 i ...... 0.98 0.95 0.90
! ........... i......
i .....
i .....
!..; :
.S..:
183
...........
.....
i .....
i ......
0.75 ..a
0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.001 -3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
(l/42)log(Ca2+/SO 2-) Fig. 6. Probability plot of (1/~/'2)ln(Ca2+/SO2-). Continuous line represents a perfect normal distribution. the dominant process, but that the influence of marine water ( N a + / C 1 - ~ 0.55) is not able to generate clearly different groups of observations. In the case of the log-ratios (l/q/-2) ln(Ca2+/Mg2+), (1/~/2)
ln(Mg2+/SO 2-) and (1/v/-2)ln(Ca2+/ SO]-), the solution of minerals typical of weathered hydromagmatic deposits not homogeneously present in the area (i.e. sulphates) might be the mechanism able to
1.6 1.4 1.2
m
f
1
-~ 0.8
0.~ 0.4 0.2
_AI -0.5
0 (1/q2)ln(Na§
0.5 -)
Fig. 7. Histogram of (l/x/2) ln(Na+/C1 -) values and fitted skew-normal densities.
1
1.5
184
A. BUCCIANTI E T AL. 0.9 0.8 0.7 0.6 L~ 0.5 0.4 0.3 I
0.2 84 0.1 0
-2
-1
0
l
2
3
(1/{2)ln(CaZ+/Mg2+) Fig. 8. Histogram of (1/4"2)ln(Ca2+/Mg z+) values and fitted skew-normal densities.
increase the ratio of Ca 2+/SO]- in part of the waters. Here, to take into account moderate skewness is insufficient to describe reality and the presence of groups of data appears to be the dominant feature. Summarizing, univariate frequency distributions in water geochemistry of Vulcano island can be
characterized by three different models, log-normal, logskew-normal and multimodal. The goodness of fit of the skew-normal (log-skew) model compared with the normal (log-normal) one depends on the persistence and continuity of natural processes affecting a part of the population so that a moderate skewness is
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -6
_
~
-5
~
-4
-3
-2
-1
(l/'J2)ln(Mg2+/SOo-) Fig. 9. Histogram of (1/~/2)ln(Mg2+/SO] -) values and fitted skew-normal densities.
0
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY
185
1.4
1.2
0.8
-8 0.6
0.4
0.2
0 -5
I
i-'-i
-4
-3
-2
-1
0
1
2
3
(l/x/2)ln(Ca2+/SO2-) Fig. 10. Histogram of (1/~/2) ln(Ca2+/SO2-) values and fitted skew-normal densities.
generated. However, when the geochemical phenomena affect with recurrence only a part of the population, groups tend to develop and a more complicated analysis is needed. The geochemicalstatistical approach presented can be used to verify how successive independent stages of dilution are able to describe the behaviour of the data and how other processes are able to sign their presence and persistence in time and/or space until generation of groups of cases occur.
Log-ratios and their multivariate statistical modelling in water chemistry: results and discussion In the previous part, log-ratios were analysed using a geochemical-statistical procedure to investigate their behaviour. The terms of the logratios were chosen following geochemical criteria. Table 1. Value of the estimated shape parameter, A, and the corresponding estimated skewness index, 4/, of each log-ratio
Log-ratio
( 1/ ~/2) ln(Na+/C1- ) (1/v/2) ln(Ca2+/Mg 2+) (1/~2) ln(Mg2+/SO] - ) (1/x/2) ln(Cae+/SO 2-)
A
,~
2.97 2.52 -2.11 - 1.74
0.5 5 0.58 -0.48 -0.38
However, water chemistry can be considered also as a whole, and to do so the shape of the frequency distribution of the composition X = (Na +, K +, Ca 2+ Mg 2+ HCO 3, SO]-, C1-) should be considered. In this case, one can verify if X follows a logistic normal distribution, or a logistic skew-normal one. The importance of processes able to introduce skewness in single log-ratios is considered when all the members of the composition are analysed together. The adequacy of the models was investigated applying the isometric log-ratio transformation, ilr(X) = (ilrl, ilr2, ilr3, ilr4, ilr5, ilr6), given by Egozcue et al. (2003)
1 /Na+\ ilr, = --=ln/--~T/ ~/2 \ K ,/' 1 [ Na+K + \ ilr2 = ~ l n ~ ( s o 2 _ ) 2 ) , ilr3=~
1 In{Na+K+SOp ~ k (-~5 -j,
ilr4 = ~
1 ln{Na+K+SO2-CI-~ \ (HCO~_)4 j ,
1 /Na+K+SO42-Cl-HCO3 \ ilr5 = ~ l n ~ (-C-~a2+-~ ), ilr6 = ~
1
In
(Na+K+SOZ-CI-HCO3Ca 2+ ) (Mg2+)6
(1)
A. BUCCIANTI E T A L
186
Then, a multivariate normal model and a multivariate skew-normal model are fitted to the transformed samples. The maximum likelihood method is used to obtain estimates of the parameters. The two fitted models are compared by applying the likelihood ratio test (i.e. the null hypothesis that all the components of the shape parameter k are 0). As the p-value is
"= 9 0
_k.
I I
__L_
I
I I
I
-1 ilr 1
I A_
I
I
.J_
I
A_
4-
ilr5
I _1_
ilr 3
ilr 6
ilr 2
Fig. 12. Box-plots of the investigated log-ratios.
obse~ation of the vector Y (i = 1. . . . . 977) a n d / i and ZE are the estimates of the parameters/x and 2~ of a skew-normal model (Azzalini & Capitanio 1999). If the log-ratio vector Y has a multivariate skew-normal distribution, then the Mahalanobis distances are sampled from a ~ distribution 9 However, application of graphical tests ( P - P plot and Q - Q plot of the Mahalanobis distance versus h~5 values shown in Fig. 13) and numerical tests (based on the Anderson-Darling, Cramer-von Mises and Kolmogorov-Smimov statistics) to validate the h~5 distribution of values di allow one to conclude that the multivariate skew-normal model is not yet able to describe statistically the composition in an exhaustive way and a more complex investigation about the presence of groups and anomalous values is needed. The water composition of some wells at Vulcano island appears thus to be affected by persistence phenomena (i.e. influence of uprising gas and/or presence of secondary minerals) able to generate isolated groups of observations with homogeneous behaviour.
Conclusions
In any aquifer chemical relationships between the different species are affected by the development of acid-base and redox reactions, solutionprecipitation processes and adsorption phenomena. Which process dominates at any time depends on the mineralogy of the aquifer, the hydrogeological environment and the history of the groundwater
movement (i.e. residence time). In the investigated volcanic environment the mobilization of chemical species as the result of fumarolic activity and of the alteration of volcanic products directly affects the unconfined aquifer feeding the wells of Vulcano. Several investigations in this area indicate that the weak acidity of the circulating solutions, able to leach chemical species, is attributable to the volcanic CO2 provided at a discontinuous rate to the shallow-water body. Consequently, the mobilization depends on the CO2 input, on the rate of neutralization by rock weathering and on the intensity of rainfall acting as a diluting factor. In this general framework, if solutions circulate in volcanic products involved in syn-depositional changes, significant quantities of calcium sulphate, sodium chloride, fluorine and other trace elements can be provided to the groundwater and no natural mechanism can remove them to a significant extent. The only limiting factor is the saturation of the solution with respect to the mineral. Further occasional contributions have to be ascribed to marine-like solutions that inflow into the overlying water bodies. In this situation, investigation of the shape of the frequency distribution of single log-ratios for the Vulcano waters (properly chosen from a geochemical point of view) has been used to verify: (1) whether one general process, dominated by dilution, is able to describe the behaviour of the data; or (2) whether further processes are overlapping dilution, so that a moderate negative or positive skewness is present; or (3) whether these processes are important enough to generate a
188
A. BUCCIANTI E T AL . P - P plot
Q - Q plot
30
1
0.9
25 9
0.8
0.7 c O
0
20 Q
._
d3 'E
o
I'I"
"(3
~
15
~ .i,..., c
~
0.6
{3. O
.=>
05
E o
0.4
'13
10
Q.
~
x U.I
0.3 0.2 0.1
0
0
5'0
i 100
, 150
, 200
Mahalanobis distances
0
' 0.2
0
' 0.4
' 0.6
' 0.8
' 1
Observed cumulative probability
Fig. 13. P - P Plot and Q - Q plot for multivariate skew-normal distribution.
multimodal complex distribution. The analysis can be extended to the multivariate case, choosing an appropriate transformation. It has indicated that the multivariate distribution is described better by the skew-normal model compared with the normal one. Thus, the processes affecting the composition would tend to follow a log-skew normal law and dilution is not the only present and dominant phenomenon. However, the multivariate skewnormal model is not yet completely adequate to describe the simultaneous relationships among the variables. The presence of bimodality in some marginals, as well as of anomalous values, may be the underlying reason. Summarizing the results, in water chemistry univariate frequency distributions of log-ratios can show three fundamental features, lognormal, logskew-normal, and overlapping processes leading to bimodality. In this context, the skewnormal distribution family appears to have an important intermediate role in a better description of natural phenomena where dilution is not the
only phenomenon, but the persistence and continuity of other processes has not yet clearly generated different groups of observations. When marginal distributions of all the three previous types are considered together in a multivariate framework, the reciprocal relationships are complex. In such a case, the multivariate skew-normal model may be better than the normal one, but it is not yet adequate to describe the whole composition. Other statistical procedures are required to identify samples pertaining to different potential groups.
This research has been financially supported by Italian MIUR (Ministero dell'Istruzione, dell'Universit~ e della Ricerca Scientifica e Tecnologica), PRIN 2004, through the GEOBASI project (prot. 2004048813002) and by the Direccirn General de Ensefianza Superior (DGES) of the Spanish Ministry for Education and Culture through the project BFM200305640.
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY
References AHRENS, L. H. 1953. A fundamental law of geochemistry. Nature, 172, 1148. AHRENS, L. H. 1954a. The lognormal distribution of the elements a fundamental law of geochemistry and its subsidiary. Geochimica et Cosmochimica Acta, 6, 49-74. AHRENS, L. H. 1954b. The lognormal distribution of the elements, ii. Geochimica et Cosmochimica Acta, 6, 121-132. AHRENS, L. H. 1957. Lognormal type distribution, iii. Geochimica et Cosmochimica Acta, 11, 205- 213. /MTCHISON, J. 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society Series B, 44 (2), 139-177. AITCHISON, J. 1986. The Statistical Analysis of Compositional Data. Chapman & Hall, London. ALLI~GRE, C. J. & LEWIN, E. 1995. Scaling laws and geochemical distributions. Earth and Planetary Science Letters, 132, 1- 13. AUBREY, K. V. 1954. Frequency distribution of the concentrations of the elements in rocks. Nature, 174, 141 - 142. AUBREY, K. V. 1956. Frequency distributions of elements in igneous rocks. Geochimica et Cosmochimica Acta, 9, 83-90. AZZALINI, A. 1985. A class of distribution which includes the normal ones. Scandinavian Journal of Statistics, 12, 171-178. AZZALINI, A. • CAPITANIO, A. 1999. Statistical applications of the multivariate skew-normal distribution. Journal of the Royal Statistical Society, Series B, 61 (3), 579-602. AZZALINI, A. t~ DALLA VALLE, A. 1996. The mutlivariate skew-normal distribution. Biometrika, 83 (4), 715-726. BUCCIANTI, A. & PAWLOWSKY-GLAHN,V. 2005. New perspectives on water chemistry and compositional data analysis. Mathematical Geology, 37 (7), 703 -727. CAPASSO, G., FAVARA, R., FRACOFONTE, S. 8z INGUAGG1ATO, S. 1999. Chemical and isotopic variations in fumarolic discharge and thermal waters at Vulcano island Aeolian islands, Italy during 1996: evidence of resumed volcanic activity. Journal of Volcanology and Geothermal Research, 88, 167-175. CAPASSO, G., D'ALESSANDRO, W., FAVARA, R., INGUAGGIATO, S. & PARELLO, F. 2001. Interaction between the deep fluids and the shallow groundwaters on Vulcano island (Italy). Journal of Volcanology and Geothermal Research, 108, 187-198. CHAYES, F. 1954. The lognormal distribution of elements: a discussion. Geochimica et CosmochimicaActa, 6, 119-121. DAVIES, S. N. & DEWIEST, R. C. M. 1996. Hydrogeology. Wiley and Sons, New York. DI LIBERTO, V., NuccIo, P. M. & PAONITA, A. 2002. Genesis of chlorine and sulphur in fumarolic emissions at Vulcano island (Italy): assessment of ph
189
and redox conditions in the hydrothermal system. Journal of Volcanology and Geothermal Research, 116, 137-150. EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2005. Groups of parts and their balances in compositional data analysis. Mathematical Geology, 37 (7), 795-828. EGOZCUE, J. J., PAWLOWSKY-GLAHN, V. MATEUFIGUERAS, G. & BARCEL0-VIDAL, C. 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35 (3), 279-300. KAPTEYN, J. C. 1903. Skew Frequency Curves in Biology and Statistics. Astronomical Laboratory, Groningen, Noordhoff. MARTINI, M. 1980. Geochemical survey on the phreatic waters of vulcano (Aeolian Islands, Italy) Bulletin of Volcanology, 43 (1), 265-274. MARTINI, M. 1989. The forecasting significance of chemical indicators in areas of quiescent volcanism: examples from bulcano and phlegrean fields (Italy). In: LATTER, J. H. (ed.) Volcanic Hazard, IA V-CEI Proceedings in Volcanology 1. SpringerVerlag, Berlin Heidelberg, Germany, 372-383. MARTINI, M. 1996. Chemical characters of the gaseous phase in different stages of volcanism: precursors and volcanic activity. In: SCARPA, R. & TILLING, R. I. (eds) Montoring and Mitigation of Volcanic Hazard. Springer-Verlag, Berlin Heidelberg, Germany, 200-219. MATEU-FIGUERAS, G. 2003. Models de distribuci6 sobre el sfmplex. PhD thesis, Universitat Polit~cnica de Catalunya, Barcelona, Spain. MATEU-FIGUERAS, G., PAWLOWSKY-GLAHN, V. ~: BARCELO-VtDAL, C. 2005. The additive logistic skew-normal distribution on the simplex. Stochastic Environmental Reserach and Risk Assessment (SERRA), 19 (3), 205-214. MCGRATH, S. P. & LOVELAND, P. J. 1992. The Soil Geochemical Atlas of England and Wales. Blackie Academic, London. MILLER, R. L. & GOLDBERG, E. D. 1955. The normal distribution in geochemistry. Geochimica et Cosmochimica Acta, 8, 53-62. MONTALTO, A. 1996. Signs of potential renewal of eruptive activity at La Fossa (Vulcano, Aeolian Islands). Bulletin of Volcanology, 57, 483-492. REIMANN, C. & FILZMOSER, P. 1999. Normal and lognormal data distribution in geochemistry: death of a myth. consequences for the statistical tretment of geochemical and environmental data. Environmental Geology, 39 (9), 1001-1014. VISTELIUS, A. B. 1960. The shew frequency distributions and the fundamental law of the geochemical processes. Journal of Geology, 68, 1-22. WAYNE, R. O. 1990. A physical explanation of the lognormality of pollutant concentrations. Journal of Air & Waste Management Associaties, 40 (10), 1378-1383.
Rounded zeros: some practical aspects for compositional data J. A. M A R T I N - F E R N / ~ N D E Z
& S. T H I 0 - H E N E S T R O S A
Departament Informhtica i Matem~tica Aplicada, Universitat de Girona, Campus Montilivi, Edifici P-IV, E-17071, Girona, Spain (e-mail: josepantoni.martin @udg. es) Abstract: It is very important to realize that the well-known 'zeros problem' in compositional data is inherent in the nature of the data rather than in the log-ratio methodology. In a strict sense, any null value is informative itself and needs specific treatment before a multivariate method is applied. In the rounded zero case specific techniques of missing data should be applied as a previous step, taking into account that any of these techniques must respect the compositional nature of the data. In practice, when an imputation method is applied, then it is necessary to make a sensitivity analysis of the results from multivariate analysis. These methodological aspects are applied and illustrated using compositional data from geological samples.
Martfn-Fernfindez et al. (2004a) dealt with the zeros in the database of Cenozoic volcanic rocks of Hungary (6.Kov~ics & Kovfics 2001). In that study the authors are interested in log-ratio analysis (Aitchison 1986) of subcompositional patterns in order to contribute to the understanding of petrogenetic processes (Martfn-Fernfindez et al. 2004b) that occurred in the Carpatho-Pannonian region. The dataset consists of 959 unaltered rock samples and nine major oxides from that database: [SiO2; TiO2; A1203; FezO3total; MgO; CaO; Na20; K20; P205]. Since some of these observations have null values the authors conclude that they have the well-known 'zeros problem'. Nobody disagrees with this assertion. Nevertheless, one should think about the reasons for this 'logical' conclusion. Obviously, the first temptation is to reason as follows: Since multivariate statistical methods based on log-ratio methodology applied, then one cannot work with null values and thus zeros are a problem. Therefore it is preferable to apply classical multivariate methods based on Euclidean distance because this methodologyhas not the zeros problem. Certainly, it is obvious that if some sample has null values neither ratios nor logarithms can be formed. Nevertheless, that kind of reasoning is clearly very simple and incomplete because it does not take into account the nature of the null value. The first question that one must answer in relation to a null value is about its nature - one must decide if the zero value is a true value or not. If it is considered as a true value then it is informative by itself. Therefore, this null value means the absolute absence of the part in the observation, i.e. the null value is an essential or structural zero. On the other hand, if the null value indicates the presence of a component, but below the detection limit, then this zero represents a missing small
value, i.e. null values are rounded zeros. Since the nature of the two kinds of zeros is different the treatment should be different. Note that this different treatment is a consequence of the nature of the zero rather than of the statistical methodology (Euclidean, log-ratio . . . . ). Two kinds of zeros, two different treatments
In the structural zeros case, two initial questions must be answered: (1) is the n u m b e r of parts too large for the goals of the study?; (2) is the presence in a part of an essential zero an indication that the composition belongs to a different group or population? An affirmative answer to the first question is related to the sampling or measuring step and suggests amalgamating some related parts (Aitchison 1986, p. 36). This amalgamation procedure reduces the dimensionality and probably the amount of null values. The answer to the second question is related to the true information of the null value. For example, an affirmative answer could indicate that it is sensible to divide the sample, and then a statistical analysis of any kind would be applied to each sub-sample separately. After both questions have been solved, and after data have been closed, the statistical analysis can be applied. But, now the question is: which one? Nowadays, in the context of compositional studies, most scientists apply either Euclidean or log-ratio statistical methods. Obviously, when a scientist chooses the Euclidean option then there are no problems with the remaining structural zeros. Nevertheless note that when a scientist selects this option one is assuming, for example, that the difference or similarity between two threecompositions - in percentages - as (0, 10, 90) and
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 191-201. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
192
J.A. MART[N-FERN/~qDEZ & S. THI0-HENESTROSA
(10, 0, 90) is exactly the same as the difference between (30, 40, 30) and (40, 30, 30): de((0, 10, 90),(10, 0, 90)) 1
-----r
10) 2 + (10-- 0) 2 + ( 9 0 - 90) 2
= 104~,
de((30, 40, 30),(40, 30, 30)) /
= ~/(30 - 40) 2 + (40 - 30) 2 -k- (30 - 30) 2 = 104~,
(1)
where de means the Euclidean distance. This assumption seems to be unsuitable when, for example, one is working with geological data such as percentages of {clay, silt, sand} in sediments. This weakness is serious because all the classical multivariate methods are based on Euclidean concepts. Certainly, all these multivariate methods contain in their formulation the classical variance/covariance matrix and, in a strict sense, this matrix is a Euclidean measure of the variability. On the other hand, selection of the log-ratio methodology involves a weakness with the null values. Fortunately, recent works (Aitchison & Kay 2003; Bacon-Shone 2003) offer new strategies for the application of the log-ratio methodology combined with conditional models and subcompositional analysis. When the null values denote that no quantifiable proportion could be recorded to the accuracy of the measurement process, then this kind of zero is usually understood as 'a trace too small to measure'. Note that in reality these null values are missing values rather than zeros (Martfn-Fern~indez et al. 2003a). Consequently, it seems reasonable to apply specific techniques for a statistical analysis of incomplete multivariate data (Little & Rubin 2002). By analogy to the structural zeros case, it is very important to emphasize that the decision to apply specific techniques for missing values to deal with the rounded zeros is independent of the statistical methodology (Euclidean, log-ratio . . . . ) that one has selected. In other words, even when one scientist selects the Euclidean option the 'rounded zeros problem' appears because the missing values should be treated. Moreover, as in the structural zeros case, in the Euclidean option the scientist is assuming, for example, that the difference between two three-compositions as (0.1, 10, 89.9) and (10, 0.1, 89.9) is exactly the same as the difference between (30.1, 40, 29.9) and (40, 30.1, 29.9). Here, for example, the value 0.1 comes from some imputation procedure which has replaced the null value by 0.1. The calculations (1) can be repeated and, in this case, for both pairs, the Euclidean distance is equal to 9.9~/2. Observe that in
both cases the third part does not contribute to the Euclidean distance between the compositions. For example, if one is working with geological data as percentages of {clay, silt, sand} in sediments, the information that in one case the sample has 89.9% of sand and in the other case it has the 29.9% is missing in the calculations. This unsuitable effect is clearer if the subcompositons in terms of clay and silt are made. The two first three-compositions have respectively the {clay, silt} subcompositions (0.99, 99.01) and (99.01, 0.99). The other pair respectively transforms in (42.94, 57.06) and (57.06, 42.94). The difference between the first pair seems to be more important than the difference between the second pair. In Figure 1 the difference between the first pair of compositions with the difference between the second pair can be appreciated. It seems to be reasonable to accept that the difference between the first pair must be greater than the difference between the second pair. In this sense, from the log-ratio methodology, the Aitchison distances between the two pairs is da((0.1, 10, 89.9), (10, 0.1, 89.9)) = de(clr(0.1, 10, 89.9), clr(10, 0.1, 89.9)) 6.5127, da((30.1, 40, 29.9),(40, 30.1, 29.9)) = de(clr(30.1, 40, 29.9), clr(40, 30.1, 29.9)) 0.4021
(2)
Aitchison distance works better than the Euclidean distance because it shows more similarity between the second pair of compositions. For more details about the definition and properties of this distance see Egozcue & Pawlowsky-Glahn (2006). The well-known 'zeros problem' in compositional data is inherent in the nature of the data. In any case - structural or rounded - one null value is informative and calls for a specific treatment before some multivariate method is applied. This paper focuses on the rounded zeros case since it is the most frequent case in geological studies. T r e a t m e n t of rounded zeros The assumption that a rounded zero is a small missing value indicates that some specific technique to deal with these missing values must be applied. After the missing data procedure is applied one can then apply such log-ratio method as desired. Many missing data techniques have been suggested in the literature (Little & Rubin 2002); these can be classified into parametric and non-parametric techniques. Among parametric techniques to treat the inference problem in the presence of missing data in real space several methods have been developed,
ROUNDED ZEROS: SOME PRACTICAL ASPECTS
193
=~I 9149
(0.1, 10, 89.9)
=:::::::::::::Dl~lm 9149 Dd~ 9149149149 ~l~aam~ 9149149149149 -----idTdddoJ 9149149149149149149 --I~ 9149149
~
m||mid|||ai|amimiiiiiiiiiiiiiiiiiimiiiina 9
Jlllga 9149149149149 ~%%~laaaalaaaaawgD 9149149149149149 ~%%~JaJllaaJaJBiiJliiliinllii 9 ~%~llamHHmaiHHemlalSmaalilli 9149 ~saamaiaesammummaammim 9149149 ~am 9149149149149 ~%~ 9149149149149149149 ~aaaaaaaaomauusea 9 ~%%~laa anaaaoaummulllaannlnmnnnnllnniununnnnnunnnna ~%%~maaaaaaaalaaau 9149 ~%~mmunanaasmmummmmummmamammmmmmmimmmmmmmmmnnummnmn
Z
g
I,-, i
(10, 0.1,89.9)
r O
o. (30.1, 40, 29.9)
0
0
~
~ i l u i i n i l i i i l i n n a l /inniliiiiiinlll! .... milninlilililnlal ~ l m n u i n n l a n i i n i i n
(40, 30.1,29.9)
~ ' ~ ~ ~ ~ m l n W l l l l l l l l l i l i ~%%%%%%%%%%%%%%%%%%~~mmmmmmmmmmmmmmsm ~%%%%%%%%%%%%%%%%%%~~ammmmmime 9 ~ ~ ~ ~ : mm 9149149
Clay 0
I
20
I
40
I
60
Percentages
1
80
t
1O0
[ ] Silt Sand
Fig. 1. Four three-compositions of {clay, silt, sand}.
but the EM algorithm, and its extensions, and the multiple imputation method are the most used approaches. These techniques rely on fully parametric models for multivariate data, usually the normal distribution, and contain in their formulation the variance/covariance matrix. In view of the above arguments unreasonable results are likely to be obtained when classical methodology is applied to compositional data. Consequently, it seems reasonable to combine the parametric techniques for missing data with the log-ratio methodology. Buccianti & Rosso (1999) following the method proposed in Sandford et al. (1993) made the first approach to such a combination from an empirical point of view, the performances of the EM algorithm and the log-ratio methodology. In Martfn-Fern~indez et al. (2003b) the authors showed a first approach to combine the log-ratio methodology with the multiple imputation techniques via Markov Chain Monte Carlo (MCMC) simulation algorithms. These research lines are unfinished and require more effort. On the other hand, the group of non-parametric techniques for missing data in real space (Little &
Rubin 2002) consists essentially of a family of imputation strategies: cold deck imputation, composite methods, hot deck imputation and mean imputation. The expression imputation is equivalent to a replacement strategy which completes in some way the incomplete dataset by inserting a quantity for each missing value. Sandford et al. (1993) indicated that if the missing values are reported as 'less than' a given threshold value for some variables, a replacement can be considered. After this, from the completed dataset, any multivariate method can be applied. Nevertheless, all the papers and books related to imputation techniques recommend that one should be careful in using a replacement strategy because the general structure of the data could be seriously distorted. In particular, the covariance structure and the metric properties of the dataset should be preserved in order to avoid further analysis on sub-populations being misleading. Note that the previous sentence provides the clue to replacement techniques for rounded zeros in compositional data. The specific nature of compositional data forces a decision in advance as to which kind of covariance structure and metric
194
J.A. MARTiN-FERN,~dqDEZ & S. THIO-HENESTROSA
properties one wishes to preserve. According to the existing possibilities, there is a decision between the preservation of either the classical - Euclidean covariance and metric or the covariance and the metrics induced by the log-ratio methodology. On one side, the above examples (1) and (2) show that the Euclidean distance could not produce reasonable results. On the other side, compositional data are formed by continuous variables whose scale of measurement is a ratio scale (Jobson 1992, p. 7) and their main operations are perturbation and subcomposition. Consequently, at least for compositional data, the replacement strategies must be coherent with all these basic aspects. Following this point of view, Martfn-FemS_ndez et al. (2003a) analysed one multiplicative replacement and, following Sandford et al. (1993), suggested that when the proportion of these null values is not large (less than 10% of the values in data matrix) a simple-replacement method which uses an imputation value equal to 65% of the threshold value can be used.
Three non-parametric methods of imputation for rounded zeros R e p l a c e m e n t s additive, simple a n d multiplicative
Suppose that a D-composition x = (xb x2. . . . . xt)) contains Z rounded zeros and a scientist wants to replace x by a new composition r = (rl, r2 . . . . . rr)) without zeros according to the above arguments. In the literature the scientist finds three different formulae: 3j(Z + 1)(D - Z) D2 q =
xj--~---Z+l() E
ifxj =0,
(3) 3k
klxk=0
ifxj > 0 ;
c q =
c + ~klx~=O 3k 3j C C + Eklxk=-O ~k Xj
if Xj = O,
(1
Table 1, Descriptive measures for each component
(4)
if Xj > O;
8j rj =
~klx~-=0~k') Xj
of the dataset
Number Minimum Zeros obs. without observed zeros value (%) Number Percent Components
ifxj = 0 , if Xj > 0;
non-zero values. In (3) this modification is additive (Aitchison 1986). This formula (3) is referred to as the additive replacement. The term simple replacement is assigned to the formula (4) because its procedure is very simple. This strategy consists of replacing each rounded zero in the composition by appropriate ~j; and, after this, closing the vector. In (5) the modification of the non-zero values is multiplicative. This kind of modification was suggested independently by Fry et al. (2000) and Martin-Fermindez et al. (2000). In MartinFernfindez et al. (2003a) multiplicative replacement (5) was analysed in depth. There, from a theoretical point of view, the authors compared its properties in relation to the properties of the additive and simple replacements. This work focuses on showing from a practical point of view, how, why and when the multiplicative replacement may provide better results that the others. For this practical goal the dataset from the database of Cenozoic volcanic rocks of Hungary referred to above is used. In Tables 1 and 2 the pattern and the location of null rounded zeros in the dataset is summarized. Observe that three components have no zeros: SIO2, A1203 and K20. From the rest, 101 compositions have at least one null value and the zeros are concentrated mainly in components P205, TiO2 and MgO. Note (Table 1) that the minimum measured value in these components is 0.01%. Certainly, the number of null values is reduced (1.8%) because out of the 959 x 9 values in the data matrix, 153 are zeros. Therefore, it seems reasonable to consider a simple-substitution strategy (MartinFern~indez et al. 2003a). In advance of the application of some of the three formulae of replacement, the value of the threshold of each component must be selected. For these purposes four samples (Table 3) from the dataset were selected. These samples represent different possibilities in relation to the number of null values. For these samples a running number
(5)
SiO2 TiO2
A1203
where 8j is a small value, less than the given threshold of part xj, and c is the constant of the sum-constraint; for example, c = 100 where data are percentages. This constraint forces the replacement formulae to contain a modification of the
Fe203_tot MgO CaO Na2O K20 P205
959 914 959 958 935 957 956 959 881
42.30 0.01 8.21 0.26 0.01 0.07 0.10 0.62 0.01
0 45 0 1 24 2 3 0 78
0.0 4.7 0.0 0.1 2.5 0.2 0.3 0.0 8.1
195
ROUNDED ZEROS: SOME PRACTICAL ASPECTS Table 2. The pattern of null values in the dataset Pattern of zeros Number observations
SiO2
TiO2
A1203
FezO3_tot
MgO
CaO
NaO2
K20
P205
858 39 3 12 3 8 25 5 3
Observations without zeros a 858 897 912 870 881 866 930 953 933 931 898 914
0 0
1 1 1
0 0 0 0 0 0
0 0
0
Note: '0' symbolizesthat the componentcontains null value. aNumber of observations without zeros if the correspondingvariables with zero value is not considered.
has been included - first column - in order to reference them in calculations and examples. The sample numbered as 1" in the last row is an artificial sample obtained from sample number 1 by forcing parts TiO2, MgO and P205 to take the value zero, and then closing this constructed sample to obtain again a sum equal to 100. The reason for recording all these values using four decimal digits is to illustrate the effect of each replacement formula more clearly.
Restoration o f the 'true' composition: natural replacement It seems logical to expect that a replacement strategy will restore the 'true' sample when the null values are replaced by the 'true' values. Consider the sample numbered as 1" in Table 3. Suppose that a scientist decides that the appropriate ~j for parts TiO2, MgO and P205 respectively are equal to 0.14, 0.13 and 0.03, which are the true values of sample number 1 (Table 3). These 6j values are used in the formulae (3), (4) and (5). Table 4 shows the result produced by the application of additive, simple and multiplicative replacement to
this composition. Observe that the multiplicative replacement is the only one which restores exactly the true composition - number 1 in Table 3 making this direct imputation. It is clear from the structure of formulae (3) and (4) that they are also capable of restoring the true sample. It is necessary to analyse the relationship only between the specific value ~j and the final imputed value. For example, consider the situation for the part TiO2 in the additive replacement case. If one wishes the replaced composition r to restore the true value 0.140 then the value of ~2 must satisfy the following relationship: 82(3 + 1)(9 - 3) = 0.140. 92
(6)
From this relationship it is easy to calculate that ~2 = 0.4724. Making this calculation for parts MgO and P205, the values ~5 ---- 0.4386 and ~9 = 0.1012 are obtained. In this way the final imputed values by the additive replacement (3) are true. However, if the modification of the nonnull values is calculated, the corresponding true values are not produced. This distortion happens
Table 3. Arbitrary selected samples from the dataset No.
SiO2
TiO2
A1203
Fe203_tot
MgO
CaO
Na20
K20
P205
1 2 3 4 1"
75.0775 76.2700 75.6819 76.3489 75.3033
0.1400 0.0000 0.0000 0.0000 0.0000
14.3957 14.1100 13.3080 13.6091 14.4390
1.6495 1.1600 2.5377 1.7486 1.6545
0.1300 0.0300 0.0699 0.0000 0.0000
0.9397 0.9800 1.1290 1.1791 0.9425
2.9091 3.3100 3.1671 3.4173 2.9179
4.7286 4.1200 4.1063 3.6970 4.7428
0.0300 0.0200 0.0000 0.0000 0.0000
196
J.A. MARTIN-FERNANDEZ & S. THI0-HENESTROSA
Table 4. Result from the application of replacement to the composition numbered as 1" in Table 3
Replacement
SiO2
TiO2
None (1") Additive Simple Multiplicative
75.3033 75.2885 75.0782 75.0775
0.0000 0.0415 0.1395 0.1400
A 1 2 0 3 Fe203_tot MgO 14.4390 14.4242 14.3958 14.3957
1.6545 1.6397 1.6495 1.6495
because the non-null values in (3) are modified in an additive way using the arithmetic mean of the imputed values. For example, in the additive replacement (3) the SiO 2 part is forced to take the value
0.0000 0.0385 0.1296 0.1300
CaO
Na20
K20
P205
0.9425 0.9277 0.9397 0.9397
2.9179 2.9031 2.9092 2.9091
4.7428 4.7280 4.7286 4.7286
0.0000 0.0089 0.0299 0.0300
restoration of the true values in an easier and faster way.
Relationship between final imputed value and threshold: clear replacement
3+1 rl = 75.3033 - - 92
(7)
(0.4724 + 0.4386 § 0.1012) -- 75.2533, which is different from the true value 75.0775 in sample number 1. Following the same example, in the simple replacement (4) for the part TiO2 the relationship considered is 100 1 -q- ~2 + g5 § ~9
~2 -~- 0.140.
(8)
It is obvious that in order to know the value of ~2 it is necessary to simultaneously solve this relationship for the part TiO2 and the parts MgO and P205. At the end, when these gj values are used in (4) all true values, zeros and non-zeros of the sample number 1 can be restored. In relation to the possibility of restoring the true values in one sample it is concluded that the multiplicative replacement (5) is more natural than the additive and simple replacement because it allows
Dealing with rounded zeros essentially is the same as dealing with NMAR missing values, where NMAR (Little & Rubin 2002) means Not Missing At Random. In the NMAR missing values one considers that the probability that a component is missing may depend on the unobserved component of the data. That is, the mechanism of 'missingness' is nonignorable. Essentially that is the case of rounded zeros in compositional data because the 'missingness' is strictly related to the nature of the variable rather than the sample itself. Consequently, in nonparametric imputation methods, it seems logical to expect that the final imputed value in a specific part depends on the nature of the part rather than the other values in the sample. For ease of readability ~j is considered to take the value 0.1% for all those parts which have null values, but it is important to remark that it is recommended (Mart/n-Fern~indez et al. 2003a) to use ~j equal to 65% of the threshold value. Using those values (~j = 0.1%), the replacement formulae (3), (4) and (5) for the samples numbered 2 to 4, and 1" (Table 3) are applied. Table 5 shows the replaced compositions for each
Table 5. Results from the application of replacement to the samples numbered 2, 3, 4 and 1" in Table 3
Replacement/No. Additive 2 3 4 1" Simple 2 3 4 1" Multiplicative 2 3 4 1"
SiO2
TiO2
A1203
Fe203_tot
MgO
CaO
Na20
K20
P205
76.2675 75.6745 76.3341 75.2885
0.0198 0.0259 0.0296 0.0296
14.1075 13.3006 13.5943 14.4242
1.1575 2.5303 1.7338 1.6397
0.0275 0.0625 0.0296 0.0296
0.9775 1.1216 1.1642 0.9277
3.3075 3.1597 3.4025 2.9031
4.1175 4.0989 3.6822 4.7280
0.0175 0.0259 0.0296 0.0296
76.1938 75.5308 76.1206 75.0781
0.0999 0.0998 0.0997 0.0997
14.0959 13.2815 13.5684 14.3958
1.1588 2.5327 1.7434 1.6495
0.0300 0.0698 0.0997 0.0997
0.9790 1.1267 1.1755 0.9397
3.3067 3.1608 3.4070 2.9092
4.1159 4.0981 3.6860 4.7286
0.0200 0.0998 0.0997 0.0997
76.1937 75.5305 76.1199 75.0774
0.1000 0.1000 0.1000 0.1000
14.0959 13.2814 13.5683 14.3957
1.1588 2.5326 1.7434 1.6495
0.0300 0.0698 0.1000 0.1000
0.9790 1.1267 1.1755 0.9397
3.3067 3.1608 3.4070 2.9091
4.1159 4.0981 3.6860 4.7286
0.0200 0.1000 0.1000 0.1000
ROUNDED ZEROS: SOME PRACTICAL ASPECTS
197
Table 6. Some ratios from the application of replacement to the samples numbered 2, 3, 4 and 1" in Table 3 Replacement ~j = 0.1% Ratio (in %) A1203/SIO2 of sample 4 Between of 1" and 4 in part Na20 Between of 3 and 2 in part Na20 Between of 4 and 2 in part Na20
Initial
Additive
Simple
Multiplicative
17.8249 85.3863 95.6843 103.2407
17.8089 85.3227 95.5318 102.8698
17.8249 85.3863 95.5888 103.0348
17.8249 85.3863 95.5885 103.0340
replacement. Let attention be concentrated on part TiO2. For the additive (3) and the simple (4) replacements the final imputed value is different for the three observations because the final imputed value depends on the number of null values in the sample. It is clear that for the additive replacement the final imputed value increases when the number of zeros increases; and for the simple replacement the effect is the contrary. These effects seem not to be reasonable since for the same part, i.e. the same threshold, the final imputed value in the part depends on the presence or absence of null values on others parts of the sample. In contrast, the multiplicative replacement imputes exactly the same value in all the null values of the part. This effect is reasonable, taking into account the nature of the rounded zeros. In addition, this replacement introduces artificial correlation between parts which have null values in the same samples. This effect is unsuitable and can distort the results of posterior multivariate analysis when the number of null values in the dataset is large, more than 10% (Sandford et al. 1993). Nevertheless, this artificial correlation is an inherent effect of the non-parametric methods of imputation rather than an effect of the specific formula (5). Note that this effect is present in this kind of technique for data in real space and it is well known for the missing data approach (Little & Rubin 2002). When the number of zeros in the dataset is quite large, parametric methods of imputation are recommended. These methods incorporate the information included in the covariance structure and impute values taking into account this information. As a consequence, the imputed value would be different for each sample and the specific mechanism of imputation must be the formula of multiplication replacement (5) since the procedure of this multiplicative replacement is clearer than the additive and simple formulae.
of the data is analysed in this section. It is important to analyse the ratios between two parts in one sample and, also, it is important to analyse the ratios between two samples in one specific part. Naturally, as the log-ratio covariance is based on the ratios, those replacements that preserve the ratios will be more reliable in preserving the covariance structure. In Table 6 the ratio between the parts A120 3 and SiO 2 for sample number 4 is shown. Initially, this ratio is equal to 17.8249. After each replacement is applied (~j = 0.1%), simple and multiplicative replacements preserve the ratio and the additive replacement distorts it. Samples 4 and 1" have the same number (3) of (null) values, located in the same parts. For the part Na20, initially the ratio between the sample 1" and 4 is equal to 85.3863. After the replacements are applied the additive method distorts the ratio and the simple and multiplicative methods preserve it. This preservation of ratios by simple and multiplicative replacement is a consequence of the multiplicative modification of the non-null values in formulae (4) and (5). In other words, the additive modification of the non-zero values in formula (3) has the effect of the distortion of ratios. The ratios between samples 3, 4 and 2 are distorted by all the replacements since these samples have different number of null values. Nevertheless, note that in these cases the behaviour of the simple and the multiplicative replacements is extremely similar. In Martin-FernAndez et al. (2003a) an extended analysis of theoretical properties of these replacements is presented. There the authors review in depth the properties of these replacements in relation to the basic operations: subcomposition, perturbation and power transformation; and in relation to basic elements of log-ratio methodology: Aitchison distance, compositional geometric mean, variance matrix and total variance.
Avoiding the distortion o f covariance structure: preservation o f ratios
Sensitivity analysis: the obligatory step in non-parametric replacements
In order to evaluate the distortion of the covariance structure the effect of each replacement on the ratios
In any non-parametric replacement strategy the imputed value is selected in advance. After the
198
J.A. MARTIN-FERN/~IDEZ & S, THIO-HENESTROSA decreasingly small imputed values, spurious clusters are obtained when these imputed values tend to infinity or minus infinity. For the dataset of Cenozoic volcanic rocks, a detailed revision of the data and geological knowledge (6.Kov~ics & Kov~ics 2001) of the sampling process, suggest a common threshold for all parts equal to 0.01%, and hence the maximum rounding-off error 8n~o = 0.005% is considered. Martfn-Fern~indez et al. (2003a) suggested 65% of the threshold - ~j = 0.0065% - as a suitable imputed value. Using these values the multiplicative replacement (5) was applied. In MartfnFernfindez et al. (2004b) the authors were interested in linear discriminant analysis using log-ratio methodology on the dataset from the database of Cenozoic volcanic rocks of Hungary referred to above. In this dataset two groups of samples exist: alkaline basalts and the calc-alkaline series. The main goal of this study was to contribute to the understanding of petrogenetic processes that occurred in the Carpatho-Pannonian region. After a first descriptive step based on the compositional biplot (Aitchison & Greenacre 2002) of the replaced dataset (Fig. 2), it was noted that the rays of clr(SiO2), clr(K20), clr(Na20) and clr(Al203) are the closest rays to the first axis of the biplot, the direction along which the projections of the alkaline basalts and the calc-alkaline series are best separated.
replacement is made some multivariate method will be applied and some results expressed on some indices will be obtained. For example, after PCA is applied one has the proportion of explained variance; in cluster analysis the number of groups is obtained; in discriminant analysis the linear discriminant function (LDF) misclassfication rate is calculated; in linear regression the R 2 index is made. There are numerous examples of results from a multivariate analysis. The question that naturally arises is how robust are the results in relation to the imputed values in the replacement method. Therefore, a sensitivity analysis of the results in relation to these values must be made. It is very important to emphasize that this step it is not inherent in log-ratio methodology. This kind of analysis is obligatory for any kind of dataset and any methodology applied. In Tauber (1999) a descriptive example is presented in order to illustrate the strong influence of selection on the 6j value in cluster analysis studies using log-ratio methodology for compositional data. There the main argument was that when the imputed value tends to zero spurious clusters appears. Note that this effect is logical and not inherent in the logratio methodology since in the Euclidean context exactly the same effect would appear. If there are missing values in some Euclidean dataset and these are replaced by increasingly large or
i
' i I
clr(MgO) 9
~
.
~
clr(CaO)" .o
0 ~../
~ Oo OO
~ ,o , -
I axis1 -
0 0
0
............
T - "~-'"
............
-
~
OtOotal" 0
r
~
clI~A[2O3)
~
clr(SiO2)
- - -o. . . . . . ~
~
~
clr(K20 ,
o
/'%
~~
/ clr(Ti~21.,_._, " ' -'
/
,,/
o
.
| o o Oo O ~o 0% o o o,o o o oo ~ oo o oo o f3 o o oo
clr(P205 )
o o oO
axis2
o
o
o
Fig. 2. Biplot in the clr-transformed space. Clr components and samples: circles represent calc-alkaline series; dots, alkaline basalts.
ROUNDED ZEROS: SOME PRACTICAL ASPECTS
MgOc
(Na20+K20) c
Fig. 3. Centred [SiO2; MgO; Na20 + K20] subcomposition. Circles represent calc-alkaline series; dots, alkaline basalts. Superindex c indicates centred parts.
The good separation of the alkaline basalts and the calc-alkaline series in the biplot is also numerically confirmed by a linear discriminant analysis of the two groups applied to the clr-transformed dataset: only 3.96% of the observations are incorrectly classified (misclassification rate) by the LDF. Subcompositional linear patterns and simultaneously reasonable separation of the two groups are obtained with ternary diagrams including SiO2 and K20, or SiO2 and Na20, and a third component, e.g. MgO whose vertex lies further apart in the
199
biplot (Fig. 2). Further, in order to allow comparison of the log-ratio analysis with the results from traditional methods (Martfn-Fern~indez et al. 2004b), the amalgamated subcomposition [SiO2; MgO; Na20 + K20] is considered. Clearly, the separation of the two groups is better visually (Fig. 3; LDF spell out misclassification rate miscl, rate: 3.96%) with the amalgamated subcomposition. Note that the data have been centred (Martfia-Femfindez et al. 1999). For more details of the centring operation see Pawlowsky-Glahn & Egozcue (2006). A sensitivity analysis must now be performed. In Aitchison (1986), for a sensitivity analysis the range gmro/5 < g < 2gmro, where gr~o is the maximum rounding-off error, is suggested as reasonable. The imputed value (Martin-Fernfindez et al. 2003a) is 65% of the threshold and so the range seems to be appropriate. Figure 4 shows the pattern of the variation of the LDF misclassification rate when the value gj = 0.000065 simultaneously varies for all parts between 0.00001 < gj < 0.0001. Observe that for values around the imputed value the LDF rate is reasonably stable; and, when g tends to zero, the LDF rate increases showing that the two groups become more mixed. The reason for this behaviour lies in the null values in the part MgO of the samples which belong to the calc-alkaline series group. These values are responsible for the calc-alkaline group increasing its variability.
5.6 5.4 [ 5.2
~
5
L,. E
.o
4.8
o
4.6
"5 ._
E 4.4 I.L
D .J
4.2 3.96
381
; 6.5 0.00001 < 5 < 0.0001
Fig. 4. Sensitivity analysis: variation of the LDF misclassification rate.
lO x 10-5
200
J.A. MARTiN-FERNaNDEZ & S. THI0-HENESTROSA
Therefore, those calk-alkaline samples which are close to the alkaline basalts group with high MgO, are misclassified. This sensitivity analysis could be more sophisticated in the sense that different combinations of the variation of the gj among the parts could be made. For example, one could fix the value gj in some parts and vary those values in other parts in order to detect the contribution of each combination of parts to the sensitivity. Other possible combinations are to decrease the value 8j for some parts and simultaneously increase the gj values for other parts. Naturally, the results produced are different in each practical study performed. From the authors' experience the most interesting and interpretable results are obtained when a global sensitivity analysis is performed in the way that has produced the LDF misclassification rate in Figure 4.
Concluding remarks The well-known 'problem of zeros' is inherent in the nature of compositional data rather then the log-ratio methodology. In particular, rounded zeros should be considered as small missing values. In datasets where null values are less than 10% of data, three different formulae of non-parametric replacement can be applied. The multiplicative replacement appears to be the easier, faster, more coherent and natural formula for substitution of the rounded zeros by appropriate small values. Whatever the replacement method employed, a sensitivity analysis of the results is obligatory in order to analyse the variability of the results in relation to the variation of the imputed value. For datasets with a large number of rounded zeros, parametric methods of missing data should be applied. These methods are not developed here but it seems reasonable to imagine that these methods will consist of a combination of either EM algorithm or MCMC methods with the appropriate log-ratio transformation: additive log-ratio (air), isometric log-ratio (ilr) or centred log-ratio (clr) transformations. Future research will focus on this strategy. This work has received financial support from the Direcci6n General de Investigaci6n of the Spanish Ministry for Science and Technology through the project BFM2003-05640/MATE. The data set from the database of Cenozoic volcanic rocks of Hungary has been kindly provided by Drs L. 6. Kovfics and G. P. Kovfics from the Hungarian Geological Survey.
References AITCHISON, J. 1986. The statistical analysis of compositional data. Chapman & Hall, London. Reprinted (2003) by The Blackburn Press, Caldwell, NJ.
AITCHISON, J. & GREENACRE, M. 2002. Biplots of compositional data. Applied Statistics, 51, 375-392. AITCHISON, J. & KAY, J. W. 2003. Possible solutions of some essential zero problems in compositional data analysis. In: THI0-HENESTROSA, S. & MARTINFERN,~NDEZ, J. A. (eds) Proceedings of CODAWORK'03, The First Compositional Data Analysis Workshop, October 15-17, University of Girona (Spain). CD-ROM (World Wide Web: http://ima.udg.es/Activitats/CoDaWork03/index. html#session2). BACON-SHONE, J. 2003. Modelling structural zeros in compositional data. In: THIO-HENESTROSA, S. & MARTiN-FERN.~NDEZ, J. A. (eds) Proceedings of CODA WORK'03, The First Compositional Data Analysis Workshop, October 15-]7, University of Girona (Spain). CD-ROM (World Wide Web: http: //ima.udg.es/Activitats/CoDaWork03/index. html#session2). BUCCIANTI, A. & ROSSO, F. 1999. A new approach to the statistical analysis of compositional (closed) data with observations below the 'detection limit'. Geoinformatica, 3, 17-31. EGOZCUE, J. J. & PAWLOWSKY-GLAHN,V. 2006. Simplicial geometry for compositional data. In: BUCCIANTI, A., MATEU-FIGUERAS, G., & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 1- 10. FRY, J. M., FRY, T. R. L. & MCLAREN, K. R. 2000. Compositional data analysis and zeros in micro data. Applied Economics, 32, 953-959. LITTLE, R. J. A. & RUBIN, D. B. 2002. Statistical Analysis with Missing Data (2nd edn). John Wiley and Sons, New York. JOBSON, J. D. 1992. Applied multivariate data analysis, Vol IL Categorical and multivariate data analysis. Springer texts in statistics, Springer, New York. MARTiN-FERNANDEZ, J. A., BREN, M., BARCELOVIDAL, C. & PAWLOWSKY-GLAHN, V. 1999. A measure of difference for compositional data based on measures of divergence. In: Proceedings of IAMG'99. Trondheim, Norway, 1, 211-216. MARTiN-FERN,~NDEZ, J. A., BARCEL0-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2000. Zero replacement in compositional data sets. In: IQERS, H. A. L., RASSON, J.-P., GROENEN, P. J. F. & SCHADER, M. (eds) Proceedings of the 7th Conference of the International Federation of Classification Societies, University of Namur (Belgium). Springer-Verlag, Berlin, Germany, 155-160. MARTiN-FERNANDEZ, J. A., BARCELO-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003a. Dealing with zeros and missing values in compositional data sets. Mathematical Geology, 35 (3), 253-278. 1VIARTiN-FERN,~NDEZ,J. A., PALAREA=ALBALADEJO,J. & GOMEZ-GARC{A,J. 2003b. Markov Chain Monte Carlo Method Applied to Rounding Zeros of Compositional Data: First Approach. In: THI0-HENESTROSA, S. & MART[N-FERN.~NDEZ, J. A. (eds)
ROUNDED ZEROS: SOME PRACTICAL ASPECTS
Proceedings of CODA WORK'03, The First Compositional Data Analysis Workshop, October 1517, Univeristy of Girona (Spain). CD-ROM (World Wide Web: http://ima.udg.es/Activitats/ CoDaWork03/index.html#session2). MARTiN-FERNANDEZ, J. A., O.KovAcs, L., KOVACS, G. P. & PAWLOWSKY-GLAHN,V. 2004a. The treatment of zeros in compositional data analysis: the database of cenozoic volcanites of Hungary. 32nd International Geological Conference, Florence (I), Abstracts Volume, part 1, abstract 41-12, p. 213. MARTiN-FERNANDEZ, J. A., PAWLOWSKY-GLAHN,V., O.KovAcs, L. & KOVACS, G. P. 2004b. Subcompositonal exploration in the database of cenozoic volcanites of Hungary. 32nd International Geological Conference, Florence (I), Abstracts Volume, part 1, abstract 41-16, p. 214.
201
O.KovAcs, L. & KovAcs, G. P. 2001. Petrochemical database of the Cenozoic volcanites in Hungary: structure and statistics. Acta Geologica Hungarica, 44 (4), 381-417. PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. 2006. Compositional data and their analysis: an introduction. In: BUCCIANTI, A., MATEU-FIGUERAS, G., & PAWLOWSKV-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 1- 10. SANDFORD, R. F., PIERSON, C. T. & CROVELLI, R. A. 1993. An objective replacement method for censored geochemical data. Mathematical Geology, 25 (1), 59-80. TAUBER, F. 1999. Spurious clusters in granulometric data caused by logratio transformation. Mathematical Geology, 31 (5), 491-504.
Is the simplex open or closed? (some topological concepts) E. B A R R A B I ~ S & G. M A T E U - F I G U E R A S
Department Informgttica i Matemhtica Aplicada, Universitat de Girona, Campus Montilivi, P4, E-17071 Girona, Spain (e-mail:
[email protected]) Abstract: The simplex is the natural space to work with when compositional data are considered. Sometimes, the concepts of open simplex and closed simplex are used, although most of the time they are not well defined. The objective of this contribution is to expose some of the mathematical concepts related to the simplex and its structure in order to make clear when the terms open and closed are mathematically appropriate. Moreover, these concepts sometimes generate discussion about the proper representation of the simplex. It will be shown that this discussion makes no sense when considering the simplex as a Euclidean vector space.
Since Aitchison (1982) introduced the log-ratio approach to compositional data, there has been some occasional discussion about the proper representation of this type of observation and the terminology to be used. Since this discussion arises twice and again, it seems appropriate to clarify the concepts involved. The discussion arises from two ways of defining the sample space of compositional data. On one side, there is the traditional sample space in geology and other fields of science, which is
General
concepts
To analyse the concepts open simplex and closed simplex, it is necessary to recall the definitions of open and closed ball, open and closed set, and boundary and frontier of a set in a metric space (see, for example, Schechter (1997) or Rudin (1976)). Let ( ~ o , d) be the D-dimensional real space with the usual Euclidean metric, which is given by the function d ( x , x*) = IIx - x* II
R = {x = (Xl, X2 . . . . . XD); X1 "at- X2
= ~(x I - x~) 2 -+- 9 9 9 +
(XD --
x~)) 2,
(3)
- } - ' ' ' + X D = I , x i > O , i = 1 . . . . . D}. (1)
Observe that one can consider the set R e m b e d d e d in the D-dimensional space ~D and it includes data with zero values. On the other side, the sample space in the log-ratio approach excludes zero values and is given by
xi>O, i=l
Bd(Z, r) = {x E X; d(x, z) < r},
. . . . . D}.
(2)
Kd(Z, r) = {x E X; d(x,z) < r}. The restriction in S to strictly positive values for all the components is necessary for modelling within the log-ratio approach, as division by zero and log-transformation of zero is not defined. Both sets, R and S, are e m b e d d e d in the D-dimensional real space R D and the essential difference between them is inclusion, respectively exclusion, of ntuples with zero components. The discussion is whether it is fight or wrong to use the terminology closed and open simplex for the sets R and S, respectively. In Figure 1, a representation of both sets are shown for D = 3.
(5)
The difference between both sets is the inclusion or not of an equality (recall that the same situation stands for the sets R and S): closed balls admit all points at distance to the centre equal to r, while the open balls do not. For example, for D = 3, closed and open balls are spheres with and without their hull, respectively. A set T C_ R D is said to be open if for each point z ~ T, there exists r > 0 such that Ba(z, r) C_ T. This is, open sets are those for which around their points can be considered an open ball contained in the set, no matter how small the radius of the ball
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis
in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 203-206. 0305-8719/06/$15.00
(4)
this is, the set of all points of R ~ located at a distance less than r of Z. The closed ball with centre z and radius r is the set
S = {x = (xl,x2 . . . . . XD); xl +x2
+'''+XD=I,
where d(x, x*) stands for distance between x and x* and Hx - x* ]1the norm of the vector x - x*. For any z E ED and r > 0, the o p e n ball centred at Z with radius r is the set
@ The Geological Society of London 2006.
204
E. BARRABI~S & G. MATEU-HGUERAS
Fig. 1. Representation of the sets R and S (Deft and right respectively) in ~3. In the first case, the segments of the triangle (its boundary) on the coordinate planes are included, while in the second case they are not. The vertices of the triangle correspond to data with exactly two components equal to zero and the sides (excluding the vertices) correspond to data with one component equal to zero.
has to be taken. A set T C ~ o is said to be a closed set if its complementary set, this is the set of all points of ~ o that do not belong to T, is an open set. The complementary set of a set T is written as T c = ~D\T. Observe that a set can be neither open nor closed. For example, in the real line, the interval (0, 1) is an open set, [0, 1] is a closed set but (0, l] is neither open nor closed. To define the boundary or frontier of a set needs the concepts of interior and closure of a set. The interior of a set is the largest open set contained in it, and shall be denoted by int(T). The closure of a set is the smallest closed set which contains it, and shall be denoted by cls(T). Observe that an open ball is an open set, and that the closure of an open ball is the closed one with the same centre and radius. Furthermore a closed ball is a closed set, and the interior of a closed ball is the open one with the same centre and radius, this is, clS(Bd(Z, r)) : Kd(Z, r) and i n t ( K d ( z , r)) : Bd(Z, r).
(6) Now, the boundary or frontier OT of a set T is defined as the intersection of cls(T) with cls(TC), and it can be shown that for any T C ED, the space R ~ can be partitioned into the three disjoint sets int(T), int(T c) and OT, which are open, open and closed, respectively. This is,
distance due to the fact that i n t ( R ) = i n t ( S ) = and, therefore, they cannot contain any open ball. On another side, R is a closed set of ~D, as its complementary set is open, but S is not a closed set because cls(S) = R :~ S. Furthermore OR : 0S = {x = (Xl,X 2. . . . . XD); x i : O, for some i :
1. . . . . D},
(8)
this is, in both cases the boundary is exactly the same: the set of the compositions with one or more components equal to 0. In conclusion, if one considers R, S C ~D with the Euclidean distance d, the set R is closed but the set S is neither open nor closed.
The s i m p l e x as a subset of [~D-1 W h e n dealing with the sets R and S, a different approach can be used. The condition xl + x 2 +...-I-XD = 1 allows expression of one of the components, say XD, in terms of the other variables as
XD = 1
-
X 1 . . . . .
XD_ 1 .
(9)
Therefore, the following sets can be defined R' = {x = ( x l , x 2 . . . . . X D - I ) ; Xl -'[-X2 -~-- " " "
~D = int(T) LJ int(T C) tO OT.
(7)
Furthermore, T is a closed set if and only if it is equal to its closure, this is if and only if cls(T) = T. An immediate conclusion of this definition is that in ~3 open sets have volume (in general, in ~D, open sets contain open balls); thus, neither R nor S are open sets, given that they are completely flat (see Fig. 1). This can be generalized to any dimension: R and S are not open sets in R D with the Euclidean
-JI-XD_ 1 St :
"< 1 ,
X i >_ O ,
i = 1.....
D - 1},
(10)
{x : (Xl,X 2. . . . . XD_I); x I --[-x 2 - { - . . . +XD-1 O,
i=l
.....
D--l},
(11)
and x E R if and only if x = (x', XD), where x' ~ R' and XD satisfies (9) (analogously for the sets S and S'). It is observed that whereas R and S are e m b e d d e d in the D-dimensional real space, R' and S' are e m b e d d e d in the ( D - 1)-dimensional real
TOPOLOGICAL CONCEPTS ON THE SIMPLEX
205
Fig. 2. The sets R' and S' for D = 3 (left and fight respectively) in two-dimensional space. In the first case the sides of the triangle are included while in the second case they are not.
space. In Figure 2 the representation of the sets R' and S' in the case D = 3 are given. Then R' and S' are regarded as subsets of R D- 1 and here it is found that R' is a closed set whereas S' is an open set. Furthermore, R' = cls(S') and S' = int(R') (recall the difference with R and S). This justifies the common terminology closed simplex for R' and open simplex for S'. But it is necessary to emphasize the fact that in this case only D - 1 variables are involved, so the representation would be different from the representation of the sets R and S.
The simplex as a Euclidean vector space Aitchison (1986) showed that the standard operations in real space may have no sense from a compositional point of view. Aitchison defined two fundamental operations, perturbation and powering, as well as a distance in the simplex, known as the Aitchison distance, which is given by
d,~(x,x*) =
~
x
x:t
in---In--d.
x:
xj/
.
(12)
The distance da cannot be defined on any composition x with one or more components equal to zero because it involves division by zero and/or logarithm of zero. In consequence, the distance d~ cannot be defined on the set R, whereas there is no problem in defining it on the set S. Later, Billheimer et al. (2001) and PawlowskyGlahn & Egozcue (2001) proved that the set S with the operations defined by Aitchison has a Euclidean vector space structure and it is not adequate to consider it as a subset of the real space R D. A summary of the state of the art can be found in this volume in the article 'Simplicial geometry for compositional data' by Egozcue & Pawlowsky-Glahn (2006). Thus, if one wants to be coherent from a compositional point of view the simplex must be considered as a Euclidean
vector space in itself, that is, (S, da) must be considered as a metric space. In this context, the discussion about closed and open simplex has no meaning whatsoever, as S is the whole space and therefore, open and closed at the same time. Do the compositions with zero parts play any role in this context? It is clear that they do not belong to the space S, and that they are at infinite distance (da) of any point of the set S. So, any point of x @ R with xi = 0 for some i = 1. . . . . D, has the same behaviour as a point with an infinite component in the real space. But what happens if a dataset contains zeros? There will be problems using the log-ratio approach. At this point it will be necessary to apply a replacement strategy to the zeros (see, for example, MartfnFernandez et al. (2003)) in order to use the Aitchison distance without numerical problems.
Representation: the ternary diagram Now the discussion is which is the suitable representation for the simplex (S, du). The argument put forward by supporters of different representations is that R is a closed set, whereas S is an open set. As has been seen before, this discussion has no sense because it is considering S as a vector space in itself, which means that it is open and closed at the same time. Let attention be centred on the case D = 3. All the ideas developed here can be generalized easily to an arbitrary number of parts (although there is no suitable representation for the simplex for D > 4). It is known widely that a convenient way to represent a three-part composition is the ternary diagram. As can be seen in Figure 3, there is a one-to-one correspondence between a three-part composition and a point in the triangle. Note that all the parts (xl,x2,x3) are involved and appear in the diagram, so the representation is different from the one shown in Figure 2, where the set S' is considered (and only two components are represented).
206
E. BARRABt~S & G. MATEU-FIGUERAS X1
Xl
X2
\ --
X3
Fig. 3. Representation of a three-part composition, x = (Xl,X2,X3), in the ternary diagram.
Observe that the borderlines of the ternary diagram are associated with scales of variation of the different parts involved in the composition. In fact, when a ternary diagram is regarded, it is understood that every side of the triangle represents a regular scale between 0 and c, where e.g. c = 1 if observations are in parts per one, or c = 100 if observations are in parts per hundred, or c = 1 000 000 if they are in ppm. A composition without zero parts is represented by a point inside the triangle (Fig. 3). A composition with exactly one component xi equal to zero is represented by a point on the side opposite to the vertex xi. Each vertex represents a composition with exactly one component different from zero. Therefore, in this context the sides of the ternary diagram can be viewed as axes in a coordinate system instead of parts of the whole space. This representation has been used for a long time. Only recently the approach presented in the previous section using the Aitchison distance turns out to be more suitable than the classical methods. This implies that only the set S has to be considered, but the ternary diagram representation is still valid as the sides of the triangle are reference axes. Furthermore, the points on the sides are at infinite distance da to any point inside the triangle. Representing points at infinity is not a contradiction. For example, the stereographic projection is a well-known case where there is a correspondence between each point on the plane and a point on a sphere. In this case, the whole sphere is always represented, although the north pole represents the points at infinity. Conclusions
The concepts of open simplex and closed simplex have been used commonly to refer to the sets S
and R respectively. From a mathematical point of view these expressions are not correct: as subsets of the Euclidean space R ~ only the terminology for the set R is correct (it is a closed set), whereas S is neither closed nor open. Nevertheless, the terminology is correct when it is applied to the sets R' and S', which are embedded in the D - 1 dimensional Euclidean space but, in this case, only D - 1 components are taken into account. Finally, if one considers the simplex as a Euclidean vector space with the Aitchison distance da, only the subset S makes sense. Then, the whole space is open and closed at the same time and the terminology of open and closed simplex should be ruled out. With respect to their representation, the authors propose the use of the classical ternary diagram, where the sides of the triangle represent the points at infinity of the simplex space (S, da). This research has been supported by the Direcci6n General de Ensefianza Superior (DGES) of the Spanish Ministry for Education and Culture through the projects BFM2000-0540 and BFM2003-05640/MATE.
References
AITCHISON,J. 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society, Series B (Statistical Methodology), 44 (2), 139-177. AITCmSON, J. 1986. The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. Chapman & Hall Ltd, London. Reprinted (2003) with additional material by The Blackburn Press, Caldwell, NJ. BILLHEIMER, D., GUTTORP, P. & FAGAN, W. 2001. Statistical interpretation of species composition. Journal of the American Statistical Association, 96 (456), 1205-1214. EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2006. Simplicial geometry for compositional data. In: BUCCIANTI, A., MATEU-FIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 145 - 159. MARTIN-FERNANDEZ, J. A., BARCELO-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35 (3), 253-278. PAWLOWSKY-GLAHN,V. & EGOZCUE,J. J. 2001. Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment (SERRA), 15 (5), 384-398. RUDIN, W. 1976. Principles of mathematical analysis (3rd edn). International Series in Pure and Applied Mathematics. McGraw-Hill Book Co, NY. SCHECHTER, E. 1997. Handbook of Analysis and its Foundations. Academic Press, San Diego.
Index Note: Page numbers in italics refer to figures while those in bold refer to entries in tables.
accretion system 43 acid-base reactions 187 active margin 43, 50, 53 additive log-ratio 3-5, 8, 8, 82, 157-8 drawbacks 81 transformed 20, 79, 80-1, 87, 106 additive replacement 194-7, 196, 197 adsorption 187 Afrobolivina afra foraminifera 59 Aitchison composition 120-1 Aitchison distance 4, 5, 6-7, 119, 121, 122-3, 124, 130, 137, 150-1,153, 156-7, 192, 197, 205, 206 Aitchison geometry 7, 120, 121, 145, 161 Aitchison inner product 6-7 Aitchison norm 6-7, 151, 152 alkalis-total Fe-Mg (AFM) plot 19, 22 alteration processes 178, 179 alternative hypothesis 37 amino-acid analysis 60, 63 analysis of variance (ANOVA) 67, 70, 119 andesite 18 Appin Group 26, 35-6, 35 aquifers 178, 179, 187 aragonite 30 Argyll Group 26 Arsen'evskoe deposit, Russia 45, 47, 55 balances 4-5, 52, 52, 145, 154-7, 155, 158 Ballachulish Subgroup 33, 38 barplot 121 barycentre 7, 149, 162 basalt 18, 156, 166-72 alkaline 11, 13, 14-17, 101, 103, 106, 198, 198, 199, 200 bases 148-9 Basic Compositional Data Analysis functions from S + / R 130 Be/V ratio 52, 53 between-groups matrix 60-1, 62 Bhattacharyya distance 130 bimodality 188 binary logistic discriminant analysis 47-9, 50 biplot 4, 13, 14, 30, 32, 35, 43, 119, 123, 161, 170, 198 analysis 168 CoDaPack 103, 104, 114 coincident vertices 165 collinear vertices 165 construction of 163-4 cosines of angles 164 interpretation of 164-5 links and rays 164 major oxides 169 subcompositional analysis 164-5 Blair Atholl Dark Schist & Limestone Formation 35-6, 36 Blair Atholl Subgroup 28, 33, 37-8 box plot 70, 71, 119, 121,123, 187 Box-Muller algorithm 88
brachiopods 60, 61-4, 63 British Geological Survey 27 calc-alkaline series 11, 13, 14-17, 18, 19, 101, 103, 106, 198, 198, 199 calc-silicate phases 30 calcite 30, 178, 179, 187 carbon, isotopes 39 carbon dioxide 178, 181,187 carbonates 181 Carpatho-Pannonian Region 11, 12, 198 casserite 43-55, 49, 51 cation-exchange equilibria 178 Cenozoic vulcanites, Hungary 11-23, 12, 101, 191, 194, 198 centered data 7-8, 8, 15-17, 104, 111, 116, 130, 131,135, 161-3, 163, 166, 167, 169, 170, 199, 199 centered log-ratio 3-5, 8, 103, 112-15, 157-8, 167, 170 coefficients 45-7, 148, 152, 156, 157 covariance matrix 60, 64, 72, 162-3, 166 transformed 19-20, 106, 148, 162, 163,198, 199 chemo-stratigraphical study 61 chloride 178-9 classification 1, 13 probability 87-8 sediment 80, 82-4, 83 clay minerals 166-72, 178-9 clear replacement 196-7 closed data 1, 32, 79, 146 closed geometric mean 7 closed simplex 203, 205 closure constant 146, 147, 161 cluster analysis 43, 119, 122-3, 198 see also Ward cluster analysis CoDaPack 101-18, 129-30, 133 ALN confidence region 114, 116 ALN predictive region 114, 115 ALR plot 112, 114 amalgamation 111 analysis 107, 108, 117 atypicality indices 117, 117 biplot 103, 104, 114 centering 111 centre 116 CLR plot 112-14 CLR variance 114-15 descriptive statistics 102-3, 103, 107, 108, 114-17, 116, 117 graphs 103-7, 107, 108, 112-14, 113-16 logistic normality test 117 menu features 107-18 operations 102, 107, 110-12 perturbation 110, 111 power transformation 110, 112 preferences 107, 108, 117 principal components 114, 115 raw ALR 109, 110
208 CoDaPack (Continued) raw CLR 109 raw ILR 110 real data example 101-7 rounded zero replacement 101-2, 102, 111-12 standardisation 111 subcomposition/closure 111 sum constraint 117, 118 summary 102-3, 103, 114-16, 116 ternary diagram 105-6, 112, 113 total variance 116 transformations 107-10, 108 unconstrain/bias 107-9, 109 variation array 114, 116-17 websites 101, 130 coefficients 3, 8-9, 45-7, 60, 148-9, 152, 155-7, 156, 157 collision margin 68 composition 11, 18 exploratory methodology 161-73 non-negativity 80 principal component analysis 2, 72, 74, 104-6, 106-7, 119, 165-6 software 101 - 18 three part 80-1, 82, 205, 206 see also subcomposition compositional fields 87, 162 compositional lines 6, 6, 9, 13, 104, 149-50, 149 principle axis 166, 166 compositions software package 119-27, 131 four classes 131,143 websites 119 confidence region 81-2, 82, 87, 88, 9 3 - 6 additive logistic normal 114 constant ratios 136, 137 constant sum 2, 25, 38, 61, 67, 80, 94-5, 117, 118, 135, 145, 194 continental block margin 79, 84 coordinates see coefficients Coptothyris grayi, brachiopod 60 correlation matrix 1, 4, 25 covariance structure 1, 2, 3, 4, 25-6, 61, 79, 81, 97, 151, 162-3, 165, 170, 192, 193-4, 197 see also variance Dalradian limestone 25-39, 27, 29 FMC subcomposition 34-9 lithological characteristics 31 data 30, 32 3D visualisation 138, 142 detection limit 191-2 errors 3, 161 full-space data 59 glacial dataset 129, 130 missing data techniques 191-200 NMAR missing values 196 number of zeros 197 properties 1-2 random sampling 179 Researcher's daily activities 129, 130 residual 130 sampling error 176 visualisation 13, 18, 46, 162
INDEX see also centered data; closed data; open data; transformed data decay, rates of 150 deciles values 136 deformation 25, 67 degrees of freedom 87, 146 dendrogram 2, 124, 137, 138, 141,142 depositional process 5 descriptive statistics 102-3, 103, 107, 108, 114-17, 116, 117, 161,194 diagenetic fluid 30 Dickinson model 79-98 database 84-7, 86, 94 discriminatory power 80 as exploration tool 97-8 methods 87-8 spatial resolution 96 temporal resolution 96 differentiation 19 mixing operators 176 dilution 185, 187, 188 dimensional real space 204-5,205, 206 dimensionality reduction 13 discriminant analysis 1, 2, 50, 61, 79-80, 125-6, 175, 198 discriminant coordinates 59-65 discriminant power 59, 61, 64, 65, 94 Dickinson model 80 dispersion see variance distance 4, 6, 145, 150-1, 155,205 see also Aitchison distance; Bhattacharyya distance; Euclidean distance dolomite 28 Dufftown Limestone Formation 33-9
earthquake 67, 75 eigenvalue 163, 166 eigenvector 163-4, 166 EM algorithm 193,200 equilibrium constant 70-3 erosional process 5, 175 essential zero see structural zero Euclidean distance 5, 130, 157, 191, 192, 193, 194 Euclidean geometry 161 Euclidean metric 203 Euclidean space 3, 7, 145, 151, 192, 198, 203, 205, 206 Euclidean statistical methods, assumptions 191-2 factor analysis 2, 175 feldspar 170, 172 plagioclase 30, 167, 169 Festival'noe deposit, Russia 45, 47, 54 fixed percentage line 82, 84 fluid source 43, 44, 53 fluorine 187 fractal distribution 176 frequency distribution 19, 175-88 positively skewed 177 shape of 36, 175, 179, 181, 185, 185-6, 187 theoretical 176-8 uni & multivariate 175, 177-85, 186-8
INDEX fumarole field 67-75 inner and outer 68 fumarolic gases 67-75, 71, 178, 187 fluid circulation 74 spatial variation 67, 68, 70, 72-5 subcompostional analysis 70-3 temperature 68, 69-70, 69, 72-5 temporal variation 67, 68, 69, 70, 71, 72-5 gabbro 166-72 gas phase equilibria 178 Gaussian distribution 69, 70, 175-7, 180-3, 193 Gaussian model 179, 181 geochemistry, natural laws 175-88 geometric mean 131,133, 134, 161-2, 197 glacial till 129 goodness of fit 179, 181, 186 grain size 28 Grampian Group 26 Grampian Orogeny 26 Hilbert space 151 histogram 18, 19, 183, 186 Hungarian Geological Survey 101 Hungary 11-23, 191, 194 hydromagrnatic deposits 178, 183 immunoasssay 60 imputation strategies 193-4 non-parametic 194-7 imputation value 196, 197 Inchrory Limestone Formation 33-9 inferential modelling 176 inferential success ratio 93-4 inner product 4, 145, 151, 152, 157 see also Aitchison inner product iron 11-12, 167 iso-density partitioning 87, 88-93, 94-6, 97 iso-density point 87, 88 isometric log-ratio 3-5, 8, 20, 81, 82, 157-8 coefficients 166 transformed 106, 125, 185-7 isotopes 11, 53, 54, 149-50 Italy 67-75, 161, 172, 175, 178, 184 soil dataset 166-7 Kavalerovo region, Russia 45, 47, 48, 54-5 Khingan-Okhotsk metallogenic belt, Russia 45 kin file 137 Kincraig and Ord Ban limestone 28, 35-6, 36, 37 kinemage 138 KiNG viewer 129, 137, 138, 142 Kinlochlaggan limestones 36, 37-8 Kinlochlaggan Syncline 28 Kolmogorov-Smirnov test 69, 70, 179 Komsomol'sk region, Russia 45, 47, 48, 53-4 Kullback-Leibler divergence 130 Laqueus rubellus, brachiopod 60 latent root 60-1, 63, 65 lattice of hypotheses 33, 34 Law of Mass Action 74-5
209
Law of Proportionate Effects 177 leaching 178, 187 Leny Limestone 26 Levene statistic 70 likelihood ratio test 181, 186 limestone 25, 27-30 linear dependence 148 linear discriminant analysis (LDA) 19-21, 51, 52, 106, 199 linear discriminant function (LDF) 19, 20, 198, 199-200, 199 see also misclassification rate linear regression 49, 198 lithophile elements 43, 45, 53 lithostratigraphy 25, 26-8, 29, 39 Lithuania, Silurian sediments 59-60, 61, 62 Lochaber Subgroup 38 log-contrasts 33, 70-3, 72, 73, 74-5 log-likelihood function 181 log-normal model 177-8, 179, 184 log-ratio 1, 2-9, 11, 13-18, 25, 26, 35, 38, 45-7, 67, 70, 81-4, 82, 101, 155, 157-8, 162, 176, 177, 178-87, 186-7, 192, 193, 198, 203 covariance matrix 60, 62, 63 linear discriminant analysis 21 space 83, 84, 87 time independent 70, 70, 71 transformation 59, 64, 131,145, 176, 200 see also additive log-ratio; centered log-ratio; isometric log-ratio logskew-normal model 176, 177, 181,184, 188 loss on ignition 12, 30, 34 MAGE viewer 129, 137 magma 11, 21, 167 chamber system 43-5, 44 differentiation 170 magmatic arc 79, 84, 94-5, 178 magnesium 179 Mahalanobis distances 186-7 Mann-Whitney U-test 25, 37-8, 37 marker horizon 28 Markov Chain Monte Carlo (MCMC) simulation algorithms 193,200 MATLAB version software library 179 matrices see between groups matrix; correlation matrix; covariance matrix; variance matrix; within-groups matrix metamorphism 25, 53 metasomatic alteration 18 metric space 203, 205 Mg/Ca ratio 179 mica 30 mineral paragenesis 25, 45 misclassification rate 49, 50, 52, 88, 198, 199, 199 MixeR 129-43 Activity researchers dataset 138 data file format 129, 141 data input 131 glacial dataset 142 library 131 routines 131-2 websites 129, 131, 143
210 Moine Supergroup 28 monitoring tools 67, 75, 98 monograph 2 multi element variation see spider diagram multimodal model 184, 186, 188 multiplicative replacement 140-1,194-7, 196, 197, 198, 200 multivariate analysis 81, 192, 198 distribution 188 incomplete data 192 Multivariate Analysis of Variance (manova) 125 redundant information 59 statistical modelling 185-7 multivariate normal model 186-7 multivariate skew-normal model 186-7, 188, 188 natural replacement 195-6 non-parametic techniques 36-7, 192-3, 194-200 see also parametic techniques norm 5, 145, 155, 157 see also Aitchison norm normal distribution see Gaussian distribution null hypothesis 37 null value 191,192, 195 OCR software 84 olivine 170, 172 open data 25, 26 ophiolite 161, 166-72 orogen, recycled 79, 84 orthogonal system 4-5, 60, 151-7, 179 orthonormal basis 145, 149, 152-3, 153, 155-6, 157, 164 oxides hydrous 166, 168, 173 major 11-23, 101, 167, 167, 168, 169, 171, 191 oxyhydroxides 166, 168, 173 P-P plot 187, 188 Pagetides trilobite 26 palaeontology 59 Panarea, Italy 67, 75 parametic techniques 192-3, 200 see also non-parametic techniques percentiles 18, 82, 84, 131, 133 border lines 133, 134-6, 134, 135 Pereval'noe deposit, Russia 45, 47, 54 peridotite 167 permeability variations 178, 179 perturbation 5-6, 7, 110, 111, 123, 130, 145, 147-9, 151, 156, 157, 158, 162, 163, 194, 197, 205 pH conditions 167-8, 173 Pictathyrispicta, brachiopod 60 pie-chart 121 Pitlurg limestones 35, 36, 37-8 plagioclase see feldspar Port Arisaig Formation (PAF) 26 positive vectors 146 potassium 178 power transformation 5-6, 110, 112, 123-4, 130, 145, 147-9, 151, 157, 158, 197, 205 power-perturbation equation 8-9
INDEX precipitation reactions 175, 178 predictive distribution 81, 82, 87, 88, 8 9 - 9 6 additive logistic normal 114 principal component analysis 2, 72, 74, 104-6, 106-7, 119, 165-6 probability 33, 36, 87-8, 125, 175 classification 87-8 density 2, 87, 88, 157, 176 plot 69, 72, 73, 179, 180-3 ternary sandstone 97 provenance association 79, 87, 88, 89-96, 97 provenance field 79, 84, 85, 87, 88, 94 pyroxene 167, 170, 172 Q-Q plot 187, 188 QFL diagram 84, 85, 88, 89, 93 QmFLt diagram 84, 85, 88, 90 QmPK diagram 88, 91, 95 QpLvLs diagram 88, 92, 96 Quadrays formulas 137 Quartz 30, 84-96 Quick Basic 130 R statistical package 119-27, 129-143 Aitchison geometry 120 classification routines 143 closure constant 121 composition visualization 129-43 compositional computation 123-4 compositional data analysis 120-2 data command 120 data subsets 125 defining a vector 132 discrimination analysis 125-6 download and installation 120, 126 Fahrmeir package 131 GNU S 130 grouped data 124-6 help window 120, 121 importing data 126 library command 120 mixture procedures 131-2 Mixtures with see MixeR multiple parts 122, 123, 124 Multivariate Analysis of Variance (manova) 125 percentile lines 133, 134-6, 134 plot command 121 principle component analysis 121-2 programming interface 126 ratio lines 136-7 specifying components 122 starting up 119 statistical sub-routines 126 subcomposition 132-3 technical help 126-7 ternary diagram 121,123, 133-4 tetrahedron 137-9 variance 121-2, 121 version number 120 websites 119, 126 zero replacement value 120, 139-41 rare earth element diagram 32 ratio lines 131, 136-7
INDEX redox reactions 167-8, 187 replacement formulae 194-7, 195-6 replacement strategy 193-4, 205 see also additive replacement; clear replacement; multiplicative replacement; natural replacement; simple replacement; zero replacement value representations see transformed data rhyolite 18 ridge-type constant 59, 60-1 rounded zero 101-2, 102, 111 - 12, 191-200 Russia 43-55, 44 sample space 175, 176-7, 203 sandstone 79-98, 80, 95-8, 97 scatter plot 52 Scotland, Dalradian limestone 25-39 scrubbing 73, 74, 75 sediment 1, 59, 79 classification 4, 80, 80, 82-4, 83 sedimentary basin 79 seismic data 67 sequential binary partition 152-3, 153, 154, 154, 155-6, 155, 156 order of partition 155, 155 serpentinite 166-72 set boundary and frontier 203-4 interior 204 open and closed 203-4, 205 set R 203-6, 204, 205, 205, 206 set S 203-6, 204, 205, 205, 206 shear zone 28 shrinkage 59, 60-1, 62, 63, 64-5 siderophile elements 43, 45, 53 Sikhote 'Alin, Russia 43-55, 45 silicates 18, 34, 166, 181 Silurian sediments, Lithuania 59-60, 61, 62 simple replacement 139-40, 194-7, 196, 197 simplex 26, 32, 38, 68-9, 107, 119, 162, 175, 176, 179, 203-6 Aitchison 123 closed 203-5 D-part 3, 6, 145-58 open 203, 205 as subset 204-5 vector space 147-50 skewness index, estimated 1 8 5 - 6 skew-normal model 177, 181,183-5, 188 slab-window zone 43 sodium 178, 187 software tools 129-31 soil 166-72 chemistry and land use 166 dataset 166-7 interstitial water 172 solution-precipitation processes 175, 187 Southern Highland Group 26 spatial averaging 96 spectral analysis, quantitative 45 spider diagram 28, 30, 34 standard correlation analysis 175 standard deviation see variance Stromboli, Italy 67, 75
strontium 39 structural zero 191-2 Student t-test 69 subcomposition 15-17, 21, 147, 153, 154, 165, 194, 197 coherence 2, 25-6 covariance 2 discrimination 25, 32-4, 34 fumarolic gases 70-3 projection of 153, 153, 154 ternary sandstone 85 three part 13, 129-43, 134-7, 161, 169, 169 subduction margin 43, 54, 178 substitution constant 102 t-test see Student t-test tectogenetic switch hypothesis 43-6, 54 tectonics 11, 25, 45, 79 Terebratalia coreanica, brachiopod 60 ternary diagram 6, 8, 15-17, 104, 119, 129, 130-1,146, 146, 149, 161, 162, 169, 199, 205-6, 206 CoDaPack 105-6, 112, 113 Dickinson model 79 statistical analysis 80-1, 82 visualisation in R 133-4, 135, 136 tetrahedral diagram 129, 131, 141,142, 146 visualisation in R 137-9 Theory of Successive Random Dilutions 177 thin section point count 79, 85 tholeiitic series 19 tin deposits 43-55, 50 Tindari-Letojanni lithospheric fault 68 Torulian Limestone 28, 35, 36, 38 total alkali-silica (TAS) plot 18, 21 total variance 161-3, 167, 170, 197 tourmaline 53 trace elements 11, 30, 45, 169-70, 170, 171, 171, 173, see also casserite transform margin 43, 50, 53, 54 transformed data 19, 20, 25, 30, 38, 45-7, 81, 107-19, 108, 119, 157 Tyrrhenian Sea 67, 68, 178 U-Pb zircon ages 26 univariate normal model 177-85 univariate skew-normal model 176, 177-85 variance 59-65, 70, 103-4, 164, 165, 167, 168, 170, 176, 177, 192, 198 matrix 161-3, 167, 168, 171, 197 see also covariance structure; total variance vectors 63, 205 collinear 168 column markers 164 row markers 164 stability analysis 59-65 Visualbasic routines 101 volcanic arc see magmatic arc volcanoes, space-time monitoring 179 Vulcano Island, Italy 67-75, 175, 178, 184 Vysokogorskoe deposit, Russia 45, 47, 54-5
211
212 Wald-Wolfowitz Runs Test 69 Ward cluster analysis 46-9, 50, 50, 51 water chemistry 39, 175-88, 180, 182, 183, 184, 185 weathering 166, 170, 172, 175, 178, 181, 183, 187 weight percentage 3, 129 weighted method 87 within-groups matrix 60-1, 62
INDEX X-Ray Fluorescence 30 zero component 3, 84, 161, 191-200, 203, 205 see also rounded zero; structural zero zero replacement value 86 in R 139-41 rounded 101-2, 102, 111-12 zircon, U-Pb ages 26
Compositional Data Analysis in the Geosciences From Theory to Practice Edited by
A. Buccianti, G. Mateu-Figueras and V. Pawlowsky-Glahn
Since Karl Pearson wrote his paper on spurious correlation in 1897, a lot has been said about the statistical analysis of compositional data, mainly by geologists such as Felix Chayes. The solution appeared in the 1980s, when John Aitchison proposed to use Iogratios. Since then, the approach has seen a great expansion, mainly building on the idea of the 'natural geometry' of the sample space. Statistics is expected to give sense to our perception of the natural scale of the data, and this is made possible for compositional data using Iogratios. This publication will be a milestone in this process. This book will be of interest to geologists using statistical methods. It includes the intuitive justification of the methodology, convincing through case studies and presenting user-friendly software, which includes a section for those who need to see the proof of the mathematical consistency of the methods used.
Visit our online bookshop: http://www.geolsoc.org.uk/bookshop Geological Society web site: http://www.geolsoc.org.uk
ISBN -86239-205-6
Cover illustration: Volcan Licancabur(22°50'S67°50'W,5900 m) is a stratovolcanowhich lies on the border of Chile and Bolivia(the peak proper being locatedin Chile), 30 km eastof the villageof San Pedrode Atacama.The 70 x 90 m crater lakeat the summit is believedto be the highest lake in the world, and despiteair temperaturesof -30°Cit containsnumerousliving creatures. Photographcourtesyof Prof. PiermariaLuigi Rossi,Universityof Bologna (I). The equation representsthe set of the d-dimensionalsimplex embedded in D-dimensionalreal space.When D = 3 the simplexis represented by a triangle. (Aitchison, 1986, The Statistical Analysis of Compositional Data, Chapman& Hall).
I1!!11!! !!!11