Advances in Sound Localization

ADVANCES IN SOUND LOCALIZATION Edited by Paweł Strumiłło Advances in Sound Localization Edited by Paweł Strumiłło Pub...

Author: Pawel Strumillo

64 downloads 2467 Views 40MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

ADVANCES IN SOUND LOCALIZATION Edited by Paweł Strumiłło

Advances in Sound Localization Edited by Paweł Strumiłło

Published by InTech Janeza Trdine 9, 51000 Rijeka, Croatia Copyright © 2011 InTech All chapters are Open Access articles distributed under the Creative Commons Non Commercial Share Alike Attribution 3.0 license, which permits to copy, distribute, transmit, and adapt the work in any medium, so long as the original work is properly cited. After this work has been published by InTech, authors have the right to republish it, in whole or part, in any publication of which they are the author, and to make other personal use of the work. Any republication, referencing or personal use of the work must explicitly identify the original source. Statements and opinions expressed in the chapters are these of the individual contributors and not necessarily those of the editors or publisher. No responsibility is accepted for the accuracy of information contained in the published articles. The publisher assumes no responsibility for any damage or injury to persons or property arising out of the use of any materials, instructions, methods or ideas contained in the book. Publishing Process Manager Ivana Lorkovic Technical Editor Teodora Smiljanic Cover Designer Martina Sirotic Image Copyright 2010. Used under license from Shutterstock.com First published March, 2011 Printed in India A free online edition of this book is available at www.intechopen.com Additional hard copies can be obtained from [email protected]

Advances in Sound Localization, Edited by Paweł Strumiłło p. cm. ISBN 978-953-307-224-1

free online editions of InTech Books and Journals can be found at www.intechopen.com

Contents Preface Part 1

XI

Signal Processing Techniques for Sound Localization

Chapter 1

The Linear Method for Acoustical Source Localization (Constant Speed Localization Method) - A Discussion of Receptor Geometries and Time Delay Accuracy for Robust Localization 3 Sergio R. Buenafuente and Carmelo M. Militello

Chapter 2

Direction-Selective Filters for Sound Localization Dean Schmidlin

Chapter 3

Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions 39 Ryoichi Takashima, Tetsuya Takiguchi and Yasuo Ariki

Chapter 4

Localization Error: Accuracy and Precision of Auditory Localization 55 Tomasz Letowski and Szymon Letowski

Chapter 5

HRTF Sound Localization 79 Martin Rothbucher, David Kronmüller, Marko Durkovic, Tim Habigt and Klaus Diepold

Chapter 6

Effect of Space on Auditory Temporal Processing with a Single-Stimulus Method 95 Martin Roy, Tsuyoshi Kuroda and Simon Grondin

Part 2 Chapter 7

Sound Localization Systems

105

Sound Source Localization Method Using Region Selection 107 Yong-Eun Kim, Dong-Hyun Su, Chang-Ha Jeon, Jae-Kyung Lee, Kyung-Ju Cho and Jin-Gyun Chung

19

1

VI

Contents

Chapter 8

Robust Audio Localization for Mobile Robots in Industrial Environments 117 Manuel Manzanares, Yolanda Bolea and Antoni Grau

Chapter 9

Source Localization for Dual Speech Enhancement Technology 141 Seungil Kim, Hyejeong Jeon, and Lag-Young Kim

Chapter 10

Underwater Acoustic Source Localization and Sounds Classification in Distributed Measurement Networks 157 Octavian Adrian Postolache, José Miguel Pereira and Pedro Silva Girão

Chapter 11

Using Virtual Acoustic Space to Investigate Sound Localisation 179 Laura Hausmann and Hermann Wagner

Chapter 12

Sound Waves Generated Due to the Absorption of a Pulsed Electron Beam 199 A. Pushkarev, J. Isakova, G. Kholodnaya and R. Sazonov

Part 3

Auditory Interfaces for Enhancing Human Perceptive Abilities 223

Chapter 13

Spatial Audio Applied to Research with the Blind 225 Brian FG Katz and Lorenzo Picinali

Chapter 14

Sonification of 3D Scenes in an Electronic Travel Aid for the Blind 251 Michal Bujacz, Michal Pec, Piotr Skulimowski, Pawel Strumillo and Andrzej Materka

Chapter 15

Virtual Moving Sound Source Localization through Headphones 269 Larisa Dunai, Guillermo Peris-Fajarnés, Teresa Magal-Royo, Beatriz Defez and Victor Santiago Praderas

Chapter 16

Unilateral Versus Bilateral Hearing Aid Fittings 283 Monique Boymans and Wouter A. Dreschler

Chapter 17

Auditory Guided Arm and Whole Body Movements in Young Infants 297 Audrey L.H. van der Meer and F.R. (Ruud) van der Weel

Contents

Part 4

Spatial Sounds in Multimedia Systems and Teleconferencing 315

Chapter 18

Camera Pointing with Coordinate-Free Localization and Tracking 317 Evan Ettinger and Yoav Freund

Chapter 19

Sound Image Localization on Flat Display Panels 343 Gabriel Pablo Nava, Yoshinari Shirai, Kaji Katsuhiko, Masafumi Matsuda, Keiji Hirata and Shigemi Aoyagi

Chapter 20

Backward Compatible Spatialized Teleconferencing based on Squeezed Recordings 363 Christian H. Ritz, Muawiyath Shujau, Xiguang Zheng, Bin Cheng, Eva Cheng and Ian S Burnett

Part 5

Applications in Biomedical and Diagnostic Studies

Chapter 21

Neurophysiological Correlate of Binaural Auditory Filter Bandwidth and Localization Performance Studied by Auditory Evoked Fields 387 Yoshiharu Soeta and Seiji Nakagawa

Chapter 22

Processing of Binaural Information in Human Auditory Cortex 407 Blake W. Johnson

Chapter 23

The Impact of Stochastic and Deterministic Sounds on Visual, Tactile and Proprioceptive Modalities 431 J.E. Lugo, R. Doti and J. Faubert

Chapter 24

Discrete Damage Modelling for Computer Aided Acoustic Emissions in Health Monitoring 459 Antonio Rinaldi, Gualtiero Gusmano and Silvia Licoccia

Part 6 Chapter 25

Sound Localization in Animal Studies

385

475

Comparative Analysis of Spatial Hearing of Terrestrial, Semiaquatic and Aquatic Mammals Elena Babushina and Mikhail Polyakov

477

Chapter 26

Directional Hearing in Fishes Richard R. Fay

Chapter 27

Frequency Dependent Specialization for Processing Binaural Auditory Cues in Avian Sound Localization Circuits 513 Rei Yamada and Harunori Ohmori

493

VII

VIII

Contents

Chapter 28

Highly Defined Whale Group Tracking by Passive Acoustic Stochastic Matched Filter 527 Frédéric Bénard, Hervé Glotin and Pascale Giraudet

Chapter 29

Localising Cetacean Sounds for the Real-Time Mitigation and Long-Term Acoustic Monitoring of Noise 545 Michel André, Ludwig Houégnigan, Mike van der Schaar, Eric Delory, Serge Zaugg, Antonio M. Sánchez and Alex Mas

Chapter 30

Sound Localisation in Practice: An Application in Localisation of Sick Animals in Commercial Piggeries 575 Vasileios Exadaktylos, Mitchell Silva, Sara Ferrari, Marcella Guarino and Daniel Berckmans

Preface Awareness of one’s environment is important in everyday life situations for humans, animals and in various scientific and engineering applications. Living organisms can observe their surroundings using their senses, whereas man-made systems need to be equipped with different sensors (e.g. image, acoustic or touch). Whatever the nature of the signal acquisition system, be it technical or biological, an advanced processing of sensory data is needed in order to derive localization information. Among the sources of physical modalities that can be localized from far distances are electromagnetic waves (that can propagate in vacuum) and sound waves that require some physical medium (air, water or a solid material) to propagate through. A consequence of the mechanical nature of sound propagation is the considerable dissipation of the carried energy and an a high dependence of the propagation speed on the medium type (e.g. 340m/s in air). Although, different techniques need to be engaged in locating electromagnetic and sound radiation sources, some of them are conceptually alike, e.g. processes used in radar and echolocation (also animal echolocation). Sound source localization (SSL) is defined predominantly as the determination of the direction from a receiver, but also includes the distance from it. The direction can be expressed by two polar angles: the azimuth angle (i.e. horizontal bearings) and the elevation angle (i.e. vertical bearings). Determination of a sound source’s distance can be achieved through measurements of sound intensity and/or its spectrum; however, a priori knowledge is needed about the source’s radiation characteristic. SSL is a complex computation problem. Because of the wave nature of sound propagation phenomena such as refraction, diffraction, diffusion, reflection, reverberation and interference occur. The wide spectrum of sound frequencies that range from infrasounds (lower than 20Hz) through acoustic sounds which are perceived by the human auditory system (nominally ~20Hz÷20kHZ) to ultrasounds (above 20kHz), also introduces difficulties, as different spectrum components have different penetration properties through the medium. Wide-band sound sources can be perceived differently (in terms of distance, direction and pitch) depending on the geometric characteristics of the sound propagation environment. Consequently, development of robust sound localization techniques calls for different approaches, including multisensor schemes, null-steering beamforming and time-difference arrival techniques.

XII

Preface

SSL is an important research field that has attracted researchers’ efforts from many technical and biomedical sciences. Sound localization techniques can be vital in rescue missions, medicine (ultrasonography), seismology (oil and gas exploration), as well as robotics, noise cancellation and improvement of immersion in virtual reality systems. Remarkable sound localization capabilities are featured by humans and other living organism who use them for communication, spatial orientation, wayfinding and also for locating prey or fleeing from predators. Advances in Sound Localization is a collection of 30 contributions reporting up-to-date studies of different aspects of sound localization research, ranging from purely theoretical approaches to their implementation in specific applications. The contributions are organized in six major sections. Part I provides state of the art exposition to a number of advanced concepts for SSL starting from a mathematical background of sensor arrays, binaural techniques (including the HeadRelated Transfer Functions – HRTFs) to conceptually appealing methods that employ direction-selective filters and discrimination of acoustic transfer functions to achieve single-channel sound source localization. Part II reports systems that implement signal processing techniques and sensor setups for robust SSL in real-life environments. It is shown that source localization can find application in robotics (e.g. for aiding environment mapping) and underwater acoustics. Techniques are proposed for considerable reduction of computing time required to run SSL algorithms. Also, approaches to generation of virtual acoustic space for studying SSL abilities in humans and animals are described. Finally, it is demonstrated how the use of SSL techniques can be applied for speech enhancement purposes. In Part III applications of SSL techniques are covered that are aimed at enhancing human perception abilities. Applications include: aiding the blind in spatial orientation by means of auditory display systems and investigation on how bilateral hearing fittings improve spatial hearing. The part is concluded by studies underlining the importance of auditory information for environmental awareness in infants. Applications of SSL in multimedia and teleconferencing systems are addressed in Part IV. The concept of an automatic cameraman is reported, in which a pan-tilt-zoom camera is driven by an SSL system to point in the direction of a speaker. Another communication deals with enriching video material that is projected onto large displays by spatialization of sounds using a novel loudspeaker setup. Finally, a technique employing a microphone array for spatial location of speakers in teleconferencing systems is described. Part V is devoted to applications of SSL techniques in biomedical and diagnostic studies. First two contributions in this Section deal with studies of the human auditory cortex. The former attempts to identify characteristics of human binaural auditory filter by examining the activity of auditory evoked fields, whereas the latter explains how the binaural information is processed in the auditory cortex by using electroencephalography (EEG) and magnetoencephalography (MEG). In another interesting study it is postulated that sound stimuli (stochastic or

Preface

deterministic) can facilitate perception of stimuli by other sensory modalities. This observation can be the basis for treatment of Parkinson and Alzheimer diseases. The Part is concluded by studies on detection of structural damage in materials using acoustic emission techniques. Finally, Part VI focuses on the intriguing field of SSL in animal studies. Two lines of research are reported. The first addresses, how avian, terrestrial and aquatic animals excel in SSL by their extraordinary spatial hearing abilities. The second field of study is devoted to techniques used in practical application of SSL methods (e.g. matched filtering) for localizing animal groups or an individual animal within a group. While preparing this preface I have become strongly convinced that this book will offer a rich source of valuable material on up-to-date advances on sound source localization that should appeal to researches representing diverse engineering and scientific disciplines. March 2011 Paweł Strumiłło, Ph.D., D.Sc Technical University of Lodz, Poland

XIII

Part 1 Signal Processing Techniques for Sound Localization

1 The Linear Method for Acoustical Source Localization (Constant Speed Localization Method) - A Discussion of Receptor Geometries and Time Delay Accuracy for Robust Localization Sergio R. Buenafuente and Carmelo M. Militello University of La Laguna (ULL) Spain

1. Introduction One of the most widely used methodology for the passive localization of acoustic sources is based on the measurement of the time delay of arrival (TDOA) of the source signal to receptors pairs. In 2D, two pairs of receptors are necessary, implying the need of 3 receptors. In 3D, three pairs are needed, and a minimum of 4 receptors. The only data available to solve for the source spatial coordinates are the receptors spatial position and the best possible computation of TDOA between receptors pairs. In a 2D problem if we have two receptors and we compute a TDOA between them, it is a well known fact that the source capable to produce that delay must be placed over one of two symmetric hyperbolas, Figure 1. Because this is true for each pair, becomes clear that the source must be placed in the intersection of the hyperbolas of two different pairs. That is why this method is known as hyperbolic localization. HL for short. The resulting system of equations is non linear. In 3D the hyperbolas become hyperboloids, a third coordinate appears as unknown, and one more pair of receptors is needed. This reasoning justifies the minimum number of receptors mentioned above. Of course, although the mathematical minimum is correct, in finite computations the pairs available can provide a numerically inadequate set of equations. To provide more pairs, and receptors, than necessary made available an ample set of equations from where to choose the adequate ones. Nevertheless, non linearity and equation redundancy are different issues that should not be confused. For the sake of self consistency the equations of the HL problem are developed. Be s = { x, y, z} the unknown spatial position of the source. For each receptor mi we have its position { xi, yi, zi } and the vector ri = s − m i that points from the receptor to the source. Assuming spherical sound propagation the following relationship is satisfied by each receptor pair: ri − r j = dij = vτij (1) where dij , a signed quantity, is the difference between the distances of each receptor to the source, v is the sound propagation speed in the medium and τij is the TDOA computed from the receptors registers. The τij s are signed quantities too. Working over Equation 1, the

4

Advances in Sound Localization

Fig. 1. A source positioned over the hyperbolas, irrespective of the distance, will produce the same TDOA absolute value. Which one is the involved hyperbola is determined by the TDOA sign. following expression is obtained:

( xi − x j ) x + (yi − y j )y + (zi − z j )z + dij r j =

m2i − m2j − d2ij 2

(2)

The same equation can be written for other two pairs. Assuming that the three pairs are constructed from three receptors the resulting system of equations is:

( xi − x j ) x + (yi − y j )y + (zi − z j )z + dij r j

=

0.5(m2i − m2j − d2ij )

( xk − xl ) x + (yk − yl )y + (zk − zl )z + dkl rl

=

0.5(m2k − m2l − d2kl )

=

0.5(m2i

( xi − xk ) x + (yi − yk )y + (zi − zk )z + dik rk where rq

=

mq

=

− m2k

(3)

− d2ik )

( x q − x )2 + ( y q − y )2 + ( z q − z )2 xq2 + y2q + z2q ; for q = j, k, l

(4)

Equations 3 constitute a nonlinear system of equations and can be solved, iteratively, by traditional numerical methods. In 1987 many authors, in closely sequenced papers, presented a different way to obtain Equation 3 (Abel & Smith, 1987; Friedlander, 1987; H.C.Schau & Robinson, 1987). First they choose one of the receptors, for example receptor j, as a master receptor. This allows computing all the receptor-source distances as a function of the distance of the master receptor to the source. The values of dij are computed from the τij and the medium propagation speed. d jl = r j − rl =⇒ rl = r j − d jl

(5)

The Linear Method for Acoustical Source Localization (Constant Speed Localization Method) - A Discussion of Receptor Geometries and Time Delay Accuracy for Robust Localization

5

Second, receptor m j is renamed m0 and r j as r0 , obtaining

( xi − x j ) x + (yi − y j )y + (zi − z j )z + dij r0 =

m2i − m2j − d2ij 2

+ dij d0j

(6)

where r0 is now the distance between the master receptor and the source, the so called range, computed as r0 =

( x − x0 )2 + ( y − y0 )2 ) + ( z − z0 )2

(7)

In Equation 6, the unknowns still are { x, y, z}. One way to overcome the non linearity of the system was to introduce r0 as a new unknown or parameter (Friedlander, 1987). The new unknown required the introduction of one more equation, expanding the original equations system. At that time nobody believed that the values of r0 and { x, y, z} obtained from the expanded system would satisfy Equation 7. It seems that nobody checked it either in the last 20 years. Because the clear non linear nature of Equation 7 many authors developed ways to solve the new expanded system by iterative methods (Chan & Ho, 1994). The use of redundant pairs made it necessary to combine iterative methods with least square procedures, increasing the difficulty. In 2000, (Huang et al., 2000) found that the redundant system can be solved correctly in only one iteration. It was not noticed that it only can happen if the system is linear or if the initial guess in the nonlinear system is always coincident with the right solution.

2. The constant speed localization method, CSLM

Fig. 2. Straight front propagation In 2007 the authors (Militello & Buenafuente, 2007) presented a new way of interpreting the source localization problem, from now on CSLM (Constant Speed Localization Method). This allowed demonstrating that the problem could be transformed into a linear one by the mere fact of adding an additional receiver to the minimum required in the hyperbolic localization method. It was also shown that the work of Friedlander et al. and methods derived from it are special cases of the general case presented, making clear the linearity of the method. To explain the CSLM the receptors are considered to act as sources, each one emitting sound. But each one starts emitting in the inverse order they capture the sound from the source. In this way, all the wave fronts emitted will intersect the source at the same time.

6


Two receptors at a distance 2c from each other received the signal with a time delay t a . For a sound speed v a spatial delay is defined as 2a = t a v. Now the two receptors start emitting with a time delay t a . Both circles will intersect, and the successive intersections will describe a hyperbola. The hyperbola is symmetric with respect to the line joining the receptors and one of the branches will contain the source. But, if we join the successive intersection points with a straight line, as in Figure 2, a straight front can be identified. In (Militello & Buenafuente, 2007) it was proved that this front propagates with a constant speed vl = va/c. Because of the straight front speed property the method is called Constant Speed Localization. Each receptor pairs will produce one straight front propagating at a constant speed, and all the fronts will reach the source at the same time, i.e. all the constant speed traveling straight lines will intersect at the source position. In this way, a linear system of equations having as unknowns the source coordinates and the time of arrival can be constructed. The unknowns are clearly independent, and there is neither preferred coordinate system nor time origin. If one receptor position is considered as the coordinate centre, and the distance from this point to the source is called the range, the values of vt appearing in the equation can be substituted by r0 and Friedlandert’s equations are recovered. This is the only case where R = vt = x2 + y2 . A detailed development of CSLM for 2D and 3D problems in its general form is presented in (Militello & Buenafuente, 2007). Here, for the sake of comparison, the equations are developed taking into account Friedlander’s methodology and the following particular form is obtained:

( xi − x j ) x + (yi − y j )y + (zi − z j )z + dij vt =

m2i − m2j − d2ij 2

+ dij d0j

(8)

To reach (8) the time origin is established as the time when receptor m0 starts emitting. In the original CSLM method the time origin is the time when the furthest receptor starts emitting. Because the problem is linear in time and space, a time or a coordinate shift do not introduces changes in the solution nature. Equations 6 and 8 are almost identical. The difference is that r0 is replaced by vt. This replacement is consistent with the meaning of r0 in Friedlandert’s formulation and the meaning of the independent variable t in the CSLM formulation. Then r0 is an independent variable because it can be obtained as the product of the independent variable t by the sound speed in the medium. Now the linear nature of both methods and their equivalence has been established. Because a new independent variable appears, r0 or t, one more equation is needed. The linear system can be solved by using a minimum of four sensors instead of three in a 2D problem and five sensors instead of four in a 3D problem. But the use of the correct number of sensors does not preclude the appearance of numerical errors when solving the system. Something worth noting: in the CSLM method it is necessary to create a common time axis. It can only be done if the TDOA are not only computed between the active receptor pairs but also among one receptor, lets say a master one, and one of the receptors of each active pair. This is totally equivalent to Friedlandert’s method when all the receptors positions are computed as a function of the position of the master receptor. Then, the computational work load involved in both methods is the same.

3. The design of the reception system There are many variables and uncertainties in the design of a receptor system. To mention some of them the following list is proposed: Uncertainties:


7

1. The error in TDOA estimations. This error depends on the ability to identify a specific perturbation introduced by the source in each sensor register and to assign a time to it. Or in the ability to compute the TDOA for a receptor pair. 2. The geometrical position of the receptor. Nowadays receptors are small in size and the pressure centre of a microphone can be determined with an error of the order of millimetres. Design variables: 1. The spatial distribution of receptors. 2. The receptors chosen to constitute active pairs. As it will be shown, the design variables will be responsible of the system performance. It will govern the way the effects of uncertainties are amplified in some detection scenarios and the quality of detection when the relative position of the source changes respect to our detection system. 3.1 Selecting the active pairs and the master receptor (time origin)

The study is focused in the way the design variables affects the source localization through the inevitable TDOA uncertainties. The superscript ◦ is used to indicate the correct or exact values. They will be affected by an uncertainty value so that τij = τij◦ ± eij . By replacing it in (8) and rearranging terms:

( xi − x j ) x ◦ + (yi − y j )y◦ + (zi − z j )z◦ + vτij◦ vt◦ − 0.5(m2i − m2j − v2 (dij◦ )2 )

=

0

◦ ◦ ±v2 eij t◦ − 0.5eij2 ± vτij◦ eij + v2 (τij◦ τ0j ± τij◦ e0j ± τ0j eij ± eij eoj )

=

ij (10)

(9)

Equation 9 recasts Equation 8. Equation 10 is an error and can be seen as a contribution to the uncertainty value of the left hand side of the original equation system. Neglecting second order terms and adding up uncertainties an upper bound can be computed. ◦ ◦ ij = v2 eij (t◦ + τij◦ + τ0j ) + τij◦ τ0j + τij◦ e0j (11) This upper bound can be reduced if all the active pairs include the master receptor. In doing ◦ = 0. In this case Equation 11 can be further simplified to: so τ00 ◦ i0 = vei0 (vt◦ + di0 )

(12)

From this equation many conclusions can be drawn about the amplification of the TDOA inaccuracies. The main factors are: 1. The speed of sound in the medium. 2. The distance from the source. 3. The TDOA uncertainty. In other words, for a given medium, the further the source the higher is the error. And, for a given set of receptors, it seems that the active pairs should be chosen so that one of the receptors appears in all the pairs and the distance between receptors is kept to a minimum.

8


4. Error propagation Although the rules extracted in the preceding sections seems logical, they are not conclusive. This is due to the fact that in a linear problem the quality of the solution depends on the conditioning of the system of equations. In 3D the number of unknowns is four so that four pairs are needed. The system of equations gets the form Mx = b, where ⎡ ⎤ xi − x j yi − y j zi − z j dij ⎢ xk − xl yk − yl zk − zl dkl ⎥ ⎥ (13) M=⎢ ⎣ xm − xn ym − yn zm − zn dmn ⎦ x p − xq y p − yq z p − zq d pq

T (14) x = x y z vt ⎡ ⎤ m2i − m2j − d2ij + 2dij d0j ⎥ 1⎢ ⎢ m2 − m2l − d2kl + 2dkl d0l ⎥ b= ⎢ 2k (15) ⎥ 2 ⎣ mm − m2n − d2mn + 2dmn d0n ⎦ m2p − m2q − d2pq + 2d pq d0q and the solution is

x = M −1 b

(16)

provided that the inverse of M exists. Notice the use of eight different sensors, which is the most general case to construct the system. But, as one sensor can be part of many pairs, this number can be reduced to five. Because of the uncertainties pointed up before matrices M and b are perturbed. As before only TDOA uncertainties are considered. The real equation system becomes (17) (M + δM) xˆ = (b + δb) being xˆ an approximation to the exact solution. xˆ = x◦ + δx

(18)

Because the system is linear, perturbation theory can be applied in order to obtain a bound to the expected error in the system solution. The relative solution error will satisfy: cond(M) δx δM δb (19) ≤ + δM x◦ M b 1 − cond(M) M where cond(M) is the matrix condition number defined as: cond(M) = M M−1 ≥ 1

(20)

where · is a matrix norm, usually the l2 norm. In a badly conditioned system the cond(M) is bigger than 1. If it is assumed that the perturbed matrices have a small norm and cond(M) is not a big number, (Moon & Stirling, 2000), the relative error in system solution can be approximated by δx δM δb cond ( M ) + O ( e2 ) (21) ≤ + x◦ M b Being e the order of magnitude of the TDOA uncertainty. From Equation 21 it can be seen that the relative error in the system solution can be approximated as the sum of the relative error in the matrix plus the relative error in the independent term, amplified by the condition number. In order to clarify the effect of this equation in the results two examples are presented.


9

4.1 Directivity of a given sensor configuration

In this context the term "directivity" is defined as 1/cond(M), having a maximum value of 1, and is used to point how a given sensor configuration will amplify the uncertainties from a source placed over a circle around the designed master receptor. Matrix M has three columns that can be evaluated from the receptors coordinates, but the fourth one depends on the relative positions of source and receptors pairs, the TDOA. Matrix M can be easily constructed from any expected source position and its condition evaluated. Following Equation 21 the value 1/cond(M) can be seen as a directivity property. A high value in a given direction indicates that direction as a preferred one with small uncertainty amplification. Simulation A.

Fig. 3. Simulación A. (a) A starting receptors configuration and range computation with CSLM. (b) Matrix M condition showing the lobes responsible of error amplification. (c) Receptors array directivity, minimum directivity in the maximum error propagation direction. A set of receptors are positioned: m0 {0, 0}, m1 {−5, 8}, m2 {4, 6}, and m3 {−2, 4}. The receptors pairs are {m0 , m1 }, {m0 , m3 } and {m0 , m2 }. It must be noticed that receptors m0 , m1 and m3 seems to be over a straight line at 120◦ from the X axis but they are not. If they are over the same line the system is singular and can not be inverted. A circle of radius 40 m centered at m0 is drawn and 1000 sources uniformly distributed over it. For each source exact, within machine precision, quantities are computed. The exact TDOAs are computed and perturbed with a random Gaussian error distribution. The error standard deviation is set to 10us. The values of vt computed for each source are plotted in Figure 3(a). Figure 3(b) plots the computed matrix condition and clearly shows the coincidence of big condition values with high source localization error. An amplification factor of 800 can be seen at 300◦ . Figure 3(c) is the directivity, showing a big value in the directions where the computed error will be low. From the traveling straight front point of view a wrong selection of receptors pairs will produce almost parallel lines, making it difficult to compute their intersection. Why the 120◦ direction produces less dispersion than the 300◦ one? It will be explained latter. Simulation B A robust configuration is defined as the one with not pronounced directivity lobes. Under this point of view the best one will be the one with no lobes and a directivity value near one. In order to achieve this receptors are placed in the vertex of an equilateral triangle and the √ master receptor is placed at the triangle centre of gravity, Figure 4. The triangle side is 4 3

10


Fig. 4. Simulación B. A centred triangle configuration. a)Computed range with CSLM. b)Matrix M condition. c) Directivity. m. The TDOA uncertainties are computed in exactly the same manner as in Simulation A. It can be seen that three lobes appear with a very uniform shape. The directivity is uniform too. It should be noticed that a directivity number better than 0.02 is not achieved for this configuration. Simulation B shows how with the same computational and hardware costs a better system can be constructed. The matrix condition number increases as the distance to the source increases. The ideal number of 1 is hard to get. For the triangular configuration of SIMULATION B a condition number of 1.4 is obtained for a source placed at the triangle centre, in top of the master receptor.

5. An upper bound for the solution error When designing a reception system the effect of TDOA error in system performance is capital. All the electronics and computational effort used in reducing this uncertainty will have a direct impact in localization. Equation 21 provides an easy way to predict the value of uncertainty necessary for a desired performance. Assuming no error in receptors positions the perturbed matrix can be written as ⎤ ⎡ 0 0 0 eij v ⎢ 0 0 0 ekl v ⎥ ⎥ δM = ⎢ (22) ⎣ 0 0 0 emn v ⎦ 0 0 0 e pq v where eij is the error in computing the TDOA for each receptors pair. The maximum value for eij is set to emax . The l1 norm is computed for this matrix obtaining a bound for the perturbed matrix: (23) δM < nvemax In Equation 23 n is the number of receptor pairs. To compute an upper bound to δb it must be recalled that dij = dij◦ + veij . The perturbed b can be written as: ⎡ ⎤ eij2 + 2τij eij ⎥ v2 ⎢ ⎢ e2 + 2τkl ekl ⎥ (24) δb = − ⎢ 2 kl ⎥ 2 ⎣ emn + 2τmn emn ⎦ e2pq + 2τpq e pq


11

Now, if eij is neglected with respect to τij (remember that τij is the TDOA and eij the error in computing it. It is assumed that eij

ωmax ωmin

(34)

2.3 Directivity index of prototype filter In a receiving aperture, directivity serves to reject noise and other interference arriving from directions other than the look direction. The directive effect of a spatial filter has been summarized in a single number called the directivity, which is computed from (Ziomek, 1995)

D=

P (ω : 0 ) 2π π

1 ∫ ∫ P (ω : ψ ) sinψ dψ dζ 4π 0 0

(35)

where P (ω : ψ ) is the filter’s beam power pattern and for γ = 0 is given by 2

P (ω : ψ ) = B (ω : ψ ) =

Equation (35) can be simplified to

( a − 1)2 2 ω 2 ( a − cosψ )

(36)

26


2 P (ω : 0 )

D=

1

∫ P (ω : x ) dx

(37)

−1

where x = cosψ . The substitution of Eq. (36) into Eq. (37) results in D=

2

( a − 1) dx ∫ 2 −1 ( a − x ) 2

1

=

a+1 a−1

(38)

Equation (38) represents the directivity of the first-order prototype filter. The directivity index is defined as DI

10 log 10 D dB

(39)

Equation (34) gives a constraint on the parameter a . Let ω1 ≤ ωmin and ω2 ≥ ωmax denote the lower and upper cutoff frequencies of the temporal bandpass filter that is to filter out the undesirable frequency component in Eq. (33), and let a = ω2 ω1 . The lower and upper cutoff frequencies are related to the center frequency ω0 and the quality factor Q by ⎛

1 1 ⎞ − ⎟ 4Q 2 2Q ⎟⎠ 1 1 ⎞ 1+ + ⎟ 2 2Q ⎟⎠ 4Q

(40)

ω2 + ω1 a + 1 = = 1 + 4Q 2 ω2 − ω1 a − 1

(41)

ω1 = ω0 ⎜⎜ 1 + ⎝ ⎛ ω2 = ω0 ⎜⎜ ⎝

From Eq. (40) one may write

From Eqs. (38) and (39) the directivity index becomes DI = 10log 10 1 + 4Q 2

(42)

For Q >> 1 2 the DI may be approximated as DI = 3 + 10 log 10 Q dB

(43)

If the input plane wave function fits within the pass band of the temporal filter, then the directivity index is given by Eq. (43). For Q = 10, the directivity index is 13 dB. It was noted in Section 2.1 that the maximum directivity index for a vector sensor is 6.02 dB. Using Eq. (41) to Solve for a yields a=

1 + 4Q 2 + 1 1 + 4Q 2 − 1

(44)

27

Direction-Selective Filters for Sound Localization

When the quality factor is 10, then the parameter a of the prototype filter is 1.105. The discriminating function of the filter is given by Eq. (30). The function has a value of 1 at ψ = 0 . The beamwidth of the prototype filter is obtained by equating Eq. (30) to 1 2 , solving for ψ , and multiplying by 2. The result is

(

)

BW = 2ψ 3 dB = 2 cos −1 ⎡ a 1 − 2 + 2 ⎤ ⎣ ⎦

(45)

For the case a = 1.105 , the beamwidth is 33.9o. This is in sharp contrast to the beamwidth of the maximum DI vector sensor which is 104.9o. Figure 1 gives a plot of the discriminating function as a function of the angle ψ . Note that the discriminating function is a monotonic function of ψ . This is not true for discriminating functions of directional acoustic sensors (Schmidlin, 2007).

Fig. 1. Discriminating function for a = 1.105.

3. Direction-Selective filters with rational discriminating functions 3.1 Interconnection of prototype filters The first-order prototype filter can be used as a fundamental building block for generating filters that have discriminating functions which are rational functions of cosψ . As an example, consider a discriminating function that is a proper rational function and whose denominator polynomial has roots that are real and distinct. Such a discriminating function may be expressed as

28

Advances in Sound Localization μ

μ

gu L (ψ ) =

j ∑ d j cos ψ

j =0

=K

ν

j ∑ c j cos ψ

j =0

(

)

( j =1

)

∏ b j − cosψ

j =1

ν

∏ a j − cosψ

(46)

where cν = 1 and μ < ν . The discriminating function of Eq. (46) can expanded in the partial fraction expansion ν

gu L (ψ ) = ∑

i = 1 ai

Ki − cosψ

(47)

The function specified by Eq. (47) may be realized by a parallel interconnection of ν prototype filters (with γ = 0). Each component of the above expansion has the form of Eq. (30). Normalizing the discriminating function such that it has a value of 1 at ψ = 0 yields ν

Ki =1 i = 1 ai − 1 ∑

(48)

Similar to Eq. (36), the beam power pattern of the composite filter is given by P (ω : ψ ) =

gu L (ψ )

2

ω2

(49)

Equations (47) and (49) together with Eq. (35) lead to the following expression for the directivity: ν

ν

D−1 = ∑ ∑ K i K j gij i =1 j =1

(50)

where 1 ai2 − 1

(51)

⎛ ai a j − 1 ⎞ 1 ⎟ ,i ≠ j coth −1 ⎜ ⎜ ai − a j ⎟ ai − a j ⎝ ⎠

(52)

gii =

gij =

For a given set of ai values, the directivity can be maximized by minimizing the quadratic form given by Eq. (50) subject to the linear constraint specified by Eq. (48). To solve this optimization problem, it is useful to represent the problem in matrix form, namely, minimize D−1 = K′GK subject to U′K = 1 where

(53)

29


K′ = [ K 1

⎡ 1 U′ = ⎢ ⎣ a1 − 1

Kν ]

K2

(54)

1 1 ⎤ … ⎥ a2 − 1 aν − 1 ⎦

(55)

and G is the matrix containing the elements gij . Utilizing the Method of Lagrange Multipliers, the solution for K is given by K=

G −1 U U′G −1U

(56)

The minimum of D−1 has the value D−1 = U′G −1 U

(57)

The maximum value of the directivity index is

(

DI max = −10 log 10 U′G −1 U

)

(58)

3.2 An example: a second-degree rational discriminating function As a example of applying the contents of the previous section, consider the proper rational function of the second degree,

gu L (ψ ) =

d0 + d1 cosψ K1 K2 = + c0 + c1 cosψ + cos 2 ψ a1 − cosψ a2 − cosψ

(59)

where a2 > a1 and d0 = a2 K 1 + a1K 2 d1 = −K 1 − K 2

(60)

c0 = a1 a2 , c1 = − a1 − a2

In the example presented in Section 2.3, the parameter a had the value 1.105. In this example let a1 = 1.105, and let a2 = 1.200 . The value of the matrices G and U are given by ⎡ 4.5244 3.1590 ⎤ G=⎢ ⎥ ⎣ 3.1590 2.227 ⎦

(61)

⎡ 9.5238 ⎤ U=⎢ ⎥ ⎣ 5.0000 ⎦

(62)

If Eqs. (56) and (58) are used to compute K and DImax, the result is ⎡ 0.3181⎤ K=⎢ ⎥ ⎣ −0.4058 ⎦

(63)

30


DI max = 17.8289 dB

(64)

From Eqs. (60), one obtains d0 = −.0668, d1 = 0.0878 c0 = 1.3260, c1 = −2.3050

(65)

Figure 2 illustrates the discriminating function specified by Eqs. (59) and (65). Also shown (as a dashed line) for comparison the discriminating function of Fig. 1. The dashed-line plot represents a discriminating function that is a rational function of degree one, whereas the solid-line plot corresponds to a discriminating function that is a rational function of degree two. The latter function decays more quickly having a 3-dB down beamwidth of 22.6o as compared to a 3-dB down beamwidth of 33.9o for the former function.

Fig. 2. Plots of the discriminating function of the examples presented in Sections 2.3 and 3.2. In order to see what directivity index is achievable with a second-degree discriminating function, it is useful to consider the second-degree discriminating function of Eq. (59) with equal roots in the denominator, that is, c0 = a2 , c1 = −2 a . It is shown in a technical report by the author (2010c) that the maximum directivity index for this discriminating function is equal to Dmax = 4

a+1 a−1

(66)

31


and is achieved when d0 and d1 have the values d0 =

a−1 ( a − 3) 4

(67)

d1 =

a−1 ( 3a − 1) 4

(68)

Note that the directivity given by Eq. (66) is four times the directivity given by Eq. (38). Analogous to Eqs. (42) and (43), the maximum directivity index can be expressed as DI max = 6 + 10 log 10 1 + 4Q 2 dB ≈ 9 + 10 log 10 Q dB

(69)

For a1 = 1.105, Q = 10 and the maximum directivity index is 19 dB which is a 6 dB improvement over that of the first-degree discriminating function of Eq. (30). In the example presented in this section, a1 = 1.105, a2 = 1.200,DI max = 17.8 dB . As a2 moves closer to a1 , the maximum directivity index will move closer to 19 dB. For a specified a1 , Eq. (69) represents an upper bound on the maximum directivity index, the bound approached more closely as a2 moves more closely to a1 . 3.3 Design of discriminating functions from the magnitude response of digital filters In designing and implementing transfer functions of IIR digital filters, advantage has been taken of the wealth of knowledge and practical experience accumulated in the design and implementation of the transfer functions of analog filters. Continuous-time transfer functions are, by means of the bilinear or impulse-invariant transformations, transformed into equivalent discrete-time transfer functions. The goal of this section is to do a similar thing by generating discriminating functions from the magnitude response of digital filters. As a starting point, consider the following frequency response:

( )

H e jω =

1− ρ 1 − ρ e − jω

(70)

where ρ is real, positive and less than 1. Equation (70) corresponds to a causal, stable discrete-time system. The digital frequency ω is not to be confused with the analog frequency ω appearing in previous sections. The magnitude-squared response of this system is obtained from Eq. (70) as

( )

H e jω

2

=

1 − 2ρ + ρ 2 1 − 2 ρ cos ω + ρ 2

(71)

Letting ρ = e −σ allows one to recast Eq. (71) into the simpler form

( )

H e jω

2

=

cosh σ − 1 cosh σ − cos ω

(72)

32


If the variable ω is replaced by ψ, the resulting function looks like the discriminating function of Eq. (30) where a = cosh σ . This suggests a means for generating discriminating functions from the magnitude response of digital filters. Express the magnitude-squared response of the filter in terms of cos ω and define gu L (ψ )

( )

H e jψ

2

(73)

To illustrate the process, consider the magnitude-squared response of a low pass Butterworth filter of order 2, which has the magnitude-squared function

( )

H e jω

2

=

1 ⎡ tan (ω 2 ) ⎤ 1+ ⎢ ⎥ ⎢⎣ tan (ωc 2 ) ⎥⎦

(74)

4

where ωc is the cutoff frequency of the filter. Utilizing the relationship ⎛ A ⎞ 1 − cos A tan 2 ⎜ ⎟ = ⎝ 2 ⎠ 1 + cos A

(75)

one can express Eq. (74) as

( )

H e jω

2

=

α ( 1 + cos ω )

2

α ( 1 + cos ω ) + ( 1 − cos ω ) 2

2

(76)

where

⎛ ωc ⎞ ( 1 − cos ωc ) ⎟= 2 ⎝ 2 ⎠ ( 1 + cos ωc ) 2

α = tan 4 ⎜

(77)

The substitution of Eq. (77) into Eq. (76) and simplifying yields the final result

( )

H e jω

2

=

1 − cosθ 1 + 2 cos ω + cos2 ω 2 1 − 2 cosθ cos ω + cos2 ω

(78)

where cosθ =

2 cos ωc 1 + cos2 ωc

(79)

By replacing ω by ψ in Eq. (78), one obtains the discriminating function gu L (ψ ) =

1 − cosθ 1 + 2 cosψ + cos 2 ψ 2 1 − 2 cosθ cosψ + cos 2 ψ

(80)

where ωc is replaced by ψ c in Eq. (79). A plot of Eq. (80) is shown in Fig. 3 for ψ c = 10 . From the figure it is observed that ψ c = 10 is the 6-dB down angle because the


33

discriminating function is equal to the magnitude-squared function of the Butterworth filter. The discriminating function of Fig. 3 can be said to be providing a “maximally-flat beam” of order 2 in the look direction u L . Equation (80) cannot be realized by a parallel interconnection of first-order prototype filters because the roots of the denominator of Eq. (80) are complex. Its realization requires the development of a second-order prototype filter which is the focus of current research.

4. Summary and future research 4.1 Summary The objective of this paper is to improve the directivity index, beamwidth, and the flexibility of spatial filters by introducing spatial filters having rational discriminating functions. A first-order prototype filter has been presented which has a rational discriminating function of degree one. By interconnecting prototype filters in parallel, a rational discriminating function can be created which has real distinct simple poles. As brought out by Eq. (33), a negative aspect of the prototype filter is the appearance at the output of a spurious frequency whose value is equal to the input frequency divided by the parameter a of the filter where a > 1. Since the directivity of the filter is inversely proportional to a − 1 , there exists a tension as a approaches 1 between an arbitrarily increasing directivity D and destructive interference between the real and spurious frequencies. The problem was

Fig. 3. Discriminating function of Eq. (80).

34


alleviated by placing a temporal bandpass filter at the output of the prototype filter and assigning a the value equal to the ratio of the upper to the lower cutoff frequencies of the bandpass filter. This resulted in the dependence of the directivity index DI on the value of the bandpass filter’s quality factor Q as indicated by Eqs. (42) and (43). Consequently, for the prototype filter to be useful, the input plane wave function must be a bandpass signal which fits within the pass band of the temporal bandpass filter. It was noted in Section 2.3 that for Q = 10 the directivity index is 13 dB and the beamwidth is 33.9o. Directional acoustic sensors as they exist today have discriminating functions that are polynomials. Their processors do not have the spurious frequency problem. The vector sensor has a maximum directivity index of 6.02 dB and the associated beamwidth is 104.9o. According to Eq. (42) the prototype filter has a DI of 6.02 dB when Q = 1.94 . The corresponding beamwidth is 87.3o. Section 3.2 demonstrated that the directivity index and the beamwidth can be improved by adding an additional pole. Figure 4 illustrates the directivity index and the beamwidth for the case of two equal roots or poles in the denominator of the discriminating function. As a means of comparison, it is instructive to consider the dyadic sensor which has a polynomial of the second degree as its discriminating function. The sensor’s maximum directivity index is 9.54 dB and the associated beamwidth is 65o. The directivity index in Fig. 4 varies from 9.5 dB at Q = 1 to 19.0 dB at Q = 10 . The beamwidth varies from 63.2 o at Q = 1 to 19.7 o at Q = 10 . The directivity index and beamwidth of the two-equal-poles discriminating function at Q = 1 is essentially the same as that of the dyadic sensor. But as the quality factor increases, the directivity index goes up while the beamwidth goes down. It is important to note that the curves in Fig. 4 are theoretical curves. In any practical implementation, one may be required to operate at the lower end of each curve. However, the performance will still be an improvement over that of a dyadic sensor. The two-equal-poles case cannot be realized exactly by first-order prototype filters, but the implementation presented in Section 3.2 comes arbitrarily close. Finally, in Section 3.3 it was shown that discriminating functions can be derived from the magnitude-squared response of digital filters. This allows a great deal of flexibility in the design of discriminating functions. For example, Section 3.3 used the magnitude-response of a second-order Butterworth digital filter to generate a discriminating function that provides a “maximally-flat beam” centered in the look direction. The beamwidth is controlled directly by a single parameter. 4.2 Future research Many rational discriminating functions, specifically those with complex-valued poles and multiple-order poles, cannot be realized as parallel interconnections of first-order prototype filters. Examples of such discriminating functions appear in Figs. 2 and 3. Research is underway involving the development of a second-order temporal-spatial filter having the prototypical beampattern

B (ω : ψ ) =

gu L (ψ )

( jω )2

(81)

where the prototypical discriminating function gu L (ψ ) has the form

gu L (ψ ) =

d0 + d1 cosψ 1 + c1 cosψ + c 2 cos 2 ψ

(82)


35

Fig. 4. DI and beamwidth as a function of Q. With the second-order prototype in place, the discriminating function of Eq. (80), as an example, can be realized by expressing it as a partial fraction expansion and connecting in parallel two prototypal filters. For the first, d0 = ( 1 − cosθ ) 2 and d1 = c1 = c 2 = 0 , and for the second, d0 = 0, d1 = sin 2 θ , c1 = −2 cosθ , c 2 = 1 . Though the development of a second-order prototype is critical for the implementation of a more general rational discriminating function than that of the first-order prototype, additional research is necessary for the firstorder prototype. In Section 2.2 the number of spatial dimensions was reduced from three to one by restricting pressure measurements to a radial line extending from the origin in the direction defined by the unit vector u L . This allowed processing of the plane-wave pressure function by a temporal-spatial filter describable by a linear first-order partial differential equation in two variables (Eq. (21)). The radial line (when finite in length) represents a linear aperture or antenna. In many instances, the linear aperture is replaced by a linear array of pressure sensors. This necessitates the numerical integration of the partial differential equation in order to come up with the output of the associated filter. Numerical integration techniques for PDE’s generally fall into two categories, finite-difference methods (LeVeque, 2007) and finite-element methods (Johnson, 2009). If q prototypal filters are connected in parallel, the associated set of partial differential equations form a set of q symmetric hyperbolic systems (Bilbao, 2004). Such systems can be numerically integrated using principles of multidimensional wave digital filters (Fettweis and Nitsche, 1991a, 1991b). The resulting algorithms inherit all the good properties known to hold for wave digital filters,

36


specifically the full range of robustness properties typical for these filters (Fettweis, 1990). Of special interest in the filter implementation process is the length of the aperture. The goal is to achieve a particular directivity index and beamwidth with the smallest possible aperture length. Another important area for future research is studying the effect of noise (both ambient and system noise) on the filtering process. The fact that the prototypal filter tends to act as an integrator should help soften the effect of uncorrelated input noise to the filter. Finally, upcoming research will also include the array gain (Burdic, 1991) of the filter prototype for the case of anisotropic noise (Buckingham, 1979a,b; Cox, 1973). This paper considered the directivity index which is the array gain for the case of isotropic noise.

5. References Bienvenu, G. & Kopp, L. (1980). Adaptivity to background noise spatial coherence for high resolution passive methods, Int. Conf. on Acoust., Speech and Signal Processing, pp. 307-310. Bilbao, S. (2004). Wave and Scattering Methods for Numerical Simulation, John Wiley and Sons, ISBN 0-470-87017-6, West Sussex, England. Bresler, Y. & Macovski, A. (1986). Exact maximum likelihood parameter estimation of superimposed exponential signals in noise, IEEE Trans. ASSP, Vol. ASSP-34, No. 5, pp. 1361-1375. Buckingham, M. J. (1979a). Array gain of a broadside vertical line array in shallow water, J. Acoust. Soc. Am., Vol. 65, No. 1, pp. 148-161. Buckingham, M. J. (1979b). On the response of steered vertical line arrays to anisotropic noise, Proc. R. Soc. Lond. A, Vol. 367, pp. 539-547. Burdic, W. S. (1991). Underwater Acoustic System Analysis, Prentice-Hall, ISBN 0-13-947607-5, Englewood Cliffs, New Jersey, USA. Cox, H. (1973). Spatial correlation in arbitrary noise fields with application to ambient sea noise, J. Acoust. Soc. Am., Vol. 54, No. 5, pp. 1289-1301. Cray, B. A. (2001). Directional acoustic receivers: signal and noise characteristics, Proc. of the Workshop of Directional Acoustic Sensors, Newport, RI. Cray, B. A. (2002). Directional point receivers: the sound and the theory, Oceans ’02, pp. 1903-1905. Cray, B. A.; Evora, V. M. & Nuttall, A. H. (2003). Highly directional acoustic receivers, J. Acoust. Soc. Am., Vol. 13, No. 3, pp. 1526-1532. D’Spain, G. L.; Hodgkiss, W. S.; Edmonds, G. L.; Nickles, J. C.; Fisher, F. H.; & Harris, R. A. (1992). Initial analysis of the data from the vertical DIFAR array, Proc. Mast. Oceans Tech. (Oceans ’92), pp. 346-351. D’Spain, G. L.; Luby, J. C.; Wilson, G. R. & Gramann R. A. (2006). Vector sensors and vector sensor line arrays: comments on optimal array gain and detection, J. Acoust. Soc. Am., Vol. 120, No. 1, pp. 171-185. Fettweis, A. (1990). On assessing robustness of recursive digital filters, European Transactions on Telecommunications, Vol. 1, pp. 103-109. Fettweis, A. & Nitsche, G. (1991a). Numerical Integration of partial differential equations using principles of multidimensional wave digital filters, Journal of VLSI Signal Processing, Vol. 3, pp. 7-24, Kluwer Academic Publishers, Boston.


37

Fettweis, A. & Nitsche, G. (1991b). Transformation approach to numerically integrating PDEs by means of WDF principles, Multidimensional Systems and Signal Processing, Vol. 2, pp. 127-159, Kluwer Academic Publishers, Boston. Hawkes, M. & Nehorai, A. (1998). Acoustic vector-sensor beamforming and capon direction estimation, IEEE Trans. Signal Processing, Vol. 46, No. 9, pp. 2291-2304. Hawkes, M. & Nehorai, A. (2000). Acoustic vector-sensor processing in the presence of a reflecting boundary, IEEE Trans. Signal Processing, Vol. 48, No. 11, pp. 29812993. Hines, P. C. & Hutt, D. L. (1999). SIREM: an instrument to evaluate superdirective and intensity receiver arrays, Oceans 1999, pp. 1376-1380. Hines, P. C.; Rosenfeld, A. L.; Maranda, B. H. & Hutt, D. L. (2000). Evaluation of the endfire response of a superdirective line array in simulated ambient noise environments, Proc. Oceans 2000, pp. 1489-1494. Johnson, C. (2009). Numerical Solution of Partial Differential Equations by the Finite-Element Method, Dover Publications, ISBN-13 978-0-486-46900-3, Mineola, New York, USA Krim, H. & Viberg, M. (1996). Two decades of array signal processing research, IEEE Signal Processing Magazine, Vol. 13, No. 4, pp. 67-94. Kumaresan, R. & Shaw, A. K. (1985). High resolution bearing estimation without eigendecomposition , Proc. IEEE ICASSP 85, p. 576-579, Tampa, FL. Kythe, P. K.; Puri, P. & Schaferkotter, M. R. (2003). Partial Differential Equations and Boundary Value Problems with Mathematica, Chapman & Hall/ CRC, ISBN 1-58488-314-6, Boca Raton, London, New York, Washington, D.C. LeVeque, R. J. (2007). Finite Difference Methods for Ordinary and Partial Differential Equations, SIAM, ISBN 978-0-898716-29-0, Philadelphia, USA. Nehorai, A. & Paldi, E. (1994). Acoustic vector-sensor array processing, IEEE Trans. Signal Processing, Vol. 42, No. 9, pp. 2481-2491. Schmidlin, D. J. (2007). Directionality of generalized acoustic sensors of arbitrary order, J. Acoust. Soc. Am., Vol. 121, No. 6, pp. 3569-3578. Schmidlin, D. J. (2010a). Distribution theory approach to implementing directional acoustic sensors, J. Acoust. Soc. Am., Vol. 127, No. 1, pp. 292-299. Schmidlin, D. J. (2010b). Concerning the null contours of vector sensors, Proc. Meetings on Acoustics, Vol. 9, Acoustical Society of America. Schmidlin, D. J. (2010c). The directivity index of discriminating functions, Technical Report No. 31-2010-1, El Roi Analytical Services, Valdese, North Carolina. Schmidt, R. O. (1986). Multiple emitter location and signal parameter estimation, IEEE Trans. Antennas and Propagation, Vol. AP-34, No. 3, pp. 276-280. Silvia, M. T. (2001). A theoretical and experimental investigation of acoustic dyadic sensors, SITTEL Technical Report No. TP-4, SITTEL Corporation, Ojai, Ca. Silvia, M. T.; Franklin, R. E. & Schmidlin, D. J. (2001). Signal processing considerations for a general class of directional acoustic sensors, Proc. of the Workshop of Directional Acoustic Sensors, Newport, RI. Van Veen, B. D. & Buckley, K. M. (1988). Beamforming: a versatile approach to spatial filtering, IEEE ASSP Magazine, Vol. 5, No. 2, pp. 4-24.

38


Wong, K. T. & Zoltowski, M. D. (1999). Root-MUSIC-based azimuth-elevation angle-ofarrival estimation with uniformly spaced but arbitrarily oriented velocity hydrophones, IEEE Trans. Signal Processing, Vol. 47, No. 12, pp. 3250-3260. Wong, K. T. & Zoltowski, M. D. (2000). Self-initiating MUSIC-based direction finding in underwater acoustic particle velocity-field beamspace, IEEE Journal of Oceanic Engineering, Vol. 25, No. 2, pp. 262-273. Wong, K. T. & Chi, H. (2002). Beam patterns of an underwater acoustic vector hydrophone located away from any reflecting boundary, IEEE Journal Oceanic Engineering, Vol. 27, No. 3, pp. 628-637. Ziomek, L. J. (1995). Fundamentals of Acoustic Field Theory and Space-Time Signal Processing, CRC Press, ISBN 0-8493-9455-4, Boca Raton, Ann Arbor, London, Tokyo. Zou, N. & Nehorai, A. (2009). Circular acoustic vector-sensor array for mode beamforming, IEEE Trans. Signal Processing, Vol. 57, No. 8, pp. 3041-3052.

3 Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions Ryoichi Takashima, Tetsuya Takiguchi and Yasuo Ariki Graduate School of System Informatics, Kobe University, Kobe Japan

1. Introduction Many systems using microphone arrays have been tried in order to localize sound sources. Conventional techniques, such as MUSIC, CSP, and so on (e.g., (Johnson & Dudgeon, 1996; Omologo & Svaizer, 1996; Asano et al., 2000; Denda et al., 2006)), use simultaneous phase information from microphone arrays to estimate the direction of the arriving signal. There have also been studies on binaural source localization based on interaural differences, such as interaural level difference and interaural time difference (e.g., (Keyrouz et al., 2006; Takimoto et al., 2006)). However, microphone-array-based systems may not be suitable in some cases because of their size and cost. Therefore, single-channel techniques are of interest, especially in small-device-based scenarios. The problem of single-microphone source separation is one of the most challenging scenarios in the field of signal processing, and some techniques have been described (e.g., (Kristiansson et al., 2004; Raj et al., 2006; Jang et al., 2003; Nakatani & Juang, 2006)). In our previous work (Takiguchi et al., 2001; Takiguchi & Nishimura, 2004), we proposed HMM (Hidden Markov Model) separation for reverberant speech recognition, where the observed (reverberant) speech is separated into the acoustic transfer function and the clean speech HMM. Using HMM separation, it is possible to estimate the acoustic transfer function using some adaptation data (only several words) uttered from a given position. For this reason, measurement of impulse responses is not required. Because the characteristics of the acoustic transfer function depend on each position, the obtained acoustic transfer function can be used to localize the talker. In this paper, we will discuss a new talker localization method using only a single microphone. In our previous work (Takiguchi et al., 2001) for reverberant speech recognition, HMM separation required texts of a user’s utterances in order to estimate the acoustic transfer function. However, it is difficult to obtain texts of utterances for talker-localization estimation tasks. In this paper, the acoustic transfer function is estimated from observed (reverberant) speech using a clean speech model without having to rely on user utterance texts, where a GMM (Gaussian Mixture Model) is used to model clean speech features. This estimation is performed in the cepstral domain employing an approach based upon maximum likelihood. This is possible because the cepstral parameters are an effective representation for retaining useful clean speech information. The results of our talker-localization experiments show the effectiveness of our method.

40

Advances in Sound Localization (Each training position)

T

Observed speech from each position

Single mic.

O (T )

Estimation of the frame sequence data OS of the acoustic transfer function using Clean speech GMM the clean speech model (Trained using the clean speech database)

Hˆ T

argmax Pr(O T | H , OS ) H

Training of the acoustic transfer function GMM for each position using Hˆ (T ) (T )

OH GMMs for each position T

xxx 30$ T

60 $

Fig. 1. Training process for the acoustic transfer function GMM

2. Estimation of the acoustic transfer function 2.1 System overview

Figure 1 shows the training process for the acoustic transfer function GMM. First, we record the reverberant speech data O(θ ) from each position θ in order to build the GMM of the acoustic transfer function for θ. Next, the frame sequence of the acoustic transfer function Hˆ (θ ) is estimated from the reverberant speech O(θ ) (any utterance) using the clean-speech acoustic model, where a GMM is used to model the clean speech feature: Hˆ (θ ) = argmax Pr(O(θ ) | H, λS ).

(1)

H

Here, λS denotes the set of GMM parameters for clean speech, while the suffix S represents the clean speech in the cepstral domain. The clean speech GMM enables us to estimate the acoustic transfer function from the observed speech without needing to have user utterance texts (i.e., text-independent acoustic transfer estimation). Using the estimated frame sequence data of the acoustic transfer function Hˆ (θ ) , the acoustic transfer function GMM for each (θ ) position λ H is trained. Figure 2 shows the talker localization process. For test data, the talker position θˆ is estimated based on discrimination of the acoustic transfer function, where the GMMs of the acoustic transfer function are used. First, the frame sequence of the acoustic transfer function Hˆ is estimated from the test data (any utterance) using the clean-speech acoustic model. Then, from among the GMMs corresponding to each position, we find a GMM having the ˆ maximum-likelihood in regard to H: (θ ) θˆ = argmax Pr( Hˆ |λ H ), θ

(θ )

(2)

where λ H denotes the estimated acoustic transfer function GMM for direction θ (location).

Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions

41

(User’s test position) Reverberant speech Observed speech from each position

Single mic.

O (T )

Estimation of the frame sequence data OS of the acoustic transfer function using Clean speech GMM the clean speech model (Trained using the clean speech database) Hˆ

Tˆ

argmax Pr( Hˆ | OH ) (T )

T

GMMs for each position T

xxx 30$ T

60 $

Fig. 2. Estimation of talker localization based on discrimination of the acoustic transfer function 2.2 Cepstrum representation of reverberant speech

The observed signal (reverberant speech), o (t), in a room environment is generally considered as the convolution of clean speech and the acoustic transfer function: o (t) =

L −1

∑ s(t − l )h(l )

(3)

l =0

where s(t) is a clean speech signal and h(l ) is an acoustic transfer function (room impulse response) from the sound source to the microphone. The length of the acoustic transfer function is L. The spectral analysis of the acoustic modeling is generally carried out using short-term windowing. If the length L is shorter than that of the window, the observed complex spectrum is generally represented by O(ω; n) = S(ω; n) · H (ω; n).

(4)

However, since the length of the acoustic transfer function is greater than that of the window, the observed spectrum is approximately represented by O(ω; n) ≈ S(ω; n) · H (ω; n). Here O(ω; n), S(ω; n), and H (ω; n) are the short-term linear complex spectra in analysis window n. Applying the logarithm transform to the power spectrum, we get log |O(ω; n)|2 ≈ log |S(ω; n)|2 + log | H (ω; n)|2 .

(5)

In speech recognition, cepstral parameters are an effective representation when it comes to retaining useful speech information. Therefore, we use the cepstrum for acoustic modeling that is necessary to estimate the acoustic transfer function. The cepstrum of the observed signal is given by the inverse Fourier transform of the log spectrum: Ocep (t; n) ≈ Scep (t; n) + Hcep (t; n)

(6)

where Ocep , Scep , and Hcep are cepstra for the observed signal, clean speech signal, and acoustic transfer function, respectively. In this paper, we introduce a GMM (Gaussian Mixture Model) of the acoustic transfer function to deal with the influence of a room impulse response.

42 &HSVWUDOFRHIILFLHQW0)&&WK RUGHU


/HQJWKRILPSXOVHUHVSRQVHPVHF 䢢

GHJ GHJ

䢢

&HSVWUDOFRHIILFLHQW0)&&WK RUGHU

&HSVWUDOFRHIILFLHQW0)&&WK RUGHU PVHF 1RUHYHUEHUDWLRQ

䢢

GHJ GHJ

䢢


Fig. 3. Difference between acoustic transfer functions obtained by subtraction of short-term-analysis-based speech features in the cepstrum domain 2.3 Difference of acoustic transfer functions

, that were computed for each word Figure 3 shows the mean values of the cepstrum, Hcep using the following equations:

Hcep (t; n) ≈ Ocep (t; n) − Scep (t; n) Hcep (t) =

1 N

(7)

N

∑ Hcep (t; n)

(8)

n

where t is the cepstral index. Reverberant speech, O, was created using linear convolution of clean speech and impulse response. The impulse responses were taken from the RWCP sound scene database (Nakamura, 2001), where the loudspeaker was located at 30 and 90 degrees from the microphone. The lengths of the impulse responses are 300 msec and 0 msec. The reverberant speech and clean speech were processed using a 32-msec Hamming


43

window, and then for each frame, n, a set of 16 MFCCs was computed. The 10th and 11th cepstral coefficients for 216 words are plotted in Figure 3. As shown in this figure (300 msec) a difference between the two acoustic transfer functions (30 and 90 degrees) appears in the cepstral domain. The difference shown will be useful for sound source localization estimation. On the other hand, in the case of the 0 msec impulse response, the influence of the microphone and the loudspeaker characteristics are a significant problem. Therefore, it is difficult to discriminate between each position for the 0 msec impulse response. Also, this figure shows that the variability of the acoustic transfer function in the cepstral domain appears to be large for the reverberant speech. When the length of the impulse response is shorter than the analysis window used for the spectral analysis of speech, the acoustic transfer function obtained by subtraction of short-term-analysis-based speech features in the cepstrum domain comes to be constant over the whole utterance. However, as the length of the impulse response for the room reverberation becomes longer than the analysis window, the variability of the acoustic transfer function obtained by the short-term analysis will become large, with acoustic transfer function being approximately represented by Equation (7). To compensate for this variability, a GMM is employed to model the acoustic transfer function.

3. Maximum-likelihood-based parameter estimation This section presents a new method for estimating the GMM (Gaussian Mixture Model) of the acoustic transfer function. The estimation is implemented by maximizing the likelihood of the training data from a user’s position. In (Sankar & Lee, 1996), a maximum-likelihood (ML) estimation method to decrease the acoustic mismatch for a telephone channel was described, and in (Kristiansson et al., 2001) channel distortion and noise are simultaneously estimated using an expectation maximization (EM) method. In this paper, we introduce the utilization of the GMM of the acoustic transfer function based on the ML estimation approach to deal with a room impulse response. The frame sequence of the acoustic transfer function in (6) is estimated in an ML manner by using the expectation maximization (EM) algorithm, which maximizes the likelihood of the observed speech: Hˆ = argmax Pr(O| H, λS ).

(9)

H

Here, λS denotes the set of clean speech GMM parameters, while the suffix S represents the clean speech in the cepstral domain. The EM algorithm is a two-step iterative procedure. In the first step, called the expectation step, the following auxiliary function is computed. Q( Hˆ | H )

=

ˆ λS )| H, λS ] E[log Pr(O, c| H,

=

∑ c

Pr(O, c| H, λS ) ˆ λS ) · log Pr(O, c| H, Pr(O| H, λS )

(10)

Here c represents the unobserved mixture component labels corresponding to the observation sequence O. The joint probability of observing sequences O and c can be calculated as ˆ λS ) Pr(O, c| H,

=

ˆ λS ) ∏ wc ( ) Pr(On( ) | H,

n(v)

n v

v

(11)

44


where w is the mixture weight and On(v) is the cepstrum at the n-th frame for the v-th training data (observation data). Since we consider the acoustic transfer function as additive noise in the cepstral domain, the mean to mixture k in the model λO is derived by adding the acoustic transfer function. Therefore, (11) can be written as ˆ λS ) Pr(O, c| H,

=

(S)

(S)

∏ wc ( ) · N (On( ) ; μk ( ) + Hˆ n( ) , Σk ( ))

n(v)

v

n v

v

n v

(12)

n v

where N (O; μ, Σ) denotes the multivariate Gaussian distribution. It is straightforward to derive that (Juang, 1985) Q( Hˆ | H )

=

∑ ∑ Pr(On( ) , cn( ) = k|λS ) log wk v

v

k n(v)

+ ∑ ∑ Pr(On(v) , cn(v) = k|λS ) k n(v)

(S)

(S)

· log N (On(v) ; μk + Hˆ n(v) , Σk ) (S)

(13)

(S)

Here μk and Σk are the k-th mean vector and the (diagonal) covariance matrix in the clean speech GMM, respectively. It is possible to train those parameters by using a clean speech database. Next, we focus only on the term involving H. Q( Hˆ | H )

=

∑ ∑ Pr(On( ) , cn( ) = k|λS ) v

v

k n(v)

=

(S) (S) · log N (On(v) ; μk + Hˆ n(v) , Σk ) D 1 ( S )2 − ∑ ∑ γk,n(v) ∑ log(2π ) D σk,d 2 k n(v) d =1 ⎫ (S) (On(v) ,d − μk,d − Hˆ n(v) ,d )2 ⎬ + ( S )2 ⎭ 2σk,d

(14)

γk,n(v) = Pr(On(v) , k|λS )

(15) (S)

(S)

2

Here D is the dimension of the observation vector On , and μk,d and σk,d are the d-th mean value and the d-th diagonal variance value of the k-th component in the clean speech GMM, respectively. The maximization step (M-step) in the EM algorithm becomes “max Q( Hˆ | H )”. The re-estimation formula can, therefore, be derived, knowing that ∂Q( Hˆ | H )/∂ Hˆ = 0 as (S)

∑ γk,n( ) v

Hˆ n(v) ,d =

On(v) ,d − μk,d

k

∑ k

( S )2

σk,d

γk,n(v) ( S )2

σk,d

(16)


45

After calculating the frame sequence data of the acoustic transfer function for all training data (several words), the GMM for the acoustic transfer function is created. The m-th mean vector (θ ) and covariance matrix in the acoustic transfer function GMM (λ H ) for the direction (location) θ can be represented using the term Hˆ n as follows: (H)

μm

=∑∑

v n(v)

γm,n(v) Hˆ n(v) γm

(17)

(H)

Σm

=

∑∑

(H) (H) γm,n(v) ( Hˆ n(v) − μm ) T ( Hˆ n(v) − μm )

γm

v n(v)

(18)

Here n(v) denotes the frame number for v-th training data. Finally, using the estimated GMM of the acoustic transfer function, the estimation of talker localization is handled in an ML framework: (θ ) θˆ = argmax Pr( Hˆ |λ H ), θ

(19)

(θ )

where λ H denotes the estimated GMM for θ direction (location), and a GMM having the maximum-likelihood is found for each test data from among the estimated GMMs corresponding to each position.

4. Experiments 4.1 Simulation experimental conditions

The new talker localization method was evaluated in both a simulated reverberant environment and a real environment. In the simulated environment, the reverberant speech was simulated by a linear convolution of clean speech and impulse response. The impulse response was taken from the RWCP database in real acoustical environments (Nakamura, 2001). The reverberation time was 300 msec, and the distance to the microphone was about 2 meters. The size of the recording room was about 6.7 m×4.2 m (width×depth). Figure 4 and Fig. 5 show the experimental room environment and the impulse response (90 degrees), respectively. The speech signal was sampled at 12 kHz and windowed with a 32-msec Hamming window every 8 msec. The experiment utilized the speech data of four males in the ATR Japanese speech database. The clean speech GMM (speaker-dependent model) was trained using 2,620 words and has 64 Gaussian mixture components. The test data for one location consisted of 1,000 words, and 16-order MFCCs (Mel-Frequency Cepstral Coefficients) were used as feature vectors. The total number of test data for one location was 1,000 (words) × 4 (males). The number of training data for the acoustic transfer function GMM was 10 words and 50 words. The speech data for training the clean speech model, training the acoustic transfer function and testing were spoken by the same speakers but had different text utterances respectively. The speaker’s position for training and testing consisted of three positions (30, 90, and 130 degrees), five positions (10, 50, 90, 130, and 170 degrees), seven positions (30, 50, 70, ..., 130 and 150 degrees) and nine positions (10, 30, 50, 70, ..., 150, and 170 degrees). Then, for each

46


4,180 mm

3,120 mm

6,660 mm

sound source microphone

4,330 mm

Fig. 4. Experiment room environment for simulation 0.3

Amplitude

0.2 0.1 0 -0.1 -0.2 0

0.1

0.2 0.3 Time [sec]

0.4

0.5

Fig. 5. Impulse response (90 degrees, reverberation time: 300 msec) set of test data, we found a GMM having the maximum-likelihood from among those GMMs corresponding to each position. These experiments were carried out for each speaker, and the localization accuracy was averaged by four talkers. 4.2 Performance in a simulated reverberant environment

Figure 6 shows the localization accuracy in the three-position estimation task, where 50 words are used for the estimation of the acoustic transfer function. As can be seen from this figure, by increasing the number of Gaussian mixture components for the acoustic transfer function, the localization accuracy is improved. We can expect that the GMM for the acoustic transfer function is effective for carrying out localization estimation. Figure 7 shows the results for a different number of training data, where the number of Gaussian mixture components for the acoustic transfer function is 16. The performance of the training using ten words may be a bit poor due to the lack of data for estimating the acoustic transfer function. Increasing the amount of training data (50 words) improves in the performance. In the proposed method, the frame sequence of the acoustic transfer function is separated from the observed speech using (16), and the GMM of the acoustic transfer function is trained by (17) and (18) using the separated sequence data. On the other hand, a simple way to carry

/RFDOL]DWLRQDFFXUDF\>@


47

PL[

PL[ PL[ PL[ 1XPEHURIPL[WXUHV

PL[


Fig. 6. Effect of increasing the number of mixtures in modeling acoustic transfer function Here, 50 words are used for the estimation of the acoustic transfer function.

SRVLWLRQ SRVLWLRQ SRVLWLRQ SRVLWLRQ

ZRUGV

ZRUGV

1XPEHURIWUDLQLQJGDWD

Fig. 7. Comparison of the different number of training data out voice (talker) localization may be to use the GMM of the observed speech without the separation of the acoustic transfer function. The GMM of the observed speech can be derived in a similar way as in (17) and (18). (O)

μm

(O)

Σm

=

∑∑

v n(v)

=∑∑

v n(v)

γm,n(v) On(v)

(20)

γm (O)

(O)

γm,n(v) (On(v) − μm ) T (On(v) − μm ) γm

(21)

The GMM of the observed speech includes not only the acoustic transfer function but also clean speech, which is meaningless information for sound source localization. Figure 8 shows the comparison of four methods. The first method is our proposed method and the second is the method using GMM of the observed speech without the separation of the acoustic transfer function. The third is a simpler method that uses the cepstral mean of the observed

48



*00RIDFRXVWLFWUDQVIHUIXQFWLRQ3URSRVHG *00RIREVHUYHGVSHHFK 0HDQRIREVHUYHGVSHHFK &63WZRPLFURSKRQHV

SRVLWLRQ

SRVLWLRQ

SRVLWLRQ

SRVLWLRQ

1XPEHURISRVLWLRQV

Fig. 8. Performance comparison of the proposed method using GMM of the acoustic transfer function, a method using GMM of observed speech, that using the cepstral mean of observed speech, and CSP algorithm based on two microphones speech instead of GMM. (Then, the position that has the minimum distance from the learned cepstral mean to that of the test data is selected as the talker’s position.) And the fourth is a CSP (Cross-power Spectrum Phase) algorithm based on two microphones, where the CSP uses simultaneous phase information from microphone arrays to estimate the location of the arriving signal (Omologo & Svaizer, 1996). As shown in this figure, the use of the GMM of the observed speech had a higher accuracy than that of the mean of the observed speech. And, the use of the GMM of the acoustic transfer function results in a higher accuracy than that of GMM of the observed speech. The proposed method separates the acoustic transfer function from the short observed speech signal, so the GMM of the acoustic transfer function will not be affected greatly by the characteristics of the clean speech (phoneme). As it did with each test word, it is able to achieve good performance regardless of the content of the speech utterance. But the localization accuracy of the methods using just one microphone decreases as the number of training positions increases. On the other hand, the CSP algorithm based on two microphones has high accuracy even in the 9-position task. As the proposed method (single microphone only) uses the acoustic transfer function estimated from a user’s utterance, the accuracy is low. 4.3 Performance in simulated noisy reverberant environments and using a Speaker-independent speech model

Figure 9 shows the localization accuracy for noisy environments. The observed speech data was simulated by adding pink noise to clean speech convoluted using the impulse response so that the signal to noise ratio (SNR) were 25 dB, 15 dB and 5 dB. As shown in Figure 9, the localization accuracy at the SNR of 25 dB decreases about 30 % in comparison to that in a noiseless environment. The localization accuracy decreases further as the SNR decreases. Figure 10 shows the comparison of the performance between a speaker-dependent speech model and a speaker-independent speech model. For training a speaker-independent clean speech model and a speaker-independent acoustic transfer function model, the speech data spoken by four males in the ASJ Japanese speech database were used. Then, the clean speech


49


&OHDQ

SRVLWLRQ SRVLWLRQ SRVLWLRQ SRVLWLRQ

6LJQDOWRQRLVHUDWLR>G%@

Fig. 9. Localization accuracy for noisy environments Localization accuracy [%]

90

84.1

80 70 60

61.9

64.1

59.4 51.3

50

40.0

40

37.5 29.8

30 20

speaker dependent speaker independent

10 0 3-position

5-position

7-position

9-position

Number of positions

Fig. 10. Comparison of performance using speaker-dependent/-independent speech model (speaker-independent, 256 Gaussian mixture components; speaker-dependent,64 Gaussian mixture components) GMM was trained using 160 sentences (40 sentences × 4 males) and it has 256 Gaussian mixture components. The acoustic transfer function for training locations was estimated by this clean speech model from 10 sentences for each male. The total number of training data for the acoustic transfer function GMM was 40 (10 sentences × 4 males) sentences. For training the speaker-dependent model and testing, the speech data spoken by four males in the ATR Japanese speech database were used in the same way as described in section 4.1. The speech data for the test were provided by the same speakers used to train the speaker-dependent model, but different speakers were used to train the speaker-independent model. Both the speaker-dependent GMM and the speaker-independent GMM for the acoustic transfer function have 16 Gaussian mixture components. As shown in Figure 10, the localization accuracy of the speaker-independent speech model decreases about 20 % in comparison to the speaker-dependent speech model. 4.4 Performance using Speaker-dependent speech model in a real environment

The proposed method, which uses a speaker-dependent speech model, was also evaluated in a real environment. The distance to the microphone was 1.5 m and the height of the microphone

50


Microphone

Loudspeaker

Fig. 11. Experiment room environment /RFDOL]DWLRQDFFXUDF\>@

VHF

VHF

6HJPHQWOHQJWK

VHF

Localization accuracy [%]

Fig. 12. Comparison of performance using different test segment lengths 100 90 80 70 60 50 40 30 20 10 0

94.8

87.8 79.0 68.3 62.8 49.0

0

45

position: 45 deg. position: 90 deg.

90

Orientation of speaker [degrees]

Fig. 13. Effect of speaker orientation was about 0.45 m. The size of the recording room was about 5.5 m × 3.6 m × 2.7 m (width × depth × height). Figure 11 depicts the room environment of the experiment. The experiment used speech data, spoken by two males, in the ASJ Japanese speech database. The clean speech GMM (speaker-dependent model) was trained using 40 sentences and has 64 Gaussian


51

6SHDNHU¶VSRVLWLRQ

q q 2ULHQWDWLRQ q RIVSHDNHU

0LFURSKRQH Fig. 14. Speaker orientation mixture components. The test data for one location consisted of 200, 100 and 66 segments, where one segment has a length of 1, 2 and 3 sec, respectively. The number of training data for the acoustic transfer function was 10 sentences. The speech data for training the clean speech model, training the acoustic transfer function, and testing were spoken by the same speakers, but they had different text utterances respectively. The experiments were carried out for each speaker and the localization accuracy of the two speakers was averaged. Figure 12 shows the comparison of the performance using different test segment lengths. There were three speaker positions for training and testing (45, 90 and 135 degrees) and one loudspeaker (BOSE Mediamate II) was used for each position. As shown in this figure, the longer the length of the segment was, the more the localization accuracy increased, since the mean of estimated acoustic transfer function became stable. Figure 13 shows the effect when the orientation of the speaker changed from that of the speaker for training. There were five speaker positions for training (45, 65, 90, 115 and 135 degrees). There were two speaker positions for the test (45 and 90 degrees), and the orientation of the speaker changed to 0, 45 and 90 degrees, as shown in Figure 14. As shown in Figure 13, as the orientation of speaker changed, the localization accuracy decreased. Figure 15 shows the plot of acoustic transfer function estimated for each position and orientation of speaker. The plot of the training data is the mean value of all training data, and that for the test data is the mean value of test data per 40 seconds. As shown in Figure 15, as the orientation of the speaker changed from that for training, the estimated acoustic transfer functions were distributed over the distance away from the position of training data. As a result, these estimated acoustic transfer functions were not correctly recognized.

5. Conclusion This paper has described a voice (talker) localization method using a single microphone. The sequence of the acoustic transfer function is estimated by maximizing the likelihood of training data uttered from a position, where the cepstral parameters are used to effectively represent useful clean speech information. The GMM of the acoustic transfer function based on the ML estimation approach is introduced to deal with a room impulse response. The experiment results in a room environment confirmed its effectiveness for location

52




7UDLQLQJGDWDZKHUHVSHDNHU¶V SRVLWLRQ GHJ SRVLWLRQ GHJ SRVLWLRQ GHJ SRVLWLRQ GHJ SRVLWLRQ GHJ

7HVWGDWDZKHUHVSHDNHU¶V SRVLWLRQ GHJRULHQWDWLRQ SRVLWLRQ GHJRULHQWDWLRQ SRVLWLRQ GHJRULHQWDWLRQ SRVLWLRQ GHJRULHQWDWLRQ SRVLWLRQ GHJRULHQWDWLRQ SRVLWLRQ GHJRULHQWDWLRQ

7UDLQLQJSRVLWLRQ

q

q

q

q q


q

6SHDNHURULHQWDWLRQ

GHJ GHJ GHJ GHJ GHJ GHJ

q q

q

q

q

&HSVWUDOFRHIILFLHQW0)&&WK RUGHU Fig. 15. Mean acoustic transfer function values for five positions (top graph) and mean acoustic transfer function values for three speaker orientations (0 deg, 45 deg, and 90 deg) at a position of 45 deg and 90 deg (bottom graph) estimation tasks. But the proposed method requires the measurement of speech for each room environment in advance, and the localization accuracy decreases as the number of training positions increases. In addition, not only the position of speaker but also various factors (e.g., orientation of the speaker) affect the acoustic transfer function. Future work will include efforts to improve both localization estimation from more locations and estimation when the conditions other than speaker position change. We also hope to improve the localization accuracy in noisy environments and for speaker-independent speech models.


53

Also, we will investigate a text-independent technique based on HMM in the modeling of the speech content.

6. References Johnson, D. & Dudgeon, D. (1996). Array Signal Processing, Prentice Hall, Englewood Cliffs, NJ. Omologo, M. & Svaizer, P. (1996). Acoustic Event Localization in Noisy and Reverberant Environment Using CSP Analysis, Proceedings of 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP96), Institute of Electrical and Electronics Engineers (IEEE), Atlanta, Georgia, pp. 921-924. Asano, F., Asoh, H. & Matsui, T. (2000). Sound Source Localization and Separation in Near Field, IEICE Trans. Fundamentals Vol. E83-A, No. 11, pp. 2286-2294. Denda, Y., Nishiura, T. & Yamashita, Y. (2006). Robust Talker Direction Estimation Based on Weighted CSP Analysis and Maximum Likelihood Estimation, IEICE Trans. on Information and Systems Vol. E89-D, No. 3, pp. 1050-1057. Keyrouz, F., Naous, Y. & Diepold, K. (2006) A New Method for Binaural 3-D Localization Based on HRTFs, Proceedings of 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP06), Institute of Electrical and Electronics Engineers (IEEE), Toulouse, France, pp. V-341-V-344. Takimoto, M., Nishino, T. & Takeda, K. (2006). Estimation of a talker and listener’s positions in a car using binaural signals, The Fourth Joint Meeting Acoustical Society of America and Acoustical Society of Japan Acoustical Society of America and Acoustic Society of Japan, Honolulu, Hawaii, 3pSP33, pp. 3216. Kristjansson, T., Attias, H. & Hershey, J. (2004). Single Microphone Source Separation Using High Resolution Signal Reconstruction, Proceedings of 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP04), Institute of Electrical and Electronics Engineers (IEEE), Montreal, Quebec, Canada, pp. 817-820. Raj, B., Shashanka, M. & Smaragdis, P. (2006). Latent Direchlet Decomposition for Single Channel Speaker Separation, Proceedings of 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP06), Institute of Electrical and Electronics Engineers (IEEE), Toulouse, France, pp. 821-824. Jang, G., Lee, T. & Oh, Y. (2003). A Subspace Approach to Single Channel Signal Separation Using Maximum Likelihood Weighting Filters, Proceedings of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP03), Institute of Electrical and Electronics Engineers (IEEE), Hong Kong, pp. 45-48. Nakatani, T. & Juang, B. (2006). Speech Dereverberation Based on Probabilistic Models of Source and Room Acoustics, Proceedings of 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP06), Institute of Electrical and Electronics Engineers (IEEE), Toulouse, France, pp. I-821-I-824. Takiguchi, T., Nakamura, S. & Shikano, K. (2001). HMM-separation-based speech recognition for a distant moving speaker, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 2, pp. 127-140. Takiguchi, T. & Nishimura, M. (2004). Acoustic Model Adaptation Using First Order Prediction for Reverberant Speech, Proceedings of 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP04), Institute of Electrical and Electronics Engineers (IEEE), Montreal, Quebec, Canada, pp. 869-872.

54


Sankar, A. & Lee, C. (1996). A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition, IEEE Transactions on Speech and Audio Processing, Vol. 4, No. 3, pp. 190-202. Kristiansson, T., Frey, B., Deng, L. & Acero, A. (2001). Joint Estimation of Noise and Channel Distortion in a Generalized EM framework, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Institute of Electrical and Electronics Engineers (IEEE), Trento, Italy, pp. 155-158. Juang, B. (1985). Maximum-likelihood estimation of mixture multivariate stochastic observations of Markov chains, AT&T Tech. J., Vol. 64, No. 6, pp. 1235-1249. Nakamura, S. (2001). Acoustic sound database collected for hands-free speech recognition and sound scene understanding, International Workshop on Hands-Free Speech Communication, International Speech Communication Association, Kyoto, Japan, pp. 43-46.

4 Localization Error: Accuracy and Precision of Auditory Localization 1U.S.

Tomasz Letowski1 and Szymon Letowski2

Army Research Laboratory, Human Research and Engineering Directorate, Aberdeen Proving Ground, MD 21005-5425 2Evidence Based Research, Inc., Vienna, VA 22182 USA

1. Introduction The act of localization is the estimation of the true location of an object in space and is characterized by a certain amount of inherent uncertainty and operational bias that results in estimation errors. The type and size of the estimation errors depend on the properties of the emitted sound, the characteristics of the surrounding environment, the specific localization task, and the abilities of the listener. While the general idea of localization error is straightforward, the specific concepts and measures of localization error encountered in the psychoacoustic literature are quite diverse and frequently poorly described, making generalizations and data comparison quite difficult. In addition, the same concept is sometimes described in different papers by different terms, and the same term is used by different authors to refer to different concepts. This variety of terms and metrics used with inconsistent semantics can easily be a source of confusion and may cause the reader to misinterpret the reported data and conclusions. A fundamental property of localization estimates is that in most cases they are angular and thus represent circular (spherical) variables, which in general cannot be described by a linear distribution as assumed in classical statistics. The azimuth and elevation of the sound source locations define an ambiguous conceptual sphere, which can only be fully analyzed with the methods of spherical statistics. However, these methods are seldom used in psychoacoustic studies, and it is not immediately clear to what degree they should be utilized. In many cases, localization estimates may, in fact, be correctly analyzed using linear methods, but neither the necessary conditions for nor the limitations of linear methods have been clearly stated. In sum, localization error is a widely used and intuitively simple measure of spatial uncertainty and spatial bias in the perception of sound source location, but both a common terminology for its description and a broad understanding of the implications of its circular character are lacking. Some of the issues related to these topics are discussed in the subsequent sections. The presented concepts and explanations are intended to clarify some existing terminological ambiguities and offer some guidance as to the statistical treatment of localization error data. The focus of the discussion is on issues related to localization judgments, with only marginal attention given to distance estimation judgments which deserve to be the object of a separate article.

56


2. Basis of auditory localization Spatial hearing provides information about the acoustic environment; about its geometry and physical properties and about the locations of sound sources. Sound localization generally refers to the act or process of identifying the direction toward a sound source on the basis of sound emitted by the source (see discussion of this definition in Section 3). For living organisms, this is a sensory act based on the perceived auditory stimulation. In the case of machine localization, it is an algorithmic comparison of signals arriving at various sensors. The sound can be either the main product of the source or a by-product of its operation. The act of sound localization when performed by living organisms can also be referred to as auditory localization, and this term is used throughout this chapter. The localization ability of humans depends on a number of anatomical properties of the human auditory system. The most important of these is the presence of two entry points to the auditory system (the external ears) that are located on opposite sides of the human head. Such a configuration of the auditory input system causes a sound coming at the listener from an angle to have a different sound intensity and time of arrival at each ear. The difference in sound intensity is mainly caused by the acoustic shadow and baffle effects of the head and results in a lower sound intensity at the ear located farther away from the sound source (Strutt, 1876; Steinhauser, 1879). The difference in time of arrival is caused by the difference in the distance the sound has to travel to each of the ears (Strutt, 1907; Wilson and Myers, 1908). These differences are normally referred to as the interaural intensity difference (IID) and the interaural time difference (ITD). In the case of continuous pure tones and other periodic signals the term interaural phase difference (IPD) is used in place of ITD since such sounds have no clear reference point in time. The IID and ITD (IPD) together are called the binaural localization cues. The IID is the dominant localization cue for high frequency sounds, while the ITD (IPD) is the dominant cue for low frequency sounds (waveform phase difference). The ITD (IPD) is additionally an important cue for high frequency sounds because of differences in the waveform envelope delay (group delay) (Henning, 1974; 1980; McFadden & Pasanen, 1976). Binaural cues are the main localization mechanisms in the horizontal plane but are only marginally useful for vertical localization or front-back differentiation. This is due to spatial ambiguity caused by head symmetry and referred to as the cone of confusion (Wallach, 1939). The cone of confusion is the imaginary cone extending outward from each ear along the interaural axis that represents sound source locations producing the same interaural differences. Although asymmetry in ear placement on the head and in the shape of the pinnae provides some disambiguation, the sound source positions located on the surface of the cone of confusion cannot be identified using binaural cues and can only be resolved using spectral cues associated with the directional sound filtering of the human body. These cues are called monaural cues as they do not depend on the presence of two ears. Monaural cues result from the shadowing and baffle effects of the pinna and the sound reflections caused by the outer ear (pinna and tragus), head, and upper torso (Steinhauser, 1879; Batteau, 1967; Musicant & Butler, 1984; Lopez-Poveda & Meddis, 1996). These effects and reflections produce peaks and troughs in the sound spectrum that are unique for each sound source location in space relative to the position of the listener (Bloom, 1977; Butler & Belendiuk, 1977; Watkins, 1978). Monaural cues and the related Interaural Spectrum Difference (ISD) also help binaural horizontal localization (Jin et al., 2004; Van Wanrooij & Van Opstal, 2004), but they are most

Localization Error: Accuracy and Precision of Auditory Localization

57

critical for vertical localization and front-back differentiation. The spectral cues that are the most important for accurate front-back and up-down differentiation are located in the 4-16 kHz frequency range (e.g., Langendijk & Bronkhorst, 2002). Spatial localization ability in both horizontal and vertical planes is also dependent on slight head movements, which cause momentary changes in the peak-and-trough pattern of the sound spectrum at each ear (Young, 1931; Wallach, 1940; Perrett & Noble, 1997; Iwaya et al., 2003), visual cues, and prior knowledge of the stimulus (Pierce, 1901; Rogers & Butler, 1992). More information about the physiology and psychology of auditory localization can be found elsewhere (e.g., Blauert, 1974; Yost & Gourevitch, 1987; Moore, 1989; Yost et al., 2008; Emanuel & Letowski (2009).

3. Terminology, notation, and conventions The broad interest and large number of publications in the field of auditory localization has advanced our knowledge of neurophysiologic processing of spatial auditory signals, the psychology of spatial judgments, and environmental issues in determining the locations of sound sources. The authors of various experimental and theoretical publications range from physiologists to engineers and computer scientists, each bringing their specific expertise and perspective. The large number of diversified publications has also led to a certain lack of consistency regarding the meaning of some concepts. Therefore, before discussing the methods and measures used to describe and quantify auditory localization errors in Section 5, some key concepts and terminological issues are discussed in this and the following section. Auditory spatial perception involves the perception of the surrounding space and the locations of the sound sources within that space on the basis of perceived sound. In other words, auditory spatial perception involves the perception of sound spaciousness, which results from the specific volume and shape of the surrounding space, and the identification of the locations of the primary and secondary (sound reflections) sound sources operating in the space in relation to each other and to the position of the listener. In very general terms, auditory spatial perception involves four basic elements: • Horizontal localization (azimuth, declination) • Vertical localization (elevation) • Distance estimation • Perception of space properties (spaciousness) The selection of these four elements is based on a meta-analysis of the literature on spatial perception and refers to the traditional terminology used in psychoacoustic research studies on the subject matter. It seems to be a logical, albeit obviously arbitrary, classification. A direction judgment toward a sound source located in space is an act of localization and can be considered a combination of both horizontal and vertical localization judgments. Horizontal and vertical localization judgments are direction judgments in the corresponding planes and may vary from simple left-right, up-down, and more-less discriminations, to categorical judgments, to the absolute identifications of specific directions in space. A special form of localization judgments for phantom sound sources located in the head of the listener is called lateralization. Therefore, the terms lateralization and localization refer respectively to judgment of the internal and external positions of sound sources in reference to the listener’s head (Yost & Hafter, 1987; Emanuel & Letowski, 2009).

58


Similarly to localization judgments, distance judgments may have the form of discrimination judgments (closer-farther), relative numeric judgments (half as far – twice as far), or absolute numeric judgments in units of distance. In the case of two sound sources located at different distances from the listener, the listener may estimate their relative difference in distance using the same types of judgments. Such relative judgments are referred to as auditory distance difference or auditory depth judgments. Both distance and depth judgments are less accurate than angular localization judgments and show large intersubject variability. In general, perceived distance PD is a power function of the actual distance d and can be described as

PD = kd a ,

(1)

where a and k are fitting constants dependent on the individual listener. Typically k is close to but slightly smaller than 1 (k=0.9), and a is about 0.4 but varies widely (0.3-0.8) across listeners (Zahorik et al., 2005). The above differentiation between localization and distance estimation is consistent with the common interpretation of auditory localization as the act of identifying the direction toward the sound source (White, 1987; Morfey, 2001; Illusion, 2010). It may seem, however, inconsistent with the general definition of localization which includes distance estimation (APA, 2007; Houghton Mifflin, 2007). Therefore, some authors who view distance estimation as an inherent part of auditory localization propose other terms, e.g., direction-of-arrival (DOA) (Dietz et al., 2010), to denote direction-only judgments and distinguish them from general localization judgments. The introduction of a new term describing direction-only judgments is intended to add clarity to the language describing auditory spatial perception. However, the opposite may be true since the term localization has a long tradition in the psychoacoustic literature of being used to mean the judgment of direction. This meaning also agrees with the common usage of this term. Therefore, it seems reasonable to accept that while the general definition of localization includes judging the distance to a specific location, it does not mandate it, and in its narrow meaning, localization refers to the judgment of direction. In this context, the term localization error refers to errors in direction judgment, and the term distance estimation error to errors in distance estimation. Spaciousness is the perception of being surrounded by sound and is related to the type and size of the surrounding space. It depends not only on the type and volume of the space but also on the number, type, and locations of the sound sources in the space. Perception of spaciousness has not yet been well researched and has only recently become of more scientific interest due to the rapid development of various types of spatial sound recording and reproduction systems and AVR simulations (Griesinger, 1997). The literature on this subject is very fragmented, inconsistent, and contradictory. The main reason for this is that unlike horizontal localization, vertical localization, and distance estimation judgments, which are made along a single continuum, spaciousness is a multidimensional phenomenon without well defined dimensions and one that as of now can only be described in relative terms or using categorical judgments. The two terms related to spaciousness that are the most frequently used are listener envelopment (LEV) and apparent source width (ASW). Listener envelopment describes the degree to which a listener is surrounded by sound, as opposed to listening to sound that happens “somewhere else”. It is synonymous to spatial impression as defined by Barron and


59

Marshall (1981). Some authors treat both these terms as synonymous to spaciousness, but spaciousness can exist without listener envelopment. The ASW is also frequently equated with spaciousness, but such an association does not agree with the common meanings of both width and spaciousness and should be abandoned (Griesinger, 1999). The concept of ASW relates more to the size of the space occupied by the active sound sources and should be a subordinate term to spaciousness. Thus, LEV and ASW can be treated as two complementary elements of spaciousness (Morimoto, 2002). Some other correlated or subordinate aspects of spaciousness are panorama (a synonym of ASW), perspective, ambience, presence, and warmth. Depending on the task given to the listener there are two basic types of localization judgments: • Relative localization (discrimination task) • Absolute localization (identification task) Relative localization judgments are made when one sound source location is compared to another, either simultaneously or sequentially. Absolute localization judgments involve only one sound source location that needs to be directly pointed out. In addition, absolute localization judgments can be made on a continuous circular scale and expressed in degrees (°) or can be restricted to a limited set of preselected directions. The latter type of judgment occurs when all the potential sound source locations are marked by labels (e.g., number), and the listener is asked to identify the sound source location by label. The actual sound sources may or may not be visible. This type of localization judgment, in which the identification data are later expressed as selection percentages, i.e., the percent of responses indicating each (or just the correct) location, is referred to throughout this chapter as categorical localization. From the listener’s perspective, the most complex and demanding judgments are the absolute localization judgments, and they are the main subject of this chapter. The other two types of judgments, discrimination judgments and categorization judgments, are only briefly described and compared to absolute judgments later in the chapter. In order to assess the human ability to localize the sources of incoming sounds, the physical reference space needs to be defined in relation to the position of the human head. This reference space can be described either in the rectangular or polar coordinate system. The rectangular coordinate system x, y, z is the basis of Euclidean geometry and is also called the Cartesian coordinate system. In the head-oriented Cartesian coordinate system the x, y, and z axes are typically oriented as left-right (west-east), back-front (south-north), down-up (nadir-zenith), respectively. The east, front, and up directions indicate the positive ends of the scales. The Euclidean planes associated with the Cartesian coordinate system are the vertical lateral (x-z), the vertical sagittal (y-z), and the horizontal (x-y) planes. The main reference planes of symmetry for the human body are: • Median sagittal (midsagittal) plane: y-z plane • Frontal (coronal) lateral plane: x-z plane • Axial (transversal, transaxial) horizontal plane: x-y plane The relative orientations of the sagittal and lateral planes and the positions of the median and frontal planes are shown in Figure 2. The virtual line passing though both ears in the frontal plane is called the interaural axis. The ear closer to the sound source is termed the ipsilateral ear and the ear farther away from the sound source is the contralateral ear.

60


Fig. 1. Main reference planes of the human body. The axial plane is parallel to the page. The median (midsagittal) plane is the sagittal plane (see Figure 1) that is equidistant from both ears. The frontal (coronal) plane is the lateral plane that divides the listener‘s head into front and back hemispheres along the interaural axis. The axial (transversal) plane is the horizontal plane of symmetry of the human body. Since the axial plane is not level with the interaural axis of human hearing, the respective plane, called the visuoaural plane by Knudsen (1982), is referred to here as the hearing plane, or as just the horizontal plane. In the polar system of coordinates, the reference dimensions are d (distance or radius), θ (declination or azimuth), and φ (elevation). Distance is the amount of linear separation between two points in space, usually between the observation point and the target. The angle of declination (azimuth) is the horizontal angle between the medial plane and the line connecting the point of observation to the target. The angle of elevation is the vertical angle between the hearing plane and the line from the point of observation to the target. The Cartesian and polar systems are shown together in Figure 2.

Fig. 2. Commonly used symbols and names in describing spatial hearing coordinates. One advantage of the polar coordinate system over Cartesian coordinate system is that it can be used in both Euclidean geometry and the spherical, non-Euclidean, geometry that is useful in describing relations between points on a closed surface such as a sphere. In auditory perception studies two spherical systems of coordinates are used. They are referred to as the single-pole system and the two-pole system. Both are shown in Figure 3. The head-oriented single-pole system is analogous to the planetary coordinate system of longitudes and latitudes. In the two-pole system, both longitudes and latitudes are represented by series of parallel circles. The single-pole system is widely used in many fields of science. However, in this system the length of an arc between two angles of azimuth depends on elevation. The two-pole system makes the length of the arc between two angles of azimuth the same regardless of elevation. Though less intuitive, this system may be convenient for some types of data presentation (Knudsen, 1982; Makous &


61

Fig. 3. Single-pole (left) and two-pole (right) spherical coordinate systems. Adapted from Carlile (1996). Middlebrooks, 1990). Since both these systems share the same concepts of azimuth and elevation, it is essential that the selection of the specific spherical coordinate system always be explicit (Leong & Carlile, 1998). It should also be noted that there are two conventions for numerically labeling angular degrees that are used in scientific literature: the 360° scheme and the ±180° scheme. There are also two possibilities for selecting the direction of positive angular change: clockwise (e.g., Tonning, 1970) or counterclockwise (e.g., Pedersen & Jorgensen, 2005). The use of two notational schemes is primarily a nuisance that necessitates data conversion in order to compare or combine data sets labeled with different schemes. However, converting angles that are expressed differently in the two schemes from one scheme to the other is just a matter of either adding or subtracting 360°. In the case of localization studies, where differences between angles are the primary consideration, the ±180° labeling scheme is overwhelmingly preferred. First, it is much simpler and more intuitive to use positive and negative angles to describe angular difference. Second, and more importantly, the direct summing and averaging of angular values can only be done with angles that are contained within a (numerically) continuous range of 180°, such as ±90º. If the 360° scheme is used, then angles to the left and right of 0° (the reference angle) cannot be directly added and must be converted into vectors and added using vector addition. Less clear is the selection of the positive and negative directions of angular difference. However, if the ±180° scheme is used, the absolute magnitude of angular values is the same regardless of directionality, which is another reason to prefer the ±180° scheme. Under the 360° scheme, the clockwise measurement of any angle other than 180° will have a different magnitude than that same angle measured counterclockwise, i.e., 30° in the clockwise direction is 330° in the counterclockwise direction. In mathematics (e.g., geometry) and physics (e.g., astronomy), a displacement in a counterclockwise direction is considered positive, and a displacement in a clockwise direction is considered negative. In geometry, the quadrants of the circle are ordered in a counterclockwise direction, and an angle is considered positive if it extends from the x axis in a counterclockwise direction. In astronomy, all the planets of our solar system, when observed from above the Sun, rotate and revolve around the Sun in a counterclockwise direction (except for the rotation of Venus). However, despite the scientific basis of the counterclockwise rule, the numbers on clocks and all the circular measuring scales, including the compass, increase in a clockwise direction, effectively making it the positive direction. This convention is shown in Figure 2

62


and is accepted in this chapter. For locations that differ in elevation, the upward direction from a 0° reference point in front of the listener is normally considered as the positive direction, and the downward direction is considered to be the negative direction.

4. Accuracy and precision of auditory localization The human judgment of sound source location is a noisy process laden with judgment uncertainty, which leads to localization errors. Auditory localization error (LE) is the difference between the estimated and actual directions toward the sound source in space. This difference can be limited to difference in azimuth or elevation or can include both (e.g., Carlile et al., 1997). The latter can be referred to as compound LE. Once the localization act is repeated several times, LE becomes a statistical variable. The statistical properties of this variable are generally described by spherical statistics due to the spherical/circular nature of angular values (θ = θ + 360°). However, if the angular judgments are within a ±90° range (as is often the case in localization judgments, after disregarding front-back reversals), the data distribution can be assumed to have a linear character, which greatly simplifies data analysis. Front-back errors should be extracted from the data set and analyzed separately in order to avoid getting inflated localization error (Oldfield & Parker, 1984; Makous & Middlebrooks, 1990; Bergault, 1992; Carlile et al., 1997). Some authors (e.g. Wightman & Kistler, 1989) mirror the perceived reverse locations about the interaural axis prior to data analysis in order to preserve the sample size. However, this approach inflates the power of the resulting conclusions. Only under specific circumstances and with great caution should front-back errors be analyzed together with other errors (Fisher, 1987). The measures of linear statistics commonly used to describe the results of localization studies are discussed in Section 5. The methods of spherical (circular) statistical data analysis are discussed in Section 6. The linear distribution used to describe localization judgments, and in fact most human judgment phenomena, is the normal distribution, also known as the Gaussian distribution. It is a purely theoretical distribution but it well approximates distributions of human errors, thus its common use in experiments with human subjects. In the case of localization judgments, this distribution reflects the random variability of the localizations while emphasizing the tendency of the localizations to be centered on some direction (ideally the true sound source direction) and to become (symmetrically) less likely the further away we move from that central direction. The normal distribution has the shape of a bell and is completely described in its ideal form by two parameters: the mean (µ) and the standard deviation (σ). The mean corresponds to the central value around which the distribution extends, and the standard deviation describes the range of variation. In particular, approximately 2/3 of the values (68.2%) will be within one standard deviation from the mean, i.e., within the range [μ - σ, μ + σ]. The mathematical formula and graph of the normal distribution are shown in Figure 4. Based on the above discussion, each set of localization judgments can be described by a specific normal distribution with a specific mean and standard deviation. Ideally, the mean of the distribution should correspond with the true sound source direction. However, any lack of symmetry in listener hearing or in the listening conditions may result in a certain bias in listener responses and cause a misalignment between the perceived location of the sound source and its actual location. Such bias is called constant error (CE).


63

Fig. 4. Normal distribution. Standard deviation (σ) is the range of variability around the mean value (μ ± σ) that accounts for approximately 2/3 of all responses. Another type of error is introduced by both listener uncertainty/imprecision and random changes in the listening conditions. This error is called random error (RE). Therefore, LE can be considered as being composed of two error components with different underlying causes: constant error (CE) resulting from a bias in the listener and/or environment and random error (RE) resulting from the inherent variability of listener perception and listening conditions. If LE is described by a normal distribution, CE is given by the difference between the true sound source location and the mean of the distribution (xo) and RE is characterized by the standard deviation (σ) of the distribution. The concepts of CE and RE can be equated, respectively, with the concepts of precision and accuracy of a given set of measurements. The definitions of both these terms, along with common synonyms (although not always used correctly), are given below: Accuracy (constant error, systematic error, validity, bias) is the measure of the degree to which the measured quantity is the same as its actual value. Precision (random error, repeatability, reliability, reproducibility) is the measure of the degree to which the same measurement made repeatedly produces the same results. The relationship between accuracy and precision and the normal distribution from Figure 4 are shown in Figure 5.

Fig. 5. Concepts of accuracy in precision in localization judgments.

64


Localization accuracy depends mainly on the symmetry of the auditory system of the listener, the type and behavior of the sound source, and the acoustic conditions of the surrounding space. It also depends on the familiarity of the listener with the listening conditions and on the non-acoustic cues available to the listener. For example, auditory localization accuracy is affected by eye position (Razavi et al., 2007). Some potential bias may also be introduced by the reported human tendency to misperceive the midpoint of the angular distance between two horizontally distinct sound sources. Several authors have reported the midpoint to be located 1° to 2° rightward (Cusak et al., 2001; Dufour et al., 2007; Sosa et al., 2010), although this shift may be modulated by listener handedness. For example, Ocklenburg et al. (2010) observed a rightward shift for left-handed listeners and a leftward shift for right-handed listeners. Localization precision depends primarily on fluctuations in the listener’s attention, the type and number of sound sources, their location in space, and the acoustic conditions of the surrounding space. In addition, both localization accuracy and precision depend to a great degree on the data collection methodology (e.g., direct or indirect pointing, verbal identification, etc). In general, the overall goodness-of-fit of the localization data to the true target location can be expressed in terms of error theory as (Bolshev, 2002) as:

p (θ ) =

1 1 × . 2 2 (CE + RE2 )

(2)

5. Linear statistical measures The two fundamental classes of measures describing probability distributions are measures of central tendency and measures of dispersion. Measures of central tendency, also known as measures of location, characterize the central value of a distribution. Measures of dispersion, also known as measures of spread, characterize how spread out the distribution is around its central value. In general, distributions are described and compared on the basis of a specific measure of central tendency in conjunction with a specific measure of spread. For the normal distribution, the mean (μ), a measure of central tendency, and the standard deviation (σ), a measure of dispersion, serve to completely describe (parametrize) the distribution. There is, however, no way of directly determining the true, actual values of these parameters for a normal distribution that has been postulated to characterize some population of judgments, measurements, etc. Thus these parameters must be estimated on the basis of a representative sample taken from the population. The sample arithmetic mean (xo) and the sample standard deviation (SD) are the standard measures used to estimate the mean and standard deviation of the underlying normal distribution. The sample mean and standard deviation are highly influenced by outliers (extreme values) in the data set. This is especially true for smaller sample sizes. Measures that are less sensitive to the presence of outliers are referred to as robust measures (Huber & Ronketti, 2009). Unfortunately, many robust measures are not very efficient, which means that they require larger sample sizes for reliable estimates. In fact, for normally distributed data (without outliers), the sample mean and standard deviation are the most efficient estimators of the underlying parameters.

65


A very robust and relatively efficient measure of central tendency is the median (ME). A closely related measure of dispersion is the median absolute deviation (MEAD), which is also very robust but unfortunately also very inefficient. A more efficient measure of dispersion that is however not quite as robust is the mean absolute deviation (MAD). Note that the abbreviation “MAD” is used in other publications to refer to either of these two measures. The formulas for both the standard and robust sample measures discussed above are given below in Table 1. They represent the basic measures used in calculating LE when traditional statistical analysis is performed. Measure Name

Symbol

Definition/Formula

Arithmetic Mean

xo

1 n xo = ∑ xi n i =1

Standard Deviation

SD

Median Median Absolute Deviation

ME

Mean Absolute Deviation

MEAD MAD

SD =

1 n ∑ ( x − xo )2 n i =1 i

Comments

V (variance) = SD2.

middle value of responses middle value of the absolute deviations from the median

MAD =

1 n ∑ |x − xo| n i =1 i

Table 1. Basic measures used to estimate the parameters of a normal distribution. Strictly speaking, the sample median estimates the population median, which is the midpoint of the distribution, i.e., half the values (from the distribution) are below it and half are above it. The median together with the midpoints of the two halves of the distribution on either side of the median divide the distribution into 4 four parts of equal probability. The three dividing points are called the 1st, 2nd, and 3rd quartiles (Q1, Q2 and Q3), with the 2nd quartile simply being another name for the median. Since the normal distribution is symmetric around its mean, its mean is also its median, and so the sample median can be used to directly estimate the mean of a normal distribution. The median absolute deviation of a distribution does not coincide with its standard deviation, thus the sample median absolute deviation does not give a direct estimate of the population standard deviation. However, in the case of a normal distribution, the median absolute deviation corresponds to the difference between the 3rd and 2nd quartiles, which is proportional to the standard deviation. Thus for a normal distribution the relationship between the standard deviation and the MEAD is given by (Goldstein & Taleb, 2007):

σ ≈ 1.4826 (Q3 − Q2 ) = 1.4826( MEAD)

(3)

The SD is the standard measure of RE, while the standard measure of CE is the mean signed error (ME), also called mean bias error, which is equivalent to the difference between the sample mean of the localization data (xo) and the true location of the sound source. The unsigned, or absolute, counterpart to the ME, the mean unsigned error (MUE) is a measure of total LE as it represents a combination of both the CE and the RE. The MUE was used among others by Makous and Middlebrooks (1990) in analyzing their data. Another error

66


measure that combines the CE and RE is the root mean squared error (RMSE). The relationship between these three measures is given by the following inequality, where n is the sample size (Willmott & Matusuura, 2005).

ME ≤ MUE ≤ RMSE ≤ nMUE .

(4)

The RE part of the RMSE is given by the sample standard deviation (SD), but the RE in the MUE does not in general correspond to any otherwise defined measure. However, if each localization estimate is shifted by the ME so as to make the CE equal to zero, the MUE of the data normalized in this way is reduced to the sample mean absolute deviation (MAD). Since the MAD is not affected by linear transformations, the MAD of the normalized data is equal to the MAD of the non-normalized localizations and so represents the RE of the localizations. Thus, the MAD is also a measure of RE. For a normal distribution, the standard deviation is proportional to the mean absolute deviation in the following ratio (Goldstein & Taleb, 2007):

σ=

π 2

MAD ≈ 1.253( MAD)

(5)

This means that for sufficiently large sample sizes drawn from a normal distribution, the normalized MUE (=MAD) will be approximately equal to 0.8 times the SD. The effect of sample size on the ratio between sample MAD and sample SD for samples from a normal distribution in shown below in Fig. 6.

Fig. 6. The standard deviation of the ratios between sample MAD and sample SD for 1000 simulated samples plotted against the size of the sample. Note that unlike the RMSE, which is equal to the square root of the sum of the squares of the CE (ME) and RE (σ), the MUE is not expressible as a function of CE (ME) and RE (MAD). The formulas for the error measures are given below in Table 2. The formulas listed in Table 2 and the above discussion apply to normal or similar unimodal distributions. In the case of a multimodal data distribution, these measures are in general not applicable. However, if there are only a few modes that are relatively far apart, then these measures (or similar statistics) can be calculated for each of the modes using appropriate subsets of the data set. This is in particular applicable to the analysis of frontback errors, which tend to define a separate unimodal distribution.

67


Measure Name

Symbol

Type

Mean Error (Mean Signed Error)

ME

CE

Mean Absolute Error (Mean Unsigned Error)

MUE

CE & RE

Root-Mean-Squared Error

RMSE

CE & RE

Standard Deviation

SD

RE

MAD

RE

Mean Absolute Deviation

Definition/Formula

ME =

Comments

1 n ∑ ( x − η ) = xo − η n i =1 i

MUE = RMSE =

1 n ∑ |x − η| n i =1 i 1 n ∑ ( x − η )2 n i =1 i

|ME| ≤ MUE ≤ |ME|+ MAD RMSE2= ME2+ SD2

1 n ∑ ( x − xo )2 n i =1 i 1 n MAD = ∑ |xi − xo| n i =1

SD =

Table 2. Basic measures used to calculate localization error (η denotes true location of the sound source). There is a continuing debate in the literature as to what constitutes a front-back error. Most authors define front-back errors as any estimates that cross the interaural axis (Carlile et al., 1997; Wenzel, 1999). Other criteria include errors crossing the interaural axis by more than 10º (Schonstein, 2008) or 15º (Best et al., 2009) or errors that are within a certain angle after subtracting 180º. An example of the last case is using a ±20º range around the directly opposite angle (position) which corresponds closely to the range of typical listener uncertainty in the frontal direction (e.g., Carlile et al., 1997). The criterion proposed in this chapter is that only estimates exceeding a ±150º error should be considered nominal frontback errors. This criterion is based on a comparative analysis of location estimates made in anechoic and less than optimal listening conditions. The extraction and separate analysis of front-back errors should not be confused with the process of trimming the data set to remove outliers, even though they have the same effect. Front-back errors are not outliers in the sense that they simply represent extreme errors. They represent a different type of error that has a different underlying cause and as such should be treated differently. Any remaining errors exceeding ±90º may be trimmed (discarded) or winsorized to keep the data set within the ±90º range. Winsorizing is a strategy in which the extreme values are not removed from the sample, but rather are replaced with the maximal remaining values on either side. This strategy has the advantage of not reducing the sample size for statistical data analysis. Both these procedures mitigate the effects of extreme values and are a way of making the resultant sample mean and standard deviation more robust. The common primacy of the sample arithmetic mean and sample standard deviation for estimating the population parameters is based on the assumption that the underlying distribution is in fact perfectly normal and that the data are a perfect reflection of that distribution. This is frequently not the case with human experiments, which have numerous potential sources for data contamination. In general, this is evidenced by more values farther away from the mean than expected (heavier tails or greater kurtosis) and the presence of extreme values, especially for small data sets. Additionally, the true underlying

68


distribution may deviate slightly in other ways from the assumed normal distribution (Huber & Ronchetti, 2009). It is generally desired that a small number of inaccurate results should not overly affect the conclusions based on the data. Unfortunately, this is not the case with the sample mean and standard deviation. As mentioned earlier the mean and, in particular, the standard deviation are quite sensitive to outliers (the inaccurate results). Their more robust counterparts discussed in this section are a way of dealing with this problem without having to specifically identify which results constitute the outliers as is done in trimming and winsorizing. Moreover, the greater efficiency of the sample SD over the MAD disappears with only a few inaccurate results in a large sample (Huber & Ronchetti, 2009). Thus, since there is little chance of human experiments generating perfect data and a high chance of the underlying distribution not being perfectly normal, the use of more robust measures for estimating the CE (mean) and RE (standard deviation) may be recommended. It is also recommended that both components of localization error, CE and RE, always be reported individually. A single compound measure of error such as the RMSE or MUE is not sufficient for understanding the nature of the errors. These compound measures can be useful for describing total LE, but they should be treated with caution. Opinions as to whether RMSE or MUE provides the better characterization of total LE are divided. The overall goodness-of-fit measure given in Eq. 2 clearly uses RMSE as its base. Some authors also consider RMSE as “the most meaningful single number to describe localization performance” (Hartmann, 1983). However, others argue that MUE is a better measure than RMSE. Their criticism of RMSE is based on the fact that RMSE includes MUE but is additionally affected by the square root of the sample size and the distribution of the squared errors which confounds its interpretation (Willmott & Matusuura 2005).

6. Spherical statistics The traditional statistical methods discussed above were developed for linear infinite distributions. These methods are in general not appropriate for the analysis of data having a spherical or circular nature, such as angles. The analysis of angular (directional) data requires statistical methods that are concerned with probability distributions on the sphere and circle. Only if the entire data set is restricted to a ±90º range can angular data be analyzed as if coming from a linear distribution. In all other cases, the methods of linear statistics are not appropriate, and the data analysis requires the techniques of a branch of statistics called spherical statistics. Spherical statistics, also called directional statistics, is a set of analytical methods specifically developed for the analysis of probability distributions on spheres. Distributions on circles (two dimensional spheres) are handled by a subfield of spherical statistics called circular statistics. The fundamental reason that spherical statistics is necessary is that if the numerical difference between two angles is greater than 180°, then their linear average will point in the opposite direction from their actual mean direction. For example, the mean direction of 0° and 360° is actually 0°, but the linear average is 180°. Note that the same issue occurs also with the ±180° notational scheme (consider -150° and 150°). Since parametric statistical analysis relies on the summation of data, it is clear that something other than standard addition must serve as the basis for the statistical analysis of angular data. The simple solution comes from considering the angles as vectors of unit length and applying vector addition. The Cartesian coordinates X and Y of the mean vector for a set of vectors corresponding to a set of angles θ about the origin are given by:


69

X=

1 n ∑ sin(θi ) n i =1

(6)

Y=

1 n ∑ cos(θi ) n i =1

(7)

and

The angle θo that the mean vector makes with the X-axis is the mean angular direction of all the angles in the data set. Its calculation depends on the quadrant the mean vector is in:

⎧tan −1 (Y X ) ⎪ ⎪π + tan −1 Y X ( ) ⎪ θo = ⎪⎨ − 1 ⎪−π + tan (Y X ) ⎪π / 2 ⎪ ⎪⎩−π 2

X>0 X < 0, Y ≥ 0 X < 0, Y < 0 X = 0, Y ≥ 0 X = 0, Y < 0

(8)

The magnitude of the mean vector is called the mean resultant length (R):

R = X2 + Y 2 .

(9)

R is a measure of concentration, the opposite of dispersion, and plays an important role in defining the circular standard deviation. Its magnitude varies from 0 to 1 with R = 1 indicating that all the angles in the set point in the same direction. Note that R = 0 not only for a set of angles that are evenly distributed around the circle but also for one in which they are equally divided between two opposite directions. Thus, like the linear measures discussed in the previous section, R is most meaningful for unimodal distributions. One of the most significant differences between spherical statistics and linear statistics is that due the bounded range over which the distribution is defined, there is no generally valid counterpart to the linear standard deviation in the sense that intervals defined in terms of multiples of the standard deviation represent a constant probability independent of the value of the standard deviation. Clearly, as the circular standard deviation increases, fewer and fewer standard deviations are needed to cover the whole circle. The circular counterpart to the linear normal distribution is known as the von Mises distribution (Fisher, 1993)

f (θ ,κ ) =

1 κ cos(θ −θo ) e , 2π I o (κ )

(10)

where θo is the mean angle and Io(κ) the modified Bessel function of order 0. The κ parameter of the von Mises function is not a measure of dispersion, like the standard deviation, but, like R, is a measure of concentration. At κ = 0, the von Mises distribution is equal to the uniform distribution on the circle, while at higher values of κ the distribution becomes more and more concentrated around its mean. As κ continues to increases above 1, the von Mises distribution begins to more and more closely resemble a wrapped normal distribution, which is a linear normal distribution that has been wrapped around the circle

70


∞ − 1 f (θ ) = ∑ e σ 2π k =−∞

(θ −θo + 2π k )2 2σ 2 ,

(11)

where θo and σ are the mean and standard deviation of the linear distribution. A reasonable approach to defining the circular standard deviation would be to base it on the wrapped normal distribution so that for a wrapped normal distribution it would coincide with the standard deviation of the underlying linear distribution. This can be accomplished due to the fact that for the wrapped normal distribution there is a direct relationship between the mean resultant length, R, and the underlying linear standard deviation

−σ 2 R=e 2 .

(12)

The above equality provides the general definition of the circular standard deviation as:

σ c = σ = −2 ln( R) .

(13)

The sample circular mean direction and sample circular standard deviation can be used to describe any circular data set drawn from a normal circular distribution. However, if the angular data are within ±90º, or within any other numerically continuous 180° range, then linear measures can still be used. Since standard addition applies, the linear mean can be calculated, and it will be equal to the circular mean angle. The linear standard deviation will also be almost identical to the circular standard deviation as long as the results are not overly dispersed. In fact, the relationship between the linear standard deviation and the circular standard deviation is not so much a function of the the range of the data as of its dispersion. For samples drawn from a normal linear distribution, the two sample standard deviations begin to deviate slightly at about σ = 30°, but even at σ = 60° the difference is not too great for larger sample sizes. Results from a set of simulations in which the two sample standard deviations were compared for 500 samples of size 10 and 100 are shown in Fig. 6. The samples were drawn from linear normal distributions with standard deviations randomly selected in the range 1° ≤ σ ≤ 60°. So, for angular data that are assumed to come from a reasonably concentrated normal distribution, as would be expected in most localization studies, the linear standard deviation can be used even if the data spans the full 360°, as long as the mean is calculated as the circular mean angle. This does not mean that localization errors greater than 120° (frontback errors) should not be excluded from the data set for separate analysis. Once the circular mean has been calculated, the formulas in Table 2 in Section 5 can be used to calculate the circular counterparts to the other linear error measures. The determination of the circular median, and thus the MEAD, is in general a much more involved process. The problem is that there is in general no natural point on the circle from which to start ordering the data set. However, a defining property of the median is that for any data set the average absolute deviation from the median is less than for any other point. Thus, the circular median is defined on this basis. It is the (angle) point on the circle for which the average absolute deviation is minimized, with deviation calculated as the length of the shorter arc between each data point and the reference point. Note that a circular median does not necessarily always exist, as for example, for a data set that is uniformly distributed around the

71

Localization Error: Accuracy and Precision of Auditory Localization Linear Standard Deviation vs. Circular Standard Deviation Sample Size: 100 (500 Samples)

50

50

40

40

_

60

Circular SD

Circular SD

Linear Standard Deviation vs. Circular Standard Deviation Sample Size: 10 (500 Samples) 60

30

20

30

20

10

10

0

0 0

10

20

30

40

50

60

0

10

20

Linear SD

(a)

30

40

50

60

Linear SD

(b)

Fig. 6. Comparison of circular and linear standard deviations for 500 samples of (a) small (n=10) and (b) large (n=100) size. circle (Mardia, 1972). If however, the range of the data set is less than 360° and has two clear endpoints, then the calculation of the median and MEAD can be done as in the linear case. Two basic examples of circular statistics significance tests are the nonparametric Rayleigh z test and the Watson two sample U2 test. The Rayleigh z test is used to determine whether data distributed around a circle are sufficiently random to assume a uniform distribution. The Watson two sample U2 test can be used to compare two data distributions. Critical values for both tests and for many other circular statistics tests can be found in many advanced statistics books (e.g., Batschelet, 1981; Mardia, 1972; Zar, 1999; Rao and SenGupta, 2001). The special-purpose package Oriana (see http://www.kovcomp.co.uk) provides direct support for circular statistics as do add-ons such as SAS macros (e.g., Kölliker, M. 2005), A MATLAB Toolbox for Circular Statistics (Berens, 2009), and CircStat for S-Plus, R, and Stata (e.g., Rao and SenGupta, 2001).

7. Relative (discrimination) and categorical localization The LE analysis conducted so far in this text was limited to the absolute identification of sound source locations in space. Two other types of localization judgments are relative judgments of sound source location (location discrimination) and categorical localization. The basic measure of relative localization acuity is the minimum audible angle (MAA). The MAA, or localization blur (Blauert, 1974), is the minimum detectable difference in azimuth (or elevation) between locations of two identical but not simultaneous sound sources (Mills, 1958; 1972; Perrott, 1969). In other words, the MAA is the smallest perceptible difference in the position of a sound source. To measure the MAA, the listener is presented with two successive sounds coming from two different locations in space and is asked to determine whether the second sound came from the left or the right of the first one. The MAA is calculated as half the angle between the minimal positions to left and right of the sound source that result in 75% correct response rates. It depends on both frequency and direction of arrival of the sound wave. For wideband stimuli and low frequency tones, MAA is on the order of 1° to 2° for the frontal position, increases to 8-10° at 90° (Kuhn, 1987), and decreases again to 6-7° at the rear (Mills, 1958; Perrott, 1969; Blauert, 1974). For low frequency tones arriving from the frontal position, the MAA corresponds well with the difference limen (DL)

72


for ITD (~10 μs), and for high frequency tones, it matches well with the difference limen for IID (0.5-1.0 dB), both measured by earphone experiments. The MAA is largest for mid-high frequencies, especially for angles exceeding 40° (Mills, 1958; 1960; 1972). The vertical MAA is about 3-9° for the frontal position (e.g., Perrott & Saberi, 1990; Blauert, 1974). The MAA has frequently been considered to be the smallest attainable precision (difference limen) in absolute sound source localization in space (e.g., Hartmann, 1983; Hartmann & Rakerd, 1989; Recanzone et al., 1998). However, the precision of absolute localization judgments observed in most studies is generally much poorer than the MAA for the same type of sound stimulus. For example, the average error in absolute localization for a broadband sound source is about 5º for the frontal and about 20º for the lateral position (Hofman & Van Opstal, 1998; Langendijk et al., 2001). Thus, it is possible that the acuity of the MAA, where two sounds are presented in succession, and the precision of absolute localization, where only a single sound is presented, are not well correlated and measure two different human capabilities (Moore et al., 2008). This view is supported by results from animal studies indicating that some types of lesions in the brain affect the precision of absolute localization but not the acuity of the MAA (e.g., Young et al., 1992; May, 2000). In another set of studies, Spitzer and colleagues observed that barn owls exhibited different MAA acuity in anechoic and echoic conditions while displaying similar localization precision across both conditions (Spitzer et al., 2003; Spitzer & Takahasi, 2006). The explanation of these differences may be the difference in the cognitive tasks and the much greater difficulty of the absolute localization task. Another method of determining LE is to ask listeners to specify the sound source location by selecting from a set of specifically labeled locations. These locations can be indicated by either visible sound sources or special markers on the curtain covering the sound sources (Butler et al., 1990; Abel & Banerjee, 1996). Such approaches restrict the number of possible directions to the predetermined target locations and lead to categorical localization judgments (Perrett & Noble, 1995). The results of categorical localization studies are normally expressed as percents of correct responses rather than angular deviations. The distance between the labeled target locations is the resolution of the localization judgments and describes the localization precision of the study. In addition, if the targets are only distributed across a limited region of the space, this may provide cues resolving potential front-back confusion (Carlile et al., 1997). Although categorical localization was the predominant localization methodology in older studies, it is still used in many studies today (Abel & Banerjee, 1996; Vause & Grantham, 1999; Van Hosesel & Clark, 1999; Macaulay et al., 2010). Additionally, the Source Azimuth Identification in Noise Test (SAINT) uses categorical judgments with a clock-like array of 12 loudspeakers (Vermiglio et al., 1998) and a standard system for testing the localization ability of cochlear implant users is categorical with 8 loudspeakers distributed in symmetric manner in the horizontal plane in front of the listener with 15.5º of separation (Tyler & Witt, 2004). In order to directly compare the results of a categorical localization study to an absolute localization study, it is necessary to extract a mean direction and standard deviation from the distribution of responses over the target locations. If the full distribution is known, then by treating each response as an indication of the actual angular positions of the selected target location, the mean and standard deviation can be calculated as usual. If only the percent of correct responses is provided, then as long as the percent correct is over 50%, a normal distribution z-Table (giving probabilities of a result being less than a given z-score) can be used to estimate the standard deviation. If d is the angle of target separation (i.e., the


73

angle between two adjacent loudspeakers), p the percent correct and z the z-score corresponding to (p+1)/2, then the standard deviation is given by

σ=

d 2z

(14)

and the mean by the angular position of the correct target location. This is based on the assumption that the correct responses are normally distributed over the range delimited by the points half way between the correct loudspeaker and the two loudspeakers on either side. This range spans the angle of target separation (d) and thus d/2 is the corresponding zscore for the actual distribution. The relationship between the standard z-score and the zscore for a normal distribution N(μ,σ) is given by: zN ( μ , σ ) = μ + σ ⋅ z .

(15)

In this case, the mean, μ, is 0 as the responses are centered around the correct loudspeaker position, so solving for the standard deviation gives Equation 14. As an example, consider an array of loudspeakers separated by 15° and an 85% correct response rate for some individual speaker. The z-score for (1+.85)/2 = .925 is 1.44, so the standard deviation is estimated to be 7.5°/1.44 = 5.2°. An underlying assumption in the preceding discussion is that the experimental conditions of the categorical judgment task are such that the listener is surrounded by evenly spaced target locations. If this is not the case, then the results for the extreme locations at either end may have been affected by the fact that there are no further locations. In particular this is a problem when the location with the highest percent of responses is not the correct location and the distribution is not symmetric around it. For example, this appears to be the case for the speakers located at ±90° in the 30° loudspeaker arrangement used by Abel & Banerjee (1996).

8. Summary Judgments of sound source location as well as the resultant localization errors are angular (circular) variables and in general cannot be properly analyzed by the standard statistical methods that assume an underlying (infinite) linear distributions. The appropriate methods of statistical analysis are provided by the field of spherical or circular statistics for three- and two-dimensional angular data, respectively. However, if the directional judgments are relatively well concentrated around a central direction, the differences between the circular and linear measures are minimal, and linear statistics can effectively be used in lieu of circular statistics. The criteria under which the linear analysis of directional data is justified has been a focus of the present discussion. Some basic elements of circular statistics have been also presented to demonstrate the fundamental differences between the two types of data analysis. It has to be stressed that in both cases, it is important to differentiate frontback errors from other gross errors and analyze the front-back errors separately. Gross errors may then be trimmed or winsorized. Both the processing and interpretation of localization data becomes more intuitive and simpler when the ±180º scale is used for data representation instead of the 0-360º scale, although both scales can be successfully used. In order to meaningfully interpret overall localization error, it is important to individually report both the constant error (accuracy) and random error (precision) of the localization judgments. Error measures like root mean squared error and mean unsigned error represent

74


a specific combination of these two error components and do not on their own provide an adequate characterization of localization error. Overall localization error can be used to characterizes a given set of results but does not give any insight into the underlying causes of the error. Since the overall purpose of this chapter was to provide information for the effective processing and interpretation of sound localization data, the initial part of the chapter was devoted to differentiating auditory spatial perception from auditory localization and to summarizing the basic terminology used in spatial perception studies and data description. This terminology is not always consistently used in the literature and some standardization would be beneficial. In addition, prior to the discussion of circular data analysis, the most common measures used to describe directional data were compared, and their advantages and limitations indicated. It has been stressed that the standard statistical measures for assessing constant and random error are not robust measures, as they are quite susceptible to being overly influenced by extreme values in the data set. The robust measures discussed in this chapter are intended to provide a starting point for researchers unfamiliar with robust statistics. Given that localization studies, like many experiments involving human judgment, are apt to produce some number of outlying or inaccurate results, it may often be beneficial to utilize robust alternatives to the standard measures. In any case, researchers should be aware of this consideration. All of the above discussion was related to absolute localization judgments as the most commonly studied form of localization. Therefore, the last section of the chapter deals briefly with location discrimination and categorical localization judgments. The specific focus of this section was to indicate how results from absolute localization and categorical localization studies could be directly compared and what simplifying assumptions are made in carrying out these types of comparisons.

9. References Abel, S.M. & Banerjee, P.J. (1966). Accuracy versus choice response time in sound localization. Applied Acoustics, 49, 405-417. APA (2007). APA Concise Dictionary of Psychology. American Psychology Association, ISBN 1-4338-0391-7, Washington (DC). Barron, M. & Marshall, A.H. (1981). Spatial impression due to early lateral reflections in concert halls: The derivation of physical measure. Journal of Sound and Vibration, 77 (2), 211-232. Batschelet, E. (1981). Circular Statistics in Biology. Academic Press ISBN 978-0120810505, New York (NY). Batteau, D.W. (1967). The role of the pinna in human localization. Proceedings of the Royal Society London. Series B: Biological Sciences, 168, 158-180. Berens, P. (2009). CircStat: A MATLAB Toolbox for Circular Statistics. Journal of Statistical Software, 31 (10), 1-21. Bergault, D.R. (1992). Perceptual effects of synthetic reverberation on three-dimensional audio systems. Journal of Audio Engineering Society, 40 (11), 895-904. Best, V., Brungart, D., Carlile, S., Jin, C., Macpherson, E., Martin, R.L., McAnally, K.I., Sabin, A.T., & Simpson, B. (2009). A meta-analysis of localization errors made in the anechoic free field, Proceedings of the International Workshop on the Principles and Applications of Spatial Hearing (IWPASH). Miyagi (Japan): Tohoku University.


75

Blauert, J. (1974). Räumliches Hören. Sttutgart (Germany): S. Hirzel Verlag (Availabe in English in Blauert, J. Spatial Hearing. Cambridge (MA): MIT, 1997.) Bloom, P.J. (1977). Determination of monaural sensitivity changes due to the pinna by use of the minimum-audible-field measurements in the lateral vertical plane. Journal of the Acoustical Society of America 61, 820-828. Bolshev, L.N. (2002). Theory of errors. In: M. Hazewinkiel (Ed.), Encyclopaedia of Mathematics. Springer Verlag, ISBN 1-4020-0609-8, New York (NY). Butler, R.A. & Belendiuk, K. (1977). Spectral cues utilized in the localization of sound in the median sagittal plane. Journal of the Acoustical Society of America, 61, 1264-1269. Butler, R.A., Humanski, R.A., & Musicant, A.D. (1990). Binaural and monaural localization of sound in two-dimensional space. Perception, 19, 241-256. Carlile, S. (1996). Virtual Auditory Space: Generation and Application. R. G. Landes Company, ISBN 978-1-57059-341-3, Austin (TX). Carlile, S., Leong, P., & Hyams, S. (1997). The nature and distribution of errors in sound localization by human listeners. Hearing Research, 114, 179-196. Cusak, R., Carlyon, R.P., & Robertson, I.H. (2001). Auditory midline and spatial discrimination in patients with unilateral neglect. Cortex, 37, 706-709. Dietz, M., Ewert, S.D., & Hohmann, V. (2010). Auditory model based direction estimation of concurrent speakers from binaural signals. Speech Communication (in print). Dufour, A., Touzalin, P., & Candas, V. (2007). Rightward shift of the auditory subjective straight Ahead in right- and left-handed subjects. Neuropsychologia 45, 447-453. Emanuel, D. & Letowski, T. (2009). Hearing Science. Lippincott, Williams, & Wilkins, ISBN 978-0781780476, Baltimore (MD). Fisher, N.I. (1987). Problems with the current definition of the standard deviation of wind direction. Journal of Climate and Applied Meteorology, 26, 1522-1529. Fisher, N.I. (1993). Statistical Analysis of Circular Data. Cambridge University Press, ISBN 9780521568906, Cambridge (UK). Goldstein, D.G. & Taleb, N.N. (2007) We don't quite know what we are talking about when we talk about volatility. Journal of Portfolio Management, 33 (4), 84-86. Griesinger, D. (1997). The psychoacoustics of apparent source width, spaciousness, and envelopment in performance spaces. Acustica, 83, 721-731. Griesinger, D. (1999). Objective measures of spaciousness and envelopment, Proceedings of the 16th AES International Conference on Spatial Sound Reproduction, pp. 1-15. Rovaniemi (Finland): Audio Engineering Society. Hartmann, W.M. (1983) Localization of sound in rooms. Journal of the Acoustical Society of America, 74, 1380-1391. Hartmann, W. M. & Rakerd, B. (1989). On the minimum audible angle – A decision theory approach. Journal of the Acoustical Society of America, 85, 2031-2041. Henning, G.B. (1974). Detectability of the interaural delay in high-frequency complex waveforms. Journal of the Acoustical Society of America, 55, 84-90. Henning, G.B. (1980). Some observations on the lateralization of complex waveforms. Journal of the Acoustical Society of America, 68, 446-454. Hofman, P.M. & Van Opstal, A.J. (1998). Spectro-temporal factors in two-dimensional human sound localization. Journal of the Acoustical Society of America, 103, 2634-2648. Houghton Mifflin (2007). The American Heritage Medical Directory. Orlando (FL): Houghton Mifflin Company. Huber, P.J. & Ronchetti, E. (2009), Robust Statistics (2nd Ed.). John Wiley & Sons, ISBN: 978-0470-12990-6, Hoboken (NJ).

76


Illusion. (2010). In: Encyclopedia Britannica. Retrieved 16 September 2010 from Encyclopedia Britannica Online: http://search.eb.com/eb/article-46670 ( Accessed 15 Sept 2010). Iwaya, Y., Suzuki, Y., & Kimura, D. (2003). Effects of head movement on front-back error in sound localization. Acoustical Science and Technology, 24 (5), 322-324. Jin, C., Corderoy, A., Carlile, SD., & van Schaik, A. (2004). Contrasting monaural and interaural spectral cues for human sound localization. Journal of the Acoustical Society of America, 115, 3124-3141. Knudsen, E.I. (1982). Auditory and visual maps of space in the optic tectum of the owl. Journal of Neuroscience, 2 (9), 1177-1194. Kölliker, M. (2005). Circular statistics Macros in SAS. Freely available online at http://www.evolution.unibas.ch/koelliker/misc.htm (Accessed 15 Sept 2010). Kuhn, G.F. (1987). Physical acoustics and measurements pertaining to directional hearing. In: W.A. Yost & G. Gourevitch (eds.), Directional Hearing, pp. 3-25. Springer Verlag, ISBN 978-0387964935, New York (NY). Langendijk, E., Kistler, D.,J., & Wightman, F.L. (2001). Sound localization in the presence of one or two distractors. Journal of the Acoustical Society of America, 109, 2123-2134. Langendijk, E. & Bronkhorst, A.W. (2002). Contribution of spectral cues to human sound localization. Journal of the Acoustical Society of America, 112, 1583-1596. Leong, P. & Carlile, S. (1998). Methods for spherical data analysis and visualization. Journal of Neuroscience Methods, 80, 191-200. Lopez-Poveda, E.A., & Meddis, R. (1996). A physical model of sound diffraction and reflections in the human concha. Journal of the Acoustical Society of America, 100, 3248-3259. Macaulay, E.J., Hartman, W.M., & Rakerd, B. (2010). The acoustical bright spot and mislocalization of tones by human listeners. Journal of the Acoustical Society of America, 127, 1440-1449. Makous, J. & Middlebrooks, J.C. (1990). Two-dimensional sound localization by human listeners. Journal of the Acoustical Society of America, 92, 2188-2200. Mardia, K.V. (1972). Statistics of Directional Data. Academic Press, ISBN 978-0124711501, New York (NY). May, B.J. (2000). Role of the dorsal cochlear nucleus in sound localization behavior in cats. Hearing Research, 148, 74-87. McFadden, D.M. & Pasanen, E. (1976). Lateralization of high frequencies based on interaural time differences. Journal of the Acoustical Society of America, 59, 634-639. Mills, A.W. (1958). On the minimum audible angle. Journal of the Acoustical Society of America, 30, 237-246. Mills, A.W. (1960). Lateralization of high-frequency tones. Journal of the Acoustical Society of America, 32, 132-134. Mills, A.W. (1972). Auditory localization. In: J. Tobias (Ed.), Foundations of Modern Auditory Theory, vol 2 (pp. 301-345). New York (NY): Academic Press. Moore, B.C.J. (1989). An Introduction to the Psychology of Hearing (4th Ed.). Academic Press, ISBN 0-12-505624-9, San Diego (CA). Moore, J.M., Tollin, D.J., & Yin, T. (2008). Can measures of sound localization acuity be related to the precision of absolute location estimates? Hearing Research, 238, 94-109. Morfey, C.L. (2001). Dictionary of Acoustics. Academic Press, ISBN 0-12-506940-5, San Diego (CA). Morimoto, M. (2002). The relation between spatial impression and precedence effect, Proceedings of the 8th International Conference on Auditory Display (ICAD2002). Kyoto (Japan): ATR


77

Musicant, A.D. and Butler, R.A. (1984). The influence of pinnae-based spectral cues on sound localization. Journal of the Acoustical Society of America, 75, 1195-1200. Ocklenburg, S., Hirnstein, M., Hausmann, M., & Lewald, J. (2010). Auditory space perception by left and right-handers. Brain and Cognition, 72(2), 210-7. Oldfield, S.R. & Parker, S.P.A. (1984). Acuity of sound localization: A topography of auditory space I. Normal hearing conditions. Perception, 13, 581-600. Pedersen, J.A. & Jorgensen, T. (2005). Localization performance of real and virtual sound sources, Proceedings of the NATO RTO-MP-HFM-123 New Directions for Improving Audio Effectiveness Conference, pp. 29-1 to 29-30. Neuilly-sui-Seine (France): NATO. Perrett, S. & Noble, W. (1995). Available response choices affect localization of sound. Perception and Psychophysics, 57, 150-158. Perrett, S. & Noble, W. (1997). The effect of head rotation on vertical plane sound localization. Journal of the Acoustical Society of America, 102, 2325-2332. Perrott, D.R. (1969). Role of signal onset in sound localization. Journal of the Acoustical Society of America, 45, 436-445. Perrott, D.R. & Saberi, K. (1990). Minimum audible angle thresholds for sources varying in both elevation and azimuth. Journal of the Acoustical Society of America, 87, 1728-1731. Acoustical Society of America 56, 944-951. Pierce, A.H. (1901). Studies in Auditory and Visual Space Perception. Longmans, Green, and Co, ISBN 1-152-19101-2, New York (NY). Rao Jammalamadaka, S. & SenGupta, A. (2001). Topics in Circular Statistics. World Scientific Publishing, ISBN 9810237782, River Edge (NJ). Razavi, B., O’Neill, W.E., & Paige, G.D. (2007). Auditory spatial perception dynamically realigns with changing eye position. Journal of Neurophysiology, 27 (38), 10249-10258 Recanzone, G.H., Makhamra, S., & Guard, D.C. (1998). Comparison of absolute and relative sound localization ability in humans. Journal of the Acoustical Society of America, 103, 1085-1097. Rogers, M.E. & Butler, R.A. (1992). The linkage between stimulus frequency and covert peak areas as it relates to monaural localization. Perception and Psychophysics, 52, 536-546. Schonstein, D., Ferre, L., & Katz, F.G. (2009). Comparison of headphones and equalization for virtual auditory source localization, Proceedings of the Acoustics’08 Conference. Paris (France): European Acoustics Association. Sosa, Y., Teder-Sälejärvi, W.A., & McCourt, M.E. (2010). Biases in spatial attention in vision and audition. Brain and Cognition, 73, 229-235. Spitzer, M.W., Bala, A., Takahashi, T.T. (2003). Auditory spatial discrimination by barn awls in simulated echoic environment. Journal of the Acoustical Society of America, 113, 1631-1645. Spizer, M.W. & Takahashi, T.T. (2006). Sound localization by barn awls in a simulated echoic environment. Journal of Neurophysiology, 95, 3571-3584. Steinhauser, A. (1879). The theory of binaural audition. A contribution to the theory of sound. Philosophical Magazine (Series 5), 7, 181-197. Strutt, J.W. (Lord Rayleigh). (1876). Our perception of the direction of a source of sound. Nature, 7, 32-33. Strutt, J.W. (Lord Rayleigh). (1907). On our perception of sound direction. Philosophical Magazine (Series 5), 13, 214-232. Tonning, F.M. (1970). Directional audiometry. I. Directional white-noise audiometry. Acta Otolaryngologica, 72, 352-357.

78


Tyler, R.S, & Witt, S. (2004). Cochlear implants in adults: Candidacy. In: R.D. Kent (ed.), The MIT Encyclopedia of Communication Disorders, pp. 450-454. Cambridge (MA): MIT Press. Van Hosesel, R.M. & Clark, G.M. (1999). Speech results with a bilateral multi-channel cochlear implant subject for spatially separated signal and noise. Australian Journal of Audiology, 21, 23-28. Van Wanrooij, M.M. & Van Opstal, A.J. (2004). Contribution of head shadow and pinna cues to chronic monaural sound localization. Journal of Neuroscience, 24 (17), 4163-4171. Vause, N. & Grantham, D.W. (1999). Effects of earplugs and protective headgear on auditory localization ability in the horizontal plane. Journal of the Human Factors and Ergonomics Society, 41 (2), 282-294. Vermiglio, A., Nilsson, M., Soli, S., & Freed, D. (1998). Development of virtual test of sound localization: the Source Azimuth Identification in Noise Test (SAINT), Poster presented at the American Academy of Audiology Convention. Los Angeles (CA): AAA. Wallach, H. (1939). On sound localization. Journal of the Acoustical Society of America, 10, 270274. Wallach, H. (1940). The role of head movements and the vestibular and visual cues in sound localization. Journal of Experimental Psychology, 27, 339-368. Watkins, A.J. (1978). Psychoacoustical aspects of synthesized vertical locale cues. Journal of the Acoustical Society of America, 63, 1152-1165. Wenzel, E.M. (1999). Effect of increasing system latency on localization of virtual sounds, Proceedings of the 16th AES International Conference on Spatial Sound Reproduction, pp. 1-9. Rovaniemi (Finland): Audio Engineering Society. White, G.D. (1987). The Audio Dictionary. University of Washington Press, ISBN 0-295965274, Seattle (WA). Wightman, F.L. & Kistler, D.J. (1989). Headphone simulation of free field listening. II: Psychophysical validation. Journal of the Acoustical Society of America, 85, 868–878. Willmott, C.J. & Matsuura, K. (2005). Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research, 30, 79–82. Wilson, H.A. & Myers, C. (1908). The influence of binaural phase differences on the localization of sounds. British Journal of Psychology, 2, 363-385. Yost, W.A. & Gourevitch, G. (1987). Directional Hearing. Springer, ISBN 978-0387964935, New York (NY). Yost, W.A. & Hafter, E.R. (1987). Lateralization. In: W.A. Yost & G. Gourevitch (eds.), Directional Hearing, pp. 49-84. Springer, ISBN 978-0387964935, New York (NY). Yost, W.A., Popper, A.N., & Fay, R.R. (2008). Auditory Perception of Sound Sources. Springer, ISBN 978-0-387-71304-5, New York (NY). Young, P.T. (1931). The role of head movements in auditory localization. Journal of Experimental Psychology, 14, 95-124. Young, E.D., Spirou, G.A., Rice, J.J., & Voigt, H.F. (1992). Neural organization and response to complex stimuli in the dorsal cochlear nucleus. Philosophical Transactions of the Royal Society London B: Biological Sciences, 336, 407-413. Zahorik, P., Brungart, D.S., & Bronkhorst, A.W. (2005). Auditory distance perception in humans: A summary of past and present research. Acta Acustica, 91, 409-420. Zar, J. H. (1999). Biostatistical Analysis (4th ed.). Prentice Hall, ISBN 9780131008465, Upper Saddle River (NJ).

5 HRTF Sound Localization Martin Rothbucher, David Kronmüller, Marko Durkovic, Tim Habigt and Klaus Diepold Institute for Data Processing, Technische Universität München Germany

1. Introduction In order to improve interactions between the human (operator) and the robot (teleoperator) in human centered robotic systems, e.g. Telepresence Systems as seen in Figure 1, it is important to equip the robotic platform with multimodal human-like sensing, e.g. vision, haptic and audition. Operator Site

Barriiers

Teleoperator Site

Fig. 1. Schematic view of the telepresence scenario. Recently, robotic binaural hearing approaches based on Head-Related Transfer Functions (HRTFs) have become a promising technique to enable sound localization on mobile robotic platforms. Robotic platforms would benefit from this human like sound localization approach because of its noise-tolerance and the ability to localize sounds in a three-dimensional environment with only two microphones. As seen in Figure 2, HRTFs describe spectral changes of sound waves when they enter the ear canal, due to diffraction and reflection of the human body, i.e. the head, shoulders, torso and ears. In far field applications, they can be considered as functions of two spatial variables (elevation and azimuth) and frequency. HRTFs can be regarded as direction dependent filters, as diffraction and reflexion properties of the human body are different for each direction. Since

80


the geometric features of the body differ from person to person, HRTFs are unique for each individual (Blauert, 1997).

Fig. 2. HRTFs over varying azimuth and constant elevation The problem of HRTF-based sound localization on mobile robotic platforms can be separated into three main parts, namely the HRTF-based localization algorithms, the HRTF data reduction and the application of predictors that improve the localization performance. For robotic HRTF-based localization, an incoming sound signal is reflected, diffracted and scattered by the robot’s torso, shoulders, head and pinnae, dependent on the direction of the sound source. Thus both left and right perceived signals have been altered through the robot’s HRTF, which the robot has learned to associate with a specific direction. We have investigated several HRTF-based sound localization algorithms, which are compared in the first section. Due to its high dimensionality, it is inefficient to utilize the robot’s original HRTFs. Therefore, the second section will provide a comparison of HRTF reduction techniques. Once the HRTF dataset has been reduced and restored, it serves as the basis for localization. HRTF localization is computational very expensive, therefore, it is advantageous to reduce the search region for sound sources to a region of interest (ROI). Given a HRTF dataset, it is necessary to check the presence of each HRTF in the perceived signal individually. Simply applying a brute force search will localize the sound source but may be inefficient. To improve upon this, a search region may be defined, determines which HRTF-subset is to be searched and in what order to evaluate the HRTFs. The evaluation of the respective approaches is made by conducting comprehensive numerical experiments.

81

HRTF Sound Localization

2. HRTF Localization Algorithms In this section, we briefly describe four HRTF-based sound localization algorithms, namely the Matched Filtering Approach, the Source Cancellation Approach, the Reference Signal Approach and the Cross Convolution Approach. These algorithms return the position of the sound source using the recorded ear signals and a stored HRTF database. As illustrated in Figure 3, the unknown signal S emitted from a source is filtered by the corresponding left and right HRTFs, denoted by HL,i0 and HR,i0 , before being captured by a humanoid robot, i.e., the left and right microphone recordings X L and XR are constructed as X L = HL,i0 · S, XR = HR,i0 · S.

(1)

The key idea of the HRTF-based localization algorithms is to identify a pair of HRTFs corresponding to the emitting position of the source, such that correlation between left and right microphone observations is maximized.

Fig. 3. Single-Source HRTF Model 2.1 Matched Filtering Approach

The Matched Filtering Approach seeks to reverse the HR,i0 and HL,i0 -filtering of the unknown sound source S as illustrated in Figure 3. A schematic view of the Matched Filtering Approach is given in Figure 4.

Fig. 4. Schematic view of the Matched Filtering Approach The localization algorithm is based on the fact that filtering X L and XR with the inverse of the correct emitting HRTFs yields identical signals S˜ R,i and S˜ L,i , i.e. the original mono sound signal S in an ideal case:

82


−1 S˜ L,i = HL,i · XL −1 = HR,i · XR

(2)

=S˜ R,i ⇐⇒ i = i0 . In real case, the sound source can be localized by maximizing the cross-correlation between S˜ R,i and S˜ L,i , arg max S˜ R,i ⊕ S˜ L,i , (3) i

where i is the index of HRTFs in the database and ⊕ denotes a cross-correlation operation. Unfortunately the inversion of HRTFs can be problematic due to instability. This is mainly due to the linear-phase component of HRTFs responsible for encoding ITDs. Hence a stable approximation must be made of the instable version, retaining all direction-dependent information. One method is to use outer-inner factorization, converting an unstable inverse into an anti-causal and bounded inverse (Keyrouz et al., 2006). 2.2 Source Cancellation Algorithm

The Source Cancellation Algorithm is an extension of the Matched Filtering Approach. −1 −1 Equivalently to cross-correlating all pairs X L · HL,i and XR · HR,i , the problem can be restated H

XL L,i as a cross-correlation between all pairs X and HR,i . The improvement is that the ratio of R HRTFs does not need to be inverted and can be precomputed and stored in memory (Keyrouz & Diepold, 2006; Usman et al., 2008). HL,i XL ⊕ (4) arg max XR HR,i i

2.3 Reference Signal Approach

X R,out = S⋅ α

X R = S⋅ H R,i0 X L = S⋅ H L,i0 X L,out = S⋅ β Fig. 5. Schematic view of the Reference Signal Approach setup This approach uses four microphones as shown in Figure 5: two for the HRTF-filtered signals (X L and XR ) and two outside the ear canal for original sound signals (X L,out and XR,out ). The previous algorithms used two microphones, each receiving the HRTF-filtered mono sound signals. The four signals now captured are: X L = S · HL

(5)

X R = S · HR

(6)

83


X L,out = S · α

(7)

XR,out = S · β

(8)

α and β represent time delay and attenuation elements that occur due to the heads shadowing. L R From these signals three ratios are calculated. XXL,out and XXR,out are the left and right HRTFs respectively and

XL XR

is the ratio between the left and right HRTFs. The three ratios are

XL then cross correlated with the respective reference HRTFs (HRTF ratios in case of X ). The R cross-correlation coefficients are summed, and the HRTF pair yielding the maximum sum H XL XL XR ⊕ HL,i + ⊕ L,i + ⊕ HR,i (9) arg max X L,out XR HR,i XR,out i

defines the incident direction (Keyrouz & Abou Saleh, 2007). The advantage of this system is that HRTFs can be directly calculated yet retain the original undistorted sound signals X L,out and XR,out . Thus the direction-dependent filter can alter the incident spectra without regard to the contained information, possibly allowing for better localization. However, the need for four microphones diverges from the concept of binaural localization, exhibiting more hardware and consequently higher costs. 2.4 Convolution Based Approach

To avoid the instability problem, this approach is to exploit the associative property of convolution operator (Usman et al., 2008). Figure 6 illustrates the single-source cross-convolution localization approach. Namely, left and right observations S˜ R,i and S˜ L,i are filtered with a pair of contralateral HRTFs. The filtered observations turn to be identical at the correct source position for the ideal case: S˜ L,i = HR,i · X L

= HR,i · HL,i0 · S = HL,i · HR,i0 · S = HL,i · XR =S˜ R,i ⇐⇒ i = i0 .

(10)

Similar to the matched filtering approach, the source can be localized in real case by solving the following problem: arg max S˜ R,i ⊕ S˜ L,i . (11) i

2.5 Numerical Comparison

In this section, the previously described localization algorithms are compared by numerical simulations. We use the CIPIC database (Algazi et al., 2001) for our HRTF-based localization experiments. The spatial resolution of the database is 1250 sampling points (Ne = 50 in elevation and Na = 25 in azimut) and the length is 200 samples. In each experiment, generic and real-world test signals are virtually synthesized to the 1250 directions of the database, using the corresponding HRTF. The algorithms are then used to localized the signals and a localization success rate is computed. Noise robustness of the algorithm is investigated by different signal-to-noise ratios (SNRs) of the test signals. It should be noted that testing of the localization performance is rigorous, meaning, that we

84


Fig. 6. Schematic view of the cross-convolution approach do not apply any preprocessing to avoid e.g. instability of HRTF inversion. The localization algorithms are implemented as described above. Figure 7 shows the achieved localization results of the simulation. The Convolution Based Algorithm, where no HRTF-inversion has to be computed, outperforms the other algorithms in terms of noise robustness and localization success. Furthermore, the best localization results are achieved with white Gaussian noise sources as these ideally cover the entire frequency spectrum. A more realistic sound source is music. It can be seen in Figure 7(d), that the localization performance is slightly degraded compared to the white Gaussian sound sources. The reason for this is that music generally does not inhabit the entire frequency spectrum equally. Speech signals are even more sparse than music resulting in localization success rates worse than for music signals. Due to the results of the numerical comparison of the different HRTF-based localization algorithms, only the Convolution Based Approach will be utilized to evaluate HRTF data reduction techniques in Section 3 and predictors in Section 4.

3. HRTF Data reduction techniques In general, as illustrated in Figure 8, each HRTF dataset can be represented as a three-way array H ∈ R Na × Ne × Nt . The dimensions Na and Ne are the spatial resolutions of azimuth and elevation, respectively, and Nt the time sample size. By a Matlab-like notation, in this section we denote H(i, j, k) ∈ R the (i, j, k )-th entry of H, H(l, m, :) ∈ R Nt the vector with a fixed pair of (l, m) of H and H(l, :, :) ∈ R Ne × Nt the l-th slide (matrix) of H along the azimuth direction. 3.1 Principal Component Analysis (PCA)

Principal Component Analysis expresses high-dimensional data in a lower dimension, thus removing information yet retaining the critical features. PCA uses statistics to extract the adequately named principal components from a signal (in essence being the information that defines the target signal). The dimensionality reduction of HRIRs by using PCA is described as follows. First of all, we construct the matrix H := [vec(H(:, :, 1)) , . . . , vec(H(:, :, Nt )) ] ∈ R Nt ×( Na · Ne ) ,

(12)

85


(a) Matched Filtering Approach

(b) Source Cancellation Approach

(c) Reference Signal Approach

(d) Convolution Based Approach

Fig. 7. Comparison of HRTF-based sound localization algorithms. where the operator vec(·) puts a matrix into a vector form. Let H = [ h1 , . . . , h Nt ]. The mean value of columns of H is then computed by μ=

1 Nt

Nt

∑ hi .

(13)

i =1

= [

After centering each row of H, i.e. computing H h Nt ] ∈ R Nt ×( Na · Ne ) where

hi = h1 , . . . ,

is computed as follows hi − μ for i = 1, . . . , Nt , the covariance matrix of H C :=

1 Nt

H

. H

Fig. 8. HRIR dataset represented as a three-way array

(14)

86


Now we compute the eigenvalue decomposition of C and select q eigenvectors { x1 , . . . , xq } corresponding to the q largest eigenvalues. Then by denoting X = [ x1 , . . . , xq ] ∈ R Nt ×q , the HRIR dataset can be reduced by the following = X H

∈ R q×( Na · Ne ) . H

(15)

Note, that the storage space for the reduced HRIR dataset depends on the value of q. Finally to reconstruct the HRIR dataset one need to compute + μ ∈ R Nt ×( Na · Ne ) . Hr = X H We refer to (Jolliffe, 2002) for further discussions on PCA.

(16)

3.2 Tensor-SVD of three-way array

Fig. 9. Schematic view of the Tensor-SVD. Unlike the PCA algorithm vectorizing the HRIR dataset, Tensor-SVD keeps the structure of the original 3D dataset intact. As shown in Figure 9, given a HRIR dataset H ∈ R Na × Ne × Nt ,

∈ R Na × Ne × Nt , Tensor-SVD computes its best multilinear rank − (r a , re , rt ) approximation H where Na > r a , Ne > re and Nt > rt , by solving the following minimization problem

(17) min H − H ,

R Na × Ne × Nt H∈

F

can be where · F denotes the Frobenius norm of tensors. The rank − (r a , re , rt ) tensor H r × r a e ×rt with decomposed as a trilinear multiplication of a rank − (r a , re , rt ) core tensor C ∈ R three full-rank matrices X ∈ R Na ×ra , Y ∈ R Ne ×re and Z ∈ R Nt ×rt , which is defined by

= ( X, Y, Z ) · C H

(18)

is computed by where the (i, j, k)-th entry of H

i, j, k ) = H(

ra

re

rt

∑ ∑ ∑

α =1 β =1 γ =1

xiα y jβ zkγ C(α, β, γ).

(19)

Thus without loss of generality, the minimization problem as defined in (17) is equivalent to the following min H − ( X, Y, Z ) · CF ,

X,Y,Z,C

s.t. X X = Ira , Y Y = Ire and Z Z = Irt . We refer to (Savas & Lim, 2008) for Tensor-SVD algorithms and further discussions.

(20)

87


3.3 Generalized Low Rank Approximations of Matrices

Fig. 10. Schematic view of the Generalized Low Rank Approximations of Matrices Similar to Tensor-SVD, GLRAM methods, shown in Figure 10 do not require destruction of a 3D tensor. Instead of compressing along all three directions as Tensor-SVD, GLRAM methods work with two pre-selected directions of a 3D data array. Given a HRIR dataset H ∈ R Na × Ne × Nt , we assume to compress H in the first two directions. Then the task of GLRAM is to approximate slides (matrices) H(:, :, i ), for i = 1, . . . , Nt , of H along the third direction by a set of low rank matrices { XMi Y } ⊂ R Na × Ne , for i = 1, . . . , Nt , where the matrices X ∈ R Na ×ra and Y ∈ R Ne ×re are of full rank, and the set of matrices { Mi } ⊂ Rra ×re with Na > r a and Ne > re . This can be formulated as the following optimization problem Nt

min

N

∑ (H(:, :, i) − XMi Y ) F ,

X,Y,{ Mi }i=t1 i =1

(21)

s.t. X X = Ira and Y Y = Ire . Here, by abuse of notations, · F denotes the Frobenius norm of matrices. Let us construct a 3D array M ∈ Rra ×re × Nt by assigning M(:, :, i ) = Mi for i = 1, . . . , Nt . The minimization problem as defined in (21) can be reformulated in a Tensor-SVD style, i.e. min H − ( X, Y, INt ) · MF ,

X,Y,M

s.t. X X = Ira and Y Y = Ire .

(22)

We refer to (Ye, 2005) for more details on GLRAM algorithms. GLRAM methods work on two pre-selected directions out of three. There are then in total three different combinations of directions to implement GLRAM on an HRIR dataset. Performance of GLRAM in different directions might vary significantly. This issue will be investigated and discussed in section 3.5. 3.4 Diffuse Field Equalization (DFE)

A technique that provides good compression performance is diffuse field equalization. The technique reduces the number of samples per HRIR, yet retains the original characteristics. We define the matrix H containing the HRTFs as H := [vec(H(:, :, 1)), . . . , vec(H(:, :, Nt ))] ∈ R( Na · Ne )× Nt ,

(23)

88


where the operator vec(·) puts a matrix into a vector form. Let H = [ h1 , . . . , h( Na · Ne ) ]. DFE removes the time delay at the beginning of each HRTF and then calculates the average power spectrum from all HRTFs, which then is deconvolved from each HRTF, thus removing direction-independent information. The average power h is computed by h = F −1 {

1 ( Na · Ne )

( Na · Ne )

∑

i =1

|F {hi }|2 },

(24)

h is shifted circularily by half the kernel where F {·} denotes the fourier transform. Then, length: Nt Nt h 1 = [ + 1 . . . Nt ) )]. (25) h( h (1 . . . 2 2 h1−1 . The filter kernel h1 is inverted and minimum phase reconstruction is applied, yielding The diffused field equalized dataset is retrieved by hDFE = [(h1 ∗ h1−1 ), . . . , (h( Na · Ne ) ∗ h1−1 )].

(26)

After retrieving the dataset hDFE the time delay samples at the beginning of each HRIR can be removed. To achieve higher compression of the dataset, also samples at the end of each HRTFs, which do not contain crucial direction dependent information, can be removed. For further information on DFE see (Moeller, 1992). 3.5 Numerical Comparison

In this section, PCA, GLRAM, Tensor-SVD and Diffused Field Equalization are applied to a HRTF-based sound localization problem, in order to evaluate performance of these methods for data reduction. In each experiment, left and right ear KEMAR HRTF are reduced with one of the introduced reduction methods. A test signal, which is white noise is virtually synthesized using the corresponding original HRTF. The convolution based sound localization algorithm as descirbed in Section 2.4, is fed with the restored databases and used to localize the signals. Finally, the localization success rate is computed. As already mentioned, GLRAM works on two preselected directions out of three. Therefore, we conduct localization experiments for a subset of directions (35 randomly chosen locations) to detect a combination of well working parameters for GLRAM. After finding a suitable combination of the variables, localization experiments for all 1250 directions are conducted. Firstly, the dataset is reduced for the first two directions, i.e. elevation and azimuth. The contour plot given in Figure 11(a) shows the localization success rate for a fixed pair of values (Nra , Nre ). Similar results with respect to the pairs (Nra , Nrt ) and (Nre , Nrt ) are ploted in Figure 11(b) and Figure 11(c), respectively. Clearly, applying GLRAM on the pair of (Nre , Nrt ) outperforms the other two combinations. The application of GLRAM in the directions of elevation and time performs best, therefore, we compare this optimal GLRAM with the standard PCA and Tensor-SVD. As mentioned in section 3.3, GLRAM is a simple form of Tensor-SVD with leaving one direction out. Thus, we investigate the effect of additionally reducing the third direction, whereas the dimensions in elevation and time are fixed to the parameters of the optimal GLRAM. Figure 13 shows that additionally decreasing the dimension in azimuth leads to a huge loss of localization accuracy. After determining the optimal parameters for GLRAM, the simulations are conducted for all 1250 directions of the CIPIC dataset. Figure 12 shows the localization success rate in dependency of the compression rate for GLRAM and PCA. It can be seen that an optimized GLRAM outperforms the standard PCA in terms of compression.

89


(a) GLRAM on ( azimuth, elevation)

(b) GLRAM on ( azimuth, time)

(c) GLRAM on (elevation, time)

Fig. 11. Contour plots of localization success rate of using GLRAM in different settings.

Fig. 12. Comparison between DFE, PCA and GLRAM

4. Predictors for HRTF sound localization To reduce the computational costs of HRTF-based sound localization, especially for moving sound sources, it is advantageous to determine a region of interest (ROI) as illustrated in Figure 15. A ROI constricts the 3D search space around the robotic platform leading to a reduced set of eligible HRTFs. Various tracking models have been implemented in microphone sound localization. Primarily they predict the path of a sound source as it is traveling and thus acquiring faster and more accurate non-ambiguous localization results (Belcher et al., 2003; Ward et al., 2003). Most of these filters are updated periodically in scans. In this section, three predictors, namely Time

90


Delay of Arrival, Kalman filter and Particle filter, are briefly introduced to determine a ROI to reduce the set of eligible HRTFs to be processed to localize moving sound sources. 4.1 Time Delay of Arrival

The time delay between the two signals xi [n] and x j [n] is found when the cross-correlation value Rij (τ ) is maximal. Given that τ has been determined, the time delay is calculated by ΔT =

τ , fs

(27)

where f s is the sampling rate. Knowing the geometry (distance between the robot’s ears) of the microphones and the delays between microphone pairs, a number of locations for the sound source can be disregarded (Brandstein & Ward, 2001; Kwok et al., 2005; Potamitis et al., 2004; Valin et al., 2003). Then, an HRTF-based localization algorithm only evaluates the remaining possible locations of the source. 4.2 Kalman Filter

The Kalman filter is a frequently used predictor (usage for microphone array localization described in (Belcher et al., 2003)). The discrete version exhibits two main states: time update (prediction) and measurement update (correction). The Kalman filter predicts the state of xk at time k given the linear stochastic difference equation

and measurement

xk = Axk−1 + Buk−1 + wk−1

(28)

zk = Hxk + vk .

(29)

Matrices A, B and H provide relation from discrete time k − 1 to k for their respective variables x (the state) and u (optional control input). w and v add noise to the model. A set of time and measurement update equations are used to predict the next state (Kalman, 1960). The state vector is defined by current location coordinates x and y and the velocity components v x and vy (Potamitis et al., 2004; Usman et al., 2008). Note that here the predictor is applied to two dimensional space. x = [ x, v x , y, vy ] T (30) An unreliable location estimate during initialization of the the Kalman filter may be a source of error. To improve upon this, particle filters have been implemented in (Chen & Rui, 2004).

Fig. 13. Localization success rate by Tensor-SVD

91


actual position update

position prediction

Fig. 14. Schematic view of the application of predictors in HRTF-based localization. 4.3 Particle Filter

The particle filter bases itself on the idea of randomly generating samples from a distribution and assigning weights to each to define their reliability. The particles and their associated weights define an averaged center which is the predicted value for the next step. Each weight wik is associated to a particle xi in iteration k. A set of N particles is initially drawn from a distribution q( xi | xki −1 , zk ) with zk being the current observed value. For each particle the weight is calculated by p(zk | xki ) p( xki | xki −1 ) wik = wik−1 . (31) i q( xki | x0:k −1 , z1:k ) Once all weights are calculated, their sum is normalized. To determine the predicted value, the weighted average of the particles is taken: x¯ =

1 N i wk · xi N i∑ =1

(32)

Over time it may occur that very few particles possess most of the weight. This case requires resampling to protect from particle degeneration. The variance of the weights is used as a measure to check for this case and if required, the set of weights is exchanged with a better approximation (Gordon et al., 1993). Many particle filter variations exist, such as the Monte Carlo approximations and Sampling Importance Resampling. However a particle filter may find only a local optimum and thus never reaching the global optimum. Evolutionary estimation is proposed in (Kwok et al., 2005) to overcome such problems. Initially a set of potential speaker locations are estimated and then a heuristic search is performed. The speaker locations are called chromosomes and can only move within a defined region. After the initialization, the Time Delay of Arrival (TDOA) is evaluated for each potential location as well as each microphone. The difference vi between expected and actual TDOAs is used to define a fitness function for each chromosome i together with error variance στ2 : −0.5

v2 i στ2

(33) ωi = e i The new estimate of source location is given by ωi is then scaled such that ∑in=1 ωi = 1 → ω sx =

n

∑ ω i sxi .

i =1

(34)

92


(a) Time Delay of Arrival

(b) Particle Filter

(c) Kalman Filter

Fig. 15. Comparison of predictors for HRTF Sound Localization.


93

Chromosomes are then selected according to a linearly spaced pointer spanning the fitness magnitude scale, with higher fitness chromosomes being selected more often. The latter chromosomes receive less mutation as compared to weaker chromosomes depending on r g , the zero mean Gaussian random number variance, and dm , the distance for mutation (Kwok et al., 2005). s xi+1 = s xi + r g dm (35) 4.4 Numerical comparison

This section gives a performance overview of the applied predictors in a HRTF-based sound localization scenario. We simulate moving sound by virtually synthesizing a sound source, which is white noise, using different pairs of HRTFs. This way, a random path of 500 different source positions is generated, simulating a moving sound source. Then, Time Delay of Arrival, the Kalman filter and the Particle filter seek to reduce the search region for the HRTF-based sound localization to a region of interest. The Convolution Based Algorithm is utilized to localize the moving sound source. The experiments were conducted three times with different speed of the sound source. Figure 15 summarizes the results of applying predictors to HRTF-based sound localization. The left plots show the localization success rates in dependency of the size of the region of interest. In the right plots the number of directions that have to be evaluated within the localization algorithms are shown. The bigger the region of interest, the more HRTF-pairs have to be utilized to maximize the cross correlation (11) resulting in a higher processing time. On the other hand, the smaller the region of interest, the higher the danger of excluding the HRTF pair that is maximizing the cross correlation (11), leading to false localization results. Our simulation results show that the number of HRTFs to be evaluated for the Convolution Based Algorithm can be significantly reduced to speed up HRTF-based localization for moving sources. Time Delay of Arrival is reducing the search region to 500 directions while reaching hundred percent correct localization of the path, meaning all 500 source positions are detected correctly for the different speeds of the sources. Particle- and Kalman filter are able to reduce the search region to 130 directions in case of sound sources with a speed of 20deg/s. For slower sources, only 60 directions need to be taken into account.

Acknowledgements This work was fully supported by the German Research Foundation (DFG) within the collaborative research center SFB-453 "High Fidelity Telepresence and Teleaction".

5. References Algazi, V. R., Duda, R. O., Thompson, D. M. & Avendano, C. (2001). The CIPIC HRTF database, IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, pp. 21–24. Belcher, D., Grimm, M. & Kroschel, K. (2003). Speaker tracking with a microphone array using a kalman filter, Advances in Radio Science 1: 113–117. Blauert, J. (1997). An introduction to binaural technology, Binaural and Spatial Hearing, R. Gilkey, T. Anderson, Eds., Lawrence Erlbaum, Hilldale, NJ, USA, pp. 593–609. Brandstein, M. & Ward, D. (2001). Microphone arrays - signal processing techniques and applications, Springer.

94


Chen, Y. & Rui, Y. (2004). Real-time speaker tracking using particle filter sensor fusion, Proceedings of the IEEE 92(3): 485–494. Gordon, N., Salmond, D. & Smith, A. (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation, Radar and Signal Processing, IEE Proceedings F 140(2): 107 –113. Jolliffe, I. T. (2002). Principal Component Analysis, second edn, Springer. Kalman, R. (1960). A new approach to linear filtering and prediction problems, Transactions of the ASME - Journal of Basic Engineering 82(Series D): 35–45. Keyrouz, F. & Abou Saleh, A. (2007). Intelligent sound source localization based on head-related transfer functions, IEEE International Conference on Intelligent Computer Communication and Processing, pp. 97–104. Keyrouz, F. & Diepold, K. (2006). An enhanced binaural 3D sound localization algorithm, 2006 IEEE International Symposium on Signal Processing and Information Technology, pp. 662–665. Keyrouz, F., Diepold, K. & Dewilde, P. (2006). Robust 3D Robotic Sound Localization Using State-Space HRTF Inversion, IEEE International Conference on Robotics and Biomimetics, 2006. ROBIO’06, pp. 245–250. Kwok, N., Buchholz, J., Fang, G. & Gal, J. (2005). Sound source localization: microphone array design and evolutionary estimation, IEEE International Conference on Industrial Technology, pp. 281–286. Moeller, H. (1992). Fundamentals of binaural technology, Applied Acoustics 36(3-4): 171–218. Potamitis, I., Chen, H. & Tremoulis, G. (2004). Tracking of multiple moving speakers with multiple microphone arrays, IEEE Transactions on Speech and Audio Processing 12(5): 520–529. Savas, B. & Lim, L. (2008). Best multilinear rank approximation of tensors with quasi-Newton methods on Grassmannians, Technical Report LITH-MAT-R-2008-01-SE, Department of Mathematics, Linkpings University. Usman, M., Keyrouz, F. & Diepold, K. (2008). Real time humanoid sound source localization and tracking in a highly reverberant environment, Proceedings of 9th International Conference on Signal Processing, Beijing, China, pp. 2661–2664. Valin, J., Michaud, F., Rouat, J. & Letourneau, D. (2003). Robust sound source localization using a microphone array on a mobile robot, IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. 2. Ward, D. B., Lehmann, E. A. & Williamson, R. C. (2003). Particle filtering algorithms for tracking an acoustic source in a reverberant environment, IEEE Transactions on Speech and Audio Processing 11(6): 826–836. Ye, J. (2005). Generalized low rank approximations of matrices, Machine Learning 61(1-3): 167–191.

6 Effect of Space on Auditory Temporal Processing with a Single-Stimulus Method Martin Roy, Tsuyoshi Kuroda and Simon Grondin Université Laval, Québec Canada

1. Introduction The exact nature of the relation between space and time is certainly one of the most fundamental issues in physics (Buccheri, Saniga, & Stuckey, 2003), but it is also an intriguing question for experimental psychologists (Casasanto, Fotakopoulou, & Boroditsky, 2010). A function of perception is to form mental representations indicating what object exists, where it is located, and how it acts, i.e., how the object moves in space with the lapse of time. Space and time are integrated in the perceptual system to cause the perception of motion and speed, and such integration is required to determine the performance of the motor system (e.g., hand movement; see Lee, 2000). How space and time exert mutual influence is a question that was addressed many years ago (Abe, 1935; Helson, 1930), notably by J. Piaget, who studied the ontogenesis of the relations between time, distance and speed (Piaget, 1955). Time perception has been often explained with the “internal-clock hypothesis,” which is notable in discussing the perceptual relation between space and time. An internal clock is usually assumed to be a pacemaker-counter device, with the first module emitting pulses accumulated by the second one (Grondin, 2001, 2010). The amount of accumulation decides the perceived time duration. The performance level is varied, however, when some variation of nontemporal factors are introduced in experiments. This variability, in a duration discrimination task for instance, can be observed by varying the time intervals' structure (filled or empty; Grondin, 1993), or by varying the sensory modality to be stimulated. Space is a nontemporal factor, which is susceptible to vary the performance level of the internal clock. There are two illusions concerning the perceptual relation between space and time, which have been studied since the early 20th century (see Jones & Huang, 1982; Sarrazin, Giraudo, & Pittenger, 2007; ten Hoopen, Miyauchi, & Nakajima, 2008). The tau effect takes place typically in the successive presentation of three signals, say, X, Y, and Z, with Y somewhere between X and Z (Helson, 1930; Helson & King, 1931; Henry, McAuley, & Zaleha, 2009). They are delivered from different sources spaced at equal intervals, resulting in two equal intervals in space, X-Y and Y-Z. These intervals are perceived as unequal in their distance, however, if the signals are presented at unequal intervals in time; if the time interval defined by X and Y is shorter (longer) than the time interval defined by Y and Z, the spatial distance between X and Y is perceived as shorter (longer) than the spatial distance between Y and Z. In other words, the spatial-interval ratio is perceived as if it were similar to the time-interval ratio. Such interaction between space and time can be caused in the opposite direction with

96


the same signal configuration, i.e., the time-interval ratio is perceived as if it were similar to the spatial-interval ratio. This opposite-direction effect was named the kappa effect (Cohen, Hansel, & Sylvester, 1953; Price-Williams, 1954). The kappa effect in the auditory mode was investigated in the present study. The kappa effect has been tested more often in the visual mode (Cohen, Hansel, & Sylvester, 1953, 1955; Collyer, 1977; Miyatani, 1984-1985; Sarrazin, Giraudo, Pailhous, & Bootsma, 2004), and even in the tactile mode (Goldreich, 2007; Suto, 1952, 1955, 1957). There are researches testing the kappa effect in the auditory mode, but most of them focused on the effects of frequency distance (difference), instead of spatial distance, on the perception of time duration (Cohen, Hansel, & Sylvester, 1954; Henry & McAuley, 2009; Jones & Huang, 1982; Shigeno, 1986; Yoblick & Salvendy, 1970). In a typical case, three successive signals were different in their frequency, causing two intervals in time and in frequency, and the time-interval ratio was perceived as if it had been similar to the frequency-interval ratio. There is little evidence for the occurrence of the kappa effect in the perception of space and time in the auditory mode (Sarrazin, Giraudo, Pittenger, 2007; see Ouellet, 2003). The kappa effect indicates that time duration increases perceptually in proportion to spatial distance between two signals, but this effect has been demonstrated with three successive signals, where two intervals are bounded on each other. Few researches have examined whether or not the similar effect can take place when a single interval is presented. It is important in this context to indicate that there are cases where time duration is perceived as shorter when spatial distance is increased in the visual mode (Guay & Grondin, 2001). This result was observed in an experiment employing a single-stimulus method, where a categorization judgment was conducted after the presentation of one interval. The interval was defined by two signals delivered from different sources, which were selected from three sources (above, middle and below) located in front of participants on the same vertical plane. All location pairs were presented in random order within each block. The interval was more often perceived as shorter when it was marked by the above and below sources, in comparison with intervals marked by the above and middle sources or the middle and below sources. The purpose of the present study was to verify if space exerts influence on time perception (1) when intervals to be measured perceptually are marked by sounds delivered from sources having different distances between them, and (2) when these intervals are presented according to a single-stimulus method.

2. Method Participants Twelve 19- to 26-year-old volunteer students at Université Laval (six females and six males) with no hearing problems participated in this experiment. They were paid CAN $20 for their participation. Apparatus and stimuli A time interval was defined by two sound stimuli of 20 ms. The stimuli were 1-kHz sinusoidal sounds generated by IBM PC running E-Prime software (version 1.1.4.1 - SP3). The computer was equipped with an SB Audigy 2 sound card, and the stimuli were delivered by Logitech Z-640 loudspeakers. Participants pressed “1” or “3” on the computer keyboard to indicate that the interval was short or long, respectively.

Effect of Space on Auditory Temporal Processing with a Single-Stimulus Method

97

Procedure The single-stimulus method was employed (Allan, 1979; Morgan, Watamaniuk, & McKee, 2000), i.e., each trial consisted of presenting one interval. The duration of the time intervals was controlled as follows: Eight values of time-interval duration were distributed around a midpoint value which is called the base duration. Four values below the base duration were called the “short” duration, and four values above the base duration were called the “long” duration. There were two base-duration conditions, 125 and 250 ms. In the former case, the “short” intervals lasted 104, 110, 116 and 122 ms, and the “long” intervals 128, 134, 140 and 146 ms. In the latter case, the “short” intervals lasted 208, 220, 232 and 244 ms, and the “long” intervals 256, 268, 280 and 292 ms. The participants were asked to judge whether the presented interval belonged to the ‘‘short’’ or to the ‘‘long’’ category. A 1.5-s feedback signal was presented immediately after the response on the computer screen and indicated whether the response was correct or not. There were two conditions of spatial distance between the auditory sources (loudspeakers), 1.1 m and 3.3 m (see Figure 1), and there were two conditions of the direction of stimulus presentation, right to left and left to right. Each participant completed eight sessions, four for 125-ms base duration and four for 250ms base duration. Six participants completed the 125-ms sessions before the 250-ms sessions, and six completed the 250-ms sessions before the 125-ms sessions. Four sessions in each base duration corresponded to four spatial conditions (2 distances x 2 directions), and they were carried out in random order. Each session had six blocks of 64 trials where the eight intervals were presented eight times in random order, and thus 48 responses were obtained in each interval in each spatial condition. Data analysis The two direction conditions were collapsed, resulting in four conditions in the data analysis (2 distance and 2 base-duration conditions). For each participant and for each condition, an 8-point psychometric function was traced, plotting the time interval duration on the x-axis and the “long” response proportion on the y-axis. Each point on the psychometric function was based on 96 presentations. The pseudo-logistic model (Killeen, Fetterman, & Bizo, 1997) was employed to calculate psychometric functions that were fitted to the resulting curves. Two indices of performance were estimated from each psychometric function, one for sensitivity and one for perceived duration. As an indicator of temporal sensitivity, one standard deviation (SD) on each psychometric function was employed. Using one SD (or variance) is a common procedure to express temporal sensitivity (Grondin, 2008; Grondin, Roussel, Gamache, Roy, & Ouellet, 2005; Killeen & Weiss, 1987). The other parameter was the temporal bisection point (BP). In the context of the kappa effect, this dependent variable is the most important. The BP can be defined as the x value corresponding to the 0.50 proportion on the y-axis. The observed shift of the BP for different conditions can be interpreted as an indication of differences in perceived duration. If an interval is perceived as longer, the “long” response takes place more frequently, which causes the downward shift of the BP. If an interval is perceived as shorter, the “long” response takes place less frequently, which causes the upward shift of the BP.

3. Results Figure 2 reports the grouped psychometric function for each of the four experimental conditions: 2 Distances x 2 Base Durations. In order to allow direct comparisons between the

98


base duration conditions, two dependent variables were calculated from the above parameters. One is the Constant Error, which is the BP minus the base duration. The other is the Coefficient of Variation, which is the SD divided by the BP. Figure 3 shows the results for the Constant Error. Essentially, it reveals higher values in the 3.3-m condition than in the 1.1-m condition. A 2 x 2 ANOVA with repeated measures revealed that both the distance effect, F(1,11) = 10.55, p < .01, ηp2 = .49 and the base duration effect, F(1,11) = 4.91, p < .05, ηp2 = .31, were significant. The interaction effect was also significant, F(1,11) = 8.36, p < .05, ηp2 = .43. Figure 4 shows the results for the Coefficient of Variation. The 2 x 2 ANOVA with repeated measures revealed that both the distance effect, F(1,11) = 7.23, p < .05, ηp2 = .40, and the base duration effect, F(1,11) = 19.80, p < .001, ηp2 = .64, were significant. The interaction effect was not significant, F(1,11) = .60, p = .45, ηp2 = .05.

4. Discussion The results of the present experiment clearly indicate that increasing the distance between sound sources marking time intervals leads to a decrease of the perceived duration (a higher constant error). These results are inconsistent with what is usually reported when referring to the kappa effect but consistent with results obtained in the visual mode with a singlestimulus method (Guay & Grondin, 2001). Other results linking space and time in the auditory mode revealed no such effect of distance between sound sources when sequences of four sounds from four sources were used (Ouellet, 2003). The present experiment also revealed that increasing distance between marker’s sources results in a higher coefficient of variation. The present results can be explained on the basis of the internal-clock hypothesis, where the accumulation process is controlled by an attentional mechanism, with more attention to time resulting in a higher accumulation of pulses (Grondin & Macar, 1992; Grondin & Plourde, 2007; Macar, Grondin, & Casini, 1994). When the two stimuli were farther away from each other in space, more attentional resources were allocated to their location perception (Mondor & Zattore, 1995; Rhodes, 1987; Roussel, Grondin, & Killeen, 2009), which caused the decrease of the resources allocated to the time perception. There were less accumulated pulses in the counter of the internal clock, and thus the time duration was perceived as shorter. This explanation is also consistent with the results obtained with the coefficient of variation. Allocating more resources to the spatial perception caused more variance (more categorization errors – higher coefficient of variation) in the observers’ judgment. Finally, the results also revealed higher coefficients of variation in the 125-ms base duration than in the 250-ms base duration. This finding is consistent with a generalized form of Weber’s law applied to time perception in which sensory noise (nontemporal noise due to attention disturbance) causes more damage to performance with briefer intervals.

5. Acknowledgements This research was made possible by a grant awarded to SG by the Natural Sciences and Engineering Research Council of Canada (NSERC). We would like to thank Marie-Claude Simard and Karine Drouin for their help with data collection. Correspondence should be addressed to Simon Grondin, École de psychologie, 2325 rue des Bibliothèques, Université Laval, Québec, Qc, Canada, G1V 0A6 (E-mail: [email protected])

99


Exterior : d 2 = 3.3m Interior : d 1 = 1.1m

m1 Speaker #1

Exterior

m3 Speaker #3 Interior

m2 Speaker #2 Interior

m4 Speaker #4 Exterior

1,5 m 1m

Fig. 1. Experimental set-up

Keyboard

100


Fig. 2. Psychometric function (grouped data) in each experimental condition.

101


4

Constant Error (ms)

2

0

-2

-4

125 ms 250 ms

-6 1 ,1

3 ,3

Distance (m) Fig. 3. Mean Constant Error as a function of distance between auditory markers. (Bars are standard errors) 125 ms 250 ms

0,22

Coefficient of Variation

0,20 0,18 0,16 0,14 0,12 0,10 0,08 1 ,1

3 ,3

Distance

Fig. 4. Mean Coefficient of Variation as a function of distance between auditory markers. (Bars are standard errors)

102


6. References Abe, S. (1935). Experimental study on the co-relation between time and space. Tohoku Psychologica Folia, 3, 53–68. Allan, L. G. (1979). The perception of time. Perception & Psychophysics, 26, 340-354. Buccheri, R., Saniga, M., & Stuckey, W. M. (Eds.). (2003). The Nature of Time: Geometry, Physics and Perception. Dordrecht, Netherlands: Kluwer Academic Publishers. Casasanto, D., Fotakopoulou, O., & Boroditsky, L. (2010). Space and time in the child’s mind: Evidence for a cross-dimensional asymmetry. Cognitive Science, 34, 387-405. Cohen, J., Hansel, C. E. M., & Sylvester, J. D. (1953). A new phenomenon in time judgment. Nature, 172, 901. Cohen, J., Hansel, C. E. M., & Sylvester, J. D. (1954). Interdependence of temporal and auditory judgments. Nature, 174, 642-644. Cohen, J., Hansel, C. E. M., & Sylvester, J. D. (1955). Interdependence in judgments of space, time and movement. Acta Psychologica, 11, 360-372. Collyer, C. E. (1977). Discrimination of spatial and temporal intervals defined by three light flashes : Effects of spacing on temporal judgments and of timing on spatial judgments. Perception & Psychophysics, 21, 357-364. Goldreich, D. (2007). A Bayesian perceptual model replicates the cutaneous rabbit and other tactile spatiotemporal illusions. PLoS ONE, 2, e333. Grondin, S. (1993). Duration discrimination of empty and filled intervals marked by auditory and visual signals. Perception & Psychophysics, 54, 383-394. Grondin, S. (2001). From physical time to the first and second moments of psychological time. Psychological Bulletin, 127, 22-44. Grondin, S. (2008). Methods for studying psychological time. In S. Grondin (Ed.). Psychology of time (pp. 51-74). Bingley, UK: Emerald Group Publishing. Grondin, S. (2010). Timing and time perception: A review of recent behavioral and neuroscience findings and theoretical directions. Attention, Perception, & Psychophysics, 72, 561-582. Grondin, S., & Macar, F. (1992). Dividing attention between temporal and nontemporal tasks: A performance operating characteristic -POC- analysis. In F. Macar, V. Pouthas, & W. J. Friedman (Eds.), Time, Action, Cognition: Towards Bridging the Gap. (pp. 119-128). Dordrecht, Netherlands: Kluwer Academic Publishers. Grondin, S., & Plourde, M. (2007). Discrimination of time intervals presented in sequences: Spatial effects with multiple auditory sources. Human Movement Science, 26, 702716. Grondin, S., Roussel, M.-È., Gamache, P.-L., Roy, M., & Ouellet, B. (2005). The structure of sensory events and the accuracy of time judgments. Perception, 34, 45-58. Guay, I., & Grondin, S. (2001). Influence on time interval categorization of distance between markers located on a vertical plane. In E. Sommerfeld, R. Kompass, & T. Lachman (Eds.), Proceedings of the 17th Annual Meeting of the International Society for Psychophysics (pp. 391-396). Berlin, Germany: Pabst Science Publishers. Helson, H. (1930). The tau effect: An example of psychological relativity. Science, 71, 536– 537. Helson, H., & King, S. M. (1931). The tau effect: An example of psychological relativity. Journal of Experimental Psychology, 14, 202–217.


103

Henry, M. J. & McAuley, J. D. (2009). Evaluation of an imputed pitch velocity model of the auditory kappa effect. Journal of Experimental Psychology: Human Perception and Performance, 35, 551-564. Henry, M. J., McAuley, J. D., & Zaleha, M. (2009). Evaluation of an imputed pitch velocity model of the auditory tau effect. Attention, Perception & Psychophysics, 71, 1399-1413. Jones, B., & Huang, Y. L. (1982). Space-time dependencies in psychophysical judgment of extent and duration: Algebraic models of the tau and kappa effect. Psychological Bulletin, 91, 128-142. Killeen, P. R., Fetterman, J. G., & Bizo, L. A. (1997). Time’s cause. In C. M. Bradshaw & E. Szabadi (Eds). Time and Behavior: Psychological and Neurobehavioral Analyses (pp. 79131). Amsterdam, Netherlands: North-Holland/Elsevier Science. Killeen, P. R., & Weiss, N. A. (1987). Optimal timing and the Weber function. Psychological Review, 94, 455-468. Lee, D. (2000). Learning of spatial and temporal patterns in sequential hand movements. Cognitive Brain Research, 9, 35-39. Macar, F., Grondin, S., & Casini, L. (1994). Controlled attention sharing influences time estimation. Memory & Cognition, 22, 673-686. Mondor, T. A., & Zattore, R. J. (1995). Shifting and focusing auditory spatial attention. Journal of Experimental Psychology: Human Perception and Performance, 21, 387-409. Morgan, M. J., Watamaniuk, S. N. J., & McKee, S. P. (2000). The use of an implicit standard for measuring discrimination thresholds. Vision Research, 40, 2341-2349. Miyatani (1984-1985). The time and distance judgments at different levels of discriminability of temporal and spatial information. Hiroshima Forum for Psychology, 10, 45-55. Ouellet, B. (2003). L’influence de la distance entre des marqueurs statiques sur la discrimination d’intervalles temporels : À la recherche de l’effet kappa classique en modalité auditive. Unpublished master’s dissertation, Université Laval, Québec, Canada. Piaget, J. (1955). The development of time concepts in the child. In P. H. Hoch, J. Zubin (Eds.), Psychopathology of Childhood (pp. 34-44). New York, USA: Grune & Stratton. Price-Williams, D. R. (1954). The kappa effect. Nature, 173, 363-364. Rhodes, G . (1987). Auditory attention and the representation of spatial information. Perception & Psychophysics, 42, 1-14. Roussel, M.-È., Grondin, S., & Killeen, P. (2009). Spatial effects on temporal categorization. Perception, 38, 748-762. Sarrazin, J.-C., Giraudo, M.-D., Pailhous, J., & Bootsma, R. J. (2004). Dynamics of balancing space and time in memory: Tau and kappa effects revisited. Journal of Experimental Psychology: Human Perception and Performance, 30, 411-430. Sarrazin, J.-C., Giraudo, M.-D., & Pittenger, J. B. (2007). Tau and kappa effects in physical space: The case of audition. Psychological Research, 71, 201-218. Shigeno, S. (1986). The auditory tau and kappa effects for speech and nonspeech stimuli. Perception & Psychophysics, 40, 9-19. Suto, Y. (1952). The effect of space on time estimation (S-effect) in tactual space. I. Japanese Journal of Psychology, 22, 45-57. Suto, Y. (1955). The effect of space on time estimation (S-effect) in tactual space. II : The role of vision in the S effect upon skin. Japanese Journal of Psychology, 26, 94-99. Suto, Y. (1957). Role of apparent distance in time perception. Research Reports of Tokyo Electrical Engineering College, 5, 73-82.

104


ten Hoopen, G., Miyauchi, R., & Nakajima, Y. (2008). Time-based illusions in the auditory mode. In S. Grondin (Ed.). Psychology of Time (pp. 139-188). Bingley, UK: Emerald Group Publishing. Yoblick, D. A., & Salvendy, G. (1970). Influence of frequency on the estimation of time for auditory, visual, and tactile modalities: The kappa effect. Journal of Experimental Psychology, 86, 157-164.

Part 2 Sound Localization Systems

7 Sound Source Localization Method Using Region Selection Yong-Eun Kim1, Dong-Hyun Su2, Chang-Ha Jeon2, Jae-Kyung Lee2, Kyung-Ju Cho3 and Jin-Gyun Chung2 1Korea

Automotive Technology Institute in Chonan, 2Chonbuk National University in Jeonju, 3Korea Association Aids to Navigation in Seoul, Korea

1. Introduction There are many applications that would be aided by the determination of the physical position and orientation of users. Some of the applications include service robots, video conference, intelligent living environments, security systems and speech separation for hands-free communication devices (Coen, 1998; Wax & Kailath, 1983; Mungamuru & Aarabi, 2004; Sasaki et al., 2006; Lv & Zhang 2008). As an example, without the information on the spatial location of users in a given environment, it would not be possible for a service robot to react naturally to the needs of the user. To localize a user, sound source localization techniques are widely used (Nakadai et al., 2000; Brandstein & Ward, 2001; Cheng & Wakefield, 2001; Sasaki et al., 2006). Sound localization is the process of determining the spatial location of a sound source based on multiple observations of the received sound signals. Current sound localization techniques are generally based upon the idea of computing the time difference of arrival (TDOA) information with microphone arrays (Knnapp & Cater, 1976; Brandstein & Silverman, 1997). An efficient method to obtain TDOA information between two signals is to compute the cross-correlation of the two signals. The computed correlation values give the point at which the two signals from separate microphones are at their maximum correlation. When only two isotropic (i.e., not directional as in the mammalian ear) microphones are used, the system experiences front-back confusion effect: the system has difficulty in determining whether the sound is originating from in front of or behind the system. A simple and efficient method to overcome this problem is to incorporate more microphones (Huang et al., 1999). Various weighting functions or pre-filters such as Roth, SCOT, PHAT, Eckart filter and HT can be used to increase the performance of time difference estimation (Knnapp & Cater, 1976). However, the performance improvement is achieved with the penalty of large power consumption and hardware overhead, which may not be suitable for the implementation of portable systems such as service robots. In this chapter, we propose an efficient sound source localization method under the assumption that three isotropic microphones are used to avoid the front-back confusion

108


effect. By the proposed approach, the region from 0° to 180° is divided into three regions and only one of the three regions is selected for the sound source localization. Thus considerable amount of computation time and hardware cost can be reduced. In addition, the estimation accuracy is improved due to the proper choice of the selected region.

2. Sound localization using TDOA If a signal emanated from a remote sound source is monitored at two spatially separated sensors in the presence of noise, the two monitored signals can be mathematically modeled as

x1 (t ) = s1 (t ) + n1 (t ), x2 (t ) = αs1 (t − D) + n2 (t ),

(1)

where α and D denote the relative attenuation and the time delay of x 2 ( t ) with respect to x1 ( t ), respectively. It is assumed that signal s1 ( t ) and noise ni ( t ) are uncorrelated and jointly stationary random processes. A common method to determine the time delay D is to compute the cross correlation Rx1 x2 ( τ) = E[ x1 (t )x2 (t − τ)] ,

(2)

where E denotes expectation operator. The time argument at which Rx1 x2 ( τ) achieves a maximum is the desired delay estimate.

Fig. 1. Sound source localization using two microphones Fig. 1 shows the sound localization test environments using two microphones. We assume that the sound waves arrive in parallel to each microphone as shown in Fig. 1. Then, the time delay D can be expressed as D=

d vsound

=

lmic cos φ vsound

,

(4)

where vsound denotes the sound velocity of 343m/s. Thus, the angle of the sound source is computed as φ = cos −1

D vsound lmic

= cos −1

If the sound wave is sampled at the rate of by nd samples, the distance d can be computed as

fs

d lmic

.

(5)

, and the sampled signal is delayed

109

Sound Source Localization Method Using Region Selection

d=

vsound nd fs

.

(6)

In Fig. 1, since d is a side of a right-angled triangle, we have d < lmic .

(7)

Thus, when d = lmic in (6), the number of maximum delayed samples nd ,max is obtained as nd ,max =

f s lmic vsound

.

(8)

3. Proposed sound source localization method 3.1 Region selection for sound localization The desired angle in (5) is obtained using the inverse cosine function. Fig. 2 shows the inverse cosine graph as a function of d. Since the inverse cosine function is nonlinear, Δd (estimation error in d) has different effect on the estimated angle depending on the sound source location. Fig. 3 shows the estimation error (in degree) of sound source location as a function of Δd. As can be seen from Fig. 3, Δd has smaller effect for the sources located from 60° to 120°. As an example, when the source is located at 90° with the estimation error Δd = 0.01, the mapped angle is 89.427°. However, if the source is located at 0° with the estimation error Δd = 0.01, the mapped angle is 8.11°. Thus, for the same estimation error Δd, the effect for the source located at 0° is 14 times larger than that of the source at 90°. To efficiently implement the inverse cosine function, we consider the region from 60° to 120° as approximately linear as shown in Fig. 2.

Fig. 2. Inverse cosine graph as a function of d

110


Fig. 3. Estimation error of sound source location as a function of Δd Fig. 4 shows the front-back confusion effect: the system has difficulty in determining whether the sound is originating from in front of (sound source A) or behind (sound source B) the system. A simple and efficient method to overcome this problem is to incorporate more microphones. In Fig. 5, three microphones are used to avoid the front-back confusion effect, where L, R and B mean the microphones located at the left, right and back sides, respectively. In this chapter, to apply the cross-correlation operation in (2), for each arrow between the microphones in Fig. 5, the signal received at the tail part and the head part are designated as x1 ( t ) and x 2 (t ), respectively. In conventional approaches, correlation functions are calculated between each microphone pair and mapped to angles as shown in Fig. 6-(a), (b) and (c). Notice that, due to the frontback confusion effect, each microphone pair provides two equivalent maximum values. Fig. 6-(d) is obtained by adding the three curves. In Fig. 6-(d), the angle corresponding to the maximum magnitude is the desired sound source location.

Fig. 4. Front-back confusion effect


Fig. 5. Sound source localization using three microphones

(a)

(b)

(c)

(d) Fig. 6. Angles obtained from microphone pairs: (a) L-R, (b) B-L, (c) R-B, and (d) (L-R)+ (B-L)+(R-B)

111

112


Source location(angle)

Proper microphone pair

60°～120°, 240°～300°

R-L

120°～180°, 300°～360°

B-R

180°～240°, 0°～60°

L-B

Table 1. Selection of proper microphone pair for six different source locations. Due to the nonlinear characteristic of the inverse cosine function, the accuracy of each estimation result is different depending on the source location. Notice that in Fig. 5, wherever the source is located, exactly one microphone pair has the sound source within its approximately linear region (60°~120° or 240°~300° for the microphone pair). As an example, if a sound source is located at 30° in Fig. 5, the location is within the approximately linear region for L-B pair. Table 1 summarizes the choice of proper microphone pairs for six different source locations. The proper selection of microphone pairs can be achieved by comparing the time index τmax values (or, the number of shifted samples) in (2) at which the maximum correlation values are obtained. Fig. 7 shows the comparison of the correlation values obtained from three microphone pairs when the source is located at 90°. For the smallest estimation error, we select the microphone pair whose τmax value is closest to 0. Notice that the correlation curve in the center (by the microphone pair R-L) has the τmax value which is closest to 0. In fact, for the smallest estimation error, we just need to select the correlation curve in the center. As an example, assume that a sound source is located at 90° in Fig. 5. Then, for the microphone pair R-L, the two signals arrived at the microphones R and L have little difference in their arrival times since the distances from the source to each microphone are almost the same. Thus, the cross correlation has its maximum around τ = 0. However, for LB pair, the microphone L is closer to the source than the microphone B. Since the received signals at microphones B and L are designated as x1 (t ) and x 2 (t ), respectively, the cross

Fig. 7. Comparison of the correlation values obtained from three microphone pairs for the source located at 90°

113


correlation in (2) gets its maximum when x2 (t ) is shifted to the right ( τ > 0 ). The opposite is true for the microphone pair B-R as can be seen from Fig. 7. Table 2 shows that proper microphone pairs can be simply selected by comparing maximum correlation positions (or, τmax values from each microphone pair). Maximum correlation positions

Proper Mic.

Front / Back

τmax (BR)≤ τmax (RL)

≤ τmax (LB)

R-L

Front

τmax (BR)≤ τmax (LB)

≤ τmax (RL)

L-B

Front

τmax (RL)≤ τmax (BR)

≤ τmax (LB)

B-R

Front

τmax (LB)≤ τmax (RL)

≤ τmax (BR)

R-L

Back

τmax (RL)≤ τmax (LB)

≤ τmax (BR)

L-B

Back

τmax (LB)≤ τmax (BR)

≤ τmax (RL)

B-R

Back

Table 2. Selecetion of proper microphone pair If the sampled signals of x1 ( t ) and x2 ( t ) are denoted by two vectors X1 and X2, the length of the cross-correlated signal RX1X2 is determined as n(RX1X2) = n(X1) + n(X2) – 1,

(9)

where n(X) means the length of vector X. In other words, to obtain the cross-correlation result, vector shift and inner product operations need to be performed by n(RX1X2) times. It is interesting to notice that, once the distance between the microphones and the sampling rate are determined, the maximum time delay between two received signals is bounded by nd ,max in (8). Thus, instead of performing vector shift and inner product operations by n(RX1X2) times as in the conventional approaches, it is sufficient to perform the operations by only nd ,max times. Specifically, we perform the correlation operation from n = −nd ,max /2 to n = nd ,max /2 (for sampled signals, τ = n / f s , integer n). In the simulation shown in Fig. 7, n(X1) = n(X2) = 256 and nd ,max = 64. Thus, the number of operations for cross-correlation is reduced from 511 to 65 by the proposed method, which means the computation time for cross-correlation can be reduced by 87%. 3.2 Simplification of angle mapping using linear equation Conventional angle mapping circuits require a look-up table for inverse cosine function. Also, an interpolation circuit is needed to obtain a better resolution with reduced look-up table. However, since the proposed region selection approach uses only the approximately linear part of the inverse cosine function, the use of look-up table and interpolation circuit can be avoided. Instead, the approximately linear region is approximated by the following equation: y = ax + b ,

(10)

114


where

−60 , (cos π / 3 − cos 2 π / 3) × lmic 60 cos 2 π / 3 . b = 120 + (cos π / 3 − cos 2 π / 3) a=

(11)

When the distance between the two microphones is given, the coefficients a and b in (10) can be pre-calculated. Thus, angle mapping can be performed using only one multiplication and one addition for a given value of d. Fig. 8 shows the block diagrams of the conventional sound source localization systems and the proposed system.

(a)

(b) Fig. 8. Block diagrams of conventional and proposed methods: (a) conventional method, and (b) proposed method.

4. Simulation results Fig. 9 shows the sound source localization system test environments. The distance between the microphones is 18.5cm. The sound signals received using three microphones are sampled at 16 KHz and the sampled signals are sent to the sound localization system implemented using Altera stratix II FPGA. Then, the estimation result is transmitted to a host PC through two FlexRay communication systems. The test results are shown in Table 3. Notice that the average error of the proposed method is only 31% of that of the conventional method. To further reduce the estimation error, we need to increase the sampling rate and the distance between the microphones.

115


Fig. 9. Sound localization system test environments Distance 1m 2m 3m 4m 5m Maximum absolute error average error

0° 0° 0° 0° 2.5° 4.1° 4.1° 1.32°

30° 27° 27° 27° 34° 37° 7° 4°

60° 56° 59° 59° 57° 67° 7° 3.2°

90° 88° 85° 88° 95° 82° 8° 4.4°

30° 32.7° 32° 32.7° 28° 33° 3° 2.48°

60° 60° 59° 60° 62° 61° 2° 0.8°

90° 87.2° 85° 87.2° 86° 92° 4° 3.32°

(a) Distance 1m 2m 3m 4m 5m Maximum absolute error average error

0° 0° 0° 0° 1 2 2° 0.6° (b)

Table 3. Simulation results: (a) conventional method, and (b) proposed method

5. Conclusion Compared with conventional sound source localization methods, proposed method achieves more accurate estimation results with reduced hardware overhead due to the new region selection approach. By the proposed approach, the region from 0° to 180° is divided into three regions and only one of the three regions is selected such that the selected region corresponds to the linear part of the inverse cosine function. By the proposed approach, the

116


computation time for cross correlation is reduced by 87%, compared with the conventional approach. By simulations, it is shown that the estimation error by the proposed method is only 31% of that of the conventional approach. The proposed sound source localization system can be applied to the implementation of portable service robot systems since the proposed system requires small area and low power consumption compared with conventional methods. The proposed method can be combined with generalized correlation method with some modifications.

6. Acknowledgment This research was financially supported by the Ministry of Education, Science Technology (MEST) and National Research Foundation of Korea (NRF) through the Human Resource Training Project for Regional Innovation.

7. References Brandstein M. S. & Silverman H. (1997). A practical methodology for speech source localization with microphone arrays. Comput. Speech Lang., Vo.11, No.2, pp. 91-126, ISSN 0885-2308 Brandstein M. & Ward D. B. (2001). Robust Microphone Arrays: Signal Processing Techniques and Applications, New York: Springer, ISBN 978-3540419532 Cheng I. & Wakefield G. H. (2001). Introduction to head-related transfer functions (HRTFs): representations of HRTFs in time, frequency, and space. J. Audio Eng. Soc., Vol. 49, No.4, (April, 2001), pp. 231-248, ISSN 1549-4950 Coen M. (1998). Design principles for intelligent environments, Proceedings of the 15th National Conference on Artificial Intelligence, pp. 547-554 Huang J.; Supaongprapa T.; Terakura I.; Wang F.; Ohnishi N. & Sugie N. (1999) A modelbased sound localization system and its application to robot navigation. Robot. Auton. Syst., Vol.27, No.4, (June,1999), pp. 199-209, ISSN 0921-8890 Knnapp C. H. & Cater G. C. (1976). The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process., Vol.24, No.4, (August 1976), pp.320-327, ISSN 0096-3518 Lv X. & Zhang M. (2008). Sound source localization based on robot hearing and vision, Proceedings of ICCSIT 2008 International Conference of Computer Science and Information Technology, pp. 942-946, ISBN 978-0-7695-3308-7, Singapore, August 29September 2 2008 Mungamuru, B. & Aarabi, P. (2004). Enhanced sound localization. IEEE Trans. Syst. Man Cybern. Part B- Cybern., Vol.34, No.3, (June, 2004), pp. 1526-1540, ISSN 1083-4419 Nakadai K.; Lourens T.; Okuno H. G. & Kitano H. (2000). Active audition for humanoid, Proceedings of the 17th National Conference on Artificial Intelligence and 12th Conference on Innovative Applications of Artificial Intelligence, pp. 832-839 Sasaki Y.; Kagami S. & Mizoguchi H. (2006). Multiple sound source mapping for a mobile robot by self-motion triangulation, Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 380-385, ISBN 1-4244-0250-X, Beijing, China, October, 2006 Wax M. & Kailath T. (1983). Optimum localization of multiple sources by passive arrays. IEEE Trans. Acoust. Speech Signal Process., Vol.31, No.6, (October,1983). pp. 12101217, ISSN 0096-3518

8 Robust Audio Localization for Mobile Robots in Industrial Environments Manuel Manzanares, Yolanda Bolea and Antoni Grau

Technical University of Catalonia, UPC, Barcelona Spain

1. Introduction For autonomous navigation in workspace, a mobile robot has to be able to know its position in this space in a precise way that means that the robot must be able to self-localize to move and perform successfully the different entrusted tasks. At present, one of the most used systems in open spaces is the GPS navigation system; however, in indoor spaces (factories, buildings, hospitals, warehouses…) GPS signals are not operative because their intensity is too weak. The absence of GPS navigation systems in these environments has stimulated the development of new local positioning systems with their particular problems. Such systems have required in many cases the installation of beacons that operate like satellites (similar to GPS), the use of landmarks or even the use of other auxiliary systems to determine the robot’s position. The problem of mobile robot localization is a part of a more global problem because in autonomous navigation when a robot is exploring an unknown environment, it usually needs to obtain some important information: a map of the environment and the robot’s location in the map. Since mapping and localization are related to each other, these two problems are usually considered as a single problem called simultaneous localization and mapping (SLAM). The problem of Simultaneous Localization and Map Building is a significant open problem in mobile robotics which is difficult because of the following paradox: to localize itself the robot needs the map of the environment, and, for building a map the robot location must be known precisely. Mobile robots use different kinds of sensors to determine their position: for instance it is very common the use of odometric or inertial sensors, however it is remarkable to consider that in wheel slippage, sensor drifts a noise causing error accumulation, thus leading to erroneous estimates. Another kind of external sensors used in robotics in order to solve localization are for instance CCD cameras, infrared sensor, ultra sonic sensor, mechanical wave and laser. Other sensors recently applied are the instruments sensible to the magnetic field known as the electronic compass (Navarro & Benet, 2009). Mobile robotics are interested on those able to measure the Earths magnetic field and express it through an electrical signal. One type of electronic compass is based on magneto-resistive transducers, whose electrical resistance varies with the changes on the applied magnetic field. This type of sensors presents sensitivities below 0.1 milligauss, with response times below 1 sec, allowing its reliable use in vehicles moving at high speeds (Caruso, 2000). In SLAM some applications with electronic compass have been developed working simultaneously with other sensors such as artificial vision (Kim et al., 2006) and ultrasonic sensors (Kim et al., 2007).

118


In mobile robotics, due to the use of different sensors at the same time to provide localization information the problem of data fusion rises and many algorithms have been implemented. Multisensor fusion algorithms can be broadly classified as follows: estimation methods, classification methods, inference methods, and artificial intelligence methods (Luo et al., 2002); in the latter are remarkable neural networks, fuzzy and genetic algorithms (Begum et al., 2006); (Brunskill & Roy, 2005). Related with the provided sensors information processing in SLAM context, many works can be found, for instance in (Di Marco et al., 2000), where estimation of the position of the robot and the selected landmarks are derived in terms of uncertainty regions, under the hypothesis that the errors affecting all sensor measurements are unknown but bounded, or in (Begum et al., 2006) where an algorithm processes sensor data incrementally and therefore, has the capability to work online. Therefore a comprehensive collection of researches have been reported on SLAM, most of which stem from the pioneer work of (Smith et al. 1990). This early work provides a Kalman Filter (KF) based statistical framework for solving SLAM. The KF based SLAM algorithms require feature extraction and identification from sensor data, for estimating the pose and the parameters. In the situation that the system noise and measurement obey a Gaussian amplitude distribution, KF uses the state recursive equation that is with the noise estimates the optimal attitude of mobile robots. But there would be generated errors of localization, if the noise does not obey the distribution. KF is also able to the merge low graded multisensor data models. Particle filter is the next probabilistic technique that has earned popularity in SLAM literature. The hybrid SLAM algorithm proposed in (Thrun, 2001) uses particle filter for posterior estimation over a robot’s poses and is capable to map large cyclic environments. Another method of fusion broadly used is Extended Kalman Filter (EKF); the EKF can be used where the model is nonlinear, but it can be suitably linearized around a stable operating point. Several systems have been researched to overcome the localization limitation. For example, the Cricket Indoor Location (Priyantha, 2000) which relies on active beacons placed in the environment. These beacons transmit simultaneously two signals (a RF and an ultrasound wave). Passive listeners mounted, for example, on mobile robots can, by knowing the difference in propagation speed of the RF and ultrasound signals, estimate their own position in the environment. GSM and WLAN technologies can also be used for localization. Using triangulation methods and measuring several signal parameters such as the signal’s angle and time of arrival, it becomes possible to estimate the position of a mobile transmitter/receiver in the environment (Sayed et al., 2005). In (Christo et al., 2009), a specific architecture is suggested for the use of multiples iGPS Web Services for mobile robots localization. Most of the mobile robot’s localization systems are based on robot vision, and robot vision is also a hot spot in the research of robotics. Camera which is the most popular visual sensor is widely used for the localization of mobile robots just now. However some difficulties occur because of the limitation of camera’s visual field and the dependence on light condition. If the target is not in the visual field of camera or the lighting condition is poor, the visual localization system of the mobile robot cannot work effectively. Nowadays, the role of acoustic perception in autonomous robots, intelligent buildings and industrial environments is increasingly important and in the literature there are different works (Yang et al., 2007); (Mumolo et al., 2003); (Csyzewski, 2003). Comparing to the study on visual perception, the study on auditory is still in its infancy stage. The human auditory system is a complex and organic information processing system,

Robust Audio Localization for Mobile Robots in Industrial Environments

119

it can feel the intensity of sound and space orientation information. Compared with vision, audition has several unique properties. Audition is omni-directional. The sound waves have strong diffraction ability; audition also is less affected by obstacles. Therefore, the audio ability possessed by robot can make up the restrictions of other sensors such as limited view or the non-translucent obstacles. Nevertheless, audio signal processing presents some particular problems such as the effect of reverberations and noise signals, complex boundary conditions and near-field effect, among others, and therefore the use of audio sensors together with other sensors is common to determine the position and also for autonomous navigation of a mobile robot, leading to a problem of data fusion. There are many applications that would be aided by the determination of the physical position and orientation of users. As an example, without the information on the spatial location of users in a given environment, it would not be possible for a service robot to react naturally to the needs of the user. To localize a user, sound source localization techniques are widely used. Such techniques can also help a robot to self-localize in its working area. Therefore, the sound source localization (one or more sources) has been studied by many researchers (Ying & Runze, 2007); (Sasaki et al., 2006); (Kim et al., 2009). Sound localization can be defined as the process of determining the spatial location of a sound source based on multiple observations of the received sound signals. Current sound localization techniques are generally based upon the idea of computing the time difference of arrival (TDOA) information with microphone arrays (Brandstein & Silverman, 1997); (Knapp & Carter, 1976), or interaural time difference (ITD) (Nakashima & Mukai, 2005). The ITD is the difference in the arrival time of a sound source between two ears, a representative application can be found in (Kim & Choi, 2009) with a binaural sound localization system using sparse coding based ITD (SITD) and self-organizing map (SOM). The sparse coding is used for decomposing given sounds into three components: time, frequency and magnitude, and the azimuth angle are estimated through the SOM. Other works in this field use structured sound sources (Yi & Chu-na, 2010) or the processing of different audio features (Rodemann et al., 2009), among other techniques. The works that authors present in this Chapter are developed with audio signals generated with electric machines that will be used to mobile robots localization in industrial environments. A common problem encountered in industrial environments is that the electric machine sounds are often corrupted by non-stationary and non-Gaussian interferences such as speech signals, environmental noise, background noise, etc. Consequently, pure machine sounds may be difficult to identify using conventional frequency domain analysis techniques as Fourier transform (Mori et al., 1996), and statistical techniques such as Independent Component Analysis (ICA) (Roberts & Everson, 2001). The wavelet transform has attracted increasing attention in recent years for its ability in signal features extraction (Bolea et al., 2003); (Mallat & Zhang, 1993), and noise elimination (Donoho, 1999). While in many mechanical dynamic signals, such as the acoustical signals of an engine, Donoho’s method seems rather ineffective, the reason for their inefficiency is that the feature of the mechanical signals is not considered. Therefore, when the idea of Donoho’s method and the sound feature are combined, and a de-noising method based on Morlet wavelet is added, this methodology becomes very effective when applied to an engine sound detection (Lin, 2001). In (Grau et al., 2007), the authors propose a new approach in order to identify different industrial machine sounds, which can be affected by non-stationary noise sources.

120


It is also important to consider that non-speech audio signals have the property of nonstationary signals in the same way that many real signals encountered in speech processing, image processing, ECG analysis, communications, control and seismology. To represent the behaviour of a stationary process is common the use of models (AR, ARX, ARMA, ARMAX, OE, etc.) obtained from the experimental identification (Ljung, 1987). The coefficient estimation can be done with different criteria: LSE, MLE, among others. But in the case of non-stationary signals the classical identification theory and its results are not suitable. Many authors have proposed different approaches to modelling this kind of non-stationary signals, that can be classified: i) assuming that a non stationary process is locally stationary in a finite time interval so that various recursive estimation techniques (RLS, PLR, RIV, etc.) can be applied (Ljung, 1987); ii) a state space modelling and a Kalman filtering; iii) expanding each time-varying parameter coefficients onto a set of basis sequences (Charbonnier et al., 1987); and iv) nonparametric approaches for non-stationary spectrum estimation such a local evolving spectrum, STFT and WVD are also developed to characterize non-stationary signals (Kayhan et al., 1994). To overcome the drawbacks of the identification algorithms, wavelets could be also considered for time varying model identification. The distinct feature of a wavelet is its multiresolution characteristic that is very suitable for non-stationary signal processing (Tsatsanis & Giannakis, 1993). The work to be presented in this Chapter will investigate different approaches based on the study of audio signals with the purpose of obtaining the robot location (in x-y plane) using as sound sources industrial machines. For their own nature, these typical industrial machines produce a stationary signal in a certain time interval. These resultant stationary waves depend on the resonant frequencies in the plant (depending on the plant geometry and dimensions) and also on the different absorption coefficients of the wall materials and other objects present in the environment. A first approach that authors will investigate is based on the recognition of patterns in the acquired audio signal by the robot in different locations (Bolea et al., 2008). These patterns will be found through a process of feature extraction of the signal in the identification process. To establish the signal models the wavelet transform will be used, specifically the Daubechies wavelet, because it captures very well the characteristics and information of the non-speech audio signals. This set of wavelets has been extensively used because its coefficients capture the maximum amount of the signal energy. A MAX model (Moving Averaging Exogenous) represents the sampled signals in different points of the space domain because the signals are correlated. We use the closest signal to the audio source as signal input for the model. Only the model coefficients need to be stored to compare and to discriminate the different audio signals. This would not happen if the signals were represented by an AR model because the coefficients depend on the signal itself and, with a different signal in every point in the space domain, these coefficients would not be significant enough to discriminate the audio signals. When the model identification is obtained by wavelets transform, the coefficients that do not give information enough for the model are ignored. The eigenvalues of the covariance matrix are analyzed and we reject those coefficients that do not have discriminatory power. For the estimation of each signal the approximation signal and its significant details are used following the next process: i) model structure selection; ii) model parameters calibration with an estimation model (the LSE method can be


121

used for its simplicity and, furthermore a good identified model coefficients convergence is assured); iii) validation of the model. Another approach that will also be investigated is based on the determination of the transfer function of a room, denoted RTF (Room Transfer Function), this model is an LPV (Linear Parameters Varying) because the parameters of the model vary along the robot’s navigation (Manzanares et al., 2009). In an industrial plant, there are different study models in order to establish the transmission characteristics of a sound between a stationary audio source and a microphone in closed environments: i) the beam theory applied to the propagation of the direct audio waves and reflected audio waves in the room (Kinsler et al., 1995); ii) the development of a lumped parameters model similar to the model used to explain the propagation of the electromagnetic waves in the transmission lines (Kinsler et al., 1995) and the study of the solutions given by the wave equation (Kuttruff, 1979). Other authors propose an RTF function that carries out to industrial plant applied sound model (Haneda et al., 1992); (Haneda et al., 1999); (Gustaffson et al., 2000). In these works the complexity to achieve the RTFs is evident as well as the need of a high number of parameters to model the complete acoustic response for a specific frequency range, moreover to consider a real environment presents an added difficulty. In this research we study how to obtain a real plant RTF. Due that this RTF will be used by a mobile robot to navigate in an industrial plant, we have simplified the methodology and our goal is to determinate the x-y coordinates of the robot. In such a case, the obtained RTF will not present a complete acoustic response, but will be powerful enough to determine the robot’s position.

2. Method based on the recognition of patterns of the audio signal This method is based on the recognition of patterns in the acquired audio signal by the robot in different locations, to establish the signals models the Daubechies wavelets will be used. A MAX model (Moving Averaging Exogenous) represents the sampled signals in different points of the space domain, and for the estimation of each signal the approximation signal and its significant details are used following the process steps mentioned previously: i) model structure selection; ii) model parameters calibration with an estimation model; iii) validation of the model. Let us consider the following TV-MAX model and be Si = y(n), q

r

k =0

k =0

y(n) = ∑ b(n ; k )u(n − k ) + ∑ c(n ; k )e(n − k )

(1)

where y(n) is the system output, u(n) is the observable input, which is assumed as the closest signal to the audio source, and e(n) is a noise signal. The second term is necessary whenever the measurement noise is colored and needs further modeling. The coefficients for the different models will be used as the feature vector, which can be defined as XS , where

XS = ( b1 , b2 ,...

q + 1)

..., c1 , c 2 ,...

r + 1)

)

(2)

where q+1 and r+1 are the amount of b and c coefficients respectively. From every input signal a new feature vector is obtained representing a new point in the (q+r+2)-dimensional

122


feature space, fs. For feature selection, it is not necessary to apply any statistical test to verify that each component of the vector has enough discriminatory power because this step has been already done in the wavelet transform preprocessing. This feature space will be used to classify the different audio signals entering the system. Some labeled samples with their precise position in the space domain are needed. In this chapter a specific experiment is shown. When an unlabeled sample enters the feature space, the minimum distance to a labeled sample is computed and this measure of distance will be used to estimate the distance to the same sample in the space domain. For this reason a transformation function fT is needed which converts the distance in the feature space in the distance in the space domain, note that the distance is a scalar value, independently of the dimension of the space where it has been computed. The Euclidean distance is used, and the distance between to samples Si and Sj in the feature space is defined as

(

)

d fs Si , S j =

q

(

∑ bkSi − bkS j

k =0

)

2

r

(

+ ∑ c kSi − c kS j k =0

)

2

(3)

where bkSi and ckSi are the b and c coefficients, respectively, of the wavelet transform for the Si signal. It is not necessary to normalize the coefficients before the distance calculation because they are already normalized intrinsically by the wavelet transformation. Because there exist the same relative distances between signals with different models, and with the knowledge that the greater the distortion the farther the signal is from the audio source, we choose those correspondences (dxy, dfs) between the samples that are closest to the audio source equidistant in the dxy axis. These points will serve to estimate a curve of norder, that is, the transformation function fT. An initial approximation for this function is a polynomial of 4th order and there are several solutions for a unique distance in the feature space, that is, it yields different distances in the x-y space domain.

Fig. 1. Localization system in space domain from non-speech audio signals. We solve this drawback adding a new variable: previous position of the robot. If we have an approximate position of the robot, its speed and the computation time between feature extraction samples, we will have a coarse approximation of the new robot position, coarse enough to discriminate among the solutions of the 4th-order polynomial. In the experiments section a waveform for the fT function can be seen, and it follows the model from the sound derivative partial equation proposed in (Kinsler et al., 1995) and (Kuttruff, 1979).


123

In Figure 1 the localization system can be shown, including the wavelet transformation block, the modeling blocks, the feature space and the spatial recognition block which has as input the environment of the robot and the function fT. 2.1 Sound source angle detection As stated in the Introduction section, in order to locate sound sources several works have been developed using a microphone array. Because we work with a unique source of sound, and in order to simplify the number of sensors, we propose a system that detects the direction in which the maximum sound intensity is received and, in this way, emulating the response of a microphone array located in the perimeter of a circular platform. To achieve this effect we propose a turning platform with two opposed microphones. The robot computes the angle respect the platform origin (0º) and the magnetic north of its compass. Figure 2 depicts the blocks diagram of the electronic circuit to acquire the sound signals. The signal is decoupled and amplified in a first stage in order to obtain a suitable range of work for the following stages. Then, the maximum of the mean values of the rectified sampled audio signal determines the position of the turning platform.

Fig. 2. Angle detection block diagram. There are two modes of operation: looking for local values or global values. To find the maximum value the platform must turn 180º (because there are two microphones), this mode warranties that the maximum value is determined but the operation time is longer than using the local value detection, in which the determination is done when the system detects the first maximum. In most of the experiments this latter operation mode is enough. 2.2 Spatial recognition This distance computation between the unlabelled audio sample and labeled ones is repeated for the two closest samples to the unlabelled one. Applying then the transformation function fT two distances in the x-y domain are obtained. These distances indicate where the unlabelled sample is located. Now, with a simple process of geometry, the position of the unlabelled sample can be estimated but with a certain ambiguity, see Figure 3. In (Bolea et al., 2003) we used the intersection of three circles, which theoretically gives a unique solution, but in practice these three circles never intersect in a point but in an area that induces to an approximation, and thus, to an error (uncertainty) in the localization point. The intersection of two circles (as shown in Figure 3) leads to a two-point solution. In the correct discrimination of these points the angle between the robot and the sound source is computed.

124


Since the robot computes the angle between itself and the sound source, the problem is to identify the correct point of the circles intersection. Figure 4 shows the situation. I1 and I2 are the intersection points. For each point the angle respect the sound source is computed (α1 and α2), because the exact source position is known (xs, ys). Possible solutions

Centroid Sk Intersection area

Sk

ri

x

Si

Sp

Sj

x Sp

Sj

rp

rj

rp

rj

y

y

Fig. 3. Geometric process of two (right) or three (left) circles intersection to find the position of unlabeled sample Sk.

(I1, α1) Sk Sp

Sj

rp

rj

N

x (I2, α 2 )

α FN 1 α2 α1

α FN 2

α F -N y

Source (xs,ys)

Fig. 4. Angles computation between ambiguous robot localization and sound source. Angles α1 and α2 correspond to:

α 1 = arctg

y I1 − ys x I 1 − xs

,

⎛ y I − ys ⎞ 2 ⎟ ⎜ x I − xs ⎟ ⎝ 2 ⎠

α 2 = arctg ⎜

(4)

These angles must be corrected respect the north in order to have the same offset than the angle computed aboard the robot: αFN1 = α1 - αF-N; αFN2 = α2 - αF-N

(5)

being αF-N the angle between the room reference and the magnetic north (previously calibrated).


125

Now, to compute the correct intersection point is only necessary to find the angle which is closer to the angle computed on the robot with the sensor.

3. Method based on the LPV model with audio features In this second approach we study how to obtain a real plant RTF. Due that this RTF will be used by a mobile robot to navigate in an industrial plant, we have simplified the methodology and our goal is to determinate the x-y coordinates of the robot. In such a case, the obtained RTF will not present a complete acoustic response, but will be powerful enough to determine the robot’s position. The work investigates the feasibility of using sound features in the space domain for robot localization (in x-y plane) as well as robot’s orientation detection. 3.1 Sound model in a closed room The acoustical response of a closed room (with rectangular shape), where the dependence with the pressure in a point respect to the defined (x,y,z) position is represented by the following wave equation:

Lx

∂2 p ∂x

2

+ Ly

∂2 p ∂y

2

+ Lz

∂2 p ∂z2

+ k2 p = 0

(6)

Lx, Ly and Lz denote the dimensions of the length, width and height of the room with ideally rigid walls where the waves are reflected without loss, Eq. (6) is rewritten as: p ( x, y , z ) = p1 ( x ) p 2 ( y ) p 3 ( z )

(7)

when the evolution of the pressure according to the time is not taken into account. Then Eq. (7) is replaced in Eq. (6), and three differential equations can be derived and it is the same for the boundary condition. For example, p1 must satisfy the equation: d 2 p1 dx 2

+ k x2 p1 = 0

(8)

With boundary conditions in x = 0 and x = Lx: dp1 =0 dx

kx, ky and kz constants are related by the following expression: k x2 + k y2 + k z2 = k 2

(9)

p1 ( x) = A1 cos( k x x) + B1 sin( k x x )

(10)

Equation (8) has as general solution:

Through Eq. (8) and limiting this solution to the boundary conditions, constants in Eq. (10) take the following values:

126


kx =

n yπ nxπ nπ ; ky = and k z = z Lx Ly Lz

being nx, ny and nz positive integers. Replacing these values in Eq. (10) the wave equation eigenvalues are obtained:

k nx n y nz

⎡⎛ n = π ⎢⎜⎜ x ⎢⎝ L x ⎣

2

2

⎛n ⎞ ⎞ ⎛n ⎞ ⎟⎟ + ⎜ y ⎟ + ⎜⎜ z ⎟⎟ ⎜ Ly ⎟ ⎝ Lz ⎠ ⎠ ⎝ ⎠

2⎤

1/ 2

⎥ ⎥ ⎦

(11)

The eigenfunctions or normal modes associated with these eigenvalues are expressed by: ⎛ n y πy ⎞ ⎛ n πx ⎞ ⎛ ⎞ ⎟. cos⎜ n z πz ⎟.e jωt p nx n y nz ( x, y , z ) = C1 . cos⎜⎜ x ⎟⎟. cos⎜ ⎜ ⎟ ⎜ ⎟ ⎝ Lz ⎠ ⎝ Lx ⎠ ⎝ Ly ⎠ e

jwt

(12)

= cos( wt ) − j sin( wt )

being C1 an arbitrary constant and introducing the variation of pressure in function of the time by the factor ejwt. This expression represents a three dimensional stationary wave space in the room. Eigenfrequencies corresponding to Eq. (11) eigenvalues can be expressed by:

f nx n y nz = f nx ny nz =

c kn n n 2π x y z f nx 2 + f ny 2 + f nz 2

2 ⎛ n c ⎞ ⎛ ny c f nx ny nz = ⎜ x ⎟ + ⎜ ⎝ 2 Lx ⎠ ⎜⎝ 2 Ly

(13)

2

⎞ ⎛ n c ⎞2 ⎟ +⎜ z ⎟ ⎟ ⎝ 2 Lz ⎠ ⎠

where c is the sound speed. Therefore, the acoustic response of any close room presents resonance frequencies (eigenfrequencies) where the response of a sound source emitted in the room at these frequencies is the highest. The eigenfrequencies depend on the geometry of the room and also depend on the materials reflection coefficients, among other factors. Microphones obtain the environmental sound and they are located at a constant height (z1) respect the floor, and thus the factor: ⎛ n πz ⎞ cos ⎜ z 1 ⎟ ⎝ Lz ⎠

(14)

is constant and therefore, if temporal dependency pressure respect the time is not considered, Eq. (12) is: ⎛ n yπ y ⎞ ⎛ n πx ⎞ ⎟ pnx ny nz ( x , y ) = C 2 .cos ⎜ x ⎟ .cos ⎜ ⎜ ⎟ L ⎝ x ⎠ ⎝ Ly ⎠

(15)

In our experiments, Lx = 10.54m, Ly = 5.05m and Lz = 4m, considering a sound speed propagation of 345m/s. When Eq. (15) is applied in the experiments rooms, for mode (1, 1,


127

2), this equation indicates the acoustic pressure in the rooms depending on the x-y robot’s position, and this is: ⎛ πx ⎞ ⎛ πy ⎞ pnx ny nz ( x , y ) = C 2 .cos ⎜ ⎟ .cos ⎜ 5,05 ⎟ ⎝ 10, 54 ⎠ ⎝ ⎠

(16)

With these ideal conditions and for an ideal value for constant C2 = 2, the theoretic acoustic response in the rooms for this absolute value of pressure, and for this propagation mode, can be seen in Figure 5.

Fig. 5. Room response for propagation mode (1,1,2). The shape of Figure 5 would be obtained for a sound source that is excited only this propagation mode, really the acoustic response will be more complex as we increase the propagation modes excited by the sound source. 3.2 Transfer function in a closed room In (Gustaffson et al., 2000) a model based in the sum of second order transfer functions is proposed; these functions have been built between a sound source located in a position ds emitting an audio signal with a specific acoustic pressure Ps and a microphone located in dm which receives a signal of pressure Pm; each function represents the system response in front to a propagation mode. The first contribution of this work is to introduce an initial variation to this model considering that the sound source has a fixed location, and then this model can be expressed as:

K [ dm ] s Pm ( dm , s ) M = ∑ 2 2 Ps (s ) n = 1 s + 2ξ nωn s + ωn

(17)

Because our objective is not to obtain a complete model of the acoustic response of the industrial plant, it will not be necessary to consider all the propagation modes in the room and we will try to simplify the problem for this specific application without the need to work with models of higher order.

128


To implement this experiment the first step is to select the frequency of interest by a previous analysis of the audio signal frequency spectrum emitted by the considered sound source (an industrial machine). Those frequency components with a significant acoustic power will be considered with the only requirement that they are close to one of the resonant frequencies of the environment. The way to select those frequencies will be through a band-pass digital filter centered in the frequency of interest. Right now, the term M in the sum of our model will have the value N, being this new value the propagation modes resulting from the filtering process. The spectra of the sound sources used in our experiments show an important component close to the frequency of 100Hz for the climatic chamber, and a component of 50Hz for the PCB insulator, see Figure 10 (right) and Figure 11 (right). For a concrete propagation mode, the variation that a stationary audio signal receives at different robot’s position can be modeled, this signal can be smoothed by the variation of the absorption coefficient of the different materials that make up the objects in the room; those parameters are named K[dm] and ξ[ dm], and Eq. (17) results: K [ dm ] s Pm ( dm , s ) N = ∑ 2 2 Ps (s ) n = 1 s + 2ξ n [ dm ]ωns + ωn

H (s , dm ) =

(18)

where the gain (K), smooth coefficient (ξn) and the natural frequency (ωn) of the transfer function room system depend on the room characteristics: dm, nx, ny, Lx, and Ly, yielding an LPV indoor model. Using Eq. (17) the module of the closed room in a specific transmission mode ωn1 is: H ( jωn1 , dm1 ) =

K 2ξn 1ωn1

(19)

The room response in the propagation mode ωn1 (z1 is a constant), assuming that the audio source only emits a frequency ωn1 for a specific coordinate (x,y) of the room, is: H = with f n1 =

Pm Ps

nx , ny

⎛ n yπ y ⎞ ⎛ n πx ⎞ ⎟ = C cos ⎜ x ⎟ cos ⎜ ⎜ Ly ⎟ L ⎝ x ⎠ ⎝ ⎠

(20)

f nx 2 + f ny 2 , ωn1 = 2π f n 1 .

Equaling Eq. (19) and (20), it results: k

ξn1 = 2ωn 1

⎛ n yπ y ⎞ ⎛ n πx ⎞ cos ⎜ x ⎟ cos ⎜ ⎟ ⎜ Ly ⎟ ⎝ Lx ⎠ ⎝ ⎠

(21)

If the filter is non-ideal then more than one transmission mode could be considered and therefore the following expression is obtained: m

∑

K nl

l = 1 2ξ nlωnl

⎛ n ylπ y ⎞ m ⎛ n πx ⎞ ⎟ = ∑ C cos ⎜ x l ⎟ cos ⎜ ⎜ Ly ⎟ l =1 ⎝ Lx ⎠ ⎝ ⎠

(22)


129

The best results in the identification process in order to determine the robot’s position have been obtained, for each considered propagation mode, keeping K[dm] coefficient constant and observing the different variations in the acquired audio signal in the smoothing coefficient ξ[dm]. If the zeros of the system are forced to be constant in the identification process for different robot’s locations, and we admit that the emitted signal power by the sound sources is also constant and the audio signal power acquired with the microphones varies along the robot’s position, then the pole positions in the s plane, for the considered propagation mode, will vary in the different robot’s positions and their values will be: s1n [ dm ] = −ξn [ dm ]ωn + ωn

(ξn [ dm ])

s2 n [ dm ] = −ξn [ dm ]ωn − ωn

(ξn [ dm ])

2

−1

(23)

2

−1

(24)

It is worth noting that this model of reduced order gives good results in order to determine the robot’s position and, although it does not provide a complete physical description of the evolution of the different parameters in the acoustic response for the different robot’s positions, we can admit that according to the physical model given by the wave equation in Eq. (16), the modules of the proposed transfer functions will vary following a sinusoidal pattern and the pole position in the s plane will show those variation in the same fashion.

4. Experiments and discussions 4.1 Method based on the recognition of patterns of the audio signal In the first proposed method based on the recognition of patterns of the audio signal, in order to prepare a setting as real as possible, we have used a workshop with a CNC milling machine as non-speech audio source. The room has a dimension of 7 meters by 10 meters and we obtain 9 labeled samples (from S1 to S9), acquired at regular positions, covering the entire representative workshop surface. With the dimensions of the room, these 9 samples are enough because there is not a significant variance when oversampling. In Figure 6 the arrangement of the labelled samples can be observed. The robot enters the room, describes a predefined trajectory and gets off. In its trajectory the robot picks four unlabeled samples (audio signals) that will be used as data test for our algorithms (S10, S11, S12 and S13). The sample frequency is 8 kHz following the same criteria as (Bielińska, 2002) in order to choose the sampling frequency because its similarity to speech signals. First, in order to obtain the 9 models coefficients corresponding to the 9 labeled nonstationary audio signals, these signals are decomposed by the wavelet transform in 4 levels, with one approximation signal and 4 detail signals, Figure 7. For the whole samples, the relevance of every signal is analyzed. We observe the more significant decomposition to formulate the prediction model, that is, those details containing the more energy of the signal. With the approximation (A4i) and the detail signal of 4th level (D4i) is enough to represent the original signal, because the mean and deviation for the D3i, D2i and D1i detail signals are two orders of magnitude below A4i and D4i. Figure 7 (bottom left) shows the difference between the original signal and the estimated signal with A4i and D4i. Practically there is no error when overlapped. In this experiment we have chosen the Daubechies 45 wavelets transform because it yields good results in identification (Tsatsanis & Giannakis, 1993), after testing different Daubechies wavelets.

130


After an initial step for selecting the model structure, it is determined that the order of the model has to be 20 (10 for the A4i and 10 for D4i coefficients), and an MAX model has been selected, for the reasons explained above. When those 9 models are calibrated, they are validated with the error criteria of FPE (Function Prediction Error) and MSE (Mean Square Error), yielding values about 10e(-6) and 5% respectively using 5000 data for identification and 1000 for validation. Besides, for the whole estimated models the residuals autocorrelation and cross-correlation between the inputs and residuals are uncorrelated, indicating the goodness of the models.

S7

S8

S9 1.9m

S13

S12

3m

Robot trajectory

1.8m

S4

S5

S6

1.2m 1.9m

S10

S11 1m

S1

S2 2.36m

S3 2.6m

Audio source

Fig. 6. Robot environment: labeled audio signals and actual robot trajectory with unlabelled signals (S10, S11, S12, S13). These coefficients form the feature space, where the relative distances among all the samples are calculated and related in the way explained in section 2 in order to obtain the transform function fT. With these relations, the curve appearing in Figure 8 is obtained, under the minimum square error criteria, approximated by a 4th-order polynomial with the following expression: 3 2 fT = d fs = 9.65 e ( 10 ) d xy4 +1.61e(5)dxy − 8.49 e(2)dxy + 144.9 dxy + 107.84

which is related with the solution of the sound equation in (Kinsler et al., 1995); (Kuttruff, 1979) with a physical meaning. With the transform function fT we proceed to find the two minimum distances in the feature space to each unlabelled sample respect the labeled ones, that is, for audio signals S10, S11, S12 and S13, respect to S1, ..., S9.

131


We obtain four solutions for each signal because each distance in the feature space crosses four times the fT curve. In order to discard the false solutions we use the previous position information of the robot, that is the (xi,yi)prev point. We also know the robot speed (v = 15cm/sec) and the computation time between each new position given by the system, which is close to 3 sec. If we consider the movement of the robot at constant speed, the new position will be (xi,yi)prev ± (450,450)mm. Wavelet descomposition of S2 S ig n a l

0.1 0

-0.1

A4

0.05

D4

0.02

0.02 D3

2000

3000 n[samples]

4000

5000

6000

0

1000

2000

3000 n[samples]

4000

5000

6000

0

1000

2000

3000 n[samples]

4000

5000

6000

0

1000

2000

3000 n[samples]

4000

5000

6000

0

1000

2000

3000 n[samples]

4000

5000

6000

0

1000

2000

3000 n[samples]

4000

5000

6000

0

-0.02

0

-0.02 0.02 D2

1000

0

-0.05

0

-0.02 0.05 D1

0

0

-0.05

S11

-3

0.03

2

Original signal=A4+D4 Estimated signal 0.02

S11

x 10

1.5

0.01

1

0 Error

0.5

-0.01

0

-0.02

-0.5

-0.03

-0.04 5000

-1

5100

5200

5300

5400

5500 5600 n [samples]

5700

5800

5900

6000

-1.5 5000

5100

5200

5300

5400

5500 n[samples]

5600

5700

5800

5900

Fig. 7. (Up) Multilevel wavelet decomposition of a non-speech signal (S2) by an approximation signal and four signal details; (down) comparison between (left) original signal (A4+D4) and the estimated signal and (right) its error for S11.

6000

132

Advances in Sound Localization 4

8

x 10

Real data Estimated data 7

Distance in feature space

6

5

4

3

2

1

0

0

1000

2000

3000 4000 5000 Distance in space domain

6000

7000

8000

Fig. 8. Transform function fT. With this information we choose the solution that best fits with the crossing circles solution and the possible robot movement. In order to solve the ambiguity of the two intersection points, the angle that the robot has computed (γ) is compared with the angles (αFN1, αFN2) analytically computed between the two intersection points and the sound source (corrected respect the magnetic north). The solution is the angle αFNi closer to γ. The uncertainty of this location is bounded by d*sin(ε), being ε the difference between the actual angle of the robot respect the sound source and the max{αFNi, γ}, and d is the actual distance of the robot to the sound source. In our experiments, we have verified that ε is limited to 1.9º for d=1m and 1º for d=2.5m (between 3.3 and 4.3 cm of absolute error in localization). 4.2 Method based on the LPV model In the second proposed method based on the LPV model, the methodology applied to determine the robot’s position is the following: 1. The robot acquires an audio signal in its current position and performs an identification process taking as input signal the filtered sound source signal and as output signal the acquired and filtered signal. The parameters corresponding to the obtained poles in this identification process will be the features components for further steps. 2. The Euclidean distances in the feature space are calculated between the current position and the different labeled samples. 3. The two first samples are chosen and the distance between them and the robot’s position are then calculated. Through a transformation function fT, in the same way that the previous approach, the distance in the feature domain is converted to a distance in the space domain. These two distances in the space domain give two possible positions by the crossing circles of distances. 4. To discriminate between both possible solutions, the angle between each one and the platform containing the microphone array (which contains a compass) are calculated,


133

and the closest one to the platform angle will be chosen as discriminatory variable to select the current robot’s position. 5. Steps 3 and 4 are repeated with the remaining labeled samples, and the solution is chosen among the closest angle to the robot’s platform. The acoustic response of the environment is very directional, and this fact leads to consider some uncertainty in the determination of the transformation function which relates the distance in the feature space and the domain space. The robot, in order to determine its location, will perform the identification process between the emitted sound signal by the sound sources and the acquired signal by the microphone. As it can be seen in Figure 9, the robot follows the trajectory indicated by the arrows. In the map sound sources are indicated (climate chamber and PCB insulator). Two experiments are carried out using both sound sources separately. There are two kind of audio samples: R1, R2, R3, R4, R5, R6 and R7 which are used in the recognition step whereas M1, M2, M3, M4 and M5 are labeled samples used in the learning step. The acquired signal in the climatic chamber will be used in the identification process. This signal is time-continuous and, initially, non-stationary; but because the signal is generated by revolving electrical machines it has some degree of stationariety when a high number of samples is used, in this case, 50,000 samples (1.13 seconds).

Fig. 9. Robot environment: labeled audio signals and actual robot trajectory with unlabeled signals (R1, R2, R3, R4, R5, R6 and R7). The fundamental frequency is located at 100Hz, see Figure 10, and there are also some significant harmonics above and below it. In order to simplify the identification process only the fundamental frequency at 100Hz will be taken into account. In this approach the sampling frequency is 44,100Hz. Other lower frequencies could be used instead, avoiding working with a high number of samples, but this frequency has been chosen because in a near future a voice recognition system will be implemented aboard the robot and it will be shared with this audio localization system.

134


The emitted signal for the PCB insulator machine and its spectrum can be seen in Figure 11. To facilitate the plant identification process centering its response in the 100Hz component, the input and output signals will be filtered and, consequently, the input-output relationship in linear systems is an ARX model. To do that, a band-pass filter is applied to the acquired sound signals by the robot, specifically a 6th-order digital Cauer filter. Figure 12 shows the results of the filter for the input signal in, for instance, robot position R4 in the climatic chamber (experiment 1). After an initial step for selecting the model structure, an ARX has been selected, for the reasons explained above of stationery (Charbonnier et al., 1987), with na = 10, nb = 4 and a delay of 2 for the case of the climatic chamber (experiment 1), and na = 10, nb = 2 and a delay of 4 in the case of PCB insulator (experiment 2). When those 5 models are calibrated, they are validated with the error criteria of FPE (Function Prediction Error) and MSE (Mean Square Error), yielding values about 10e(-10) and 3% respectively using 5000 data for identification and 3000 for validation. Besides, for the whole estimated models the residuals

0.08

0.4

0.07

0.2 0.06

0

0.05

-0.2

0.04 0.03

-0.4 0.02

-0.6

0.01

0.1

0.2

0.3

0.4

0.5 time(s)

0.6

0.7

0.8

0

0.9

0

100

200

300 400 500 frecuency (Hz)

600

700

800

Fig. 10. Source signal (climate chamber) and its frequency spectrum. 1 0.8 0.25

0.6 0.4

0.2

0.2 0

0.15

-0.2 0.1

-0.4 -0.6

0.05

-0.8 -1

0

0.1

0.2

0.3

0.4

0.5 time(s)

0.6

0.7

0.8

0.9

1

0

0

100

200

300 400 500 frecuency (Hz)

Fig. 11. Source signal (PCB insulator) and its frequency spectrum.

600

700

800


135

Fig. 12. R4 sound signal (left) and its filtered signal (right). 0.08 0.06 0.04 0.02 0 -0.02 -0.04 -0.06 -0.08 1.25

1.26

1.27

1.28 Time

1.29

1.3

1.31

Fig. 13. Original M5 signal and its estimation. autocorrelation and cross-correlation between the inputs and residuals are uncorrelated, indicating the goodness of the models. For instance, for labeled M5 sample the signal and its estimation can be seen in Figure 13 in the first experiment, validating the model. When observing the diagram of poles and zeros for the different transfer function models in the identification process for the labeled signals, there exists no difference between the zero positions, and, in the other hand, there is a significant variation in pole positions, due mainly to obstacles presence, reverberations among other effects, see Figure 14. Therefore, we will focus in poles to determinate the points in the feature space. In experiment 1, in order to determine the transformation function, for every point in the feature space, the distances between them and the source signal are calculated, and these distances are plotted together with their corresponding distances in the space domain. With these values, after an interpolation process, the transform function fT is computed. In order to estimate the robot localization, we use other information such as the robot speed (in this case 15cm/sec), the computation time between each new position (3 sec). This fact is a source of uncertainty that adds in average ± 45 cm in the robot’s position.

136


Fig. 14. Poles and zeros positions in experiment 1 (left) and 2 (right).

Fig. 15. Nominal transformation function and the limits of the interval for the uncertainty in experiment 1. In experiment 1, when the climatic chamber is used as sound source the obtained transformation function is: ⎛ 2π x 80π ⎞ y = 4, 4 + 4, 4.sin ⎜ − ⎟ ⎝ 170 170 ⎠ Now, if an uncertainty interval is supposed (± 50 cm) the transformation function that covers this variability in the robot’s position can be expressed (for both experiments) as:

φ ⎛ 2π x ⎞ y = A + A.sin ⎜ − ⎟ ⎝ 170 ± 50 170 ± 50 ⎠ In Figure 15, the nominal transformation function and the limits for the uncertainty interval transformation functions can be seen.


137

There exists another uncertainty of about ±7.5 degrees in the angle determination due to the rotary platform in the robot that contains the microphones. Finally, to determine the current robot’s position the solution that provides the closest angle to the robot’s platform will be chosen. The results of our experiments yield an average error in the X axis of -1.242% and in the Y axis of 0.454% in experiment 1 and 0.335% in the X axis and -0.18% in the Y axis, providing estimated x-y positions good enough and robust.

5. Conclusion With the approaches presented in this Chapter we have achieved some interesting results that encourage the authors to keep on walking in this research field. The room feature extraction is carried out by identification of the sound signals. Besides to reinforce the localization, avoiding ambiguity and reducing uncertainty and incorporating robustness, a sensorial system is used aboard the robot to compute the angle between itself and the sound source. The obtained feature space is related with the space domain through a general approach with acoustical meaning. The validation of this novel approach is tested in different environments obtaining good results. The results keep on being very good when the uncertainty is incorporated in the transformation function.

6. References Navarro, D. & Benet, G. (2009). Magnetic Map Building for Mobile Robot Localization Purpose, 14th International Conference on Emerging Technologies and Factory Automation, Palma de Mallorca, September, 2009. Caruso, M. (2000). Applications of magnetic sensors for low cost compass systems, IEEE Position Location and Navigation Symposium, pp. 177–184, San Diego, CA, USA, March 2000. Kim, H.-D.; Kim, D.-W. & Sim, K.B. (2006). Simultaneous Localization and Map Building using Vision Camera and Electrical Compass, SICE-ICASE International Joint Conference, Korea, October, 2006. Kim, H.-D.; Seo, S-.W.; Jang, I.-H. & Sim, K.B. (2007). SLAM of Mobile Robot in the indoor Environment with Digital Magnetic Compass and Ultrasonic Sensors, International Conference on Control, Automation and Systems, Oct. 17-20, 2007, Seoul, Korea. Luo, R.C.; Yih, C.-C. & Su, K.L. (2002). Multisensor Fusion and Integration: Approaches, Applications and Future Research Directions, IEEE Sensors Journal, Vol. 2, no. 2, April 2002. Begum, M.; Mann, G.K.I. & Gosine, R. (2006). A Fuzzy-Evolutionary Algorithm for Simultaneous Localization and Mapping of Mobile Robots, IEEE Congress on Evolutionary Computation, Canada, July, 2006. Brunskill, M. & Roy, N. (2005). SLAM using Incremental Probabilistic PCA and Dimensionality Reduction, Proc. of the IEEE International Conference on Robotics and Automation, Spain, April, 2005. Di Marco, M.; Garulli, A.; Lacroix, S. & Vicino, A. (2000). A Set Theoretic Approach to the Simultaneous Localization and Map Building Problem, Proceedings of the 39th IEEE Conference on Decision and Control, Sydney, Australia, December, 2000.

138


Smith, R.; Self, M. & Cheeseman, P. (1990). Estimating uncertain spatial relationships in robotics, Autonomous Robot Vehicles, vol. 8, pp. 167–193, 1990. Thrun, S. (2001). A probabilistic online mapping algorithm for teams of mobile robots, Journal of Robotics Research, vol. 20, no. 5, pp. 335-363, 2001. Begum, M.; Mann, G.K.I. & Gosine, R.G. (2006). An Evolutionary SLAM Algorithm for Mobile Robots, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, October 9 - 15, 2006, Beijing, China. Priyantha, N.B. (2000). The Cricket location-support system, Proc. of the 6th Annual International Conference on Mobile Computing and Networking, pp. 32-43, 2000. Sayed, A.H.; Tarighat, A. & Khajehnouri, N. (2005). Network-Based Wireless Location: Challenges faced in developing techniques for accurate wireless location information, IEEE Signal Processing Magazine, vol. 22, no. 4, July 2005. Christo, C.; Carvalho, E.; Silva, M.P. & Cardeira, C. (2009). Autonomous Mobile Robots Localization with Multiples iGPS Web Services, 14th International Conference on Emerging Technologies and Factory Automation, Palma de Mallorca, September, 2009. Yang, P.; Sun, H. & Zu, L. (2007). An Acoustic Localization System Using Microphone Array for Mobile Robot, International Journal of Intelligent Engineering & Systems, 2007. Mumolo, E.; Nolich, M. & Vercelli, G. (2003). Algorithms for acoustic localization based on microphone array in service robotics, Robotics and Autonomous Systems, vol. 42, pp. 69-88, 2003. Csyzewski, A. (2003). Automatic identification of sound source position employing neural networks and rough sets, Pattern Recognition Letters, vol. 24, pp. 921-933, 2003. Ying, J. & Runze, Y. (2007). Research Status and Prospect of the Acoustic Localization Techniques, Audio Engineering, vol. 31, no. 2, pp. 4-8, 2007. Sasaki, Y.; Kagami S. & Mizoguchi, H. (2006). Multiple sound source mapping for a mobile robot by selfmotion triangulation, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, 2006. Kim, Y.-E.; Su, D.-H.; Chung, G.-J.; Huang, X. & Lee, C.-D. (2009). Efficient Sound Source Localization Method Using Region Selection, IEEE International Symposium on Industrial Electronics, ISlE 2009, Seoul Olympic Parktel, Seoul, Korea July 5-8, 2009. Brandstein, M.S. & Silverman, H. (1997). A practical methodology for speech source localization with microphone arrays, Computer Speech and Language, vol. 11, no. 2, pp. 91-126, 1997. Knapp, C.H. & Carter, G.C. (1976). The generalized correlation method for estimation of time delay, IEEE Trans. on Acoustics, Speech and Signal Processing, vol. Assp-24, no. 4, 1976. Nakashima, H. & Mukai, T. (2005). 3D Sound source localization system based on learning of binaural hearing, IEEE International Conference on Systems, Man and Cybernetics, vol. 4, pp. 3534-3539, 2005. Kim, H.S. & Choi, J. (2009). Binaural Sound Localization based on Sparse Coding and SOM, IEEE/RSJ International Conference on Intelligent Robots and Systems, October 11-15, 2009, St. Louis, USA. Yi, H. & Chu-na, W. (2010). A New Moving Sound Source Localization Method Based on the Time Difference of Arrival, Proc. of the International Conference on Image Analysis and Signal Processing, pp. 118-122, 9-11 April, 2010, Zhejiang, China.


139

Rodemann, T.; Joublin, F. & Goerick, C. (2009). Audio Proto Objects for Improved Sound Localization, IEEE/RSJ International Conference on Intelligent Robots and Systems, October 11-15, 2009, St. Louis, USA Mori, K.; Kasashima, N.; Yoshioha, T. & Ueno, Y. (1996). Prediction of Spalling on a Ball Bearing by Applying the Discrete Wavelet Transform to Vibration Signals, Wear, vol. 195, no. 1-2, pp. 162-168, 1996. Roberts, S. & Everson, R. (2001). Independent Component Analysis: Principles and Practice, Cambridge Univ. Press, Cambridge, UK, 2001. Bolea, Y.; Grau, A. & Sanfeliu, A. (2003). Non-speech Sound Feature Extraction based on Model Identification for Robot Navigation, 8th Iberoamerican Congress on Pattern Recognition, CIARP 2003, Lectures Notes in Computer Science, LNCS 2905, pp. 221228, Havana, Cuba, November 2003. Mallat, S. & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries, IEEE Trans. on Signal Processing, vol.45, no.12, pp. 3397-3415, 1993. Donoho, D.-L. (1999). De-noising by soft-thresholding, IEEE Trans. on Information Theory, vol. 33, no. 7, pp. 2183-2191, 1999. Lin, J. (2001). Feature Extraction of Machine Sound using Wavelet and its Application in Fault Diagnosis, NTD&E International, vol. 34, pp. 25-30, 2001. Grau, A.; Bolea, Y. & Manzanares, M. (2007). Robust Industrial Machine Sounds Identification based on Frequency Spectrum Analysis, 12th Iberoamerican Congress on Pattern Recognition, CIARP 2007, Lecture notes in Computer Science, LNCS 4756, pp. 71-77, Valparaiso, Chile, November 2007. Ljung, L. (1987). System identification: Theory for the user. Prentice-Hall, 1987. Charbonnier, R.; Barlaud, M.; Alengrin, G. & Menez, J. (1987). Results on AR-modeling of nonstationary signals, IEEE Trans. Signal Processing, vol. 12, no. 2, pp. 143-151. Kayhan, A.S.; Ei-Jaroudi, A. & Chaparro, L.F. (1994). Evolutionary periodogram for nonstationary signals, IEEE Trans. Signal Process, vol. 42, no. 6, pp. 1527-1536. Tsatsanis, M.K. & Giannakis, G.B. (1993). Time-varying system identification and model validation using wavelets, IEEE Trans. Signal Process, vol. 41, no. 12, pp. 3512-3523. Kinsler, L.; Frey, A.; Coppens, A. & Sanders, J. (1995). Fundamentals of Acoustics, Limusa Ed., Barcelona, 1995. Kuttruff, H. (1979). Room Acoustics, Applied Science Publishers Ltd., 1979. Haneda, Y.; Makino, S. & Kaneda, Y. (1992). Modeling of a Room Transfer Function Using Common Acoustical Poles, IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-92, vol.2, pp. 213–216, 1992. Haneda, Y.; Kaneda, Y. & Kitawaki, N. (1999). Common-Acoustical-Pole and Residue Model and Its Application to Spatial Interpolation and Extrapolation of a Room Transfer Function, IEEE Transactions on Speech and Audio Processing, vol. 7, no. 6, Nov. 1999. Gustaffson, T.; Pota, H.R.; Vance, J.; Rao, B.D. & Trivedi, M.M. (2000). Estimation of Acoustical Room Transfer Functions, Proceedings of the 39th IEEE Conference on Decision and Control, Sydney, Australia, December 2000. Bolea, Y.; Manzanares, M. & Grau, A. (2008). Robust robot localization using non-speech sound in industrial environments, IEEE International Symposium on Industrial Electronics, ISIE 2008, Cambridge, United Kingdom, 30 June- 2 July 2008.

140


Manzanares, M.; Guerra, E.; Bolea, Y. & Grau, A. (2009). Robot Localization Method by Acoustical Signal Identification, 14th IEEE Emerging Tech and Factory Automation, ETFA’09, Palma de Mallorca, Spain, 2009. Bielińska, E. (2002). Speaker identification, Artificial Intelligence Methods, AI-METH, 2002.

9 Source Localization for Dual Speech Enhancement Technology Seungil Kim1, Hyejeong Jeon2, and Lag-Young Kim2 1Speech

2Information

Innovation Consulting Group Technology Lab. LG Electronics Korea

1. Introduction Many researchers have investigated multi-channel speech enhancement techniques which can be used for the pre-processing of the speech recognition system. Numerous microphones can give high performance, but they require additional hardware costs and generate the design problem about microphone position. Therefore speech enhancement technique using two microphones is preferred in mobile phone such as LG KM900, iPhone 4 and Nexus One. For enhancing the speech with two or more microphones, the spatial information from the input signal's incident angle should be used. Therefore, various sound source localization(SSL) methods have been used to estimate the talker’s direction-ofarrival(DOA). There are two main approaches to localization (Brandstein, 1995), (Dibase, 2000): the steered-beamformer approach, which includes various kinds of beamformers; and time-difference of arrival (TDOA) approach, which includes a generalized cross-correlation (GCC). The steered-beamformer approach has the capability of enhancing a desired signal that originates from a particular direction. The beamformer can steer its response at a particular angle; it can then find the spatial information required to maximize the beamformer output by scanning over a predefined spatial region. For this purpose, we can use a simple conventional delay-and-sum beamformer or many optimum beamfomers (Naguib, 1996). The TDOA approach uses classical time delay estimation techniques, such as cross-correlation, GCC, adaptive time delay estimation, and the adaptive eigenvalue decomposition algorithm (Chen et al., 2006). The most common time delay estimation method is the GCC, which consists of various types such as the unfiltered type, the maximum likelihood (ML) type, and the phase transform (PHAT) type. The GCC-PHAT is a widely used for TDOA estimation method because it works well in a realistic environment. The resolution of the DOA estimator is deeply related to the aperture size of the array and the number of microphone. A large aperture size and microphones make an accurate estimation result. Therefore, SSL method using two microphones cannot give the accurate direction-of-arrival (DOA) estimation result. Moreover, the implementation of a TDOA estimator requires a voice activity detector (Araki et al., 2007) or a speech/non speech detector (Lathoud, 2006). However, the TDOA estimation often shows a failed result in spite of these kinds of additional processing. Hence, reliable SSL algorithm is needed for dual channel speech enhancement system.

142


In this chapter, we will define the reliability measure based on waterbed effect of DOA estimator and then show a method of increasing the accuracy of DOA estimation by using reliability measure(Jeon et al., 2007).

2. Dual Speech Enhancement technology Dual Speech Enhancement (DSETM) is a trademark of advanced two-channel speech enhancement technology developed by LG Electronics. It has been shown that DSE would be competitive to the other state-of-art speech enhancement technologies. DSE technology can be divided into two sub-technologies according to its function and aim. One is the Dual Speech Enhancement for Talk (DSE.TTM) and the other is the Dual Speech Enhancement for Recording (DSE.RTM). DSE.T is a solution for speech communication system. Comfortable call is unfortunately impossible in noisy environments. DSE.T can be a new solution to enhance speech quality in noisy place. DSE.T technology was introduced at CES 2009 in Las Vegas via Woo-hyun Baek (LG CTO) as one of the representative technologies prepared by LG Electronics. DSE.R makes clear video recording with directionality. In DSE.R, two omni-directional microphones are processed and virtually make them as one directional microphone. One more useful thing in DSE.R is the function of electrical steering. Therefore, the user can select the direction of sound focusing. If user wants to record the voice of person who is pictured, user only needs to select “Producer Mode”. If user wants to record the landscape or something else, user will select “Narrator Mode”. DSE technology was applied to commercial LG mobile phone, KM900 Arena as shown in Fig. 1. Four related video clips are available in following link and they will be also presented in the multimedia appendix of this book. LG DSE.T technology - http://goo.gl/TUEo LG’s dual mic noise reduction demo at CES 2009 - http://goo.gl/QlFx LG DSE.R technology - http://goo.gl/dbz3 DSE.R Test : LG KM900 Arena Video Recording - http://goo.gl/kwzJ

Fig. 1. LG KM900 Arena Phone

143

Source Localization for Dual Speech Enhancement Technology

3. Sound source localization for Dual Speech Enhancement technology The direction of talker can be used for DSE technology. However, two microphones are not enough to get high angular resolution at the DOA estimator. Therefore it is needed to reject some unreliable results and select the good one. For the reliable DOA estimation, we adopt the new scheme “Reliability Measure” which arises from waterbed effect. If the obtained reliability measure is lower than a predefined threshold, the result will be rejected. By using the reliable results only, we can decrease the detection failure of the signal. 3.1 Waterbed effect in DOA estimation The waterbed effect means that if somewhere the amplification characteristic is pushed down, it goes up somewhere else. This term is usually used in filter response. Stoica and Ninness showed the waterbed effect would appear in spectral estimation (Stoica & Ninness, 2004) (Ninness, 2003). He proved that the power spectral density estimated by a periodogram has a constant average relative variance. The searching method for the DOA estimation is very similar to a spectral estimation such as a periodogram. Therefore the waterbed effect in DOA estimation can be obtained by similar process to spectral estimation (Jeon et al. 2007). Let Φ(ω ) be the power spectral density of a Gaussian white noise process ˆ (ω ) be the periodogram estimate of Φ(ω ) . We can then show that the variance of and Φ ˆ Φ(ω ) is proportional to the square of the power spectral density (Hayes, 1996). Thus,

{

}

ˆ (ω ) = Φ 2 (ω ). var Φ

(1)

ˆ (ω ) has the following form: The average relative variance of Φ

{ } ˆ (ω )} var {Φ dω

ˆ (ω ) average relative variance Φ =

1 π ∫ 2π −π

(2)

Φ 2 (ω )

= 1. This phenomenon has been called the waterbed effect. The waterbed effect in DOA estimation can be reduced by a similar process. If R(θ ) is the cross-correlation value of GCC-PHAT, then K −1

X1[ k ]X 2* [ k ]

k =0

X1[ k ]X 2* [ k ]

R(θ ) = ∑

e

−j

2π k d sin θ K c

, θ ∈ [ −π , π ].

(3)

In addition, a power pattern estimate of R(θ ) , Pˆ (θ ) , is expressed by 1 1 K − 1 K − 1 X1[ k ]X 2* [ k ] X1* [l ]X 2 [l ] − j 2 Pˆ (θ ) = R(θ ) = e ∑ ∑ K K k = 0 l = 0 X1[ k ]X 2* [ k ] X1* [l ]X 2 [l ]

Furthermore, the expected value of the power pattern is

2π ( k − l ) d sin θ K c

.

(4)

144

Advances in Sound Localization 2π ( k − l ) d sin θ 0 ⎫ 2π ( k − l ) d sin θ 1 K − 1 K − 1 ⎪⎧ j K ⎪ −j K c c . E Pˆ (θ ) = ∑ ∑ E ⎨e ⎬e K k =0 l =0 ⎪ ⎪ ⎩ ⎭

{

}

(5)

Let the input signal be a spatially white noise process, and note that the signal is assumed to be Gaussian white noise in the spectral estimation. For spatially white noise, the expected j

2π ( k − l ) d sin θ 0

c is equal to unity when k = l only; otherwise it is equal to zero. Thus, value of e K the expected value of the power pattern in (5) is equal to unity. That is,

{

}

E Pˆ (θ ) = 1.

(6)

The second-order moment of the power pattern estimate is 2π ( k − l + m − n ) d sin θ 0 ⎫ 2π ( k − l + m − n ) d sin θ 1 K − 1 K − 1 K − 1 K − 1 ⎧⎪ j ⎪ −j K c K c , E Pˆ 2 (θ ) = 2 ∑ ∑ ∑ ∑ E ⎨ e e ⎬ K k = 0 l = 0 m = 0 n = 0 ⎩⎪ ⎭⎪

{

}

(7)

and separated by sum of two parts as follows: ⎧⎪ j 2π ( k − l + m − n ) d sin θ0 ⎫⎪ − j 2π ( k − l + m − n ) d sin θ 1 K c K c E Pˆ 2 (θ ) = 2 ∑ ∑ ∑ ∑ E ⎨ e ⎬e K k + m= l + n ⎪⎩ ⎪⎭

{

}

⎧⎪ j 2π ( k − l + m − n ) d sin θ0 ⎫⎪ − j 2π ( k − l + m − n ) d sin θ 1 K c K c . + 2 ∑ ∑ ∑ ∑ E ⎨e ⎬e K o. w ⎩⎪ ⎭⎪ The number that can be satisfy k + m = l + n is

(8)

(K + 1)(K + 2)(2 K + 3) − 3K . Hence, the 6

equation (8) can be simplified to

1 ⎡ (K + 1)(K + 2)(2K + 3) ⎤ E Pˆ 2 (θ ) = 2 ⎢ − 3K ⎥ . 6 K ⎣ ⎦

{

The variance of Pˆ (θ ) is

}

{

} {

} {

}

Var Pˆ (θ ) = E Pˆ 2 (θ ) − E Pˆ (θ )

=

(9)

2

2 K 3 + 3K 2 − 5K + 6 K ≈ . 3 6K 2

(10)

By using (6) and (10), we can calculate the average relative variance of Pˆ (θ ) as follows:

{ } var {Pˆ (θ )} dθ

average relative variance Pˆ (θ ) =

1 π ∫ 2π −π

=

2K 3 + 3K 2 − 5K + 6 K ≈ . 3 6K 2

P 2 (θ )

(11)


145

This equation is the waterbed effect in the DOA estimation. Figure 2 shows the result of the DOA estimation. The input signal which had the source in the angle of 30° location was used. The result showed that the direction was correctly estimated and the waterbed effect appeared in the angle of -30°. Even though there was no other signals, the result showed that there is the negative value in the angle of -30°. 3.2 Reliability measure The concept of reliability measure was presented in (Jeon et al., 2007) and (Jeon, 2008). Figure 3 shows the cross-correlation value of the GCC-PHAT when the speech source is present at a direction of 0° and when the speech source is absent.

Fig. 2. Waterbed Effect in the DOA Estimation To test the waterbed effect, we seated the talker in front of the dual microphone receiver. When a dominant source exists, the waterbed effect should cause the mainlobe to be prominent. If there is no directional source, R(θ ) has a flat pattern for all directions. The reliability measure ( z ), which indicates the prominence of the lobe of R(θ ) , is defined as

z = f ( Rmax − Rmin ) ,

(12)

where f ( x ) is any monotone-increasing function, Rmax is the maximum value of R and Rmin is the minimum. We used the formula f ( x ) =

x K

2

.

In Fig. 3, the reliability ( z ) is 0.0177 when speech is absent and the reliability ( z ) is 0.9878 when speech is present. Because the reliability measure refers to the directivity of the sound source, we only selected the DOA estimation results that had a high reliability value and we clustered those results. If we assume that a reliable DOA estimation result can be obtained when a dominant directional input exists, we can consider the following two hypotheses of reliability decision problem:

146


Assuming that reliable DOA estimation result can be obtained when dominant directional input exists, two hypotheses of reliability decision problem are as follow: H 0 : unreliable DOA estimation result H 1 : reliable DOA estimation result

And the hypothesis test equation can be defined as d1

1 2 > z = 2 ( Rmax − Rmin ) η , < K

(13)

d0

where η is the threshold for the selection of reliable results.

Fig. 3. The cross-correlation value when the speech source is present and when speech source is absent. 3.3 Determination of the threshold To determine whether the estimate is reliable or not, we need to find the optimum threshold for detection. In (Kim et al., 2008), the optimum threshold was calculated based on maximum likelihood criteria. If we assume that the structure of z is known, reliable source detection can be considered as a simple binary decision problem. To determine which probabilistic model is fit to z , we made observations of z . The recorded data used to calculate the value of z was measured in a quiet conference room. The microphones were 8 cm apart and a single talker was located in front of the microphones. We visually determined that z could be modeled with a Rayleigh pdf as follows:

p( z|H 0 ) =

⎛ z2 ⎞ exp ⎜⎜ − 2 ⎟⎟ σ 02 ⎝ 2σ 0 ⎠ z

(14)

147


p( z|H 1 ) =

⎛ z2 ⎞ exp ⎜⎜ − 2 ⎟⎟ . σ 12 ⎝ 2σ 1 ⎠ z

(

(15)

)

The ML estimation for the unknown parameter σ 02 ,σ 12 is given by the maximum value of the log-likelihood function (Schmidt et al., 1996). If we have N 0 items of observation data for z , which is in a decision region Z0 , then

σ 02 =

1 N0 2 ∑ zi 2N0 i =1

, zi ∈ Z0 .

(16)

, z j ∈ Z1 .

(17)

Similarly, σ 12 can be easily obtained as follows:

σ 12 =

1 N1 2 ∑ zi 2N 1 j =1

Figure 4 depicts the observation data distributions fitted with a Rayleigh model. In the quiet conference room, the estimated variances σ 0 and σ 1 are 0.0183 and 0.1997, respectively. If we make use of the likelihood ratio Λ( z ) =

p( z| H 1 ) , p( z|H 0 )

(18)

the decision rule can be represented by d1

⎛σ 2 −σ 2 ⎞> σ2 Λ( z) = 02 exp ⎜⎜ 1 2 20 ⋅ z2 ⎟⎟ λ . σ1 ⎝ 2σ 0 σ 1 ⎠
ln ⎜⎜ 02 ⎟⎟ − ⎜⎜ 1 2 20 ⋅ z2 ⎟⎟ ln λ . 2 σ σ σ 0 1 ⎝ 1⎠ ⎝ ⎠ d
2σ 02σ 12 ⎧⎪ ln ln ⋅ λ + ⎜ ⎨ ⎜ σ 2 ⎟⎟ ⎬ = η . < σ 12 − σ 02 ⎩⎪ ⎝ 0 ⎠ ⎭⎪ d

(21)

0

When ln λ is equal to zero, the threshold of the ML decision rule (Melsa & Cohn, 1978) can be determined by

η ML =

⎛σ 2 ⎞ 2σ 02σ 12 ⋅ ln ⎜⎜ 12 ⎟⎟ . 2 2 σ1 −σ0 ⎝ σ0 ⎠

(22)

148


(

)

If we use σ 02 ,σ 12 = ( 0.0183, 0.1997 ) , which is previously calculated, η ML becomes 0.0567 for Fig. 4. Probability density function

20

Data p(z|H0) p(z|H1)

18 16 14 12 10 8 6 4 2 0

0

0.1

0.2

0.3

0.4

0.5 z

0.6

0.7

0.8

0.9

1

Fig. 4. The cross-correlation value when the speech source is present and when speech source is absent.

4. Performance evaluations 4.1 Simulations The simulation was performed with the male talker’s speech signal. The input speech came from the 30° and the spatially white random noise was mixed to make the SNR of 5dB, 10 dB, 15 dB, and 20 dB. The distance between two microphones was assumed to be 8cm. The comparison of the estimated DOA is shown in Fig. 5. When the reliability measure and the threshold selection were applied, the average value of the estimated DOA was close to the speech direction. Also, the standard deviation and the RMS error was drastically reduced. 4.2 Experiments To evaluate the performance of the proposed method, we applied it to the speech data recorded in a quiet conference room. The size of room was 8.5m x 5.5m x2.5m. This conference room, which was suitable for a conference with the several people, generated a normal reverberation effect. The impulse response of the conference room is shown in Fig. 6. The room had various kinds of office furniture such as tables, chairs, a white board standing on the floor, and a projector fixed to the ceiling. The two microphones were placed on the table in the center of the room, and the distance between the microphones was set to 8 cm. Figure 7 shows the experimental setup. The sampling rate of the recorded signal was 8 kHz, and the sample resolution of the signal was 16 bits. Because the proposed method worked efficiently for the probabilistic model of reliability, we found it useful to eliminate the perturbed results of the estimated DOA in the speech recorded in this room. We compared the results with the normal GCC-PHAT method.


149

Fig. 5. (a) The average estimated DOA (b) The standard deviation (c) The RMS error when the SNR was 5 dB, 10 dB, and 20 dB

Fig. 6. Impulse response of the conference room for the experiments

150


4.2.1 Reliability As shown in Fig. 7 and Fig. 8, we performed the experiment of the DOA estimator for a talker's speech from a direction of 60°. White noise and tone noise resulted from the fan of the projector. Screen

Whiteboard Microphones

Table Chairs

Fig. 7. The Experimental Setup Screen Whiteboard

60 ° Microphones

1.5m

Fig. 8. The Recording Setup for Fixed Talker’s Location Figure 9(a) shows the waveform of the talker's speech. We calculated the direction of the talker's speech on the basis of the GCC-PHAT, and the result is shown in Fig. 9(b). The small circles in the figure indicate the results of the estimated DOA. There are many incorrect results for the estimated DOA, especially in periods when the talker didn’t talk. Because of the estimated DOA results for when the talker didn’t talk, there was a drastic drop in the performance of the estimated DOA. We calculated the reliability values of the given speech and applied the results to the estimated DOA.


151

Fig. 9. (a) A waveform of the talker’s speech (b) DOA estimation results of GCC-PHAT. It doesn’t use the reliability measure.

Fig. 10. (a) The calculated reliability for Fig. 9(a). (b) DOA estimation results of GCC-PHAT. It uses the reliability measure and eliminates unreliable estimates. Figure 10(a) shows the reliability measures of the given speech, and Fig. 10(b) shows the estimated DOA after the removal of any unreliable results. We set the threshold, η , to 0.15. The x-marks indicate the eliminated values; these values were eliminated because the reliability measure revealed that those results were perturbed.

152


We can trace the talker’s direction by using this method. In the experiment, the talker spoke some sentences while walking around the table, and the distance from the talker to the microphones was about 1.5 m. Figure 11 shows the talker's path in the room. Screen Whiteboard 90 ° 135 °

45 ° Table

180 ° Microphones

Talker

270 °

0°

315 °

Fig. 11. The Recording Setup for Moving Talker Figure 12(a) and Fig. 12(b) show the waveform and the estimated DOA based on the GCCPHAT. The results of the estimated DOA are very disturbed because of the perturbed results. Figure 13(a) shows the calculated reliability values for the speech. By applying the reliability measure, as shown in Fig. 13(b), we can eliminate the perturbed values and produce better results for the estimated DOA. The x-marks represent the eliminated results. By eliminating the perturbed results, we can ensure that the estimated DOA is more accurate and has a smaller variance. There is a degree of difference between the source direction and the average estimated DOA value. The difference occurs with respect to the height of the talker’s mouth. Basically, we calculated the direction of the source from the phase difference of the two input signals. When we set the source direction, we thought the source was located on the same horizontal plane as the microphones. Thus, when the height of the source is not the same as the table, the phase difference cannot be the intended value as shown in Fig. 14. Even though we set the source direction at 90°, the actual source direction was 90°- θ h , where θ h is ⎛h⎞ ⎝ ⎠

θ h = tan −1 ⎜ ⎟ d

(23)

Because we used the source signal incident from the direction of 60° in Fig. 8, the actual source direction would be 48.5507° by using (23). The same phenomenon also occured in the next experiment; hence, the estimated DOA range was reduced to (-90°+ θ h , 90°- θ h ), not (90°, 90°).


Fig. 12. A waveform of the talker’s speech (b) DOA estimation results of GCC-PHAT. It doesn’t use the reliability measure.

Fig. 13. (a) The calculated reliability for Fig. 11(a). (b) DOA estimation results of GCCPHAT. It uses the reliability measure and eliminates unreliable estimates.

153

154


90 °

θh

h

d

Fig. 14. The Recording Setup for Moving Talker 4.2.2 Speech recognition with DSE technology The source localization has played an important role in the speech enhancement system. We applied the proposed localization method to the speech recognition system and evaluate its performance in a real car environment (Jeon, 2008). The measurements were made in a mid-sized car. The input microphones were mounted on a sun visor for speech signal to impinge toward the input device (at the direction of 0°) as shown in Fig. 15. And a single condenser microphone was mounted between the two microphones. It was installed for the comparison with DSE output. The reference microphone was set in front of speaker. We controlled the background noise with the driving speed. In the high and low noise condition, the speed of car was 80-100km/h and 40-60km/h, respectively.

Fig. 15. The experiment setup in a car

155


For speech recognition test, we used the Hidden Markov Model Toolkit (HTK) 3.4 version as speech recognizer. HTK is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research (http://htk.eng.cam.ac.uk/). We used 30 Korean phonemes word set for the experiments. The 30 words were composed of commands which were indispensable to use the telematics system. The speech recognition result is shown in Table 1. The speech recognition rate was decreased according as the background noise was increased. Noise Type

Speech Recognition Rate

Low (low speed) High (high speed)

73.33 58.83

Table 1. The speech recognition rate results : No pre-processing We tested the DSE technology and source localization method using reliability measure. For evaluation, signal-to-noise ratio (SNR) and speech recognition rate were used. The SNR results are shown in table 2. The SNR for the low noise environment was increased from 9.5 to 18.5 and for the high noise from 1.8 to 14.9. The increased performance of the DSE technology affected to the speech recognition rate. The speech recognition rate is shown in table 3 when the DSE technology was adopted. Without reliability measure, the speech recognition system for the high noise environment didn’t give a good result as table 1. However the speech recognition rate was increased from 58.83 to 65.81 for the high noise environment when DSE technology was used. Method

Low Noise

High Noise

Single Microphone DSE w/o reliability measure DSE with reliability measure

9.5 5.2 18.5

1.8 2.7 14.9

Table 2. SNR comparison results Noise Type

Speech Recognition Rate

Low (low speed) High (high speed)

77.42 65.81

Table 3. Speech recognition rate results : DSE pre-processing with reliability measure

5. Conclusions We introduced a method of detecting a reliable DOA estimation result. The reliability measure indicates the prominence of the lobe of the cross-correlation value, which is used to find the DOA. We derived the waterbed effect in the DOA estimation and used this effect to calculate the reliability measure. To detect reliable results, we then used the maximum likelihood decision rule. By using the assumption of the Rayleigh distribution of reliability, we calculated the appropriate threshold and then eliminated the perturbed results of the

156


DOA estimates. We evaluated the performance of the proposed reliability measure in a fixed talker environment and a moving talker environment. Finally we also verified that DSE technology using this reliable DOA estimator would be useful to speech recognition system in a car environment.

6. References S. Araki, H. Sawada, and S. Makino (2007). “Blind speech separation in a meeting situation with maximum SNR beamformers,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. I, p. 41-44. M. Brandstein (1995). A Framework for Speech Source Localization Using Sensor Arrays, Ph. D Thesis, Brown University. J. Chen, J. Benesty, and Y. Huang (2006). “Time delay estimation in room acoustic environments: An overview,” EURASIP Journal on Applied Signal Processing, Vol. 2006, pp. 1-19. J. Dibase (2000). A High-Accuracy, Low-Latency Technique for Talker Localization in reverberant Environments Using Microphone Arrays, Ph. D Thesis, Brown University. M. Hayes (1996). Statistical Digital Signal Processing and Modeling, John Wiley & Sons. H. Jeon, S. Kim, L. Kim, H. Yeon, and H. Youn (2007). “Reliability Measure for Sound Source Localization,” IEICE Electronics Express, Vol.5, No.6, pp.192-197. H. Jeon (2008). Two-Channel Sound Source Localization Method for Speech Enhancement System, Ph. D Thesis, Korea Advanced Institute of Science and Technology. G. Lathoud (2006). Spatio-Temporal Analysis of Spontaneous Speech with Microphone Arrays, Ph. D Thesis, Ecole Polytechnique Fédérale de Lausanne. J. Melsa, and D. Cohn (1978). Decision and Estimation Theory, McGraw-Hill. A. Naguib (1996). Adaptive Antennas for CDMA Wireless Networks, Ph. D Thesis, Stanford University. B. Ninness (2003). “The asymptotic CRLB for the spectrum of ARMA processes,” IEEE Transactions on Signal Processing, Vol. 51, No. 6, pp. 1520-1531. F. Schmitt, M. Mignotte, C. Collet, and P. Thourel (1996). ''Estimation of noise parameters on SONAR images'', in SPIE International Society for Optical Engineering - Technical Conference on Application of Digital Image Processing XIX - SPIE'96 , Vol. 2823, pp. 112, Denver, USA. P. Stoica, J. Li, and B. Ninness (2004). “The Waterbed Effect in Spectral Estimation,” IEEE Signal Processing Magazine, Vol. 21, pp. 88-100.

10 Underwater Acoustic Source Localization and Sounds Classification in Distributed Measurement Networks Octavian Adrian Postolache1,2, José Miguel Pereira1,2 and Pedro Silva Girão1 1Instituto

de Telecomunicações, (LabIM) Portugal

2ESTSetúbal/IPS

1. Introduction Underwater sound signals classification, localization and tracking of sound sources, are challenging tasks due to the multi-path nature of sound propagation, the mutual effects that exist between different sound signals and the large number of non-linear effects that reduces substantially the signal to noise ratio (SNR) of sound signals. In the region under observation, the Sado estuary, dolphins’ sounds and anthropogenic noises are those that are mainly present. Referring to the dolphins’ sounds, they can be classified in different types: narrow-band-frequency-modulated continuous tonal sounds, referred to as whistles, broadband sonar clicks and broadband burst pulse sounds. The system used to acquire the underwater sound signals is based on a set of hydrophones. The hydrophones are usually associated with pre-amplifying blocks followed by data acquisition systems with data logging and advanced signal processing capabilities for sound recognition, underwater sound source localization and motion tracking. For the particular case of dolphin’s sound recognition, dolphin localization and tracking, different practical approaches are reported in the literature that combine time-frequency representation and intelligent signal processing based on neural networks (Au et al., 2000; Wright, 2002; Carter, 1981). This paper presents a distributed virtual system that includes a sound acquisition component expressed by 3 hydrophones array, a sound generation device, expressed by a sound projector, and two acquisition, data logging, data processing and data communication units, expressed by a laptop PC, a personal digital assistant (PDA) and a multifunction acquisition board. A water quality multiparameter measurement unit and two GPS devices are also included in the measurement system. Several filtering blocks were designed and incorporated in the measurement system to improve the SNR ratio of the captured sound signals and a special attention was dedicated to present two techniques, one to locate sound signals’ sources, based on triangulation, and other to identify and classify different signal types by using a wavelet packet based technique.

158


2. Main principles of acoustics’ propagation Sound is a mechanical oscillating pressure that causes particles of matter to vibrate as they transfer their energy from one to the next. These vibrations produce relatively small changes in pressure that are propagated through a material medium. Compared with the atmospheric pressure, those pressure variations are very small but can still be detected if their amplitudes are above the hearing threshold of the receiver that is about a few tenths of micro Pascal. Sound is characterized by its amplitude (i.e., relative pressure level), intensity (the power of the wave transmitted in a particular direction in watts per square meter), frequency and propagation speed. This section includes a short review of the basic sound propagation modes, namely, planar and spherical modes, and a few remarks about underwater sound propagation. 2.1 Plane sound waves Considering an homogeneous medium and static conditions, i.e. a constant sound pressure over time, a stimulation force applied in YoZ plane, originates a plane sound wave traveling in the positive x direction whose pressure value, according to Hooke’s law, is given by, p(x) = − Y ⋅ ε

(1)

where p represents the differential pressure caused the sound wave, Y represents the elastic modulus of the medium and ε represents the relative value of its mechanical deformation caused by sound pressure. For time-varying conditions, there will be a differential pressure across an elementary volume, with a unitary transversal area and an elementary length dx, given by,

dp =

∂p(x, t) ⋅ dx ∂x

(2)

Using Newton’s second law and the relationships (1) and (2), it is possible to obtain the relation between time pressure variation and the particle speed caused by the sound pressure, ∂p ∂u(x, t) = −ρ ⋅ ∂x ∂t

(3)

where ρ represents the density of the medium and u(x,t) represents the particle speed at a given point (x) and a given time instant (t). Considering expressions (1), (2) and (3), it is possible to obtain the differential equation of sound plane waves that is expressed by, ∂2 p ∂t

2

=

Y ∂2 p ⋅ ρ ∂ x2

(4)

where Y represents the elastic modulus of the medium and ρ represents its density. 2.2 Spherical sound waves This approximation still considers a homogeneous and lossless propagation medium but, in this case, it is assumed that the sound intensity decreases with the square value of the

Underwater Acoustic Source Localization and Sounds Classification in Distributed Measurement Networks

159

distance from sound source (1/r2), that means, the sound pressure is inversely proportional to that distance (1/r). In this case, for static conditions, the spatial pressure variation is given by (Burdic, 1991), ∇p =

∂p ∂p ∂p ⋅ uˆ x + ⋅ uˆ y ⋅ uˆ z ∂x ∂y ∂z

(5)

where uˆ x , uˆ y and uˆ z represent the Cartesian unit vectors and ∇ represents the gradient operator. Using spherical polar coordinates, the sound pressure (p) dependents only on the distance between a generic point in the space (r, θ, φ) and the sound source coordinates that is located in the origin of the coordinates’ system. In this case, for time variable conditions, the incremental variation of pressure is given by, 1 ∂ 2 (r ⋅ p) ∂ 2 p ⋅ = r ∂t 2 ∂t 2

(6)

where r represents the radial distance between a generic point and the sound source. Concerning sound intensity, for spherical waves in homogeneous and lossless mediums, its value decreases with the square value of the distance (r) since the total acoustic power remains constant across spherical surfaces. It is important to underline that this approximation is still valid for mediums with low power losses as long as the distance from the sound source is higher than ten times the sound wavelength (r>10⋅λ). 2.3 Definition of some sound parameters There are a very large number of sound parameters. However, according the aim of the present chapter, only a few parameters and definitions will be reviewed, namely, the concepts of sound impedance, transmission and reflection coefficients and sound intensity. The transmission of sound waves, through two different mediums, is determined by the sound impedance of each medium. The acoustic impedance of a medium represents the ratio between the sound pressure (p) and the particle velocity (u) and is given by, Zm = ρ ⋅ c

(7)

where, as previously, ρ represents the density of the medium and c represents the propagation speed of the acoustic wave that is, by its turn, equal to the product of the acoustic wavelength by its frequency (c=λ⋅f). Sound propagation across two different mediums depends on the sound impedance of each one, namely, on the transmission and reflection coefficients. For the normal component of the acoustic wave, relatively to the separation plane of the mediums, the sound reflection and transmission coefficients are defined by, ΓR =

Z m1 − Z m 2 Z m1 + Z m 2

ΓT =

2 ⋅ Z m2 Z m1 + Z m 2

(8)

160


where ΓR and ΓT represent the refection and transmission coefficients, and, Zm1 and Zm2, represent the acoustic impedance of medium 1 and 2, respectively. For spherical waves, the acoustic intensity that represents the power of sound signals is defined by, I=

1 r2

⋅

(p2 )av ρ⋅c

(9)

where (p2)av is the mean square value of the acoustic pressure for r=1 m and the others variables have the meaning previously defined. The total acoustic power at a distance r, from the sound source, is obtained by multiplying the previous result by the area of a sphere with radius equal r. The results that is obtained is given by, P = 4π ⋅

(p2 )av 2ρ ⋅ c

(10)

This constant value of sound intensity was expected since it is assumed a sound propagation in a homogenous lossless propagation medium. In which concerns the sound pressure level, it is important to underline that this parameter represents, not acoustic energy per time unit, but acoustic strength per unit area. The sound pressure level (SPL) is defined by, SPL = 20 ⋅ log10 (p/p ref )

(11)

where the reference power (pref) is equal to 1 μPa for sound propagation in water or others liquids. Similarly, the logarithmic expression of sound intensity level (SIL) and sound power level (SL) are defined by, I = 10 ⋅ log10 (I/I ref ) dB(SIL) S WL = 10 ⋅ log10 ( W / Wref )

(12)

where the reference values of intensity and power are given by Iref=10-12 W/m2 and Wref=10-12 W, respectively. 2.4 A few remarks about underwater sound propagation It should be noted that the speed of sound in water, particularly seawater, is not the same for all frequencies, but varies with aspects of the local marine environment such as density, temperature and salinity. Due mainly to the greater “stiffness” of seawater relative to air, sound travels approximately with a velocity (c) about 1500 m/s in seawater while in air it travels with a velocity about 340 m/s. In a simplified way it is possible to say that underwater sound propagation velocity is mainly affected by water temperature (T), depth (D) and salinity (S). A simple and empirical relationship that can be used to determine the sound velocity in salt water is given by (Hodges, 2010),

c(T, S, DP) ≅ A 1 + A 2 ⋅ T + A 3 ⋅ T 2 + A 4 ⋅ T 3 + (B1 − B 2 ⋅ T) ⋅ (S − C 1 ) + D 1 ⋅ D

[A1 , A 2 , A 3 , A 4 ] ≅ [1449, 4.6, − 0.055, 0.0003] [B1 , B2 , C1 , D1 ] ≅ [1.39, 0.012, 35, 0.017 ]

(13)


161

where temperature is expressed in ºC, salinity in expressed in parts per thousand and depth in m. The sensitivity of sound velocity depends mainly on water temperature. However, the variation of temperature in low depth waters, that sometimes is lower than 2 m in river estuaries, is very small and salinity is the main parameter that affects sound velocity in estuarine salt waters. Moreover, salinity is estuarine zones depends strongly on tides and each sound monitoring measuring node must include at least a conductivity/salinity transducer to compensate underwater sound propagation velocity from its dependence on salinity (Mackenzi, 1981). As a summary it must be underlined that underwater sound transmission is a very complex issue, besides the effects previously referred, the ocean surface and bottom reflects, refracts and scatters the sound in a random fashion causing interference and attenuation that exhibit variations over time. Moreover, there are a large number of non-linear effects, namely temperature and salinity gradients, that causes complex time-variable and non-linear effects.

3. Spectral characterization of acoustic signals Several MATLAB scripts were developed to identify and to classify acoustic signals. Using a given dolphin sound signal as a reference signal, different time to frequency conversion methods (TFCM) were applied to test the main characteristics of each one. 3.1 Dolphin sounds In which concerns dolphin sounds (Evans, 1973; Podos et al., 2002), there are different types with different spectral characteristics. Between these different sound types we can refer whistles, clicks, bursts, pops and mews, between others. Dolphin whistles, also called signature sounds, appear to be an identification sound since they are unique for each dolphin. The frequency range of these sounds is mainly contained in the interval between 200 Hz and 20 kHz (Reynolds et al., 1999). Clicks sounds are though to be used exclusively for echolocation (Evans, 1973). These sounds contains mainly high frequency spectral components and they require data acquisition systems with high analog to digital conversion rates. The frequency range for echolocation clicks includes the interval between 200 Hz and 150 kHz (Reynolds et al., 1999). Usually, low frequency clicks are used for long distance targets and high frequency clicks are used for short distance targets. When dolphins are closer to an object, they increase the frequency used for echolocation to obtain a more detailed information about the object characteristics, like shape, speed, moving direction, and object density, between others. For long distance objects low frequency acoustic signals are used because their attenuation is lower than the attenuation that is obtained with high frequency acoustic signals. By its turn, burst pulse sounds that include, mainly, pops, mews, chirps and barks, seem to be used when dolphins are angry or upset. These signals are frequency modulated and their frequency range includes the interval between 15 kHz and 150 kHz. 3.2 Time to frequency conversion methods As previously referred, in order to compare the performance of different TFCM, that can be used to identify and classify dolphin sounds a dolphin whistle sound will be considered as reference. In which concerns signals’ amplitudes, it makes only sense, for classification

162


purposes, to used normalized amplitudes. Sound signals’ amplitudes depend on many factors, namely on the distance between sound sources and the measurement system, being this distance variable for moving objects, for example dolphins and ships. A data acquisition sample rate equal to 44.1 kS/s was used to digitize sound signals and the acquisition period was equal to 1 s. Figure 1 represents the time variation of the whistle sound signal under analysis.

Fig. 1. Time variation of the dolphin whistle sound signal under analysis Fourier time to frequency conversion method

The first TFCM that will be considered is the Fourier transform method (Körner, 1996). The complex version of this time to frequency operator is defined by, X (f ) =

+∞

∫ x ( t ) ⋅e

− j2 π ⋅ f ⋅ t

dt

(14)

−∞

where x(t) and X(f) represent the signal and its Fourier transform, respectively. The results that are obtained with this FTCM don’t give any information about the frequency contents of the signal over time. However, some information about the signal bandwidth and its spectral energy distribution can be accessed. Figure 2 represents the power spectral density (PSD) of the sound signal represented in figure 1. As it is clearly shown, the PSD of the signal exhibits two peaks, one around 2.8 kHz and the other, with higher amplitude, is a spectral component whose frequency is approximately equal to 50 Hz. This spectral component is caused by the mains power supply and can be strongly attenuated, almost removed, by hardware or digital filtering. It is important to underline that this FTCM is not suitable for non-stationary signals, like the ones generated by dolphins. Short time Fourier transform method

Short time Fourier transform (STFT) is a TFCM that can be used to access the variation of the spectral components of a non-stationary signal over time. This TFCM is defined by,


163

PSD peak (50 Hz)

PSD peak 2.8 kHz

Fig. 2. Power spectral density of the dolphin whistle sound signal X(t, f) =

+∞

∫ x(t) ⋅ w(t −τ) ⋅ e

− j2π⋅f ⋅t

dt t ∈ ℜ

(15)

−∞

where x(t) and X(t,f) presents the signal and its STFT, respectively, and w(t) represents the time window function that is used the evaluate the STFT. With this TFCM it is possible to obtain the variation of the frequency contents of the signal over time. Figure 3 represents the spectrogram of the whistle sound signal when the STFT method is used. The spectogram considers a window length of 1024 samples, an overlap length of 128 samples and a number of points that are used for FFT evaluation, in each time window, equal to 1024. However, the STFT of a given signal depends significantly on the parameters that are used for its evaluation. Confirming this statement, figure 4 represents the spectrogram of the whistle signal obtained with a different window length, in this case equal to 64 samples, an overlap length equal to 16 samples and a number of points used for FFT evaluation, in each time interval, equal to 64. In this case, it is clearly shown that different time and frequency resolutions are obtained. The STFT parameters previously referred, namely, time window length, number of overlapping points, and the number of points used for FFT evaluation in each time window, together with the time window function, affect the time and frequency resolution that are obtained. Essentially, if a large time window is used, spectral resolution is improved but time resolution gets worst. This is the main drawback of the STFT method, there is a compromise between time and frequency resolutions. It is possible to demonstrate (Allen & Rabiner, 1997; Flandrin, 1984) that the constraint between time and frequency resolutions is given by, Δf ≥

1 4π ⋅ Δt

where Δf and Δt represent the frequency and time resolutions, respectively.

(16)

164


Fig. 3. Spectogram of the whistle sound signal (window length equal to1024 samples, overlap length equal to 128 samples and a number of points used for FFT evaluation equal to 1024)

Fig. 4. Spectogram of the whistle sound signal (window length equal to 64 samples, overlap length equal to 16 samples and a number of points used for FFT evaluation equal to 64)


165

Time to frequency conversion methods based on time-frequency distributions

When the signal exhibits slow variations in time, and there is no hard requirements of time and frequency resolutions, the STFT, previously described, gives acceptable results. Otherwise, time-frequency distributions can be used to obtain a better spectral power characterization of the signal over time (Claasen & Mecklenbrauker, 1980; Choi & Williams, 1989). A well know case of these methods is the Choi-Williams time to frequency transform that is defined by,

X(t, f) =

+∞

∫

−∞

e

+∞ − j2π2π

∫

σ/4π ⋅ τ 2 ⋅ e

−

σ(μ − t)2 4τ 2

⋅ x(μ + τ/2) ⋅ x * (μ − τ/2) ⋅ dμ ⋅ dτ

(17)

−∞

where x(μ+τ/2) represents the signal amplitude for a generic time t equal to μ+τ/2 and the exponential term is the distribution kernel function that depends on the value of σ coefficient. The Wigner-Ville distribution (WVD) time to frequency transform is a particular case of the Choi-Williams TFCM that is obtained when σ→∞, and its time to frequency transform operator is defined by, X(t, f) =

+∞

∫

e − j2π2π⋅ x(μ + τ/2) ⋅ x * (μ − τ/2)dτ

(18)

−∞

These TFCM could give better results in which concerns the evaluation of the main spectral components of non-stationary signals. They can minimize the spectral interference between adjacent frequency components as long as the distributions kernel function parameters’ are properly selected. These TFCM provide a joint function of time and frequency that describes the energy density of the signal simultaneously in time and frequency. However, ChoiWilliams and WVD TCFM based on time-frequency distributions depends on non-linear quadratic terms that introduce cross-terms in the time-frequency plane. It is even possible to obtain non-sense results, namely, negative values of the energy of the signal in some regions of the time-frequency plane. Figure 5 represents the spectrogram of the whistle sound signal calculated using the Choi-Williams distribution. The graphical representation considers a time window of 1 s, a unitary default Kernel coefficient (σ=1), a time smoothing window (Lg) equal to 17, a smoothing width (Lh) equal to 43 and a representation threshold equal to 5 %. Wavelets time to scale conversion method

Conversely to others TFCM that are based on Fourier transforms, in this case, the signal is decomposed is multiple components that are obtained by using different scales and time shifts of a base function, usually known as the mother wavelet function. The time to scale wavelet operator is defined by, X(τ(α) =

+∞

∫

~l

x(t) ⋅ α 0.5 ⋅ ψ (α(t − τ)) ⋅dt

(19)

−∞

where ψ is the mother wavelet function, α and τ are the wavelet scaling and time shift coefficients, respectively.

166


Fig. 5. Spectrogram of the whistle sound signal using the Choi-Williams distribution (time window=1 s, a unitary default Kernel coefficient, time smoothing window=17, a smoothing width=43) It is important to underline that the frequency contents of the signal is not directly obtained from its wavelet transform (WT). However, as the scale of the mother wavelet gets lower, a lower number of signal’s samples are contained in each scaled mother wavelet, and there the WT gives an increased knowledge of the high frequency components of the signal. In this case, there is no compromise between time and frequency resolutions. Moreover, wavelets are particularly interesting to detect signals’ trends, breakdowns and sharp peaks variations, and also to perform signals’ compressing and de-noising with minimal distortion. Figure 6 represents the scalogram of the whistle sound signal when a Morlet mother wavelet with a bandwidth parameter equal to 10 is used (Cristi, 2004; Donoho & Johnstone, 1994). The contour plot uses time and frequency linear scales and a logarithmic scale, with a dynamic range equal to 60 dB, to represent scalogram values. The scalogram was evaluated with 132 scales values, 90 scales between 1 and 45.5 with 0,5 units’ increments, an 42 scales between 46 and 128 with 2 units’ increments. The scalogram shows clearly that the main frequency components of the whistle sound signal are centered on the amplitude peaks of the signal, confirming the results previously obtained with the Fourier based TFCM. 3.3 Anthropogenic sound signals In which concerns underwater sound analysis it is important to analyze anthropogenic sound signals because they can disturb deeply the sounds generated by dolphins’ sounds. Anthropogenic noises are ubiquitous, they exist everywhere there is human activities. The powerful anthropogenic power sources come from sonars, ships and seismic survey pulses. Particularly in estuarine zones, noises from ships, ferries, winches and motorbikes, interfere with marine life in many ways (Holt et al., 2009).


167

Fig. 6. Scalogram of the whistle sound signal when a Morlet mother wavelet with a bandwidth parameter equal to 10 is used Since the communication between dolphins is based on underwater sounds, anthropogenic noises can originate an increase of dolphin sounds’ amplitudes, durations and repetition rates. These negative effects happen, particularly, whenever anthropogenic noises frequencies overlap the frequency bandwidth of the acoustic signals used by dolphins. It is generally accepted that anthropogenic noises can affect dolphins’ survival, reproduction and also divert them from their original habitat (NRC, 2003; Oherveger & Goller, 2001). Assuming equal amplitudes of dolphin and anthropogenic sounds, it is important to know their spectral components. Two examples of the time variations and scalograms of anthropogenic sounds signals will be presented. Figures 7 and 8 represent the time variations and the scalograms of a ship-harbor and submarine sonar sound signals, respectively. As it is clearly shown, both signals contain spectral components that overlap the frequency bandwidth of dolphin sound signals, thus, affecting dolphins’ communication and sound signals’ analysis.

(a) Fig. 7. Ship-harbour signal: (a) time variation and (b) scalogram

(b)

168


(a)

(b)

Fig. 8. Submarine sonar signal: (a) time variation and (b) scalogram

4. Measurement system The measurement system includes several measurement units than can, by its turn, be integrated in a distributed measurement network, with wireless communication capabilities (Postoloche et al., 2006). Each measurement unit, whose description will be considered in the present section, includes the acoustic devices that establish the interface between the electrical devices and the underwater medium, a water quality measurement unit that is used for environmental assessment purposes, and the signal conditioning, data acquisition and data processing units. 4.1 Hardware Figure 9 represents the intelligent distributed virtual measurement system that was implemented for underwater sound monitoring and sound source localization. The system includes two main units: a base unit, where the acoustic signals are detected and digitized, and a remote unit that generates testing underwater acoustic signals used to validate the implemented algorithms for time delay measurement (Carter, 1981; Chan & Ho, 1994), acoustic signal classification and underwater acoustic source localization. A set of three hydrophones (Sensor Technology, model SS03) are mounted on a 20m structure with 6 buoys that assure a linear distribution of the hydrophones. The number and the linear distribution of the hydrophones permit to implement a hyperbolic algorithm (Mattos & Grant, 2004; Glegg et al., 2001) for underwater acoustic source localization, and also to perform underwater sound monitoring tasks including sound detection and classification. The main characteristics of the hydrophones includes a frequency range between 200 Hz and 20 kHz, a sensitivity of -169 dB relatively to 1 V/μPa and a maximum operating depth of 100 m. The azimuth angle (ϕ) obtained from hydrophone array structure, together with the information obtained from the GPS1 (Garmin 75GPSMAP) device, installed on the base unit, and the information obtained from the fluxgate compass (SIMRAD RFC35NS) device, are used to calculate the absolute position of the remote underwater acoustic source. After the estimation of the underwater acoustic source localization, a comparison with the values


169

given by the GPS2 is carried out to validate the performance of the algorithms that are used for sound source localization. REMOTE BOAT

BASE SHIP-BOAT

WATER MEDIUM

remote unit PDA

base unit WiFi

PC

H-CC-3

water

audio amplifier

WiFi

H3

GPS 2

sound projector

H1

GPS 1

RS232 bus

H2

DAQ bus

WQMU

NI DAQ 6024

external data base storage

Fig. 9. The architecture of the distributed virtual system for underwater acoustic signal acquisition, underwater sound source localization and sound analysis (H1,H2,H3hydrophones, H-CC-3- channels’ conditioning circuits, WQMU– water quality measurement unit, NI DAQCard-6024 – multifunction DAQCard, GPS1 and GPS2- remote and base GPS units). As it is presented in figure 9, a three-channel hydrophones’ conditioning circuit (H-CC) provides the analog voltage signals associated with the captured sounds. These signals are acquired using three analog channels’ inputs (ACH0, ACH1 and ACH2) of the DAQCard using a data acquisition rate equal to 44.1S/s. The azimuth angle information, expressed by V⋅sinϕ and V⋅cosϕ voltages delivered by the electronic compass, is acquired using ACH3 and ACH4 channels of the DAQCard. The water quality parameters, temperature and salinity, are acquired using a multiparameter Quanta Hydrolab unit (Eco Environmental Inc.) that is controlled by the laptop PC trough a RS232 connection. During system’s testing phase, acoustic signals generation is triggered through a Wi-Fi communication link that exists between the PC and the PDA, or by a start-up table that is stored in the PDA and in the PC. Thus, at pre-defined time instants, a specific sound signal is generated by the sound projector (Lubell LL9816) and it is acquired by the hydrophones. The acquisition time delays are then evaluated and localization algorithms, based on the time difference of arrivals (TDOA), are used to locate sound sources. The main characteristics of the sound projector includes a frequency range (±3 dB) between 200 Hz and 20 kHz, a maximum SPL of 180 dB/μPa/m at a frequency equal to 1 kHz and a maximum cable voltage-to-current ratio equal to 20 Vrms/3 A. Temperature and salinity measurements, obtained for the WQMU (Postolache et al, 2002; Postolache et al., 2006; Postolache et al., 2010) are used to compensate sound source localization errors caused by underwater sound velocity variations (13). 4.2 Software System’s software includes two mains parts. One is related with dolphin sounds classification and the other is related with the GIS (Postolache et al., 2007). Both software parts are integrated in a common application that simultaneously identify sound sources and locate them in the geographical area under assessment. In this way, it is possible to

170


locate and pursue the trajectory of moving sound sources, particularly dolphins in a river estuary. 4.2.1 Dolphin sounds classification based on wavelets packets This software part performs basically the following tasks: hydrophone channel voltage acquisition and processing, fluxgate compass voltage data acquisition and processing, noise filtering, using wavelet threshold denoising (Mallat, 1999; Guo et al., 2000), digital filtering, detection and classification of sound signals. Additional software routines were developed to perform data logging of the acquired signals, to implement the GIS and to perform geographic coordinates’ analysis based on historical data. The laptop PC software was developed in LabVIEW (National Instruments) that, by its turn, includes some embedded MATLAB scripts. The generation of the acoustic signals, at the remote unit, is controlled by the distributed LabVIEW software (laptop PC software and PDA software). The laptop software component triggers the sound generation by sending to the PDA a command using the TCP/IP clientserver communication’s protocol. The sound type (e.g. dolphin’s whistle) and its time duration are defined using a specific command code. In which concerns the underwater acoustic analysis, the hydrophones’ data is processed in order to extract the information about the type of underwater sound source by using a wavelet packet (WP) signal decomposition and a set of neural network processing blocks. Features’ extraction of sound signals is performed using the root mean square (RMS) values of the coefficients that are obtained after WP decomposition (Chang & Wang, 2003). Based on the WP decomposition it is possible to obtain a reduced set of features parameters that characterize the main type of underwater sounds detected in the monitored area. It is important to underline that conversely to the traditional wavelet decomposition method, where only the approximation portion of the original signal is split into successive approximation and details, the proposed WP decomposition method extends the capabilities of the traditional wavelet decomposition method by decomposing the detail part as well. The complete decomposition tree for a three level WP decomposition is represented in Fig. 10. underwater acoustic signal

D

A DA

AA

AAA

DAA ADA

DD

AD

DDA AAD

DAD ADD

DDD

Fig. 10. Decomposition tree for a three level WP decomposition (D-details associated with the high-pass decimation filter, A-approximations associated with the low pass decimation filter) 4.2.2 Geographic Information System: their application to locate sound sources This software part implements the GIS and provides a flexible solution to locate and pursue moving sound sources. The main components, included in this software part, are the


171

hyperbolic bearing angle and range algorithms, both related with the estimation of sound sources’ localizations. In order to transform the relative position coordinates, determined by the system of hydrophones (Hurell & Duck, 2000), into absolute position coordinates, it is necessary to transform the GPS data, obtained from Garmin GPSMAP76, into a cartographic representation system. The mapping scale used to represent the geographical data is equal to 1/25000. This scale value was selected taking into account the accuracy of the GPS device that was used for testing purposes. The conversion from relative to absolute coordinates is performed in three steps: Molodensky (Clynch, 2006) three-dimensional transformation, Gauss-Krüger (Grafarend & Ardalan, 1993) projection and, finally, absolute positioning calculation. In the last step, a polar to Cartesian coordinates’ conversion is performed considering the water surface as a reference plane (XoY), and defining the direction of X-axis by using the data provided by the electronic compass. Figure 11 is a pictorial representation of the geometrical parameters that are used to locate underwater acoustic sources.

Hydrophones’ array

Dolphin sound source

Fig. 11. Geometrical parameters that are used to locate underwater acoustic sources The main software tasks performed by the measurement system are represented in figure 12. Finally, it is also important to refer that the underwater sound source localization is calculated and displayed on the user interface together with some water quality parameters, namely, temperature, turbidity and conductivity, that are provided by the WQMU. Future software developments can also provide improved localization accuracy by profiling the coverage area into regions where multiple measurements results of reference sound sources are stored in an allocation database. The best match between localization measurement data and the data that is stored in the localization database is determined and then interpolation can be used to improve sound source localization accuracy.

172


Fig. 12. Measurement system software block diagram

5. Experimental results To evaluate de performance of the proposed measurements system two experimental results will be presented. The first one is related with the capabilities provided by the sound source localization algorithms, and second one is related with the capabilities of wavelet based techniques to detect and classify dolphin sounds. 5.1 Sound source localization Several laboratory experiments were done to test the different measuring units including the WQMU. Field tests, similar to the ones previously performed in laboratory, were performed in Sado estuary. During field test of the measurement system, dolphins were sighted but none produced a clear sound signal that could be acquired or traced to the source. In order to fill this gap, a number of experiments took place involving pre-recorded dolphin sounds. For sound reproduction an underwater sound projector was used, which allowed the testing of the sound source localization algorithms. The sound projector, installed in a second boat (remote boat), away from the base ship-boat, was moved way from the hydrophones’ structure and several pre-recorded sounds were played. This methodology was used to test the performance of the hydrophones’ array structure and the TDOA algorithm to measure the localization of the sound source for different values of distances and azimuth angles. Using the time delay values between the sounds captured by the hydrophones and the data obtained from the electronic compass (zero heading), sounds can be traced to their sources. Table I represents the localization errors that are expected to estimate the localization of the sound source as a function of the angle resolution, that can be defined by the electronic compass, and by the distance between the hydrophones’ and the position of the sound source. The data contained in the table considers that the distance from the hydrophones to sound source is always lower than 500 m. As it can be verified, in order to obtain the desired precision, characteristic of a 1/25000 scale representation, the resolution in the zero heading acquisition angle, for distances lower than 500 m, cannot be greater then ±½ degree. Experimental results that were performed using GPS1 and GPS2 units gave an absolute error lower 10 m. This value is in accordance


173

with the SimRad RFC35NS electronic compass characteristics whose datasheet specifies an accuracy better than 1º and repeatability equal to ±0.5º.

Angle(º)

Distance (m) 0,25 0,5 1 2,5 5 10

10 0,04 0,09 0,17 0,44 0,87 1,74

50 0,22 0,44 0,87 2,18 4,36 8,68

100 150 300 500 0,44 0,65 1,31 2,18 0,87 1,31 2,62 4,36 1,75 2,62 5,24 8,73 4,36 6,54 13,09 21,81 8,72 13,07 26,15 43,58 17,36 26,05 52,09 86,82

Table 1. Localization errors as a function of the zero heading angle accuracy and of the distance between the hydrophones and the sound source 5.2 Wavelet based classification of dolphin sounds The WP based method that was used to identify and classify sound signals enables a large flexibility to choice the best combination of the WP features that are used for detection and classification purposes. During the design of the features extraction algorithm, different levels of decomposition, varying between 2 and five, and different sounds’ periods, varying between 30 ms and 1000 ms, were used. Referring the wavelet packet decomposition, used for underwater sound features’ extraction, a practical approach concerning the choice of the best level of decomposition and mother wavelet function was carried out. Thus, the capabilities of Daubechies, Symlets and Coiflets functions as mother wavelets were tested. For the studied cases, the RMS of the WP coefficients for different bands of interests were evaluated. As an example, figure 13 represents the features’ values obtained with a three level decomposition tree and a db1 mother wavelet, when different sound types are

Fig. 13. Wavelet based feature extraction using a three level decomposition tree (green linedolphin chirp, red linedolphin whistle, blue line- motorbike, black line- ping sonar).

174


considered, namely, a dolphin chirp, a dolphin whistle, and two anthropogenic noises, in this case, a water motorbike and a ping sonar sound. As it can be easily verified, the features’ values are significantly different for a third level WP decomposition. Figure 14 represents the WP coefficients for a dolphin whistle when it is used a six level WP decomposition tree. In this case, the number of terminal nodes is higher, 64 instead of 8, but the features’ variation profile over packet wavelet terminal nodes have a similar pattern and the sound classification performance is better. As expected, there is always a compromise between the data processing load and the sounds’ classification performance.

Fig. 14. Wavelet based feature extraction of a dolphin whistle using a six level decomposition tree Neural Network Classifier

The calculated features for different types of sound signals, real or artificially generated by the sound projector located in the remote unit, are used to train a neural network sound classifier (NN-SC) characterized by a Multilayer Perceptron architecture with 8 neurons in the input layer, a set of 10 to 20 neurons in the hidden layer and one or more neurons (nout) in the output layer (Haykin, 1994). Each input neuron collects one of the 8 features obtained from the WP decomposition. During the training phase of NN-SC the target vector elements are defined according the different sound types used for training purposes. For a similar sound type, for example dolphin whistles, the target vectors’ values are within a pre-defined interval. It is important to underline that all values that are used in NN-SC are normalized to its maximum amplitude in order to improve sound identification performance of the neural network. The number of output neurons (nout) of the NN-SC depends on the number of different signal types to identify. Thus, in the simplified case of the detection of dolphins sounds, independently of their types, the NN-SC uses a single neuron in the output layer with two separated features’ range intervals that corresponds to “dolphin sound” and “no dolphin sound” detected, respectively. In order to identify different sound sources and types, it is required more than two features’ ranges. This is the case when it is required to classify different sound types, like dolphins bursts, whistles and clicks, or other anthropogenic


175

sounds, like water motorbikes, ship sound or other underwater noise sounds. Figure 15 represents the NN-SC features’ range amplitudes that are used for sound classification. sound data file Sound Identification Algorithm

No identification

Anthropogenic noise

Dolphin sound

Sound type 1 a0

b0

Sound type 2 c1

... c2

Sound type n

log(r.m.s.) cn

feature amplitude

Fig. 15. NN-SC features’ amplitudes that are used for sound classification To test the performance of the sound classification algorithm the following features’ amplitudes were considered: between 0.1 and 0.3 for anthropogenic noise sounds; between 0.5 and 0.7 for dolphin whistles and between 0.9 and 1.1 for dolphin chirps. When features’ amplitudes are outside the previous intervals there is no sound identification. This happens if the NN-SC gives an erroneous output or if the training set is reduced in which concerns the number of different sound types to be identified. Figure 16 represents NN-SC normalized output values that are obtained for a third level wavelet packet decomposition, a training and validation sets with 16 elements (sound signals), each one, a root-mean-square training parameter goal equal to 10-5, a number of hidden layer neurons equal to eight and when the Levenberg–Marquardt minimization algorithm is used to evaluate the weights and bias of each ANN neuron.

Fig. 16. NN-SC normalized output values that were obtain using a third level wavelet packet decomposition ANN classifiers

176


The results that were obtained present a maximum relative error almost equal to 3.23 % and the standard deviation of the errors values is almost equal to 1.14 %. Since, in this example, all dolphin sounds’ features are in the range between 0.9 and 1.1, we conclude that there is no classification error. Different tests were performed with others sound types and it was verified a very good performance, higher than 95 % of right classifications, as long as there is no NN-SC features’ identification ranges overlapping.

6. Conclusions This chapter includes a review of sound propagation principles, the presentation of different TFCM that can be used to represent frequency contents of non-stationary signals and, finally, the presentation of a measurement system to acquire and process measurement data. In which concerns sound propagation principles, a particular attention was dedicated to the characterization of plane and spherical sound propagation modes and to the definition of power related acoustic parameters. Some details about underwater sound propagation are presented, particularly the ones that affect sound propagation speed. Variations of sound propagation speed in estuarine waters, where salinity can exhibit large variations, must be accounted to minimize measurement errors of sound sources’ localizations. Referring to TFCM a particular attention was dedicated to short time to frequency transforms and wavelet characterization of underwater sounds. Several examples of the application of these methods to characterize dolphin sounds and anthropogenic noises were presented. Several field tests of the measurement system were performed to evaluate its performance for sound signals’ detection and classification, and to test its capability to locate underwater sound sources. To validate triangulation algorithms an underwater sound projector and an array of hydrophones were used to obtain a large set of measurement data. Using the GPS coordinates of the sound projector, located in a remote boat, and the GPS coordinates of the base ship-boat, where the hydrophones’ array and data acquisition units are located, validation of relative and absolute localization of sound sources were performed for distances lower than 500 m and for a frequency range between 200 Hz and 20 kHz. The measurement system also includes data logging and GIS capabilities. The first capability is important to evaluate changes over time in dolphin’s habitats and the second one to locate and pursue the trajectory of moving sound sources, particularly dolphins in a river estuary. In which concerns the detection and classification of underwater acoustic sounds, a wavelet packet technique, based on a third level decomposition and on a RMS features’ extraction of the terminal nodes’ coefficients, followed by an ANN classification method, is proposed. The classification results that were obtained present a maximum relative error almost equal to 3.23 % and a standard deviation, of the error values, almost equal to 1.14 %. Further tests are required to evaluate the sound detection and classification algorithms when different sound sources interfere mutually, particularly when dolphin sounds are mixed with anthropogenic noises.

7. References Allen, J. & Rabiner, L. (1977). “A Unified Approach to Short-Time Fourier Analysis and Synthesis” Proc. IEEE, Vol. 65, No. 11, pp. 1558-64, 1977


177

Au, W.; Popper, A.N & Fay, R.F. (2000). “Hearing by Whales and Dolphins”, New York: Springer-Verlag Burdic, W.S. (1991). “Underwater Acoustic System Analysis”, 2nd edition, Prentice Hall, Inc., Peninsula Publishing, California, U.S.A., 1991 Carter (1981). "Time Delay Estimation for Passive Signal Processing", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-29, No.3, pp. 463-470, 1981 Chang, S.H. & Wang, F. (2003). "Underwater Sound Detection based on Hilbert Transform Pair of Wavelet Bases", Proceedings OCEANS'2003, pp.1680-1684, San Diego, USA, 2003 Choi, H. & Williams, W.J. (1989). “Improved Time-Frequency Representation of Multicomponent Signals Using Exponential Kernels,” IEEE Trans. ASSP, Vol. 37, No. 6, pp. 862-871, June 1989 Claasen, T. & Mecklenbrauker, W. (1980). “The Wigner Distribution - A Tool for TimeFrequency Signal Analysis” 3 parts Philips J. Res., Vol. 35, No. 3, 4/5, 6, pp. 217250, 276-300, 372-389, 1980 Clynch, J.R. (2006). “Datums - Map Coordinate Reference Frames Part 2 – Datum Transformations”, Feb. 2006 (available at http://www.gmat.unsw.edu.au/snap/gps/clynch_pdfs/Datum_ii.pdf). Cristi, R. (2004). “Modern Digital Signal Processing”, Thompson Learning Inc., Brooks/Cole, 2004 Donoho, D.L & Johnstone, I.M. (1994). "Ideal Spatial Adaptation by Wavelet Shrinkage", Biometrika, Vol 81, pp. 425-455, 1994. Evans, W.E. (1973). "Echolocation by marine dauphines and one species of fresh-water dolphin", J. Acoust. Soc. Am. 54:191-199, 1973 Eco Environmental, “Hydrolab Quanta” (available at http://www.ecoenvironmental.com.au/eco/water/hydrolab_quantag.htm) Flandrin, P. (1984). “Some Features of Time-Frequency Representations of Multi-Component Signals” IEEE Int. Conf. on Acoust. Speech and Signal Proc., pp. 41.B.4.1-41.B.4.4, San Diego (CA), 1984 Glegg, S.; Olivieri, M.; Coulson, R. & Smith, S.(2001). "A Passive Sonar System Based on an Autonomous Underwater Vehicle", IEEE Journal of Oceanic Engineering, Vol. 26, No. 4, pp. 700-710, October 2001. Grafarend, E. & Ardalan, A. (1993). “World Geodetic Datum 2000”, Journal of Geodesy 73, pp. 611-623, 1993 Guo, D.; Zhu, W.; Gao, Z. & Zhang, J. (2000). "A study of Wavelet Thresholding Denoising", Proceedings of IEEE ICSP2000, pp. 329-332, 2000. Haykin (1994), “Neural Networks”, Prentice Hall, NewJersey, USA, 1999. Chan, Y. & Ho, k. (1994). "A Simple and Efficient Estimator for Hyperbolic Location", IEEE Transactions on Signal Processing, Vol.42, No. 8, pp. 1905-1915, Aug. 1994 Hodges, R.P. (2010). “Underwater Acoustics: analysis, design and performance of SONAR”, John Wiley & Sons, Ltd, 2010 Oherveger, K. & Goller, F. (2001). “The Matabolic Cost of Bird Song roduction”, Journal Exp. Biol., (204), pp. 3379-3388, 2001. Holt, M.M.; Noren, D.P; Veirs, V.;Emmons, C.K. & Veirs, S. (2009). “Speaking Up: Killer Whales, Increase their Call Capability in Response to Vessel Noise”, Journal of Acoustics Society of America, 125(1), 2009

178


Hurell, A.& Duck, F. (2000). "A two-dimensional hydrophone array using piezoelectric PVDF", IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control, Vol. 47, Issue 6, pp.1345-1353, Nov. 2000 Körner, T.W. (1996). “Fourier Analysis”,Cambridge, Cambridge University Press, United Kingdom, 1996 Mallat, S. (1999). “A Wavelet Tour of Signal Processing”, Elsevier, 1999 Mattos, L. & Grant, E. (2004). "Passive Sonar Applications: Target Tracking and Navigation of an Autonomous Robot", Proceedings of IEEE International Conference on Robotics and Automation, pp.4265-4270, New Orleans, 2004 Mackenzi, K.V. (1981). “Discussion of Sea Water Sound-Speed Determination”, Journal of the Acoustic Society of America, (70), pp. 801-806, 1981 National Instruments (2005). "LabVIEW Advanced Signal Processing Toolbox", Nat. Instr. Press, 2005. NRC (National Research Council) (2003). “Ocean Noise and Marine Animals”, National Academies Press, Washington DC, 2003 Podos, J; Silva, V.F. & Rossi-Santos, M. (2002). “Vocalizations of Amazon River Dolphins, Inia geoffrensis: Insights into the Evolutionary Origins of Delphinid Whistles”, Ethology, Blackwell Verlag Berlin, 108, pp. 601-612, 2002 Postolache, O.; Pereira, J.M.D. & Girão, P.S. (2002). “An Intelligent Turbidity and Temperature Sensing Unit for Water Quality Assessment”, IEEE Canadian Conference on Electrical & Computer Engineering, CCECE 2002, pp. 494-499, Manitoba, Canada, May 2002 Postolache, O.; Girão, P.S.; Pereira, M.D. & Figueiredo, M. (2006). “Distributed Virtual System for Dolphins’ Sound Acquisition and Time-Frequency Analysis”, IMEKO XVIII World Congress, Rio de Janeiro, Brasil, Sept. Postolache, O.; Girão, P.S. & Pereira, J.M.D. (2007). “Intelligent Distributed Virtual System for Underwater Acoustic Source Localization and Sounds Classification”, Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS’2007), pp. 132-135, Dortmund, Germany, September 2007 Postolache, O.; Girão, P. & Pereira, M. (2010). “Smart Sensors and Intelligent Signal Processing in Water Quality Monitoring Context”, 4th International Conference on Sensing Technology (ICST’2010), Lecce, Itália, June 2010 Reynolds, J.E.; Odell D.H. & Rommel, A. (1999). “Biology of Marine Mammals”, edited by John E. Reynolds III and Sentiel A. Rommel, Melbourne University Press, Australia, 1999 Wright, D. (2001). “Undersea with Geographical Information Systems”, ESRI Press, USA

11 Using Virtual Acoustic Space to Investigate Sound Localisation Laura Hausmann and Hermann Wagner

RWTH Aachen, Institute of Biology II Germany

1. Introduction It is an important task for the future to further close the gap between basic and applied science, in other words to make our understanding of the basic principles of auditory processing available for applications in medicine or information technology. Current examples are hearing aids (Dietz et al., 2009) or sound-localising robots (Calmes et al., 2007). This effort will be helped by better quantitative data resulting from more and more sophisticated experimental approaches. Despite new methodologies and techniques, the complex human auditory system is only accessible in a restricted way to many experimental approaches. This gap is closed by animal model systems that allow a more focused analysis of single aspects of auditory processing than human studies. The most commonly used animals in auditory research are birds (barn owls, chicken) and mammals (monkeys, cats, bats, ferrets, guinea pigs, rats and gerbils). When these animals are tested with various auditory stimuli in behavioural experiments, the accuracy (distance of a measured value to the true value) and precision (repeatability of a given measured value) of the animal’s behavioural response allows to draw conclusions on the difficulty with which the animal can use the stimulus to locate sound sources. An example is the measurement of minimum audible angles (MAA) to reveal the resolution threshold of the auditory system for the horizontal displacement of a sound source (Bala et al., 2007). Similarly, one can exploit the head-turn amplitude of humans or animals in response to narrowband or broadband sounds as a measure for the relevance of specific frequency bands, as well as binaural and monaural cues or perception thresholds (e.g. May & Huang, 1995; Poganiatz et al., 2001; Populin, 2006). The barn owl (Tyto alba) is an auditory specialist, depending to a large extent on listening while localising potential prey. In the course of evolution, the barn owl has developed several morphological and neuronal adaptations, which may be regarded as more optimal solutions to problems than the structures and circuits found in generalists. The owl has a characteristic facial ruff, which amplifies sound and is directionally sensitive for frequencies above 4 kHz (Coles & Guppy, 1988). Additionally, the left and right ear openings and flaps are asymmetrically with the left ear lying slightly higher than the right one. This asymmetry creates a steep gradient of interaural level differences (ILDs) in the owl’s frontal field (Campenhausen & Wagner, 2006). These adaptations to sound localisation are one of the reasons why barn owl hearing was established as an important model system during the last decades.

180


This chapter will focus on the application of a powerful technique for the investigation of sound processing, the virtual auditory space technique. Its basics, relevance and applications will be discussed for human listeners as well as in barn owls, supplemented by a comparison with other species. Sound localisation is based on extraction of physical cues of the sound reaching the eardrums. Such physical cues are the monaural spectral properties of the sound as well as differences between the sounds reaching the left and rights ears, leading to binaural cues. These cues vary systematically with sound source position relative to an animal’s head. A sound originates from a source and travels through the air until it reaches the eardrums of a listener. Several distortions (reflection, attenuation) are imposed on the sound along its path. Sound parameters may be measured at or close to the eardrum. The comparison of the measured sound at the eardrum with the sound emitted by the source allows for a determination of the distortion and is unique for each individual. The resulting transfer functions are called the head-related transfer functions (HRTFs) referring to the major influence of head shape in the process of distortion. HRTFs carry information about the location of a sound source. Note that the term HRTF refers to the frequency domain, whereas one speaks of the head-related impulse response (HRIR) when the signal is represented in the time domain. Both signals may be transformed from one domain to the other by means of a Fourier transformation (Blauert, 1997). In monaural spectra the large decreases in amplitude, termed notches, carry information about sound source direction due to their systematic directional variations. Animals and humans use this information during sound localisation, in particular when resolving frontback confusions (Gardner & Gardner, 1973, Hebrank & Wright, 1974). The comparison of the HRTFs measured at the two ears yields two major binaural parameters: Interaural time difference (ITD) and interaural level difference (ILD). The ITD depends on the angle of incidence as well as on the distance between the two ears. This cue may be further divided into envelope and carrier ITDs. Envelope ITDs occur specifically at the onset and end of a sound and are then called onset ITDs, whereas ITDs derived from the carrier occur in the ongoing sound and are, therefore, called ongoing ITDs. ITD is constant along a circle describing a surface of a cone, termed “cone of confusion” because for sound sources along this cone surface, the identical ITDs do not allow unambiguous localisation of narrowband stimuli (cf. Blauert, 1997). This leads to ambiguities with respect to front and back, and, therefore, this cone is also known as cone of confusion. ILDs arise from the frequency and position depended attenuation of sound by the pinna, the head and the body that typically differs between the two ears.

2. Investigation of sound localisation - current approaches and problems The simplest approach to find out more about the relevance of the sound parameters is to replay natural sounds from a loudspeaker and measure the subject’s reaction to the sounds. These experiments are typically carried out in rooms having walls that strongly suppress sound reflections. If the distance between source and listener is large enough, we have a free-field situation, and the approach is called free-field stimulation. Free-field sounds have a major disadvantage: the physical cues to sound location cannot be varied independently, because a specific ITD resulting from a given spatial displacement of the sound source also involves a change in the ILD and the monaural spectra. This renders it difficult or even impossible to derive the contribution of single cues to sound localisation.

Using Virtual Acoustic Space to Investigate Sound Localisation

181

On the other hand, free-field sounds contain all relevant cues a subject may use in behaviour. Although free-field stimulation allows for an investigation of how relevant specific sound characteristics are, such as the frequency spectrum, the limits of this technique are obvious. Since this chapter focusses on the virtual space technique, we will not review the results from the numerous studies dealing with free-field stimulation. One way to overcome the problems inherent in free-field stimulation is the dichotic stimulation via headphones, allowing the independent manipulation of ITDs or ILDs in the stimulus. Dichotic stimulation was used to prove that humans use ITDs for azimuthal sound localisation for frequencies up to 1.5 kHz and ILDs for frequencies above 5 kHz (reviewed in Blauert, 1997). The upper frequency limit for ITD extraction seems to be determined by the ability of neurons to encode the phase of the signal’s carrier frequency, which in turn is necessary to compare phase differences between both ears. The lower border for ILD extraction, likewise, seems to be related to the observation that the head of an animal only creates sufficiently large ILDs above a certain frequency. These conclusions are supported by data from animals such as the cat, the ferret, monkey and the barn owl (Koeppl, 1997; Koka et al., 2008; Moiseff and Konishi 1981; Parsons et al., 2009; Spezio et al., 2000; Tollin & Koka, 2009). The use of both ITDs and ILDs in azimuthal sound localisation is known as duplex theory (Blauert, 1997; Macpherson & Middlebrooks, 2002; Rayleigh, 1907). In the barn owl, the filtering properties of the facial ruff together with the asymmetrical arrangement of the ear openings and the preaural flaps in the vertical plane cause ILDs to vary along an axis inclined to the horizontal plane. This allows the barn owl to use ILDs for elevational sound localisation (Moiseff 1989, Campenhausen & Wagner 2006; Keller et al. 1998). In contrast, mammals use ILDs for high-frequency horizontal localisation (reviewed in Blauert, 1997). The ability of the owl’s auditory neurons to lock to the signal’s phase within almost the whole hearing range (Köppl, 1997) – again in contrast to most mammals – together with the use of ILDs for elevational localisation is one of the reasons that make the barn owl interesting for auditory research, despite the mentioned differences to mammals. With earphone stimulation, binaural cues can be manipulated independently. For example, the systematic variation of either ITDs or ILDs while keeping the other cue constant is nowadays a commonly used technique to characterise neuronal tuning to sound or to investigate the impact of the cues on sound localisation ability (reviewed in Butts and Goldman, 2006). Another example is the specific variation of ongoing ITDs, but not onset ITDs (Moiseff & Konishi, 1981; von Kriegstein et al., 2008) or a systematic variation of the degree of interaural correlation in binaurally presented noises (Egnor, 2001). Although dichotic stimulation helped to make progress in our understanding of sound localisation, one disadvantage of this method is that human listeners perceive sources as lying inside the head (Hartmann & Wittenberg, 1996; Wightman & Kistler, 1989b) rather than in outside space. Consequently, when only ITDs or ILDs are introduced, but no spectral cues, the sound is “lateralised” towards a direction corresponding to the amplitude of ITD or ILD, respectively. For human listeners, this may yield a horizontal displacement of the sound image sufficient for many applications. However, both vertical localisation and distance estimation are severly hampered, if possible at all. In contrast, a free-field sound source or an appropriately simulated sound is really “localised”. This means that dichotic stimuli do not contain all physical cues of free-field sounds. A method to overcome the problems of dichotic stimulation as described so far, while

182


preserving its advantages, is the creation of a virtual auditory space (VAS), the method and implementation of which is the topic of this chapter. The work of Wightman and Kistler (1989a,b) and others showed that free-field sources could be simulated adequately by filtering a sound with the personal head-related transfer functions (HRTFs). Bronkhorst (1995) reported that performance degraded when subjects were stimulated with very high frequency virtual sounds. This observation reflects the large difficulties in generating veridical virtual stimuli at high-frequencies.

3. The virtual space technique While dichotic stimulation does not lead to externalisation of sound sources, the use of HRTF-filtered stimuli in a virtual auditory space does (Hartmann & Wittenberg, 1996; Plenge, 1974; Wightman & Kistler, 1989b). For that reason, numerous attempts have been made to develop virtual auditory worlds for humans or animals. The main goal there is that virtual sound sources in VAS should unambiguously reflect all free-field sound characteristics. A second goal, especially in human research, is to create virtual auditory worlds that are universally applicable across all listeners. This requires a trade-off between the realistic simulation of free-field characteristics and computational power, that is, one wants to discard nonessential cues while preserving all relevant cues. For that purpose, knowledge is required on which cues are the relevant cues for sound localisation and which are not. The method involved in creating VAS originated in the 1950ies when systematic experiments using artificial head manikins were undertaken (reviewed by Paul, 2009). However, it is only the computational power developed within the last two decades that allows for elaborate calculations and manipulations of virtual auditory stimuli. Measuring the HRTFs is usually done by inserting small microphones into the ear canals of the subject, as sketched in Figure 1. The sound impinging on the eardrum is measured. Sound is replayed from a free-field loudspeaker (see Fig. 1a). The loudspeaker signal should contain all relevant frequencies within the hearing range of the subject. It has been shown that measurement at or close to the eardrum is adequate, because the measured signal contains the important information (Wightman & Kistler, 1989a,b). When the signal arrives at the eardrum, it has been filtered by the outer ear, the head and the body of the subject. The amplitude and phase spectra at the eardrum represent the HRTF for the given ear and the respective position. The monaural spectrum of a specific sound at a given position may be obtained by filtering the sound with the respective HRTF. ITDs and ILDs occurring at a given position are derived by comparing the respective measured HRTFs at the two ears. The procedure of replaying a free-field sound and recording the resulting impulse response at the subject’s eardrum is usually carried out for representative spatial locations, i.e., the free-field speaker is positioned at a constant distance at different locations, for example by moving it along a circular hoop (Fig. 1a). In this way, the desired spatial positions in both azimuthal and elevational planes may be sampled (Fig. 1b+c). One may use click stimuli (Dirac pulses, see Blauert, 1997) as free-field sounds. In this way, a broad range of frequencies can be presented in a very short time. However, such stimuli do not contain much energy and as a consequence have to be repeated many times (typically 1000) in order to increase the energy provided to the listener (reviewed in Blauert, 1997; see also Poganiatz & Wagner, 2001; Poganiatz et al., 2001).


183

After Hausmann et al. (2010).

Fig. 1. Schematic of a setup for HRTF measurements. A) During HRTF measurements, the anesthetised owl is fixated with the help of a cloth jacket in the center of a metal hoop. A loudspeaker can be moved upwards or downwards along the hoop, allowing variation of the vertical stimulus angle as shown in panel B). The hoop can be rotated about its vertical axis, which allowed positioning of the hoop at various azimuthal values, with 0° being directly in front of the owl as shown in panel C). Other stimuli are so-called sweep signals, which run from low to high frequencies or vice versa in a given time interval. For example, logarithmically rising sweeps have successfully been used for HRTF recordings in the owl (Campenhausen & Wagner, 2006; Hausmann et al., 2010). Such sweep signals have the advantage that a small number of repetitions of sound emissions suffices to yield reproducible measurements, while containing energy in all desired frequencies within the subject’s hearing range.

184


The quality of HRTF recordings measured with both types of stimuli is comparable, as demonstrated by similar shape of HRTFs and localisation performance in the owl for HRTFfiltered stimuli recorded either during application of click noise (Poganiatz & Wagner, 2001; Poganiatz et al., 2001) or of sweeps (Campenhausen & Wagner, 2006; Hausmann et al., 2009; Hausmann et al., 2010). Short click stimuli have also commonly been used for HRTF measurements in other animals and humans, leading to localisation performance comparable to free-field stimulation (Delgutte et al., 1999; Musicant et al., 1990; Tollin & Yin, 2002; Wightman & Kistler, 1989b). The impulse responses recorded at the subject’s eardrum are influenced by the individual transfer functions not only of the subject itself, but also of the equipment used for the recordings such as the microphones, loudspeaker and hardware components. In order to provide an accurate picture of the transfer characteristics, all impulse responses recorded with the subject (specific for each azimuthal (α) and elevational (ε) position) have to be corrected for the transfer characteristics of the system components (Tsys). The correction can be easily done by transforming each impulse response into the frequency domain via Fast Fourier transformation (FFT), and then divide each subject-specific FFT (Hαε) by the reference measurement recorded for the system components including the microphone, but without the subject (Tsys) following equation 1.

Hαε =

Hαε ⋅ Tsys Tsys

(1)

In both behavioural and electrophysiological experiments, HRTF-filtered stimuli open a wide range of possible manipulations to analyse single characteristics of sound processing and allow for prediction of localisation behaviour based on HRTF characteristics (humans: Getzmann & Lewald, 2010; Hebrank & Wright, 1974; cat: Brugge et al., 1996; May & Huang, 1996; guinea pig: Sterbing et al., 2003; owl: Hausmann et al., 2009; Poganiatz et al., 2001; Witten et al., 2010). Virtual auditory worlds can be created for all animals whose HRTFs are measured. An advantage of using the barn owl rather than many mammalian species is that the owl performs saccadic head-turns towards a sound source when sitting on a perch (Knudsen et al., 1979), while the eyes or pinnae can barely be moved (Steinbach, 1972). In contrast, many mammals may move their eyes and pinnae. This allows for example cats or monkeys to locate sound sources even with restrained head to a certain extent (Dent et al., 2009; Populin, 2006; Populin & Yin, 1998). The owl’s saccadic head-turn response allows to use the owls’ head-turn angle as a measure for the perceived sound source location (Knudsen & Konishi, 1978). The next section will review how HRTF-filtered stimuli have been implemented in the barn owl as a model system to tackle specific issues of sound localisation which are also relevant for human sound localisation.

4. Virtual auditory space and its applications in an auditory specialist One of the first applications of VAS for the barn owl was the work of Poganiatz and coworkers (2001). The authors conducted a behavioural study in which individualised HRTFs of barn owls were manipulated in that the broadband ITD was artificially set to a specific value, irrespective of their natural ITD. This artificial ITD was either -100 µs, corresponding to a position of approximately -40° of azimuth (by definition left of the


185

animal) based on a change of 2.5 µs per degree (Campenhausen & Wagner, 2006), or to +100 µs, corresponding to +40° of azimuth (by definition right of the animal). All other cues such as the ILD and monaural spectra were preserved. That is, the stimuli were ambiguous in that the ITD might point towards a different hemisphere than did all the remaining cues. The authors of the study predicted that the owl should turn its head towards the position encoded by the ITD if the ITD was the relevant cue for azimuthal sound localisation. Similarly, the owl should turn towards the position encoded by ILD and monaural spectra if these cues were relevant for azimuthal localisation. When these manipulated stimuli were replayed via headphones to the owls, the animals always turned their heads towards the position that was encoded by the ITD and not by the remaining cues. From these findings, Poganiatz et al. (2001) concluded that the owls used exclusively the ITD to determine stimulus azimuth. As we will show below, this may hold for a large range of auditory space. However, the resolution of spatially ambiguous ITDs in the frontal and rear hemispheres requires further cues. The same approach of manipulating virtual stimuli was used for investigating the role of ILD for elevational sound localisation by setting the broadband ILD in HRTF-filtered stimuli to a fixed value (Poganiatz & Wagner, 2001). Such experiments showed that barn owls’ elevational head-turn angles depend partly, but not exclusively on ILDs. The role of ILDs and other cues for elevational localisation will be tackled in more detail below. Thus, these earlier studies did not resolve the cues needed to resolve front-back confusion or localisation of phantom sources that occur at positions that can be predicted from a narrowband sound’s period duration and ITD. Both phenomena are commonly known problems in humans especially for localisation in the median plane (Gardner & Gardner 1973; Hill et al. 2000; Wenzel et al. 1993; Zahorik et al. 2006). Furthermore, it is still unclear which cues, apart from broadband ILD, contribute to elevational sound localisation. The owl’s ability to locate sound source elevation is essentially based on its ear asymmetry and facial ruff. Going a step further and utilise the morphological specialisations of the barn owl, a possible application for humans might thus be to mimic an owl’s facial ruff to achieve better localisation performance in humans. We extended the use of HRTF-filtered stimuli to answer some of the above raised questions. The method introduced by Campenhausen & Wagner (2006) allowed us to measure the influence of the barn owl’s facial ruff for a closer analysis of the role of external filtering as well as of the interplay of the owl’s asymmetrically placed ears with the characteristically heart-shaped ruff. Using VAS enabled us to analyse the contribution of the facial ruff and the asymmetrically placed ear openings independently from each other, an important aspect if one wants to implement the owls’ specialisations for engineering of sound localisation devices. Virtual ruff removal (Hausmann et al., 2009) was realised by recording HRTFs for anesthetized barn owls a) with intact ruff of the animal that was tested in behavioural experiments later on (individualised HRTFs), b) for a reference animal with intact ruff (reference owl, normal non-individualised HRTFs) and c) for the same reference animal after successive removal of all feathers of the facial disk, leaving only the rear body feathers intact (see also Campenhausen & Wagner, 2006), named “ruffcut” condition. The advantage of simulating ruff removal rather than actually removing the ruff of the behaving owls consisted in a better reproducibility of stimulus conditions over the course of the experiments, as the feathers regrow after removal and thus stimulus conditions change. Furthermore, responses to the stimuli were comparable between subjects since the stimulus

186


conditions were equal for all three owls included in the study. And third, virtual ruff removal is a more animal-friendly approach than real removal of the feathers, since the behaving owls are not hampered in their usual localisation behaviour, as would be the case if one actually removed their facial ruff. The measurements yielded three sets of HRTFs. In behavioural experiments, broadband noise (1-12 kHz) was filtered with these HRTFs to simulate ruff removal for three owls. The former two stimulus conditions with intact ruff were required for comparison of the normal localisation performance with that in response to simulated ruff removal. In parallel, the changes of binaural cues were analysed. Virtual ruff removal resulted not only in a reduction of the ITD range in the periphery (Fig. 2B), but also in a corresponding reduction of azimuthal head-turn angles for ruffcut versus normal stimuli (Fig. 2A). That is, the virtual ruff removal induced a change in the localisation behaviour that could be correlated with the accompanying changes of the ITD as the relevant cue for azimuthal localisation (see also Poganiatz et al., 2001). Virtual ruff removal influenced behaviour in two major ways. First, it caused the ILDs in the frontal field to become smaller, and the ILDs did no longer vary with elevation, as is the case in HRTFs recorded with intact ruff (see also Keller et al., 1998). Correspondingly, the owls lost their ability to determine stimulus elevation. Second, while owls having a normal ruff could discriminate stimuli coming from the rear from those coming from the front even if the stimuli had the same ITD (Hausmann et al., 2009), this ability to distinguish between front and back in HRTFs having the same ITD was lost after virtual ruff removal. This finding implies that the ITD is indeed the only relevant cue for azimuthal localisation in the frontal field, as suggested by Poganiatz et al. (2001), but stimulus positions with equal ITD in the front and in the rear, respectively, may not be discriminated based on the ITD alone. Hence, the ruff provides cues other than ITD to resolve position along the cone of confusion (see Blauert, 1997). Potential candidates for the cues provided by the ruff for front-back disambiguation are ILDs and monaural spectral cues, both of which are altered after ruff removal (Campenhausen & Wagner, 2006; Hausmann et al., 2009). The role of ILDs and spectral cues can be investigated by keeping the ILD in virtual auditory stimuli constant, while the ITD and spectral cues vary with location according to their natural amplitude. Such an approach was pursued in an earlier study by Poganiatz & Wagner (2001), where ILDs in virtual acoustic stimuli were set to a fixed value of either -6 dB (left ear louder) or +6 dB in the frequency range from 4 to 10 kHz. In response to those manipulated stimuli, the owls responded with a positive elevational head-turn to the +6 dB stimuli and with a negative head-turn to the -6 dB stimuli. When the stimulus ILD was set to +6 dB, the owls’ head-turn was directed to a relatively constant elevational position. In response to stimuli whose ILD was set to -6 dB, however, the elevational head-turn amplitude was constant at positive stimulus azimuth but increased with incrementally negative stimulus angle, or vice versa. The localisation behaviour depended on the stimulus position, meaning that the elevational localisation was not exclusively defined by the mean broadband ILD. The study of Poganiatz & Wagner (2001) argued against monaural spectral cues as this additional cue, since the spectra had been preserved according to the natural shape. However, the owls’ elevational head-turn angles did not follow a simple and clear relationship, which renders conclusions on the contribution of single binaural and monaural cues difficult. The question remained of whether owls needed broadband ILDs to determine


187

the elevation of virtual sound sources, or whether ILDs in single frequency bands could be used as well.

After Hausmann et al. (2009)

Fig. 2. ITDs and azimuthal head-turn angle under normal and ruffcut conditions. A) The azimuthal head-turn angles of owls in response to azimuthal stimulation (x-axis) with individualised HRTFs (dotted, data of two owls), non-individualised HRTFs of a reference animal (normal, black, three owls) and to the stimuli from the reference owl after ruff removal (ruffcut, blue, three owls). Arrows mark ±140° stimulus position in the periphery, where azimuthal head-turn angle decreased for stimulation with simulated ruff removal, in contrast to stimulation with intact ruff (individualised and reference owl normal) where they approach a plateau at about ±60°. Significant differences between stimulus conditions are marked with asterisks depending on the significance level (**p

Advances in Sound Localization

Localization in Clinical Neurology

Sound Source Localization (Springer Handbook of Auditory Research)

Sound

Sound

Sound

Localization in Group Theory and Homotopy Theory

Correlation and Localization

Sound on Sound (September 2005)

Sound on Sound (December 2005)

Sound on Sound (November 2004)

Sound on Sound (March 2004)

Sound on Sound (June 2005)

Sound on Sound (October 2004)

Sound on Sound (January 2004)

Sound on Sound (January 2005)

Noncommutative localization in algebra and topology

Normal modes and localization in nonlinear systems

Fluctuations and localization in mesoscopic electron systems

Noncommutative localization in algebra and topology

Correlation and Localization (Topics in Current Chemistry)

Sound on Sound (May 2005)

Sound on Sound (July 2004)

Non-commutative localization in algebra and topology

Normal Modes and Localization in Nonlinear Systems

Sound on Sound (November 2005)

Sound on Sound (June 2004)

Sound on Sound (February 2004)

Sound on Sound (September 2004)

Sound on Sound (January 2006)

Sound on Sound (February 2005)

Advances in Sound Localization

Localization in Clinical Neurology

Sound Source Localization (Springer Handbook of Auditory Research)

Sound

Sound

Sound

Localization in Group Theory and Homotopy Theory

Correlation and Localization

Sound on Sound (September 2005)

Sound on Sound (December 2005)

Sound on Sound (November 2004)

Sound on Sound (March 2004)

Sound on Sound (June 2005)

Sound on Sound (October 2004)

Sound on Sound (January 2004)

Sound on Sound (January 2005)

Noncommutative localization in algebra and topology

Normal modes and localization in nonlinear systems

Fluctuations and localization in mesoscopic electron systems

Noncommutative localization in algebra and topology

Correlation and Localization (Topics in Current Chemistry)

Sound on Sound (May 2005)

Sound on Sound (July 2004)

Non-commutative localization in algebra and topology

Normal Modes and Localization in Nonlinear Systems

Sound on Sound (November 2005)

Sound on Sound (June 2004)

Sound on Sound (February 2004)

Sound on Sound (September 2004)

Sound on Sound (January 2006)

Sound on Sound (February 2005)

Recommend Documents