Speech Enhancement in the Karhunen-Loeve Expansion Domain (Synthesis Lectures on Speech and Audio Processing)

Speech Enhancement in the Karhunen-Loève Expansion Domain Synthesis Lectures on Speech and Audio Processing Editor B.H...

Author: Jacob Benesty | Jingdong Chen | Yiteng Huang

19 downloads 682 Views 1MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Speech Enhancement in the Karhunen-Loève Expansion Domain

Synthesis Lectures on Speech and Audio Processing Editor B.H. Juang, Georgia Tech

Speech Enhancement in the Karhunen-Loève Expansion Domain Jacob Benesty, Jingdong Chen, and Yiteng Huang 2011

Sparse Adaptive Filters for Echo Cancellation Constantin Paleologu, Jacob Benesty, and Silviu Ciochina June 2010

Multi-Pitch Estimation Mads Græsbøll Christensen and Andreas Jakobsson 2009

Discriminative Learning for Speech Recognition: Theory and Practice Xiaodong He and Li Deng 2008

Latent Semantic Mapping: Principles & Applications Jerome R. Bellegarda 2007

Dynamic Speech Models: Theory, Algorithms, and Applications Li Deng 2006

Articulation and Intelligibility Jont B. Allen 2005

Copyright © 2011 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher.

Speech Enhancement in the Karhunen-Loève Expansion Domain Jacob Benesty, Jingdong Chen, and Yiteng Huang www.morganclaypool.com

ISBN: 9781608456048 ISBN: 9781608456055

paperback ebook

DOI 10.2200/S00326ED1V01Y201101SAP007

A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON SPEECH AND AUDIO PROCESSING Lecture #7 Series Editor: B.H. Juang, Georgia Tech Series ISSN Synthesis Lectures on Speech and Audio Processing Print 1932-121X Electronic 1932-1678

Speech Enhancement in the Karhunen-Loève Expansion Domain

Jacob Benesty INRS-EMT, University of Quebec

Jingdong Chen WeVoice, Inc.

Yiteng Huang WeVoice, Inc.

SYNTHESIS LECTURES ON SPEECH AND AUDIO PROCESSING #7

M &C

Morgan

& cLaypool publishers

ABSTRACT This book is devoted to the study of the problem of speech enhancement whose objective is the recovery of a signal of interest (i.e., speech) from noisy observations. Typically, the recovery process is accomplished by passing the noisy observations through a linear filter (or a linear transformation). Since both the desired speech and undesired noise are filtered at the same time, the most critical issue of speech enhancement resides in how to design a proper optimal filter that can fully take advantage of the difference between the speech and noise statistics to mitigate the noise effect as much as possible while maintaining the speech perception identical to its original form. The optimal filters can be designed either in the time domain or in a transform space. As the title indicates, this book will focus on developing and analyzing optimal filters in the Karhunen-Loève expansion (KLE) domain. We begin by describing the basic problem of speech enhancement and the fundamental principles to solve it in the time domain. We then explain how the problem can be equivalently formulated in the KLE domain. Next, we divide the general problem in the KLE domain into four groups, depending on whether interframe and interband information is accounted for, leading to four linear models for speech enhancement in the KLE domain. For each model, we introduce signal processing measures to quantify the performance of speech enhancement, discuss the formation of different cost functions, and address the optimization of these cost functions for the derivation of different optimal filters. Both theoretical analysis and experiments will be provided to study the performance of these filters and the links between the KLE-domain and time-domain optimal filters will be examined.

KEYWORDS noise reduction, speech enhancement, single-channel microphone signal processing, Karhunen-Loève expansion (KLE), time domain, KLE domain, Wiener filter, tradeoff filter, maximum signal-to-noise ratio (SNR) filter, minimum variance distortionless response (MVDR) filter.

vii

Contents 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2

2

Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 2.2

3

11 13 15 18 20 21

Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25 26 29 30

Optimal Filters in the KLE Domain with Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1 5.2 5.3 5.4 5.5

6

Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean-Square Error (MSE) Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tradeoff Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subspace-Type Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maximum Signal-to-Noise Ratio (SNR) Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Linear Models for Signal Enhancement in the KLE Domain . . . . . . . . . . . . . . . . . 25 4.1 4.2 4.3 4.4

5

Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Karhunen-Loève Expansion (KLE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Optimal Filters in the Time Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 3.2 3.3 3.4 3.5 3.6

4

The Problem of Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Organization of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MSE Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tradeoff Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maximum SNR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33 35 38 40 41

Optimal Filters in the KLE Domain with Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.1

Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

viii

7

8

9

6.2

Maximum SNR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.3

MSE Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.4

Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.5

Minimum Variance Distortionless Response (MVDR) Filter . . . . . . . . . . . . . . . . . 50

6.6

Tradeoff Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52



7.2

MSE Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.3

Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.4

Tradeoff Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.5




8.2

MSE Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.3

Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8.4

Tradeoff Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.5

MVDR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.6


Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 9.1

Experimental Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9.2

Estimation of the Correlation Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 76

9.3


9.4

Performance of the Time-Domain Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 9.4.1 Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 9.4.2 Tradeoff Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

9.5

Performance of the KLE-Domain Filters with Model 1 . . . . . . . . . . . . . . . . . . . . . 81 9.5.1 KLE-Domain Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 9.5.2 KLE-Domain Tradeoff Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

9.6

Performance of the KLE-Domain Filters with Model 3 . . . . . . . . . . . . . . . . . . . . . 85

9.7

Performance of the KLE-Domain Filters with Model 2 . . . . . . . . . . . . . . . . . . . . . 87

9.8

KLE-Domain Filters with Model 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

ix

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

1

CHAPTER

1

Introduction A signal of interest (usually speech), when picked up by microphones, is inevitably contaminated by unwanted acoustic distortions. Depending on the mechanism that generates them, these distortions can be broadly classified into four basic categories: additive noise originating from various ambient sound sources, interference from concurrent competing speakers, filtering effects caused by room surface reflections and spectral shaping of recording devices, and echo from coupling between loudspeakers and microphones. These four categories of distortions interfere with the measurement, processing, recording, and communication of the desired speech signal in very distinct ways, and combating them has led to four important research areas: speech enhancement (also called noise reduction), source separation, speech dereverberation, and echo cancellation and suppression. A broad coverage of these research areas can be found in [6], [30]. This book is devoted to the study of the problem of single-channel speech enhancement in the Karhunen-Loève expansion (KLE) domain.

1.1

THE PROBLEM OF SPEECH ENHANCEMENT

Speech enhancement consists of recovering a speech signal of interest from microphone observations, which are corrupted by unwanted additive noise. By additive noise, we mean that the signal picked up by a microphone is a superposition of the clean speech and noise. In this scenario, the noise does not directly modify the statistics of the desired speech signal. However, the observed noisy signal can have very different characteristics in comparison to the desired speech. To illustrate this, Fig. 1.1 shows a clean speech signal, the same signal observed in a noisy conference room, and their spectrograms. Inspecting the difference between the clean speech and noisy signal spectrograms, one may notice that the noise effect manifests itself in several different aspects, including but not limited to: 1) many new frequency components are added into the observed signal, 2) a great portion of the time-varying spectra of the desired speech is masked, 3) the spectral intensity is increased, 4) the dynamic properties of the desired speech spectra near phonetic boundaries are smeared, and 5) the intermittent nature of speech becomes less distinct. These changes may greatly affect the human’s perception of the desired speech. On the one hand, one can still perceive the useful information embedded in the desired speech signal when listening to the noisy one; but it would take more attention and may easily lead to listening fatigue. On the other hand, it may become impossible to comprehend the desired speech if the noise is strong. As a result, how to mitigate the noise effect, thereby recovering the desired speech signal from its noisy observations, has become an important problem for many applications such as voice communication and human-machine interfaces.

1. INTRODUCTION

Frequency (kHz)

Amplitdue

1.0

(a)

0.5 0 -0.5 -1.0 4

Amplitdue

1.0

(b)

3 0.5

2 1 0 1.0

0 (c)

0.5 0 -0.5 -1.0 4

Frequency (kHz)

2

1.0

(d)

3 0.5

2 1 0

0 0

1

2

3

4

5 6 Time (s)

7

8

9

10

Figure 1.1: Illustration of the noise effect: (a) a clean speech signal, (b) the clean speech spectrogram, (c) a noisy speech observed in a conference room, and (d) the noisy speech spectrogram.

With a use of a single microphone, the noise mitigation process is generally accomplished by properly filtering the noisy speech. The earliest attempt on this was made at Bell Laboratories where Schroeder proposed a system for reducing noise in telecommunication environments in 1960 [48]. His method divides the noisy signal into a number of subbands. For each subband, a rectifier and a lowpass filter are applied in tandem to estimate the noisy speech envelope. The noise level in the corresponding subband is then estimated and subtracted from the noisy speech envelope, resulting in an estimate of the clean speech envelope for the subband. A second rectification process is applied to force the negative results, due to the subtraction, to zero. The rectified clean speech envelope estimate, which is served as a gain filter, is then multiplied with the unmodified subband signal.

1.1. THE PROBLEM OF SPEECH ENHANCEMENT

Finally, the fullband signal is synthesized from all the subband outputs. This spectral subtraction method implemented with analog circuits, however, has not received much public attention, probably because it was never published in the form of a journal or conference paper for easy and broad circulation. In the late 1970s, Boll, in his informative paper [9], reformulated the spectral subtraction method but in the framework of digital short-time Fourier analysis, which was later proved to be a particular case of the so-called parametric Wiener filter [41]. Almost at the same time, Lim and Oppenheim, in their landmark work [39], systematically formulated the speech enhancement problem and studied and compared the different algorithms known in the 1970s. Their work demonstrated that speech enhancement was not only effective in improving the quality of noise-corrupted speech, but also useful for increasing both the quality and intelligibility of linear prediction coding (LPC) based parametric speech coding systems. It was this work that had sparkled a huge amount of research attention on the problem. Many algorithms have been developed since then. The most notable contributions include the maximum likelihood (ML) estimator [41], the minimum-mean-square-error (MMSE) estimator [15], [16], and the maximum a posteriori (MAP) estimator [51], to name a few. These algorithms share the common key idea of applying a gain (whose value is between 0 and 1) to the noisy speech spectrum in each frequency band to attenuate the noise. They differ only in the form of the gain and how it is estimated. To derive their gains, the aforementioned MMSE, ML, and MAP estimators assume explicit knowledge of the marginal and joint probability distributions of the clean speech and noise spectra, so that the conditional expected value of the clean speech spectrum, given the noisy speech spectrum, can be evaluated. However, the assumed distributions may not accurately reflect the behavior of the real signals in reality. One way to circumvent this issue is to collect some speech and noise samples and learn the distributions from the collected data. This has led to the development of the hidden Markov model (HMM) based speech enhancement technique. HMM is a statistical model that uses a finite number of states and the associated state transitions to jointly model the temporal and spectral variation of the signals [2]. It has long been used for speech modeling with applications for speech recognition [1], [32], [44]. HMM was introduced to deal with the speech enhancement problem in the late 1980s [17], [18], [19]. This method conquers the problem in two steps. In the first step, which is often called a training process, the probability distributions of the clean speech and the noise process are estimated from given training sequences. The estimated distributions are then applied in the second step to construct speech enhancement filters. Similar to the traditional frequency-domain techniques, the HMM method also applies a gain to the noisy speech spectrum to reduce noise and many different gains can be formed [19], [47]. Besides not requiring an explicit knowledge of the speech and noise distributions, the HMM technique has another advantage of being able to tolerate some nonstationarity in noise, depending on the number of states and mixtures used in the noise HMM. But distortion will arise when the characteristics of the noise are not represented in the training noise data.

3

4

1. INTRODUCTION

Most early attempts in speech enhancement were made in the frequency domain. One may wonder why the frequency domain is preferred to the time domain, given that the noisy signal is originally observed in the time domain and the enhanced signal has to be in the time domain as well. There are many practical reasons for this. First of all, most of our knowledge and understanding of speech production and perception is related to frequencies. In the frequency domain, it is not only easier for us to design speech enhancement filters, but it is more straightforward to analyze and monitor their performance as well. Secondly, thanks to the fast Fourier transform (FFT), the implementation of frequency-domain filters can be made, in general, computationally more efficient than filters in the time domain. Furthermore, the statistics of a speech signal are time and frequency varying and noise can be either white or colored. In the frequency domain, the speech enhancement filters at different frequency bands are designed and handled independently. This gives significant flexibility in exploiting the difference between speech and noise statistics to optimize the amount of noise reduction. However, working in the frequency domain can incur some problems that may not be seen in the time domain, which need special attention. First, due to the circular convolution, some frequency aliasing will be added into the enhanced signal after applying a speech enhancement filter. This problem cannot be completely avoided unless we use a unit gain, which will not give any noise reduction. But one can manage to minimize the effect by applying a proper windowing function (such as the Kaiser one) before FFT and after the inverse FFT (IFFT). Second, speech enhancement filters are generally a function of the noisy and noise spectra. The two spectra are not known a priori and have to be estimated in real applications. A de facto standard practice in the field of speech enhancement is to treat the short-time FFT spectrum as an estimate of the true spectrum. Such an estimate, however, generally has very large variations about the true spectrum, causing the estimated gains to exceed their theoretical range between 0 and 1. As a result, a nonlinear rectification process has to be used to force the gain to be between 0 and 1. But this would produce some isolated narrowband frequency components in the filtered spectrum. When transformed into the time domain, these isolated components produce music tone sounding noise, which is widely referred to as “musical noise.” Musical noise is very unpleasant to hear. Much evidence has shown that listeners would rather prefer to listen to the original noisy signal instead of hearing the enhanced signal with musical noise in most cases. Therefore, it is important not to introduce such noise when we implement a frequency-domain algorithm. But getting rid of musical noise is not a trivial job and it took several decades for engineers to figure out how to do it. Even today, it is still not uncommon to see implementations that result in a signal that is of a lower perceptual quality than the original noisy signal. Because of these problems with the frequency domain techniques, it is often worthwhile to examine the speech enhancement problem in the time domain [4], [7], [11]. The formulation in the time domain not only can avoid some problems with the frequency domain methods, but also can offer new insights into how to design optimal filters and properly evaluate them. The time and frequency domains are not the only signal spaces in which the speech enhancement problem can be formulated and tackled. In the literature, several other transform spaces have been investigated, such as the LPC model space [22], [23], [35], [37], [42], [43] and the KLE do-

1.2. ORGANIZATION OF THE BOOK

main. Among them, the KLE domain has received extensive attention.The major difference between the frequency and KLE domains is that the former uses a fixed transform (the Fourier transform) while the latter employs a signal-dependent transform (the KL transform) that is computed from the signal covariance matrix. There are two advantages, at least, of using the signal-dependent KL transform. First, if the covariance matrix is accurately estimated, there will be no aliasing problem. Second, the desired speech and noise may be better separated in the KLE domain than in the frequency domain. The earliest attempts of using the KL transform were made by Dendrinos, Bakamidis, and Carayannis [14] and by Ephraim and Van Trees [20], where the so-called subspace technique was developed. In essence, the subspace approach projects the noisy signal vector into a different domain via the KL transform through the eigenvalue decomposition of an estimate of the correlation matrix of the noisy signal [20]. Once transformed, the speech signal only spans a portion of the entire space, and as a result, the entire vector space can be divided into two subspaces: the signal-plus-noise subspace and the noise-only subspace. Noise reduction is then achieved by removing the noise subspace and cleaning the signal-plus-noise subspace. The rationale of the subspace method in dealing with white noise is rather straightforward; but it becomes less obvious when noise is colored. To cope with colored noise, the subspace approach is extended to a more general form by using the generalized eigenvalue decomposition that simultaneously diagonalizes the clean, noisy, and noise covariance matrices. This extension was first reported in [33] and then redeveloped in [27], [28], [29]. However, one should note that this so-called generalized subspace method is not really a subspace technique since there is no noise-only subspace anymore after the generalized analysis transform. It is more appropriate to call it a constrained Wiener filter. Nevertheless, both the original subspace technique and the generalized one share the common idea of modifying the eigenvalues of the noisy covariance matrix to achieve noise reduction. By analogy, this is similar to filtering the noisy power spectrum in the frequency domain. Recently, a general formulation of the speech enhancement problem in the KLE domain has been developed [7], [8], [13]. The basic paradigm of this new formulation follows an analysisfiltering-synthesis model. Given a noisy speech signal, which is assumed to be a superposition of a desired clean speech and an unwanted noise signal, a KLE analysis transform will be estimated and applied to transforming a vector of the noisy speech into the KLE domain. Following the convention used in the frequency domain, we call the components corresponding to each KLT base vector a subband. For every subband, a filter is designed and applied to the noisy KLE coefficients, thereby obtaining an estimate of the clean speech KLE coefficients. Finally, the filtered KLE coefficients are transformed back to the time domain using the KLE synthesis transform. The most critical issue with this new formulation is how to design the optimal filters in the KLE domain, which is indeed the focus of this entire book.

1.2

ORGANIZATION OF THE BOOK

The material in this book is organized into nine chapters, including this one. While the focus of the book is on the KLE-domain algorithms as its title indicates, we also attempt to cover the most

5

6

1. INTRODUCTION

basic concepts and fundamental principles used to design the optimal filters in the time domain and explain the strong links between the time-domain and KLE-domain filters, which in turn help us better understand how noise reduction works in the frequency domain. The work discussed in these chapters is as follows. Chapter 2 describes the speech enhancement problem that is going to be dealt with throughout the text. We first formulate the problem in the time domain, and then explain the principles of the KLE and how the time-domain signal model can be equivalently expressed in the KLE domain. Noisy signals are originally observed in the time domain. It is, therefore, legitimate to tackle the speech enhancement problem in this domain. As pointed earlier, the fundamental issue of speech enhancement in the time domain is how to design a linear filter or a linear transformation that can reduce noise while maintaining the desired speech perception identical to its original form. Typically, the design of a noise reduction filter follows three basic steps: defining a cost function, optimizing the cost function to obtain a noise reduction filter, and evaluating the filter whether it can achieve the expected performance. Chapter 3 provides an overview of the filter design issues in the time domain. We present several performance measures that can be used to evaluate noise reduction filters in the time domain. We also discuss how to define different mean-square errors (MSEs) and how to minimize these MSEs to obtain different noise reduction filters. In Chapter 4, we discuss the basic speech enhancement problem in the KLE domain and present four linear models depending on whether the interframe and interband information is accounted for. These four linear models will lead to four different filter design approaches in the KLE domain. Chapters 5 to 8 focus on the optimal noise reduction filter design issues in the KLE domain, with one chapter addressing the design issue associated with one linear model. For each linear model, we discuss the definitions of the performance measures, the MSE cost functions, and how to minimize these cost functions to obtain the optimal noise reduction filters. Also discussed in these chapters are the relationship between the KLE-domain and time-domain filters. Chapter 9 provides experimental results to validate some of the key filters derived in Chapters 3 and 5–8.

7

CHAPTER

2

Problem Formulation In this chapter, we formulate the problem of the additive noise picked up by a microphone along with the desired signal. We also explain the principle of the Karhunen-Loève expansion (KLE) and reformulate the time-domain signal model in the KLE domain.

2.1

SIGNAL MODEL

The noise reduction problem considered in this work is one of recovering the desired signal (or clean speech) x(k), k being the discrete-time index, of zero mean from the noisy observation (microphone signal) [7], [50] y(k) = x(k) + v(k),

(2.1)

where v(k) is the unwanted additive noise, which is assumed to be a zero-mean random process (white or colored) and uncorrelated with x(k). The signal model given in (2.1) can be written in a vector form if we process the data by blocks of L samples: y(m) = x(m) + v(m),

(2.2)

where m ≥ 0 is the time-frame index, y(m) =

y(mL) y(mL + 1) · · · y(mL + L − 1)

T

(2.3)

is a vector of length L, superscript T denotes transposition of a vector or a matrix, and x(m) and v(m) are defined in a similar way to y(m). Since x(k) and v(k) are uncorrelated by assumption, the correlation matrix (of size L × L) of the noisy signal is Ry = E y(m)yT (m) = Rx + Rv , (2.4) where E[·] denotes mathematical expectation, and Rx = E x(m)xT (m) , Rv = E v(m)vT (m) ,

8

2. PROBLEM FORMULATION

are the correlation matrices of x(m) and v(m), respectively. Our objective is then to find a “good” estimate of either x(k) or x(m) in the sense that the additive noise is significantly reduced while the desired signal is lowly distorted. This book will focus on the estimation of x(m). For that purpose, we will fully exploit the properties of the KLE.

2.2

KARHUNEN-LOÈVE EXPANSION (KLE)

As explained in [7], [8], [13], it may be advantageous to perform noise reduction in the KLE domain. In this section, we briefly recall the principle of the KLE which can be applied to y(m), x(m), or v(m). In this study, we choose to apply it to y(m) while the same concept was developed for x(m) in [7], [8], [13]. Fundamentally, we should not expect much difference between the two, but it is preferable to apply the KLE to y(m) as the corresponding covariance matrix is usually full rank and well conditioned. Let us first diagonalize the correlation matrix Ry as follows [24]: QT Ry Q = ,

(2.5)

where Q=

q1

q2

· · · qL

(2.6)

and = diag (λ1 , λ2 , . . . , λL )

(2.7)

are, respectively, orthogonal and diagonal matrices. The orthonormal vectors q1 , q2 , . . . , qL are the eigenvectors corresponding, respectively, to the eigenvalues λ1 , λ2 , . . . , λL of the matrix Ry . The vector y(m) can be written as a combination (expansion) of the eigenvectors of the correlation matrix Ry as follows: y(m) =

L

cy,l (m)ql ,

(2.8)

l=1

where cy,l (m) = qTl y(m), l = 1, 2, . . . , L

(2.9)

are the coefficients of the expansion and l is the subband1 index. The representation of the random vector y(m), described by (2.8) and (2.9), is the Karhunen-Loève expansion (KLE) [25]. Equations (2.8) and (2.9) are, respectively, the synthesis and analysis parts of this expansion. 1 In this book, the term subband refers to the signal component along each basis vector of the KLE.

2.2. KARHUNEN-LOÈVE EXPANSION (KLE)

From (2.9), we can easily verify that E cy,l (m) = 0, l = 1, 2, . . . , L and E cy,i (m)cy,j (m) =

λi , i = j . 0, i = j

(2.10)

(2.11)

It can also be checked from (2.9) that L

2 2 cy,l (m) = y(m)2 ,

(2.12)

l=1

where y(m)2 is the Euclidean norm of y(m). The previous expression shows the energy conservation through the KLE process. We also define cx,l (m) = qTl x(m), l = 1, 2, . . . , L, cv,l (m) = qTl v(m), l = 1, 2, . . . , L.

(2.13) (2.14)

We can check that L

2 cx,l (m) = x(m)22 ,

(2.15)

2 cv,l (m) = v(m)22 .

(2.16)

l=1 L l=1

From (2.11), we see that the interband correlation of the coefficients cy,l (m) is equal to 0. But the interband correlations of the coefficients cx,l (m) and cv,l (m) are E cx,i (m)cx,j (m) = qTi Rx qj , (2.17) T E cv,i (m)cv,j (m) = qi Rv qj . (2.18) It is easy to verify that these interband correlations, i.e., E cx,i (m)cx,j (m) and E cv,i (m)cv,j (m) for i = j , are equal to 0 only when the noise is white (assuming that the desired signal, i.e., speech, is always correlated which is usually the case). However, in practice, noise is rarely white and the interband correlation should be taken into account in the design of filters for noise reduction. This idea was first proposed in [38] but in the frequency domain. The speech signal is highly correlated.Therefore, the interframe correlation cannot be expected to be zero, i.e., E cx,l (m)cx,l (m − i) = 0, and should be considered in the development of noise reduction algorithms.

9

10

2. PROBLEM FORMULATION

Left multiplying both sides of (2.2) by qTl , the time-domain signal model is transformed into the KLE domain as cy,l (m) = cx,l (m) + cv,l (m), l = 1, 2, . . . , L.

(2.19)

Therefore, noise reduction in the KLE domain corresponds to the estimation of the coefficients cx,l (m), l = 1, 2, . . . , L, from the observations cy,l (m), l = 1, 2, . . . , L [7], [8], [13].

11

CHAPTER

3

Optimal Filters in the Time Domain This chapter reviews the classical time-domain linear filtering technique for noise reduction. Some new results are also presented as well. This chapter is important for the rest of this work since we will show later some interesting and strong links with noise reduction in the KLE domain. In the time domain, the objective of noise reduction is to estimate x(m) from the observation vector y(m). Usually, we estimate the noise-free speech, x(m), by applying a linear transformation to the microphone signal [4], [5], [12], [30], [40], [50], i.e., z(m) = Ht y(m) = Ht [x(m) + v(m)] = xf (m) + vrn (m),

(3.1)

where Ht is a filtering matrix of size L × L, xf (m) = Ht x(m)

(3.2)

is the filtered clean speech (or filtered desired signal), and vrn (m) = Ht v(m)

(3.3)

is the filtered noise, which is often called the residual noise. The correlation matrix of the estimated signal is then Rz = E z(m)zT (m) = Ht Rx HTt + Ht Rv HTt .

(3.4)

Therefore, with this time-domain formulation, the noise reduction problem becomes one of finding “good” filtering matrices that would attenuate the noise as much as possible while keeping the clean speech from being dramatically distorted. We start this chapter by defining some important performance measures.

3.1

PERFORMANCE MEASURES

One of the most important measures in noise reduction is the signal-to-noise ratio (SNR). We define the input SNR as the ratio of the intensity of the signal of interest (speech) over the intensity of the

12

3. OPTIMAL FILTERS IN THE TIME DOMAIN

background noise, i.e., iSNR = where

and

σx2 , σv2

(3.5)

σx2 = E x 2 (k) σv2 = E v 2 (k)

are the variances of the signals x(k) and v(k), respectively. This definition of the input SNR can also be written in another form. With the signal model shown in (2.2), it is easy to check that σx2 =

tr (Rx ) L

σv2 =

tr (Rv ) , L

and

where tr(·) denotes the trace of a square matrix. Therefore, the input SNR can be rewritten as iSNR =

tr (Rx ) . tr (Rv )

(3.6)

After noise reduction with the time-domain model given in (3.1), the output SNR can be expressed as E xTf (m)xf (m) oSNR(Ht ) = E vTrn (m)vrn (m)

tr Ht Rx HTt

. (3.7) = tr Ht Rv HTt One of the most important goals of noise reduction is to improve the SNR after filtering [5], [11]. Therefore, we must design a filter, Ht , in such a way that oSNR(Ht ) ≥ iSNR. Another important measure in noise reduction is the noise-reduction factor, which quantifies the amount of noise being attenuated by the filter. With the time-domain formulation, this factor is defined as [5], [11] ξnr (Ht ) =

tr (Rv )

tr Ht Rv HTt

.

(3.8)

3.2. MEAN-SQUARE ERROR (MSE) CRITERION

13

The larger the value of ξnr (Ht ), the more the noise is reduced. After the filtering operation, the residual noise level is expected to be lower than that of the original noise level; therefore, this factor should have a lower bound of 1 for optimal filters. The filtering operation adds distortion to the speech signal. In order to evaluate the amount of speech distortion, the concept of speech-distortion index has been introduced in [5], [11]. With this time-domain model, the speech-distortion index is defined as E [xf (m) − x(m)]T [xf (m) − x(m)] υsd (Ht ) = T (m)x(m) E x

T E Ht x(m) − x(m) Ht x(m) − x(m) = tr (Rx ) tr (Ht − I)Rx (Ht − I)T = , (3.9) tr (Rx ) where I is the identity matrix of size L × L. The speech-distortion index has a lower bound of 0 and an upper bound of 1 for optimal filters. The higher the value of υsd (Ht ), the more the speech is distorted. A measure that is somewhat similar to the noise-reduction factor is the speech-reduction factor defined as [7] ξsr (Ht ) =

tr (Rx )

tr Ht Rx HTt

.

(3.10)

The larger the value of ξsr (Ht ), the more the speech is reduced (or distorted). After the filtering operation, the speech level is typically lower than that of the original speech level; therefore, this factor should have a lower bound of 1 for optimal filters. It is easy to verify that we always have ξnr (Ht ) oSNR(Ht ) = . iSNR ξsr (Ht )

3.2

(3.11)

MEAN-SQUARE ERROR (MSE) CRITERION

Although many different criteria can be defined, the mean-square error (MSE) is, by far, the most used one because of its simplicity in terms of deriving useful filters and closed-form estimators. We define the error signal vector between the estimated and desired signals as e(m) = z(m) − x(m) = Ht y(m) − x(m),

(3.12)

which can also be written as the sum of two orthogonal error signal vectors: e(m) = ex (m) + ev (m),

(3.13)

14


where ex (m) = (Ht − I) x(m)

(3.14)

is the speech distortion due to the linear transformation and ev (m) = Ht v(m)

(3.15)

represents the residual noise [20]. Having defined the error signal, we can now write the MSE criterion:

J (Ht ) = tr E e(m)eT (m)

= tr (Rx ) + tr Ht Ry HTt − 2tr Ht Ryx

= tr (Rx ) + tr Ht Ry HTt − 2tr (Ht Rx ) , where

(3.16)

Ryx = E y(m)xT (m)

is the cross-correlation matrix between the observation and desired signals, which can also be expressed as Ryx = Rx since Rvx = E v(m)xT (m) = 0 [x(m) and v(m) are assumed to be uncorrelated]. Similarly, using the uncorrelation assumption, expression (3.16) can be structured in terms of two MSEs, i.e.,

J (Ht ) = tr E ex (m)eTx (m) + tr E ev (m)eTv (m) (3.17) = Jx (Ht ) + Jv (Ht ) . For the particular transformation Ht = I (the identity matrix), we get J (I) = tr (Rv ) ,

(3.18)

so there will be neither noise reduction nor speech distortion. Using this particular case of the MSE, we define the normalized MSE (NMSE) as J˜ (Ht ) =

J (Ht ) J (I)

= iSNR · υsd (Ht ) +

1 , ξnr (Ht )

(3.19)

3.3. WIENER FILTER

15

where υsd (Ht ) = ξnr (Ht ) =

Jx (Ht ) , tr (Rx ) tr (Rv ) . Jv (Ht )

(3.20) (3.21)

This shows the connection between the NMSE and the speech-distortion index and the noisereduction factor defined in Section 3.1.

3.3

WIENER FILTER

If we differentiate the MSE criterion, J (Ht ) [eq. (3.16)], with respect to Ht and equate the result to zero, we easily find the Wiener filtering matrix: = Rx R−1 y

Ht,W

= I − Rv R−1 y .

(3.22)

This optimal filtering matrix depends on the correlation matrices Ry and Rv : the first one can be estimated during speech-and-noise periods while the second one can be estimated during noise-only intervals, assuming that the statistics of the noise do not change much with time. Now, if we substitute (2.5) into (3.22), we get another useful form of the time-domain Wiener filtering matrix: Ht,W = Q − QT Rv Q −1 QT . (3.23) Let us define the following normalized correlation matrices: ˜v R

=

˜x R

=

Rv , σv2 Rx . σx2

A third way to write Wiener is −1

˜ ˜ xR Ht,W = R v

I ˜ xR ˜ −1 +R v iSNR

−1 .

(3.24)

We can see from (3.24) that lim

Ht,W

= I,

(3.25)

lim

Ht,W

= 0.

(3.26)

iSNR→∞ iSNR→0

16


Clearly, the Wiener filtering matrix may have a disastrous effect for low input SNRs since it may remove everything (noise and speech). Property 3.1 With the optimal Wiener filtering matrix given in (3.22), the output SNR is always greater than or equal to the input SNR, i.e., oSNR(Ht,W ) ≥ iSNR.

2

Proof. See [7]. Ht,W

The minimum MSE (MMSE) and minimum NMSE (MNMSE) are obtained by replacing in (3.16) and (3.19):

R J Ht,W = tr (Rx ) − tr Rx R−1 x y

−1 = tr (Rv ) − tr Rv Ry Rv , (3.27)

tr Rv R−1 y Rv ˜ J Ht,W = 1 − ≤ 1. (3.28) tr (Rv )

We can compute the speech-distortion index by substituting (3.22) into (3.9): oSNR(Ht,W ) + 2 ≤ 1. υsd Ht,W = 1 − iSNR · ξnr Ht,W

(3.29)

Using (3.19) and (3.29), we get the noise-reduction factor: oSNR(Ht,W ) + 1 ξnr Ht,W = ≥ 1. iSNR − J˜ Ht,W

(3.30)

We have

Property 3.2

iSNR iSNR ≤ J˜ Ht,W ≤ , 1 + oSNR(Ht,W ) 1 + iSNR 2 (1 + iSNR) 1 + oSNR(Ht,W ) 1 + oSNR(Ht,W ) , ≤ ξnr Ht,W ≤ iSNR · oSNR(Ht,W ) iSNR 2 2 ≤ υsd Ht,W ≤

1

1 + oSNR(Ht,W )

Proof. See [7].

1 + oSNR(Ht,W ) − iSNR . (1 + iSNR) 1 + oSNR(Ht,W )

(3.31) (3.32) (3.33)

2

3.3. WIENER FILTER

17

PARTICULAR CASE: WHITE NOISE We assume here that the noise picked up by the microphone is white (i.e., Rv = σv2 I). In this situation, the Wiener filtering matrix becomes Ht,W = I − σv2 R−1 y ,

(3.34)

where Ry = Rx + σv2 I. It is well known that the inverse of the Toeplitz matrix Ry can be factorized as follows [3], [34]:

⎡ ⎢ ⎢ Ry−1 = ⎢ ⎣

1 −c12 .. .

−c21 1 .. .

−c1L

−c2L

· · · −cL1 · · · −cL2 .. .. . . ··· 1

⎤⎡ ⎥⎢ ⎥⎢ ⎥⎢ ⎦⎣

1/E1 0 .. .

0 1/E2 .. .

0

0

··· ··· .. .

0 0 .. .

⎤ ⎥ ⎥ ⎥, ⎦

(3.35)

· · · 1/EL

where the columns of the first matrix on the right-hand side of (3.35) are the linear interpolators of the signal y(k) and the elements El in the diagonal matrix are the respective interpolation-error powers. Using the factorization of Ry−1 in (3.27) and (3.28), the MMSE and MNMSE can be rewritten, respectively, as L 2 1 J (Ht,W ) = Lσv2 − σv2 , El

(3.36)

L σv2 1 ˜ J (Ht,W ) = 1 − . L El

(3.37)

l=1

l=1

Assume that the noise-free speech signal, x(k), is very well predictable. In this scenario, El ≈ σv2 , ∀ l, and replacing this value in (3.37), we find that J˜(Ht,W ) ≈ 0. From (3.19), we then deduce that υsd (Ht,W ) ≈ 0 (no speech distortion) and ξnr (Ht,W ) ≈ ∞ (infinite noise reduction). Notice that, from a theoretical point of view (and with white noise), this result is independent of the SNR. Also, ⎤ ⎡ 0 c12 · · · c1L ⎢ c21 0 · · · c2L ⎥ ⎥ ⎢ Ht,W ≈ ⎢ . (3.38) .. .. ⎥ .. ⎣ .. . . . ⎦ cL1 cL2 · · · 0 and Ht,W x(m) ≈ x(m), so that ξsr (Ht,W ) ≈ 1 and oSNR(Ht,W ) ≈ ∞; therefore, we can almost perfectly recover the signal x(k).

18


At the other extreme case, let us see now what happens when the signal of interest x(k) is not predictable at all. In this situation, El ≈ σy2 , ∀ l and cij ≈ 0, ∀ i, j, i = j . Using these values, we get Ht,W

≈

J˜(Ht,W ) ≈

iSNR I, 1 + iSNR

(3.39)

iSNR . 1 + iSNR

(3.40)

With the help of the two previous equations, it is straightforward to obtain 2 1 1+ , ξnr (Ht,W ) ≈ iSNR υsd (Ht,W ) ≈

1 (1 + iSNR)2

,

SNR(Ht,W ) ≈ iSNR.

(3.41) (3.42) (3.43)

While some noise reduction is achieved (at the price of speech distortion), there is no improvement in the SNR, meaning that the Wiener filter has no positive effect on the microphone signal y(k). This analysis, even though simple, is quite insightful. It shows that the Wiener filter can mitigate the noise effect and improve the SNR, as long as the desired signal is somewhat predictable. However, in practice some discontinuities could be heard from a voiced signal to an unvoiced one, since for the former the noise will be mostly removed while it will not for the latter.

3.4

TRADEOFF FILTERS

The time-domain NMSE as shown in (3.19) is the sum of two terms. One depends on the speech distortion while the other one depends on the noise reduction. Instead of minimizing the NMSE with respect to Ht as we already did to find the Wiener filter, we can minimize the speech-distortion index with the constraint that the noise-reduction factor is equal to a value that is greater than one. Mathematically, this is equivalent to min Jx (Ht ) subject to Jv (Ht ) = β · tr (Rv ) , Ht

(3.44)

where 0 < β < 1 in order to have some noise reduction. If we use a Lagrange multiplier, μ, to adjoin the constraint to the cost function, (3.44) can be rewritten as Ht,T,μ = arg min L(Ht , μ),

(3.45)

L(Ht , μ) = Jx (Ht ) + μ Jv (Ht ) − β · tr (Rv )

(3.46)

Ht

with

3.4. TRADEOFF FILTERS

19

and μ ≥ 0. From (3.45) and assuming that the sum matrix Rx + μRv is invertible (if it is not, the pseudo inverse can be used), we can easily derive the optimal filtering matrix: Ht,T,μ

= Rx (Rx + μRv )−1 −1 = Ry − Rv Ry + (μ − 1)Rv −1 = (1 − μ)I + μH−1 , t,W

(3.47)

where the Lagrange multiplier, μ, satisfies Jv Ht,T,μ = β · tr (Rv ), which implies that ξnr (Ht,T,μ ) =

1 > 1. β

(3.48)

In practice, it is not easy to determine the optimal μ. Therefore, when this parameter is chosen in an ad-hoc way, we can see that for • μ = 1, Ht,T,1 = Ht,W , so the tradeoff filter degenerates to the Wiener one; • μ = 0, Ht,T,0 = I, which is an identity filtering matrix that passes the noisy speech without changing it; • μ > 1, results in low residual noise at the expense of high speech distortion; • μ < 1, leads to little speech distortion and little noise reduction. With the tradeoff filtering matrix given in (3.47), the output SNR is always greater than or equal to the input SNR, i.e., oSNR(Ht,T,μ ) ≥ iSNR, ∀μ ≥ 0. Property 3.3

2

Proof. See [7].

We can find another tradeoff filtering matrix by minimizing the residual noise with the constraint that some level of speech distortion is allowed. Mathematically, this is equivalent to min Jv (Ht ) subject to Jx (Ht ) = β2 · tr (Rx ) , Ht

(3.49)

where β2 > 0 in order to have some noise reduction. If we use a Lagrange multiplier, μ2 , to adjoin the constraint to the cost function, (3.49) can be rewritten as Ht,T,2,μ2 = arg min L(Ht , μ2 ),

(3.50)

L(Ht , μ2 ) = Jv (Ht ) + μ2 Jx (Ht ) − β2 · tr (Rx )

(3.51)

Ht

with

20


and μ2 > 0. The optimal solution to this optimization problem is Rv −1 , (3.52) Ht,T,2,μ2 = Rx Rx + μ2 where the Lagrange multiplier, μ2 , satisfies Jx Ht,T,2,μ2 = β2 · tr (Rx ), which implies that υsd (Ht,T,2,μ2 ) = β2 > 0.

(3.53)

From a practical point of view, the two tradeoff filters derived here are fundamentally the same since by taking μ = 1/μ2 , we see that Ht,T,μ = Ht,T,2,1/μ .

3.5

SUBSPACE-TYPE FILTER

In [21], it is shown that two symmetric matrices Rx and Rv can be jointly diagonalized if Rv is positive definite. This joint diagonalization was first introduced by Jensen et al. [33] and then by Hu and Loizou [27], [28], [29] in the single-channel noise reduction problem. For our time-domain model, we get Rx

= Bjd BT ,

(3.54)

Rv

= BBT , = B I + jd BT ,

(3.55)

Ry

(3.56)

where B is a full rank square matrix but not necessarily orthogonal, and the diagonal matrix jd = diag λjd,1 , λjd,2 , . . . , λjd,L (3.57) contains the eigenvalues of the matrix R−1 v Rx with λjd,1 ≥ λjd,2 ≥ · · · ≥ λjd,L ≥ 0. Applying the decompositions (3.54)–(3.56) in (3.47), the tradeoff filter becomes −1 −1 B . Ht,T,μ = Bjd jd + μI

(3.58)

Therefore, the estimation of the speech signal, x(m), is done in three steps: first, we apply the transform B−1 to the noisy signal; second, the transformed signal is modified by the gain function −1 jd jd + μI ; and, finally, we transform back the signal to its original domain by applying the transform B. It is believed that a speech signal can be modelled as a linear combination of a number of some (linearly independent) basis vectors smaller than the dimension of these vectors [14], [20], [26], [31]. As a result, the vector space of the noisy signal can be decomposed in two subspaces: the signal-plus-noise subspace of length Ls and the null subspace of length Ln , with L = Ls + Ln . This implies that the last Ln eigenvalues of the matrix R−1 v Rx are equal to zero. Therefore, we can rewrite (3.58) to obtain the subspace-type filter: 0Ls ×Ln μ (3.59) B−1 , Ht,S,μ = B 0Ln ×Ls 0Ln ×Ln

3.6. MAXIMUM SIGNAL-TO-NOISE RATIO (SNR) FILTER

where

μ = diag

λjd,2

λjd,1

λjd,Ls

21

, ,..., λjd,1 + μ λjd,2 + μ λjd,Ls + μ

(3.60)

is an Ls × Ls diagonal matrix. This algorithm is now often referred to as the generalized subspace approach. One should note, however, that there is no noise-only subspace with this formulation. Therefore, noise reduction can only be achieved by modifying the speech-plus-noise subspace by setting μ to a positive number. Using (3.58) in (3.7), we find that −2 T tr B3jd jd + μI B . oSNR(Ht,T,μ ) = (3.61) −2 T tr B2jd jd + μI B As a result,

tr B3jd BT

. lim oSNR(Ht,T,μ ) = μ→∞ tr B2jd BT

(3.62)

In this limiting case, the tradeoff filter has no interest since Ht,T,∞ = 0.

3.6

MAXIMUM SIGNAL-TO-NOISE RATIO (SNR) FILTER

Contrary to what it may be believed, the filtering matrix B−1 that jointly diagonalizes the two matrices Rx and Rv does not maximize the output SNR. To derive the maximum SNR filter, we first need to rewrite the filtering matrix as ⎡ T ⎤ ht,1 ⎢ hT ⎥ ⎢ t,2 ⎥ Ht = ⎢ . ⎥ , (3.63) ⎣ .. ⎦ hTt,L where ht,l is a finite-impulse-response (FIR) filter of length L. We can rewrite the output SNR as L T l=1 ht,l Rx ht,l oSNR (Ht ) = L . (3.64) T l=1 ht,l Rv ht,l Lemma 3.4

We have oSNR (Ht ) ≤ max l

T R h ht,l x t,l T R h ht,l v t,l

= χ.

(3.65)

22


Proof. Let us define the positive reals al = hTt,l Rx ht,l and bl = hTt,l Rv ht,l . We have L

l=1 al L l=1 bl

L bl al = · L . bl b i i=1

(3.66)

l=1

Now, define the two following vectors: u = u

a2 b2

a1 b1

b1 L

=

i=1 bi

aL bL

···

T

b2 L

i=1 bi

(3.67)

, ···

bL L

i=1 bi

T .

(3.68)

Using the Holder’s inequality, we see that L

l=1 al

L

l=1 bl

= uT u

al ≤ u∞ u 1 = max , l bl

2

which ends the proof.

Theorem 3.5

(3.69)

The maximum SNR filtering matrix is given by ⎡ ⎢ ⎢ Ht,max = ⎢ ⎣

T β1 ht,max T β2 ht,max .. .

⎤ ⎥ ⎥ ⎥, ⎦

(3.70)

T βL ht,max

where βl , l = 1, 2, . . . , L are real numbers with at least one of them different from 0 and ht,max is the eigenvector corresponding to the maximum eigenvalue, λmax , of the matrix Rv−1 Rx . The corresponding output SNR is oSNR Ht,max = λmax .

(3.71)

Proof. From Lemma 3.4, we know that the output SNR is upper bounded by χ whose maximum value is clearly λmax . On the other hand, it can be checked from (3.64) that oSNR Ht,max = λmax . Since this output SNR is maximal, Ht,max is indeed the maximum SNR filter. 2

3.6. MAXIMUM SIGNAL-TO-NOISE RATIO (SNR) FILTER

It can be shown that for μ ≥ 1, iSNR ≤ oSNR Ht,W ≤ oSNR Ht,T,μ ≤ oSNR Ht,max = λmax

23

(3.72)

and for μ ≤ 1, iSNR ≤ oSNR Ht,T,μ ≤ oSNR Ht,W ≤ oSNR Ht,max = λmax .

(3.73)

Note that the filtering matrix H t,max = QHt,max

(3.74)

also maximizes the output SNR, so that Ht,max and H t,max are fundamentally equivalent, following the basic principle of maximizing the time-domain output SNR.

25

CHAPTER

4

Linear Models for Signal Enhancement in the KLE Domain From the KLE-domain signal model explained in Chapter 2, there are four possible linear models for the estimation of the desired signal as explained in this part.

4.1

MODEL 1

In the first and simplest model, that we call Model 1, neither interframe nor interband correlations are taken into account. With this model, the estimate of cx,l (m) is obtained with cz1 ,l (m) = h1,l cy,l (m) = h1,l cx,l (m) + h1,l cv,l (m), l = 1, 2, . . . , L,

(4.1)

where h1,l is a (positive) gain factor that should be smaller than 1. This approach is pretty much equivalent to noise reduction in the frequency domain [7], which ignores the interband and interframe correlations of the signals. The variance of cz1 ,l (m) is φcz1 ,l = E cz21 ,l (m) = h21,l φcy,l = h21,l λl = h21,l φcx,l + h21,l φcv,l , l = 1, 2, . . . , L,

(4.2)

where φcy,l

= λl ,

(4.3)

φcx,l

= qTl Rx ql ,

(4.4)

φcv,l

=

(4.5)

qTl Rv ql ,

are the variances of cy,l (m), cx,l (m), and cv,l (m), respectively. Intuitively, we see from (4.2) that for the eigenvalues dominated by noise, the corresponding gains should be close to 0, while for the eigenvalues dominated by speech, the corresponding gains should be close to 1.

26

4. LINEAR MODELS FOR SIGNAL ENHANCEMENT IN THE KLE DOMAIN

With Model 1, we can deduce the estimate of x(m) as L

z1 (m) =

cz1 ,l (m)ql l=1 L h1,l ql qTl y(m) l=1 HTD,1 y(m),

= =

(4.6)

where =

HTD,1

L

h1,l ql qTl

l=1 = Q diag h1,1 , h1,2 , . . . , h1,L QT

(4.7)

is a matrix of size L × L, which is the equivalent time-domain version of the gains h1,l in the KLE domain. Hence, the correlation matrix of z1 (m) is

Rz1 = Q diag h21,1 λ1 , h21,2 λ2 , . . . , h21,L λL QT . (4.8)

4.2

MODEL 2

In Model 2, the interframe correlation is taken into account. Therefore, we estimate the coefficients cx,l (m), l = 1, 2, . . . , L, by passing cy,l (m), l = 1, 2, . . . , L, from consecutive time-frames through a linear filter, i.e., cz2 ,l (m) = hT2,l cy,l (m) = hT2,l cx,l (m) + hT2,l cv,l (m), l = 1, 2, . . . , L,

(4.9)

where h2,l =

h2,l,0

h2,l,1

· · · h2,l,M−1

T

is an FIR filter of length M corresponding to the subband l, cy,l (m) =

cy,l (m) cy,l (m − 1) · · · cy,l (m − M + 1)

T

is a vector of length M, cx,l (m) and cv,l (m) are defined in a similar way to cy,l (m), and M is the chosen number of consecutive frames. Taking M = 1, for all the filters h2,l in (4.9), we get Model 1 presented in the previous subsection. However, for M > 1, the interframe correlation will now be taken into account. At time-frame m, our desired signal is cx,l (m) [and not the whole the vector cx,l (m)]. However, the vector cx,l (m) contains both the desired signal, cx,l (m), and the components cx,l (m − i), i = 0,

4.2. MODEL 2

27

which are not the desired signals at time-frame m but signals that are correlated with cx,l (m). Therefore, the elements cx,l (m − i), i = 0, contain both a part of the desired signal and a component that we consider as an interference. This suggests that we should decompose cx,l (m − i) into two orthogonal components corresponding to the part of the desired signal and interference, i.e.,

cx,l (m − i) = γcx,l (i)cx,l (m) + cx,l (m − i),

(4.10)

cx,l (m − i) = cx,l (m − i) − γcx,l (i)cx,l (m),

(4.11)

(m − i) = 0, E cx,l (m)cx,l

(4.12)

E cx,l (m)cx,l (m − i) γcx,l (i) = 2 (m) E cx,l

(4.13)

where

and

is the interframe correlation coefficient of the signal cx,l (m). Hence, we can write the vector cx,l (m) as cx,l (m) = cx,l (m)γ cx,l + c x,l (m) = cxd ,l (m) + c x,l (m),

(4.14)

where cxd ,l (m) = cx,l (m)γ cx,l is a vector depending on the desired signal, c x,l (m) =

(m) c (m − 1) · · · c (m − M + 1) cx,l x,l x,l

T

is the interference signal vector, and γ cx,l

=

γcx,l (0) γcx,l (1) · · · γcx,l (M − 1) T = 1 γcx,l (1) · · · γcx,l (M − 1) E cx,l (m)cx,l (m) = 2 (m) E cx,l

T

(4.15)

is the (normalized) interframe correlation vector. Substituting (4.14) into (4.9), we get cz2 ,l (m) = cx,l (m)hT2,l γ cx,l + hT2,l c x,l (m) + hT2,l cv,l (m), l = 1, 2, . . . , L.

(4.16)

28


We observe that the estimate of the desired signal is the sum of three terms that are mutually uncorrelated. The first one is clearly the filtered desired signal while the two others are the filtered undesired signals (interference-plus-noise). Therefore, the variance of cz2 ,l (m) is φcz2 ,l

= hT2,l cy,l h2,l = hT2,l cxd ,l h2,l + hT2,l c x,l h2,l + hT2,l cv,l h2,l , l = 1, 2, . . . , L,

(4.17)

= E cy,l (m)cTy,l (m) , = E cxd ,l (m)cTxd ,l (m)

(4.18)

where cy,l cxd ,l

= φcx,l γ cx,l γ Tcx,l , = E c x,l (m)c T x,l (m) = cx,l − cxd ,l , = E cv,l (m)cTv,l (m) ,

c x,l cv,l

(4.19) (4.20) (4.21)

are the correlation matrices of the vectors cy,l (m), cxd ,l (m), c x,l (m), and cv,l (m), respectively. We see clearly from these correlation matrices that the interframe correlation is taken into account. The estimate of the vector x(m) would be z2 (m) =

L

cz2 ,l (m)ql

l=1

= = =

L M−1 l=1 i=0 L M−1

h2,l,i cy,l (m − i)ql h2,l,i ql qTl y(m − i)

i=0 l=1 M−1

HTD,2,i y(m − i),

(4.22)

i=0

where HTD,2,i =

L

h2,l,i ql qTl , i = 0, 1, . . . , M − 1

(4.23)

l=1

are the time-domain filtering matrices. We see again from (4.22) how the estimate depends on the M successive frames of the observation signal vector y(m). The correlation matrix of z2 (m) is R z2 =

M−1 M−1 i=0 j =0

HTD,2,i E y(m − i)yT (m − j ) HTTD,2,j .

(4.24)

4.3. MODEL 3

4.3

29

MODEL 3

In our third model, the interband correlation is taken into account. Then, we have cz3 ,l (m) = hT3,l cy (m) = hT3,l cx (m) + hT3,l cv (m), l = 1, 2, . . . , L, where h3,l =

h3,l,0

h3,l,1

· · · h3,l,L −1

(4.25)

T

is an FIR filter of length L ≤ L, corresponding to the subband l, T cy (m) = cy,1 (m) cy,2 (m) · · · cy,L (m)

(4.26)

is a vector of length L , and cx (m) and cv (m) are defined in a similar way to cy (m). Taking L = 1 for all the filters h3,l in (4.25), we obtain Model 1. However, for L > 1, the interband correlation will now be taken into account. In the rest, we will always assume that L = L. In this case, cy (m) = QT y(m), cx (m) = QT x(m), and cv (m) = QT v(m). In a vector form, (4.25) is T cz3 (m) = cz3 ,1 (m) cz3 ,2 (m) · · · cz3 ,L (m) = H3 cy (m) = H3 cx (m) + H3 cv (m), (4.27) where

⎡ ⎢ ⎢ H3 = ⎢ ⎣

hT3,1 hT3,2 .. .

⎤ ⎥ ⎥ ⎥ ⎦

hT3,L is a filtering matrix of size L × L. For this model, cx (m) is our desired signal vector. The correlation matrix of cz3 (m) is cz3

= H3 cy HT3 = H3 cx HT3 + H3 cv HT3 ,

where

(4.28)

cx

= E cy (m)cTy (m) = , = E cx (m)cTx (m)

(4.30)

cv

= QT Rx Q, = E cv (m)cTv (m) = QT Rv Q,

(4.31)

cy

(4.29)

30


are the correlation matrices of the vectors cy (m), cx (m), and cv (m), respectively. With Model 3, the estimate of x(m) is z3 (m) = Qcz3 (m) = QH3 QT y(m) = HTD,3 y(m),

(4.32)

where HTD,3 = QH3 QT

(4.33)

is the time-domain form of H3 . Therefore, the correlation matrix of z3 (m) is Rz3 = QH3 HT3 QT ,

(4.34)

which is interesting to compare to Rz1 of Model 1.

4.4

MODEL 4

In our fourth and last model, we take into account both the interframe and interband correlations. In this case, the coefficients cx,l (m), l = 1, 2, . . . , L, are estimated as cz4 ,l (m) = =

M−1 i=0 M−1

hT4,l,i cy (m − i) hT4,l,i cx (m − i) +

M−1

i=0

hT4,l,i cv (m − i), l = 1, 2, . . . , L,

(4.35)

i=0

where h4,l,i =

h4,l,i,0

h4,l,i,1

· · · h4,l,i,L −1

T

is an FIR filter of length L ≤ L, corresponding to the subband index l and time-frame index i, cy (m − i) =

cy,1 (m − i) cy,2 (m − i) · · · cy,L (m − i)

T

(4.36)

is a vector of length L , and cx (m − i) and cv (m − i) are defined in a similar way to cy (m − i). Model 4 is a generalization of the three previous models. Indeed, taking L = M = 1 for all, the filters in (4.35) gives Model 1; L = 1 leads to Model 2; and M = 1 corresponds to Model 3. In the rest, we will always assume that L = L. Expression (4.35) can be rewritten in a more convenient way as cz4 ,l (m) = hT4,l cy (m) = hT4,l cx (m) + hT4,l cv (m), l = 1, 2, . . . , L,

(4.37)

4.4. MODEL 4

where h4,l =

hT4,l,0

hT4,l,1

· · · hT4,l,M−1

31

T

is an FIR filter of length ML, T cy (m) = cTy (m) cTy (m − 1) · · · cTy (m − M + 1) is a vector of length ML, and cx (m) and cv (m) are defined in a similar way to cy (m). In a vector form, (4.37) becomes T cz4 (m) = cz4 ,1 (m) cz4 ,2 (m) · · · cz4 ,L (m) = H4 cy (m) = H4 cx (m) + H4 cv (m), where

⎡ ⎢ ⎢ H4 = ⎢ ⎣

hT4,1 hT4,2 .. .

(4.38)

⎤ ⎥ ⎥ ⎥ ⎦

hT4,L is a filtering matrix of size L × ML. At time-frame m, our desired signal vector is cx (m) but not the whole vector cx (m).Therefore, we should decompose cx (m) into two orthogonal components: cx (m) = cx cx (m) + c

x (m) = cxd (m) + c

x (m),

(4.39)

where cxd (m) = cx cx (m) is a linear version of the desired signal vector, c

x (m) is the interference signal vector of length ML, ⎡ ⎤ cx ,0 −1 cx ⎢ c ,1 −1 ⎥ x cx ⎢ ⎥ cx = ⎢ ⎥ .. ⎣ ⎦ . cx ,M−1 −1 cx

is the normalized interframe correlation matrix, cx ,i = E cx (m − i)cTx (m) , i = 0, 1, . . . , M − 1, and

E cx (m)c

T (m) = 0. x

(4.40)

(4.41)

32


Substituting (4.39) in (4.38), we obtain cz4 (m) = H4 cx cx (m) + H4 c

x (m) + H4 cv (m)

(4.42)

and the correlation matrix of cz4 (m) is cz4 = H4 cx HT4 + H4 c

x HT4 + H4 cv HT4 , d

where

(4.43)

d

= E cxd (m)cTxd (m) (4.44)

c

x

= cx cx Tcx , = E c

x (m)c

T x (m)

(4.45)

cv

= cx − cx , d = E cv (m)cTv (m) ,

cx

(4.46)

are the correlation matrices of the vectors cxd (m), c

x (m), and cv (m), respectively. With this model, the estimate of x(m) is z4 (m) = = = =

L

cz4 ,l (m)ql l=1 L M−1 hT4,l,i cy (m − i)ql l=1 i=0 L M−1 ql hT4,l,i QT y(m − i) i=0 l=1 M−1 HTD,4,i y(m − i),

(4.47)

ql hT4,l,i QT , i = 0, 1, . . . , M − 1

(4.48)

i=0

where HTD,4,i =

L l=1

are the time-domain filtering matrices. The correlation matrix of z4 (m) is Rz4 =

M−1 M−1 i=0 j =0

HTD,4,i E y(m − i)yT (m − j ) HTTD,4,j .

(4.49)

33

CHAPTER

5

Optimal Filters in the KLE Domain with Model 1 In this chapter, we study noise reduction with Model 1. We recall that in Model 1, neither interframe nor interband correlations are taken into account. To simplify the presentation, we drop the subscript “1” from the gain (see Chapter 4, Section 4.1), so that now h1,l is written as hl .

5.1


To examine what happens in each subband, we define the subband input SNR as iSNRl

= =

φcx,l φcv,l qTl Rx ql qTl Rv ql

, l = 1, 2, . . . , L.

(5.1)

We can rewrite the input SNR (already defined in Chapter 3) as L

iSNR = =

T l=1 ql Rx ql L T l=1 ql Rv ql 2 σx . σv2

(5.2)

We can demonstrate that [7] iSNR ≤

L

iSNRl .

(5.3)

l=1

The output SNR is the SNR after the filtering operation. From (4.2), we deduce the subband output SNR: oSNR(hl ) =

h2l φcx,l h2l φcv,l

= iSNRl , l = 1, 2, . . . , L

(5.4)

34

5. OPTIMAL FILTERS IN THE KLE DOMAIN WITH MODEL 1

and the fullband output SNR:

L oSNR(h: ) = l=1 L

h2l φcx,l

2 l=1 hl φcv,l

(5.5)

.

We notice that the subband output SNR cannot be improved with just a gain, but the fullband output SNR can. We always have [7] oSNR(h: ) ≤

L

iSNRl .

(5.6)

l=1

The previous inequality shows that the fullband output SNR is upper bounded no matter how the gains hl , l = 1, 2, . . . , L are chosen. The subband and fullband noise-reduction factors are φcv,l ξnr (hl ) = h2l φcv,l = ξnr (h: ) = =

1 , l = 1, 2, . . . , L, h2l L l=1 φcv,l L 2 l=1 hl φcv,l L l=1 φcv,l . L −1 l=1 ξnr (hl )φcv,l

(5.7)

(5.8)

The noise-reduction factor is supposed to have a lower bound of 1 for optimal gains, and the larger its value, the more the noise is reduced. We also have ξnr (h: ) ≤

L

(5.9)

ξnr (hl ).

l=1

To quantify the speech distortion, we give the subband speech-distortion index

2 E hl cx,l (m) − cx,l (m) υsd (hl ) = φcx,l 2 = (hl − 1) , l = 1, 2, . . . , L and the fullband speech-distortion index L υsd (h: ) = =

hl cx,l (m) − cx,l (m) L l=1 φcx,l L l=1 υsd (hl )φcx,l . L l=1 φcx,l l=1 E

(5.10)

2

(5.11)

5.2. MSE CRITERION

35

The speech-distortion index is usually upper bounded by 1. We have υsd (h: ) ≤

L

υsd (hl ).

(5.12)

l=1

Another way to quantify signal distortion is via the speech-reduction factor. The subband and fullband definitions are ξsr (hl ) = = ξsr (h: ) = =

φcx,l h2l φcx,l 1 , l = 1, 2, . . . , L, h2l L l=1 φcx,l L 2 l=1 hl φcx,l L l=1 φcx,l . L −1 l=1 ξsr (hl )φcx,l

(5.13)

(5.14)

The speech-reduction factor is supposed to have a lower bound of 1 for optimal gains. We also have ξsr (h: ) ≤

L

ξsr (hl ).

(5.15)

l=1

It can easily be checked that

5.2

oSNR(hl ) iSNRl

=

ξnr (hl ) , ł = 1, 2, . . . , L, ξsr (hl )

(5.16)

oSNR(h: ) iSNR

=

ξnr (h: ) . ξsr (h: )

(5.17)

MSE CRITERION

In the KLE domain and with Model 1, the error signal between the estimated and desired signals in the subband l is el (m) = cz1 ,l (m) − cx,l (m) = hl cy,l (m) − cx,l (m),

(5.18)

which can also be written as the sum of two uncorrelated error signals: el (m) = ex,l (m) + ev,l (m),

(5.19)

36


where ex,l (m) = hl cx,l (m) − cx,l (m)

(5.20)

is the speech distortion due to the gain and ev,l (m) = hl cv,l (m)

(5.21)

represents the residual noise. From the error signal (5.18), we give the corresponding KLE-domain (or subband) MSE criterion: J (hl ) = E el2 (m) (5.22) = h2l λl − 2hl φcx,l cy,l + φcx,l , where φcx,l cy,l

= E cx,l (m)cy,l (m) 2 (m) = E cx,l = φcx,l

is the cross-correlation between the signals cx,l (m) and cy,l (m). Expression (5.22) can be structured in a different way: 2 2 J (hl ) = E ex,l (m) + E ev,l (m) (5.23) = Jx (hl ) + Jv (hl ) . For the particular gain hl = 1, ∀l, we get 2 J (1) = E cv,l (m) = φcv,l = qTl Rv ql ,

(5.24)

so there will be neither noise reduction nor speech distortion. Using this particular case of the MSE, we define the KLE-domain (or subband) normalized MSE (NMSE) as J˜ (hl ) =

J (hl ) J (1)

= iSNR l · υsd (hl ) +

1 , ξnr (hl )

(5.25)

5.2. MSE CRITERION

37

where υsd (hl ) =

Jx (hl ) , φcx,l

(5.26)

ξnr (hl ) =

qTl Rv ql . Jv (hl )

(5.27)

The KLE-domain NMSE depends explicitly on the subband speech-distortion index and the subband noise-reduction factor. We define the fullband MSE and fullband NMSE as 1 J (hl ) L L

J (h: ) = =

1 L

l=1 L

(5.28) 1 2 hl φcv,l L L

(hl − 1)2 φcx,l +

l=1

l=1

= Jx (h: ) + Jv (h: ) and J (h: ) J˜ (h: ) = L L T l=1 ql Rv ql L 2 L 2 h φc l=1 (hl − 1) φcx,l + Ll=1 l v,l = L T T l=1 ql Rv ql l=1 ql Rv ql = iSNR · υsd (h: ) +

(5.29)

1 , ξnr (h: )

where υsd (h: ) = ξnr (h: ) =

Jx (h: ) , L l=1 φcx,l L T l=1 ql Rv ql . Jv (h: )

(5.30) (5.31)

Again, the fullband NMSE with the KLE depends explicitly on the fullband speech-distortion index and the fullband noise-reduction factor. It is straightforward to see that minimizing the subband MSE for each l is equivalent to minimizing the fullband MSE.

38


5.3

WIENER FILTER

By minimizing J (hl ) [eq. (5.22)] with respect to hl , we easily find the Wiener gain:

hW,l

=

= = =

2 (m) E cx,l 2 (m) E cy,l 2 (m) E cv,l 1− 2 (m) E cy,l φcx,l φcx,l + φcv,l iSNR l . 1 + iSNR l

(5.32)

This gain is the equivalent form of the frequency-domain Wiener gain [7]. Clearly, 0 ≤ hW,l ≤ 1, ∀l. We deduce the different subband performance measures: J˜ hW,l =

iSNR l ≤ 1, 1 + iSNR l ξnr (hW,l ) = ξsr (hW,l ) 2 1 = 1+ ≥ 1, iSNR l 1 υsd (hW,l ) = ≤ 1. (1 + iSNR l )2

(5.33) (5.34)

(5.35)

The fullband output SNR is L oSNR(hW,: ) =

l=1 φcx,l

L

l=1 φcv,l

iSNR l 1 + iSNR l iSNR l 1 + iSNR l

2 2 .

(5.36)

With the optimal KLE-domain Wiener gain given in (5.32), the fullband output SNR is always greater than or equal to the input SNR, i.e., oSNR(hW,: ) ≥ iSNR. Property 5.1

Proof. We can use exactly the same techniques as the ones exposed in [7] to show this property. 2

5.3. WIENER FILTER

Property 5.2

39

We have iSNR iSNR ≤ J˜ hW,: ≤ , 1 + oSNR(hW,: ) 1 + iSNR 2 (1 + iSNR) 1 + oSNR(hW,: ) 1 + oSNR(hW,: ) ≤ ξnr hW,: ≤ , iSNR · oSNR(hW,: ) iSNR 2 2 ≤ υsd hW,: ≤

1

1 + oSNR(hW,: )

1 + oSNR(hW,: ) − iSNR . (1 + iSNR) 1 + oSNR(hW,: )

(5.37) (5.38) (5.39)

Proof. We can use exactly the same techniques as the ones exposed in [7] to show these different inequalities. 2 It is of great interest to understand how the time-domain Wiener filter (see Chapter 3)

(5.40) Ht,W = Q − QT Rv Q −1 QT is related to the KLE-domain Wiener gain given in (5.32). Substituting the KLE-domain Wiener gain into (4.6), we see that the estimator of the vector x(m) can be written as z1,W (m) = =

L

hW,l cy,l (m)ql

l=1 L

hW,l ql qTl

y(m)

l=1

= HTD,W y(m).

(5.41)

Therefore, the time-domain filtering matrix HTD,W =

L

hW,l ql qTl

(5.42)

l=1

is strictly equivalent to the KLE-domain gains hW,l , l = 1, 2, . . . , L. Substituting (5.32) into (5.42), we easily find that

(5.43) HTD,W = Q − diag QT Rv Q −1 QT . Clearly, the two filters Ht,W and HTD,W may be very close to each other. For example if the noise is white, then Ht,W = HTD,W . Also the orthogonal matrix Q tends to diagonalize the Toeplitz matrix Rv for a large L. In this case, QT Rv Q ≈ diag QT Rv Q , and as a result, Ht,W ≈ HTD,W .

40


5.4

TRADEOFF FILTER

The tradeoff gain is obtained by minimizing the speech distortion with the constraint that the residual noise level is equal to a value smaller than the level of the original noise. This is equivalent to solving the problem min Jx (hl ) subject to Jv (hl ) = βφcv,l ,

(5.44)

hl

where Jx (hl ) = (1 − hl )2 φcx,l , Jv (hl ) = h2l φcv,l ,

(5.45) (5.46)

and 0 < β < 1 in order to have some noise reduction in the subband l. If we use a Lagrange multiplier, μ ≥ 0, to adjoin the constraint to the cost function, we get the tradeoff gain: hT,μ,l

=

φcx,l φcx,l + μφcv,l

=

λl − φcv,l λl + (μ − 1)φcv,l

=

iSNR l . μ + iSNR l

(5.47)

This gain can be seen as a KLE-domain Wiener gain with adjustable input noise level μφcv,l . The particular cases of μ = 1 and μ = 0 correspond to the Wiener and identity gains, respectively. The fullband output SNR is 2 L iSNR l φ l=1 cx,l μ + iSNR l oSNR(hT,μ,: ) = (5.48) 2 . L iSNR l l=1 φcv,l μ + iSNR l Property 5.3 With the tradeoff gain given in (5.47), the fullband output SNR is always greater than or equal to the input SNR, i.e., oSNR(hT,μ,: ) ≥ iSNR, ∀μ ≥ 0.

Proof. We can use exactly the same techniques as the ones exposed in [7] to show this property. 2 From (5.48), we deduce that L lim oSNR(hT,μ,: ) = l=1 L μ→∞

φcx,l iSNR 2l

2 l=1 φcv,l iSNR l

≤

L l=1

iSNR l .

(5.49)

5.5. MAXIMUM SNR FILTER

41

This shows how the fullband output SNR of the tradeoff gain is upper bounded. The fullband speech-distortion index is υsd hT,μ,: =

L

φcx,l μ2

l=1

(μ + iSNR l )2 . L l=1 φcx,l

(5.50)

The fullband speech-distortion index of the tradeoff gain is an increasing function of the parameter μ. Property 5.4

Proof. It is straightforward to verify that dυsd hT,μ,: ≥ 0, dμ

(5.51)

2

which ends the proof. It is clear that 0 ≤ υsd hT,μ,: ≤ 1, ∀μ ≥ 0.

(5.52)

Therefore, as μ increases, the fullband output SNR increases at the price of more distortion to the desired signal. As we already did for the Wiener gain, we can write the KLE-domain tradeoff gain into the time domain. Indeed, substituting (5.47) into (4.6), we find that

−1 + (μ − 1) · diag QT Rv Q QT , HTD,T,μ = Q − diag QT Rv Q

(5.53)

which has a similar form to the filtering matrix proposed in [46]. This matrix can be compared to the time-domain tradeoff filtering matrix (see Chapter 3)

−1 QT . Ht,T,μ = Q − QT Rv Q + (μ − 1) · QT Rv Q

(5.54)

We see that if the noise is white, the two matrices are the same.

5.5

MAXIMUM SNR FILTER

Let us define the L × 1 vector h=

h1

h2

· · · hL

T

,

(5.55)

42


which contains all the subband gains. The fullband output SNR can be rewritten as oSNR(h: ) = oSNR(h) hT Dφcx h , = hT Dφcv h

(5.56)

where Dφcx Dφcv

= diag φcx,1 , φcx,2 , . . . , φcx,L , = diag φcv,1 , φcv,2 , . . . , φcv,L ,

(5.57) (5.58)

are two diagonal matrices. We assume here that φcv,l = 0, ∀l. In the maximum SNR approach, we find the filter, h, that maximizes the fullband output SNR defined in (5.56). The solution to this problem that we denote by hmax is simply the eigenvector corresponding to the maximum eigenvalue of the matrix D−1 φcv Dφcx . Since this matrix is diagonal, its maximum eigenvalue is its largest diagonal element, i.e., max l

φcx,l = max iSNRl . l φcv,l

(5.59)

Assume that this maximum is the l0 th diagonal element of the matrix D−1 φcv Dφcx . In this case, the l0 th component of hmax is 1 and all its other components are 0. As a result, oSNR(hmax ) = max iSNRl l

= iSNRl0 .

(5.60)

We also deduce that oSNR(h: ) ≤ max iSNRl , ∀h: . l

(5.61)

This means that with the Wiener, tradeoff, or any other gain, the fullband output SNR cannot exceed the maximum subband input SNR, which is a very interesting result on its own. It is easy to derive the fullband speech-distortion index: φcx,l υsd hmax = 1 − L 0 , l=1 φcx,l

(5.62)

which can be very close to 1, implying very large distortions of the desired signal. The equivalent time-domain version of hmax is simply HTD,max = ql0 qTl0 .

(5.63)

Needless to say that this maximum SNR filter is never used in practice since all subband signals but one are suppressed. But this filter is still interesting from a theoretical point of view.

43

CHAPTER

6

Optimal Filters in the KLE Domain with Model 2 In Model 2, the interframe correlation is taken into account. In this chapter, we show how to exploit this feature in order to develop noise reduction algorithms that are different from the ones developed with Model 1. To simplify the presentation, we drop the subscript “2” from the FIR filter of length M (see Chapter 4, Section 4.2), so that now h2,l is written as hl .

6.1


From (4.16), we can deduce the most important performance measures. The subband output SNR is defined as1 oSNR(hl ) = =

hTl cxd ,l hl hTl in,l hl

2 φcx,l hTl γ cx,l hTl in,l hl

, l = 1, 2, . . . , L,

(6.1)

where in,l = c x,l + cv,l , l = 1, 2, . . . , L

(6.2)

is the interference-plus-noise correlation matrix. With Model 2, the subband output SNR is not equal, in general, to the subband input SNR contrary to Model 1. But for the particular filter hl = iM,1 , where iM,1 is the first column of the identity matrix IM of size M × M, we have oSNR(iM,1 ) = iSNRl , l = 1, 2, . . . , L.

(6.3)

For any two vectors hl and γ cx,l and a positive definite matrix in,l , we have

hTl γ cx,l

2

≤ hTl in,l hl γ Tcx,l −1 γ in,l cx,l .

(6.4)

Using the previous inequality in (6.1), we deduce an upper bound for the subband output SNR: oSNR(hl ) ≤ φcx,l γ Tcx,l −1 in,l γ cx,l , l = 1, 2, . . . , L. 1 In this study, we consider the interference as part of the noise in the definitions of the performance measures.

(6.5)

44


We define the fullband output SNR as

L

l=1 φcx,l

oSNR(h: ) =

L

hTl γ cx,l

2 .

(6.6)

φcx,l γ Tcx,l −1 in,l γ cx,l .

(6.7)

T l=1 hl in,l hl

We always have [7] oSNR(h: ) ≤

L

oSNR(hl ) ≤

l=1

L l=1

The previous inequality shows that the fullband output SNR is upper bounded no matter how the filters hl , l = 1, 2, . . . , L are chosen. The subband and fullband noise-reduction factors are φcv,l

ξnr (hl ) =

, l = 1, 2, . . . , L, hTl in,l hl L l=1 φcv,l . L T l=1 hl in,l hl

ξnr (h: ) =

(6.8) (6.9)

These factors should be lower bounded by 1 for optimal filters. We also have ξnr (h: ) ≤

L

(6.10)

ξnr (hl ).

l=1

From the inequality in (6.4), we easily find that ξnr (hl ) ≤

φcv,l γ Tcx,l −1 in,l γ cx,l , l = 1, 2, . . . , L,

2 hTl γ cx,l

(6.11)

ξnr (h: ) ≤

−1 L φ T cv,l γ cx,l in,l γ cx,l .

2 T l=1 hl γ cx,l

(6.12)

To quantify the speech distortion, we give the subband speech-distortion index E υsd (hl ) = =

cx,l (m)hTl γ cx,l

hTl γ cx,l

− cx,l (m)

2

φc

2 x,l − 1 , l = 1, 2, . . . , L

(6.13)

6.1. PERFORMANCE MEASURES

and the fullband speech-distortion index

2 T φ γ − 1 h l cx,l l=1 cx,l L l=1 φcx,l L l=1 υsd (hl )φcx,l . L l=1 φcx,l

45

L υsd (h: ) = =

(6.14)

The speech-distortion index is usually upper bounded by 1. We have υsd (h: ) ≤

L

υsd (hl ).

(6.15)

l=1

We can also quantify signal distortion via the subband and fullband speech-reduction factors which are defined as φcx,l ξsr (hl ) =

2 φcx,l hTl γ cx,l =

ξsr (h: ) =

=

1

2 , l = 1, 2, . . . , L, hTl γ cx,l L l=1 φcx,l

2 L T l=1 φcx,l hl γ cx,l L l=1 φcx,l . L −1 l=1 ξsr (hl )φcx,l

(6.16)

(6.17)

The speech-reduction factor is supposed to have a lower bound of 1 for optimal filters. We also have ξsr (h: ) ≤

L

ξsr (hl ).

(6.18)

l=1

A key observation from (6.13) or (6.16) is that the design of a noise reduction algorithm that does not distort the desired signal requires the constraint hTl γ cx,l = 1, ∀l.

(6.19)

It can easily be checked that oSNR(hl ) iSNRl

=

ξnr (hl ) , l = 1, 2, . . . , L, ξsr (hl )

(6.20)

oSNR(h: ) iSNR

=

ξnr (h: ) . ξsr (h: )

(6.21)

46


6.2

MAXIMUM SNR FILTER

The maximum SNR filter, hmax,l , is obtained by maximizing the subband output SNR as defined in (6.1). Therefore, hmax,l is the eigenvector corresponding to the maximum eigenvalue of the matrix −1 in,l cxd ,l . Let us denote this eigenvalue by λmax,l . Since the rank of the matrix cxd ,l is equal to 1, we have

λmax,l = tr −1 in,l cxd ,l = φcx,l γ Tcx,l −1 in,l γ cx,l , l = 1, 2, . . . , L.

(6.22)

oSNR(hmax,l ) = φcx,l γ Tcx,l −1 in,l γ cx,l , l = 1, 2, . . . , L,

(6.23)

As a result,

which corresponds to the maximum possible output SNR according to the inequality in (6.5). Obviously, we also have hmax,l = αl −1 in,l γ cx,l , l = 1, 2, . . . , L,

(6.24)

where αl is an arbitrary scaling factor different from zero. While this factor has no effect on the subband output SNR, it has on the fullband output SNR and speech distortion (subband and fullband). In fact, all filters derived in the rest of this chapter are equivalent up to this scaling factor. These filters also try to find the respective scaling factors depending on what we optimize.

6.3

MSE CRITERION

The error signal between the estimated and desired signals in the subband l is el (m) = cz2 ,l (m) − cx,l (m) = hTl cy,l (m) − cx,l (m).

(6.25)

This error signal can also be written as the sum of two uncorrelated error signals: el (m) = ex,l (m) + ein,l (m),

(6.26)

where hTl cxd ,l (m) − cx,l (m) ex,l (m) = = hTl γ cx,l − 1 cx,l (m)

(6.27)

is the speech distortion due to the filter and ein,l (m) = hTl c x,l (m) + hTl cv,l (m)

(6.28)

6.3. MSE CRITERION

represents the residual interference-plus-noise. The subband MSE criterion is then J hl = E el2 (m) =

hTl cy,l hl

where cy,l cx,l

47

(6.29)

− 2hTl cy,l cx,l iM,1

+ φcx,l ,

= E cy,l (m)cTx,l (m) = E cx,l (m)cTx,l (m) = cx,l

is the cross-correlation matrix between the two signal vectors cy,l (m) and cx,l (m). We can rewrite the subband MSE as J hl = Jx hl + Jin hl , where

and

2 (m) Jx hl = E ex,l

2 = φcx,l hTl γ cx,l − 1

(6.30)

2 (m) Jin hl = E ein,l = hTl in,l hl . For the particular filter hl = iM,1 , ∀l, we get J iM,1 = φcv,l .

(6.31)

(6.32)

Using this particular case of the MSE, we define the subband normalized MSE (NMSE) as J hl ˜ J hl = J iM,1 1 , = iSNR l · υsd hl + (6.33) ξnr hl where υsd hl = ξnr hl =

Jx hl , φcx,l φcv,l . Jin hl

(6.34) (6.35)

48


The KLE-domain NMSE depends explicitly on the subband speech-distortion index and the subband noise-reduction factor. We define the fullband MSE and fullband NMSE as J h: =

L 1 J hl L

(6.36)

l=1 L

L 1 Jx hl + Jin hl = L l=1 l=1 = Jx h: + Jin h:

1 L

and

J h: ˜ J h: = L L l=1 φcv,l = iSNR · υsd (h: ) +

where υsd (h: ) = ξnr (h: ) =

(6.37) 1 , ξnr (h: )

Jx h: , L l=1 φcx,l L l=1 φcv,l . Jin h:

(6.38) (6.39)

The fullband NMSE with the KLE depends also explicitly on the fullband speech-distortion index and the fullband noise-reduction factor. It is straightforward to see that minimizing the subband MSE for each l is equivalent to minimizing the fullband MSE.

6.4

WIENER FILTER

The Wiener filter is easily derived by taking the gradient of the MSE, J hl , with respect to hl and equating the result to zero: hW,l

= −1 c iM,1

cy,l x,l = IM − −1 cy,l cv,l iM,1 .

(6.40)

Since cx,l iM,1 = φcx,l γ cx,l ,

(6.41)

6.4. WIENER FILTER

49

we can rewrite (6.40) as hW,l = φcx,l −1 cy,l γ cx,l .

(6.42)

cy,l = φcx,l γ cx,l γ Tcx,l + in,l .

(6.43)

It is easy to verify that

Determining the inverse of cy,l from (6.43) with the Woodbury’s identity −1 T −1 in,l γ cx,l γ cx,l in,l

−1 −1 cy,l = in,l −

−1 T φc−1 x,l + γ cx,l in,l γ cx,l

(6.44)

and substituting the result into (6.42), leads to another interesting formulation of the Wiener filter: hW,l =

−1 in,l γ cx,l

−1 T φc−1 x,l + γ cx,l in,l γ cx,l

,

(6.45)

that we can rewrite as hW,l

= =

−1 in,l cy,l − IM

iM,1 1 − M + tr −1 c y,l in,l −1 in,l cxd ,l 1 + λmax,l

iM,1 .

We can deduce from (6.45) that the subband output SNR is oSNR hW,l = λmax,l

= tr −1 − M, c y,l in,l

(6.46)

(6.47)

and the subband speech-distortion index is a clear function of the subband output SNR: υsd hW,l =

1

2 . 1 + oSNR hW,l

The higher is the value of oSNR hW,l , the less the desired signal is distorted. Clearly, oSNR hW,l ≥ iSNRl ,

(6.48)

(6.49)

since the Wiener filter maximizes the subband output SNR. Recall that in Model 1, the subband output SNR cannot be improved.

50


It is of great interest to observe that the two filters, hmax,l and hW,l are equivalent up to a scaling factor. Indeed, taking αl =

φcx,l 1 + λmax,l

(6.50)

in (6.24) (maximum SNR filter), we find (6.46) (Wiener filter). With the Wiener filter, the subband noise-reduction factor is

ξnr hW,l

2 1 + oSNR hW,l = iSNRl · oSNR hW,l 2 1 . ≥ 1+ oSNR hW,l

(6.51)

Using (6.48) and (6.51) in (6.33), we find the minimum NMSE: J˜ hW,l =

iSNRl ≤ 1. 1 + oSNR hW,l

(6.52)

The fullband output SNR is oSNR 2 hW,l 2 l=1 φcx,l 1 + oSNR hW,l . oSNR(hW,: ) = oSNR hW,l L 2 l=1 φcx,l 1 + oSNR hW,l L

(6.53)

Property 6.1 With the optimal KLE-domain Wiener filter given in (6.40), the fullband output SNR is always greater than or equal to the input SNR, i.e., oSNR(hW,: ) ≥ iSNR.

Proof. We can use exactly the same techniques as the ones exposed in [7] to show this property. 2

6.5

MINIMUM VARIANCE DISTORTIONLESS RESPONSE (MVDR) FILTER

The celebrated minimum variance distortionless response (MVDR) filter proposed by Capon [10], [36] is usually derived in a context where we have at least two sensors (or microphones) available. Interestingly, with Model 2, we can also derive the MVDR (with one sensor only) by minimizing

6.5. MINIMUM VARIANCE DISTORTIONLESS RESPONSE (MVDR) FILTER

51

the MSE of the residual interference-plus-noise, Jin hl , with the constraint that the desired signal is not distorted. Mathematically, this is equivalent to min hTl in,l hl subject to hTl γ cx,l = 1, hl

(6.54)

for which the solution is hMVDR,l

= =

φcx,l −1 in,l γ cx,l λmax,l −1 cy,l − IM in,l

iM,1 . tr −1 − M in,l cy,l

(6.55)

Obviously, we can rewrite the MVDR as hMVDR,l =

−1 cy,l γ cx,l γ Tcx,l −1 cy,l γ cx,l

.

(6.56)

Taking αl =

φcx,l λmax,l

(6.57)

in (6.24) (maximum SNR filter), we find (6.55) (MVDR filter), showing how the maximum SNR, MVDR, and Wiener filters are equivalent up to a scaling factor. From a subband point of view, this scaling is not significant, but from a fullband point of view, it can be important since speech signals are broadband in nature. Indeed, it can easily be verified that this scaling factor affects the fullband output SNRs and fullband speech-distortion indices. While the subband output SNRs of the maximum SNR, Wiener, and MVDR filters are the same, the fullband output SNRs are not because of the scaling factor. It is clear that we always have oSNR hMVDR,l υsd hMVDR,l ξsr hMVDR,l ξnr hMVDR,l

= oSNR hW,l , = 0, = 1, λmax,l = ≤ ξnr hW,l , iSNRl

(6.58) (6.59) (6.60) (6.61)

and iSNRl ≥ J˜ hW,l . 1 ≥ J˜ hMVDR,l = λmax,l

(6.62)

52


The fullband output SNR is L oSNR(hMVDR,: ) =

L l=1

l=1 φcx,l

φcx,l oSNR hMVDR,l

.

(6.63)

With the optimal KLE-domain MVDR filter given in (6.55), the fullband output SNR is always greater than or equal to the input SNR, i.e., oSNR(hMVDR,: ) ≥ iSNR.

Property 6.2

2

Proof. See next section.

6.6

TRADEOFF FILTER

In the tradeoff approach, we try to compromise between noise reduction and speech distortion. Instead of minimizing the MSE to find the Wiener filter or minimizing the MSE of the residual interference-plus-noise with the constraint of no distortion to find the MVDR, we could minimize the speech-distortion index with the constraint that the noise-reduction factor is equal to a positive value that is greater than 1. Mathematically, this is equivalent to min Jx hl subject to Jin hl = βφcv,l , hl

(6.64)

where 0 < β < 1 to insure that we get some noise reduction. By using a Lagrange multiplier, μ > 0, to adjoin the constraint to the cost function, we easily deduce the tradeoff filter: hT,μ,l

−1 = φcx,l φcx,l γ cx,l γ Tcx,l + μin,l γ cx,l =

φcx,l −1 in,l γ cx,l

, (6.65) μ + λmax,l where the Lagrange multiplier, μ, satisfies Jin hT,μ,l = βφcv,l . However, in practice, it is not easy to determine the optimal μ. Therefore, when this parameter is chosen in an ad-hoc way, we can see that for • μ = 1, hT,1,l = hW,l , which is the Wiener filter; • μ = 0, hT,0,l = hMVDR,l , which is the MVDR filter; • μ > 1, results in low residual noise at the expense of high speech distortion; • μ < 1, results in high residual noise and low speech distortion.

6.6. TRADEOFF FILTER

53

Note that the MVDR filter cannot be derived from the first line of (6.65) since by taking μ = 0, we have to invert a matrix that is not full rank. Again, we observe here as well that the tradeoff and Wiener filters are equivalent up to a scaling factor. As a result, the subband output SNR with the tradeoff filter is obviously the same as the subband output SNR with the Wiener filter, i.e., oSNR hT,μ,l = λmax,l , (6.66) and does not depend on μ. However, the subband speech-distortion index is now both a function of the variable μ and the subband output SNR: υsd hT,μ,l =

μ2 μ + λmax,l

2 .

(6.67)

From (6.67), we observe how μ can affect the desired signal. The tradeoff filter is interesting from several perspectives since it encompasses both the Wiener and MVDR filters. It is then useful to study the fullband output SNR and the fullband speechdistortion index of the tradeoff filter, which both depend on the variable μ. Using (6.65) in (6.6), we find that the fullband output SNR is L oSNR hT,μ,: =

l=1

L l=1

φcx,l λ2max,l 2 μ + λmax,l . φcx,l λmax,l 2 μ + λmax,l

(6.68)

We propose the following. Property 6.3 parameter μ.

The fullband output SNR of the tradeoff filter is an increasing function of the

Proof. The proof is very similar to the one given in [49]. In order to determine the variations of oSNR hT,μ,: with respect to the parameter μ, we will check the sign of the following differentiation with respect to μ: doSNR hT,μ,: Num(μ) =2 , (6.69) dμ Den(μ) where L L φcx,l λ2max,l φcx,l λmax,l Num(μ) = − 3 2 l=1 μ + λmax,l l=1 μ + λmax,l L L φcx,l λ2max,l φcx,l λmax,l + 2 3 l=1 μ + λmax,l l=1 μ + λmax,l

(6.70)

54


and L φcx,l λmax,l Den(μ) = 2 l=1 μ + λmax,l

2

.

(6.71)

We only focus on the numerator of the above derivative to see the variations of the fullband output SNR since the denominator is always positive. Multiplying and dividing by μ + λmax,l , this numerator can be rewritten as L L φcx,l λ2max,l φcx,l λmax,l μ + λmax,l Num(μ) = − 3 3 μ + λmax,l l=1 l=1 μ + λmax,l L L φcx,l λ2max,l μ + λmax,l φcx,l λmax,l + 3 3 μ + λmax,l l=1 l=1 μ + λmax,l L 2 φcx,l λ2max,l = − 3 l=1 μ + λmax,l L L φcx,l λ2max,l φcx,l λmax,l −μ 3 3 l=1 μ + λmax,l l=1 μ + λmax,l L L φcx,l λ3max,l φcx,l λmax,l + 3 3 l=1 μ + λmax,l l=1 μ + λmax,l L L φcx,l λ2max,l φcx,l λmax,l +μ 3 3 l=1 μ + λmax,l l=1 μ + λmax,l L 2 φcx,l λ2max,l = − 3 l=1 μ + λmax,l L L φcx,l λ3max,l φcx,l λmax,l + (6.72) 3 3 . μ + λ μ + λ max,l max,l l=1 l=1 As far as μ, λmax,l , and φcx,l are positive ∀l, we can use the Cauchy-Schwarz inequality L L φcx,l λ3max,l φcx,l λmax,l 3 3 l=1 μ + λmax,l l=1 μ + λmax,l ! ⎡ ⎤2 $ L " " φcx,l λ3max,l φcx,l λmax,l # ≥⎣ 3 3 ⎦ μ + λ μ + λ max,l max,l l=1 L 2 2 φcx,l λmax,l = 3 . μ + λ max,l l=1

(6.73)


Substituting (6.73) into (6.72), we conclude that doSNR hT,μ,: ≥ 0, dμ

55

(6.74)

proving that the fullband output SNR is increasing with respect to μ.

2

From Property 6.3, we deduce that the MVDR filter gives the smallest fullband output SNR, which is L l=1 φcx,l . (6.75) oSNR hT,0,: = L φcx,l l=1 λmax,l We give another interesting property. Property 6.4

We have

L

2 l=1 φcx,l λmax,l

lim oSNR hT,μ,: = L μ→∞

l=1 φcx,l λmax,l

≤

L

λmax,l .

(6.76)

l=1

2

Proof. Easy to show from (6.68).

While the fullband output SNR is upper bounded, it is easy to show that the fullband noisereduction factor and fullband speech-reduction factor are not. So when μ goes to infinity, so are ξnr hT,μ,: and ξsr hT,μ,: . The fullband speech-distortion index is L

υsd hT,μ,: =

l=1

φcx,l μ2

μ + λmax,l L l=1 φcx,l

2 .

(6.77)

Property 6.5 The fullband speech-distortion index of the tradeoff filter is an increasing function of the parameter μ.

Proof. It is straightforward to verify that dυsd hT,μ,: ≥ 0, dμ which ends the proof.

(6.78)

2

56


It is clear that 0 ≤ υsd hT,μ,: ≤ 1, ∀μ ≥ 0.

(6.79)

Therefore, as μ increases, the fullband output SNR increases at the price of more distortion to the desired signal. Property 6.6 With the tradeoff filter, hT,μ,l , the fullband output SNR is always greater than or equal to the input SNR, i.e., oSNR hT,μ,: ≥ iSNR, ∀μ ≥ 0.

Proof. We know that λmax,l ≥ iSNRl ,

(6.80)

which implies that L l=1

iSNRl ≤ φcv,l λmax,l L

φcv,l

(6.81)

l=1

and hence,

oSNR hT,0,: =

L

l=1 φcx,l

L

φcx,l ≥ l=1 = iSNR. L L iSNR l φ c v,l l=1 l=1 φcv,l λmax,l

But from Proposition 6.3, we have oSNR hT,μ,: ≥ oSNR hT,0,: , ∀μ ≥ 0,

(6.82)

(6.83)

as a result, oSNR hT,μ,: ≥ iSNR, ∀μ ≥ 0, which completes the proof.

(6.84)

2

57

CHAPTER

7

Optimal Filters in the KLE Domain with Model 3 This chapter is dedicated to the study of optimal filters with Model 3 where the interband correlation is taken into account. To simplify the presentation, we drop the subscript “3” from the FIR filter of length L and the filtering matrix of size L × L (see Chapter 4, Section 4.3), so that now h3,l and H3 are written as hl and H, respectively.

7.1


In this section, we derive the most important performance measures based on Model 3 derived in Chapter 4, Section 4.3. We define the subband output SNR as oSNR(hl ) = =

hTl cx hl hTl cv hl T Qhl Rx Qhl T , l = 1, 2, . . . , L. Qhl Rv Qhl

(7.1)

It is interesting to notice that, contrary to Model 1 and Model 2 where the subband output SNRs depend only on the energies of the desired and noise signals in the considered subband, the subband output SNR for Model 3 depends on the whole energies (from all subbands) of the desired and noise signals. We easily find the definition of the fullband output SNR, which is oSNR(h: ) = oSNR(H) L T h cx hl = Ll=1 Tl h h l=1 l cv l tr Hcx HT

. = tr Hcv HT

(7.2)

58


We recall that ⎡ ⎢ ⎢ H=⎢ ⎣

hT1 hT2 .. .

⎤ ⎥ ⎥ ⎥ ⎦

hTL is a filtering matrix of size L × L. We always have oSNR(H) ≤

L

oSNR(hl ).

(7.3)

l=1

The previous inequality shows that the fullband output SNR is upper bounded no matter how the filters hl , l = 1, 2, . . . , L are chosen. The subband and fullband noise-reduction factors are φcv,l T hl cv hl qTl Rv ql T , Qhl Rv Qhl L l=1 φcv,l L T l=1 hl cv hl

ξnr (hl ) = = ξnr (H) =

l = 1, 2, . . . , L,

tr (Rv )

T . T T tr HQ Rv HQ

=

(7.4)

(7.5)

These factors should be lower bounded by 1 for optimal filters. We also have ξnr (H) ≤

L

(7.6)

ξnr (hl ).

l=1

The distortion of the desired signal can be quantified with the subband speech-distortion index

E υsd (hl ) = =

hTl cx (m) − cx,l (m)

hl − il

2

φcx,l cx hl − il , l = 1, 2, . . . , L φcx,l

T

(7.7)


59

and the fullband speech-distortion index L l=1

υsd (H) =

hl − il L

cx hl − il

T

l=1 φcx,l

tr (H − I) cx (H − I)T , tr cx

=

(7.8)

where il is a vector of length L, corresponding to the lth column of the identity matrix I of size L × L. The speech-distortion index is usually upper bounded by 1. We have υsd (H) ≤

L

υsd (hl ).

(7.9)

l=1

We can also quantify signal distortion via the subband and fullband speech-reduction factors which are defined as ξsr (hl ) = = ξsr (H) = =

φcx,l T hl cx hl qTl Rx ql T , Qhl Rx Qhl L l=1 φcx,l L T l=1 hl cx hl

l = 1, 2, . . . , L,

tr (Rx )

T . T T tr HQ Rx HQ

(7.10)

(7.11)

The speech-reduction factor is supposed to have a lower bound of 1 for optimal filters. We also have ξsr (H) ≤

L

ξsr (hl ).

(7.12)

l=1

We can verify that oSNR(hl ) iSNRl

=

ξnr (hl ) , ł = 1, 2, . . . , L, ξsr (hl )

(7.13)

oSNR(H) iSNR

=

ξnr (H) . ξsr (H)

(7.14)

60


7.2

MSE CRITERION

We define the error signal between the estimated and desired signals in the subband l as el (m) = cz3 ,l (m) − cx,l (m) =

(7.15)

hTl cy (m) − cx,l (m).

This error signal can also be written as the sum of two uncorrelated error signals: el (m) = ex,l (m) + ev,l (m),

(7.16)

ex,l (m) = hTl cx (m) − cx,l (m) T = hl − il cx (m)

(7.17)

where

is the speech distortion due to the filter and ev,l (m) = hTl cv (m)

(7.18)

represents the residual noise. From the error signal defined in (7.15), we can now deduce the subband MSE criterion: J hl = E el2 (m) (7.19) = hTl cy hl − 2hTl cy cx il + φcx,l , where cy cx

= E cy (m)cTx (m) = E cx (m)cTx (m) = cx

is the cross-correlation matrix between the two signal vectors cy (m) and cx (m). The subband MSE can be rewritten as J hl = Jx hl + Jv hl , where

and

2 (m) Jx hl = E ex,l T = hl − il cx hl − il

(7.20)

2 (m) Jv hl = E ev,l = hTl cv hl .

(7.21)

7.2. MSE CRITERION

61

For the particular filters hl = il , ∀l, we get J (il ) = φcv,l .

(7.22)

Using this particular case of the MSE, we define the subband normalized MSE (NMSE) as J hl J˜ hl = J (il ) 1 , = iSNR l · υsd hl + (7.23) ξnr hl where υsd hl = ξnr hl =

Jx hl , φcx,l φcv,l . Jv hl

(7.24) (7.25)

The KLE-domain NMSE depends explicitly on the subband speech-distortion index and the subband noise-reduction factor. We define the fullband MSE and fullband NMSE as J (H) = =

L 1 J hl L

1 L

l=1 L l=1

(7.26)

L 1 Jx hl + Jv hl L l=1

= Jx (H) + Jv (H) and J (H) J˜ (H) = L L l=1 φcv,l = iSNR · υsd (H) +

(7.27) 1 , ξnr (H)

where υsd (H) = ξnr (H) =

Jx (H) , L l=1 φcx,l L l=1 φcv,l . Jv (H)

(7.28) (7.29)


62


7.3

WIENER FILTER

If we differentiate the MSE criterion, J hl , with respect to hl and equate the result to zero, we find the Wiener filter: hW,l

= −1 i cy cx l

= I − −1 c v il cy

= I − −1 cv il .

(7.30)

Combining all filters hW,l , l = 1, 2, . . . , L in a matrix, we get HW

= cx −1 cy = =

(7.31)

I − cv −1 cy I − cv −1 .

Property 7.1 The Wiener filter derived with Model 3 [eq. (7.31)] is strictly equivalent to the classical time-domain Wiener filter derived in Chapter 3 [eq. (3.22)].

Proof. Indeed, from Chapter 4, Section 4.3, we know that the time-domain form of HW is HTD,W = QHW QT .

(7.32)

Substituting (7.31) into the previous expression, we find that HTD,W

−1 T = Q cx Q

= Qcx QT Q−1 QT

= Rx R−1 y = Ht,W ,

(7.33)

2

which completes the proof. It is interesting to see that the subband and fullband output SNRs are oSNR(hW,l ) =

−1 qTl Rx R−1 y Rx Ry Rx ql

qTl Rx R−1 R R−1 R q y v y x l oSNR(HW ) = oSNR HTD,W = oSNR H t,W

=

, l = 1, 2, . . . , L,

−1 tr Rx R−1 y Rx Ry Rx

. −1 tr Rx R−1 R R R v y x y

(7.34)

(7.35)


63

From some results of this chapter and Chapter 3, we easily deduce that oSNR (HW ) ≥ iSNR

(7.36)

and oSNR(HW ) ≤

L qT R R−1 R R−1 R q x y x l l x y l=1

7.4

−1 qTl Rx R−1 y Rv Ry Rx ql

.

(7.37)

TRADEOFF FILTER

The basic principle of the tradeoff filter is to compromise between noise reduction and speech distortion. From the following optimization procedure (7.38) min Jx hl subject to Jv hl = βφcv,l , hl

where 0 < β < 1 to insure that we get some noise reduction, we find that the optimal tradeoff filter is −1 hT,μ,l = cx + μcv cx il , (7.39) where μ ≥ 0 is a Lagrange multiplier satisfying Jv hT,μ,l = βφcv,l . Usually μ is chosen in an ad-hoc way, so that for • μ = 1, hT,1,l = hW,l , which is the Wiener filter; • μ = 0, hT,0,l = il , which is the identity filter (neither noise reduction nor speech distortion); • μ > 1, results in low residual noise at the expense of high speech distortion; • μ < 1, results in high residual noise and low speech distortion. Combining all filters hT,μ,l , l = 1, 2, . . . , L in a matrix, we get −1 HT,μ = cx cx + μcv .

(7.40)

Property 7.2 The tradeoff filter derived with Model 3 [eq. (7.40)] is strictly equivalent to the classical time-domain tradeoff filter derived in Chapter 3 [eq. (3.47)].

Proof. Indeed, from Chapter 4, Section 4.3, we know that the time-domain form of HT,μ is HTD,T,μ = QHT,μ QT .

(7.41)

64


Substituting (7.40) into the previous expression, we find that

−1 T HTD,T,μ = Qcx QT Q cx + μcv Q = Rx (Rx + μRv )−1 = Ht,T,μ ,

(7.42)

2

which completes the proof. Obviously, we also have the following important results oSNR(HT,μ ) = oSNR HTD,T,μ = oSNR Ht,T,μ

(7.43)

and oSNR(HT,μ ) ≥ iSNR, ∀μ ≥ 0.

7.5

(7.44)

MAXIMUM SNR FILTER

The maximum SNR filter is obtained by maximizing the subband output SNR defined in (7.1). Assume that the matrix cv is full rank. In this case, the maximum SNR filter is the same in all subbands and is equal to the eigenvector, hmax , corresponding to the maximum eigenvalue, λmax , of the matrix −1 cv cx . As a result, the subband and fullband output SNRs are oSNR hmax = λmax , ∀l, (7.45) oSNR (Hmax ) = λmax , (7.46) where ⎡ ⎢ ⎢ Hmax = ⎢ ⎣

β1 hTmax β2 hTmax .. .

⎤ ⎥ ⎥ ⎥ ⎦

(7.47)

βL hTmax and βl , l = 1, 2, . . . , L are real numbers with at least one of them different from 0. The maximum SNR filter derived with Model 3 [eq. (7.47)] is strictly equivalent to the time-domain maximum SNR filter derived in Chapter 3 [eq. (3.74)].

Property 7.3

Proof. Indeed, from Chapter 4, Section 4.3, we know that the time-domain form of Hmax is HTD,max = QHmax QT .

(7.48)

7.5. MAXIMUM SNR FILTER

65

It is clear that R−1 v Rx ht,max −1 cv cx hmax

= λmax ht,max , = λmax hmax ,

(7.49) (7.50)

where hmax = QT ht,max .

(7.51)

Therefore, substituting (7.51) into (7.48), we find that HTD,max = QHt,max = H t,max .

(7.52)

2 This is another interesting (and simpler) way to derive the maximum SNR filter in the time domain as compared to its direct derivation from the time-domain output SNR as explained in Chapter 3.

67

CHAPTER

8

Optimal Filters in the KLE Domain with Model 4 Model 4 is at the same time the most general model and the most complicated one since both the interband and interframe correlations are taken into account. To simplify the presentation, we drop the subscript “4” from the FIR filter of length ML and the filtering matrix of size L × ML (see Chapter 4, Section 4.4), so that now h4,l and H4 are written as hl and H, respectively.

8.1


All performance measures are derived from expressions (4.37) and (4.42) of Section 4.4. We define the subband output SNR as1 oSNR hl = =

hTl cx hl d

(8.1)

hTl in hl hTl cx cx Tcx hl hTl in hl

, l = 1, 2, . . . , L,

where in = c

x + cv

(8.2)

is the interference-plus-noise covariance matrix. Like in Model 3, this subband output SNR depends on the whole energies (from all subbands) of the desired, interference, and noise signals. The fullband output SNR is then oSNR h: = oSNR H L T l=1 hl cx hl = L T d h h l=1 l in l tr Hcx HT d

. = (8.3) tr Hin HT 1 In this study, we consider the interference as part of the noise in the definitions of the performance measures.

68


We recall that ⎡ ⎢ ⎢ H=⎢ ⎣

hT1 hT2 .. .

⎤ ⎥ ⎥ ⎥ ⎦

hTL is a filtering matrix of size L × ML. We always have L oSNR H ≤ oSNR hl .

(8.4)

l=1

The previous inequality shows that the fullband output SNR is upper bounded no matter how the filters hl , l = 1, 2, . . . , L are taken. The subband and fullband noise-reduction factors are ξnr hl = ξnr H = =

φcv,l , l = 1, 2, . . . , L, T hl in hl L l=1 φcv,l L T l=1hl in hl tr cv

. tr Hin HT

(8.5)

(8.6)

These factors should be lower bounded by 1 for optimal filters. We also have L ξnr H ≤ ξnr hl .

(8.7)

l=1

The distortion of the desired signal can be quantified with the subband speech-distortion index υsd hl =

E

=

hTl cx cx (m) − cx,l (m)

Tcx hl

− il

T

2

φcx,l

cx Tcx hl − il

φcx,l

, l = 1, 2, . . . , L

(8.8)


69

and the fullband speech-distortion index L

υsd H =

l=1

tr

=

Tcx hl − il L

T

cx Tcx hl − il

l=1 φcx,l

T H cx − I cx H cx − I , tr cx

(8.9)

where il is a vector of length L, corresponding to the lth column of the identity matrix I of size L × L. The speech-distortion index is usually upper bounded by 1. We have L υsd H ≤ υsd hl .

(8.10)

l=1

We can also quantify signal distortion via the subband and fullband speech-reduction factors which are defined as ξsr hl = ξsr H = =

φcx,l , l = 1, 2, . . . , L, T hl cx cx Tcx hl L l=1 φcx,l L T T l=1 hl cx cx cx hl tr cx

. tr H cx cx Tcx HT

(8.11)

(8.12)

The speech-reduction factor is supposed to have a lower bound of 1 for optimal filters. We also have L ξsr H ≤ ξsr hl .

(8.13)

l=1

An important observation from (8.9) or (8.12) is that the design of a noise reduction algorithm with Model 4 that does not distort the desired signal requires the constraint H cx = I.

(8.14)

We can verify that oSNR hl iSNRl oSNR H iSNR

= =

ξnr hl , ł = 1, 2, . . . , L, ξsr hl ξnr H . ξsr H

(8.15) (8.16)

70


8.2

MSE CRITERION

The error signal between the estimated and desired signals in the subband l is defined as el (m) = cz4 ,l (m) − cx,l (m) = hTl cy (m) − cx,l (m).

(8.17)

The previous error can be decomposed as follows: el (m) = ex,l (m) + ein,l (m),

(8.18)

where ex,l (m) = hTl cx cx (m) − cx,l (m)

T = Tcx hl − il cx (m)

(8.19)

is the speech distortion due to the filter and ein,l (m) = hTl c

x (m) + hTl cv (m)

(8.20)

represents the residual interference-plus-noise. The subband MSE criterion for Model 4 is then J hl = E el2 (m) = where

hTl cy hl

− 2hTl cy cx il

(8.21) + φcx,l ,

cy = E cy (m)cTy (m)

is the correlation matrix of the signal cy (m), cy cx

= E cy (m)cTx (m) = E cx (m)cTx (m) = cx

is the cross-correlation matrix between the two signal vectors cy (m) and cx (m), and il is a vector of length ML for which its lth component is equal to 1 and all its other components are equal to 0. Expression (8.21) can be rewritten as J hl = Jx hl + Jin hl ,

8.2. MSE CRITERION

where

and

2 (m) Jx hl = E ex,l

T

= Tcx hl − il cx Tcx hl − il

71

(8.22)

2 (m) Jin hl = E ein,l = hTl in hl .

(8.23)

For the particular filters hl = il , ∀l, we get J il = φcv,l .

(8.24)

Using this particular case of the MSE, we define the subband normalized MSE (NMSE) as J hl J˜ hl = J il 1 , = iSNR l · υsd hl + (8.25) ξnr hl where υsd hl = ξnr hl =

Jx hl , φcx,l φcv,l . Jin hl

(8.26) (8.27)

The KLE-domain NMSE depends explicitly on the subband speech-distortion index and the subband noise-reduction factor. We define the fullband MSE and fullband NMSE as J H =

L 1 J hl L

(8.28)

l=1 L

L 1 Jx hl + Jin hl L l=1 l=1 = Jx H + Jin H

=

and

1 L

J H ˜ J H = L L l=1 φcv,l = iSNR · υsd H +

(8.29) 1 , ξnr H

72


where υsd H = ξnr H =

Jx H , L l=1 φcx,l L l=1 φcv,l . Jin H

(8.30) (8.31)


8.3

WIENER FILTER

From the MSE criterion given in (8.21), we easily derive the Wiener filter, which is = −1 cx il cy

= IML − −1 cy cv il ,

hW,l

(8.32)

where IML is the identity matrix of size ML × ML. Combining all filters hW,l , l = 1, 2, . . . , L in a matrix, we get HW

= Icx −1 cy =

(8.33)

I − Icv −1 cy ,

where I= Lemma 8.1

I 0L×(ML−L)

.

We can rewrite the Wiener filter as

−1 −1 T + Tcx −1 HW = −1 cx cx in cx in .

(8.34)

Proof. This expression is easy to show by using the Woodbury’s identity in the following decomposition cy = cx cx Tcx + in and replacing it in (8.33).

(8.35)

2


73

The form of the Wiener filter presented in (8.34) is interesting because it shows an obvious link with some other optimal filters as it will be verified later. The Wiener filter with Model 4 has also several interesting properties. For example, it can be shown that the fullband output SNR is always greater than or equal to the input SNR, i.e., oSNR HW ≥ iSNR.

8.4

TRADEOFF FILTER

The tradeoff filter is an elegant way to compromise between noise reduction and speech distortion. One natural approach for its derivation is as follows: (8.36) min Jx hl subject to Jin hl = βφcv,l , hl

where 0 < β < 1 to insure that we get some noise reduction. From (8.36), we find that the optimal tradeoff filter is −1 cx il , (8.37) hT,μ,l = cx + μin where μ > 0 is a Lagrange multiplier satisfying Jin hT,μ,l = βφcv,l . Taking μ = 1, we obviously find the Wiener filter. Combining all filters hT,μ,l , l = 1, 2, . . . , L in a matrix, we get −1 HT,μ = Icx cx + μin ,

(8.38)

which can be rewritten, thanks to the Woodbury’s identity, as

−1 −1 T Tcx −1 HT,μ = μ−1 cx + cx in cx in .

(8.39)

We can show here as well that oSNR HT,μ ≥ iSNR, ∀μ > 0.

8.5

(8.40)

MVDR FILTER

The minimum variance distortionless response (MVDR) filter is found by

subject to H cx = I. min tr Hin HT H

(8.41)

Therefore, the optimal solution is

−1 HMVDR = Tcx −1 Tcx −1 in cx in .

(8.42)

74


Lemma 8.2

We can rewrite the MVDR filter as

−1 HMVDR = Tcx −1 Tcx −1 c y cx cy .

(8.43)

Proof. This expression is easy to show by using the Woodbury’s identity in −1 cy .

2

From (8.43), it is easy to deduce the relationship between the MVDR and Wiener filters: −1 HMVDR = HW cx HW .

(8.44)

oSNR HMVDR ≥ iSNR.

(8.45)

It can be shown that

8.6

MAXIMUM SNR FILTER

The maximum SNR filter is obtained by maximizing the subband output SNR defined in (8.1). It is assumed that the matrix in is full rank. In this case, the maximum SNR filter is the same in all subbands and is equal to the eigenvector, hmax , corresponding to the maximum eigenvalue, λmax , of the matrix −1 in cx . As a result, the subband and fullband output SNRs are d

oSNR hmax = λmax , ∀l, oSNR Hmax = λmax ,

(8.46) (8.47)

where ⎡ ⎢ ⎢ Hmax = ⎢ ⎣

β1 hTmax β2 hTmax .. .

⎤ ⎥ ⎥ ⎥ ⎦

(8.48)

βL hTmax and βl , l = 1, 2, . . . , L are real numbers with at least one of them different from 0. It can be observed that for μ ≥ 1, iSNR ≤ oSNR HMVDR ≤ oSNR HW ≤ oSNR HT,μ ≤ oSNR Hmax = λmax

(8.49)

and for μ ≤ 1, iSNR ≤ oSNR HT,μ ≤ oSNR HMVDR ≤ oSNR HW ≤ oSNR Hmax = λmax . (8.50)

75

CHAPTER

9

Experimental Study By dividing the general speech enhancement problem in the KLE domain into four categories, depending on whether the interframe and interband information is accounted for, we have derived a number of optimal noise reduction filters in Chapters 5 to 8. For each category of filters, we have analyzed their performance through theoretical evaluation of either the subband or the fullband output SNRs, noise-reduction factors, and speech-distortion indices. We have also discussed their connection to the time-domain filters. In this chapter, we study some of those key noise reduction filters through experiments and highlight the merits and limitations inherent in each optimal filter.

9.1

EXPERIMENTAL CONDITIONS

The clean speech used in all the experiments was recorded from a female talker in a quiet office room. It was sampled at 8 kHz. The overall length of the signal is 2 minutes. The first 10 seconds of this clean speech signal and the corresponding spectrogram are visualized in Fig. 9.1.

Frequency (kHz)

Amplitdue

1.0

(a)

0.5 0 -0.5 -1.0 4

1.0

(b)

3 0.5

2 1 0

0

1

2

3

4

5 6 Time (s)

7

8

9

10

0

Figure 9.1: The clean speech x(k) used in the experiments: (a) the first 10-second waveform and (b) the first 10-second spectrogram.

The noisy speech is obtained by adding noise to the clean speech where the noise signal is properly scaled to control the input SNR level. As we pointed out in the introduction, noise is a general term encompassing a broad range of unwanted signals. They can be either white (their

76

9. EXPERIMENTAL STUDY

spectral density is the same across all the frequency bands within the signal bandwidth) or colored (their power is not the same at different frequency bands); they can also be either stationary (their statistics stay the same over time) or nonstationary (their statistics are time-varying). Because of this, it is very difficult to evaluate a noise reduction filter and fairly compare different filters as the experimental results obtained in one noise condition may not necessarily be consistent with the ones obtained from another noise condition. Therefore, it is important to assess a noise reduction filter in many different conditions before we choose to implement it into a practical system. In this chapter, we choose three types of noise that we consider as very representative of real applications: a computer generated stationary white Gaussian random process, a car noise signal (quasi-stationary but colored), and a babble noise signal (nonstationary and colored). The car noise is recorded in a Volvo Sedan running at 55 miles/hour on a highway with all its windows closed. The first 10 seconds of this noise and its spectrogram are shown in Fig. 9.2. It is seen from the spectrogram that most of the car noise energy concentrates in low frequencies so this noise is colored. Also plotted in Fig. 9.2 are the autocorrelation coefficients of this noise (the first column of Rv ) computed using a long-time average. Clearly, there is a strong correlation between adjacent noise samples. This, again, illustrates that this car noise is colored. The babble noise is recorded in a New York Stock Exchange (NYSE) room, so we shall call it the NYSE noise from now on. This noise consists of many sounds from various sources such as electrical fans, computer fans, telephone rings, and even some background speech. The first 10second waveform and spectrogram and the first 20 autocorrelation coefficients (computed using a long-time average) of this noise are plotted in Fig. 9.3. One can easily see that the NYSE noise is nonstationary and colored.

9.2

ESTIMATION OF THE CORRELATION MATRICES AND VECTORS

The implementation of both the time-domain and KLE-domain noise reduction filters requires the estimation of the correlation matrices Ry , Rx , and Rv . Since the noisy signal y(k) is accessible, the correlation matrix Ry can be computed by approximating the mathematical expectation in its definition with a sample average. However, a noise estimator or a voice activity detector (VAD) is needed in practice to compute the other two matrices. While they are very important, the noise estimation and VAD issues will be left to the reader’s investigation. Instead, we directly compute the noise correlation matrix from the noise signal. Specifically, in the following experiments, estimates of the matrices Ry and Rv are obtained using either a short-time or long-time average. In the shorttime average case, at each time instant k, a segment of the noisy and noise signals that consists of a number of the most recent samples are used to compute the corresponding correlation matrices. The length of the segment (or window), denoted by N, may vary depending on the experimental setup, which will be specified in each experiment. In the long-time average, all the signal samples will be used to compute the correlation matrices. Once an estimate of the matrices Ry and Rv are achieved, the estimate of Rx is obtained by subtracting the estimate of Rv from that of Ry .


Frequency (kHz)

Amplitdue

1.0

(a)

0.5 0 -0.5 -1.0 4

1.0

(b)

3 0.5

2 1 0

0

1

2

3

1.0 Amplitude

77

4

5 6 Time (s)

7

8

9

10

14

16

18

20

0

(c)

0.8 0.6 0.4 0.2 0

0

2

4

6

8 10 12 Time Lag (sample)

Figure 9.2: The car noise used in the experiments: (a) the first 10-second waveform, (b) the first 10second spectrogram, and (c) the first 20 autocorrelation coefficients.

In the KLE domain, we also need to estimate the variances, correlation matrices, and correlation vectors of the subband signals cy,l , cx,l , and cv,l . All the parameters of the noisy signal are computed from cy,l using a recursive method, and those parameters of the noise signal are directly computed from cv,l without using any VAD. The parameters associated with the clean speech are estimated by subtracting the corresponding noise parameters from those of the noisy signal.

9.3


For ease of comparison, we evaluate both the time- and KLE-domain filters using the fullband output SNR, noise-reduction factor, and speech-distortion index as the performance measures. These measures are computed according to their definitions given, respectively, in (3.7), (3.8), and (3.9) by replacing the expectation by a long-time average. Note that the KLE-domain filters are designed on a subband basis. To compute the fullband performance measures for these filters, we need to have the time-domain filtered speech, residual noise, and interference signal (if any), which

78


Frequency (kHz)

Amplitdue

1.0

(a)

0.5 0 -0.5 -1.0 4 3 0.5

2 1 0

0

1

2

3

1.0 Amplitude

1.0

(b)

4

5 6 Time (s)

7

8

9

10

14

16

18

20

0

(c)

0.8 0.6 0.4 0.2 0

0

2

4

6

8 10 12 Time Lag (sample)

Figure 9.3: The NYSE noise used in the experiments: (a) the first 10-second waveform, (b) the first 10-second spectrogram, and (c) the first 20 autocorrelation coefficients.

are constructed from the corresponding filtered KLE coefficients using the synthesis transform given in (2.8).

9.4

PERFORMANCE OF THE TIME-DOMAIN FILTERS

In this set of experiments, we study the performance of two important time-domain noise reduction filters: Wiener and tradeoff. As described earlier, we compute the noisy signal and noise correlation matrices directly from the noisy and noise signals using a short-time average. Specifically, at each ˆ y (k), is calculated from y(k) using the most recent time instant k, an estimate of Ry , denoted by R 320 samples (a 40-ms window length). The matrix Rv is computed in a similar way. But noise is supposed to be relatively more stationary than speech, so we use 640 samples (an 80-ms window length) to compute the estimate of Rv . The Wiener and tradeoff filters are then implemented by ˆ y (k) and R ˆ v (k) into (3.23) and (3.47), respectively. The input SNR substituting Ry and Rv with R for this experiment is set to 10 dB.

9.4. PERFORMANCE OF THE TIME-DOMAIN FILTERS

79

18

(a)

oSNR (dB)

16

14

: in white noise ◦: in car noise ∗: in NYSE noise

12

10 −15

(b)


υsd (dB)

−20

−25

−30 0

20

40

60 80 100 Filter length L (sample)

120

140

160

Figure 9.4: Performance of the time-domain Wiener filter as a function of the filter length L: (a) output SNR and (b) speech-distortion index. The input SNR is 10 dB.

9.4.1

WIENER FILTER

With the above experimental setup, the key parameter that affects the performance of the Wiener filter is the filter length L. The optimal value of L depends on many factors such as the degree of autocorrelation of the desired speech signal and that of the noise. Figure 9.4 plots the performance of the Wiener filter as a function of L in three noise conditions (white Gaussian, car, and NYSE). Note that only the output SNR and speech-distortion index are plotted while the noise-reduction factor is omitted in the figure because its curve is similar to that of the output SNR and does not provide much additional information. In the white Gaussian noise condition, it is seen that the output SNR increases while the speech-distortion index decreases with L at first. But when the value of L is larger than 20, the two

80


measures do not change much with L. The reason can be explained as follows. When the noise is white Gaussian, the product matrix QT Rv Q becomes σv2 I and the Wiener filter given in (3.23) can be written as Ht,W = Q t,W QT , where

(9.1)

t,W

λx,1 λx,2 λx,L = diag , ,..., λx,1 + σv2 λx,2 + σv2 λx,L + σv2

(9.2)

is a diagonal matrix, and λx,l , l = 1, 2, · · · , L, are the eigenvalues of the matrix Rx with λx,1 ≥ λx,2 ≥ . . . ≥ λx,L ≥ 0. Substituting (9.1) into (3.7) and (3.9), we obtain L

oSNR(Ht,W ) =

l=1 L l=1 L

υsd Ht,W

=

l=1

λ3x,l

λx,l + σv2 λ2x,l · σv2

λx,l + σv2

2

λx,l + σv2

L

(9.3)

.

(9.4)

2

λx,l · σv4

,

2

λx,l

l=1

As discussed in Chapter 3, a speech signal is predictable in nature and can be modelled as a linear combination of a small number of (linearly independent) basis vectors. So, the positive semi-definite matrix Rx has only a limited number of positive eigenvalues and the rest is zero. Let us assume that the number of positive eigenvalues is Ls . It is easy to check from (9.3) and (9.4) that once the value of L is greater than Ls , further increasing L has no effect on either the output SNR or the speech-distortion index. Of course, the value of Ls varies depending on the nature of the sounds in the speech signal. It is relatively small for voiced sounds while large for unvoiced sounds. But in average, the value of Ls is around 20 for an 8-kHz sampling rate [11]. That is why in Fig. 9.4, the optimal performance of the Wiener filter in the white Gaussian noise condition is achieved when the value of L is around 20 and further increasing L does not lead to much performance improvement. If noise is colored, the product matrix QT Rv Q is no longer diagonal. In this situation, the Wiener filtering matrix depends not only on the degree of the autocorrelation of the speech signal, but also on that of the noise signal. So, a larger filter length should be used in colored noise as compared to the white noise condition. From Fig. 9.4, it is seen that in both the car and NYSE noise conditions the output SNR increases with L, but it increases more quickly for L ≤ 80 and for L > 80, the improvement in the output SNR is almost negligible. Unlike the output SNR, the

9.5. PERFORMANCE OF THE KLE-DOMAIN FILTERS WITH MODEL 1

81

speech-distortion index in the car and NYSE noise conditions is not a monotonic function of L. It first increases slightly and then decreases as L increases. This, again, indicates that a larger filter length is needed if the noise is colored. We see that in both conditions, there is not much change in the speech-distortion index for L > 80. Therefore, 80 should be a sufficient value of L in the car and NYSE noise conditions. It is also seen that when the filter length is sufficiently large (e.g., L > 40), the Wiener filter achieves the best performance in the car noise condition. Comparatively, the performance in the NYSE noise condition is relatively poorer, which is intuitively reasonable since babble noise is nonstationary and therefore more difficult to deal with than stationary noise.

9.4.2

TRADEOFF FILTER

In the tradeoff filter given in (3.47), a parameter μ is introduced to control the compromise between the amount of noise reduction and the degree of speech distortion. This experiment illustrates the impact of the value of μ on the noise reduction performance. Based on the previous experiment, we set the filter length L to 60, and the results are shown in Fig. 9.5. When μ = 0, the tradeoff filter becomes the identity matrix, which passes the noisy speech without modifying it. So, there will be neither speech distortion nor noise reduction, which can be seen from Fig. 9.5 where when μ = 0, the output SNR is the same as the input SNR and the speechdistortion index is very small (note that the value of the speech distortion index is less than −100 dB when μ = 0, which is not displayed in the figure). When μ = 1, the tradeoff filter becomes the Wiener one. As we increase the value of μ, a higher output SNR is obtained, but at a price of adding more speech distortion as seen from Fig. 9.5 that both the output SNR and the speech-distortion index increase with μ. Before leaving this subsection, we want to bring the reader’s attention to a numerical issue in implementing the tradeoff filter. From (3.47), one can see that we need to compute the inverse of the sum matrix Ry + (μ − 1)Rv . In practice, both matrices Ry and Rv are generally positive definite. However, when μ < 1, we subtract a scaled version of Rv from Ry and the resulting matrix can become singular. This problem becomes more and more serious when μ decreases from 1 to 0. In the extreme case where μ = 0, we have Ry + (μ − 1)Rv = Rx . As we discussed earlier, when the value of L is large, Rx is rank deficient, so its inverse does not exist. A straightforward way to circumvent this issue is through the use of a pseudo-inverse when μ < 1, which was adopted in our implementation.

9.5

PERFORMANCE OF THE KLE-DOMAIN FILTERS WITH MODEL 1

In this set of experiments, we study the performance of the KLE domain filters with Model 1. Again, we choose to illustrate the Wiener and tradeoff filters. The matrices Ry and Rv are estimated in the same way as in the previous experiments, and the input SNR is set to 10 dB.

82

9. EXPERIMENTAL STUDY 22

(a)

oSNR (dB)

20 18 16


14 12 10 −10

(b)


υsd (dB)

−15

−20

−25 0

1

2

3 4 5 Tradeoff parameter μ

6

7

8

Figure 9.5: Performance of the time-domain tradeoff filter as a function of the parameter μ: (a) output SNR and (b) speech-distortion index. The input SNR is 10 dB and L = 60.

9.5.1

KLE-DOMAIN WIENER FILTER

The Wiener filter is implemented according to (5.43) by substituting the matrices Ry and Rv with the corresponding estimates. The performance of this Wiener filter as a function of L is sketched in Fig. 9.6. Comparing Figs. 9.6 and 9.4, one can see that relationship between the noise reduction performance and the filter length L of this Wiener filter is similar to that of the time-domain Wiener filter. We notice that the time- and KLE-domain Wiener filters have the same performance in the white Gaussian noise condition. This is due to the fact that the product matrix QT Rv Q becomes a diagonal one and, therefore, the two filters are identical. However, in the car and NYSE noise conditions, the KLE-domain Wiener filter has a slightly lower output SNR and a higher speech-


83

18

(a)

oSNR (dB)

16

14


12

10 −15

(b)


υsd (dB)

−20

−25

−30 0

20

40

60 80 100 Filter length L (sample)

120

140

160

Figure 9.6: Performance of the KLE-domain Wiener filter with Model 1 as a function of the filter length L: (a) output SNR and (b) speech distortion index. The input SNR is 10 dB.

distortion index. The underlying reason will be explained when we discuss the experiments for filters with Model 3.

9.5.2

KLE-DOMAIN TRADEOFF FILTER

The tradeoff filter with Model 1 can be implemented either according to (5.47) or based on (5.53). Here, we choose to use (5.53). Figure 9.7 plots the output SNR and speech-distortion index, both as a function of μ. Comparing Fig. 9.7 with Fig. 9.5, one can see that KLE- and time-domain tradeoff filters have the same performance in white Gaussian noise.This is because the two filters are identical in this condition. However, the performance of the KLE-domain tradeoff filter is inferior to that

84


(a)

oSNR (dB)

20 18 16


14 12 10 −10

(b)


υsd (dB)

−15

−20

−25 0

1

2

3 4 5 Tradeoff parameter μ

6

7

8

Figure 9.7: Performance of the KLE-domain tradeoff filter with Model 1 as a function of the parameter μ: (a) output SNR and (b) speech-distortion index. The input SNR is 10 dB and L = 60.

of its time-domain counterpart for the same value of μ for the other noises. The reason will be explained in the next subsection when we discuss the filters with Model 3. Before discussing the filters with Model 3, we want to point out that the KLE-domain filters with Model 1 are easier and more efficient to implement as compared to the time-domain filters since the matrix that needs to be inverted in Model 1 is diagonal.


9.6

85


Before talking about the performance of the filters with Model 2, let us first discuss the experiments for the filters with Model 3. One may ask why we need the filters with Model 3, given that the KL transform diagonalizes the noisy covariance matrix and, therefore, the KLE coefficients in different subbands should be uncorrelated. The reasons for using the interband information are multiple. In order to estimate the KL transform, we need to obtain an estimate of the noisy covariance matrix Ry . Typically, such an estimate is computed from the noisy signal by using a short-time average [or a recursive method [13]] to approximate the expectation operation. The window length (number of samples) used in the short-time average plays a vital role in the accuracy of the estimated covariance matrix. If the window length is too short, the estimation variance would be very large, which will eventually be translated into less SNR improvement and more speech distortion. In addition, the covariance matrix estimate may not be full rank. To reduce the variance of the Ry estimate and make it invertible, we need to use a large window length. But with a large window length, the covariance estimate may not be able to follow the time-varying statistics of the speech. As a result, the KL transform that diagonalizes the estimate of the covariance matrix may not diagonalize the true covariance matrix. One easy way to verify this is through examining the cross-correlation between KLE coefficients from different subbands, which will be left to the reader’s investigation. Here, we take a different approach to illustrate the existence of interband correlation: we compare the noise reduction performance between Wiener filters with Model 1 and Model 3 by varying the window length N. The results in the car noise condition are plotted in Fig. 9.8. It is seen that the output SNR of the Wiener filter with Model 1 decreases as the window length N increases. If a long-time average is used to estimate the noisy covariance matrix, the output SNR is only 1 dB higher than the input SNR. In comparison, the output SNR for the Wiener filter with Model 3 does not decrease much with N. One can easily notice that the difference between Wiener filters with Model 3 and Model 1 in the output SNR, increases with the window length N. We also see that the Wiener filter with Model 3 has a smaller speech-distortion index. All these points indicate that there exists crosscorrelation between the KLE coefficients from different subbands and the degree of this correlation increases as we use a longer window in the short-time average to estimate the noisy covariance matrix. Therefore, it is necessary to use the interband information in developing noise reduction filters, particularly when we use a long window in the short-time average. While it diagonalizes the noisy covariance matrix, the KL transform may not diagonalize the noise and speech covariance matrices. This is another reason to use the interband information. The only exceptional case is when noise is white. In this situation, the noise covariance is a diagonal matrix, and the KL transform would simultaneously diagonalize both the speech and noisy covariance matrices. In Chapter 7, we have shown that the KLE-domain noise reduction filters with Model 3 are identical to their counterparts in the time domain. This is intuitively obvious since the time-domain filters are derived using the fullband signal, which is equivalent to using all the self- and cross-band

86


(a)

16

: Wiener filter with Model 3 ◦: Wiener filter with Model 1

oSNR (dB)

15 14 13 12 11 10 −15

(b)

: Wiener filter with Model 3 ◦: Wiener filter with Model 1

υsd (dB)

−17 −19

−21

−23 −25 0

1

2 Length N (second)

3

∞

Figure 9.8: Performance of the Wiener filters with Model 1 and Model 3 as a function of the window length N in the car noise condition: (a) output SNR and (b) speech-distortion index. The input SNR is 10 dB and L = 60. In the “∞” case, the window length N is equal to the overall length of the y(k) signal. So, the short-time average becomes a long-time average in this situation.

information in the KLE domain. The equivalence between the time-domain filters and the KLEdomain filters with Model 3, on the one hand, naturally explains why the time-domain Wiener and tradeoff filters have better performance than their counterparts in the KLE-domain with Model 1 in colored noise, and on the other hand, demonstrates the motivation for developing filters with Model 3. But filters with Model 1 are simple, may be good enough in most applications, and are equivalent to the optimal gains in the frequency domain.


9.7

87


In the previous experiments, we applied a short-time average to estimate the noisy covariance matrix. The window length in the short-time average should be properly selected so that the covariance matrix estimate can follow the nonstationarity of the desired speech signal to achieve noise reduction. Alternatively, the nonstationarity of speech can be employed in the KLE domain. This has led to the development of the filters with Model 2. Unlike Model 1 and Model 3 where each frame may have a different transformation matrix Q, algorithms with Model 2 assume that all the frames share the same matrix Q; otherwise, filtering the KLE coefficients across different frames would not make much sense. With this requirement, the estimation of Q should be relatively easy: we can simply use a long-term sample average to compute the correlation matrix Ry , and the KL transform matrix Q can then be obtained by performing the eigenvalue decomposition of Ry . In the course of our study, we found that the estimation accuracy of the matrix Q plays a less important role in noise reduction performance for the filters with Model 2 than it does for the filters with Model 1 and Model 3. We can even replace the matrix Q with either the Fourier matrix F used in the DFT, or the coefficient matrix in the discrete cosine transform (DCT) without degrading noise reduction performance of the filters with Model 2. This is to say that the idea of the filters with Model 2 can also be applied to the frequency-domain approaches. However, strictly following the theoretical development in Chapter 6, we still use the transformation matrix Q in our experiments with the correlation matrix Ry being estimated using a long-term average. This matrix Q is then applied to each frame of the noisy and noise signals to compute the KLE coefficients cy,l (m) and cv,l (m). The construction of the filters with Model 2 requires to know the subband correlation matrices cy,l and cv,l . Again, we can use a short-time average to approximate the mathematical expectation to compute these two matrices. But we found that it is easier to optimize the performance if we use a recursive method as in [13] to estimate cy,l and cv,l . Specifically, in this experiment, an estimate of the cy,l matrix at the mth frame is computed using the following recursion: ˆ cy,l (m) = αcy,l ˆ cy,l (m − 1) + (1 − αcy,l )cy,l (m)cTy,l (m),

(9.5)

where αcy,l is a forgetting factor that controls the influence of the previous data samples on the current estimate of the noisy correlation matrix. The noise covariance matrix cv,l is estimated in a similar manner but with a different forgetting factor αcv,l . With the estimated covariance ˆ cv,l (m), an estimate of the cx,l matrix at time m is computed as ˆ cx,l (m) = ˆ cy,l (m) and matrices ˆ cv,l (m) and the interframe correlation vector at time m, i.e., γˆ c (m), is taken as the ˆ cy,l (m) − x,l ˆ cx,l (m) normalized with its first element. first column of ˆ cv,l (m) and γˆ c (m) into (6.40) and (6.56), we implemented the ˆ cy,l (m), Substituting x,l Wiener and MVDR filters with Model 2. There are many parameters that affect the performance of the filters with Model 2, such as the transformation length L, the filter length M, and the forgetting factors αcy,l and αcv,l . These parameters can be tuned up through experiments step by step by varying

88


(a)

: Wiener ◦: MVDR

oSNR (dB)

18 16 14 12 10 8 −10

(b)

: Wiener

υsd (dB)

−12 −14 −16 −18 −20 0

2

4

6

8 10 12 14 Filter length M (sample)

16

18

20

Figure 9.9: Performance of the Wiener and MVDR filters with Model 2 as a function of the filter length M in the white Gaussian noise condition: (a) output SNR and (b) speech-distortion index. The input SNR is 10 dB and L = 20. Note that the speech-distortion index for the MVDR filter is smaller than −100 dB and it is not displayed.

one parameter while fixing the others at a time. Let us set L, αcy,l , and αcv,l , respectively, to 20, 0.8, and 0.9 following the results in [13] and study the impact of the filter length M on the output SNR and speech-distortion index. The results for both the Wiener and MVDR filters in the white Gaussian noise condition are shown in Fig. 9.9. It is seen from Fig. 9.9 that the output SNR for both the Wiener and MVDR filters first increases and then decreases as the filter length M increases. With a properly selected value of M, one can see that a better performance is achieved with the Model 2 Wiener filter as compared to the Model 1 Wiener filter, which justifies the motivation of using interframe information. We also

9.8. KLE-DOMAIN FILTERS WITH MODEL 4

89

see that for M > 8, if we keep increasing M, there is some performance degradation. The reason ˆ cy,l (m) would be can probably be explained as follows. With the same forgetting factor, the matrix less well conditioned for a larger value of M, and inverting this matrix may cause some numerical problems. One way to overcome this problem is to use either a larger forgetting factor or using ˆ cy,l (m). We will leave this to the reader’s own more regularization when computing the inverse of exploration. It is interesting to see that with the Model 2, we can derive an MVDR filter. When M = 1, there will be neither noise reduction nor speech distortion. But as we increase M to larger than 1, we can achieve some noise reduction without adding speech distortion. Note that the speech distortion index is smaller than −100 dB for the MVDR filter, which is not displayed in Fig. 9.9. We also notice some difference between the results in Fig. 9.9 and those in [13]. This is because the whole filtered speech is treated as the desired speech component in [13] while in this book we have divided the filtered speech into a desired speech and an interference component. Apparently, treating the interference component as noise after noise reduction is more reasonable since it is uncorrelated with the desired speech samples that we want to estimate.

9.8

KLE-DOMAIN FILTERS WITH MODEL 4

Similar to the filters with Model 2, the filters with Model 4 requires to have the same transformation matrix Q across different frames. With this framework, we can use a long-term average to estimate the correlation matrix Ry , thereby obtaining the matrix Q. We can also replace the matrix Q with either the Fourier matrix F or the DCT matrix. From the previous experiments with Model 2, we already see that taking into account the interframe information can help improve noise reduction performance. We have also demonstrated in the Model 3 that the interband information is needed to better cope with colored noise. This justifies the motivation of developing the filters with Model 4. But we will leave the performance study of the filters with Model 4 to the reader’s own investigation.

91

Bibliography [1] J. K. Baker, “The dragon system–An overview,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-23, pp. 24–29, Feb. 1975. DOI: 10.1109/TASSP.1975.1162650 3 [2] L. E. Baum and T. Petrie, “Statisitcal inference for probabilistic functons of finite state Markov chains,” Ann. Math. Stat., vol. 73, pp. 1554–1563, 1966. DOI: 10.1214/aoms/1177699147 3 [3] J. Benesty and T. Gaensler, “New insights into the RLS algorithm,” EURASIP J. Applied Signal Process., vol. 2004, pp. 331–339, Mar. 2004. DOI: 10.1155/S1110865704310188 17 [4] J. Benesty, S. Makino, and J. Chen, Eds., Speech Enhancement. Berlin, Germany: SpringerVerlag, 2005. 4, 11 [5] J. Benesty, J. Chen, Y. Huang, and S. Doclo, “Study of the Wiener filter for noise reduction,” in Speech Enhancement, J. Benesty, S. Makino, and J. Chen, Eds., Berlin, Germany: SpringerVerlag, 2005, Chapter 2, pp. 9–41. 11, 12, 13 [6] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing. Berlin, Germany: Springer-Verlag, 2008. 1 [7] J. Benesty, J. Chen, Y. Huang, and I. Cohen, Noise Reduction in Speech Processing. Berlin, Germany: Springer-Verlag, 2009. 4, 5, 7, 8, 10, 13, 16, 19, 25, 33, 34, 38, 39, 40, 44, 50 [8] J. Benesty, J. Chen, and Y. Huang, “On noise reduction in the Karhunen-Loeve expansion domain,” in Proc. IEEE ICASSP, 2009, pp. 25–28. DOI: 10.1109/ICASSP.2009.4959511 5, 8, 10 [9] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, pp. 113–120, Apr. 1979. DOI: 10.1109/TASSP.1979.1163209 3 [10] J. Capon, “High resolution frequency-wavenumber spectrum analysis,” Proc. IEEE, vol. 57, pp. 1408–1418, Aug. 1969. DOI: 10.1109/PROC.1969.7278 50 [11] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into the noise reduction Wiener filter,” IEEE Trans. Audio, Speech, Language Process., vol. 14, pp. 1218–1234, July 2006. DOI: 10.1109/TSA.2005.860851 4, 12, 13, 80

92

BIBLIOGRAPHY

[12] J. Chen, J. Benesty, Y. Huang, and E. J. Diethorn,“Fundamentals of noise reduction,” in Springer Handbook of Speech Processing, J. Benesty, M. M. Sondhi, and Y. Huang, Eds., Berlin, Germany: Springer-Verlag, 2007, Chapter 43, Part H, pp. 843–872. 11 [13] J. Chen, J. Benesty, and Y. Huang, “Study of the noise-reduction problem in the KarhunenLoeve expansion domain,” IEEE Trans. Audio, Speech, Language Process., vol. 17, pp. 787–802, May 2009. DOI: 10.1109/TASL.2009.2014793 5, 8, 10, 85, 87, 88, 89 [14] M. Dendrinos, S. Bakamidis, and G. Carayannis, “Speech enhancement from noise: a regenerative approach,” Speech Commun., vol. 10, pp. 45–57, Feb. 1991. DOI: 10.1016/0167-6393(91)90027-Q 5, 20 [15] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error shorttime spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, pp. 1109–1121, Dec. 1984. DOI: 10.1109/TASSP.1984.1164453 3 [16] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error logspectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-33, pp. 443–445, Apr. 1985. DOI: 10.1109/TASSP.1985.1164550 3 [17] Y. Ephraim, D. Malah, and B.-H. Juang, “On the application of hidden Markov models for enhancing noisy speech,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-37, pp. 1846– 1856, Dec. 1989. DOI: 10.1109/29.45532 3 [18] Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models,” IEEETrans. Signal Process., vol. 40, pp. 725–735, Apr. 1992. DOI: 10.1109/78.127947 3 [19] Y. Ephraim, “Statstical-model-based speech enhancement systems,” Proc. IEEE, vol. 80, pp. 1526–1555, Oct. 1992. DOI: 10.1109/5.168664 3 [20] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhancement,” IEEE Trans. Speech, Audio Process., vol. 3, pp. 251–266, July 1995. DOI: 10.1109/89.397090 5, 14, 20 [21] K. Fukunaga, Introduction to Statistical Pattern Recognition. San Diego, CA: Academic Press, 1990. 20 [22] S. Gannot, D. Burshtein, and E. Weinstein, “Iterative and sequential Kalman filter-based speech enhancement algorithms,” IEEE Trans. Speech, Audio Process., vol. 6, pp. 373–385, July 1998. DOI: 10.1109/89.701367 4 [23] Z. Goh, K.-C. Tan, and B. T. G. Tan, “Kalman-filtering speech enhancement method based on a voiced-unvoiced speech model,” IEEE Trans. Speech, Audio Process., vol. 7, pp. 510–524, Sept. 1999. DOI: 10.1109/89.784103 4

BIBLIOGRAPHY

93

[24] G. H. Golub and C. F. Van Loan, Matrix Computations. Baltimore, MD: The Johns Hopkins University Press, 1996. 8 [25] S. Haykin, Adaptive Filter Theory. Fourth Edition, Upper Saddle River, NJ: Prentice-Hall, 2002. 8 [26] K. Hermus, P. Wambacq, and H. Van hamme,“A review of signal subspace speech enhancement and its application to noise robust speech recognition,” EURASIP J. Advances Signal Process., vol. 2007, Article ID 45821, 15 pages, 2007. DOI: 10.1155/2007/45821 20 [27] Y. Hu and P. C. Loizou,“A subspace approach for enhancing speech corrupted by colored noise,” IEEE Signal Process. Lett., vol. 9, pp. 204–206, July 2002. DOI: 10.1109/LSP.2002.801721 5, 20 [28] Y. Hu and P. C. Loizou, “A subspace approach for enhancing speech corrupted by colored noise,” in Proc. IEEE ICASSP, 2002, pp. I-573–I-576. DOI: 10.1109/ICASSP.2002.1005804 5, 20 [29] Y. Hu and P. C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored noise,” IEEE Trans. Speech Audio Process., vol. 11, pp. 334–341, July 2003. DOI: 10.1109/TSA.2003.814458 5, 20 [30] Y. Huang, J. Benesty, and J. Chen, Acoustic MIMO Signal Processing. Berlin, Germany: SpringerVerlag, 2006. 1, 11 [31] F. Jabloun and B. Champagne, “Signal subspace techniques for speech enhancement,” in Speech Enhancement, J. Benesty, S. Makino, and J. Chen, Eds., Berlin, Germany: Springer-Verlag, 2005, Chapter 7, pp. 135–159. 20 [32] F. Jelinek, “Continuous speech recognitiion by statistical methods,” Proc. IEEE, vol. 64, pp. 532–536, Apr. 1976. DOI: 10.1109/PROC.1976.10159 3 [33] S. H. Jensen, P. C. Hansen, S. D. Hansen, and J. A. Sorensen, “Reduction of broad-band noise in speech by truncated QSVD,” IEEE Trans. Speech Audio Process., vol. 3, pp. 439–448, Nov. 1995. DOI: 10.1109/89.482211 5, 20 [34] S. Kay, “Some results in linear interpolation theory,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-31, pp. 746–749, June 1983. DOI: 10.1109/TASSP.1983.1164088 17 [35] B. Koo and J. D. Gibson, “Filtering of colored noise for speech enhancement and coding,” in Proc. IEEE ICASSP, 1989, pp. 345–352. DOI: 10.1109/78.91144 4 [36] R. T. Lacoss, “Data adaptive spectral analysis methods,” Geophysics, vol. 36, pp. 661–675, Aug. 1971. DOI: 10.1190/1.1440203 50

94

BIBLIOGRAPHY

[37] B. Lee, K. Y. Lee, and S. Ann, “An EM-based approach for parameter enhancement with an application to speech signals,” Signal Process., vol. 46, pp. 1–14, Sept. 1995. DOI: 10.1016/0165-1684(95)00068-O 4 [38] C. Li and S. Vang Andersen, “Inter-frequency dependency in MMSE speech enhancement,” in Proc. NORSIG, 2004, pp. 200–203. DOI: 10.1109/NORSIG.2004.250161 9 [39] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proc. IEEE, vol. 67, pp. 1586–1604, Dec. 1979. DOI: 10.1109/PROC.1979.11540 3 [40] P. Loizou, Speech Enhancement: Theory and Practice. Boca Raton, FL: CRC Press, 2007. 11 [41] R. J. McAulay and M. L. Malpass,“Speech enhancement using a soft-decision noise suppression filter,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-28, pp. 137–145, Apr. 1980. DOI: 10.1109/TASSP.1980.1163394 3 [42] M. Nied´zwiecki and K. Cisowski, “Adaptive scheme for elimination of broadband noise and impulsive disturbances from AR and ARMA signals,” IEEE Trans. Signal Process., vol. 44, pp. 528–537, Mar. 1996. DOI: 10.1109/78.489026 4 [43] K. K. Paliwal and A. Basu, “A speech enhancement method based on Kalman filtering,” in Proc. IEEE ICASSP, 1987, pp. 177–180. DOI: 10.1109/ICASSP.1987.1169756 4 [44] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–286, Feb. 1989. DOI: 10.1109/5.18626 3 [45] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993. [46] A. Rezayee and S. Gazor, “An adpative KLT approach for speech enhancement,” IEEE Trans. Speech Audio Process., vol. 9, pp. 87–95, Feb. 2001. DOI: 10.1109/89.902276 41 [47] H. Sameti, H. Sheikhzadeh, L. Deng, and R. L. Brennan, “HMM-based strategies for enhancement of speech signals embedded in nonstationary noise,” IEEE Trans. Speech, Audio Process., vol. 6, pp. 445–455, Sept. 1998. DOI: 10.1109/89.709670 3 [48] M. R. Schroeder, “Apparatus for suppressing noise and distortion in communication signals,” U.S. Patent No. 3,180,936, filed Dec. 1, 1960, issued Apr. 27, 1965. 2 [49] M. Souden, J. Benesty, and S. Affes,“On the global output SNR of the parameterized frequencydomain multichannel noise reduction Wiener filter,” IEEE Signal Process. Lett., vol. 17, pp. 425-428, May 2010. DOI: 10.1109/LSP.2010.2042520 53 [50] P. Vary and R. Martin, Digital Speech Transmission: Enhancement, Coding and Error Concealment. Chichester, England: John Wiley & Sons Ltd, 2006. 7, 11

BIBLIOGRAPHY

95

[51] P. J. Wolfe and S. J. Godsill, “Simple alternatives to the Ephraim and Malah suppression rule for speech ehancemnet,” in Proc. IEEE ICASSP, 2001, pp. 496–499. DOI: 10.1109/SSP.2001.955331 3

97

Authors’ Biographies JACOB BENESTY Jacob Benesty was born in 1963. He received a Masters degree in microwaves from Pierre & Marie Curie University, France, in 1987, and a Ph.D. degree in control and signal processing from Orsay University, France, in April 1991. During his Ph.D. program (from November 1989 to April 1991), he worked on adaptive filters and fast algorithms at the Centre National d’Etudes des Telecommunications (CNET), Paris, France. From January 1994 to July 1995, he worked at Telecom Paris University on multichannel adaptive filters and acoustic echo cancellation. From October 1995 to May 2003, he was first a Consultant and then a Member of the Technical Staff at Bell Laboratories, Murray Hill, NJ, USA. In May 2003, he joined INRS-EMT, University of Quebec, in Montreal, Quebec, Canada, as a Professor. His research interests are in signal processing, acoustic signal processing, and multimedia communications. Dr. Benesty received the 2001 and 2008 Best Paper Awards from the IEEE Signal Processing Society. In 2010, he received the Gheorghe Cartianu Award from the Romanian Academy. He was a member of the editorial board of the EURASIP Journal on Applied Signal Processing, a member of the IEEE Audio & Electroacoustics Technical Committee, the co-chair of the 1999 International Workshop on Acoustic Echo and Noise Control (IWAENC), and the general co-chair of the 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Dr. Benesty co-authored and co-edited many books in the area of acoustic signal processing.

JINGDONG CHEN Jingdong Chen received B.S. and M.S. degrees in electrical engineering from the Northwestern Polytechnic University, Xiaan, China, in 1993 and 1995, respectively, and the Ph.D. degree in pattern recognition and intelligence control from the Chinese Academy of Sciences, Beijing, in 1998. From 1998 to 1999, he was with ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan, where he conducted research on speech synthesis, speech analysis, as well as objective measurements for evaluating speech synthesis. He then joined the Griffith University, Brisbane, Australia, as a Research Fellow, where he engaged in research in robust speech recognition and signal processing. From 2000 to 2001, he worked at ATR Spoken Language Translation Research Laboratories on robust speech recognition and speech enhancement. From 2001 to 2009, he was a Member of Technical Staff at Bell Laboratories, Murray Hill, New Jersey, working on acoustic signal processing for telecommunications. He is currently serving as the Chief Scientist of WeVoice Inc. in New Jersey. His research interests include adaptive signal processing, speech enhancement,

98

AUTHORS’ BIOGRAPHIES

adaptive noise/echo cancellation, microphone array signal processing, signal separation, and source localization. Dr. Chen co-authored and co-edited several books in the area of acoustic and speech signal processing. He is currently an Associate Editor of the IEEE Transactions on Audio, Speech, and Language Processing, a member of the IEEE Audio and Electroacoustics Technical Committee, and a member of the editorial board of the Open Signal Processing Journal. He helped organize the 2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), and was the technical Co-Chair of the 2009 WASPAA. Dr. Chen received the 2008 Best Paper Award from the IEEE Signal Processing Society, the 1998-1999 Japan Trust International Research Grant from the Japan Key Technology Center, the Young Author Best Paper Award from the 5th National Conference on Man-Machine Speech Communications, and the 1998 President’s Award from the Chinese Academy of Sciences.

YITENG HUANG Yiteng Huang received his B.S. degree from the Tsinghua University, Beijing, China, in 1994 and the M.S. and Ph.D. degrees from the Georgia Institute of Technology (Georgia Tech), Atlanta, in 1998 and 2001, respectively, all in electrical and computer engineering. From March 2001 to January 2008, he was a Member of Technical Staff at Bell Laboratories, Murray Hill, NJ. In January 2008, he founded the WeVoice, Inc., in Bridgewater, New Jersey and served as its CTO. His current research interests are in acoustic signal processing, multimedia communications, and wireless sensor networks. Dr. Huang served as an Associate Editor for the EURASIP Journal on Applied Signal Processing from 2004 and 2008 and for the IEEE Signal Processing Letters from 2002 to 2005. He served as a technical Co-Chair of the 2005 Joint Workshop on Hands-Free Speech Communication and Microphone Array and the 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. He is a coeditor/coauthor of the books Noise Reduction in Speech Processing (SpringerVerlag, 2009), Microphone Array Signal Processing(Springer-Verlag, 2008), Springer Handbook of Speech Processing (Springer-Verlag, 2007), Acoustic MIMO Signal Processing (Springer-Verlag, 2006), Audio Signal Processing for Next-Generation Multimedia Communication Systems (Kluwer, 2004), and Adaptive Signal Processing: Applications to Real-World Problems (Springer-Verlag, 2003). He received the 2008 Best Paper Award and the 2002 Young Author Best Paper Award from the IEEE Signal Processing Society, the 2000-2001 Outstanding Graduate Teaching Assistant Award from the School Electrical and Computer Engineering, Georgia Tech, the 2000 Outstanding Research Award from the Center of Signal and Image Processing, Georgia Tech, and the 1997-1998 Colonel Oscar P. Cleaver Outstanding Graduate Student Award from the School of Electrical and Computer Engineering, Georgia Tech.

99

Index acoustic distortion, 1 additive noise, 1, 7 aliasing, 4 clean speech, 7 correlation matrix, 7 cross-correlation matrix, 14 desired signal, 7 diagonalization, 8 echo, 1 echo cancellation and suppression, 1 eigenvalue, 8 eigenvector, 8 energy conservation, 9 error signal Model 1, 35 Model 2, 46 Model 3, 60 Model 4, 70 time domain, 13 estimator, 8 Model 1, 26 Model 2, 28 Model 3, 30 Model 4, 32 time domain, 11 Euclidean norm, 9 expansion, 8 coefficients, 8

filtered clean speech, 11 filtered noise, 11 filtering matrix, 31 finite-impulse-response (FIR) filter, 21, 26 frequency domain, 4 fullband MSE Model 1, 37 Model 2, 48 Model 3, 61 Model 4, 71 fullband noise-reduction factor Model 1, 34 Model 2, 44 Model 3, 58 Model 4, 68 fullband normalized MSE Model 1, 37 Model 2, 48 Model 3, 61 Model 4, 71 fullband output SNR Model 1, 34 Model 2, 44 Model 3, 57 Model 4, 67 fullband speech-distortion index Model 1, 34 Model 2, 45 Model 3, 59 Model 4, 69 fullband speech-reduction factor

100

INDEX

Model 1, 35 Model 2, 45 Model 3, 59 Model 4, 69 gain, 25 hidden Markov model (HMM) based speech enhancement, 3 Holder’s inequality, 22 input SNR, 11, 33 interband correlation, 9, 29, 30 interference, 1, 27 interference signal vector Model 2, 27 Model 4, 31 interframe correlation, 9, 26, 30 interframe correlation coefficient, 27 interframe correlation matrix, 31 interframe correlation vector, 27 interpolation-error power, 17 joint diagonalization, 20 Karhunen-Loève expansion (KLE), 7, 8 analysis, 8 synthesis, 8 Lagrange multiplier, 18, 19, 40, 52, 63, 73 linear interpolator, 17 linear model, 25 Model 1, 25 Model 2, 26 Model 3, 29 Model 4, 30 time domain, 11 linear transformation, 11 listening fatigue, 1 LPC model space, 4

maximum a posteriori (MAP) estimator, 3 maximum likelihood (ML) estimator, 3 maximum SNR filter Model 1, 41 Model 2, 46 Model 3, 64 Model 4, 74 time domain, 21 mean-square error (MSE) criterion, 13 minimum variance distortionless response (MVDR) filter Model 2, 50 Model 4, 73 minimum-mean-square-error (MMSE) estimator, 3 MSE Model 1, 35 Model 2, 46 Model 3, 60 Model 4, 70 time domain, 13 musical noise, 4 noise babble, 76 car, 76 white, 17, 76 noise reduction, 1 frequency domain, 4 KLE domain, 10 time domain, 7, 11 noise-reduction factor time domain, 12 normalized correlation matrix, 15 normalized MSE time domain, 14 null subspace, 20 optimal filters

INDEX

Model 1, 33 Model 2, 43 Model 3, 57 Model 4, 67 time domain, 11 orthonormal vector, 8 output SNR, 33 time domain, 12 parametric Wiener filter, 3 performance measures Model 1, 33 Model 2, 43 Model 3, 57 Model 4, 67 time domain, 11 residual interference-plus-noise Model 2, 47 Model 4, 70 residual noise, 11 Model 1, 36 Model 3, 60 time domain, 14 reverberation, 1 short-time Fourier analysis, 3 signal enhancement, 25 signal model KLE domain, 10 time domain, 7 signal-plus-noise subspace, 20 signal-to-noise ratio (SNR), 11 source separation, 1 spectral subtraction, 3 spectrogram, 1 speech dereverberation, 1 speech distortion Model 1, 36

Model 2, 46 Model 3, 60 Model 4, 70 time domain, 14 speech enhancement, 1 speech recognition, 3 speech-distortion index time domain, 13 speech-reduction factor time domain, 13 subband input SNR, 33 subband MSE Model 1, 36 Model 2, 47 Model 3, 60 Model 4, 70 subband noise-reduction factor Model 1, 34 Model 2, 44 Model 3, 58 Model 4, 68 subband normalized MSE Model 1, 36 Model 2, 47 Model 3, 61 Model 4, 71 subband output SNR Model 1, 33 Model 2, 43 Model 3, 57 Model 4, 67 subband speech-distortion index Model 1, 34 Model 2, 44 Model 3, 58 Model 4, 68 subband speech-reduction factor Model 1, 35 Model 2, 45

101

102

INDEX

Model 3, 59 Model 4, 69 subspace approach, 5 subspace-type filter time domain, 20 Toeplitz matrix, 17 tradeoff filter Model 1, 40 Model 2, 52 Model 3, 63 Model 4, 73 time domain, 18

training process, 3 variance, 25 voice activity detector (VAD), 76 Wiener filter Model 1, 38 Model 2, 48 Model 3, 62 Model 4, 72 time domain, 15 Woodbury’s identity, 49, 72

Speech Enhancement in the Karhunen-Loeve Expansion Domain (Synthesis Lectures on Speech and Audio Processing)

Dynamic Speech Models - Theory, Algorithms and Applications (Synthesis Lectures on Speech and Audio Processing)

Applied speech and audio processing

Latent Semantic Mapping: Principles And Applications (Synthesis Lectures on Speech and Audio Processing)

Improvements in Speech Synthesis

Sparse Adaptive Filters for Echo Cancellation (Synthesis Lectures on Speech and Audio Processing)

Speech synthesis and recognition

Speech and Language Processing

Speech and language processing

Language and Speech Processing

Advances in Audio and Speech Signal Processing: Technologies and Applications

Advances in Audio and Speech Signal Processing: Technologies and Applications

Text-to-Speech Synthesis

Text-to-speech synthesis

Text-to-speech synthesis

Advances in speech signal processing

Noise Reduction in Speech Processing

Multilingual Speech Processing

Speech Processing in Embedded Systems

Speech Processing and Soft Computing

Fractal speech processing

Speech Enhancement (Signals and Communication Technology)

Expression in Speech: Analysis and Synthesis

Pattern Recognition In Speech And Language Processing

Pattern Recognition in Speech and Language Processing

Pattern Recognition in Speech and Language Processing

Applied Speech and Audio Processing: With Matlab Examples

Applied Speech and Audio Processing: With Matlab Examples

Advances in Nonlinear Speech Processing, on Non-Linear Speech Processing, NOLISP 2007

Speech Enhancement in the Karhunen-Loeve Expansion Domain (Synthesis Lectures on Speech and Audio Processing)

Dynamic Speech Models - Theory, Algorithms and Applications (Synthesis Lectures on Speech and Audio Processing)

Applied speech and audio processing

Latent Semantic Mapping: Principles And Applications (Synthesis Lectures on Speech and Audio Processing)

Improvements in Speech Synthesis

Sparse Adaptive Filters for Echo Cancellation (Synthesis Lectures on Speech and Audio Processing)

Speech synthesis and recognition

Speech and Language Processing

Speech and language processing

Language and Speech Processing

Advances in Audio and Speech Signal Processing: Technologies and Applications

Advances in Audio and Speech Signal Processing: Technologies and Applications

Text-to-Speech Synthesis

Text-to-speech synthesis

Text-to-speech synthesis

Advances in speech signal processing

Noise Reduction in Speech Processing

Multilingual Speech Processing

Speech Processing in Embedded Systems

Speech Processing and Soft Computing

Fractal speech processing

Speech Enhancement (Signals and Communication Technology)

Expression in Speech: Analysis and Synthesis

Pattern Recognition In Speech And Language Processing

Pattern Recognition in Speech and Language Processing

Pattern Recognition in Speech and Language Processing

Applied Speech and Audio Processing: With Matlab Examples

Applied Speech and Audio Processing: With Matlab Examples

Advances in Nonlinear Speech Processing, on Non-Linear Speech Processing, NOLISP 2007

Recommend Documents