Signals and Communication Technology
For further volumes: http://www.springer.com/series/4748
Qi (Peter) Li
Speaker Authentication
123
Dr. Qi (Peter) Li Li Creative Technologies (LcT), Inc. Vreeland Road 30 A, Suite 130 Florham Park, NJ 07932 USA email:
[email protected] ISSN 18604862 ISBN 9783642237300 DOI 10.1007/9783642237317
eISBN 9783642237317
Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011939406 Ó SpringerVerlag Berlin Heidelberg 2012 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: eStudio Calamar, Berlin/Figueres Printed on acidfree paper Springer is part of Springer Science+Business Media (www.springer.com)
To my parents YuanLin Shen and YanBin Li
Preface
My research on speaker authentication started in 1995 when I was an intern at Bell Laboratories, Murray Hill, New Jersey, USA, while working on my Ph.D. dissertation. Later, I was hired by Bell Labs as a Member of Technical Staﬀ, which gave me the opportunity to continue my research on speaker authentication with my Bell Labs colleagues. In 2002, I established Li Creative Technologies, Inc. (LcT), located in Florham Park, New Jersey. At LcT, I am continuing my research in speaker authentication with my LcT colleagues. Recently, when I looked at my publications during the last ﬁfteen years, I found that my research has covered all the major research topics in speaker authentication: from frontend to backend; from endpoint detection to decoding; from feature extraction to discriminative training; from speaker recognition to verbal information veriﬁcation. This has motivated me to put my research results together into a book in order to share my experience with my colleagues in the ﬁeld. This book is organized by research topic. Each chapter focuses on a major topic and can be read independently. Each chapter contains advanced algorithms along with real speech examples and evaluation results to validate the usefulness of the selected topics. Special attention has been given to the topics related to improving overall system robustness and performance, such as robust endpoint detection, fast discriminative training theory and algorithms, detectionbased decoding, and sequential authentication. I have also given attention to those novel approaches that may lead to new research directions, such as a recently developed auditory transform (AT) to replace the fast Fourier transform (FFT) and auditorybased feature extraction algorithms. For real applications, a good speaker authentication system must ﬁrst have an acceptable authentication accuracy and then be robust to background noise, channel distortion, and speaker variability. A number of speaker authentication systems can be designed based on the methods and techniques presented in this book. A particular system can be designed to meet required speciﬁcations by selecting an authentication method or combining several authentication and decision methods introduced in the book.
vii
viii
Preface
Speaker authentication is a subject that relies on the research eﬀorts of many diﬀerent ﬁelds, including, but not limited to, physics, acoustics, psychology, physiology, hearing, auditory nerve, brain, auditory perception, parametric and nonparametric statistics, signal processing, pattern recognition, acoustic phonetics, linguistics, natural language processing, linear and nonlinear programming, optimization, communications, etc. This book only covers a subset of these topics. Due to my limited time and experience, this book only focuses on the topics in my published research. I encourage people with the above backgrounds to consider contributing their knowledge to speech recognition and speaker authentication research. I also encourage colleagues in the ﬁeld of speech recognition and speaker authentication to extend their knowledge to the above ﬁelds in order to achieve breakthrough research results. This book does not include those fundamental topics which have been very well introduced in other textbooks. This author assumes the reader has a basic understanding of linear systems, signal processing, statistics, and pattern recognition. This book can also be used as a reference book for government and company oﬃcers and researchers working in information technology, homeland security, law enforcement, and information security, as well as for researchers and developers in the areas of speaker recognition, speech recognition, pattern recognition, and audio and signal processing. It can also be used as a reference or textbook for senior undergraduate and graduate students in electrical engineering, computer science, biomedical engineering, and information management.
Acknowledgments The author would like to thank the many people who helped the author in his career and in the ﬁelds of speaker and speech recognition. I am particularly indebted to Dr. Donald W. Tufts at the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, for his role in guiding and training me in pattern recognition and speech signal processing. Special thanks are due to Dr. S. Parthasathy and Dr. Aaron Rosenberg who served as mentors when I ﬁrst joined Bell Laboratories. They led me into the ﬁeld of speaker veriﬁcation research. I am particularly grateful to Dr. BiingHwang (Fred) Juang for his guidance in verbal information veriﬁcation research. The work extended speaker recognition to speaker authentication, which has broader applications. Most topics in this book were prepared based on previously published peerreviewed journal and conference papers where I served as the ﬁrst author. I would like to thank all the coauthors of those publications, namely Dr. Donald Tufts, Dr. Peter Swaszek, Dr. S. Parthasarathy, Dr. Aaron Rosenberg, Dr. BiingHwang Juang, Dr. Frank Soong, Dr. ChinHui Lee, Qiru Zhou, Jinsong Zheng, Dr. Augustine Tsai, and Yan Huang. Also, I would like to
Preface
ix
thank the many anonymous reviewers and editors for their helpful comments and suggestions. The author also would like to thank Dr. Bishnu Atal, Dr. Joe Olive, Dr. Wu Chou, Dr. Oliver Siohan, Dr. Mohan Sondhi, Dr. Oded Ghitza, Dr. Jingdong Chen, Dr. Raﬁd Sukkar, Dr. Larry O’Gorman, Dr. Richard Rose, and Dr. David Roe, all former Bell Laboratories colleagues, for their useful discussions and their kind help and support on my research there. Also, I would like to thank Dr. Ivan Selesnick for our recent collaborations. Within Li Creative Technologies, the author would like to thank Yan Huang and Yan Yin for our recent collaborations in speaker identiﬁcation research. I also would like to thank my colleagues Dr. Manli Zhu, Dr. Bozhao Tan, Uday Jain, and Joshua Hajicek for useful discussions on biometrics, acoustic, speech, and hearing systems. From 2008 to 2010, the author’s research on speaker identiﬁcation was supported by the U.S. AFRL under the contract number FA875008C0028. I would like to thank program managers Michelle Grieco, John Parker, and Dr. Stanly Wenndt for their help and support. Some of the research results have been included in Chapter 7 and Chapter 8 of this book. Other results will be published later. I would like to thank my colleague Craig B. Adams and my daughter Joy Y Li for their work in editing this book. I would like to thank Uday Jain, Dr. Manli Zhu and Dr. Bozhao Tan for their proofreading. Also, this book could not have been ﬁnished without the support of my wife Vivian for the many weekends which I spent working on it. The author also would like to thank the IEEE Intellectual Property Rights Oﬃce for permissions to use the IEEE copyright materials which I previously published in IEEE publications in the book. Finally, I would like to thank Dr. Christoph Baumann, Engineering Editor at Springer, for his kind invitation to prepare and publish this book.
Florham Park, NJ
Qi (Peter) Li July 2011
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 BiometricBased Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 InformationBased Authentication . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Speaker Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.1 Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.2 Verbal Information Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Historical Perspective and Further Reading . . . . . . . . . . . . . . . . . 11 1.6 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2
Multivariate Statistical Analysis and OnePass Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Multivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 OnePass VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 The OnePass VQ Algorithm . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Steps of the OnePass VQ Algorithm . . . . . . . . . . . . . . . . 2.4.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Codebook Design Examples . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Robustness Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Segmental KMeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23 23 25 27 28 28 32 33 34 38 39 40 40
Principal Feature Networks for Pattern Recognition . . . . . . . 3.1 Overview of the Design Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Implementations of Principal Feature Networks . . . . . . . . . . . . . . 3.3 Hidden Node Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Gaussian Discriminant Node . . . . . . . . . . . . . . . . . . . . . . . .
43 43 46 48 49
3
xi
xii
Contents
3.3.2 Fisher’s Node Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Principal Component Hidden Node Design . . . . . . . . . . . . . . . . . . 3.4.1 Principal Component Discriminant Analysis . . . . . . . . . . 3.5 Relation between PC Node and the Optimal Gaussian Classiﬁer 3.6 Maximum SignaltoNoiseRatio (SNR) Hidden Node Design . . 3.7 Determining the Thresholds from Design Speciﬁcations . . . . . . . 3.8 Simpliﬁcation of the Hidden Nodes . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Application 1 – Data Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Application 2 – Multispectral Pattern Recognition . . . . . . . . . . . 3.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51 52 53 53 55 56 56 56 58 59 59
4
NonStationary Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Gaussian Mixture Models (GMM) for Stationary Process . . . . . 4.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Hidden Markov Model (HMM) for NonStationary Process . . . . 4.4 Speech Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Statistical Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61 61 62 63 66 68 68 70 71 72
5
Robust Endpoint Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 A Filter for Endpoint Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 RealTime Endpoint Detection and Energy Normalization . . . . 5.3.1 A Filter for Both Beginning and EndingEdge Detection 5.3.2 Decision Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 RealTime Energy Normalization . . . . . . . . . . . . . . . . . . . . 5.3.4 Database Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75 76 78 81 82 82 83 85 88 89
6
DetectionBased Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2 ChangePoint Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3 HMM State ChangePoint Detection . . . . . . . . . . . . . . . . . . . . . . . 97 6.4 HMM SearchSpace Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4.1 Concept of SearchSpace Reduction . . . . . . . . . . . . . . . . . . 100 6.4.2 Algorithm Summary and Complexity Analysis . . . . . . . . 102 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.5.1 An Example of State ChangePoint Detection . . . . . . . . . 104 6.5.2 Application to Speaker Veriﬁcation . . . . . . . . . . . . . . . . . . 104 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Contents
xiii
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7
AuditoryBased Time Frequency Transform . . . . . . . . . . . . . . . 111 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.1.1 Observing Problems with the Fourier Transform . . . . . . . 112 7.1.2 Brief Introduction of the Ear . . . . . . . . . . . . . . . . . . . . . . . 114 7.1.3 TimeFrequency Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.2 Deﬁnition of the AuditoryBased Transform . . . . . . . . . . . . . . . . . 118 7.3 The Inverse Auditory Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4 The DiscreteTime and Fast Transform . . . . . . . . . . . . . . . . . . . . . 123 7.5 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.5.1 Verifying the Inverse Auditory Transform . . . . . . . . . . . . . 124 7.5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.6 Comparisons to Other Transforms . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8
AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.2 AuditoryBased Feature Extraction Algorithm . . . . . . . . . . . . . . 138 8.2.1 Forward Auditory Transform and Cochlea Filter Bank . 138 8.2.2 Cochlear ﬁlter cepstral coeﬃcients (CFCC) . . . . . . . . . . . 140 8.2.3 Analysis and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.3 Speaker Identiﬁcation and Experimental Evaluation . . . . . . . . . . 142 8.3.1 Experimental Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.3.2 The Baseline Speaker Identiﬁcation System . . . . . . . . . . . 143 8.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.3.4 Further Comparison with PLP and RASTAPLP . . . . . . 146 8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9
FixedPhrase Speaker Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 151 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 9.2 A FixedPhrase System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 9.3 An Evaluation Database and Model Parameters . . . . . . . . . . . . . 154 9.4 Adaptation and Reference Results . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
10 Robust Speaker Veriﬁcation with Stochastic Matching . . . . . 157 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 10.2 A Fast Stochastic Matching Algorithm . . . . . . . . . . . . . . . . . . . . . 158 10.3 Fast Estimation for a General Linear Transform . . . . . . . . . . . . . 160 10.4 Speaker Veriﬁcation with Stochastic Matching . . . . . . . . . . . . . . 161
xiv
Contents
10.5 Database and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 11 Randomly Prompted Speaker Veriﬁcation . . . . . . . . . . . . . . . . . 165 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 11.2 Normalized Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 168 11.3 Applying NDA in the Hybrid SpeakerVeriﬁcation System . . . . 169 11.3.1 Training of the NDA System . . . . . . . . . . . . . . . . . . . . . . . . 169 11.3.2 Training of the HMM System . . . . . . . . . . . . . . . . . . . . . . . 171 11.3.3 Training of the Data Fusion Layer . . . . . . . . . . . . . . . . . . . 173 11.4 Speaker Veriﬁcation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 173 11.4.1 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 11.4.2 NDA System Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 11.4.3 Hybrid SpeakerVeriﬁcation System Results . . . . . . . . . . . 174 11.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 12 Objectives for Discriminative Training . . . . . . . . . . . . . . . . . . . . . 179 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 12.2 Error Rates vs. Posterior Probability . . . . . . . . . . . . . . . . . . . . . . . 180 12.3 Minimum Classiﬁcation Error vs. Posterior Probability . . . . . . . 181 12.4 Maximum Mutual Information vs. Minimum Classiﬁcation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 12.5 Generalized Minimum Error Rate vs. Other Objectives . . . . . . . 185 12.6 Experimental Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 12.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 12.8 Relations between Objectives and Optimization Algorithms . . . 187 12.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 13 Fast Discriminative Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 13.2 Objective for Fast Discriminative Training . . . . . . . . . . . . . . . . . 193 13.3 Derivation of Fast Estimation Formulas . . . . . . . . . . . . . . . . . . . . 195 13.3.1 Estimation of Covariance Matrices . . . . . . . . . . . . . . . . . . . 196 13.3.2 Determination of Weighting Scalar . . . . . . . . . . . . . . . . . . 196 13.3.3 Estimation of Mean Vectors . . . . . . . . . . . . . . . . . . . . . . . . 197 13.3.4 Estimation of Mixture Parameters . . . . . . . . . . . . . . . . . . . 198 13.3.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 13.4 Summary of Practical Training Procedure . . . . . . . . . . . . . . . . . . 200 13.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 13.5.1 Continuing the Illustrative Example . . . . . . . . . . . . . . . . . 200 13.5.2 Application to Speaker Identiﬁcation . . . . . . . . . . . . . . . . . 204 13.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Contents
xv
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 14 Verbal Information Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 14.2 Single Utterance Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 14.2.1 Normalized Conﬁdence Measures . . . . . . . . . . . . . . . . . . . . 211 14.3 Sequential Utterance Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . 212 14.3.1 Examples in SequentialTest Design . . . . . . . . . . . . . . . . . . 214 14.4 VIV Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 14.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 15 Speaker Authentication System Design . . . . . . . . . . . . . . . . . . . . 223 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 15.2 Automatic Enrollment by VIV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 15.3 FixedPhrase Speaker Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 226 15.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 15.4.1 Features and Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 15.4.2 Experimental Results on Using VIV for SV Enrollment . 228 15.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
List of Tables
1.1
List of Biometric Authentication Error Rates . . . . . . . . . . . . . . . .
5
2.1 2.2 2.3 2.4 2.5
Quantizer MSE Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of OnePass and LBG Algorithms . . . . . . . . . . . . . . . Comparison of Diﬀerent VQ Design Approaches . . . . . . . . . . . . . . Comparison for the Correlated Gaussian Source . . . . . . . . . . . . . . Comparison on the Laplace Source . . . . . . . . . . . . . . . . . . . . . . . . .
35 36 37 37 38
3.1
Comparison of Three Algorithms in the Land Cover Recognition 58
5.1
Database Evaluation Results (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.1
2 Correlation Coeﬃcients, σ12 , for Diﬀerent Sizes of Filter Bank in AT/inverse AT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.1 8.2
Summary of The training, Development, and Testing Set. . . . . . 143 Comparison of MFCC, MGFCC, and CFCC Features Tested on the Development Tet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9.1
Experimental Results in Average EqualError Rates of All Tested Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.1 Experimental Results in Average EqualError Rates (%) . . . . . . 163 11.1 Segmentation of the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 11.2 Results on Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 11.3 Major Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 12.1 Comparisons on Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . 187 13.1 ThreeClass Classiﬁcation Results of the Illustration Example . . 203 13.2 Comparison on Speaker Identiﬁcation Error Rates . . . . . . . . . . . . 204
xvii
xviii List of Tables
14.1 False Acceptance Rates when Using Two Thresholds and Maintaining False Rejection Rates to Be 0.0% . . . . . . . . . . . . . . . . 216 14.2 Comparison on Two and Single Threshold Tests . . . . . . . . . . . . . . 216 14.3 Summary of the Experimental Results on Verbal Information Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 15.1 Experimental Results without Adaptation in Average EqualError Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 15.2 Experimental Results with Adaptation in Average EqualError Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
List of Figures
1.1 1.2 1.3
Speaker authentication approaches. . . . . . . . . . . . . . . . . . . . . . . . . . 6 A speaker veriﬁcation system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 An example of verbal information veriﬁcation by asking sequential questions. Similar sequential tests can also be applied in speaker recognition and other biometric or multimodality veriﬁcation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1
An example of bivariate Gaussian distribution: ρ11 = 1.23, ρ12 = ρ21 = 0.45, and ρ22 = 0.89. . . . . . . . . . . . . . . . . . . . . . . . . . . The contour of the Gaussian distribution in Fig. 2.1. . . . . . . . . . . An illustration of a constant density ellipse and the principal components for a normal random vector X. The largest eigenvalue associates with the long axis of the ellipse and the second eigenvalue associates with the short axis. The eigenvectors associate with the axes. . . . . . . . . . . . . . . . . . . . . . . . . The method to determine a code vector: (a) select the highest density cell; (b) examine a group of cells around the selected one; (c) estimate the center of the data subset; (d) cut a “hole” in the training data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Principal Component (PC) method to determine a centroid. Left: Uncorrelated Gaussian source training data. Right: The residual data after four code vectors have been located. . . . . . . . Left: The residual data after all 16 code vectors have been located. Right: The “+” and “◦” are the centroids after one and three iterations of the LBG algorithm, respectively. . . . . . . . Left: The Laplace source training data. Right: The residual data, onepass designed centroids “+”, and onepass+2LBG centroids “◦”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 2.3
2.4
2.5 2.6 2.7
2.8
24 25
26
29 31 35
36
38
xix
xx
List of Figures
3.1
3.2 3.3 3.4 3.5 3.6
3.7 3.8
4.1
An illustrative example to demonstrate the concept of the PFN: (a) The original training data of two labeled classes which are not linearly separable. (b) The hyperplanes of the ﬁrst hidden node (LDA node). (c) The residual data set and the hyperplanes of the second hidden node (SNR node). (d) The input space partitioned by two hidden nodes and four thresholds designed by the principal feature classiﬁcation (PFC) method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A parallel implementation of PFC by a Principal Feature Network (PFN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A sequential implementation of PFC by a Principal Feature Tree (PFT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Partitioned input space for parallel implementation. (b) Parallel implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Partitioned input space for sequential implementation. (b) Sequential (tree) implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . (a) A single Gaussian discriminant node. (b) A Fisher’s node. (c) A quadratic node. (d) An approximation of the quadratic node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . When using only Fisher’s nodes, three hidden nodes and six thresholds are needed to ﬁnish the design. . . . . . . . . . . . . . . . . . . . Application 1: (a) (bottom) The sorted contribution of each threshold in the order of its contribution to the class separated by the threshold. (b) (top) Accumulated network performance in the order of the sorted thresholds. . . . . . . . . . . . . . . . . . . . . . . .
45 47 47 48 49
50 54
57
4.7
Class 1: a bivariate Gaussian distribution with m1 = [0 5], m2 = [−3 3], and m3 = [−5 0]. Σ1 = [1.41 0; 0 1.41], Σ2 = [1.22 0.09; 0.09 1.22], and Σ3 = [1.37 0.37; 0.27 1.37] . . . . . Class 2: a bivariate Gaussian distribution with m1 = [2 5], m2 = [−1 3], and m3 = [0 0]. Σ1 = [1.41 0; 0 1.41], Σ2 = [0.77 1.11; 1.11 1.09], and Σ3 = [1.41 0.04; 0.04 1.41] . . . . . Class 3: a bivariate Gaussian distribution with m1 = [−3 − 1], m2 = [−2 − 2], and m3 = [−5 − 2]. Σ1 = [1.41 0; 0 1.41], Σ2 = [0.76 0.11; 0.11 1.09], and Σ3 = [1.41 0.04; 0.04 1.41] . . . . . Contours of the pdf ’s of 3mixture GMM’s: the models are used to generate 3 classes of training data. . . . . . . . . . . . . . . . . . . . Contours of the pdf ’s of 2mixture GMM’s: the models are trained from ML estimation using 4 iterations. . . . . . . . . . . . . . . . Enlarged decision boundaries for the ideal 3mixture models (solid line) and 2mixture ML models (dashed line). . . . . . . . . . . Lefttoright hidden Markov model. . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 5.2
Shape of the designed optimal ﬁlter. . . . . . . . . . . . . . . . . . . . . . . . . 81 Endpoint detection and energy normalization for realtime ASR. 81
4.2
4.3
4.4 4.5 4.6
64
64
65 65 66 66 67
List of Figures
5.3 5.4 5.5
5.6
5.7
6.1
6.2 6.3
6.4
6.5
6.6
6.7
6.8
State transition diagram for endpoint decision. . . . . . . . . . . . . . . . Example: (A) Energy contour of digit “4”. (B) Filter outputs and state transitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (A) Energy contours of “4327631Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR). (B) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (C) Detected endpoints and normalized energy for the 20 dB SNR case, and (D) for the 5 dB SNR case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons on realtime connected digit recognition with various signaltonoise ratios (SNR’s). From 5 to 20 dB SNR’s, the introduced realtime algorithm provided word error rate reductions of 90.2%, 93.4%, 57.1%, and 57.1%, respectively. . . . . (A) Energy contour of the 523th utterance in DB5: “1 Z 4 O 5 8 2”. (B) Endpoints and normalized energy from the baseline system. The utterance was recognized as “1 Z 4 O 5 8”. (C) Endpoints and normalized energy from the realtime, endpointdetection system. The utterance was recognized correctly as “1 Z 4 O 5 8 2”. (D) The ﬁlter output. . . . . . . . . . . .
xxi
83 84
85
87
89
The scheme of the changepoint detection algorithm with tδ = 2: (a) the endpoint detection for state 1; (b) the endpoint detection for state 2; and (c) the grid points involved in p1 , p2 and p3 computations (dots). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Lefttoright hidden Markov model. . . . . . . . . . . . . . . . . . . . . . . . . . 98 All the grid points construct a full search space Ψ . The grid points involved in the changepoint detection are marked as black points. A single path (solid line) is detected from the forward and backward changepoint detection. . . . . . . . . . . . . . . . 102 A “hole” is detected from the forward and backward state changepoint detection. A search is needed only among four grid points, (8,3), (8,4), (9,3) and (9,4). The solid line indicates the path with the maximum likelihood score. . . . . . . . . 102 A search is needed in the reduced search space Ω which includes all the black points in between the two dashed lines. The points along the dashed lines are involved in changepoint detection, but they do not belong to the reduced search space. . 103 A special case is located between (11,4) and (18,6), where the forward boundary is under the backward one. A full search can be done in the subspace {(t, st )  11 ≤ t ≤ 18; 4 < st < 6}. . 103 The procedure of sequential state changepoint detection from state 1 (top) to state 7 (bottom), where the vertical dashed lines are the detected endpoints of each state. . . . . . . . . . . . . . . . . 105 The procedure of sequential state changepoint detection from state 8 (top) to state 13 (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xxii
List of Figures
6.9
7.1
(a) Comparison of average individual equalerror rates (EER’s); (b) Comparison on average speedups. . . . . . . . . . . . . . . . 107
Male’s voice: “2 0 5” recorded simultaneously by closetalking (top) and handsfree microphones in a moving car (bottom). . . . 112 7.2 The speech waveforms in Fig. 7.1 were converted to spectrograms by FFT and displayed in Bark scale from 0 to 16.4 Barks (0 to 3500 KHz). The background noise and the pitch harmonics were generated mainly by FFT. . . . . . . . . . . . . . . 113 7.3 The spectrum of FFT at the 1.15 second time frame from Fig. 7.2: The solid line represents the speech from a closetalking microphone. The dashed line is from a handsfree microphone mounted on the visor of a moving car. Both speech ﬁles were recorded simultaneously. The FFT spectrum shows 30 dB distortion at low frequency bands due to background noise and pitch harmonics as noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.4 Illustration of human ear and cochlea. . . . . . . . . . . . . . . . . . . . . . . 115 7.5 Illustration of a stretched out cochlea and a traveling wave exciting a portion of the basilar membrane. . . . . . . . . . . . . . . . . . . 115 7.6 Impulse responses of the BM in the AT when α = 3 and β = 0.2. They are very similar to the research results reported in hearing research. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.7 The frequency responses of the cochlear ﬁlters when α = 3: (A) β = 0.2; and (B) β = 0.035. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.8 The traveling wave generated by the auditory transform from the speech data in Fig. 7.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.9 A section of the traveling wave generated by the auditory transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.10 Spectrograms from the output of the cochlear transform for the speech data in Fig. 7.1 respectively. The spectrogram at top is from the data recorded by the closetalking microphone, while the spectrogram at bottom is from the handsfree microphone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.11 The spectrum of AT at the 1.15 second time frame from Fig. 7.10: The solid line represents the speech from a closetalking microphone. The dashed line is from a handsfree microphone mounted on the visor of a moving car. Both speech ﬁles were recorded simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.12 Comparison of speech waveforms: (A) The original waveform of a male voice speaking the words “two, zero, ﬁve.” (B) The synthesized waveform by inverse AT with the bandwidth of 80 to 5K Hz. When the ﬁlter numbers are 8, 16, 32, and 64, the 2 correlation coeﬃcients σ12 for the two speech data sets are 0.74, 0.96, 0.99, and 0.99, respectively. . . . . . . . . . . . . . . . . . . . . . . 126
List of Figures xxiii
7.13 (A) and (B) are speech waveforms simultaneously recorded in a moving car. The microphones are located on the car visor (A) and speaker’s lapel (B), respectively. (C) is after noise reduction using the AT from the waveform in (A), where results are very similar to (B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.14 Comparison of FT and AT spectrums: (A) The FFT spectrogram of a male voice “2 0 5”, warped into the Bark scale from 0 to 6.4 Barks (0 to 3500 KHz). (B) The spectrogram from the cochlear ﬁlter output for the same male voice. The AT is harmonic free and has less computational noise.128 7.15 Comparison of AT (top) and FFT (bottom) spectrums at the 1.15 second time frame for robustness: The solid line represents speech from a closetalking microphone. The dashed line represents speech from a handsfree microphone mounted on the visor of a moving car. Both speech ﬁles were recorded simultaneously. The FFT spectrum shows 30 dB distortion at lowfrequency bands due to background noise compared to the AT. Compared to the FFT spectrum, the AT spectrum has no pitch harmonics and much less distortion at low frequency bands due to background noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.16 The Gammatone ﬁlter bank: (A) The frequency responses of the Gammatone ﬁlter bank generated by (7.19). (B) The frequency responses of the Gammatone ﬁlter bank generated by (7.19) plus a equal loudness function. . . . . . . . . . . . . . . . . . . . . 130 8.1 8.2 8.3 8.4 8.5 8.6 8.7 9.1
Schematic diagram of the auditorybased feature extraction algorithm named cochlear ﬁlter cepstral coeﬃcients (CFCC). . . 138 Comparison of MFCC, MGFCC, and the CFCC features tested on noisy speech with white noise. . . . . . . . . . . . . . . . . . . . . . 145 Comparison of MFCC, MGFCC, and CFCC features tested on noisy speech with car noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Comparison of MFCC, MGFCC, and CFCC features tested on noisy speech with babble noise. . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Comparison of PLP, RASTAPLP, and the CFCC features tested on noisy speech with white noise. . . . . . . . . . . . . . . . . . . . . . 148 Comparison of PLP, RASTAPLP, and the CFCC features tested on noisy speech with car noise. . . . . . . . . . . . . . . . . . . . . . . . 148 Comparison of PLP, RASTAPLP, and the CFCC features tested on noisy speech with babble noise. . . . . . . . . . . . . . . . . . . . . 149 A ﬁxedphrase speaker veriﬁcation system. . . . . . . . . . . . . . . . . . . . 153
xxiv List of Figures
10.1 A geometric interpretation of the fast stochastic matching. (a) The dashed line is the contour of training data. (b) The solid line is the contour of test data. The crosses are the means of the two data sets. (c) The test data were scaled and rotated toward the training data. (d) The test data were translated to the same location as the training data. Both contours overlap each other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 10.2 A phrasebased speaker veriﬁcation system with stochastic matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 11.1 The structure of a hybrid speaker veriﬁcation (HSV) system. . . . 167 11.2 The NDA feature extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 11.3 The Type 2 classiﬁer (NDA system) for one speaker. . . . . . . . . . . 171 13.1 Contours of the pdf ’s of 3mixture GMM’s: the models are used to generate three classes of training data. . . . . . . . . . . . . . . . 201 13.2 Contours of the pdf ’s of 2mixture GMM’s: the models are from ML estimation using four iterations. . . . . . . . . . . . . . . . . . . . 201 13.3 Contours of the pdf ’s of 2mixture GMM’s: The models are from the fast GMER estimation with two iterations on top of the ML estimation results. The overlaps among the three classes are signiﬁcantly reduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 13.4 Enlarged decision boundaries for the ideal 3mixture models (solid line), 2mixture ML models (dashed line), and 2mixture GMER models (dashdotted line): After GMER training, the boundary of ML estimation shifted toward the decision boundary of the ideal models. This illustrates how GMER training improves decision accuracies. . . . . . . . . . . . . . . . . . . . . . . . 202 13.5 Performance improvement versus iterations using the GMER estimation: The initial performances were from the ML estimation with four iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 14.1 An example of verbal information veriﬁcation by asking sequential questions. (Similar sequential tests can also be applied in speaker veriﬁcation and other biometric or multimodality veriﬁcation.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 14.2 Utterance veriﬁcation in VIV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 14.3 False acceptance rate as a function of robust interval with SD threshold for a 0% false rejection rate. The horizontal axis indicates the shifts of the values of the robust interval τ . . . . . . . 218 14.4 An enlarged graph of the system performances using two and three questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 15.1 A conventional speaker veriﬁcation system . . . . . . . . . . . . . . . . . . . 224
List of Figures
xxv
15.2 An example of speaker authentication system design: Combining verbal information veriﬁcation with speaker veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 15.3 A ﬁxedphrase speaker veriﬁcation system . . . . . . . . . . . . . . . . . . . 226
Chapter 1 Introduction
1.1 Authentication Authentication is the process of positively verifying the identity of a user, device, or any entity in a computer system, often as a prerequisite to allowing access to resources in the system [49]. Authentication has been used by human for thousands of years to recognize each other, to identify friends and enemies, and to protect their information and assets. In the computer era, the purpose of identiﬁcation is more than just to identify people in our presence, but also to identify people in remote locations, computers on a network, or any entity in computer networks. As such, authentication has been extended from a manual identiﬁcation process to an automatic one. People are now paying more and more attention to security and privacy; thus authentication processes are everywhere in our daily life. Automatic authentication technology is now necessary for all computer and network access and it plays an important role in security. In general, an automatic authentication process usually includes two sessions: enrollment/registration and testing/veriﬁcation. During an enrollment/registration session, the identity of a user or entity is veriﬁed and an authentication method is initialized by both parties. During a testing/veriﬁcation session, the user must follow the same authentication method to prove their identity. If the user can pass the authentication procedure, the user is accepted and allowed to access the protected territory, networks, or systems; if not, the user is rejected and no access is allowed. One example is the authentication procedure in banks. When a customer opens an account, the bank asks the customer to show a passport, a driver’s license, or other documents to verify the customer’s identify. An authentication method, such as an account number plus a PIN (personal identiﬁcation number) is then initialized during the registration session. When the customer wants to access the account, the customer must provide both the account number and PIN for veriﬁcation to gain access to the account. We will see
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_1, Ó SpringerVerlag Berlin Heidelberg 2012
1
2
1 Introduction
similar procedures in automatic authentication systems using speech processing throughout this book. Following the discussions in [32], authentication can be further diﬀerentiated into humanhuman, machinemachine, humanmachine, and machinehuman authentications. Humanhuman authentication is the traditional method. This can be done by visually verifying a human face or signature or identifying a speaker’s voice on the phone. As the most fundamental method, humanhuman authentication continually plays an important role in our daily lives. Machinemachine authentication is the process by which the authentication of machines is done by machines automatically. Since the process on both sides can be predesigned for the best performance in terms of accuracy and speed, this kind of authentication usually provides very high performances. Examples are encryption and decryption procedures, wellestablished protocols, and secured interface designs. Machinehuman authentication is the process by which a person veriﬁes the machine generated identity or password. For example, people can identify a host ID, an account name, or the manufacturing code of a machine or device. Finally, humanmachine authentication, also called user authentication, is the process of verifying the validity of a claimed user by a machine automatically [32]. Due to its large potential applications, this is a very active research area. We will focus our discussions on user authentication. We can further label the user authentication process as information based, token based, or biometric based [32]. The informationbased approach is characterized by the use of secret or private information. A typical example of the informationbased approach is to ask private questions such as mother’s maiden name or the last four digits of one’s social security number. The information can be updated or changed at various times. For example, the private questions can be last date of deposit and deposit amount. The tokenbased approach is characterized by the use of physical objects, such as metal keys, magnetic key, electronic keys, or photo ID. The biometricbased approach is characterized by biometric matching, such as using voice, ﬁngerprint, iris, signature, or DNA characteristics. In real applications, an authentication system may use a combination of more than one or all of the above approaches. For example, a bank or credit card authentication system may include an informationbased approach – PIN number; a tokenbased approach – bank card; and a biometricbased approach – signature. In terms of decision procedure, authentication can be sequential or parallel. A sequential procedure is to make a decision by using the “AND” logic and a parallel procedure is to make a decision by “OR” logic. Using the above bank example, if an authentication decision is made by investigating the bank card, the PIN number and signature, one by one, it is a sequential process. If the decision is made by investigating only one of the three things, it is a parallel
1.1 Authentication
3
decision process. Details of the sequential decision procedure will be discussed in Chapter 14. We now discuss the biometricbased and informationbased approaches in more detail in the following sections. A speaker authentication system can be implemented by both approaches.
1.2 BiometricBased Authentication A biometric is a physical or biomedical characteristic or personal trait of a person that can be measured with or without contact and be used to recognize the person by an authentication process. For practical applications, useful biometrics must have the following properties [51, 15]: • • • • •
Measurability: A characteristic or trait can be captured by a sensor and extracted by an algorithm. Distinctiveness: A measure of a characteristic is signiﬁcantly diﬀerent from subject to subject. Repeatability: A measured characteristic can be repeated through multiple measurements at diﬀerent environments, locations, and times. Robustness: A measure of a characteristic has no signiﬁcant changes when measuring the same subject for a period of time at diﬀerent environments or even by diﬀerent sensors. Accessibility: A subject is willing to cooperate for feature extraction. The biometric authentication process can be ﬁnished in an acceptable time period and amount of cost.
Well known and popular biometrics include voice, ﬁngerprint, iris, retina, face, hand, signature, and DNA. The measurement or feature extraction can be based on diﬀerent kinds of data, such as optical image, video, audio, thermal video, DNA sequence, etc. Some biometric features are easy to capture, such as voice and face image; others may be diﬃcult, such as DNA sequence. Some features can have very high authentication distinctiveness, such as DNA and iris; others may not. Also, some of them can be captured at very low cost, such as voice; others can be too expensive in term of cost and time, such as DNA, given the existing technology. When developing a biometric system, all the above factors need to be considered. For use of speech as a biometric, its measurability and accessibility are advantages, but distinctiveness, repeatability, and robustness are the challenges to speaker authentication. Most of the topics in this book address these challenges. In term of the statistical properties, biometric signals can be divided into two categories, stationary and nonstationary. A biometric signal is stationary if its probability density function does not vary over any time shift during
4
1 Introduction
feature extraction, such as still ﬁngerprints and irises; otherwise, it is nonstationary, such as speech and video. Based on this deﬁnition, ﬁngerprint, iris, hand, and still face can be considered as stationary biometric signals, while speech and video are nonstationary biometric signals. Special statistical methods are then needed to handle the authentication process on a nonstationary biometric signal. An automatic biometric authentication system is essentially a pattern recognition system that operates by enrollment and testing sessions, as discussed above. During the enrollment session, the biometric data from an individual is acquired, features are extracted from the data, and statistical models are trained based on the extracted features. During the testing session, the acquired data is compared with the statistical models for an authentication decision. A biometric authentication process can be used to recognize a person by identiﬁcation or veriﬁcation. Identiﬁcation is the process of searching for a match through a group of previously enrolled or characterized biometric information. It is often called “onetomany” matching. Veriﬁcation is the process of comparing the claimed identity with one previously enrolled or one set of characterized information. It is often called “onetoone” matching. To give our readers an understanding of the accuracy of biometric authentication, we collected some reported results in Table 1.1. We note that it is unfair to use a table to compare diﬀerent biometrics because the evaluations were designed for diﬀerent purposes. Data collections and experiments are based on diﬀerent conditions and objectives. Also, diﬀerent biometric modalities have diﬀerent advantages and disadvantages. The purpose is to give readers a general level of understanding of biometric approaches. Since the numbers in the table may not represent the latest experimental and evaluation results, readers are encouraged to look at the references and ﬁnd the detailed evaluation methods and approaches for each of the evaluations. In the table, we added our experimental results on speaker veriﬁcation and verbal information veriﬁcation as reported in several chapters of this book. We note that in the speaker veriﬁcation experiments all the tested speakers are using the same passphrase and the passphrase is less than two seconds on average. If diﬀerent speakers used diﬀerent passphrases, the speaker veriﬁcation equalerror rate can be below 0.5%. We note that NIST 2006 error rates were extracted from the decision error tradeoﬀ (DET) curves in [28]. For detailed results about the evaluation, readers should read [37]. As we discussed above, biometric tests can also be combined, and a biometricbased decision procedure can be in parallel or sequential. A decision can be made based on one or multiple kinds of biometric tests. Furthermore, biometricbased authentication can be combined with other authentication methods. For example, in [30, 31] speech was used to generate the key for encryption, which is a combination of biometricbased and informationbased authentications; therefore, understanding authentication technology and meth
1.2 BiometricBased Authentication
5
Table 1.1. List of Biometric Authentication Error Rates BioMetric Iris
Task
MEGC2009 [34] ICE2006 [35] Finger FVC2006 print [13] FVC2004 [12] FpVTE2003 [10] Face MBGC2009 [34] FRVT2006 [13] Speech NIST2006 [37] See Table 10.1
Description Still, HDNIR portal video
False Rejection 10%
False Accept 0.1%
Left/right eye
1.0% / 1.5%
0.1%
Open category4 databases 0.021% 5.56% 0.021%5.56% Light category4 databases 0.148%5.4% 0.148%5.4% Average 20 years old 2% 2% US Gov. Ops. Data
Still/portable (Controlleduncontrolled) Varied resolution Varied lighting Speaker recognition Text independent Speaker veriﬁcation Text dependent See Table 14.3 Verbal information veriﬁcation
0.1%
1.0%
1%5%
0.1%
1%5% 1% 3% from DET
0.1% 0.1% 1.2 % from DET
1.8% 0%
1.8% 0%
ods can help readers design speciﬁc authentication procedures to meet application requirements. We note that among all biometric modalities, a person’s voice is one of the most convenient biometrics for user authentication purposes because it is easy to produce, capture, and transmit over the ubiquitous telephone and Internet networks. It also can be supported with existing telephone and wireless communication services without requiring special or additional hardware.
1.3 InformationBased Authentication Informationbased authentication is the process of securing a transaction by preregistered information and knowledge. The most popular examples are password, PIN, mother’s maiden name, or other private information. The private information is preregistered in a bank, trust agent, or computer server, and a user needs to provide the exact same information to gain access. More complex authentication systems use cryptographic protocols, where the key is the preregistered information for encryption and decryption [47]. Although the private information can be provided by typing, it is inconvenient for applications where keyboards are not available or diﬃcult to access, such as handheld devices and telephone communications. When using speech
6
1 Introduction
to provide the information, operators are then needed for the process. How to verify the verbal information automatically is called verbal information veriﬁcation which will be discussed in detail in this book.
1.4 Speaker Authentication As discussed above, speaker authentication is concerned with authenticating a user’s identity via voice characteristics, a biometricbased authentication, or by verbal content, an informationbased authentication. There are two major approaches to speaker authentication: speaker recognition (SR) and verbal information veriﬁcation (VIV). The SR approach attempts to verify a speaker’s identity based on his/her voice characteristics while the VIV approach veriﬁes a speaker’s identity through veriﬁcation of the content of his/her utterance(s).
Speaker Authentication
Speaker Recognition (Authentication by speech characteristics)
Speaker Verification
Verbal Information Verification (Authentication by verbal content)
Speaker Identification
Fig. 1.1. Speaker authentication approaches.
As shown in Fig. 1.1, the approach to speaker authentication can be categorized into two groups: one uses a speaker’s voice characteristics, which leads to speaker recognition, and the other focuses on the verbal content of the spoken utterance, which leads to verbal information veriﬁcation (VIV) . Based on our above deﬁnitions, speaker recognition, recognizing who is talking, is a biometricbased authentication, while VIV, verifying a user based on what is being said, is an informationbased authentication. These two techniques can be further combined to provide an enhanced system as indicated by the dashed line.
1.4 Speaker Authentication
Training Utterances: " Open Sesame" " Open Sesame" " Open Sesame"
Model Training
7
SpeakerDependent Model
Database Enrollment Session Test Session
Identity Claim Test Utterance: " Open Sesame"
Speaker Verifier
Scores
Fig. 1.2. A speaker veriﬁcation system.
1.4.1 Speaker Recognition Speaker recognition as one of the voice authentication techniques has been studied for several decades [2, 1, 39, 5, 4, 23]. As shown in Fig. 1.1, Speaker recognition (SR) can be formulated in two operating modes, speaker veriﬁcation and speaker identiﬁcation. Speaker veriﬁcation (SV) is the process of verifying whether an unknown speaker is the person as claimed, i.e. a yesno or onetoone, hypothesis testing problem. Speaker identiﬁcation (SID) is the process of associating an unknown speaker with a member in a preregistered, known population, i.e. a onetomany matching problem. A typical SV system is shown in Fig. 1.2, which has two operating scenarios: enrollment session and test session. A speaker needs to enroll ﬁrst before she or he can use the system. In the enrollment session, the user’s identity, such as an account number, together with a passphrase, such as a digit string or a key phrase like “open sesame” shown in the ﬁgure, is assigned to the speaker. The system then prompts the speaker to say the passphrase several times to allow training or constructing of a speakerdependent (SD) model that registers the speaker’s speech characteristics. The digit string can be the same as the account number and the key phrase can be selected by the user so it is easy to remember. An enrolled speaker can use the veriﬁcation system in a future test. Similar procedures apply in the case of SID. These schemes are sometimes referred to as direct methods as they use the speaker’s speech characteristics to infer or verify the speaker’s identity directly. In a test session, the user ﬁrst claims his/her identity by entering or speaking the identity information. The system then prompts the speaker to say the passphrase. The passphrase utterance is compared against the stored SD model. The speaker is accepted if the veriﬁcation score exceeds a preset threshold; otherwise, the speaker is rejected. Note that the passphrase may or may not be kept secret.
8
1 Introduction
When the passphrases are the same in training and testing, the system is called a ﬁxed passphrase system or ﬁxed phrase system. Frequently, a short phrase or a connecteddigit sequence, such as a telephone or account number, is chosen as the ﬁxed passphrase. Using a digit string for a passphrase has a distinctive diﬀerence from other nondigit choices. The high performance of current, connecteddigit speech recognition systems and embedded errorcorrecting possibilities of digit strings make it feasible that the identity claim can be made via spoken, rather than keyin input [40, 43]. If such an option is installed, the spoken digit string is ﬁrst recognized by automatic speech recognition (ASR) and the standard veriﬁcation procedure then follows using the same digit string. Obviously, successful veriﬁcation of a speaker relies upon a correct recognition of the input digit string. A security concern may be raised about using ﬁxed passphrases since a spoken passphrase can be taperecorded by impostors and used in later trials to get access to the system. A textprompted SV system has been proposed to circumvent such a problem. A textprompted system uses a set of speakerdependent word or subword models, possibly for a small vocabulary such as the digits. These models are employed as the building blocks for constructing the models for the prompted utterance, which may or may not be part of the training material. When the user tries to access the system, the system prompts the user to utter a randomly picked sequence of words in the vocabulary. The word sequence is aligned with the pretrained word models and a veriﬁcation decision is made based upon the evaluated likelihood score. Compared to a ﬁxedphrase system, such a textprompted system normally needs longer enrollment time in order to collect enough data to train the SD word or subword models. The performance of a textprompted system is, in general, not as high as that of a ﬁxedphrase system. This is due to the fact that the phrase model constructed from concatenating elementary word or subword models is usually not as accurate as that directly trained from the phrase utterance in a ﬁxedphrase system. Details on a textprompted system will be discussed later in this book. A typical SID system also has two operation scenarios – training section and testing section. During training, we need to train multiple speakerdependent acoustic models, one model associated with one speaker. During testing, when we receive testing speech data, we evaluate the data using all the trained speakerdependent models. The identiﬁed speaker is the speaker who’s acoustic model has the best match to the given testing data compared to others. The above systems are called textdependent, or textconstrained speaker recognition (SR) systems because the input utterance is constrained, either by a ﬁxed phrase or by a ﬁxed vocabulary. A veriﬁcation system can also be textindependent. In a textindependent SR system, a speaker’s model is trained on the general speech characteristics of the person’s voice [38, 14]. Once such a model is trained, the speaker can be veriﬁed regardless of the underlying text of the spoken input. Such a system has wide applications for monitoring and
1.4 Speaker Authentication
9
verifying a speaker on a continuous basis. In order to characterize a speaker’s general voice pattern without a text constraint, we normally need a large amount of phonetically or acoustically rich training data in the enrollment procedure. Also, without the text or lexical constraint, longer test utterances are usually needed to maintain a satisfactory SR performance. Without a large training set and long test utterances, the performance of a textindependent system is usually inferior to that of a textdependent system. When evaluating an SR system, if it is both trained and tested by the same set of speakers, it is called a closed test; otherwise, it is an open test. In a closed test, data from all the potential impostors (i.e., all except the true speaker) in the population can be used to train a set of high performance, discriminative speaker models. However, as most SR applications are of an opentest nature, to train the discriminative model against all possible impostors is not possible. As an alternative, a set of speakers whose speech characteristics are close to the speaker can be used to train the speakerdependent discriminative model, or speakerindependent models can be used to model impostors. 1.4.2 Verbal Information Veriﬁcation When applying the current speaker recognition technology to realworld applications, several problems are encountered. One such problem is the need of a voice enrollment session to collect data for training the speakerdependent (SD) model. The voice enrollment is an inconvenience to the user as well as the system operators who often has to supervise the process and check the quality of the collected data to ensure the system performance. The quality of the collected training data has a critical eﬀect on the performance of an SV system. A speaker may make a mistake when repeating the training utterances/passphrases several times. Furthermore, as we have discussed in [26], since the enrollment and testing voice may come from diﬀerent telephone handsets and networks, acoustic mismatch between the training and testing environments may occur. The SD models trained on the data collected in an enrollment session may not perform well when the test session is in a diﬀerent environment or via a diﬀerent transmission channel. The mismatch signiﬁcantly aﬀects the SV performance. This is a signiﬁcant drawback of the direct method, in which the robustness in comparative evaluation is diﬃcult to ensure. Alternatively, in light of the progress in modeling for speech recognition, the concept and algorithm of verbal information veriﬁcation (VIV) was proposed [24, 22] to take advantage of speaker registered information. The VIV method is the process of verifying spoken utterances against the information stored in a given personal data proﬁle. A VIV system may use a dialogue procedure to verify a user by asking questions. An example of a VIV system is shown in Fig. 1.3. It is similar to a typical telebanking procedure: after an account number is provided, the operator veriﬁes the user by asking some personal information, such as mother’s maiden name, birth date, address, home telephone number, etc. The user must provide answers to
10
1 Introduction
‘‘In which year were you born ?’’ Get and verify the answer utterance. Correct
Wrong
‘‘In which city/state did you grow up ?’’
Rejection
Get and verify the answer utterance. Correct
Wrong
‘‘May I have your telephone number, please ?’’
Rejection
Get and verify the answer utterance. Correct
Acceptance on 3 utterances
Wrong
Rejection
Fig. 1.3. An example of verbal information veriﬁcation by asking sequential questions. Similar sequential tests can also be applied in speaker recognition and other biometric or multimodality veriﬁcation.
the questions correctly in order to gain access to his/her account and services. In this manner, a speaker’s identity is embedded in the knowledge she or he has towards some particular questions, and thus one often considers VIV an indirect method. To automate the whole procedure, the questions can be prompted by a texttospeech system (TTS) or by prerecorded messages. The diﬀerence between SR (the direct method) and VIV (the indirect method) can be further addressed in the following three aspects. First, in an speaker recognition system, either for SID or for SV, we need to train speakerdependent (SD) models, while in VIV we usually use speakerindependent statistical models with associated acousticphonetic identities. Second, a speaker recognition system needs to enroll a new user and to train the SD model, while a VIV system does not require voice enrollment. Instead, a user’s personal data proﬁle is created when the user’s account is set up. Finally, in speaker recognition, the system has the ability to reject an impostor even when the input utterance contains a legitimate passphrase, if the utterance indeed fails to match the pretrained SD model. In VIV, it is solely the user’s responsibility to protect his or her personal information because no speakerspeciﬁc
1.4 Speaker Authentication
11
voice characteristics are used in the veriﬁcation process. In real applications, there are several ways to circumvent the situation in which an impostor uses a speaker’s personal information obtained from eavesdropping on a particular session. A VIV system can ask for information that may not be a constant from one session to another, e.g. the amount or date of the last deposit, or a subset of the registered personal information, i.e. a number of randomly selected information ﬁelds in the personal data proﬁle. To improve user convenience and system performance, we can further combine VIV and SV to construct a progressive integrated speaker authentication system. In the combined system, VIV is used to verify a user during the ﬁrst few accesses. Simultaneously, the system collects veriﬁed training data for constructing speakerdependent models. Later, the system migrates to an SV system for authentication. The combined system is convenient to users since they can start to use the system without going through a formal enrollment session and waiting for model training. Furthermore, since the training data may be collected from diﬀerent channels in diﬀerent VIV sessions, the acoustic mismatch problem is mitigated, potentially leading to a better system performance in test sessions. The SD statistical models can be updated to cover diﬀerent acoustic environments while the system is in use to further improve system performance. VIV can also be used to ensure training data for SV. Details of this approach will be discussed in following chapters.
1.5 Historical Perspective and Further Reading Speaker authentication has been studied for several decades. It is diﬃcult to provide a complete review of speaker authentication research history. This section is not intended to review the entire history of speaker authentication, but rather to brieﬂy summarize the important progress in speaker authentication based on the author’s limited experience and understanding of technical approaches. Also, this section tries to provide links between previous research and development to the chapters in this book. From 1976 to 1997, Proceedings of the IEEE published review or tutorial papers on speaker authentication about once a decade [1, 39, 8, 5]. A general overview of speaker recognition was published in 1992 [44]. Before 1998, speaker authentication mainly focused on speaker recognition, including SV and SID, for textdependent and textindependent applications. In 1999, the author and his commeagues’ review paper extended speaker recognition to speaker authentication [23]. The above review papers and the references therein are good starting points to review the history of speaker authentication. A speaker recognition system consists of two major subsystems: feature extraction and pattern recognition. Feature extraction is used to extract acoustic features from speech waveforms, while pattern recognition is used to recognize
12
1 Introduction
true speakers and imposters from their acoustic features. We brieﬂy review these two areas, respectively. In the 1960’s, pitch contour [3], linear prediction analysis, and cepstrum [2] were applied to speaker recognition to extract features. In [11], the cepstral normalization was introduced. It is still a useful technique in today’s speaker recognition systems. In [18, 9], ﬁlter banks were applied to feature extraction, where the ﬁlter banks were implemented by analogue ﬁlters followed by wave rectiﬁers. In [38], the MFCCs were used in speaker recognition. As we analyzed in Chapter 8, linear predictive cepstrum and Mel cepstrum may have similar performances for narrowband signals, but for wideband signals, the Mel cesptrum can provide better performance. In [19], an auditorybased transform and auditory ﬁlter bank was introduced; furthermore, an auditorybased feature was presented in [21, 20] to replace FFT. We note that compared with an FFT spectrumbased ﬁlter bank [7], such as used in Mel frequency cepstral coeﬃcients (MFCC), the ﬁlter bank in the analogue implementation and the auditory transform have many advantages. The details will be discussed in Chapter 7. Also beginning in the 1960’s, the basic statistical pattern recognition techniques were being used, including mean, variance, and Gaussian distribution [36, 3, 17]. In [36], a ratio of variance of speaker means to average intraspeaker variance was introduced. The concept is similar to the concept of linear discriminant analysis (LDA). At that time, the statistical methods were based on an assumption that the speaker’s data are in one Gaussian distribution. In [48], vector quantization (VQ) was applied to speaker recognition. The VQ approach is more general than previous approaches because it can handle the data in any distribution. Later, the Gaussian mixture model (GMM) and hidden Markov model (HMM) [50, 42, 45, 38, 14, 29] were applied to speaker recognition. The GMM model can model any data distribution and the HMM can characterize the nonstationary signals to multiple states, where the data in each state of HMM can be modeled by a GMM. The HMM approach is still the popular approach to textdependent speaker veriﬁcation [33]. The GMM is popular in textindependent speaker recognition [38]. In [41, 43, 33], cohort and background models were used to improve robustness. Instead of computing the absolute scores of just speaker dependent models, the cohort and background models provide related scores between a speakerdependent model and a background model; therefore, the approach makes the speaker recognition system more robust. From [27], discriminative training started to be applied to speaker recognition. The training speed was a concern for real applications in speaker recognition. The research in Chapter 13 was then conducted to speed up the discriminative training. In 1996 [46], the support vector machine (SVM) was applied to speaker recognition. Recently, as the SVM software is available to the public, more papers are being written about applying SVM to speaker recognition [6]. Recently, the soft margin estimation as a discriminative objective was applied to speaker identiﬁcation and convex optimization was used to solve the parameter estimation problem [52].
1.5 Historical Perspective and Further Readeing
13
In 1976 [39], approaches and challenges to speaker veriﬁcation and two speaker veriﬁcation systems developed by Texas Instruments and Bell Labs were reviewed. Some of the challenges introduced then are still the challenges confronting today’s speaker authentication systems, such as recording environments and transmission conditions in dialedup telephone lines. Those problems are even more problematic in today’s wireless and VoIP networks. To address the mismatch between training and testing conditions, a linear transform approach was developed in [26] and described in Chapter 10. The factor analysis approach was developed in [16]. Both attempt to address the mismatch problem in the model domain. In [21, 20], an auditorybased feature was developed to address the problem from the feature domain. The details are available in Chapter 8. Regarding system evaluations, in textindependent speaker recognition, NIST has been coordinating annual or biannual public evaluations since 1996 [37]. Readers can ﬁnd recent approaches from the related publications. In textdependent speaker veriﬁcation, in 1997 a bank invited the companies who had the speaker veriﬁcation systems to attend the performance evaluation using the bank’s proprietary database. The author attended the evaluation with his speaker veriﬁcation system developed based on the Bell Labs software environment. The term speaker authentication was introduced in 1997 while the author and his Bell Labs colleagues developed the VIV technique [25, 24]. Since VIV is beyond the traditional speaker recognition system, we introduced the term speaker authentication. As it will be introduced in Chapters 14 and 15, VIV opens up a new research and application area in speaker authentication. In addition to above discussions, another research area in speaker authentication is in generating cryptographic keys from spoken passphrases or passwords. The goal of this research is to enable a device to generate a key from voice biometrics to encrypt data saved in the device. An attacker who captures the device and extracts all the information it contains, however, should be unable to determine the key and retrieval information from the device. This research was reported in [30, 31].
1.6 Book Organization The chapters of this book cover three areas: pattern recognition, speech signal processing, and speaker authentication. As the foundation of this book, pattern recognition theory and algorithms are introduced in Chapter 2 – 4, 12, and 13. They are general enough and can be applied to any pattern recognition task including speech and speaker recognition. Signal processing algorithms and techniques are discussed in Chapters 5 – 8. They can be applied to audio signal processing, communications, speech recognition, and speaker authentication. Finally, algorithms, techniques, and methods for speaker authentication are introduced in Chapters 9, 10, 11, 14, and 15.
14
1 Introduction
The book focuses on novel research ideas and eﬀective and useful approaches. For each traditional topic, the author presents a newer and eﬀective approach. For example, in vector quantization (VQ), in addition to the traditional Kmeans algorithm, the author presents a fast onepass VQ algorithm. In neural networks (NN), instead of the backpropagation algorithm, the author presents a sequential and fast NN design method with data pruning. In decoding, instead of the Viterbi algorithm, the author presents a detectionbased decoding algorithm. In signal processing, instead of the FFT, the author presents the auditory transform. In discriminative training, the author presents a fast, closedform solution, and so on. Each chapter discuses one major topic which can be read independent of other chapters. We now provide a brief synopsis of each chapter and how it relates to the other chapters in the book. Chapter 2 – Multivariate Statistical Analysis and OnePass Vector Quantization: Since speaker authentication technology is basically a statistical approach, multivariate statistical analysis is foundational to this book. This chapter introduces the popular multivariate Gaussian distribution and principal component analysis (PCA) with illustrations. The Gaussian distribution function is the core of the GMM and HMM which are used in speaker authentication. The PCA is important to understand the geometrical property of datasets. Following that, we introduce the traditional vector quantization (VQ) algorithm where the initial centroids are selected randomly. Furthermore, the onepass VQ algorithm is presented. It initializes the centroids in a more eﬃcient way. The segmental Kmeans algorithm extends the VQ algorithm to the nonstationary timevariant signals. The traditional VQ and onepass algorithms can be applied to train the Gaussian models and large background models for speaker recognition while the segmental Kmeans can be used to train the HMM for speaker veriﬁcation. Chapter 3 – Principal Feature Networks for Pattern Recognition: Pattern recognition is one of the fundamental technologies in speaker authentication. During the last several decades, many pattern recognition algorithms have been developed from linear discriminant analysis to decision trees, Bayesian decision theory, multilayer neural networks, and support vector machines. There are already many available books that introduce the techniques and software toolboxes for most of the pattern recognition techniques. Instead of reviewing each of the techniques, in this chapter we introduce a novel neural network training and construction approach. It is a sequential design procedure. Given the required recognition performance or error rates, the algorithm can construct a recognizer or neural network step by step based on the discriminative pattern recognition techniques. The algorithm can determine the structure of the recognizer or neural network based on the required performance. The introduced technique can help readers understand the relationship between multivariate statistic analysis and neural networks. The principal feature networks have been used in realworld applications.
1.6 Book Organization
15
Chapter 4 – NonStationary Pattern Recognition: Traditional pattern recognition technologies are considered stationary patterns, i.e. the patterns do not change via time; however, the patterns of speech signals in a feature domain do change with time; therefore, nonstationary pattern recognition techniques are necessary for speaker authentication. Actually, the current approach to nonstationary pattern recognition is based on the stationary approach. It divides speech utterances into a sequence of small time segmentations or called states. Within each state, the speech pattern is assumed to be stationary; thus, the stationary pattern recognition techniques can be applied to solve the nonstationary pattern recognition problem. This chapter provides the foundation for further descriptions of the algorithms used in speaker authentication. Chapter 5 – Robust Endpoint Detection: Recorded speech signals for speaker authentication normally come with a combination of speech signal and silence. To achieve the best performance in terms of recognition speed and accuracy, one usually removes the silence in a frontend process. The detection of the presence of speech embedded in various types of nonspeech events and background noise is called endpoint detection or speech detection or speech activity detection. When the signaltonoise ratio (SNR) is high, the task is not very diﬃcult, but when the SNR is low, such as in wireless communications, VoIP, or strong background noise, the task can be very diﬃcult. In this chapter, we present a robust endpoint detection algorithm which is invariant to diﬀerent background noise levels. The algorithm has been used in real applications and has signiﬁcantly improved speech recognition performances. For any real applications, a robust endpoint detection algorithm is necessary. The technique can be used not only for speaker authentication, but for audio signal processing and speech recognition as well. Chapter 6 – DetectionBased Decoder: When the HMM is used in speech and speaker authentication, a decoding algorithm, such as the Viterbi decoding algorithm, is then needed to search for the best state sequence for nonstationary pattern recognition. However, the search space is usually large. One has to reduce the search space for practical applications. A popular approach is to preassign or guess a beam width as the search space. Obviously, this is not the best way to reduce the search space. In this chapter, we introduce the detection theory to the decoding task and present a detectionbased decoding algorithm to reduce the search space based on changepoint detection theory. Our experimental results show that the algorithm can signiﬁcantly speed up the decoding procedure. The algorithm can be used in speech recognition as well. Chapter 7 – AuditoryBased TimeFrequency Transform: The timefrequency transform plays an important role in signal processing. The Fourier transform (FT) has been used for decades, but, as analyzed in this chapter, the FT is not robust to background noise and also generates signiﬁcant computational noise. In this chapter, we present an auditorybased, timefrequency transform based on our study of the hearing periphery sys
16
1 Introduction
tem. The auditory transform (AT) is a pair of forward and inverse transforms which has been proved in theory and validated in experiments for invertiblity. The AT has much less computational noise than the FT and can be free from pitch harmonics. The AT provides a solution to robust signal processing and can be used as an alternative solution to the FT. We also compare the AT with the FFT, wavelet transform, and the Gammatone ﬁlter bank. Chapter 8 – AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation: In this chapter, we present an auditorybased feature extraction algorithm. The features are based on the robust timefrequency transform introduced in Chapter 7, plus a set of modules to mimic the signal processing functions in the cochlea. The purpose is to address the acoustic mismatch problem between training and testing in speaker recognition. The new auditorybased algorithm has shown in our experiment, to be more robust than the traditional MFCC (Mel frequency cepstral coeﬃcients), PLP, and RASTAPLP features. Chapter 9 – FixedPhrase Speaker Veriﬁcation: In this chapter, we focus on a ﬁxedphrase SV system for openset applications. Here, ﬁxedphrase means that the same passphrase is used for one speaker in both training and testing sessions and the text of the passphrase is known by the system through registration. A short, userselected phrase, also called a passphrase, is easy to remember and use. For example, it is easier to remember “open sesame” as a passphrase than a 10digit phone number. Based on our experiments, the selected passphrase can be short and less than two seconds duration and still can get a good performance. Chapter 10 – Robust Speaker Veriﬁcation with Stochastic Matching: In this chapter, we address the acoustic mismatch between training and testing environments from a diﬀerent approach – transforming feature space. Speaker authentication performances are degraded when a model trained under one set of conditions is used to evaluate data collected from diﬀerent telephone channels, microphones, etc. The mismatch can be approximated as a linear transform in the cepstral domain. We present a fast, eﬃcient algorithm to estimate the parameters of the linear transform for realtime applications. Using the algorithm, test data are transformed toward the training conditions by rotation, scale, and translation without destroying the detailed characteristics of speech. As a result, the pretrained, SD models can be used to evaluate the details under the same condition as training. Compared to cepstral mean subtraction (CMS) and other biasremoval techniques, the presented linear transform is more general since CMS and others only consider translation; compared to maximumlikelihood approaches for stochastic matching, the presented algorithm is simpler and faster since iterative techniques are not required. Chapter 11 – Randomly Prompted Speaker Veriﬁcation: In this chapter we ﬁrst introduce the randomly prompted SV system and then present a robust algorithm for randomly prompted SV. The algorithm is referred to here as normalized discriminant analysis (NDA). Using this technique, it is
1.6 Book Organization
17
possible to design an eﬃcient linear classiﬁer with very limited training data and to generate normalized discriminant scores with comparable magnitudes for diﬀerent classiﬁers. The NDA technique is applied to a recognizer for randomly prompted speaker veriﬁcation where speaker speciﬁc information obtained when utterances are processed with speakerindependent models. In experiments conducted on a networkbased telephone database, the NDA technique shows a signiﬁcant improvement over the Fisher linear discriminant analysis. Furthermore, when the NDA is used in a hybrid SV system combining information from speaker dependent and speaker independent models, veriﬁcation performance is better than the HMM with cohort normalization. Chapter 12 – Objectives for Discriminative Training: Discriminative training has shown advantages over maximum likelihood training in speech and speaker recognition. To this end, a discriminative objective needs to be deﬁned ﬁrst. In this chapter the relations among several popular discriminative training objectives for speech and speaker recognition, language processing, and pattern recognition are derived and discovered through theoretical analysis. Those objectives are the minimum classiﬁcation error (MCE), maximum mutual information (MMI), minimum error rate (MER), and a recentlyproposed generalized minimum error rate (GMER). The results show that all the objectives can be related to minimum error rates and maximum a posteriori probability. The results and the analytical methods used in this chapter can help in judging and evaluating discriminative objectives, and in deﬁning new objectives for diﬀerent tasks and better performances. Chapter 13 – Fast Discriminative Training: Currently, most discriminative training algorithms for nonlinear classiﬁer designs are based on gradientdescent (GD) methods for objective minimization. These algorithms are easy to derive and eﬀective in practice but are slow in training speed and have diﬃculty in selecting the learning rates. To address the problem, we present our study on a fast discriminative training algorithm. The algorithm initializes the parameters by the expectation maximization (EM) algorithm, and then it uses a set of closedform formulas derived in this chapter to further optimize a proposed objective of minimizing error rate. Experiments in speech applications show that the algorithm provides better recognition accuracy in fewer iterations than the EM algorithm and a neural network trained by hundreds of gradientdecent iterations. Our contribution in this chapter is to present a new way to formulate the objective minimization process, and thereby introduce a process that can be eﬃciently implemented with the desired result as promised by discriminative training. Chapter 14 – Verbal Information Veriﬁcation: Traditional speaker authentication focuses on SV and SID, which are both accomplished by matching the speaker’s voice with his or her registered speech patterns. In this chapter, we introduce a technique named verbal information veriﬁcation (VIV) in which spoken utterances of a claimed speaker are veriﬁed against the key (usually conﬁdential) information in the speaker’s registered proﬁle automatically to decide whether the claimed identity should be accepted or rejected. Using
18
1 Introduction
the proposed sequential procedure involving three questionresponse turns, the VIV achieves an errorfree result in a telephone speaker authentication experiment with 100 speakers. The VIV opens up a new research direction and application in speaker authentication. Chapter 15 – Speaker Authentication System Design: Throughout this book we introduce various speaker authentication techniques. In realworld applications, a speaker authentication system can be designed by combining two or more techniques to construct a useful and convenient system to meet the requirements of a particular application. In this chapter, we provide an example of a speaker authentication system design. The design requirement is from an online banking system which requires speaker veriﬁcation, but does not want to inconvenience customers by an enrollment procedure. The designed system is a combination of VIV and SV. Following this example, readers can design their own system for any particular application.
References 1. Atal, B. S., “Automatic recognition of speakers from their voices,” Proceeding of the IEEE, vol. 64, pp. 460–475, 1976. 2. Atal, B. S., “Eﬀectiveness of linear prediction characteristics of the speech wave for automatic speaker identiﬁcation and veriﬁcation,” Journal of the Acoustical Society of America, vol. 55, pp. 1304–1312, 1974. 3. Atal, B., Automatic speaker recognition based on pitch contours. PhD thesis, Polytech. Inst., Brookly, NY, June 1968. 4. Campbell, J. P., “Forensic speaker recognition,” IEEE Signal Processing Magazine, pp. 95–103, March 2009. 5. Campbell, J. P., “Speaker recognition: A tutorial,” Proceedings of the IEEE, vol. 85, pp. 1437–1462, Sept. 1997. 6. Campbell, W. M., “Support vector machines using GMM supervectors for speaker veriﬁcation,” IEEE Signal Processing Letter, pp. 308–311, May 2006. 7. Davis, S. B. and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. on Acoustics, speech, and signal processing, vol. ASSP28, pp. 357–366, August 1980. 8. doddington, G., “Speaker recognition – identifying people by their voices,” Proceedings of the IEEE, vol. 73, pp. 1651–1664, Nov. 1985. 9. Doddington, G., “Speaker veriﬁcation – ﬁnal report,” Tech Rep. RADC 74–179, Rome Air Development Center, Griﬃss AFB, NY, Apr. 1974. 10. FpVTE2003, “http://www.nist.gov/itl/iad/ig/fpvte03.cfm,” in Proceedings of The Fingerprint Vendor Technology Evaluation (FpVTE), 2003. 11. Furui, S., “Cepstral analysis techniques for automatic speaker veriﬁcation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 12. FVC2004, “http://bias.csr.unibo.it/fvc2004/,” in Proceedings of The Third International Fingerprint Veriﬁcation Competition, 2004. 13. FVC2006, “http://bias.csr.unibo.it/fvc2006/,” in Proceedings of The Fourth International Fingerprint Veriﬁcation Competition, 2006.
References
19
14. Gish, H. and Schmidt, M., “Textindependent speaker identiﬁcation,” IEEE Signal Processing Magazine, pp. 18–32, Oct. 1994. 15. Jain, A. K., Ross, A., and Prabhakar, S., “An introduction to biometric recognition,” IEEE Trans. on Circuits and System for Video Tech., vol. 14, pp. 4–20, January 2004. 16. Kenny, P. and Dumouchel, P., “Experiments in speaker veriﬁcation using factor analysis likelihood ratios,” in Proceedings of Odyssey, pp. 219–226, 2004. 17. Li, K. and Hughes, G., “Talker diﬀerences as they appear in correlation matrices of continuous speech spectra,” J. Acoust. Soc. Amer., vol. 55, pp. 833–837, Apr. 1974. 18. Li, K. and J.E. Dammann, W. C., “Experimental studies in speaker veriﬁcation using an adaptive system,” J. Acoust. Soc. Amer., vol. 40, pp. 966–978, Nov. 1966. 19. Li, Q., “An auditorybased transform for audio signal processing,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), Oct. 2009. 20. Li, Q. and Huang, Y., “An auditorybased feature extraction algorithm for robust speaker identiﬁcation under mismatched conditions,” IEEE Trans. on Audio, Speech and Language Processing, Sept. 2011. 21. Li, Q. and Huang, Y., “Robust speaker identiﬁcation using an auditorybased feature,” in ICASSP 2010, 2010. 22. Li, Q. and Juang, B.H., “Speaker veriﬁcation using verbal information veriﬁcation for automatic enrollment,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Seattle), May 1998. 23. Li, Q., Juang, B.H., Lee, C.H., Zhou, Q., and Soong, F. K., “Recent advancements in automatic speaker authentication,” IEEE Robotics & Automation magazine, vol. 6, pp. 24–34, March 1999. 24. Li, Q., Juang, B.H., Zhou, Q., and Lee, C.H., “Automatic verbal information veriﬁcation for user authentication,” IEEE Trans. on Speech and Audio Processing, vol. 8, pp. 585–596, Sept. 2000. 25. Li, Q., Juang, B.H., Zhou, Q., and Lee, C.H., “Verbal information veriﬁcation,” in Proceedings of EUROSPEECH, (Rhode, Greece), pp. 839–842, Sept. 2225 1997. 26. Li, Q., Parthasarathy, S., and Rosenberg, A. E., “A fast algorithm for stochastic matching with application to robust speaker veriﬁcation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Munich), pp. 1543–1547, April 1997. 27. Liu, C.S., Lee, C.H., Juang, B.H., and Rosenberg, A., “Speaker recognition based on minimum error discriminative training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1994. 28. Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M., “The DET curve in assessment of detection ask performance,” in Proceedings of Eurospeech, (Rhodes, Greece), pp. 1899–1903, Sept. 1997. 29. Matsui, T. and Furui, S., “Comparison of textindependent speaker recognition methods using VQdistoration and discrete/continuous HMM’s,” IEEE Trans. on speech and Audio Processing, vol. 2, pp. 456–459, 1994. 30. Monrose, F., Reiter, M. K., Li, Q., and Wetzel, S., “Using voice to generate cryptographic keys: a position paper,” in Proceedings of Speaker Odyssey, June 2001.
20
1 Introduction
31. Monrose, F., Reiter, M. K., Q. Li, D. L., and Shih, C., “Toward speech generated cryptographic keys on resource constrained devices,” in Proceedings of the 11th USENIX Security Symposium, August 2002. 32. O’Gorman, L., “Comparing passwords, tokens, and biometrics for user authentication,” Proceedings of the IEEE, vol. 91, pp. 2021–2040, December 2003. 33. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker veriﬁcation using subword background models and likelihoodratio scoring,” in Proceedings of ICSLP96, (Philadelphia), October 1996. 34. Phillips, P. J., “Mbgc portal challenge version 2 preliminary results,” in Proceedings of MBGC Third Workshop, 2009. 35. Phillips, P. J., Scruggs, W. T., O’Toole, A. J., Flynn, P. J., Bowyer, K. W., Schott, C. L., and Sharpe, M., “Frvt 2006 and ice 2006 largescale results,” in NISTIR, March 2007. 36. Pruzansky, S., “Patternmatching procedure for automatic talker recognition,” J. Acoust. Soc. Amer., vol. 35, pp. 354–358, Mar. 1963. 37. Przybocki, M., Martin, A., and Le, A., “NIST speaker recognition evaluations utilizing the mixer corpora – 2004, 2005, 2006,” IEEE Trans. Audio, Speech and Language Processing, vol. 15, pp. 1951–1959, Sept. 2007. 38. Reynolds, D. and Rose, R. C., “Robust textindependent speaker identiﬁcation using Gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Processing, vol. 3, pp. 72–83, 1995. 39. Rosenberg, A. E., “Automatic speaker veriﬁcation: a review,” Proceedings of the IEEE, vol. 64, pp. 475–487, April 1976. 40. Rosenberg, A. E. and DeLong, J., “HMMbased speaker veriﬁcation using a telephone network database of connected digital utterances,” Technical Memorandum BL0112693120623TM, AT&T Bell Laboratories, December 1993. 41. Rosenberg, A. E., DeLong, J., Lee, C.H., Juang, B.H., and Soong, F. K., “The use of cohort normalized scores for speaker veriﬁcation,” in Proceedings of the International Conference on Spoken Language Processing, (Banﬀ, Alberta, Canada), pp. 599–602, October 1992. 42. Rosenberg, A. E., Lee, C.H., and Juang, B.H., “Subword unit talker veriﬁcation using hidden markov models,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 269–272, 1990. 43. Rosenberg, A. E. and Parthasarathy, S., “Speaker background models for connected digit password speaker veriﬁcation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), pp. 81–84, May 1996. 44. Rosenberg, A. and Soong, F., “Recent research in automatic speaker recognition,” in Advances in Speech Signal Processing, (Furui, S. and Sondhi, M., eds.), pp. 701–738, NY: Marcel Dekker, 1992. 45. Savic, M. and Gupta, S., “Variable parameter speaker veriﬁcation system based on hidden Markov modeling,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 281–284, 1990. 46. Schmidt, M. and Gish, H., “Speaker identiﬁcation via support vector machine,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 105–108, 1996. 47. Simmons, G. J., “A survey of information authentication,” Proceedings of the IEEE, vol. 76, pp. 603–620, May 1988. 48. Soong, F. K. and Rosenberg, A. E., “On the use of instantaneous and transitional spectral information in speaker recognition,” IEEE Tran. Acoust., Speech, Signal Processing, vol. ASSP36, pp. 871–879, June 1988.
References
21
49. Stocksdale, G., “Glossary of terms used in security and intrusion detection,” Online, NSA, 2009. http:/www.sans.org/newlook/resources/glossary.htm. 50. Tishby, N., “Information theoretic factorization of speaker and language in hidden markov models, with application to speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. v1, 97–90, April 1988. 51. Woodward, J. D., Webb, K. W., Newton, E. M., Bradley, M. A., Rubenson, D., Larson, K., Lilly, J., Smythe, K., Houghton, B., Pincus, H. A., Schachter, J., and Steinberg, P., Army Biometric Applications Identifying and Addressing Sociocultural Concerns. RAND Arrayo, 2001. 52. Yin, Y. and Li, Q., “Soft frame margin estimation of Gaussian mixture models for speaker recognition with sparse training data,” in ICASSP 2011, 2011.
Chapter 2 Multivariate Statistical Analysis and OnePass Vector Quantization
Current speaker authentication algorithms are largely based on multivariate statistical theory. In this chapter, we introduce the most important technical components and concepts of multivariate analysis as they apply to speaker authentication: the multivariate Gaussian (also called normal) distribution, principal component analysis (PCA), vector quantization (VQ), and segmental Kmeans. These fundamental techniques have been used for statistical pattern recognition and will be used in our further discussions throughout this book. Understanding the basic concepts of these techniques is essential for understanding and developing speaker authentication algorithms. Those readers who are already familiar with multivariate analysis can skip most of the sections in this chapter; however, the onepass VQ algorithm presented in Section 2.4 is from the author and Swaszek’s research [8] and may be unknown to the reader. The algorithm is useful in processing very large datasets. For example, when training a background model in speaker recognition, the dataset can be huge; the onepass VQ algorithm can speed up the initialization of the training procedure. It can be used to initialize the centroids during Gaussian mixture model (GMM) or hidden Markov model (HMM) training for large datasets. Also, the concept will be applied to Chapter 3 in sequential design of a classiﬁer and to Chapter 14 in sequential design of speaker authentication systems.
2.1 Multivariate Gaussian Distribution Multivariate Gaussian distribution plays a fundamental role in multivariate analysis and many realworld problems fall naturally within the framework of Gaussian theory. It is also important and popular in speaker recognition. The importance of the Gaussian distribution in speaker authentication rests on its extension to mixture Gaussian distribution or Gaussian mixture model (GMM). A mixture Gaussian distribution with enough Gaussian components
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_2, Ó SpringerVerlag Berlin Heidelberg 2012
23
24
2 Multivariate Statistical Analysis and OnePass Vector Quantization
0.2 0.15 0.1 0.05 0 10 5 10
0
5 0 −5
−5 −10
Fig. 2.1. An example of bivariate Gaussian distribution: ρ11 = 1.23, ρ12 = ρ21 = 0.45, and ρ22 = 0.89.
can approximate the “true” population distribution of speech data. In addition, the EM (ExpectationMaximization) algorithm provides a convenient training algorithm for the GMM. The algorithm is fast and guarantees for convergence at every iteration. Third, based on the GMM, we can build a hidden Morkov model (HMM) for speaker veriﬁcation and verbal information veriﬁcation. In this section we discuss the single Gaussian distribution, which is the basic component in a mixture Gaussian distribution. The mixture Gaussian distribution will be discussed with models for speaker recognition in the following sections. A pdimensional Gaussian density for the random vector X = [x1 , x2 , . . . , xp ] can be presents as: 1 1 −1 f (x) = exp − (x − μ) Σ (x − μ) (2.1) 2 (2π)p/2 Σ1/2 where −∞ < xi < ∞, i = 1, 2, . . . , p. The pdimensional normal density can be denoted as Np (μ, Σ) [6]. For example, when p = 2, the bivariate Gaussian density has the following individual parameters: μ1 = E(X1 ), μ2 = E(X2 ), σ11 = Var(X1 ), σ22 = Var(X2 ), the covariance matrix is: σ11 σ12 Σ= (2.2) σ12 σ22
2.1 Multivariate Gaussian Distribution
25
10
5
0
−5 −10
0
−5
5
10
Fig. 2.2. The contour of the Gaussian distribution in Fig. 2.1.
and the inverse of the covariance is
σ11 σ12 . (2.3) 2 σ12 σ22 σ11 σ22 − σ12 √ √ Introducing the correlation coeﬃcient ρ12 = σ12 σ11 σ22 , we have the expression for the bivariate (p = 2) Gaussian density as: 1 1 f (x1 , x2 ) = exp − 2 2(1 − ρ212 ) 2π σ11 σ22 (1 − ρ12 ) x1 − μ1 2 x2 − μ2 2 ( √ ) +( √ ) σ 11 σ 22 x1 − μ1 x2 − μ2 −2ρ12( √ )( √ ) (2.4) σ 11 σ 22 Σ −1 =
1
For illustration, we plot two bivariate Gaussian distributions and its contour as in Figs. 2.1 and 2.2.
2.2 Principal Component Analysis Principal component analysis (PCA) is concerned with explaining the variancecovariance structure through the most important linear combinations of the original variable [6]. In speaker authentication, the principal component analysis is a useful tool for data interpretation, reduction, analysis, and visualization. When a feature or data set represents in a N dimensional data space, the actual data variability may be largely in a small number of k dimension, where k < p; thus the k dimension can represent the data in p dimension and the original data set can be reduced from a n × p matrix to a n × k matrix, where k is the number of principal components.
26
2 Multivariate Statistical Analysis and OnePass Vector Quantization
Fig. 2.3. An illustration of a constant density ellipse and the principal components for a normal random vector X. The largest eigenvalue associates with the long axis of the ellipse and the second eigenvalue associates with the short axis. The eigenvectors associate with the axes.
When a random vector, X = [X1 , X2 , . . . , Xp ], has a covariance matrix Σ, which has the eigenvalue λi and eigenvector ei pairs: (λ1 , e1 ), (λ2 , e2 ), . . . , (λp , ep ), and λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0. The ith principal component is represented as: Yi = ei X = e1i X1 + e2i X2 + . . . + epi Xp , i = 1, 2, . . . , p
(2.5)
and then the variance and covariance of Yi are: Var(Yi ) = ei Σei i = 1, 2, . . . , p
(2.6)
A proof of this property can be found in [6]. An illustration of the concept of the principal components is given in Fig. 2.3. The principal components coincide with the axes of the constant density ellipse. The largest eigenvalue is associated with the long axis and the second eigenvalue is associated with the short axis. The eigenvectors are associated with directions of the axes. In speaker authentication, the PCA can be used to reduce the feature space for eﬃcient computation without much aﬀecting the recognition or classiﬁcation accuracy. Also, it can be used to project multidimensional data samples
2.2 Principle Component Analysis
27
onto a selected twodimensional space, so the data is visible for analysis. Readers can refer to [6] to gain more knowledge on multivariate statistical analysis.
2.3 Vector Quantization Vector quantization (VQ) is a fundamental tool for pattern recognition, classiﬁcation, and data compression. It has been applied widely to speech and image processing. VQ has become the ﬁrst step in training the GMM and HMM. Both are the most popular models used in speaker authentication. The task of VQ is to partition mdimensional space into multiple cells represented by quantized vectors. The vectors are all called centroids, codebook vectors, and codewords. The VQ training criterion is to minimize overall average distortions over all cells when using centroids to represent the data in cells. The most popular algorithm for VQ training is the Kmeans algorithm. In a quick overview, the Kmeans algorithm is an iteration process with the following steps: First, initialize the centriods by an adequate method, such as the one which will be discussed in the next section. Second, partition each training data vector into one of the cells by looking for the nearest centroids. Third, update the centroids by the data grouped into the corresponding cell. Last, repeat the second and third steps until the values of centroids converges to required ranges. We note that the Kmeans algorithm can only converge to a local optimum [12]. Diﬀerent initial centroids may lead to diﬀerent local optimum. This will also aﬀect the HMM training results when using VQ for HMM training initialization. When using HMM for speaker recognition, it can be observed each time the recognizer is retrained. It may have slightly diﬀerent recognition accuracies. The LBG (Linde, Buzo, and Gray) algorithm [12] is another popular algorithm for VQ. Instead of selecting all initial centroids at one time, the LBG algorithm determines the initial centroids through a splitting process. The vector quantization (VQ) method can be used directly for speaker identiﬁcation. In [17], Soong, Rosenberg and Juang used a speakerdependent codebook to represent characteristics of a speaker’s voice. The codebook is generated by a clustering procedure based upon a predeﬁned objective distortion measure, which computes the dissimilarity between any two given vectors [17]. The codebook can also be considered an implicit representation of a mixture distribution used to describe the statistical properties of the source, i.e., the particular talker. In the test session, input vectors from the unknown talker are compared with the nearest codebook entry and the corresponding distortions are accumulated to form the basis for a classiﬁcation decision.
28
2 Multivariate Statistical Analysis and OnePass Vector Quantization
2.4 OnePass VQ The initial centroids or codebook is critical to the ﬁnal VQ results. To achieve a high performance, the high computational complexity in both the VQ encoding and codebook design are usually required. The methods most often employed for designing the codebook, such as the Kmeans and the LBG algorithms, are iterative algorithms and require a large amount of CPU time to achieve a locally optimum solution. In speaker recognition, it is often required to pull all available data together to train a background model or cohort model. The ﬁrst step in the training is to initialize the centroids for VQ. The traditional initialization algorithm just randomly selects the centroids, which is not eﬃcient and wastes a lot of computation time when a dataset is huge. To speed up the model training procedure, an algorithm which can provide better initial centroids and needs less iteration is required. In [8], we proposed a onepass VQ algorithm for the purpose. We introduce it in this section in detail. 2.4.1 The OnePass VQ Algorithm The onepass VQ algorithm is based on a sequential statistical analysis of the local characteristics of the training data and a sequential pruning technique. The work was originaly proposed by the author and Swaszek in [8]. It was inspired by a constructive neural network design concept for classiﬁcation and pattern recognition as described in Chapter 3 [10, 11, 9, 22]. The onepass VQ algorithm sequentially selects a subset of the training vectors in a high density area of the training data space and computes a VQ codebook vector. Next, a sphere is constructed about the code vector whose radius is determined such that the encoding error for points within the sphere is acceptable. In the next stage, the data within the sphere is pruned (deleted) from the training data set. This procedure continues on the remainder of the training set until the desired number of centroids is located. To achieve a local optimum, one or a few iterations like in the Kmeans algorithm can be further applied. The basic steps of the onepass VQ algorithm are illustrated in Figure 2.4 which shows how a code vector is determined for a twodimensional training data set (the algorithm is developed and implemented in N dimensions). We refer to this ﬁgure in the following discussion. The onepass VQ algorithm is compared with several benchmark results for uncorrelated Gaussian, correlated Gaussian, and Laplace sources; its performance is seen to be as good or better than the benchmark results with much less computation. Furthermore, the onepass initialization needs only slightly more computation than a single iteration of the Kmeans algorithm when the data set is large with high dimension. The algorithm also has a robust property and can be invariant to outliers. The algorithm can be applied
2.4 OnePass VQ
(a)
(b)
(c)
(d)
29
Fig. 2.4. The method to determine a code vector: (a) select the highest density cell; (b) examine a group of cells around the selected one; (c) estimate the center of the data subset; (d) cut a “hole” in the training data set.
directly in classiﬁcation, pattern recognition and data compression, or indirectly as the ﬁrst step in training Gaussian mixture models or hidden Markov models. Design Speciﬁcations The training data set X is assumed to consist of M N dimensional training vectors. The total number of regions R is determined by the desired bitrate. Let D represent the maximum allowable distortion xm − yc 2 ≤ D
(2.7)
where xm ∈ X is a data vector and yc is the nearest code vector to xm . The value for D can be determined either from the application (such as from human vision experiments, etc.) or estimated from the required data compression ratio. Source Histogram The onepass design ﬁrst partitions the input space by a cubic lattice to compute a histogram. Each histogram cell is an N dimensional hypercube.
30
2 Multivariate Statistical Analysis and OnePass Vector Quantization
(In general, we could allow diﬀerent side lengths for the cells.) The number of cells in each dimension is calculated based on the range of the data in that dimension and the allowable maximal distortion D. The employed method for calculating the number of cells in the jthdimension, Lj , is Lj =
xj,max − xj,min 2D/3
(2.8)
where xj,max and xj,min are, respectively, the maximal and minimal values of X in dimension j. The term 2D/3 is the recommended size of the cell in that dimension. The probability mass function f (k1 , k2 , . . . , kN ) with ki ∈ {1, 2 . . . Li } is determined by counting the number of training data vectors within each cell. The sequential design algorithm starts from that cell which currently has the largest number of training vectors (the maximum of f (· · ·) over the ki ). In Figure 2.4(a) we assume that the highlighted region is that cell. Then, as shown in Figure 2.4(b), a group of contiguous cells around the selected one is chosen. This region Xs ⊂ X (highlighted) will be used to locate a VQ code vector. Locating a Code Vector Two algorithms are employed in locating a code vector: a principal component method and a direct median method. For the results presented here, the medians of Xs in each of the N dimensions are computed as the components of the code vector. Medians are employed to provide some robustness with respect to variation in the data subset Xs . The median is marked in Figure 2.4(c) by the “+” symbol. Principal Component (PC) Method: As shown in Fig. 2.5, this method ﬁrst solves an eigenvector problem on the data set Xs , RXs E = λE,
(2.9)
where RXs is the covariance matrix of Xs , λ is a diagonal matrix of eigenvalues, and E = [e1 , e2 , ..., eN ] are the eigenvectors. Then, the Xs is projected onto the directions of each of the eigenvectors. Yj = Xs ej .
(2.10)
Medians are further determined from each of the Yj . The centroid is obtained by taking the inverse transforms of the median of Yj in each dimension. DirectMedian Method: This method is straight forward. It just uses the medians of Xs in each of the original dimensions as the estimated centroid. It is clear that the second method can save the eigenvector calculation, so it is faster than the ﬁrst one. For evaluation, both methods have been tested on the codebook design of Gaussian and Laplace sources. The mean square errors from both methods are very close, while the PC method is slightly better for the Laplace source and the direct median method is better for the Gaussian source.
2.4 OnePass VQ
31
e2
e1
Fig. 2.5. The Principal Component (PC) method to determine a centroid.
Pruning the Training Data Once the code vector is located, the next step is to determine a sphere around it as shown in Figure 2.4(c). (We note that for classiﬁcation applications it could be an ellipse.) One way to determine the size of the sphere is to estimate the maximal number of data vectors which will be included inside the sphere. The set of vectors within the sphere Xc is a subset of Xs , Xc ⊂ Xs . To determine a sphere for the cth code vector, the total number of data vectors T c within the sphere is estimated by Tc = Wc
Mc R+1−c
(2.11)
where M c is the total number of data vectors in the current training data set (M 1 = M and M c < M when c > 1) and R + 1 − c is the number of code vectors which have yet to be located. The term W c is a variable weight W c = W c−1 − W
(2.12)
where W is the change of the weight variable between each of the algorithm’s R cycles. The weight is employed to keep the individual spheres from growing too large. From the experience of the design examples in this chapter, we note that the resulting performance is not very sensitive to either W 1 or W . After the number of vectors of the sphere Xc is determined, the data subset Xc is pruned from the current training data set. As shown in Figure 2.4(d) a “hole” is cut and the data vectors within the hole are pruned. The entries of the mass function f (· · ·) associated with the highlighted cells are then updated. The design for the next code vector starts by selecting that cell which currently contains the largest number of training vectors.
32
2 Multivariate Statistical Analysis and OnePass Vector Quantization
Due to the nature of the design procedure it can be imagined that the diameters of the spheres might get larger and larger while the design is in progress. To alleviate this, each sphere’s diameter is limited to 2D. This value is chosen to avoid the situation in which one sphere entirely becomes a subset of another sphere (some overlap is common). Updating the Designed Code Vectors Once After locating all R code vectors and cutting the R “holes” in X there often remains a residual data set Xl ⊂ X from the training set. The last step of our onepass design is to update the designed code vectors by calculating the centroids of the nearest vectors around each code vector. This is equal to one iteration of the Kmeans algorithm. 2.4.2 Steps of the OnePass VQ Algorithm The onepass VQ algorithm is summarized below: Step 1. Initialization 1.1 Give the training data set X (an M × N data matrix), the number of designed centroid C (or the number of regions since R = C), and the Allowed Maximal Distortion D which can be either given or estimated from the X and C. 1.2 Set weight value W 1 and W . Step 2. Computing the Rectangular Histogram 2.1 Determine the number of cells in each dimension n, n = 1, 2, ..., N as in the formula (2.8). 2.2 Compute the probability mass function f by counting the training vectors within each of the cells of the histogram. Step 3. Sequential codebook design 3.1 From f (·), select the histogram cell which currently contains the most training vectors. 3.2 Group histogram cells adjacent to the selected one to obtain the data set Xs . 3.3 Determine a centroid for Xs by the PC or direct median method. 3.4 Calculate the maximal number of vectors T c for the cth “hole”as in (2.11). 3.5 Determine and prune (delete) total T c vectors from Xs . Update the entries of f (·) associated with the cells of Xs only. 3.6 Update the parameters W c+1 as in (2.12) and Rc+1 = Rc − 1. 3.7 c = c + 1 and goto Step 3.1 if Rc+1 > 0. Step 4. Improve the designed centroids
2.4 OnePass VQ
33
4.1 Update the VQ’s centroids once by recalculating the means of those vectors from X nearest to each centroid (one iteration of the Kmeans algorithm). Step 5. Stop 2.4.3 Complexity Analysis This section is concerned with complexity analysis on the sequential selection of codebook vectors (principal features). The onepass VQ includes one sequential selection and one LBG iteration. Complexity is measured by counting the number of real number operations required by the sequential selection. We deﬁne the following notations for the analysis. • • • • • •
R — the number of VQ regions M — the number of training vectors N — the data dimensions L — the number of histogram bins per dimension k — the number of data vectors in a highlight window is k times larger than the average number, M/R. Bmax — The max number of histogram cells (boxes) with nonzero count of training data vectors.
The VQ design requires some initialization: 1) Computing the initial histogram requires N M log2 (L) ops This is followed by R repetitions of ﬁnding a centroid of the local window about the histogram maximum and pruning the data set: 2) Sorting the histogram counts requires (to ﬁnd a high density cell) Bmax log2 (Bmax ) ops 3) Finding a centroid by calculating median kN
M M log2 (k ) ops R R
4) Computing the distance of each training vector in the window to the centroid (N multiplications plus 2N1 additions/subtractions for each vector) M k (3N − 1) ops R 5) Sorting the distance to determine the cutoﬀ point k
M M log2 (k ) ops R R
34
2 Multivariate Statistical Analysis and OnePass Vector Quantization
6) Updating the histogram k
M ops R
Summing all this yields a total of
N M log2 (L) + R[Bmax log2 (Bmax ) + kN +k
M M log2 (k ) R R
M M M M (3N − 1) + k log2 (k ) + k ] R R R R
(2.13)
operations. It is equal to N M log2 (L) + R[Bmax log2 (Bmax ) + 3N k
M M M + (N + 1)k log2 (k )] (2.14) R R R
For comparison, the LBG algorithm needs 3N M R operations. Suppose a data source is a 512 × 512 image and the codebook size is 64 (M = 16384, R = 64, and N = 16). If we use the parameters in a worst case for the onepass algorithm, K = 4, L = 8, and Bmax = M/2, the sequential selection will need 21.89 Mops and one LBG iteration will need 50.33 Mops. 2.4.4 Codebook Design Examples In this section the onepass VQ algorithm is tested by designing codebooks for uncorrelated Gaussian, correlated Gaussian, and independent Laplace sources since many benchmark results on these sources have been reported [2, 3, 14, 15, 21, 19, 20, 18, 24]. To facilitate the comparison of CPU time and Flops (ﬂoating point operations) from diﬀerent systems we use the wellknown LBG algorithm as a common measurement and deﬁne the following two kinds of speedup Speedupintime =
CPU time for LBG CPU time for algorithm
(2.15)
Flops for LBG . Flops for algorithm
(2.16)
Speedupinﬂops =
Since we use a highlevel, interpreted language for simulation on a multiuser system, we prefer the Speedupinﬂops as a comparison measurement. Two dimensional Uncorrelated Gaussian source In this example, the data vectors are from a zeromean, unitvariance, independent Gaussian distribution. The joint density is given by f (x1 , x2 ) =
1 exp[−(x21 + x22 )/2)], −∞ < x1 , x2 < ∞. 2π
(2.17)
2.4 OnePass VQ 4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4 −4
−3
−2
−1
0
1
2
3
4
−4 −4
−3
−2
−1
0
1
2
35
3
4
Fig. 2.6. Left: Uncorrelated Gaussian source training data. Right: The residual data after four code vectors have been located.
The training set X consisted of 4,000 length two vectors as shown in Figure 2.6. The goal is to design a size R = 16 codebook to make the MSE (meansquared error) as small as possible. To set the design parameters we estimate D by D = 5.2/(2 × log2 (16)) = 0.65 since most of the data vectors are within a circle region of diameter 5.2. We used the weight W 1 = 1.4 and W = 0.005. The simulation showed that the ﬁrst “hole” was selected and cut in the center of the Gaussian distribution; the 2nd to the 4th holes were selected around the ﬁrst one as shown in Figure 2.6. The algorithm continued until all 16 code vectors had been located and 16 holes had been cut. Figure 2.7 shows these code vectors and the residual data. Then the code vectors were updated by the nearest vectors of X. The “+” signs in the Figure 2.7 are the ﬁnal centroids generated by the onepass algorithm. Table 2.1. Quantizer MSE Performance 1 3 2 4 5 6 7 8
Onepass VQ (introduced method) Onepass VQ + 2 LBG (introduced) LindeBuzoGray (LBG, 20 iterations) Strictly polar quantizer (from [24]) Unrestricted polar quantizer (from [24]) Scalar (Max) quantizer (from [14]) Dirichlet polar VQ (from [21]) Dirichlet rotated polar VQ (from [21])
0.218 0.211 0.211 0.240 0.218 0.236 0.239 0.222
In order to further compare the onepass algorithm with the LBG algorithm we used the centroids designed by the onepass algorithm as the initial
36
2 Multivariate Statistical Analysis and OnePass Vector Quantization
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4 −4
−3
−2
−1
0
1
2
3
4
−4 −4
−3
−2
−1
0
1
2
3
4
Fig. 2.7. Left: The residual data after all 16 code vectors have been located. Right: The “+” and “◦” are the centroids after one and three iterations of the LBG algorithm, respectively. Table 2.2. Comparison of OnePass and LBG Algorithms Type
Iterations MFLOPS CPU Time MSE (seconds) OnePass 1 1.5 35 0.218 OnePass +2LBG 1+2 3.0 67 0.211 LBG 9 7.2 166 0.218 LBG 20 15.1 333 0.211
centroids for the LBG algorithm, then ran two iterations of LBG (called “onepass+2LBG”). The centroids designed by the onepass+ 2LBG are shown in Figure 2.7(Right) denoted by the “◦”s. They are very close to the onepass centroids (“+”s). This suggests another application of the onepass algorithm: it can be used to determine the initial centroids for the LBG algorithm for a fast and highperformance design. The MSE of the onepass design is compared in Table 2.1 with the MSE from other methods on an uncorrelated Gaussian source with 16 centroids. The MSE of the onepass algorithm is equal to or better than the benchmark results. Table 2.2 shows the CPU time and Mﬂops used by the onepass and LBG algorithms. For the same MSE, 0.218, the onepass has a speedupinﬂops of 7.2/1.5 = 4.8. Multidimensional Uncorrelated Gaussian Source As shown in Table 2.3, the onepass VQ algorithm was compared with ﬁve other algorithms on an N = 4 i.i.d. Gaussian source of 1,600 training vectors
2.4 OnePass VQ
37
and codebook size of R = 16. The speedupinﬂops in Table 2.3, items 1 and 2, are from our simulations. The speedupintime in items 3 to 7 were calculated based on the training time provided by Huang and Harris in [4]. Again, the onepass algorithm shows a higher speedup. Table 2.3. Comparison of Diﬀerent VQ Design Approaches VQ Design Approaches 1 2 3 4
MSE per dimension 0.35054 0.34919 0.41791 0.42202
Speedup
Onepass VQ (introduced) 2.12 LBG with 5 iterations 1.00 LBG (from [4]) 1.00 Directedsearch 1.50 binarysplitting [4] 5 Pairwise nearest neighbor[4] 0.42975 2.00 6 Simulated annealing [4] 0.41166 0.0023 7 Fuzzy cmean clustering [4] 0.51628 0.22 The speedups in item 1 and 2 are inﬂops. All others are intime.
Correlated Gaussian Source The training set is 4,000 dimension two Gaussian distributed vectors with zero means, unit variances, and correlation coeﬃcient 0.8. The codebook size is R = 16. The results are compared with a benchmark result in Table 2.4. The onepass VQ algorithm yields lower MSE. Table 2.4. Comparison for the Correlated Gaussian Source Types
Itera Mﬂops CPU MSE tions (seconds) 1 1.51 58 0.1351
OnePass OnePass + LBG 1+2 Block VQ (from [20])
2.95
110 0.1279 0.1478
Laplace Source In this example, the 4,000 dimension two training vectors have an independent Laplace distribution f (x1 , x2 ) =
1 −√2x1  −√2x2  e e 2
(2.18)
38
2 Multivariate Statistical Analysis and OnePass Vector Quantization
The training data is shown in Figure 2.8(Left); the onepass VQ’s centroids (“+”s) as well as the residual data are shown in Figure 2.8(Right). The improved centroids by onepass+2LBG are denoted as “◦”s in Figure 2.8(Right) also. Table 2.5 contains a comparison of the onepass algorithm with several other algorithms on independent Laplace sources. The onepass algorithm has a speedupinﬂops of 5.8/1.5 = 3.9 and lower MSE than the benchmark results.
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4 −4
−3
−2
−1
0
1
2
3
−4 −4
4
−3
−2
−1
0
1
2
3
4
Fig. 2.8. Left: The Laplace source training data. Right: The residual data, onepass designed centroids “+”, and onepass+2LBG centroids “◦”.
Table 2.5. Comparison on the Laplace Source Types
Itera Mﬂops CPU MSE tions. (sec.) 1 1.5 60 0.262
OnePass OnePass + LBG 1+2 LBG (this chapter) 7 UDQ (from [18]) MAX (from [14]) LBG (from [18])
3.0 5.8
111 0.259 208 0.260 0.302 0.352 0.264
2.4.5 Robustness Analysis For robust speaker and speech recognition, the selected VQ algorithm needs to be robust. What this means is that the outliers in the training data should
2.4 OnePass VQ
39
have little eﬀect on the VQ training results. The onepass VQ algorithm has the necessary robust property, because the sequential process can be invariant to outliers. Let us assume that the residual data vectors Xo in Figure 2.8(Right) are outliers. The entire training data set X is a union of the removed set Xr and outliers set Xo , X = Xr ∪ Xo . Due to the low density of the outlier area, the onepass algorithm would not assign codebook vectors to them. If we don’t want to include the outliers in our training in order to improve the robustness of the designed codebook, we can only use the data set Xr in the last step of centroid update (Step 4 in the List of the Algorithm.) Thus, the outliers Xo are not included in the VQ design and the designed codebook is therefore robust with regard to these outliers. In summary, the experimental results for diﬀerent data sources demonstrate that the onepass algorithm results in nearoptimal MSE while the CPU time and Mﬂops are only slightly more than that of one single iteration of the LBG algorithm. High performance, fast training time, and robustness are the advantages of the onepass algorithm.
2.5 Segmental KMeans In the above discussions, we addressed the segmentation problem for a stationary process, where the joint probability distribution does not change when shifting in time. However, a speech signal is a nonstationary process, where the joint probability distribution of observed data changes when shifting in time. The current approach to addressing segmentation in a speech signal is to segment the nonstationary speech sequence into a sequence of small segmentations. Within each small segmentation, we can assume the data is stationary. Segmental Kmans was developed for this purpose. It has been applied to Markov chain modeling or hidden Markov model (HMM) parameter estimation [13, 16, 7]. It is one of the fundamental algorithms for HMM training. The Kmean algorithm involves iteration of two kinds of computations: segmentation and optimization. In the beginning, the model parameters such as the centroids or means are initialized by random numbers in meaningful ranges. In the segmentation stage, the sequentially observed nonstationary data are partitioned into multiple sequential states. There is no overlap among the states. Within each state, the joint probability distribution is assumed to be stationary; thus, the optimization process can be followed to estimate the model parameters for the data within each state. The segmentation process is equivalent to a sequential decoding procedure and can be performed by using the Viterbi [23] algorithm. The iteration between the optimization and segmentation may need to repeat for several times. The details of the segmental Kmean algorithms are available in [13, 16, 7] while the details of the decoding process will be discussed in Chapter 6.
40
2 Multivariate Statistical Analysis and OnePass Vector Quantization
2.6 Conclusions In this chapter, we introduced multivariate Gaussian distribution because it will be used often in the rest of this book. We also introduced the concept of principal component analysis because this concept is helpful for reading the rest of this book and to intuitively think about data representations when developing new algorithms. We brieﬂy reviewed the popular Kmeans and LBG algorithms for VQ. Readers can get more detailed information from other textbooks (e.g. [6, 1, 5]) to understand these traditional algorithms in more detail. We presented the onepass VQ algorithm in detail. The onepass VQ algorithm is useful in training background models with very large datasets such as those used in speaker recognition. The concept and method of sequential data processing and pruning used in the onepass VQ will be used in Chapter 3 and Chapter 14 for pattern recognition and speaker authentication. Finally, we discussed the segmental Kmean algorithm, which will be used in HMM training throughout the book.
References 1. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classiﬁcation, Second Edition. New York: John & Wiley, 2001. 2. Fischer, T. R. and Dicharry, R. M., “Vector quantizer design for gaussian, gamma, and laplacian sources,” IEEE Transactions on Communications, vol. COM32, pp. 1065–1069, September 1984. 3. Gray, R. M. and Linde, Y., “Vector quantizers and predictive quantizers for GaussMarkov sources,” IEEE Transactions on Communications, vol. COM30, pp. 381–389, September 1982. 4. Huang, C. M. and Harris, R. W., “A comparison of several vector quantization codebook generation approaches,” IEEE Transactions on Image Processing, vol. 2, pp. 108–112, January 1993. 5. Huang, X., Acero, A., and Hon, H.W., Spoken language processing. NJ: Prentice Hall PTR, 2001. 6. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis. New Jersey: Prentice Hall, 1988. 7. Juang, B.H. and Rabiner, L. R., “The segmental kmeans algorithm for estimating parameters of hidden Markove models,” IEEE Trans. Acoustics, speech and Signal Processing, vol. 38, pp. 1639–1641, Sept. 1990. 8. Li, Q. and Swaszek, P. F., “Onepass vector quantizer design by sequential pruning of the training data,” in Proceedings of International Conference on Image Processing, (Washington DC), October 1995. 9. Li, Q. and Tufts, D. W., “Improving discriminant neural network (DNN) design by the use of principal component analysis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Detroit MI), pp. 3375–3379, May 1995. 10. Li, Q. and Tufts, D. W., “Synthesizing neural networks by sequential addition of hidden nodes,” in Proceedings of the IEEE International Conference on Neural Networks, (Orlando FL), pp. 708–713, June 1994.
References
41
11. Li, Q., Tufts, D. W., Duhaime, R., and August, P., “Fast training algorithms for large data sets with application to classiﬁcation of multispectral images,” in Proceedings of the IEEE 28th Asilomar Conference, (Paciﬁc Grove), October 1994. 12. Linde, Y., Buzo, A., , and Gray, R. M., “An algorithm for vector quantizer design,” IEEE Transactions on Communications, vol. COM28, pp. 84–95, 1980. 13. MacQueen, J., “Some methods for classiﬁcation and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math. Stat., Prob., pp. 281–296, 1967. 14. Max, J., “Quantizing for minimum distortion,” IEEE Transactions on Information Theory, vol. IT6, pp. 7–12, March 1960. 15. Paez, M. D. and Glisson, T. H., “Minimum meansquareerror quantization in speech pcm and dpcm systems,” IEEE Transactions on Communications, vol. COM20, pp. 225–230, April 1972. 16. Rabiner, L. R., Wilpon, J. G., and Juang, B.H., “A segmental kmeans training procedure for connected word recognition,” AT&T Technical Journal, vol. 65, pp. 21–31, May/June 1986. 17. Soong, F. K., Rosenberg, A. E., and Juang, B.H., “A vector quantization approach to speaker recognition,” AT&T Technical Journal, vol. 66, pp. 14–26, March/April 1987. 18. Swaszek, P. F., “Low dimension / moderate bitrate vector quantizers for the laplace source,” in Abstracts of IEEE International Symposium on Information Theory, p. 74, 1990. 19. Swaszek, P. F., “Vector quantization for image compression,” in Proceedings of Princeton Conference on Information Sciences and Systems, (Princeton NJ), pp. 254–259, March 1986. 20. Swaszek, P. F. and Narasimhan, A., “Quantization of the correlated gaussian source,” in Proceedings of Princeton Conference on Information Sciences and Systems, (Princeton NJ), pp. 784–789, March 1988. 21. Swaszek, P. F. and Thomas, J. B., “Optimal circularly symmetric quantizers,” Journal of Franklin Institute, vol. 313, pp. 373–384, June 1982. 22. Tufts, D. W. and Li, Q., “Principal feature classiﬁcation,” in Neural Networks for Signal Processing V, Proceedings of the 1995 IEEE Workshop, (Cambridge MA), August 1995. 23. Viterbi, A. J., “Error bounds for convolutional codes and an asymptotically optimal decoding algorithm,” IEEE Transactions on Information Theory, vol. IT13, pp. 260–269, April 1967. 24. Wilson, S. G., “Magnitude/phase quantization of independent gaussian variates,” IEEE Transactions on Communications, vol. COM28, pp. 1924–1929, November 1980.
Chapter 3 Principal Feature Networks for Pattern Recognition
Pattern recognition is one of the fundamental technologies in speaker authentication. Understanding the concept of pattern recognition is important in developing speaker authentication algorithms and applications. There are already many books and tutorial papers on pattern recognition and neural network (e.g. [6, 1]). Instead of repeating a similar introduction of the fundamental pattern recognition and neural networks techniques, we introduce a diﬀerent approach for neural network training and construction that was developed by the author and Tufts and named the principal feature network (PFN) [13, 14, 15, 12, 20], which is an analytical method to construct a classiﬁer or recognizer. Through this chapter, readers will gain a better understanding of pattern recognition methods and neural networks and their relation to multivariate statistical analysis. The PFN uses the fundamental methods in multivariate statistics as the core techniques and applies the techniques sequentially in order to construct a neural network for classiﬁcation or pattern recognition. The PFN can be considered a fast neural network design algorithm for pattern recognition, speaker authentication, and speech recognition. Due to its data pruning method, the PFN algorithm is eﬃcient in processing large databases. This chapter also discusses the relationship among diﬀerent hidden node design methods as well as the relationship between neural networks and decision trees. The PFN algorithm has been used in realworld pattern recognition applications.
3.1 Overview of the Design Concept A neural network consists of input, hidden, and output nodes and one or more hidden layers. Each hidden layer can have multiple hidden nodes. Popular approaches, like the backpropagation algorithm [21], ﬁrst deﬁne a network structure in terms of the number of hidden layers and the number of hidden nodes at each layer, and then train the hidden node parameters. Conversely, PFN is designed to construct a neural network sequentially. It starts from
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_3, Ó SpringerVerlag Berlin Heidelberg 2012
43
44
3 Principal Feature Networks for Pattern Recognition
training the ﬁrst hidden nodes based on multivariate statistical theory, then adds more hidden nodes until reaching the design speciﬁcation; ﬁnally, it combines the hidden nodes to construct the output layer and the entire network. Such an approach provides very fast training due to a data pruning method. This chapter is intended to help readers understand the neural networks for pattern recognition and the functions of layers and hidden notes. We deﬁne the principal feature as a discriminate function which is intended to provide the maximum contribution to correct classiﬁcation using a current training data set. Our principal feature classiﬁcation (PFC) algorithm is a sequential procedure ﬁnding principal features and pruning classiﬁed data. When a new principal feature is found, the correctly classiﬁed data of the current training dataset is pruned so that the next principal feature can be constructed with the subset of data which has not been classiﬁed yet, rather than redundantly reclassify or consider some of the already well classiﬁed training vectors. The PFC can also be considered a nonparametric statistical procedure for classiﬁcation while the hidden node design is using multivariate statistics. The network constructed by the PFC algorithm is called the principal feature network (PFN). The PFC does not need gradientdescent based training algorithms which need a very long training time as in the backpropagation and many other similar training algorithms. On the other hand, the PFC does not need a backward node pruning as in CART [2] and neural tree network [17], or a node “pruning” by retraining (see [16] for a survey). A designed PFN can be implemented in the structure of a decision tree or a neural network. For these reasons, the PFC can be considered a fast and eﬃcient algorithm for both neural network and decision tree design for classiﬁcation or pattern recognition. We use the following example to illustrate the concept of the PFC algorithm and PFN design. Example 1 – An Illustrative Example for the PFN Design Procedure We use the two labeled classes of artiﬁcial training data illustrated in Fig. 3.1 (a) to better specify the procedure of ﬁnding principal features and pruning the training data. We sequentially ﬁnd principal features and associated hidden nodes at each stage by selecting the best of the following two methods for choosing the next feature: (1) Fisher’s linear discriminant analysis (LDA) [8, 10, 5], or (2) maximal signaltonoiseratio (SNR) discriminant analysis (see Section 3.4). A method for determining multiple thresholds associated with these features is used in evaluating the eﬀectiveness of candidate features. Although more complicated features can be used, such as those from multivariate Gaussian [10], multivariate Gaussian mixture [19], or radial basis functions [3], the above two features are simple, complementary, and eﬃcient. The details of designing the nodes will be introduced in the following sections.
3.1 Overview of the Design Concept 3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0.5 4
4.5
5
5.5
6
6.5
7
−0.5 4
4.5
5
(a)
5.5
6
6.5
7
6
6.5
7
(b) 3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
−0.5 4
45
4.5
5
5.5
(c)
6
6.5
7
0.5 4
4.5
5
5.5
(d)
Fig. 3.1. An illustrative example to demonstrate the concept of the PFN: (a) The original training data of two labeled classes which are not linearly separable. (b) The hyperplanes of the ﬁrst hidden node (LDA node). (c) The residual data set and the hyperplanes of the second hidden node (SNR node). (d) The input space partitioned by two hidden nodes and four thresholds designed by the principal feature classiﬁcation (PFC) method.
In the ﬁrst step, we use all the training data in the input space of Fig. 3.1(a) to ﬁnd a principal feature. In this step, Fisher’s LDA provides a good result. The corresponding feature can be calculated by an inner product of the data vector with the LDA weight vector. The hyperplanes perpendicular to the vector are shown in Fig. 3.1(b). It is important to note that multiple threshold values can be used with each feature. Then, the data vectors which have been classiﬁed at this step of the design procedure are pruned oﬀ. Here two threshold values have been used. Thus the unclassiﬁed data between the two corresponding hyperplanes is used to train the second hidden node. The residual training data set for the next design stage is shown in Fig. 3.1(c). This is used to determine the second feature and second hidden node. Since the mean vectors of the two classes are very close now, Fisher’s LDA does not give a satisfactory principal feature. In the second hidden node design, maximum SNR analysis provides a better candidate for a principal feature and
46
3 Principal Feature Networks for Pattern Recognition
the thresholdsetting procedure in [11] gives us two associated hyperplanes which are also shown in Fig. 3.1(c). The overall partitioned regions are shown in Fig. 3.1 (d). All of the training data vectors have now been correctly classiﬁed. The size of training data, the performance speciﬁcations, and the need to generalize to new, test data inﬂuence the threshold settings and the stopping point of the design. For this simple classiﬁcation problem, the Backpropagation (BP) training method [4] takes hundreds of seconds to hours, and one still does not get satisfactory classiﬁcation using a multilayer perception (MLP) network with 5 hidden nodes and one output node, sumsquared error (SSE) = 4.35. The radial basis network (RBF) with common kernel functions [4] can converge to an acceptable performance in 35 seconds, but it needs 56 nodes, SSE = 0.13. On the same problem, the principal feature classiﬁcation only takes 0.2 seconds on the same machine and needs only two hidden nodes in a sequential implementation, Fig. 3.5(b). The performance of PFN is better than both BP and RBF, SSE = 0.00.
3.2 Implementations of Principal Feature Networks A principal feature network (PFN) is a decision network in which each hidden node computes the value of a principal feature and compares this value with one or more multiple threshold values. A PFN can be implemented as a neural network through a parallel implementation or as a decision tree through a sequential implementation. A parallel implementation of the PFN is shown in Fig. 3.2. The outputs of the hiddenlayer are binary words. Each word represents one partitioned region in the input data space. Each class may have more than one hiddenlayer word. The outputs of the output layer are binary words too, which are logic functions of the hiddenlayer words, but each class is only represented by one unique output binary word. Each hidden node threshold is labeled to one class, and the associated hyperplane partitions the input space into classiﬁed and unclassiﬁed regions for that class. All or nearly all of the training vectors within a classiﬁed region belong to the labeled class. The corresponding unclassiﬁed region includes the training vectors which have not been correctly classiﬁed yet. Generally speaking, each hidden node is a subclassiﬁer in the PFN. It classiﬁes part of the training vectors and leaves the unclassiﬁed vectors to other hidden nodes. A decisiontree implementation is shown in Fig 3.3. We note that each hidden node can have multiple thresholds; thus more than one classiﬁcation decision can be made in one node of the tree, e.g. Fig 3.5(b). In a sequence, the hidden nodes will be evaluated in the order that they are trained. Since each hidden node binary threshold output has been associated with one class in training, the sequential calculation stops as soon as a decision can be made. Since the ﬁrst few hidden nodes are designed using the highest density regions
3.2 Implementations of Principal Feature Networks
d1
47
dj
....
PFN output
.... y1
.... ....
p1
θ3
θ2
θ1
p Σ
2
yk θk
.......
Σ
Binary word of hidden nodes
pm
Σ
wm
w1
x1
.......
x2
xn
Fig. 3.2. A parallel implementation of PFC by a Principal Feature Network (PFN).
x input data vector
p 1 < Θ1
p 1 = x t w1 p 1 > Θ+1

pm= x t wm
p m< Θm Class 3
Hidden node 1
Class 2
Class 1
pm> Θ+m Class 4
Hidden node m
Class i
Fig. 3.3. A sequential implementation of PFC by a Principal Feature Tree (PFT).
48
3 Principal Feature Networks for Pattern Recognition
in the input training space, it is very likely that a decision can be made early in the procedure. Example 1 (continued) – Parallel and Sequential Implementations Parallel and sequential implementations of the designed PFC are shown in Fig.3.4(b) and Fig. 3.5(b). Corresponding partitioned input spaces are shown in Fig. 3.4(a) and 3.5(a). For details on parallel and processorarray implementations, please refer to [11, 14].
d y2  + o
y1  + Class x
y4
y3
y2
y1
Class x
o x
Class o
Class x 
+
x
+ y  3 + y  4
 +
(a)
Σ
Σ x2
x1 (b)
Fig. 3.4. (a) Partitioned input space for parallel implementation. (b) Parallel implementation.
3.3 Hidden Node Design With the PFN construction procedure explained, we can now introduce the hidden node design. The single hidden node design algorithms are inspired by the optimal multivariate Gaussian classiﬁcation rule [10]. The rule is implemented as a Gaussian discriminant node [12] to facilitate theoretical analysis. Two practical hidden nodes are further deﬁned: a Fisher’s Node is a linear node trained by Fisher’s linear discriminant analysis [8, 10, 14, 13] and used for training classes with separable mean vectors. The principal component (PC) node is trained based on a criterion to maximize a discriminant signaltonoise ratio [12]. It is for training classes with common mean vectors. Both nodes are designed for nonGaussian and nonlinearly separable cases in multidimensional and multiclass classiﬁcation. When the training vectors are from more than two classes, the node design
3.3 Hidden Node Design
49
x Input vector y2  +
Hidden node 1
y1  + III Class x
II Class o
I Class x + y IV Class o  3 + y V  4 Class x
Class o
I
Class x Hidden node 2
V
Class x
Class o
Class x
III
IV
 +
+

II
(b)
(a)
Fig. 3.5. (a) Partitioned input space for sequential implementation. (b) Sequential (tree) implementation.
algorithms can ﬁgure out which class to separate ﬁrst. This class normally has more separable data vectors in the input data space. Connections between a Fisher’s node, a PC node, and a Gaussian node are also discussed below. It should be noted that there are many other algorithms for single hidden node training [9, 22, 17]. Most of them use gradientdescent or iterative methods. We prefer the statistical approaches because in a single node design because they are optimal and much faster. 3.3.1 Gaussian Discriminant Node When two training data populations Class 1 and Class 2 are described as multivariate Gaussian distributions with sample mean vectors and covariance matrices μ1 , Σ1 and μ2 , Σ2 , respectively, the minimumcost classiﬁcation rule is deﬁned as [10]: Class1 : L(x) ≥ θ;
Class2 : L(x) < θ;
(3.1)
where x is an observed data vector or feature vector of N components and θ is a threshold determined by the cost ratio, the prior probability ratio, and the determinants of the covariance matrices, and −1 t −1 t −1 L(x) = xt (Σ−1 1 − Σ2 )x − 2(μ1 Σ1 − μ2 Σ2 )x
=
N
λi xt wi 2 − 2w0 x,
(3.2) (3.3)
i=1 t −1 where w0 = (μt1 Σ−1 1 − μ2 Σ2 ) and for i > 0, λi and wi are the i’th eigenvalue −1 and eigenvector for matrix Σ−1 1 − Σ2 . We deﬁne the above equation as a Gaussian Discriminant Node [12]. Its implementation is shown in Fig. 3.6(a).
50
3 Principal Feature Networks for Pattern Recognition
θ Σ λ1
2
λN
Sqr.
Sqr.
Σ
Σ
θ
Sqr.
Σ
Σ
......
Σ
WN
W1
W0
W0
......
.... (b)
...... (a)
θ1
θ
θ2
Sqr. Σ
Σ W1
W1
......
......
(c)
(d)
Fig. 3.6. (a) A single Gaussian discriminant node. (b) A Fisher’s node. (c) A quadratic node. (d) An approximation of the quadratic node.
When the covariance matrices in (3.3) are the same, the ﬁrst quadratic term is zero, and the above classiﬁer computes Fisher’s linear discriminant. The general node becomes a Fisher’s node, as shown in Fig. 3.6(b). When the second term in the equation is ignored, the above formulas only have the ﬁrst quadratic term. Due to the sequential design procedure, in each PFN design step we only use the eigenvector associated with the largest eigenvalue (or a small number of principal eigenvectors). Thus, we have the quadratic node as shown in Fig. 3.6(c). The threshold squaring function can be further approximated by two thresholds as shown in Fig. 3.6(d). This gives us a theoretical reason to allow more than one threshold on each hidden node as shown in Fig. 3.2.
3.3 Hidden Node Design
51
3.3.2 Fisher’s Node Design Suppose that we have two classes of vectors in the ndimensional input space and wish to use the current hidden node to separate these two classes. We need to ﬁnd a direction for w, so that the projected data from the classes can be separated as far as possible. This problem has been studied by Fisher [8, 5, 10]. The solution is called linear discriminant analysis (LDA). For a twoclass classiﬁcation problem, the criterion function J in Fisher’s linear discriminant [5] can be written as: Jmax (w) = where
wt SB w , wt SW w
SB = (m1 − m2 )(m1 − m2 )t ,
and SW =
(x − m1 )(x − m1 )t +
x∈X1
(x − m2 )(x − m2 )t ,
(3.4)
(3.5) (3.6)
x∈X2
where X1 and X2 are the data matrices of two diﬀerent classes, each row represents one training data vector. The m1 and m2 are the sample means of the two classes. And SW is a linear combination of the sample covariance matrices of the two classes. We assume that (m1 −m2 ) is not a zero vector. If it is zero or close to zero, then a diﬀerent training method (principle component discriminant analysis) [12] should be applied to design the hidden nodes. The problem of the formula (3.4) is the well known generalized Rayleigh quotient problem. The weight vector w to maximize J is the eigenvector associated with the largest eigenvalue of the following generalized eigenvalue problem. SB w = λSW w (3.7) When SW is nonsingular, the above equation can be written as the following conventional eigenvalue problem: −1 SW SB w = λw
(3.8)
As pointed out in [5], since SB has rank one and SB w is always in the direction of m1 − m2 , there is only one nonzero eigenvalue and the weight vector w can be solved directly as: −1 w = αSW (m1 − m2 ) (3.9) in which α is a constant which can be chosen for normalization or to make inner products with w computationally simple in implementation. For multiple class problems, the equation (3.5) becomes SB =
c i=1
ri (mi − m)(mi − m)t ,
(3.10)
52
3 Principal Feature Networks for Pattern Recognition
where, mi is the mean of class i, ri is the number of training data vectors in class i, and m is the mean vector of all classes, and the equation (3.6) becomes SW =
c
(x − mi )(x − mi )t .
(3.11)
i=1 x∈Xi
We then solve the generalized eigenvalue problem the same as in equation (3.7). To save ﬂoating point operations (Flops), the problem in (3.4) can also be converted to the conventional eigenvalue problem by changing the variable: 1/2
w = SW w. then, Jmax (w ) = where
−1/2
S = SW
(3.12)
t
w S w w t w −1/2
SB SW
,
(3.13) (3.14)
The w will be the eigenvector associated to the largest eigenvalue in solving a standard eigenvalue problem. Finally, the weight vector w can be obtained by: −1/2
w = SW
w .
(3.15)
Example 1 (continued) For the two classes of data X1 and X2 as shown in Fig. 3.1(a), use (3.6) to calculate the SW and Fig. (3.9) to obtain the weight vector w. Then project the X1 and X2 onto the weight vector. Two hyperplanes perpendicular to the weight vector w (as in Fig. 3.1(b)) can be determined. The details in determining the thresholds will be discussed in Section 3.7.
3.4 Principal Component Hidden Node Design When the mean vectors of training classes are far enough apart, Fisher’s node is eﬀective. However, when the mean vectors of the training classes are too close, Fisher’s LDA does not provide good classiﬁcation. Here we should use the quadratic node, or approximate it by using a PC node [12].
3.4 Principal Component Hidden Node Design
53
3.4.1 Principal Component Discriminant Analysis To design a quadratic node directly for nonGaussian data, our criterion is to choose a weight vector w to maximize a discriminant signaltonoise ratio J. J=
E{(Xl w)t (Xl w)} w t Σl w = , E{(Xc w)t (Xc w)} w t Σc w
(3.16)
where Xl is a matrix of row vectors of training data from class l. Xc is the matrix of training data from all classes except Xl . The class l is the class which has the largest eigenvalue among the eigenvalues calculated from the data matrices of each class, respectively. Σl and Σc are the estimated covariance matrices and w is the weight vector. In the case when mean vectors are the same and not zero, the criterion still can be used. The weight vector w can be determined by solving the following generalized eigenvalue and eigenvector problem: Σl w = λΣc w, (3.17) The eigenvector associated with the largest eigenvalue provides the maximum value of J. However, more than one weight vector can be selected to improve the discriminant analysis. In other words, more than one quadratic hidden node can be trained by solving the eigenvalue problem once. Example 1 (continued) For the residual data set in Fig. 3.1 (c), the mean vectors of the two classes are so close that the Fisher’s node is not eﬃcient. A principal component node was designed by solving the eigenvalue problem in (3.17), where the covariance matrices, Σl and Σc , are calculated based on the residual data vectors in Fig. 3.1 (c). The determined hyperplanes perpendicular to the weight vector w are shown in Fig. 3.1 (c). The input space partitioned by one Fisher’s node and one principle component node is shown in Fig. 3.7. If we used Fisher’s method to design the second node, the residual data set would not be totally partitioned by the second Fisher’s node, and one more node would be needed to totally partition the input space as shown in Fig. 3.7. Even though that ineﬃcient extra node can be removed by lossless simpliﬁcation (see Section 3.8 [14, 13]) at the end of the design, it will slow down the training procedure.
3.5 Relation between PC Node and the Optimal Gaussian Classiﬁer The principal component node of the previous section is intended for nonGaussian commonmeanvector classes. We can prove that it approximates the discriminant capability of a Gaussian classiﬁer (Gaussian node) when the
54
3 Principal Feature Networks for Pattern Recognition 3.5
3
2.5
2
1.5
1
0.5
0
0.5 4
4.5
5
5.5
6
6.5
7
Fig. 3.7. When using only Fisher’s nodes, three hidden nodes and six thresholds are needed to ﬁnish the design.
class data are from Gaussian distributions with zero mean vectors by the following: −1/2 1/2 Let w = Σc V and V = Σc w. The equation (3.16) can be further written as −t/2 −1/2 V t Σc Σl Σc V V t SV J= = , (3.18) t V V V tV −t/2
−1/2
where S = Σc Σl Σc . Its singular value decomposition is S = UΛUt . The Λ is a diagonal matrix of eigenvalues. The maximum occurs when V = U1 , where U1 is the eigenvector associated with the largest eigenvalue λ1 of Λ. Thus, the weight vector for which J of (3.16) is a maximum is w = Σ−1/2 U1 . c
(3.19)
The classiﬁcation functions of this quadratic node are Class l : xt Σ−1/2 Ut1 2 > θ c
(3.20)
Class c : xt Σ−1/2 Ut1 2 ≤ θ , c
(3.21)
where θ is a classiﬁcation threshold. This can provide a good approximation to the performance of the Gaussian classiﬁer. The classiﬁcation rule in (3.3) can now be written as [18] −1 L(x) = xt (Σ−1 c − Σl )x
= =
t
x Σ−t/2 (I c t −t/2 x Σc (I
− −
−1 1/2 −1/2 Σt/2 x, c Σl Σc )Σc −1 −1/2 S )Σc x,
(3.22) (3.23) (3.24)
3.5 Relation Between Pc Node and the Optimal Gaussian Classifier
55
= xt Σ−t/2 (I − UΛ−1 Ut )Σ−1/2 x, c c
(3.25)
−Λ
(3.26)
t
=x =
Σ−t/2 U(I c
N
−1
t
)U
Σ−1/2 x, c
(1 − 1/λi )xt Σc −1/2 Uti 2 ,
(3.27)
i=1
Taking the ﬁrst principal component (i = 1) from the above formula, the discriminant function becomes the same as the quadratic node in (3.20) and (3.21). Furthermore, if we use two thresholds to approximate the square function in Fig. 3.6(c), the classiﬁcation rules of (3.20) and (3.21) become Class l : θ1 ≤ xt Σ−1/2 U1 ≤ θ2 c
(3.28)
Class c : xt Σ−1/2 U1 > θ2 , or xt Σ−1/2 U1 < θ1 . c c
(3.29)
The implementation of (3.28) and (3.29) is called the principal component node as shown in Fig. 3.6(d). The structure is the same as the normal PFN hidden node deﬁned in Fig. 3.2.
3.6 Maximum SignaltoNoiseRatio (SNR) Hidden Node Design When the mean vectors of training classes are far apart, the Fisher’s node is eﬀective. However, when the mean vectors of the training classes are too close, the Fisher’s LDA node does not provide good classiﬁcation. Then one can use the quadratic Gaussian discriminant [10], or the following simple discriminant which can be used for nonGaussian data and often has almost the same discriminant capability as the quadratic Gaussian discriminant when the data vector is multivariate Gaussian. Proof is given in [12]. To design a robust quadratic node for possibly nonGaussian data, we choose a weight vector w to maximize a discriminant signaltonoise ratio. J=
w t Σl w , w t Σc w
(3.30)
where Σl is the covariance matrix calculated from Class l which has the largest eigenvalue among the eigenvalues calculated from the sample covariance matrices of each class respectively; Σc is the covariance matrix calculated from data pooled from all other classes. The weight vector w can be determined by solving a generalized eigenvalue and eigenvector problem, i.e. Σl w = λΣc w. The maximum SNR node has been used to classify overlapped Gaussian data [12].
56
3 Principal Feature Networks for Pattern Recognition
3.7 Determining the Thresholds from Design Speciﬁcations After a weight vector is obtained by the LDA or by maximum SNR analysis, all current training vectors are projected to the weight vector. The histograms of projected data of all classes can then be evaluated. The classes on the far right and far left on the vector can be separated by determining thresholds. The thresholds and separated regions are then labeled to the separated classes. A technique for determining thresholds from design speciﬁcations, i.e. performance requirements for every class, was developed by the author and Dr. Tufts and has been applied successfully in all of the examples and applications of this chapter. Interested readers are referred to [11] for the details of this procedure.
3.8 Simpliﬁcation of the Hidden Nodes Due to the PFN architecture and the training procedure, pruning of the PFN hidden nodes is simpler than the pruning algorithms for multilayer perception (MLP) networks. We developed two kinds of pruning algorithms for diﬀerent applications, lossless and lossy simpliﬁcations. Lossless Simpliﬁcation is for a minimal implementation; Lossy Simpliﬁcation is for improving the ability of the network to generalize, that is perform well on new data sets. To avoid confusion with the data pruning described in the above sections, we use the term simpliﬁcation. Generally speaking, lossy simpliﬁcation is needed for most applications. Readers are referred to [11] for lossless simpliﬁcation. During PFN training, a threshold is labeled to a class which is associated with that threshold. We recall that each hidden node can have more than one threshold associated with separated classes. Also, the percentage of the training vectors of each class classiﬁed by each threshold in the sequential design can be saved in an array. The array called the contribution array can then be used for simpliﬁcation analysis. We use the following example to illustrate the details.
3.9 Application 1 – Data Recognition In a real signal recognition application, a large set of multidimensional training vectors of 10 classes was completely classiﬁed by a PFN using 49 hidden nodes and 98 thresholds. The contribution of each threshold to its labeled class, in terms of percentage of classiﬁcation rate, is saved in a contribution array. The array was sorted and plotted in Fig. 3.8(a). From the Fig. 3.8(a), we can see that only a few of the thresholds have signiﬁcant contribution to full recognition of their classes. The accumulated network performance in the order of the sorted thresholds is shown in Fig. 3.8(b). The more thresholds
3.9 Application 1 – Data Recognition
57
we keep, the higher the network accuracy we can obtain on the training data set, but keeping those thresholds which provide little contribution can aﬀect the ability of the designed network to generalize to new or testing data. Accumulated Network Performance Network Accuracy %
100 80 60 40 20 0 0
10
20
30
40
50
60
70
80
90
80
90
Contribution of Each Threshold to Its Data Class
Contribution %
100 80 60 40 20 0 0
10
20
30
40
50
60
70
Fig. 3.8. Application 1: (a) (bottom) The sorted contribution of each threshold in the order of its contribution to the class separated by the threshold. (b) (top) Accumulated network performance in the order of the sorted thresholds.
In the simpliﬁcation procedure we seek to attain a desired network performance which comes from the design speciﬁcations. This value is used to prune thresholds. In this example, the desired network performance is 92% correct decisions. A horizontal dashdot line in Fig. 3.8(b) marks the desired 92% accuracy. The line intersections with the curve of the accumulated network performance. By projecting the intersection onto the Fig. 3.8(a) as the vertical broken line in both Fig. 3.8(a) and (b), a necessary number of thresholds to meet the desired network performance can be determined. For this example, the ﬁrst 38 thresholds in Fig. 3.8(a) can meet the 92% network accuracy as requested in the design speciﬁcations. Thus thresholds 39 to 98 in the sorted contribution array can be deleted. If all of the thresholds associated with a hidden node have been deleted, then that hidden node should also be deleted. After this lossy simpliﬁcation, the designed PFN has a performance of 91% on the training set and 88% on the test set using 31 hidden nodes and 38 thresholds. Thus, the performance on the test set is close to the performance on the training set.
58
3 Principal Feature Networks for Pattern Recognition
3.10 Application 2 – Multispectral Pattern Recognition We applied the principal feature classiﬁcation to recognize categories of land cover from three images of Block Island, Rhode Island, corresponding to three spectral bands  two visible and one infrared. Each complete image has 4591 × 7754 pixels. Each pixel has a resolution of 1.27 m and belongs to one of 14 categories of land covers. The training data set is a matrix which is formed from a subset of pixels which have been labeled. Each row is one training vector which has 9 feature elements associated with one pixel [7], and each of these vectors was labeled with one of 14 land cover categories. The 9 features of data vectors consist of pixel intensity in the three color bands, three local standard deviations of intensity in a diameter of 10m ﬂoating window around the rowdesignated pixel (one for each color), and three additional features from the side information of a soil database. These features are degree of local slope at the designated pixel, aspect of the slope at the designated pixel, and drainage class of soil. In [7] Duhaime identiﬁed the 14 categories of land covers for supervised training. The computer experiments on the multispectral image features started by using backpropagation and RBF algorithms [4]. However, both of them did not get the needed classiﬁcation results in a reasonable amount of time as estimated from their convergence speeds. Then the PFN and a modiﬁed radial basis function (MRBF) algorithm [15] were applied to solve the problem. The experimental results are listed in Table 3.1 and compared with one another. Table 3.1. Comparison of Three Algorithms in the Land Cover Recognition Algorithms
Mﬂops CPU No. of Accuracy Time Nodes on Test (sec.) Sets PFN (proposed) 37.64 58 77 72% MRBF [15] 221.93 518 490 60% LDA [7] – – – 55% PFN: Principal feature network; MRBF: Modiﬁed radial basis function network; and LDA: Linear discriminant analysis.
The MRBF method used a training data set of 140 sample vectors (limited by memory space), 10 from each category, and tested on a test data set of 700 samples, 50 samples from each category. The method results in an average accuracy of 60% on the test set for all of the 14 categories deﬁned above. The training took 518 seconds CPU time on a Sun Sparc IPX workstation. The PFN is trained by 700 training vectors since the PFN can be trained with much less memory space. It took only 58 seconds CPU time and reached an average performance of 72% on the same test set and on all 14 categories. (The performance is 65% if using the 140 sample set for training.) The performance of LDA, as reported in [7], was 55% on an average of 11 categories out of the
3.10 Application 2 – Multispectral Pattern Recognition
59
all 14 categories based on diﬀerent training and test data sets. The simulation software was written in an interpretive language for both PFN and MRBF.
3.11 Conclusions The principal feature network (PFN) has been compared in experiments with popular neural networks, such as BP and RBF. It also has been compared with many constructive algorithms [11], such as cascadecorrelation architecture, decision tree algorithms, etc. Generally speaking, the PFN possesses the advantages of the constructive algorithms. By applying multivariate statistical analysis in deﬁning and training hidden nodes, the classiﬁer can be trained much faster than by gradientdescent or other iterative algorithms. The overﬁtting problem results from requiring a higher classiﬁcation accuracy than the system can actually achieve. This is solved by appropriately pruning thresholds using the design speciﬁcations, and thus generalization to new test data can be realized by lossy simpliﬁcation. Compared with other algorithms, the PFN needs much less computation time in training and uses simpler structures for implementation while achieving the same or better classiﬁcation performance than the traditional neural network approach. Due to the advantages of the PFN, it has been selected and implemented in important realworld applications. Through reading this chapter, we hope readers have a better understanding on the concepts of multivariate statistics and neural networks.
References 1. Bishop, C., Neural networks for pattern recognition. NY: Oxford Univ. Press, 1995. 2. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., Classiﬁcation and Regression Trees. Belmont, CA: Wadsworth International Group, 1984. 3. Chen, S., Cowan, C. F. N., and Grant, P. M., “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Transactions on Neural Networks, vol. 2, March 1991. 4. Demuth, H. and Beale, M., Neural network toolbox user’s guide. Natick, MA: The MathWorks Inc., 1994. 5. Duda, R. O. and Hart, P. E., Pattern Classiﬁcation and Scene Analysis. New York: John & Wiley, 1973. 6. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classiﬁcation, Second Edition. New York: John & Wiley, 2001. 7. Duhaime, R. J., The Use of Color Infrared Digital Orthophotography to Map Vegetation on Block Island, Rhode Island. Master’s thesis, University of Rhode Island, Kingston RI, May 1994. 8. Fisher, R. A., “The statistical utilization of multiple measurements,” Annals of Eugenics, vol. 8, pp. 376–386, 1938. 9. Gallant, S. I., Neural Network Learning and Expert systems. Cambridge, MA: The MIT Press, 1993.
60
3 Principal Feature Networks for Pattern Recognition
10. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis. New Jersey: Prentice Hall, 1988. 11. Li, Q., Classiﬁcation using principal features with application to speaker veriﬁcation. PhD thesis, University of Rhode Island, Kingston, RI, October 1995. 12. Li, Q. and Tufts, D. W., “Improving discriminant neural network (DNN) design by the use of principal component analysis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Detroit MI), pp. 3375–3379, May 1995. 13. Li, Q. and Tufts, D. W., “Principal feature classiﬁcation,” IEEE Trans. Neural Networks, vol. 8, pp. 155–160, Jan. 1997. 14. Li, Q. and Tufts, D. W., “Synthesizing neural networks by sequential addition of hidden nodes,” in Proceedings of the IEEE International Conference on Neural Networks, (Orlando FL), pp. 708–713, June 1994. 15. Li, Q., Tufts, D. W., Duhaime, R., and August, P., “Fast training algorithms for large data sets with application to classiﬁcation of multispectral images,” in Proceedings of the IEEE 28th Asilomar Conference, (Paciﬁc Grove), October 1994. 16. Reed, R., “Pruning algorithms  a survey,” IEEE Transactions on Neural Networks, vol. 4, pp. 740–747, September 1993. 17. Sankar, A. and Mammone, R. J., “Growing and pruning neural tree networks,” IEEE Transactions on Computers, vol. C42, pp. 291–299, March 1993. 18. Scharf, L. L., Statistical Signal Processing. Reading MA: AddisonWesley, 1990. 19. Streit, R. L. and Luginbuhl, T. E., “Maximum likelihood training of probabilistic neural networks,” IEEE Transactions on Neural Networks, vol. 5, September 1994. 20. Tufts, D. W. and Li, Q., “Principal feature classiﬁcation,” in Neural Networks for Signal Processing V, Proceedings of the 1995 IEEE Workshop, (Cambridge MA), August 1995. 21. Werbos, P. J., The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. New York: J. Wiley & Sons, 1994. 22. Zurada, J. M., Introduction to Artiﬁcial Neural Systems. New York: West publishing company, 1992.
Chapter 4 NonStationary Pattern Recognition
So far, we have discussed pattern recognition for stationary signals. In this chapter, we will discuss pattern recognition for both stationary and nonstationary signals. In speaker authentication, some tasks, such as speaker identiﬁcation, are treated as stationary pattern recognition while others, such as speaker veriﬁcation, are treated as nonstationary pattern recognition. We will introduce the stochastic modeling approach for both stationary and nonstationary pattern recognition. We will also introduce the Gaussian mixture model (GMM) and the hidden Markov model (HMM), two popular models that will be used throughout the book.
4.1 Introduction Signal or feature vectors extracted from speech signals are recognized as a stochastic process, which can be either stationary or nonstationary. A stationary process is a stochastic process whose joint probability distribution does not change when shifted in time. Stationary pattern recognition is used to recognize patterns which can be characterized as a stationary process, such as still image recognition. The model used for recognition can be a feedforward neural network, the principal feature network, or a Gaussian mixture model (GMM). A nonstationary process, on the other hand, is a stochastic process whose joint probability distribution does change when shifted in time. Nonstationary pattern recognition is used to recognize those patterns which change over time, such as video images or speech signals. In this case, the model used for recognition can be a hidden Markov model (HMM) or a recurrent neural network. In this chapter, we introduce the GMM and the HMM. In speaker authentication, the GMM is used for contextindependent speaker identiﬁcation, while the HMM is used for speaker veriﬁcation and verbal information veriﬁcation. Here, we focus on the basic GMM and HMM concepts and Bayesian decision
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_4, Ó SpringerVerlag Berlin Heidelberg 2012
61
62
4 NonStationary Pattern Recognition
theory. In subsequent chapters, we present real applications of the HMM and GMM in speaker authentication.
4.2 Gaussian Mixture Models (GMM) for Stationary Process The GMM is deﬁned to represent stochastic data distributions: p(ot Cj ) = p(ot λj ) =
I
ci N (ot ; μi , Σi ),
(4.1)
i=1
where λj is the GMM for class Cj , ci is a mixture weight which must satisfy I the constraint i=1 ci = 1, I is the total number of mixture components, and N (·) is a Gaussian density function: 1 1 T Σ −1 (o − μ ) , (4.2) N (ot ; μi , Ri ) = exp − (o − μ ) t i t i i 2 (2π)d/2 Σi 1/2 where μi and Σi are the ddimensional mean vector and covariance matrix of the i’th component. t = 1, . . . T , and Ot is the tth sample or observation. Given observed feature vectors, the GMM parameters can be estimated iteratively using a hillclimbing algorithm or the expectationmaximization (EM) algorithm [3]. As has been proved, the algorithm ensures a monotonic increase in the loglikelihood during the iterative procedure until a ﬁxedpoint solution is reached [21, 7]. In most applications, a model parameter estimation can be accomplished in just a few iterations. At each step of the iteration, the parameter estimation formulas for mixture i are: cˆi =
T 1 p(iot, λ) T t=1
T
μ ˆ i = t=1 T ˆi = Σ where
T
t=1
p(iot , λ)ot
t=1
p(iot , λ)
p(iot , λ)(ot − μ ˆ i )(ot − μ ˆ i )T T t=1 p(iot , λ)
p(oi λ)ci p(iot , λ) = I . j=1 p(oi λ)cj
(4.3)
(4.4)
(4.5)
(4.6)
One application of the above model is for contextindependent speaker identiﬁcation, where we assume that each speaker’s speech characteristics manifest only acoustically and are represented by one (model) class. When a
4.2 Gaussian Mixture Models (GMM) for Stationary Process
63
spoken utterance is long enough, it is reasonable to assume that the acoustic characteristic is independent of its content. For a group of M speakers, in the enrollment phase, we train M GMM’s, λ1 , λ2 , ..., λM , using the reestimation algorithm. In the test phase, given an observation sequence O, the objective is to ﬁnd in the prescribed speaker population the speaker model that achieves the maximum posterior probability. From Eq. (4.19) and assuming the prior populations are the same for all speakers, the decision rule is Take action αk , where k = arg max
1≤i≤M
T
log p(ot λi ).
(4.7)
t=1
where αk is the action of deciding that the observation O is from speaker k. In summary, the decision on authentication is made by computing the likelihood based on the probability density functions (pdf ’s) of the feature vectors. Parameters that deﬁne these pdf ’s have to be estimated a priori. 4.2.1 An Illustrative Example In this example, we artiﬁcially generated three classes of twodimensional data. The distributions were of Gaussianmixture types, with three components in each class. Each token was twodimentional. For each class, 1,500 tokens were drawn from each of the three components; therefore, there were 4,500 tokens in total. The data density distributions of the three classes are shown in Fig. 4.1 to Fig. 4.3. The contours of the distributions of classes 1, 2 and 3 are shown in Fig. 4.4, where the means are represented as +, ∗, and boxes, respectively. In real applications, the number of mixture components is unknown. Therefore, we assumed the GMM’s, which need to be trained, have two mixture components with full covariance matrices for each class. Maximum likelihood (ML) estimation was applied to train the GMMs with four iterations based on the training data drawn from the ideal models. The contours that represent the pdf ’s of each of the GMMs after ML estimation are plotted in Fig. 4.5. The testing data with 4,500 tokens for each class were obtained using the same methods as the training data. The ML classiﬁer provided an accuracy of 76.07% and 75.97% for training and testing datasets, respectively. The decision boundary is shown in Fig. 4.6 as the dotted line. For comparison purposes, suppose we know the real models, and therefore use the same model that generated the training data to do the testing, i.e. we use three mixtures in each model and the ML classiﬁer. The performances are 77.19% and 77.02% for training and testing data sets. This is the optimal performance in the sense of minimizing Bayes error. The ideal boundary is plotted in Fig. 4.6 as the solid line. This example will be continued when we discuss discriminative training in Chapter 13.
64
4 NonStationary Pattern Recognition
0.06 0.04 0.02 0 10
5
0
−5
−10
−5
0
5
10
Fig. 4.1. Class 1: a bivariate Gaussian distribution with m1 = [0 5], m2 = [−3 3], and m3 = [−5 0]. Σ1 = [1.41 0; 0 1.41], Σ2 = [1.22 0.09; 0.09 1.22], and Σ3 = [1.37 0.37; 0.27 1.37]
0.06 0.04 0.02 0 10 5 0 −5
−10
−5
0
5
10
Fig. 4.2. Class 2: a bivariate Gaussian distribution with m1 = [2 5], m2 = [−1 3], and m3 = [0 0]. Σ1 = [1.41 0; 0 1.41], Σ2 = [0.77 1.11; 1.11 1.09], and Σ3 = [1.41 0.04; 0.04 1.41]
4.2 Gaussian Mixture Models (GMM) for Stationary Process
65
0.1
0.05
0 −10
10 −5 5
0 0
5 10
−5
Fig. 4.3. Class 3: a bivariate Gaussian distribution with m1 = [−3 − 1], m2 = [−2 − 2], and m3 = [−5 − 2]. Σ1 = [1.41 0; 0 1.41], Σ2 = [0.76 0.11; 0.11 1.09], and Σ3 = [1.41 0.04; 0.04 1.41] 10
5
0
−5 −10
−5
0
5
Fig. 4.4. Contours of the pdf ’s of 3mixture GMM’s: the models are used to generate 3 classes of training data.
66
4 NonStationary Pattern Recognition 10
5
0
−5 −10
−5
0
5
Fig. 4.5. Contours of the pdf ’s of 2mixture GMM’s: the models are trained from ML estimation using 4 iterations. 7 6 5 4 3 2 1 0 −1 −2 −7
−6
−5
−4
−3
−2
−1
0
1
2
Fig. 4.6. Enlarged decision boundaries for the ideal 3mixture models (solid line) and 2mixture ML models (dashed line).
4.3 Hidden Markov Model (HMM) for NonStationary Process A speech signal is a nonstationary signal. For many applications, such as speaker veriﬁcation and speech recognition, we have to consider the temporal information of the nonstationary speech signal; therefore, a more powerful model, the HMM, is then applied to characterize both the temporal structure and the corresponding statistical variations along a sequence of feature vectors or observations of an utterance.
4.3 Hidden Markov Model (HMM) for NonStationary Process
67
In speech and speaker veriﬁcation, an HMM is trained to represent the acoustic pattern of a subword, a word, or a whole passphrase. There are many variants of HMM’s. The simplest and most popular one is an N state, lefttoright model without a stateskip as shown in Figure 4.7. This is widely used in speaker authentication. The ﬁgure shows a Markov chain with a sequence of states, representing the evolution of speech signals. Within each state, a GMM is used to characterize the observed speech feature vector as a multivariate distribution.
a11
a22 a12
a23
s1 b1
a33 a34
s2 b2
aNN
s3
....
aN1,N
b3
sN bN
Fig. 4.7. Lefttoright hidden Markov model.
An HMM, denoted as λ, can be completely characterized by three sets of parameters, the state transition probabilities, A, the observation densities, B, and the initial state probabilities, Π; as shown in the following notation: λ = {A, B, Π} = {ai,j , bi , πi }, i, j = 1, ..., N,
(4.8)
where N is the total number of states. Given an observation sequence O = {ot }Tt=1 , the model parameters, {A, B, Π}, of λ can be trained by an iterative method to optimize a prescribed performance criterion, e.g., ML estimation. In practice, the segmental Kmean algorithm [16] with ML estimation has been widely used. Following model initialization, the observation sequence is segmented into states based on the current model parameter set λ. Then, within each state, a new GMM is trained by the EM algorithm to maximize ˆ is then used to resegment the observation the likelihood. The new HMM λ sequence by the Viterbi algorithm (see Chapter 6) and reestimation of model parameters. The iterative procedure usually converges in a few iterations by the EM algorithm. In addition to the ML criterion, the model can also be trained by optimizing a discriminative objective. For example, the minimum classiﬁcation error (MCE) criterion [9] was proposed along with a corresponding generalized probabilistic descent (GPD) training algorithm [8, 2] to minimize an objective function that approximates the error rate closely. Other criteria like maximum mutual information (MMI) [1, 15] have also been attempted. Instead of modeling the distribution of the data set of the target class, the
68
4 NonStationary Pattern Recognition
criteria also incorporate data of other classes. A discriminative model is thus constructed to implicitly model the underlying distribution of the target class but with an explicit emphasis on minimizing the classiﬁcation error or maximizing the mutual information between the target class and others. The discriminative training algorithms have been applied successfully to speech recognition. The MCE/GPD algorithm has also been applied to speaker recognition [12, 10, 17, 18]. Generally speaking, the models trained by discriminative objective functions yield better recognition and veriﬁcation performance, but the long training time makes it less attractive to real applications. We will study the objectives of discriminative training in Chapter 12 and a new discriminative training algorithm for speaker recognition in Chapter 13.
4.4 Speech Segmentation Given an HMM, λ, and a sequence of observations, O = {ot }Tt=1 , the optimal state segmentation can be determined by evaluating the maximum joint stateobservation probability, maxs P (O, sλ), conventionally called maximum likelihood decoding. One popular algorithm that accomplishes this objective eﬃciently is the Viterbi algorithm [19, 5]. When fast decoding and forced alignment are desired, a new reduced search space algorithm [11] can be employed. Details about our detectionbased decoding algorithm are presented in Chapter 6.
4.5 Bayesian Decision Theory In an M class recognition problem, we are 1) given an observation (or a feature vector) o in a ddimensional Euclidean space Rd , and a set of classes designated as {C1 , C2 , ..., CM }, and 2) asked to make a decision, for example, to classify o into, say, class Ci , where one class can be one speaker or one acoustic unit. We denote this as an action αi . By Bayes’ formula, the probability of being class Ci given o is the posterior (or a posteriori) probability: P (Ci o) =
p(oCi )P (Ci ) p(o)
(4.9)
where p(oCi ) is the conditional probability , P (Ci ) is prior probability, and p(o) =
M
p(oCj )P (Cj )
(4.10)
j=1
can be viewed as a scale factor that guarantees that the posterior probabilities sum to one.
4.5 Bayesian Decision Theory
69
Let L(αi Cj ) be the loss function describing the loss incurred for taking action αi when the true class is Cj . The expected loss (or risk) associated with taking action αi is R(αi o) =
M
L(αi Cj )P (Cj o).
(4.11)
j=1
This leads to the Bayes decision rule: To minimize the overall risk, compute the above risk for j = 1, ..., M and then select the action αi such that R(αi o) is minimum. For speaker authentication, we are interested in the zeroone loss function: 0 i=j i, j = 1, ..., M L(αi Cj ) = (4.12) 1 i = j. It assigns no loss to a correct decision and a unit loss to an error, equivalent to counting the errors. The risk to this speciﬁc loss function is R(αi o) =
M
L(αi Cj )P (Cj o)
(4.13)
P (Cj o) = 1 − P (Ci o).
(4.14)
j=1
=
j=i
Thus, to minimize the risk or error rate, we take action αk that maximizes the posterior probability P (Ci o): Take action αk , where k = arg max P (Ci o). 1≤i≤M
(4.15)
When the expected value of the loss function is equivalent to the error rate, it is called the minimumerrorrate classiﬁcation [4]. Recalling the Bayes formula in Eq. (4.9), when the density p(oCi ) has been estimated for all classes and the prior probabilities are known, we can rewrite the above decision rule as: Take action αk , where k = arg max p(oCi )P (Ci ). 1≤i≤M
(4.16)
So far, we have only considered the case of a single observation (or feature vector) o. In speaker authentication, we always encounter or employ a sequence of observations O = {oi}Ti=1 , where T is the total number of observations. After speech segmentation (which will be discussed later), we assume that during a short time period these sequential observations are produced by the same speaker and they belong to the same acoustic class or unit, say Ci . , Furthermore, if we assume that the observations are independent and identically distributed (i.i.d.), the joint posterior probability, P (Ci O) is merely the product of the component probabilities:
70
4 NonStationary Pattern Recognition
P (Ci O) =
T
P (Ci ot ).
(4.17)
t=1
From Eq. (4.16), the decision rule for the compound decision problem is αk = arg max
1≤i≤M
T
p(ot Ci )P (Ci ).
(4.18)
t=1
In practice, the decision is usually based on the log likelihood score : αk = arg max
1≤i≤M
T
log p(ot Ci )P (Ci ).
(4.19)
t=1
4.6 Statistical Veriﬁcation Statistical veriﬁcation as applied to speaker veriﬁcation and utterance veriﬁcation can be considered a twoclass classiﬁcation problem, whether a spoken utterance is from the true speaker (the target source) or from an impostor (the alternative source). Given an observation o, a decision αi is taken based on the following conditional risks derived from Eq. (4.9): R(α1 o) = L(α1 C1 )P (C1 o) + L(α1 C2 )P (C2 o) R(α2 o) = L(α2 C1 )P (C1 o) + L(α2 C2 )P (C2 o)
(4.20) (4.21)
The action α1 corresponds to the decision of positive veriﬁcation if R(α1 o) < R(α2 o).
(4.22)
Bring (4.20) and (4.21) into(4.22) and rearranging the terms, we take action α1 if: P (C1 o) L(α1 C2 ) − L(α2 C2 ) > = T1 (4.23) P (C2 o) L(α2 C1 ) − L(α1 C1 ) where T1 > 1 is a prescribed threshold. Furthermore, by applying the Bayes formula, we have p(oC1 ) P (C2 ) > T1 = T2 . (4.24) p(oC2 ) P (C1 ) For a sequence of observation O = {oi }Ti=1 which are assumed to be independent and identically distributed (i.i.d.), we have the likelihoodratio test: T p(ot C1 ) P (OC1 ) r(O) = t=1 = > T3 . (4.25) T P (OC2 ) t=1 p(ot C2 ) The same result can also be derived from the NeymannPearson decision formulation, thus the name Neyman Pearson test [14, 13, 20]. It can be shown
4.6 Statistical Verification
71
that the likelihoodratio test minimizes the veriﬁcation error for one class while maintaining the veriﬁcation error for the other class constant [6, 13]. In practice, we compute a loglikelihood ratio for veriﬁcation: R(O) = log P (OC1 ) − log P (OC2 ). A decision is made according to the rule: Acceptance: R(O) ≥ T ; Rejection: R(O) < T ,
(4.26)
(4.27)
where T is a threshold value, which can be determined theoretically or experimentally. There are two types of error in a test: false rejection, i.e., rejecting the hypothesis when it is actually true, and false acceptance, i.e., accepting it when it is actually false. The equal error rate (EER) is deﬁned as the error rate when the operating point is so chosen as to achieve equal error probabilities for the two types of error. EER has been widely used as a veriﬁcation performance indicator. In utterance veriﬁcation, we assume that the expected word or subword sequence is known, and so the task is to verify whether the input spoken utterance matches it. Similarly, in speaker veriﬁcation, the text of the passphrase is known. The task in speaker veriﬁcation is to verify whether the input spoken utterance matches the given sequence, using the model trained by the speaker’s voice.
4.7 Conclusions In this chapter, we introduced the basic techniques in modeling nonstationary speech signals for speaker authentication. For speaker identiﬁcation, GMM is often used as the model for classiﬁcation. For speaker veriﬁcation, we often use HMM as the model for recognition. The above models and methods introduced in this chapter provide the foundation for developing baseline speaker authentication systems. In the following chapters, we will introduce advanced algorithms where we still use the baseline systems as the benchmark to compare with the new algorithms. Although we will keep the GMM and HMM models, our training method will be extended from maximum likelihood to discriminative training and our decision methods will be extended from Bayesian decision to detectionbased decision. This chapter completes the ﬁrst part of this book and its goal of introducing the basic theory and models for pattern recognition and multivariate statistical analysis. In the following chapters, we will introduce advanced speaker authentication systems. Most of this work is from the author’s previous research in collaboration with his colleagues.
72
4 NonStationary Pattern Recognition
References 1. Bahl, L. R., Brown, P. F., de Souza, P. V., and Mercer, R. L., “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Tokyo), pp. 49–52, 1986. 2. Chou, W., “Discriminantfunctionbased minimum recognition error rate patternrecognition approach to speech recognition,” Proceedings of the IEEE, vol. 88, pp. 1201–1222, August 2000. 3. Dempster, A. P., Laird, N. M., and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of Royal Statistical Society, vol. 39, pp. 1–38, 1977. 4. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classiﬁcation, Second Edition. New York: John & Wiley, 2001. 5. Forney, G. D., “The Viterbi algorithm,” Proceeding of IEEE, vol. 61, pp. 268– 278, March 1973. 6. Fukunaga, K., Introduction to statistical pattern recognition, second edition. New York: Academic Press, Inc., 1990. 7. Juang, B.H., “Maximumlikelihood estimation for mixture multivariate stochastic observations of Markov chains,” AT&T Technical Jouranl, vol. 64, pp. 1235–1249, Julyaugust 1985. 8. Juang, B.H., Chou, W., and Lee, C.H., “Minimum classiﬁcation error rate methods for speech recognition,” IEEE Trans. on Speech and Audio Process., vol. 5, pp. 257–265, May 1997. 9. Juang, B.H. and Katagiri, S., “Discriminative learning for minimum error classiﬁcation,” IEEE Transactions on Signal Processing, vol. 40, pp. 3043–3054, December 1992. 10. Korkmazskiy, F. and Juang, B.H., “Discriminative adaptation for speaker veriﬁcation,” in Proceedings of Int. Conf. on Spoken Language Processing, (Philadelphia), pp. 28–31, 1996. 11. Li, Q., “A detection approach to searchspace reduction for HMM state alignment in speaker veriﬁcation,” IEEE Trans. on Speech and Audio Processing, vol. 9, pp. 569–578, July 2001. 12. Liu, C. S., Lee, C.H., Chou, W., Juang, B.H., and Rosenberg, A. E., “A study on minimum error discriminative training for speaker recognition,” Journal of the Acoustical Society of America, vol. 97, pp. 637–648, January 1995. 13. Neyman, J. and Pearson, E. S., “On the problem of the most eﬃcient tests of statistical hypotheses,” Phil. Trans. Roy. Soc. A, vol. 231, pp. 289–337, 1933. 14. Neyman, J. and Pearson, E. S., “On the use and interpretation of certain test criteria for purpose of statistical inference,” Biometrika, vol. 20A, pp. Pt I, 175–240; Pt II, 1928. 15. Normandin, Y., Cardin, R., and Mori, R. D., “Highperformance connected digit recognition using maximum mutual information estimation,” IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 299–311, April 1994. 16. Rabiner, L. R., Wilpon, J. G., and Juang, B.H., “A segmental kmeans training procedure for connected word recognition,” AT&T Technical Journal, vol. 65, pp. 21–31, May/June 1986. 17. Rosenberg, A. E., Siohan, O., and Parthasarathy, S., “Speaker veriﬁcation using minimum veriﬁcation error training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Seattle), pp. 105–108, May 1998.
References
73
18. Siohan, O., Rosenberg, A. E., and Parthasarathy, S., “Speaker identiﬁcation using minimum veriﬁcation error training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Seattle), pp. 109–112, May 1998. 19. Viterbi, A. J., “Error bounds for convolutional codes and an asymptotically optimal decoding algorithm,” IEEE Transactions on Information Theory, vol. IT13, pp. 260–269, April 1967. 20. Wald, A., Sequential analysis. NY: Chapman & Hall, 1947. 21. Wu, C. F. J., “On the convergence properties of the EM algorithm,” The Annals of Statstics, vol. 11, pp. 95–103, 1983.
Chapter 5 Robust Endpoint Detection
Often the ﬁrst step in speech signal processing is the use of endpoint detection to separate speech and silence signals for further processing. This topic has been studied for several decades; however, as wireless communications and VoIP phones are becoming more and more popular, more background and system noises are aﬀecting communication channels, which poses a challenge to the existing algorithms; therefore new and robust algorithms are needed. When speaker authentication is applied to adverse acoustic environments, endpoint detection and energy normalization can be crucial to the functioning of real systems. In low signaltonoise ratio (SNR) and nonstationary environments, conventional approaches to endpoint detection and energy normalization often fail, and speaker and speech recognition performances usually degrade dramatically. The purpose of this chapter is to address the above endpoint problem. The goal is to develop endpoint detection algorithms which are invariant to diﬀerent SNR levels. For diﬀerent types of applications, we developed two approaches: a realtime approach and a batchmode approach. We focus on the realtime approach in this chapter. The batchmode approach is available in [18]. The realtime approach uses an optimal ﬁlter plus a threestate decision diagram for endpoint detection. The ﬁlter is designed utilizing several criteria to ensure accuracy and robustness. It has almost invariant response at various background noise levels. The detected endpoints are then applied to energy normalization sequentially. Evaluation results show that the realtime algorithm signiﬁcantly reduces the string error rates in low SNR situations. The error reduction rates even exceed 50% in several evaluated databases. The algorithms presented in this chapter can also be applied to speech recognition and voice communication systems, networks, and devices. They can be implemented in either hardware or software. This work was originally reported by the author, Zheng, Tasi, and Zhou in [18].
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_5, Ó SpringerVerlag Berlin Heidelberg 2012
75
76
5 Robust Endpoint Detection
5.1 Introduction In speaker authentication and many other voice applications, we need to process the signal in utterances consisting of speech, silence, and other background noise. The detection of the presence of speech embedded in various types of nonspeech events and background noise is called Endpoint detection or speech detection or speech activity detection. In this chapter, we address endpoint detection by sequential processes to support realtime recognition (in which the recognition response is the same as or faster than recording an utterance). The sequential process is often used in automatic speech recognition (ASR) [21] while the batchmode process is often allowed in speaker recognition [17], name dialing [16], command control, and embedded systems, where utterances are usually as short as a few seconds and the delay in response is usually small. Endpoint detection has been studied for several decades. The ﬁrst application was in a telephone transmission and switching system developed in Bell Labs for time assignment of communication channels [5]. The principle was to use the free channel time to interpolate additional speakers by speech activity detection. Since then, various speechdetection algorithms have been developed for ASR, speaker veriﬁcation, echo cancellation, speech coding, and other applications. In general, diﬀerent applications need diﬀerent algorithms to meet their speciﬁc requirements in terms of computational accuracy, complexity, robustness, sensitivity, response time, etc. The approaches include those based on energy threshold (e.g. [26]), pitch detection (e.g. [8]), spectrum analysis, cepstral analysis [11], zerocrossing rate [22, 12], periodicity measure, hybrid detection [13], fusion [24], and many other methods. Furthermore, similar issues have also been studied in other research areas, such as edge detection in image processing [6, 20] and changepoint detection in theoretical statistics [7, 3, 25, 15, 4]. As is well known, endpoint detection is crucial to both ASR and speaker recognition because it often aﬀects a system’s performance in terms of accuracy and speed for several reasons. First, cepstral mean subtraction (CMS) [2, 1, 10], a popular algorithm for robust speaker and speech recognition, needs accurate endpoints to compute the mean of speech frames precisely in order to improve recognition accuracy. Second, if silence frames can be removed prior to recognition, the accumulated utterance likelihood scores will focus more on the speech portion of an utterance instead of on both noise and speech. Therefore, it has the potential to increase recognition accuracy. Third, it is hard to model noise and silence accurately in changing environments. This eﬀect can be limited by removing background noise frames in advance. Fourth, removing nonspeech frames when the number of nonspeech frames is large can signiﬁcantly reduce the computation time. Finally, for open speech recognition systems, such as openmicrophone desktop applications and audio transcription of broadcast news, it is necessary to segment utterances from continuing audio input.
5.1 Introduction
77
In applications of speech and speaker recognition, nonspeech events and background noise complicate the endpoint detection problem considerably. For example, the endpoints of speech are often obscured by speakergenerated artifacts such as clicks, pops, heavy breathing, or by dial tones. Longdistance telephone transmission channels also introduce similar types of artifacts and background noise. In recent years, as wireless, handsfree, and Voice over Internet Protocol (VoIP) phones get more and more popular, the endpoint detection becomes even more diﬃcult since the signaltonoise ratios (SNR) of these kinds of communication devices and channels are usually lower than traditional landline telephones. Also, the noise in wireless and VoIP phones has stronger nonstationary property than traditional telephones. The noise in today’s telecommunications may come from the background, such as car noise, room reﬂection, street noise, background talking, etc., or from communication systems, such as coding, transmission, packet loss, etc. In these cases, the ASR or speaker authentication performance often degrades dramatically due to unreliable endpoint detection. Another problem related to endpoint detection is realtime energy normalization. In both ASR and speaker recognition, we usually normalize the energy feature such that the largest energy level in a given utterance is close to or slightly below a constant of zero or one. This is not a problem in batchmode processing, but it can be a crucial problem in realtime processing since it is diﬃcult to estimate the maximal energy in an utterance with just a shorttime data buﬀer while the acoustic environment is changing. It becomes especially hard in adverse acoustic environments. A lookahead approach to energy normalization can be found in [8]. As we will point out later in this study, realtime energy normalization and endpoint detection are two related problems. The more accurately we can detect endpoints, the better we can do on realtime energy normalization. We note that endpoint detection as a front module is normally used before ASR or speaker recognition. When a realworld application allows ASR for the entire utterance, the ASR decoder may provide more accurate endpoints if silence models in the ASR system are trained properly. In general, the ASRbased approach for endpoint detection will course more resource in computation and take a longer time compared to the detectionbased approach described in this chapter. A good detectionbased system must meet the following requirements: accurate location of detected endpoints; robust detection at various noise levels; low computational complexity; fast response time; and simple implementation. The realtime energy normalization problem is addressed together with endpoint detection. The rest of the chapter is organized as follows: In Section 5.2, we introduce a ﬁlter for endpoint detection. In Section 5.3, we present a sequential algorithm of combined endpoint detection and energy normalization for speech recognition in adverse environments and provide experimental results in large database evaluations.
78
5 Robust Endpoint Detection
5.2 A Filter for Endpoint Detection The feature vector in ASR can be used for endpoint detection directly; however, to ensure the lowcomplexity requirement, we only use the onedimensional (1D) shortterm energy in the cepstral feature to be the feature for endpoint detection: g(t) = 10 log10
nt +I−1
o(j)2
(5.1)
j=nt
where o(j) t g(t) I nt
a data sample; a frame number; the frame energy in dB; the window length; the number of the ﬁrst data sample in the window.
Thus, the detected endpoints can be aligned to the ASR feature vector automatically, and the computation can be reduced from the speechsampling rate to the frame rate. For accurate and robust endpoint detection, we need a detector that can detect all possible endpoints from the energy feature. Since the output of the detector contains false acceptances, a decision module is then needed to make ﬁnal decisions based on the detector’s output. Here, we assume that one utterance may have several speech segments separated by possible pauses. Each of the segments can be determined by detecting a pair of endpoints named segment beginning and ending points. On the energy contours of utterances, there is always a rising edge following a beginning point and a descending edge preceding an ending point. We call them beginning and ending edges, respectively, as shown in Fig. 5.4 (A). Since endpoints always come with the edges, our approach is ﬁrst to detect the edges and then to ﬁnd the corresponding endpoints. The foundation of the theory of the optimal edge detector was ﬁrst established by Canny[6]. who developed an optimal stepedge detector. Spacek [23], on the other hand, formed a performance measure combining all three quantities derived by Canny and provided the solution of the optimal ﬁlter for step edge. Petrou and Kittler then extended the work to rampedge detection [20]. Since the edges corresponding to endpoints in the energy feature are closer to the ramp edge than the ideal step edge, the author and Tsai applied Petrou and Kittler’s ﬁlter to the endpoint detection for speaker veriﬁcation in [17]. In summary, we need a detector that meets the following general requirements: 1) invariant outputs at various background energy levels; 2) capability of detecting both beginning and ending points;
5.2 A Filter for Endpoint Detection
3) 4) 5) 6) 7)
79
short time delay or lookahead; limited response level; maximum output SNR at endpoints; accurate location of detected endpoints; and, maximum suppression of false detection.
We then need to convert the above criteria to a mathematic representation. As we have discussed, it is reasonable to assume that the beginning edge in the energy contour is a ramp edge that can be modeled by the following function: 1 − e−sx /2 for x ≥ 0 c(x) = (5.2) esx /2 for x ≤ 0 where x represents the frame number of the feature, and s is some positive constant which can be adjusted for diﬀerent kinds of edges, such as beginning or ending edges, and for diﬀerent sampling rates. The detector is a onedimensional ﬁlter f (x) which can be operated as a movingaverage ﬁlter in the energy feature. From the above requirements, the ﬁlter should have the following properties which are similar to those in [20]: 1. It must be antisymmetric, i.e., f (x) = −f (−x), and thus f (0) = 0. This follows from the fact that we want it to detect antisymmetric features [6], i.e., sensitive to both beginning and ending edges according to the request in 2); and to have nearzero response to background noise at any level, i.e. invariant to background noise according to the request in 1). 2. According to the request in 3), it must be of ﬁnite extent going smoothly to zero at its ends: f (±w) = 0, f (±w) = 0 and f (x) = 0 for x ≥ w, where w is the half width of the ﬁlter. 3. According to the request in 4), it must have a given maximum amplitude k: f (xm ) = k where xm is deﬁned by f (xm ) = 0 and xm is in the interval (−w, 0). If we represent above requirements 5), 6), and 7), as S(f (x)), L(f (x)), and C(f (x)) respectively, the combined objective function have the following form: J = max F {S(f (x)), L(f (x)), C(f (x))} ; f (x)
(5.3)
Subject to properties 1, 2 , and 3. It aims at ﬁnding the ﬁlter function, f (x), such that the value of the objective function F is maximal subject to properties 1 to 3. Fortunately, the object function is very similar to optimal edge detection in image processing, and the details of the object function have been derived by Petrou and Kittler [20] following Canny [6] as below. Assume that the beginning or ending edge in log energy is a ramp edge as deﬁned in (5.2). And, assume that the edges are emerged with white Gaussian
80
5 Robust Endpoint Detection
noise. Following Canny’s criteria, Petrou and Kittler [20] derived the SNR for this ﬁlter f (x) as being proportional to 0 f (x)(1 − esx )dx S = −w (5.4) 0 2 dx f (x) −w where w is a half width of the actual ﬁlter. They consider a good locality measure to be inversely proportional to the standard deviation of the distribution of endpoints where the edge is supposed to be. It was deﬁned as 0 s2 −w f (x)esx dx L = . (5.5) 0 (x)2 dx f −w Finally, the measure for the suppression of false edges is proportional to the mean distance between the neighboring maxima of the response of the ﬁlter to white Gaussian noise, 0 f (x)2 dx 1 C = −w . (5.6) 0 w f (x)2 dx −w Therefore, the combined objective function of the ﬁlter is: J = max{(S · L · C)2 } f (x)
2 0 0 sx sx f (x)(1 − e )dx f (x)e dx −w −w s = 2 . 0 0 w f (x)2 dx −w f (x)2 dx −w 4
(5.7) After applying the method of Lagrange multipliers, the solution for the ﬁlter function is [20]: f (x) = eAx [K1 sin(Ax) + K2 cos(Ax)] + e−Ax [K3 sin(Ax) + K4 cos(Ax)] + K5 + K6 esx
(5.8)
where A and Ki are ﬁlter parameters. Since f (x) is only half of the ﬁlter, when w = W , the actual ﬁlter coeﬃcients are: h(i) = {−f (−W ≤ i ≤ 0), f (1 ≤ i ≤ W )}
(5.9)
where i is an integer. The ﬁlter can then be operated as a movingaverage ﬁlter in: W F (t) = h(i)g(t + i) (5.10) i=−W
5.2 A Filter for
etection
81
where g(·) is the energy feature and t is the current frame number. An example of the designed optimal ﬁlter is shown in Fig. 5.1. Intuitively, the shape of the ﬁlter indicates that the ﬁlter must have positive response to a beginning edge, negative response to an ending edge, and a near zero response to silence. Its response is basically invariant to diﬀerent background noise levels since they all have near zero responses. 0.1 0.05 0 −0.05 −0.1 −15
−10
−5
0
5
10
15
Fig. 5.1. Shape of the designed optimal ﬁlter.
5.3 RealTime Endpoint Detection and Energy Normalization The approach of using endpoint detection for realtime ASR or speaker authentication is illustrated in Fig. 5.2 [19]. We use an optimal ﬁlter, as discussed in the last section, to detect all possible endpoints, followed by a threestate logic as a decision module to decide real endpoints. The information of detected endpoints is also utilized for realtime energy normalization. Finally, all silence frames are removed and only the feature vectors including cepstrum and the normalized energy corresponding to speech frames are sent to the recognizer.
Optimal Filter
Decision Logic
Endpoints Silence Removal
Energy
Energy Norm.
ASR Feature
Cepstrum Fig. 5.2. Endpoint detection and energy normalization for realtime ASR.
82
5 Robust Endpoint Detection
5.3.1 A Filter for Both Beginning and EndingEdge Detection After evaluating the shapes of both beginning and ending edges, we choose the ﬁlter size to be W = 13 to meet requirements 2) and 3). For W = 7, and s = 1, the ﬁlter parameters have been provided in [20] as: A = 0.41, [K1 . . . K6 ] = [1.583, 1.468, −0.078, −0.036, −0.872, −0.56]. For W = 13 in our application, we just need to rescale: s = 7/W = 0.5385 and A = 0.41s = 0.2208, while Ki ’s are as above. The shape of the designed ﬁlter is shown in Fig. 5.1 with a simple normalization, h/13. For realtime detection, let H(i) = h(i − 13); then the ﬁlter has 25 points in total with a 24frame lookahead since both H(1) and H(25) are zeros. The ﬁlter operates as a movingaverage ﬁlter: F (t) =
24
H(i)g(t + i − 2)
(5.11)
i=2
where g(·) is the energy feature and t is the current frame number. The output F (t) is then evaluated in a 3state transition diagram for ﬁnal endpoint decisions. 5.3.2 Decision Diagram Endpoint decision needs to be made by comparing the value of F (t) with some predetermined thresholds. Due to the sequential nature of the detector and the complexity of the decision procedure, we use a 3state transition diagram to make ﬁnal decisions. As shown in Fig. 5.3, the three states are: silence, in speech and leaving speech. Either the silence or the inspeech state can be a starting state, and any state can be a ﬁnal state. In the following discussion, we assume that the silence state is the starting state. The input is F (t), and the output is the detected frame numbers of beginning and ending points. The transition conditions are labeled on the edges between states, and the actions are listed in parentheses. “Count” is a frame counter, TU and TL are two thresholds with TU > TL , and “Gap” is an integer indicating the required number of frames from a detected endpoint to the actual end of speech. We use Fig. 5.4 as an example to illustrate the state transition. The energy for a spoken digit “4” is plotted in Fig. 5.4 (A) and the ﬁlter output is shown in Fig. 5.4 (B). The state diagram stays in the silence state until F (t) reaches point A in Fig. 5.4 (B), where F (t) ≥ TU means that a beginning point is detected. The actions are to output a beginning point (corresponding to the left vertical solid line in Fig. 5.4 (A)) and to move to the inspeech state. It stays in the inspeech state until reaching point B in Fig. 5.4 (B), where F (t) < TL . The diagram then moves to the leavingspeech state and sets Count = 0. The counter resets several times until reaching point B’. At point C, Counter = Gap = 30. An actual endpoint is detected as the left vertical
5.3 RealTime Endpoint Detection and Energy Normalization
83
F < TU Silence
TL < F < TU & Count > Gap (Output an
F > TU (Output a beg. point)
end. point )
F < TL (Count = 0)
F > TU
In Speech F > TL
Leaving Speech F TU , this means that a beginning edge is coming, and we should move back to the inspeech state. The 30frame gap corresponds to the period of descending energy before reaching a real ending point. We note that the thresholds, such as TU , and TL , are set in the ﬁlter outputs instead of absolute energy. Since the ﬁlter output is stable to the noise levels, the detected endpoints are more reliable. Those constants, Gap, TU , and TL , can be determined empirically by plotting several utterances and corresponding ﬁlter outputs. As we will show in the database evaluation, the algorithm is not very sensitive to the values of TU , and TL since the same values were used in diﬀerent databases. Also, in some applications, two separate ﬁlters can be designed for beginning and ending point detection. The size of the beginning ﬁlter can be smaller than 25 points while the ending ﬁlter can be larger than 25 points. This approach may further improve accuracy; however, it will have a longer delay and use more computation. The 25point ﬁlter used in this section was designed for both beginning and ending point detection in an 8 KHz sampling rate. Also, in the case that an utterance is started from an unvoiced phoneme, it is practical to back up about ten frames from the detected beginning points. 5.3.3 RealTime Energy Normalization Suppose that the maximal energy value in an utterance is gmax . The purpose of energy normalization is to normalize the utterance energy g(t), such that the
84
5 Robust Endpoint Detection 80
+N
BEGINING POINT
70
BEGINNING EDGE
60 50 40
ENDING POINT ENDING EDGE
+
M
40
60
20 10
SILENCE STATE T
100 (A)
80
160
LEAVING−SPEECH STATE C
B
60
140
+
TL
−20 40
120
IN−SPEECH STATE A
U
0 −10
80
100 (B)
+
+
120
+
B’
140
160
Fig. 5.4. Example: (A) Energy contour of digit “4”. (B) Filter outputs and state transitions.
largest value of energy is close to zero by performing g˜(t) = g(t)−gmax . In realtime mode, we have to estimate the maximal energy gmax sequentially while the data are being collected. Here, the estimated maximum energy becomes a variable and is denoted as gˆmax (t). Nevertheless, we can use the detected endpoints to obtain a better estimate. We ﬁrst initialize the maximal energy to a constant g0 , which is selected empirically, and use it for normalization until we detect the ﬁrst beginning point at M as in Fig. 5.4, i.e. gˆmax (t) = g0 , ∀t < M. If the average energy g¯(t) = E{g(t); M ≤ t ≤ M + 2W } ≥ gm
(5.12)
where gm is a preselected threshold to ensure that new gˆmax is not from a single click, we then estimate the maximal energy as: gˆmax (t) = max{g(t); M ≤ t ≤ M + 2W }
(5.13)
where 2W + 1 = 25 is the length of the ﬁlter and 2W the length of the lookahead window. At point M, the lookahead window is from M to N as shown in Fig. 5.4. From now on, we update gˆmax (t) as: gˆmax (t) = max{g(t + 2W ), gˆmax (t − 1); ∀t > M}.
(5.14)
Parameter g0 may need to be adjusted for diﬀerent systems. For example, the value of g0 could be diﬀerent between telephone and desktop systems. Parameter gm is relatively easy to determine.
5.3 RealTime Endpoint Detection and Energy Normalization
85
For the example in Fig. 5.5, the energy features of two utterances with 20 dB SNR (bottom) and 5 dB SNR (top) are plotted in Fig. 5.5 (A). The 5 dB utterance is generated by artiﬁcially adding car noise to the 20 dB one. The ﬁlter outputs are shown in Fig. 5.5 (B) for 20 dB (solid line) and 5 dB (dashed line) SNRs, respectively. The detected endpoints and normalized energy for 20 and 5 dB SNRs are plotted in Fig. 5.5 (C) and Fig. 5.5 (D), respectively. We note that the ﬁlter outputs for 20 and 5 dB cases are almost invariant around TL and TU , although their background energy levels have a diﬀerence of 15 dB. This ensures the robustness in endpoint detection. We also note that the normalized energy proﬁles are almost the same as the originals, although the normalization is done in realtime mode.
80 70 60 50
10
100
200
300
(A)
400
500
600
100
200
300
(B)
400
500
600
TU
0
TL
−10
0 −20 −40
0
100
200
300
(C) 400
500
600
700
0
100
200
300
(D) 400
500
600
700
0 −20 −40
Fig. 5.5. (A) Energy contours of “4327631Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR). (B) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (C) Detected endpoints and normalized energy for the 20 dB SNR case, and (D) for the 5 dB SNR case.
5.3.4 Database Evaluation The introduced realtime algorithm was compared with a baseline endpoint detection algorithm on one noisy database and several telephone databases.
86
5 Robust Endpoint Detection
Baseline Endpoint Detection The baseline system is a realtime, energy contour based adaptive detector developed based on the algorithm introduced in [21, 26]. It is used in research and commercial speech recognizers. In the baseline system, a 6state decision diagram is used to detect endpoints. Those states are named as initializing, silence, rising, energy, fellrising, and fell states. In total, eight counters and 24 hardlimit thresholds are used for the decisions of state transition. Two adaptive threshold values were used in most of the thresholds. We note that all the thresholds are compared with raw energy values directly. Energy normalization in the baseline system is done separately by estimating the maximal and minimal energy values, then comparing their diﬀerence to a ﬁxed threshold for decision. Since the energy values change with acoustic environments, the baseline approach leads to unreliable endpoint detection and energy normalization, especially in low SNR and nonstationary environments. Noisy Database Evaluation In this experiment, a database was ﬁrst recorded from a desktop computer at 16 KHz sampling rate, then downsampled to 8 KHz sampling rate. Later, car and other background noises were artiﬁcially added to the original database at the SNR levels of 5, 10, 15, and 20 dB. The original database has 39 utterances and 1738 digits in total. Each utterance has 3, 7, or 11 digits. Linear predictive coding (LPC) features and the shortterm energy were used, and the hidden Markov model (HMM) in a headbodytail (HBT) structure was employed to model each of the digits [9, 14]. The HBT structure assumes that contextdependent digit models can be built by concatenating a leftcontextdependent unit (head) with a contextindependent unit (body) followed by a rightcontextdependent unit (tail). We used three HMM states to represent each “head” and “tail”, and four HMM states to represent each “body”. Sixteen mixtures were used for each body state, and four mixtures were used for each head or tail state. The realtime recognition performances on various SNR’s are shown in Fig. 5.6. Compared to the baseline algorithm, the introduced realtime algorithm signiﬁcantly reduced word error rates. The baseline algorithm failed to work in low SNR cases because it uses raw energy values directly to detect endpoints and to perform energy normalization. The realtime algorithm makes decisions on the ﬁlter output instead of raw energy values; therefore, it provided more robust results. An example of error analysis is shown in Fig. 5.7. Telephone Database Evaluation The introduced realtime algorithm was further evaluated in 11 databases collected from the telephone networks with 8 kHz sampling rates in various
5.3 RealTime Endpoint Detection and Energy Normalization
87
56.3
WORD ERROR RATE (%)
BASELINE 50
PROPOSED
40 30 22.6 20 10
5.5 1.5 5 dB
10 dB
3.5
1.5
15 dB
3.5
1.5
20 dB
SIGNALTONOISE RATIO (SNR)
Fig. 5.6. Comparisons on realtime connected digit recognition with various signaltonoise ratios (SNR’s). From 5 to 20 dB SNR’s, the introduced realtime algorithm provided word error rate reductions of 90.2%, 93.4%, 57.1%, and 57.1%, respectively.
acoustic environments. LPC parameters and shortterm energy were used. The acoustic model consists of one silence model, 41 monophone models, and 275 headbodytail units for digit recognition. It has a total of 79 phoneme symbols, 33 of which are for digit units. Eleven databases, DB1 to DB11, were used for the evaluation. DB1 to DB5 contain digit, alphabet, and word strings. Finitestate grammars were used to specify the valid forms of recognized strings. DB6 to DB11 contain pure digit strings. In all the evaluations, both endpoint detection and energy normalization were performed in realtime mode, and only the detected speech portions of an utterance were sent to the recognition backend. In the realtime, endpointdetection system, we set the parameters as g0 = 80.0, gm = 60.0, TU = 3.6, TL = −3.0, and Gap = 30. These parameters were unchanged throughout the evaluation in all 11 databases to show the robustness of the algorithm, although the parameters can be adjusted according to signal conditions in diﬀerent applications. The evaluation results are listed in Table 5.1. It shows that the realtime algorithm works very well in regular telephone data as well. It provides worderror reduction in most of the databases. The worderror reductions even exceed 30% in DB2, DB6, and DB9. To analyze the improvement, the original energy feature of an utterance, “1 Z 4 O 5 8 2”, in DB6 is plotted in Fig. 5.7 (A). The detected endpoints and normalized energy using the conventional approach are shown in Fig. 5.7 (B) while the results of the realtime algorithm are shown in Fig. 5.7 (C). The ﬁlter output is plotted in Fig. 5.7 (D). From Fig. 5.7 (B), we can observe that
88
5 Robust Endpoint Detection Table 5.1. Database Evaluation Results (%) Database IDs (Number of strings, Number of words) DB1 (232, 1393) DB2 (671, 1341) DB3 (1957,1957) DB4 (272, 1379) DB5 (259, 2632) DB6 (576, 1738) DB7 (583, 1743) DB8 (664, 2087) DB9 (619, 8194) DB10 (651, 8452) DB11 (707, 9426)
Word Error Rate Word BaseProError line posed Reduction 13.7 11.8 13.9 14.6 7.9 45.9 4.5 4.4 2.2 10.0 9.6 4.0 15.8 15.7 0.6 2.8 1.1 60.7 1.7 1.5 11.8 0.9 0.7 22.2 1.0 0.7 30.0 5.7 5.6 1.8 1.6 1.4 12.5
the normalized maximal energy of the conventional approach is about 10 dB below zero, which causes a wrong recognition result: “1 Z 4 O 5 8”. On the other hand, the introduced algorithm normalized the maximal energy to zero (approximately), and the utterance was recognized correctly as “1 Z 4 O 5 8 2”.
5.4 Conclusions In this chapter, we presented a robust, realtime endpoint detection algorithm. In the algorithm, a ﬁlter with a 24frame lookahead detects all possible endpoints. A threestate transition diagram then evaluates the output from the ﬁlter for ﬁnal decisions. The detected endpoints are then applied to realtime energy normalization. Since the entire algorithm only uses a 1D energy feature, it has low complexity and is very fast in computation. The evaluation in a noisy database has showed signiﬁcant string error reduction, over 50% on all 5 to 20 dB SNR situations. The evaluations in telephone databases have showed over 30% reductions in 4 out of 12 databases. The realtime algorithm has been implemented in realtime ASR systems. The contributions are not only to improve recognition accuracy but also the robustness of the entire system in low signaltonoise ratio environments. The presented algorithm can be applied to both speaker recognition and speech recognition. Since the presented endpoint detection algorithm is fast and with very low computational complexity, it can be used in communication and speech processing applications directly. For example it can be implemented in embedded systems, such as wireless phones or portable devices to save cost and speed up processing time. Another examples is in a web or computer server supporting multiusers, such as a speaker veriﬁcation server for millions of users. It normally requires low computational complexity to reduce cost and increase
70 60 50 40 30
(A)
50
100
150
200
250
300
(B)
50
100
150
200
250
300
(C)
50
100
150
200
250
300
(D)
50
100
150
200
250
300
0 −20 −40 0 −20 −40 20 0 −20
Fig. 5.7. (A) Energy contour of the 523th utterance in DB5: “1 Z 4 O 5 8 2”. (B) Endpoints and normalized energy from the baseline system. The utterance was recognized as “1 Z 4 O 5 8”. (C) Endpoints and normalized energy from the realtime, endpointdetection system. The utterance was recognized correctly as “1 Z 4 O 5 8 2”. (D) The ﬁlter output.
response speed. For these cases, a solution is to use the above endpoint detector to remove all silence; therefore, we can reduce the number frames for decoding or recognition signiﬁcantly.
References 1. Atal, B. S., “Automatic recognition of speakers from their voices,” Proceeding of the IEEE, vol. 64, pp. 460–475, 1976. 2. Atal, B. S., “Eﬀectiveness of linear prediction characteristics of the speech wave for automatic speaker identiﬁcation and veriﬁcation,” Journal of the Acoustical Society of America, vol. 55, pp. 1304–1312, 1974. 3. Bansal, R. K. and PapantoniKazakos, P., “An algorithm for detecting a change in stochastic process,” IEEE Trans. Information Theory, vol. IT32, pp. 227– 235, March 1986. 4. Brodsky, B. and Darkhovsky, B. S., Nonparametric methods in changepoint problems. Boston: Kluwer Academic, 1993.
90
5 Robust Endpoint Detection
5. Bullington, K. and Fraser, J. M., “Engineering aspects of TASI,” Bell Syst. Tech. J., pp. 353–364, Mar 1959. 6. Canny, J., “A computational approach to edge detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. PAMI8, pp. 679–698, Nov. 1986. 7. Carlstein, E., M¨ uller, H.G., and Siegmund, D., Changepoint problems. Hayward, CA: Institute of Mathematical Statistics, 1994. 8. Chengalvarayan, R., “Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition,” in Proceedings of Eurospeech’99, (Budapest), pp. 61–64, Sept. 1999. 9. Chou, W., Lee, C.H., and Juang, B.H., “Minimum error rate training of interword context dependent acoustic model units in speech recognition,” in Proceedings of Int. Conf. on Spoken Language Processing, pp. 432–439, 1994. 10. Furui, S., “Cepstral analysis techniques for automatic speaker veriﬁcation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 11. Haigh, J. A. and Mason, J. S., “Robust voice activity detection using cepstral features,” in Proceedings of IEEE TENCON, (China), pp. 321–324, 1993. 12. Junqua, J. C., Reaves, B., and Mak, B., “A study of endpoint detection algorithms in adverse conditions: Incidence on a DTW and HMM recognize,” in Proceedings of Eurospeech, pp. 1371–1374, 1991. 13. Lamel, L. F., Rabiner, L. R., Rosenberg, A. E., and Wilpon, J. G., “An improved endpoint detector for isolated word recognition,” IEEE Trans. on Acoustics, speech, and signal processing, vol. ASSP29, pp. 777–785, August 1981. 14. Lee, C.H., Giachin, E., Rabiner, L. R., Pieraccini, R., and Rosenberg, A. E., “Improved acoustic modeling for large vocabulary speech recognition,” Computer Speech and Language, vol. 6, pp. 103 – 127, 1992. 15. Li, Q., “A detection approach to searchspace reduction for HMM state alignment in speaker veriﬁcation,” IEEE Trans. on Speech and Audio Processing, vol. 9, pp. 569–578, July 2001. 16. Li, Q. and Tsai, A., “A languageindependent personal voice controller with embedded speaker veriﬁcation,” in Eurospeech’99, (Budapest, Hungary), Sept. 1999. 17. Li, Q. and Tsai, A., “A matched ﬁlter approach to endpoint detection for robust speaker veriﬁcation,” in Proceedings of IEEE Workshop on Automatic Identiﬁcation, (Summit, NJ), Oct. 1999. 18. Li, Q., Zheng, J., Tsai, A., and Zhou, Q., “Robust endpoint detection and energy normalization for realtime speech and speaker recognition,” IEEE Trans. on Speech and Audio Processing, vol. 10, pp. 146–157, March 2002. 19. Li, Q., Zheng, J., Zhou, Q., and Lee, C.H., “A robust, realtime endpoint detector with energy normalization for ASR in adverse environments,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Salt Lake City), May 2001. 20. Petrou, M. and Kittler, J., “Optimal edge detectors for ramp edges,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 13, pp. 483–491, May 1991. 21. Rabiner, L. and Juang, B.H., Fundamentals of speech recognition. Englewood Cliﬀs, NJ: PTR Prentice Hall, 1993. 22. Rabiner, L. R. and Sambur, M. R., “An algorithm for determining the endpoints of isolated utterances,” The Bell System Technical Journal, vol. 54, pp. 297–315, Feb. 1975.
References
91
23. Spacek, L. A., “Edge detection and motion detection,” Image Vision Comput., vol. 4, p. 43, 1986. ¨ 24. Tanyer, S. G. and Ozer, H., “Voice activity detection in nonstationary noise,” IEEE Trans. on Speech and Audio Processing, vol. 8, pp. 478–482, July 2000. 25. Wald, A., Sequential analysis. NY: Chapman & Hall, 1947. 26. Wilpon, J. G., Rabiner, L. R., and Martin, T., “An improved worddetection algorithm for telephonequality speech incorporating both syntactic and semantic constraints,” AT&T Bell Laboratories Technical Journal, vol. 63, pp. 479–498, March 1984.
Chapter 6 DetectionBased Decoder
Decoding or searching is an important task in both speaker and speech recognition. In speaker veriﬁcation (SV), given a spoken password and a speakerdependent hidden Markov model (HMM), the task of decoding or searching is to ﬁnd optimal state alignments in the sense of maximum likelihood score of the entire utterance. Currently, the most popular decoding algorithm is the Viterbi algorithm with a predeﬁned beam width to reduce the search space; however, it is diﬃcult to determine a suitable beam width beforehand. A small beam width may miss the optimal path while a large one may slow down the process. To address the problem, the author has developed a nonheuristic algorithm to reduce the search space [12, 14]. The details are presented in this chapter. Following the deﬁnition of the lefttoright HMM, we ﬁrst detect the possible changepoints between HMM states in a forwardandbackward scheme, then use the changepoints to enclose a subspace for searching. The Viterbi algorithm or any other search algorithm can then be applied to the subspace to ﬁnd the optimal state alignment. In SV tasks, compared to a fullsearch algorithm, the proposed algorithm is about four times faster, while the accuracy is still slightly better; compared to the beamsearch algorithm, the searchspace reduction algorithm provides better accuracy with even lower complexity. In short, for an HMM with S states, the computational complexity can be reduced by up to a factor of S/3 with slightly better accuracy than in a fullsearch approach. The applications of the searchspace reduction algorithm are signiﬁcant. As wireless phones and portable devices are becoming more and more popular, SV is needed for portable devices where the computational resource and power supply are limited. Simply stated, a fast algorithm equates to longer battery life. On the other hand, a network or web server may need to support millions of users of telephone lines, wireless channels, and network computers when using SV for access control. In both cases, a fast decoding algorithm is necessary to achieve robust, accurate, and fast speaker authentication while
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_6, Ó SpringerVerlag Berlin Heidelberg 2012
93
94
6 DetectionBased Decoder
at the same time using limited computational resources and power in order to minimize system and hardware cost and extend the battery life.
6.1 Introduction The hidden Markov model (HMM) has been widely used in speech and speaker recognition where the nonstationary speech signal is represented as a sequence of states. In automatic speech recognition (ASR), given an utterance and a set of HMM’s, a decoding algorithm is needed to search for the optimal state and word path, such that the overall likelihood score of the utterance is maximum. In speaker veriﬁcation (SV), given a spoken password and a speakerdependent HMM, the task is to ﬁnd the optimal state alignment in the sense of maximum likelihood. This is called HMM state alignment. We referred to as alignment in this chapter. As the technology of SV is ready for use in real applications, a fast and accurate alignment algorithm with low complexity is needed to support both largescale and portable applications. For example, a portable device, such as a wireless phone, usually has limited computational resources and power. A fast algorithm with low complexity will allow for SV to be implemented in the device at lower cost and lower power consumption, such as a wireless phone handset or a smart phone. On the other hand, a telephone or web server for SV may need to support millions of users. A fast algorithm will allow for the same hardware to support more telephone lines and reduce the cost per service. Our research on alignment can also beneﬁt the similar decoding problem in ASR. Generally speaking, there are two basic requirements for an alignment algorithm – accuracy and speed. In the last few decades, several search algorithms have been developed based on dynamic programming [4] or heuristic search to pursue the requirements, such as the Viterbi algorithm [26], stack decoders [1], multipass search (e.g. [6, 20]), forwardbackward search [6, 20], statedetection search [13, 14], etc. which have been applied to both speech and speaker recognition (e.g. [11, 19, 8]). Fortunately, in ASR language information can be applied to prune the search path and to reduce the search space, such as language model pruning (or wordend pruning), language model lookahead, etc. (See [18] and [19] for a survey.) However, in alignment, since the whole model is just one word or one phrase, no word or language information can be applied to pruning. The current technique for alignment is the Viterbi algorithm with a state level beam search. The Viterbi algorithm is optimal in the sense of maximum likelihood [26, 9]; therefore, it meets the ﬁrst requirement for accuracy. However, a full Viterbi search is impractical due to the large search space. There are two major approaches to address the search speed problem. One approach changes the optimal algorithm to a nearoptimal one in order to increase the alignment speed (e.g. [13]), but it may lose some accuracy. Another approach keeps the optimal alignment algorithm while trying to reduce the
6.1 Introduction
95
search space. The most popular approach is the beamsearch algorithm (e.g. [17, 19, 8]) applied to the state level. It reduces the search space by pruning the search paths with low likelihood scores using a predetermined beam width. Obviously, it improves alignment speed due to the searchspace reduction, but it is diﬃcult to determine the beam width beforehand. When the value of the beam width is too large, alignment can provide better accuracy, but it slows down the speed; when the beam width is too small, alignment is faster, but it may give poor accuracy. Therefore, we present a searchspace reduction algorithm which can detect a subspace from the constraints of the lefttoright HMM states without using a beam width for alignment. As is well known, the HMM is a parametric, statistical model with a set of states characterizing the evolution of a nonstationary process in speech through a set of shorttime stationary events. Within each state, the distribution of the stochastic process is usually modeled by Gaussian mixture models (GMM). Given a sequence of observations, it moves from one state to another sequentially. Between every pair of connected states, there is a changepoint. The purpose of the searchspace reduction algorithm is to detect the possible changepoints in a forwardandbackward scheme, then use the changepoints to enclose a subspace for searching. In the case that an utterance matches the HMM’s, the algorithm would not miss the optimal path; in an impostor case, the algorithm limits the search space, therefore it has the potential to decrease the impostor’s likelihood scores. Once a subspace is detected, a search algorithm, such as the Viterbi algorithm, can be applied to ﬁnd the optimal path in the subspace. The problem of detecting a change in the characteristics of the stochastic process, random sequences, and ﬁelds is referred to as the changepoint problem [5]. There are two kinds of approaches to the problem: use of a parametric method and use of a nonparametric method. The parametric method is based on full a priori information, i.e. the probabilistic model. The model is constructed from training data where the change points (e.g. state segments) are given to guarantee statistical homogeneity. If the segment is not available, the nonparametric method can be applied [5] to preliminary changepoint detection for each interval of homogeneity; then, some parametric method might be applied for more accurate changepoint detection. In our application, since the HMM has been trained by a priori information, we only consider the parametric method of changepoint detection. In order to detect changepoints quickly and reliably, we need a sequential detection approach. Sequential testing was ﬁrst studied by Wald [27] and was known as the sequential probability ratio test (SPRT). The SPRT was designed to decide between two simple hypotheses sequentially. Using the SPRT to detect the changepoints in distributions was ﬁrst proposed by Page for memoryless processes [21, 22]. Its asymptotic properties were studied by Lorden [16]. The general form of the test was proposed by Bansal [2], and Bansal and PapantoniKazakos [3]. They also studied its asymptotic properties for stationary and ergodic processes under some general regularity conditions.
96
6 DetectionBased Decoder
It has been proven that the SPRT is asymptotically optimal in the sense that it requires the very minimum expected sample size for decision, subject to a falsealarm constraint [16, 3, 10]; however, the Page algorithm needs a predetermined threshold value for a decision. It is not critical if only one changepoint between two density functions needs to be determined, but, for the alignment, we have to detect the changes between many diﬀerent density functions, and the threshold values are usually not available. Our solution is to extend the Page algorithm to a general test which releases the speciﬁc threshold [12]. The rest of the chapter is organized as follows: In Section 6.2, we discuss how to extend the Page algorithm to a sequential testing which can detect the changepoints in data distributions without using likelihood thresholds. In Section 6.3, we apply the changepoint detection algorithm to HMM state detection. In Section 6.4, we present the searchspace reduction algorithm, then we compare it to the Viterbi beamsearch algorithms in an SV database in Section 6.5.
6.2 ChangePoint Detection Let ot denote an observation of the ddimensional feature vector at time t, and p1 (ot ) and p2 (ot ) be the ddimensional density functions of two wellknown, distinct, discretetime, and mutuallyindependent stochastic processes, i.e. the two stochastic processes are not the same, and their statistical distributions are known, respectively. (See [23] for detailed deﬁnitions.) Given a sequence of observation vectors, O = {ot; t ≥ 1}, and the density functions, p1 (ot ) and p2 (ot ), the objective is to detect a possible p1 to p2 change as reliably and quickly as possible. Since the change can happen at any time, Page proposed a sequential detection scheme [21, 22, 10] as follows: Given a predetermined threshold δ > 0, observe data sequentially, and decide that the change from p1 to p2 has occurred at the ﬁrst time t if k t T (ot) = R(oi ) − min R(oi ) ≥ δ (6.1) i=1
1≤k≤t
where R(oi ) = log
i=1
p2 (oi ) . p1 (oi )
When T (ot ) ≥ δ, the endpoint of p1 is k = arg min R(oi ) . 1≤k≤t
i=1
(6.2)
(6.3)
6.2 ChangePoint Detection
97
It is straightforward to implement the above test in a recursive form. As pointed out by Page [21], the above test breaks up into a Repeated Wald sequential test [27] with boundaries at (0, δ) and a zero initial score. It is asymptotically optimal in the sense that it requires the minimum expected sample size for decisions, subject to a false alarm constraint. The related theorems and proofs can be found in [16, 3, 10]; however, in many applications, it is impractical to determine the threshold value δ. For example, in speech segmentation, we may have over one thousand subword HMM’s and each HMM has several states. Due to diﬀerent speakers and diﬀerent spoken contents, it is impractical to predetermine all of the threshold values for every possible combination of connected states or every possible speaker. To apply the sequential scheme to speech applications, we modify the above detection scheme as follows [13]: Select a time threshold tδ > 0. Observe data sequentially, and decide that the change from p1 to p2 occurs, if t − ≥ tδ , and T (ot ) =
t i=1
R(oi ) − min
1≤k≤t
(6.4)
k
R(oi )
> ε,
(6.5)
i=1
where ε ≥ 0 is a small number or can be just zero, and R(oi ) is deﬁned as in (6.2). The enpoint of p1 can be calculated by (6.3). Here, we assume that the duration of p2 is not less than tδ . Fig. 6.1 (a) illustrates the scheme, where tδ as in (6.4) is a time threshold representing a time duration, and δ as in (6.1) represents a threshold value of the accumulated log likelihood ratio. It is much easier to determine tδ than δ in speech and speaker recognition. A common tδ can be applied to diﬀerent HMM’s and diﬀerent states. Generally speaking, a larger tδ can give a more reliable changepoint with less false acceptance, but it may increase false rejection, delay the decision, and cost more in computation. To avoid false rejection, we can just let tδ = 1 as discussed in the next section.
6.3 HMM State ChangePoint Detection We have introduced the scheme of detecting the changepoint between two stochastic processes. Now, we can apply the scheme to HMM state changepoint detection. Since the lefttoright HMM is the most popular HMM in speech and speaker recognition, we focus our discussions on it. Nevertheless, the changepoint detection algorithm can also be extended to other HMM conﬁgurations. A lefttoright HMM without state skip is shown in Figure 6.2. It is a Markov chain with a sequence of states which characterizes the evolution of a
98
6 DetectionBased Decoder
p
Σ log p2
1
(a)
tδ δ l
t
p Σ log p3 2
(b)
1
2
3
6
4
8
10
p1 p2 p3
t5
11
State 3 t
State 2
State 1 (c)
7
t9
Fig. 6.1. The scheme of the changepoint detection algorithm with tδ = 2: (a) the endpoint detection for state 1; (b) the endpoint detection for state 2; and (c) the grid points involved in p1 , p2 and p3 computations (dots).
a11
a22 a12
a23
s1 b1
a33 a34
s2 b2
aNN
s3
....
aN1,N
b3
sN bN
Fig. 6.2. Lefttoright hidden Markov model.
nonstationary process in speech through a set of shorttime stationary events (states). Within each state, the probability density functions (pdf ’s) of speech data are modeled by Gaussian mixtures. An HMM can be completely characterized by a matrix of statetransition probabilities: A = {ai,j }; observation densities, B = {bj }; and initial state probabilities, Π = {πi } as: λ = {A, B, Π} = {ai,j , bj , πi ; i, j = 1, ..., S},
(6.6)
6.3 HMM State ChangePoint Detection
99
where S is the total number of states. Given an observation vector ot , the continuous observation density for state j is bj (ot ) =
M
cjm N (ot , μjm , Σjm ),
(6.7)
m=1
where M is the total number of Gaussian components N (.); cjm , μjm and Σjm are the the mixture coeﬃcient, mean vector, and covariance matrix of the mth mixture at state j, respectively. As we presented in [13], detecting the changepoint between states is similar to detecting the changepoint between two data distributions. For a lefttoright HMM, it can be implemented by repeating the following procedure until obtaining the last changepoint between state S − 1 and S: Select a time threshold tδ > 0, observe data sequentially at time t, and decide that the change from state s to s + 1 occurs, if t − s ≥ tδ , and t
T (ot ) =
R(oi ) −
i=s−1 +1
⎧ ⎨ min
k
s−1 0 and β > 0, u(t) is the unit step function, u(t) = 1 for t ≥ 0 and 0 otherwise. The value of θ should be selected such that (7.1) is satisﬁed. b is the time shift variable, and a is the scale variable. The value of a can be determined by the current ﬁlter central frequency fc and the lowest central frequency fL in the cochlear ﬁlter bank:
120
7 AuditoryBased Time Frequency Transform
a = fL /fc .
(7.10)
Since we contract ψa,b (t) from its lowest frequency representation along the time axis, the value of a is in the range of 0 < a ≤ 1. If we stretch ψ, the value of a > 1. The frequency distribution of the cochlear ﬁlter and fc can be in the form of linear or nonlinear scales such as ERB (equivalent rectangular bandwidth) [36], Bark, [58], Mel, [6], log, etc. Note that the value of a needs to be precalculated for the required central frequency of the cochlear ﬁlter. Fig. 7.6 shows the impulse responses of 5 cochlear ﬁlters and Fig. 7.7 (A) their corresponding frequency responses. We note that the impulse response of the BM in AT are very similar to the results reported in hearing research, such as the ﬁgures in [48], [21], [37] (Fig. 1.12), [53], etc. Normally, we use α = 3. The value of β controls the ﬁlter band width, i.e. the Qfactor. We used β = 0.2 for noise reduction and β = 0.035 or around the number for feature extraction [25] where higher frequency resolution is needed. In most applications, parameters α and β may need to be determined by experiments. A study of the relation between β and speaker recognition is shown in [24]. The author derived the function in (7.9) from the psychoacoustic experiment results directly, such as the impulse responses plotted in [48, 53]. In fact, Eq. (7.9) deﬁned by the author is diﬀerent from the Gammatone function. In the standard Gammatone function, the Qfactor is ﬁxed. In our deﬁnition, the Qfactor can be adjusted by changing parameter β. Further comparison is given in Section 7.6. Using the speech data in Fig. 7.1 as input, the AT can output a bank of decomposed traveling waves in diﬀerent frequency bands as shown in Fig. 7.8. An enlarged plot by expending the time frame from 0.314 to 0.324 seconds is shown in Fig. 7.9. We note that unlike the FFT all the numbers in the AT are real, which is a signiﬁcant advantage in realtime applications. From the output of the forward AT, we can construct a spectrogram similar to the FFT spectrogram. There are many ways to compute and plot an AT spectrogram. One way which is similar to the cochlea is ﬁrst to apply a rectiﬁer function on the output of the AT. This action removes the negative part of the AT output. Then, we can use a shift window and compute the energy of each band in the window. Fig. 7.10 shows the spectrograms computed from the AT using the same speech data as in Fig. 7.2. The window size is 30 ms and shifts every 20 ms with 10 ms overlap. For diﬀerent applications, the spectrogram computation can be diﬀerent. The spectrum of AT at the 1.15 second time frame is shown in Fig. 7.11. We will compare it with the FFT spectrum in Section 7.6.
7.3 The Inverse Auditory Transform Just as the Fourier transform requires an inverse transform, a similar inverse transform is also needed for the auditory transform. The need arises when
7.3 The Inverse Auditory Transform
121
507
1000
2014
3046
4050 Hz
−5
x 10 1 0 −1 0 0.002 −5 x 10 1 0 −1 0.002 0 −5 x 10 1 0 −1 0.002 0 −6 x 10 5 0 −5 0 0.002 −6 x 10 5 0 −5 0 0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.004
0.006 0.008 0.01 Time (Second)
0.012
0.014
0.016
Fig. 7.6. Impulse responses of the BM in the AT when α = 3 and β = 0.2. They are very similar to the research results reported in hearing research.
the processed frequency decomposed signals need to be converted back to real signals, such as in the application of speech and music synthesis or noise reduction. If we can prove the inverse transform, it also means that no information is lost through the proposed forward transforms. This is important when we use the transform for feature extraction and other applications where we must make sure that the transform does not lose any information. If (7.3) is satisﬁed, the inverse transform exists: 1 ∞ ∞ 1 f (t) = T (a, b)ψa,b(t) da db (7.11) C a=0 b=0 a2 The derivation of the above transform is similar to the inverse continuous wavelet transform, such as in [41] and others. Equation (7.6) can be written in the form of convolution with f (b): T (a, b) = f (b) ∗ ψa,0 (−b)
(7.12)
Taking the Fourier transform of both sides, the convolution becomes multiplication: ∞ T (a, b)e−jωb db = aF (ω)Ψ ∗ (aω) (7.13) b=−∞
122
7 AuditoryBased Time Frequency Transform
Magnitude (dB)
−80 −90 −100 −110 −120
500
1000 1500 2000 2500 3000 3500 4000 4500 5000 (A) Frequency (Hz)
500
1000 1500 2000 2500 3000 3500 4000 4500 5000 (B) Frequency (Hz)
Magnitude (dB)
−20 −40 −60 −80 −100
Fig. 7.7. The frequency responses of the cochlear ﬁlters when α = 3: (A) β = 0.2; and (B) β = 0.035.
where F (ω) and Ψ (ω) represent the Fourier transforms of f (t) and ψ(t), respectively. We now multiply both sides of the above equation by Ψ (aω)/a3/2 and integrate with a: ∞ ∞ 1 T (a, b)Ψ (aω)e−jωb da db = 3/2 a a=−∞ b=−∞ ∞ Ψ (aω)2 F (ω) da. (7.14) a a=−∞ The integration on the righthand side can be further written as: ∞ ∞ Ψ (aω)2 Ψ (ω)2 da = dω = C a ω a=−∞ ω=−∞
(7.15)
to meet the admissibility condition in (7.3) where C is a constant. Rearrange (7.14), we can then have ∞ 1 ∞ 1 F (ω) = T (a, b)Ψ (aω)e−jωb da db. (7.16) C a=−∞ b=−∞ a3/2 We now can take the inverse Fourier transform on both sides of the above equation to achieve (7.11). The above procedure is similar to derive the inverse wavelet transform.
7.4 The DiscreteTime and Fast Transform
123
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2 0
0.2
0.4
0.6
1 0.8 Time (Seconds)
1.2
1.4
1.6
0
0.2
0.4
0.6
1 0.8 Time (Seconds)
1.2
1.4
1.6
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2
Fig. 7.8. The traveling wave generated by the auditory transform from the speech data in Fig. 7.1.
7.4 The DiscreteTime and Fast Transform In practical applications, a discretetime auditory transform is necessary. The forward discrete auditory transform can be written as: 1 b−n T [ai, b] = f [n] ψ ai ai  n=0 N
(7.17)
where ai = fL /fci is the scaling factor for the ith frequency band fci and N is the length of signal f [n]. The scaling factor ai can be a linear or nonlinear scale. For the discrete transform, ai can also be in ERB, Bark, Log, or other nonlinear scales. We derived the corresponding discretetime inverse transform: ak N 1 1 b−n ˜ f [n] = T [ai , b]ψ C a =a ai  ai i
1
(7.18)
b=1
where 0 ≤ t ≤ N , a1 ≤ ai ≤ ak , and 1 ≤ b ≤ N . We note that f˜[n] approximates to f [n] given the limited number of decomposed frequency bands. The above formulas have been veriﬁed by the following experiments. Also note that (7.18) can also be applied to compute the inverse continuous WT, where just ψ needs to be replaced.
124
7 AuditoryBased Time Frequency Transform
0.02 0 −0.02 4245.02 2636.59 1606.42 946.614 524.019 Frequency (Hz)253.355 80
0.314
0.318
0.316
0.32
0.322
0.324
Time (Second)
Fig. 7.9. A section of the traveling wave generated by the auditory transform.
Just as the Fourier Transform has a fast algorithm, the FFT, a fast algorithm for the auditory transform also exists. The most computational intensive components in (7.17) and (7.18), the convolutions, of both the forward and the inverse transforms can be implemented by the FFT and invert FFT. Also, depending on the application, the data resolution of the lower frequency bands can be reduced to save the computation. In our recent research, the AT and inverse AT have been implemented as a realtime system for audio signal processing, speaker recognition, and speech recognition.
7.5 Experiments and Discussions In this section, we present the experimental results to compare the AT with the FFT. Also, we verify the inverse AT by experiments to validate the above mathematic derivations. 7.5.1 Verifying the Inverse Auditory Transform In addition to the theoretical proof, we also validated the AT via real speech data. Fig. 7.12 (A) shows the speech waveform for a male voice speaking the words“two, zero, ﬁve” recorded with a sampling rate of 16 kHz. The inverse AT results are shown in Fig. 7.12 (B), which matches the original waveform perfectly. This result veriﬁed the inverse AT deﬁned in (7.11). To further verify the inverse AT, we computed the correlation coeﬃcients 2 [17], σ12 , between the original speech signals as shown in Fig. 7.1 (top) and
7.5 Experiments and Discussions
125
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2 0
0.2
0.4
0.6
0.8 1 Time (Seconds)
1.2
1.4
1.6
0
0.2
0.4
0.6
0.8 1 Time (Seconds)
1.2
1.4
1.6
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2
Fig. 7.10. Spectrograms from the output of the cochlear transform for the speech data in Fig. 7.1 respectively. The spectrogram at top is from the data recorded by the closetalking microphone, while the spectrogram at bottom is from the handsfree microphone. 100
dB
80
60
40
20
2
4
6
8 10 Critical Band Rate (Bark)
12
14
16
Fig. 7.11. The spectrum of AT at the 1.15 second time frame from Fig. 7.10: The solid line represents the speech from a closetalking microphone. The dashed line is from a handsfree microphone mounted on the visor of a moving car. Both speech ﬁles were recorded simultaneously.
the synthesized speech signals from inverse AT. The original signals were ﬁrst decomposed into diﬀerent numbers of channels using AT and then synthesized back to speech signals using the inverse AT. The more the cochlear ﬁlters, the higher the values of the correlation coeﬃcients, which means that the synthesized speech has lesser distortions. The relation is shown in Table 7.1.
126
7 AuditoryBased Time Frequency Transform 2000 1000 0 −1000 −2000 0
0.2
0.4
1 0.8 0.6 (A) Time (Second)
1.2
1.4
1.6
0.2
0.4
1 0.8 0.6 (B) Time (Second)
1.2
1.4
1.6
−3
x 10 2 0 −2 0
Fig. 7.12. Comparison of speech waveforms: (A) The original waveform of a male voice speaking the words “two, zero, ﬁve.” (B) The synthesized waveform by inverse AT with the bandwidth of 80 to 5K Hz. When the ﬁlter numbers are 8, 16, 32, and 2 64, the correlation coeﬃcients σ12 for the two speech data sets are 0.74, 0.96, 0.99, and 0.99, respectively. 2 Table 7.1. Correlation Coeﬃcients, σ12 , for Diﬀerent Sizes of Filter Bank in AT/inverse AT
Number of Filters in AT 8 Clean speech as in Fig. 7.1 (top) 0.74 Noisy speech as in Fig. 7.1 (bottom) 0.76
16 0.96 0.94
32 0.99 0.97
64 0.99 0.97
128 0.99 0.97
The experimental results indicate that 32 ﬁlters are good enough for most applications needing the inverse AT. 7.5.2 Applications The auditory transform can be applied to any applications where the FT or WT has been used, especially to audio and speech signal processing. We describe two applications brieﬂy. Noise Reduction: Fig. 7.13 is also an example of applying the AT to noise reduction. The original speech, as shown in Fig. 7.13 (A), was ﬁrst decomposed into 32 frequency bands using (7.17). An endpoint detector was used to recognize noise and speech [28]. A denoising function was then applied to each of the decomposed frequency bands. The function suppresses more on
7.5 Experiments and Discussions
127
2000 1000 0 −1000 −2000 −3000
0
0.5
1
1.5 (A)
2
1.5 (B)
2
1.5 (C)
2
2.5
3 x 10
4
x 10
4
x 10
4
2000 1000 0 −1000 −2000
0
0.5
1
2.5
3
2000 1000 0 −1000 −2000
0
0.5
1
2.5
3
Fig. 7.13. (A) and (B) are speech waveforms simultaneously recorded in a moving car. The microphones are located on the car visor (A) and speaker’s lapel (B), respectively. (C) is after noise reduction using the AT from the waveform in (A), where results are very similar to (B).
the noise waveforms and less on the speech waveforms. The inverse transform in (7.18) is then applied to convert the processed signals back to clean speech signals as shown in Fig. 7.13 (C), which is similar to the original closetalking data in Fig. 7.13 (B). New feature for speaker and speech recognition: Recently, based on the AT, we developed a new feature extraction algorithm for speech signal processing, named cochlear ﬁlter cepstral coeﬃcients (CFCC). Our experiments show that the new CFCC features have signiﬁcant noise robustness over the traditional MFCC features in a speaker recognition task [25, 24]. The details are available in Chapter 8.
7.6 Comparisons to Other Transforms In this section, we compare the AT with the FFT and gammatone ﬁlter bank.
128
7 AuditoryBased Time Frequency Transform
Fig. 7.14. Comparison of FT and AT spectrums: (A) The FFT spectrogram of a male voice “2 0 5”, warped into the Bark scale from 0 to 6.4 Barks (0 to 3500 KHz). (B) The spectrogram from the cochlear ﬁlter output for the same male voice. The AT is harmonic free and has less computational noise.
Comparison between AT and FFT The FFT is the major tool for the timefrequency transform used in speech signal processing. We use Fig. 7.14 to illustrate the diﬀerences between the spectrograms generated from the Fourier transform and our auditory transform [22]. The original speech wave ﬁle is recorded from male voice “2 0 5”. We then calculated the FFT spectrograms as shown in Fig. 7.14 (A), with 30 ms Hamming window shifting every 10 ms. To facilitate the comparison, we then warped the frequency distribution from linear scale to the Bark scale using the method in [26]. The spectrogram of the AT on the same speech data is shown in Fig. 7.14 (B). It was generated from the output of the cochlear ﬁlter bank as deﬁned in (8.5) plus a window to compute the average densities for each band. In comparing the two spectrograms in Fig. 7.14, we can observe that there are no pitch harmonics and less computational noise in the spectrums generated from the AT while keeping all formant information. This could be due to the variable length of cochlear ﬁlters and the selection of parameter β in (8.5). Furthermore, we compared the spectra shown in Fig. 7.15. A male voice was recorded while in a moving car using two diﬀerent microphones. A closetalking microphone was placed on the speaker’s lapel, and a handsfree microphone was placed on the car visor. Fig. 7.15 is associated with a cross section
7.6 Comparisons to Other Transforms
129
100
dB
80
60
40
20
2
4
6
8 10 Critical Band Rate (Bark)
12
14
16
2
4
6
8 10 Critical Band Rate (Bark)
12
14
16
100
dB
80
60
40
20
Fig. 7.15. Comparison of AT (top) and FFT (bottom) spectrums at the 1.15 second time frame for robustness: The solid line represents speech from a closetalking microphone. The dashed line represents speech from a handsfree microphone mounted on the visor of a moving car. Both speech ﬁles were recorded simultaneously. The FFT spectrum shows 30 dB distortion at lowfrequency bands due to background noise compared to the AT. Compared to the FFT spectrum, the AT spectrum has no pitch harmonics and much less distortion at low frequency bands due to background noise.
of Fig. 7.14 at the 1.15 second mark. The solid line represents speech recorded by the closetalking microphone, while the dashed line corresponds to speech recorded by the handsfree microphone. Fig. 7.15 (top) is the spectrum from our AT [22] and Fig. 7.15 (bottom) is from the FFT. From Fig. 7.15, we can observe the following in the FFT spectrum, which are not as signiﬁcant in the AT spectrum. First, the FFT spectra show a 30 dB distortion at lowfrequency bands due to the car background noise. Second, the FFT spectra show significant pitch harmonics, which is due to the ﬁxed length of the FFT window. Last, the noise displayed as “snow” in Fig. 7.14 (A) was generated by the FFT computation. All these may eﬀect the performance of a feature extraction algorithm. For robust speaker identiﬁcation, we need a timefrequency transform more robust than FFT as the foundation for feature extraction. The transform should generate less distortion from background noise and less computation noise from selected algorithms, such as pitch harmonics, while also retaining the useful information. Here, the AT provides a robust solution to replace the FFT.
130
7 AuditoryBased Time Frequency Transform
Magnitude (dB)
140 120 100 80 60 40 20 500
1000 1500 2000 2500 3000 3500 4000 4500 5000 (A) Frequency (Hz)
500
1000 1500 2000 2500 3000 3500 4000 4500 5000 (B) Frequency (Hz)
Magnitude (dB)
10 0 −10 −20 −30 −40 −50
Fig. 7.16. The Gammatone ﬁlter bank: (A) The frequency responses of the Gammatone ﬁlter bank generated by (7.19). (B) The frequency responses of the Gammatone ﬁlter bank generated by (7.19) plus a equal loudness function.
Comparison between AT and Gammatone In comparison, the ﬁlter bank constructed by the AT in (7.9) is diﬀerent than the Gammatone ﬁlter bank. The Gammatone function is presented as: Gfc (t) = tN −1 exp [−2πb(fc )t] cos(2πfc t + φ)u(t)
(7.19)
where N is the ﬁlter order, fc is the ﬁlter center frequency, φ is the phase, u(t) is the unit step function, b(fc ) = 1.019ERB(fc ) [49] and the bandwidth is ﬁxed to fc as shown in Fig. 7.16, where (A) is plotted from (7.19) directly and (B) is after applying a loudness function to the ﬁlters plotted in (A) [49]. In comparing Fig. 7.7 and Fig. 7.16 and ignoring the ﬁlter gains, we can ﬁnd that in AT the ﬁlter bandwidths can be easily adjusted by β which can be independent to fc while in the Gammatone function, the ﬁlter bandwidth is ﬁxed and locked to fc , i.e. the Qfactor is ﬁxed. The ﬂexible ﬁlter bandwidth of AT provides the greater freedom in developing applications. For diﬀerent applications, we can select diﬀerent β for the best performance. As we will discuss in Chapter 8, a feature extraction algorithm developed based on the AT can achieve the best performance by adjusting the ﬁlter bandwidth, β. Compared to the FT, the AT uses realnumber computations and is more robust to background noise. The frequency distribution of the AT can be in Bark, ERB, or any nonlinear scale. Also, its timefrequency resolution is
7.. 6 Comparisons to other Transforms
131
adjustable. Our experiments have shown that the AT is more robust to background noise and without pitch harmonics. Compared to the WT, the AT ﬁlter bank is signiﬁcantly closer to the impulse responses of the basilar membrane than any existing wavelets. The ﬁlter in (7.9) is diﬀerent than any existing wavelet and the frequency distribution is in an auditorybased scale. Compared to the discretetime WT, the frequency response of AT is not limited to the dyadic scale. It can be in any linear or nonlinear scale. The derived formula in (7.18) can also be used to compute the discretetime inverse continuous WT.
7.7 Conclusions A robust and invertible timefrequency transform named auditory transform (AT) is presented in this chapter. The forward and inverse transforms were proven in theory and validated in experiments. The AT is an alternative solution to the FFT and WT for robust audio and speech signal processing. Inspired by study of the human hearing system, the AT was designed to mimic the impulse responses of the basilar membrane and its nonlinear frequency distribution. The AT is an ideal solution to decompose input signals into frequency bands for audio and speech signal processing. As demonstrated, the AT has signiﬁcant advantages due to its noise robustness and its freedom from harmonic distortion and computational noise. Also, Compared to FFT, the AT only uses real number computation. This will save the computation resource in signal processing in the frequency domain. These advantages can lead to many new applications, such as robust feature extraction algorithms for speech and speaker recognition, new algorithms or devices for noise reduction and denoising, speech and music synthesis, audio coding, new hearing aids, new cochlear implants, speech enhancement, audio signal processing, etc. In Chapter 8, based on the AT, we will present a new feature extraction algorithm for robust speaker recognition.
References 1. Allen, J., “Cochlear modeling,” IEEE ASSP Magazine, pp. 3–29, Jan. 1985. 2. Barbour, D. L. and Wang, X., “Contrast tuning in auditory cortex,” Science, vol. 299, pp. 1073–1075, Feb. 2003. 3. Bruce, I., Sacs, M., and Young, E., “An auditoryperiphery model of the eﬀects of acoustic trauma on auditory nerve responses,” J. Acoust. Soc. Am, vol. 113, pp. 369–388, 2003. 4. Choueiter, G. F. and Glass, J. R., “An implementation of rational wavelets and ﬁlter design for phonetic classiﬁcation,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, pp. 939–948, March 2007. 5. Daubechies, I. and Maes, S., “A nonlinear squeezing of the continuous wavelet transform based on auditory nerve models,” in Wavelets in Medicine and Biology (A. Aldroubi and M. Unser, eds.), (CRC Press), pp. 527–546, 1996.
132
7 AuditoryBased Time Frequency Transform
6. Davis, S. B. and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. on Acoustics, speech, and signal processing, vol. ASSP28, pp. 357–366, August 1980. 7. Evans, E. F., “Frequency selectivity at high signal levels of single units in cochlear nerve and cochlear nucleus,” in Psychophysics and Physiology of Hearing, pp. 195–192, 1977. Edited by E. F. Evans, and J. P. Wilson. London UK: Academic Press. 8. Flanagan, J. L., Speech analysis synthesis and perception. New York: SpringerVerlag, 1972. 9. Fletcher, H., Speech and hearing in communication. Acoustical Society of America, 1995. 10. Furui, S., “Cepstral analysis techniques for automatic speaker veriﬁcation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 11. Gelfand, S. A., Hearing, an introduction to psychological and physiological acoustics, 3rd edition. New York: Marcel Dekker, 1998. 12. Ghitza, O., “Auditory models and human performance in tasks related to speech coding and speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 115–132, January 1994. 13. Goldstein, J. L., “Modeling rapid waveform compression on the basilar membrane as a multiplebandpassnonlinear ﬁltering,” Hearing Res., vol. 49, pp. 39–60, 1990. 14. Hermansky, H. and Morgan, N., “Rasta processing of speech,” IEEE Trans. Speech and Audio Proc., vol. 2, pp. 578–589, Oct. 1994. 15. Hohmann, V., “Frequency analysis and synthesis using a Gammatone ﬁlterbank,” Acta Acoustica United with Acustica, vol. 88, pp. 433–442, 2002. 16. Johannesma, P. I. M., “The preresponse stimulus ensemble of neurons in the cochlear nucleus,” The proceeding of the symposium on hearing Theory, vol. IPO, pp. 58–69, June 1972. 17. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis. New Jersey: Prentice Hall, 1988. 18. Kates, J. M., “Accurate tuning curves in cochlea model,” IEEE Trans. on Speech and Audio Processing, vol. 1, pp. 453–462, Oct. 1993. 19. Kates, J. M., “A timedomain digital cochlea model,” IEEE Trans. on Signal Processing, vol. 39, pp. 2573–2592, December 1991. 20. Khanna, S. M. and Leonard, D. G. B., “Basilar membrane tuning in the cat cochlea,” Science, vol. 215, pp. 305–306, Jan 182. 21. Kiang, N. Y.S., Discharge patterns of single ﬁbers in the cat’s auditory nerve. MA: MIT, 1965. 22. Li, Q., “An auditorybased transform for audio signal processing,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), Oct. 2009. 23. Li, Q., “Solution for pervasive speaker recognition,” SBIR Phase I Proposal, Submitted to NSF IT.F4, Li Creative Technologies, Inc., NJ, June 2003. 24. Li, Q. and Huang, Y., “An auditorybased feature extraction algorithm for robust speaker identiﬁcation under mismatched conditions,” IEEE Trans. on Audio, Speech and Language Processing, Sept. 2011. 25. Li, Q. and Huang, Y., “Robust speaker identiﬁcation using an auditorybased feature,” in ICASSP 2010, 2010.
References
133
26. Li, Q., Soong, F. K., and Olivier, S., “An auditory systembased feature for robust speech recognition,” in Proc. 7th European Conf. on Speech Communication and Technology, (Denmark), pp. 619–622, Sept. 2001. 27. Li, Q., Soong, F. K., and Siohan, O., “A highperformance auditory feature for robust speech recognition,” in Proceedings of 6th Int’l Conf. on Spoken Language Processing, (Beijing), pp. III 51–54, Oct. 2000. 28. Li, Q., Zheng, J., Tsai, A., and Zhou, Q., “Robust endpoint detection and energy normalization for realtime speech and speaker recognition,” IEEE Trans. on Speech and Audio Processing, vol. 10, pp. 146–157, March 2002. 29. Lin, J., Ki, W.H., Edwards, T., and Shamma, S., “Analog VLSI implementations of auditory wavelet transforms using switchedcapacitor circuits,” IEEE Trans. on Circuits and systems I: Fundamental Theory and Applications, vol. 41, pp. 572–583, Sept. 1994. 30. Liu, W., Andreou, A. G., and M. H. Goldstein, J., “Voicedspeech representation by an analog silicon model of the auditory periphery,” IEEE Trans. on Neural Networks, vol. 3, pp. 477–487, May 1992. 31. Lyon, R. F. and Mead, C., “An analog electronic cochlea,” IEEE Trans. on Acoustics, Speech, and Signal processing, vol. 36, pp. 1119–1134, July 1988. 32. Max, B., Tam, Y.C., and Li, Q., “Discriminative auditory features for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 12, pp. 27–36, Jan. 2004. 33. Misiti, M., Misiti, Y., Oppenheim, G., and Poggi, J.M., Wavelet Toolbox User’s Guide. MA: MathWorks, 2006. 34. Møller, A. R., “Frequency selectivity of single auditorynerve ﬁbers in response to broadband noise stimuli,” J. Acoust. Soc. Am., vol. 62, pp. 135–142, July 1977. 35. Moore, B., Peters, R. W., and Glasberg, B. R., “Auditory ﬁlter shapes at low center frequencies,” J. Acoust. Soc. Am, vol. 88, pp. 132–148, July 1990. 36. Moore, B. C. J. and Glasberg, B. R., “Suggested formula for calculating auditoryﬁlter bandwidth and excitation patterns,” J. Acoust. Soc. Am., vol. 74, pp. 750–753, 1983. 37. Moore, B. C., An introduction to the psychology of hearing. NY: Academic Press, 1997. 38. Nedzelnitsky, V., “Sound pressures in the casal turn of the cat cochlea,” J. Acoustics Soc. Am., vol. 68, pp. 1676–1680, 1980. 39. Patterson, R. D., “Auditory ﬁlter shapes derived with noise stimuli,” J. Acoust. Soc. Am., vol. 59, pp. 640–654, 1976. 40. Pickles, J. O., An introduction to the physiology of hearing, 2nd Edition. New York: Academic Press, 1988. 41. Rao, R. and Bopardikar, A., Wavelet Transforms. MA: AdisonWesley, 1998. 42. Sellami, L. and Newcomb, R. W., “A digital scattering model of the cochlea,” IEEE Trans. on Circuits and systems I: Fundamental Theory and Applications, vol. 44, pp. 174–180, Feb. 1997. 43. Sellick, P. M., Patuzzi, R., and Johnstone, B. M., “Measurement of basilar membrane motion in the guinea pig using the Mossbauer technique,” J. Acoust. Soc. Am., vol. 72, pp. 131–141, July 1982. 44. Shaw, E. A. G., The external ear, in Handbook of Sensory Physiology. New York: SpringerVerlay, 1974. W. D. Keidel and W. D. Neﬀ eds.
134
7 AuditoryBased Time Frequency Transform
45. Teich, M. C., Heneghan, C., and Khanna, S. M., “Analysis of cellular vibrations in the living cochlea using the continuous wavelet transform and the shorttime Fourier transform,” in Time frequency and wavelets in biomedical signal processing, pp. 243–269, 1998. Edited by M. Akay. 46. Torrence, C. and Compo, G. P., “A practical guide to wavelet analysis,” Bulletin of the American Meteorological Society, vol. 79, pp. 61–78, January 1998. 47. Volkmer, M., “Theoretical analysis of a timefrequencyPCNN auditory cortex model,” Internal J. of Neural Systems, vol. 15, pp. 339–347, 2005. 48. von B´ek´esy, G., Experiments in hearing. New York: McGRAWHILL, 1960. 49. Wang, D. and Brown, G. J., Fundamentals of computational auditory scene analysis in Computational Auditory Scene Analysis Edited by D. Wang and G. J. Brown. NJ: IEEE Press, 2006. 50. Wang, K. and Shamma, S. A., “Spectral shape analysis in the central auditory system,” IEEE Trans. on Speech and Audio Processing, vol. 3, pp. 382–395, Sept. 1995. 51. Weintraub, M., A theory and computational model of auditory monaural sound separation. PhD thesis, Standford University, CA, August 1985. 52. Wilson, J. P. and Johnstone, J., “Basilar membrane and middleear vibration in guinea pig measured by capacitive probe,” J. Acoust. Soc. Am., vol. 57, pp. 705–723, 1975. 53. Wilson, J. P. and Johnstone, J., “Capacitive probe measures of basilar membrane vibrations in,” Hearing Theory, 1972. 54. Yost, W., Fundamentals of Hearing: An Introduction, 3rd Edition. New York: Academic Press, 1994. 55. Zhou, B., “Auditory ﬁlter shapes at high frequencies,” J. Acoust. Soc. Am, vol. 98, pp. 1935–1942, October 1995. 56. Zilany, M. and Bruce, I., “Modeling auditorynerve response for high sound pressure levels in the normal and impaired auditory periphery,” J. Acoust. Soc. Am, vol. 120, pp. 1447–1466, Sept. 2006. 57. Zweig, G., Lipes, R., and Pierce, J. R., “The cochlear compromise,” J. Acoust. Soc. Am., vol. 59, pp. 975–982, April 1976. 58. Zwicker, E. and Terhardt, E., “Analytical expressions for criticalband rate and critical bandwidth as a function of frequency,” J. Acoust. Soc. Am., vol. 68, pp. 1523–1525, 1980.
Chapter 8 AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation
In the previous chapter, we introduced a robust auditory transform (AT). In this chapter, we present an auditorybased feature extraction algorithm based on the AT and apply it to robust speaker identiﬁcation. Usually, the performances of acoustic models trained in clean speech drop signiﬁcantly when tested in noisy speech. The presented features, however, have shown strong robustness in this kind of situation. We present a typical textindependent speaker identiﬁcation system in the experiment section. Under all three different mismatched testing conditions, with white noise, car noise, or babble noise, the auditory features consistently perform better than the baseline mel frequency cepstral coeﬃcient (FMCC) features. The auditory features are also compared with perceptual linear predictive (PLP) and RASTAPLP features, The features consistently perform much better than PLP. Under white noise, the FMCC features are much better than RASTAPLP. Under car and babble noises, the performace are similar.
8.1 Introduction In automatic speaker authentication, feature extraction is the ﬁrst crucial component. To ensure the performance of a speaker authentication system, successful frontend features should carry enough discriminative information for classiﬁcation or recognition, ﬁt well with the backend modeling, and be robust with respect to the changes of acoustic environments. After decades of research and development, speaker authentication under various operating modes is still a challenging problem, especially when acoustic training and testing environments are mismatched. Since the human hearing system is robust to mismatched conditions, we developed an auditorybased feature extraction algorithm based on the auditory transform introduced in Chapter 7. We name the new features ascochlear ﬁlter cepstral coeﬃcients (CFCC). The auditorybased feature extraction algorithm was originally proposed by the author and Huang in [11, 10].
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_8, Ó SpringerVerlag Berlin Heidelberg 2012
135
136
8 AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation
Speaker authentication uses the same features as speech recognition. In general, most speech feature extraction methods fall into the following two categories: modeling the human voice production system or modeling the peripheral auditory system. For the ﬁrst approach, one of the most popular features is a group of cepstral coeﬃcients derived from linear prediction, known as the linear prediction cepstral coeﬃcients (LPCC) [3, 14]. The LPCC feature extraction utilizes an allpole ﬁlter to model the human vocal tract with speech formants captured by the poles of the allpole ﬁlter. The narrow band (e.g., up to 4 KHz) LPCC features work well in a clean environment. However, in our previous experiments, the linear predictive spectral envelope shows large spectral distortion in noisy environments [13, 12]. This results in signiﬁcant performance degradation. For the second approach, there are two groups of features, based on either Fourier transforms (FT) or auditorybased transforms. The representative of the ﬁrst group is the MFCC (Mel frequency cepstral coeﬃcients), where the fast Fourier transform (FFT) is applied to generate the spectrum in the linear scale, and then a bank of bandpass ﬁlters is placed along a Mel frequency scale on top of the FFT output [4]. Alternately, the FFT output is warped to a Mel or Bark scale and then a bank of bandpass ﬁlters is placed linearly on top of the warped FFT output [13, 12]. The algorithm presented in this chapter belongs to the second group where the auditorybased transform (AT) is deﬁned as an invertible, timefrequency transform to replace FFT. The output AT can be in any kind of frequency scale (e.g., linear, Bark, ERB, etc). Therefore, there is no need to place the bandpass ﬁlter in a Mel scale as in the MFCC or warp the frequency distributions as in [13, 12]. In MFCC, we need two steps to obtain the ﬁlterbank output, FFT and then Mel ﬁlter bank. By using the AT, it is only one step. The output of auditory ﬁlters are the ﬁlterbank outputs. There is no need for FFT. MFCC features [4] in the ﬁrst group are one of the most popular features for speech and speaker recognition. Like the LPCC features, the MFCC features perform well in clean environments but not in adverse environments or mismatched training and testing conditions. Perceptual linear predictive (PLP) analysis is another peripheral auditorybased approach. Based on the FFT output, it uses several perceptually motivated transforms, including Bark frequency, equalloudness preemphasis, and masking curve. [6] The relative spectra, known as RASTA, is further developed to ﬁlter the time trajectory to suppress constant factors in the spectral component [7]. It is often cascaded with the PLP feature extraction to form the RASTAPLP features. Comparisons between MFCC and RASTAPLP have been reported in [5]. Further comparisons among the PLP, RASTAPLP, and CFCC features in experiments will be given at the end of this chapter. Both MFCC and RASTAPLP features are based on the FFT. As mentioned above, the FFT has a ﬁxed timefrequency resolution and a welldeﬁned inverse transform. Fast algorithms exist for both the forward transform and
8.1 Introduction
137
the inverse transform. Despite its simplicity and eﬃcient computation algorithms, we believe that when applied to speech processing the timefrequency decomposition mechanism of the FT is diﬀerent than the mechanism in the hearing system. First, it uses ﬁxedlength windows, which generate pitch harmonics in the entire speech bands. Secondly, its individual frequency bands are distributed linearly, which is diﬀerent from the distribution in the human cochlea. Further wrapping is needed to convert to the Bark, MEL, or other scales. Finally, in our recent study [9, 8] as shown in Chapter 7, we found that the FFT spectrogram has more noise distortion and computation noise than an auditorybased transform which we recently developed. Thus, we ﬁnd it is promising to develop a new feature extraction algorithm based on the new auditorybased, timefrequency transform [8] to replace the FFT in speech feature extraction. Based on the study of the human hearing system, the author proposed an auditorybased, timefrequency transform (AT) [9, 8] in Chapter 7. The new transform is a pair of a forward transform and an inverse transform. Through the forward transform, the speech signal can be decomposed into a number of frequency bands using a bank of cochlear ﬁlters. The frequency distribution of the cochlear ﬁlters is similar to the distribution in the cochlea and the impulse response of the ﬁlters is similar to that of the traveling wave. Through the inverse transform, the original speech signal can be reconstructed from the decomposed bandpass signals. The AT has been proven in theory and validated in experiments [8], which is presented in detail in Chapter 7. Compared to the FFT, the AT has ﬂexible timefrequency resolution and its frequency distribution can take on any linear or nonlinear form. Therefore, it is easy to deﬁne the distribution to be similar to that of the Bark, Mel, or ERB scale, which is similar to the frequency distribution function of the Basilar membrane. Most importantly, the AT transform has signiﬁcant advantages in noise robustness and can be free of the pitch harmonic distortion as plotted in [8] and Fig. 7.14. Therefore, the AT provides a new platform for feature extraction research. It forms the foundation for our robust feature extraction. The ultimate goal of this study is to develop a practical speech frontend feature extraction algorithm that conceptually emulates the human peripheral hearing system and thus achieves a superior noise robustness performance under mismatched training and testing conditions. The remainder of the chapter is organized as follows: Section 8.2 demonstrates the presented auditory feature extraction algorithm and provides an analytic study and discussion, and Section 8.3 studies the feature parameters using a development dataset and presents the experimental results of the CFCC in comparison to other frontend features in a testing dataset.
138
8 AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation
8.2 AuditoryBased Feature Extraction Algorithm In this section, we will describe the structure of the auditorybased feature extraction algorithm and provide details of its computation. The human hearing system is very complex. Although we would like to emulate the human peripheral hearing system in detail, our computers may not be able to meet the requirements of realtime applications; therefore, we will simulate only the most important features of the human peripheral hearing system. An illustrative block diagram of the algorithm is shown in Fig. 8.1. The algorithm is intended to conceptually replicate the hearing system at a high level and consists of the following modules: cochlear ﬁlter bank, haircell function with windowing, cubicroot nonlinearity, and discrete cosine transform (DCT). A detailed description of each module follows.
Fig. 8.1. Schematic diagram of the auditorybased feature extraction algorithm named cochlear ﬁlter cepstral coeﬃcients (CFCC).
8.2.1 Forward Auditory Transform and Cochlea Filter Bank The cochlear ﬁlter bank in Fig. 8.1 is an implementation of the invertible auditory transform (AT) as deﬁned and described in Chapter 7. For feature extraction, we only use the forward transform which is the foundation of the auditorybased feature. The forward transform models the traveling wave in the cochlea where the sound waveform is decomposed into a set of subband signals. Let f (t) be speech signals. A transform of f (t) with respect to a cochlear ﬁlter ψ(t), representing the basilar membrane (BM) impulse response in the cochlea, is deﬁned as: ∞ 1 b−t T (a, b) = f (t) ψ dt, (8.1) a a −∞ where a and b are real, both f (t) and ψ(t) belong to L2 (R), and T (a, b) representing the traveling waves in the BM is the decomposed signal and ﬁlter output. The above equation can also be written as:
8.2 AuditoryBased Feature Extraction Algorithm
T (a, b) = f (t) ∗ ψa,b (t) dt, where
1 ψa,b (t) = ψ a
t−b a
139
(8.2)
.
(8.3)
Like in the wavelet transform, factor a is a scale or dilation variable. By changing a, we can shift the central frequency of an impulse response function to receive a band of decomposed signals. Factor b is a time shift or translation variable. For a given value of a, factor b shifts the function ψa,0 (t) by an amount b along the time axis. Note that 1/ a is an energy normalizing factor. It ensures that the energy stays the same for all a and b; therefore, we have: ∞ ∞ ψa,b (t)2 dt = ψ(t)2 dt. (8.4) −∞
−∞
The cochlear ﬁlter, as the most important part of the transform, is deﬁned by the author as: 1 t−b ψa,b (t) = ψ a a α 1 t−b t−b = exp −2πfLβ a a a t−b cos 2πfL + θ u(t − b), (8.5) a where α > 0 and β > 0, u(t) is the unit step function, i.e. u(t) = 1 for t ≥ 0 and 0 otherwise. Parameters α and β determine the shape and width of the cochlear ﬁlter in the frequency domain. They can be empirically optimized as shown in our experiments in Section 8.3. The value of θ should be selected such that (8.6) is satisﬁed: ∞ ψ(t) dt = 0. (8.6) −∞
This is required by the transform theory to ensure no information is lost during the transform [8]. The value of a can be determined by the current ﬁlter central frequency, fc , and the lowest central frequency, fL , in the cochlear ﬁlter bank: a = fL /fc .
(8.7)
Since we contract ψa,b (t) with the lowest frequency along the time axis, the value of a is in 0 < a ≤ 1. If we stretch ψ, the value of a is in a > 1. The frequency distribution of the cochlear ﬁlter can be in the form of linear or nonlinear scales such as ERB (equivalent rectangular bandwidth) [15], Bark [21], Mel scale [4], log, etc. For a particular band number i, the corresponding value of a is represented as ai , which needs to be precalculated for the required central frequency of the cochlear ﬁlters at band number i.
140
8 AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation
8.2.2 Cochlear ﬁlter cepstral coeﬃcients (CFCC) The cochlear ﬁlter bank is intended to emulate the impulse response in the cochlea. However, there are other operations in the ear. The inner hair cells act as a transducer for mechanical movements of the BM into neural activities. When the BM moves up and down, a shearing motion is created between the BM and the tectorial membrane [16]. This causes the displacement of the uppermost hair cells which generates neural signals; however, the hair cells only generate neural signals in one direction of the BM movement. When the BM moves in the opposite direction, there is neither excitation nor neuron output. We studied diﬀerent implementations of the hair cell function. The following function of the hair cell output provides the best performance in our evaluated task: h(a, b) = T (a, b)2 ; ∀ T (a, b), (8.8) where T (a, b) is the ﬁlterbank output. Here, we assume that all other detailed functions in the outer ear, middle ear, and the feedback control of the auditory system and brain to the cochlea have been ignored or have been included in the auditory ﬁlter responses. In the next step, the hair cell output for each band is converted into a representation of nerve spike count density in a duration associated with the current band central frequency. We use the following equation to mimic the concept: +d−1 1 S(i, j) = h(i, b), = 1, L, 2L, · · · ; ∀ i, j, (8.9) d b=
where d = max{3.5τi, 20ms} is the window length, τi is the period of the ith band, and L = 10 ms is the window shift duration. We empirically set the computations and the parameters, but they may need to be adjusted for different datasets. Instead of using a ﬁxed length window as in computing FFT spectrograms, we are using a variable length window for diﬀerent frequency bands. The higher the frequency, the shorter the window. This prevents the highfrequency information from being smoothed out by long window duration. The output of the above equation and the spectrogram of the cochlear ﬁlter bank can be used for both feature extraction and analysis. Furthermore, we apply the scales of loudness function suggested by Stevens [19, 20] to the hair cell output as: y(i, j) = S(i, j)1/3 .
(8.10)
In the last step, the discrete cosine transform (DCT) is applied to decorrelate the feature dimensions and generate the cochlear ﬁlter cepstral coeﬃcients (CFCC), so the features can work with the existing backend.
8.2 AuditoryBased Feature Extraction Algorithm
141
8.2.3 Analysis and Comparison The FFT is the major tool for the timefrequency transform used in speech signal processing. Most of features were developed based on FFT. A comparison between the AT used in CFCC and FFT used in MFCC, PLP, and other features are provided in Chapter 7. The reader can refer to that chapter for details. We now focus our comparison on feature levels. The analysis and discussion is intended to help the reader understand the CFCCs. Further comparisons in experiments will be made in the next section. Comparison between CFCC and MFCC Since the MFCC features are popular in both speaker and speech recognition, we compare the CFCCs with the MFCCs in this section. It is understood that the MFCC features use the FFT to convert the time domain speech signal to the frequency domain spectrum, represented by complex numbers, and then apply triangle ﬁlters on top of the spectrum. The triangle ﬁlters are distributed in the Mel scale. The CFCC features use a bank of cochlear ﬁlters to decompose the speech signal into multiple bands. The frequency response of a cochlear ﬁlter has a belllike shape rather than a triangle shape. The shape and bandwidth of the ﬁlter in the frequency domain can be adjusted by parameters α and β in (8.5). In each of the bands, the decomposed signal is still the time domain signal, represented by real numbers. The central frequencies of the cochlear ﬁlters can be in any distribution, including Mel, ERB, Bark, log, etc. When using the FFT to compute a spectrogram, the window size must be ﬁxed to all frequency bands, due to the ﬁxed point FFT. When we compute a spectrogram from the decomposed signals generated by the cochlear ﬁlters, the window size can be diﬀerent for diﬀerent frequency bands. For example, we use a longer window for a lower frequency band to average out the background noise and a shorter window for a higher frequency band to protect highfrequency information. Furthermore, the MFCCs use a logarithm as the nonlinearity while the CFCCs use a cubic root. Furthermore, linear and/or nonlinear operations can be applied to the decomposed multiband signals to mimic signal processing in the peripheral hearing system or to tune for the best performance for a particular application. The operations in this section are just one of the feasible conﬁgurations. To achieve the best performance, diﬀerent applications may require diﬀerent conﬁgurations or adaptations. To this end, the auditory transform can be considered as a new platform for future feature extraction research. While the MFCC has been used for several decades, the CFCC is new. Further improvement may be required to ﬁnalize all the details. We introduce the CFCC as a platform with the hope that the CFCC features will be further improved through research in various tasks and databases.
142
8 AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation
Comparison between CFCCs and GammatoneBased Features There are Gammatonebased features named Gammatone frequency cepstral coeﬃcients (GFCC) [18]. They are also auditorybased speech features. During our implementation, an exact implementation following the description in [18] did not give us reasonable experimental results; therefore, we then replaced the “downsampling” procedure in [18] by computing an average of the absolute values on the Gammatone ﬁlterbank output using a 20 ms window shift every 10 ms, followed by a cubic root function and DCT. This procedure gave us the best results in our experiments, but because they are diﬀerent from the original GFCCs, we have named the modiﬁed GFCC (MGFCC) features. Comparisons in experiments are available in the next section.
8.3 Speaker Identiﬁcation and Experimental Evaluation In this section, we use a closed set, textindependent speaker identiﬁcation task to evaluate the new auditory feature extraction algorithm. In a training section, after feature extraction, a set of Gaussian mixture models is trained. Each model is associated with one speaker. In a testing session, given a testing utterance, the log likelihood scores are calculated for each of the GMMs. The GMM with the largest score is selected and the associated speaker is the estimated speaker of the given utterance. Since we are addressing the robustness problem. The CFCC frontend and GMM backed (CFCC/GMM) system was evaluated in a task where the acoustic conditions of training and testing are mismatched, i.e. the training data set was recorded under clean conditions while the testing data sets were mixed with diﬀerent types of background noise at various noisy levels. 8.3.1 Experimental Datasets The Speech Separation Challenge database contains speech recorded from a closedset of 34 speakers (18 male and 16 female speakers). All speech ﬁles are singlechannel data sampled at 25 kHz and all material is endpointed (i.e. there is little or no initial or ﬁnal silence) [1]. The training data was recorded under clean conditions. The testing sets were obtained by mixing clean testing utterances with white noise at diﬀerent SNR levels; in total there are ﬁve testing conditions provided in the database (i.e. noisy speech at 12 dB, 6 dB, 0 dB, and 6 dB SNR, and clean speech). We ﬁnd this database ideal for the study of noise robustness when training and testing conditions do not match. In particular, since all the noisy testing data is generated from the same speech with only the noise level changing, this largely reduces the performance ﬂuctuations due to variations other than noise types and mixing levels.
8.3 Speaker Identiﬁcation and Experimental Evaluation
143
Table 8.1. Summary of The training, Development, and Testing Set. Data Set # of Spks. # of Utters / Spk. Dur. (sec) / Spk. Training 34 20 36.8s Develop. 34 10 18.3s Testing 34 10 ∼ 20 29.6s
As shown in Table 8.1 the Speech Separation Challenge database was partitioned into training, development and testing sets and there is no overlap among the data sets. In our experiments speaker models were ﬁrst trained using the clean training set and then tested on noisy speech at four SNR levels. We created three disjoint subsets from the database as the training set, development set, and testing set. Each set has 34 speakers. The training set has 20 utterances per speaker and 680 utterances in total. The average duration of training data per speaker is 36.8 seconds of speech. The development set has 1700 utterances in total. There are ﬁve testing conditions (i.e. noisy speech at 12 dB, 6 dB, 0 dB, and 6 dB SNR, and clean speech). Each condition has 10 utterances per speaker. The average duration of each utterance is 1.8 seconds. The development set is only with white noise. The testing set has the same ﬁve testing conditions. Each condition has 10 to 20 utterances per speaker. The duration of each testing utterance is about 2 to 3 seconds of speech. The testing set has about 2500 utterances for each noise type. For three types of noises, white, car, and babble, we have about 7500 utterances in total for testing. Note that the training set consists of only clean speech, while both the development set and the testing set consist of clean speech and noisy speech at ﬁve diﬀerent SNR levels. Note that we mainly focused on 0 dB and 6 dB conditions in our feature analysis and comparisons because when conditions are under 6 dB the performance of all features are close to random chances. We note that in addition to white noise testing conditions provided in the Speech Separation Challenge database, we also generated two more sets of testing conditions with car noise or babble noise at 6 dB, 0 dB, and 6 dB SNR. The car noise and babble noise were recorded under realworld conditions, and mixed with the clean test speech from the database. These test sets were used as additional material to further test the robustness of the auditory features. The sizes of the testing sets with diﬀerent types of noises are the same. 8.3.2 The Baseline Speaker Identiﬁcation System Our baseline system uses the standard MFCC frontend features and Gaussian mixture models (GMMs). Twentydimensional MFCC features (c1 ∼ c20) were extracted from the speech audio based on a 25 ms window with a framerate of 10 ms; the frequency analysis range was set to be 50 Hz ∼ 8000 Hz. Note that the delta and double delta of the MFCCs were not used here since they were not found to be helpful in discerning between speakers in
144
8 AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation
our experiments. We also found cepstrum mean subtraction was not helpful; therefore it was not used in our baseline system. The backend of the baseline system is the standard GMM’s trained using the maximum likelihood estimation (MLE) [17]. Let Mi represent the GMM model for the ith speaker, and i be the index for speakers. During testing, the testing utterances u match against all hypothesized speaker models (Mi ), and the speaker identiﬁcation decision (J) is made by: J = arg max log p(uk Mi ), (8.11) i
k
where uk is the kth frame of utterance u and p(·Mi ) is the probability density function. Thirtytwo Gaussian mixtures were used in the speaker GMM models. To obtain fair comparison of the diﬀerent frontend features, only the frontend feature extraction was varied and the conﬁguration of the backend of the system remained the same in all the experiments throughout this chapter. When we developed the CFCC features, we analyzed the eﬀects of the following items to speaker ID performance in a development dataset, such as ﬁlter bandwidth (β), equalloudness function, various windowing schemes, and nonlinearity. The results were reported in [10]. Here, we only introduce the parameters and operations which provide the best results. Based on our analytic study, the CFCC feature extraction can be summarized as follows: First, the speech audio ﬁle is passed through the bandpass ﬁlter bank. The ﬁlter width parameter β was set to 0.035. The Bark scale is used for the ﬁlter bank distribution and equalloudness weighting is applied at diﬀerent frequency bands. Second, the traveling waves generated from the cochlear ﬁlters are windowed and averaged by the hair cell function. The window length is 3.5 epochs of the band central frequency or 20 ms, whichever is the shortest. Third, a cubic root is applied. Finally, since most backend systems adopt diagonal Gaussian, the discrete cosine transform (DCT) is used to decorrelate the features. The 0th component, corresponding to the energy, is removed from the DCT output for the speaker ID task. It is needed for speech recognition. Table 8.2 shows a comparison of the speaker identiﬁcation accuracy of the optimized CFCC features with the MGFCCs and MFCCs tested on the development set. Table 8.2. Comparison of MFCC, MGFCC, and CFCC Features Tested on the Development Tet. Testing SNR MFCC MGFCC CFCC
6 dB 6.8% 9.1% 12.6%
0 dB 15.9% 45.0% 57.9%
6 dB 42.1% 88.8% 90.3%
8.3 Speaker Identiﬁcation and Experimental Evaluation
145
8.3.3 Experiments Using the optimized CFCC feature parameters selected from the development set, we conducted speaker identiﬁcation experiments on the testing set with the results depicted in Fig. 8.2. As we can see from Fig. 8.2, in clean testing conditions, the CFCC features generate comparable results, over 96%, to the MFCCs. As white noise is added to the clean testing data at increasing intensity, the performance of the CFCCs are signiﬁcantly better than both the MGFCCs and MFCCs. For example, when the SNR of the testing condition drops to 6dB, the accuracy of the MFCC system drops to 41.2%. In comparison, the parallel system using the CFCC features still achieves 88.3% accuracy, more than twice as accurate as the MFCC features. Similarly, the MGFCC features have an accuracy of 85.1%, which is better than the MFCC features, but not as good as the CFCC features. The CFCC performance in the testing data set is similar to its performance in the development set. Overall, we see that the CFCC features signiﬁcantly outperform both the widely used MFCC features and the related auditorybased MGFCC features in this speaker identiﬁcation task. 1 0.9
CFCC(White Noise) MGFCC(White Noise) MFCC(White Noise)
0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.2. Comparison of MFCC, MGFCC, and the CFCC features tested on noisy speech with white noise.
To further test the noise robustness of the CFCCs, we conducted more experiments on noisy speech data with two kinds of realworld noise (car noise and babble noise) as described in Section 8.3.1 using the same experimental setup. Figure 8.3 and Figure 8.4 present the experimental results for the car noise and the babble noise at 6 dB, 0 dB and 6 dB levels, respectively. The
146
8 AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation
auditory features consistently outperform the baseline MFCC system and the MGFCC system under both realworld car noise and babble noise testing conditions. 1
0.9
CFCC(Car Noise) MGFCC(Car Noise) MFCC(Car Noise)
Accuracy
0.8
0.7
0.6
0.5
0.4 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.3. Comparison of MFCC, MGFCC, and CFCC features tested on noisy speech with car noise.
8.3.4 Further Comparison with PLP and RASTAPLP We conducted further experiments with PLP and RASTAPLP features using the same experimental setup as described before. The comparative results on white noise, car noise, and babble noise are depicted in Fig. 8.5, Fig. 8.6, and Fig. 8.7, respectively. We are not surprised to observe that the CFCC features outperform the PLP features in all three testing conditions. The PLP features minimize the diﬀerences between speakers while preserving important speech information via the spectra warping technique [6], which, as a consequence, has never been chosen as preferable speech features for speaker recognition. It is interesting to observe that the CFCCs perform signiﬁcantly better than RASTAPLP on white noise testing conditions at all diﬀerent levels; however, for car noise and babble noise the performance of the CFCCs and RASTAPLP are fairly close. As previously stated, RASTA is a technique that utilizes a bandpass ﬁlter to smooth out the variations of shortterm noise and remove constant oﬀset due to static spectral coloration in the speech channel [7]. It is typically used in combination with PLP, which is referred to as RASTAPLP [2]. Our experiments show that RASTA ﬁltering largely improves the performance of PLP features in speaker identiﬁcation under mismatched training
8.3 Speaker Identification and Experimental Evaluation 1 0.9
147
CFCC(Babble Noise) MGFCC(Babble Noise) MFCC(Babble Noise)
0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.4. Comparison of MFCC, MGFCC, and CFCC features tested on noisy speech with babble noise.
and testing conditions. It is particularly helpful when tested under car noise and babble noise, but it is not as eﬀective for ﬂatly distributed white noise. In comparison, the CFCCs consistently generate superior performance in all three conditions.
8.4 Conclusions In this chapter, we presented a new auditorybased feature extraction algorithm named CFCC and applied to robust speaker identiﬁcation in mismatched conditions. Our research was motivated by studies of the signal processing functions in the human peripheral auditory system. The CFCC features are based on the ﬂexible timefrequency transform (AT) presented in the previous chapter in combination with several components to emulate the human peripheral hearing system. The analytic study for feature optimization was conducted on a separate development set. The optimized CFCC features were then tested under a variety of mismatched testing conditions, which included white noise, car noise, and babble noise. Our experiments in speaker identiﬁcation tasks show that under mismatched conditions, the new CFCCs consistently perform better than both the MFCC and MGFCC features. Further comparison with PLP and RASTAPLP features shows that although RASTAPLP can generate comparable results when tested on car noise or babble noise, it does not perform as well when tested on ﬂatly distributed white noise. In comparison, the CFCCs generate superior results under all
148
8 AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation 1 0.9
CFCC(White Noise) PLP(White Noise) RASTA−PLP(White Noise)
0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.5. Comparison of PLP, RASTAPLP, and the CFCC features tested on noisy speech with white noise. 1 0.9
Accuracy
0.8 0.7 0.6 0.5 0.4
−6dB
CFCC(Car Noise) PLP(Car Noise) RASTA−PLP(Car Noise) 0dB
Test Conditions
6dB
Clean
Fig. 8.6. Comparison of PLP, RASTAPLP, and the CFCC features tested on noisy speech with car noise.
three noise conditions. The presented feature can be applied to other speaker authentication tasks. For the best performance some of the parameters, such as β in the auditory transform, can be adjusted. Also, for diﬀerent sampling rate, the auditory ﬁlter distribution should be adjusted accordingly.
8.4 Conclusions
149
1 0.9 0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 CFCC(Babble Noise) PLP(Babble Noise) RASTA−PLP(Babble Noise)
0.1 0 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.7. Comparison of PLP, RASTAPLP, and the CFCC features tested on noisy speech with babble noise.
We note that this chapter is just an example of using the auditory transform as the platform for feature extraction. Diﬀerent versions of CFCC can be developed for diﬀerent tasks and noise environments. The presented parameters and operations may not be the best for all applications. The reader can modify and tune the conﬁgurations to achieve the best results on speciﬁc tasks. Currently, the author is extending the auditory feature extraction algorithm to speech recognition.
References 1. http://www.dcs.shef.ac.uk/ martin/SpeechSeparationChallenge/. 2. http://www.icsi.berkeley.edu/ dpwe/projects/sprach/sprachcore.html. 3. Atal, B. S., “Eﬀectiveness of linear prediction characteristics of the speech wave for automatic speaker identiﬁcation and veriﬁcation,” Journal of the Acoustical Society of America, vol. 55, pp. 1304–1312, 1974. 4. Davis, S. B. and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. on Acoustics, speech, and signal processing, vol. ASSP28, pp. 357–366, August 1980. 5. grimaldi, M. and Cummins, F., “Speaker identiﬁcation using instantaneous frequencies,” IEEE Trans. on Audio, Speech, and language processing, vol. 16, pp. 1097–1111, August 2008. 6. Hermansky, H., “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc. Am., vol. 87, pp. 1738–1752, 1990.
150
8 AuditoryBased Feature Extraction and Robust Speaker Identiﬁcation
7. Hermansky, H. and Morgan, N., “Rasta processing of speech,” IEEE Trans. Speech and Audio Proc., vol. 2, pp. 578–589, Oct. 1994. 8. Li, Q., “An auditorybased transform for audio signal processing,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), Oct. 2009. 9. Li, Q., “Solution for pervasive speaker recognition,” SBIR Phase I Proposal, Submitted to NSF IT.F4, Li Creative Technologies, Inc., NJ, June 2003. 10. Li, Q. and Huang, Y., “An auditorybased feature extraction algorithm for robust speaker identiﬁcation under mismatched conditions,” IEEE Trans. on Audio, Speech and Language Processing, Sept. 2011. 11. Li, Q. and Huang, Y., “Robust speaker identiﬁcation using an auditorybased feature,” in ICASSP 2010, 2010. 12. Li, Q., Soong, F. K., and Olivier, S., “An auditory systembased feature for robust speech recognition,” in Proc. 7th European Conf. on Speech Communication and Technology, (Denmark), pp. 619–622, Sept. 2001. 13. Li, Q., Soong, F. K., and Siohan, O., “A highperformance auditory feature for robust speech recognition,” in Proceedings of 6th Int’l Conf. on Spoken Language Processing, (Beijing), pp. III 51–54, Oct. 2000. 14. Makhoul, J., “Linear prediction: a tutorial review,” Proceedings of the IEEE, vol. 63, pp. 561–580, April 1975. 15. Moore, B. C. J. and Glasberg, B. R., “Suggested formula for calculating auditoryﬁlter bandwidth and excitation patterns,” J. Acoust. Soc. Am., vol. 74, pp. 750–753, 1983. 16. Moore, B. C., An introduction to the psychology of hearing. NY: Academic Press, 1997. 17. Reynolds, D. and Rose, R. C., “Robust textindependent speaker identiﬁcation using Gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Processing, vol. 3, pp. 72–83, 1995. 18. Shao, Y. and Wang, D., “Robust speaker identiﬁcation using auditory features and computational auditory scene analysis,” in Proceedings of IEEE ICASSP, pp. 1589–1592, 2008. 19. Stevens, S. S., “On the psychophysical law,” Psychol. Rev., vol. 64, pp. 153–181, 1957. 20. Stevens, S. S., “Perceived level of noise by Mark VII and decibels (E),” J. Acoustic. Soc. Am., vol. 51, pp. 575–601, 1972. 21. Zwicker, E. and Terhardt, E., “Analytical expressions for criticalband rate and critical bandwidth as a function of frequency,” J. Acoust. Soc. Am., vol. 68, pp. 1523–1525, 1980.
Chapter 9 FixedPhrase Speaker Veriﬁcation
As we introduced in Chapter 1, speaker recognition includes speaker identiﬁcation and speaker veriﬁcation. In the previous chapters, we discussed speaker identiﬁcation. In this chapter, we will introduce speaker veriﬁcation. From an application point of view, speaker veriﬁcation has more commercial applications than speaker identiﬁcation.
9.1 Introduction Among the diﬀerent speaker authentication systems introduced in Chapter 1, we focus here on the ﬁxedphrase speaker veriﬁcation system for openset applications. Here, ﬁxedphrase means that the same passphrase is used for one speaker in both training and testing sessions and the text of the passphrase is known by the system through registration. The system introduced in this chapter was ﬁrst proposed by Parthasarathy and Rosenberg in [6] and a system with stochastic matching using the same database was then reported in [2] by the author and above authors. A ﬁxedphrase speaker veriﬁcation system for speaker authentication has the following advantages. First, a short, userselected phrase, also called a passphrase, is easy to remember and use. For example, it is easier to remember “open sesame” as a passphrase than a 10digit phone number. Based on our experiments, the selected passphrase can be short and less than two seconds duration and still can get a good performance. Second, a ﬁxedphrase system usually has a better performance than a textprompted system [3]. Last, the ﬁxedphrase system can be implemented as a language independent system easily, i.e. a user can create a passphrase in a selected language which has the potential to increase the security level.
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_9, Ó SpringerVerlag Berlin Heidelberg 2012
151
152
9 FixedPhrase Speaker Veriﬁcation
9.2 A FixedPhrase System In this chapter, we introduce a ﬁxedphrase system which generates superior performance in several evaluations using diﬀerent databases. As shown in Fig. 1.2, a developed ﬁxedphrase system has two phases, enrollment and test. In the enrollment phase, a speakerdependent model is trained. In the test phase, the trained model is used to verify given utterances. The ﬁnal decision is made based on a likelihood score of the speakerdependent model or a likelihood ratio score calculated from the speakerdependent model and a speakerindependent background model. For feature extraction, the speech signal is sampled at 8 kHz and preemphasized using a ﬁrstorder ﬁlter with a coeﬃcient of 0.97. The samples are blocked into overlapping frames of 30 ms in duration and updated at 10 ms intervals. Each frame is windowed with a Hamming window followed by a 10th order linear predictive coding (LPC) analysis. The LPC coeﬃcients are then converted to cepstral coeﬃcients, where only the ﬁrst 12 coeﬃcients are retained for computing the feature vector. The feature vector consists of 24 features including the 12 cepstral coeﬃcients and 12 delta cepstral coeﬃcients [6]. A simple and useful technique for robustness in feature extraction is cepstral mean subtraction (CMS) [1]. The algorithm calculates the mean from the feature cepstral vectors of an utterance and then subtracts the mean from each cepstral vector. During enrollment, LPC cepstral feature vectors corresponding to the nonsilence portion of the enrollment passphrases are used to train a speakerdependent (SD), contextdependent, and lefttoright HMM to represent the voice pattern in the utterance. The model is called a wholephrase model [6]. In addition to model training, the text of the passphrase collected from the enrollment session is transcribed into a sequence of phonemes, {Sk }K k=1 , where Sk is the kth phoneme and K is the total number of phonemes in the passphrase. The models and the transcription are used later for training a background model and for computing model scores. A detailed block diagram of a test session is shown in Fig. 9.1. After a speaker claims his or her identity, the system expects the user to utter the same phrase as in the enrollment session. The voice waveform is ﬁrst converted to the prescribed feature representation. In the forced alignment block, a sequence of speakerindependent phoneme models is constructed according to the phonemic transcription of the passphrase. The sequence of phonemic models is then used to segment and align the feature vector sequence through use of the Viterbi algorithm. In the cepstral mean subtraction block, silence frames are removed, and a mean vector is computed based on the remaining speech frames. The mean is then subtracted from all speech frames [1]. This is an important step for channel compensation. It makes the system more robust to changes in the operating environment as well as in the transmission channel. We note that the forced alignment block is also used for accurate endpoint detection. For a fast response, the realtime endpoint detection algorithm
9.2 A FixedPhrase System
153
Database Speakerdependent model
Identity claim Phoneme Transcription Feature Vectors
Forced Alignment
Speakerindependent phoneme models
L(O, Λ t )
Target Score Computation
+
Cepstral Mean Subtraction Background Score Computation

+
Threshold
Decision
L(O, Λb)
Background models
Fig. 9.1. A ﬁxedphrase speaker veriﬁcation system.
introduced in the previous chapter can be implemented for a faster response and better performance [4, 5]. In the block of target score computation of Fig. 9.1, speech feature vectors are decoded into states by the Viterbi algorithm, using the trained wholephrase model. A loglikelihood score for the target model, i.e. the target score, is calculated as 1 L(O, Λt) = log P (OΛt ), (9.1) Nf where O is the feature vector sequence, Nf is the total number of vectors in the sequence, Λt is the target model, and P (OΛt ) is the likelihood score resulting from Viterbi decoding. Using background models is usually a useful approach to improve the robustness of a speaker veriﬁcation or identiﬁcation system. Intuitively, channel distortion or background noise will aﬀect the scores of both the speakerdependent model and background models due to the mismatch between the training and current acoustic conditions. However, since both the scores in the speakerdependent model and background models are changed, the ratio of the scores changes relatively less; therefore, the likelihood ratio is a more robust score for decision. The speakerindependent background models can be phoneme dependent or phoneme independent. In the ﬁrst case, the background model can be a hidden Markov model (HMM). In the second case, it can be a GMM (Gaussian mixture model). As reported in [6], a phonemedependent background model can provide better results. This is because a phonemedependent model can model the speech more accurately than a phonemeindependent model and the phonemedependent model can be more sensitive to background noise and channel distortion. In the block of background (nontarget) score computation, a set of speakerindependent HMM’s in the order of the transcribed phoneme sequence, Λb = {λ1 , ..., λK }, is applied to align the input utterance with the expected transcription using the Viterbi decoding algorithm. The segmented
154
9 FixedPhrase Speaker Veriﬁcation
utterance is O = {O1 , ..., OK }, where Oi is the set of feature vectors corresponding to the i’th phoneme, Si , in the phoneme sequence. The background or nontarget likelihood score is then computed by L(O, Λb ) =
K 1 log P (Oi λi ), Nf
(9.2)
i=1
where Λb = {λi }K i=1 is the set of SI phoneme models in the order of the transcribed phoneme sequence, P (Oi λbi ) is the corresponding phoneme likelihood score, and K is the total number of phonemes. The target and background scores [7] are then used in the likelihoodratio test: R(O; Λt , Λb ) = L(O, Λt ) − L(O, Λb ), (9.3) where L(O, Λt) and L(O, Λb ) are deﬁned in (9.1) and (9.2) respectively.
9.3 An Evaluation Database and Model Parameters The above system was tested on a database consisting of ﬁxedphrase utterances [6, 2]. The database was recorded over a longdistance telephone network consisting of 100 speakers, 51 male and 49 female. The ﬁxed phrase, common to all speakers, is “I pledge allegiance to the ﬂag” with an average utterance length of two seconds. Five utterances from each speaker recorded in one enrollment session (one telephone call) are used to construct an SD target HMM. For testing, we used 50 utterances recorded from a true speaker in diﬀerent sessions (from diﬀerent telephone channels and handsets at diﬀerent times with diﬀerent background noise), and 200 utterances recorded from 51 male or 49 female impostors of the same gender in diﬀerent sessions. Five repetitions of the passphrase recorded in an enrollment session are used for training. For testing, the data is divided into true and impostor groups. The true speaker group has two repetitions from each of 25 test sessions; therefore, we have 50 testing utterances from the true speaker. The impostor group consists of four utterances from each speaker of the same gender to test the true speaker. We have about 200 utterances from the impostor group in total. An openset evaluation is closer to real applications. For example, a largescale telebanking system usually involves a large user population. The population also changes on a daily basis. We have to test the system as an openset problem. The SD target models for the phrases are lefttoright HMM’s. The number of states depends on the total number of phonemes in the phrases. The more the phonemes, the more the states. There are four Gaussian mixture components associated with each state [6]. The background models are concatenated speaker independent (SI) phone HMM’s trained on a telephone speech database from diﬀerent speakers and text [7]. There are 43 HMM’s,
9.3 An Evaluation Database and Model Parameters
155
corresponding to 43 phonemes respectively, and each model has three states with 32 Gaussian components per state. Again, due to unreliable variance estimates from a limited amount of speakerspeciﬁc training data, a global variance estimate was used as the common variance to all Gaussian components in the target models [6].
9.4 Adaptation and Reference Results In order to further improve the SD HMM, a model adaptation/reestimation procedure is employed. The second, fourth, sixth, and eighth test utterances from the true speaker, which were recorded at diﬀerent times, are used to update the means and mixture weights of the SD HMM for verifying successive test utterances. For the above database, the average individual equalerror rate over 100 speakers is 3.03% without adaptation and 1.96% with adaptation, respectively [2], as shown in Table 9.1. The performance can be further improved to 2.61% and 1.80% with the stochastic matching technique introduced in the next chapter. In general, the longer the passphrase, the higher the accuracy. The response time depends on the hardware/software conﬁguration. For most applications, a realtime response can be achieved using a personal computer. Table 9.1. Experimental Results in Average EqualError Rates of All Tested Speakers Without Adaptation With Adaptation Fixed Passphrase 3.03% 1.96% Fixed Passphrase 2.61% 1.80% With stochastic matching Note: Tested on 100 speakers. All speakers used a common passphrase and all impostors were the same gender as the true speaker.
We note that the same passphrase was used for all speakers in our evaluation and all speakers were tested against the same gender. The above results are the worst cases of the performance and it is for our research. The actual system equal error rate (EER) would be better when users choose their unique and diﬀerent passphrases. Also, to ensure the open test nature, none of the impostor’s data was used for discriminatively training the SD target model in the above experiments.
9.5 Conclusions In this chapter, we introduced a useful speaker veriﬁcation system. In many research papers, the performances of speaker veriﬁcation systems were reported as the EERs while all the speakers in the database selected the same
156
9 FixedPhrase Speaker Veriﬁcation
passphrase. We have to note that this is a worst case scenario, which should never happen in real applications. When diﬀerent speakers are using diﬀerent passphrases, the EER can be less than or equal to 0.5%. To further improve system performance, other advanced techniques introduced in this book can be applied to a real system design to achieve the best performances in terms of lower EER, faster response time, user friendly interface, and noise robustness. The advanced and useful techniques are endpoint detection, auditory feature, fast decoding, discriminative training, and combined system design with verbal information veriﬁcation (VIV). Readers may refer to corresponding chapters in this book when developing a real speaker authentication application. The algorithm presented in this chapter can be applied to any language, and a speaker veriﬁcation system can be developed as a languageindependent system. In that case, the background model can be a language independent universal phoneme model. In summary, the passphrase can be in any language or accent and the SV system can be language dependent or language independent.
References 1. Furui, S., “Cepstral analysis techniques for automatic speaker veriﬁcation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 2. Li, Q., Parthasarathy, S., and Rosenberg, A. E., “A fast algorithm for stochastic matching with application to robust speaker veriﬁcation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Munich), pp. 1543–1547, April 1997. 3. Li, Q., Parthasarathy, S., Rosenberg, A. E., and Tufts, D. W., “Normalized discriminant analysis with application to a hybrid speakerveriﬁcation system,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), May 1996. 4. Li, Q. and Tsai, A., “A matched ﬁlter approach to endpoint detection for robust speaker veriﬁcation,” in Proceedings of IEEE Workshop on Automatic Identiﬁcation, (Summit, NJ), Oct. 1999. 5. Li, Q., Zheng, J., Tsai, A., and Zhou, Q., “Robust endpoint detection and energy normalization for realtime speech and speaker recognition,” IEEE Trans. on Speech and Audio Processing, vol. 10, pp. 146–157, March 2002. 6. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker veriﬁcation using subword background models and likelihoodratio scoring,” in Proceedings of ICSLP96, (Philadelphia), October 1996. 7. Rosenberg, A. E. and Parthasarathy, S., “Speaker background models for connected digit password speaker veriﬁcation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), pp. 81–84, May 1996.
Chapter 10 Robust Speaker Veriﬁcation with Stochastic Matching
In today’s telecommunications environment, which includes wireless, landline, VoIP, and computer networks, the mismatch between training and testing environments poses a big challenge to speaker authentication systems. In Chapter 8, we addressed the mismatch problem from a feature extraction point of view. In this chapter, we address the problem from an acoustic modeling point of view. These two approaches can be used independently or jointly. Speaker recognition performances are degraded when a hidden Markov model (HMM) trained under one set of conditions is used to evaluate data collected from diﬀerent wireless or landline channels, microphones, networks, etc. The mismatch can be approximated as a linear transform in a cepstral domain in any type of feature extraction algorithm. In this chapter, we present a fast, eﬃcient algorithm to estimate the parameters of the linear transform for realtime applications. Using the algorithm, test data are transformed toward the training conditions by rotation, scale, and translation without destroying the detailed speakerdependent characteristics of speech, then, speaker dependent HMM’s can be used to evaluate the details under the transformed condition similar to the original training condition. Compared to cepstral mean subtraction (CMS) and other bias removal techniques, the linear transform is more general since CMS and others only consider translation; compared to maximumlikelihood approaches for stochastic matching, the algorithm is simpler and faster since iterative techniques are not required. The fast stochastic matching algorithm improves the performance of a speaker veriﬁcation system in the experiments reported in this chapter. This approach was originally reported by the author, Parthasarathy and Rosenberg in [4].
10.1 Introduction For speaker recognition, a speakerdependent hidden Markov model (HMM) for a true speaker is usually trained based on training data collected in an enrollment session. The HMM, therefore, matches the probability density func
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_10, Ó SpringerVerlag Berlin Heidelberg 2012
157
158
10 Robust Speaker Veriﬁcation with Stochastic Matching
tions (pdf ’s) of the training data perfectly in the acoustic environment of the training data. In a veriﬁcation session, test data are very often collected through a diﬀerent communication channel and handset. Since the acoustic condition is diﬀerent from the enrollment session, it usually causes a mismatch between the test data and the trained HMM. Speaker recognition performance is degraded by the mismatch. The mismatch can be represented as a linear transform in the cepstral domain: y = Ax + b, (10.1) where x is a vector of the cepstral frame of a test utterance; A and b are the matrix and vector which need to be estimated for every test utterance; and y is a transformed vector. Geometrically, b represents a translation and A represents both scale and rotation. When A is diagonal, it is only a scaling operation. Cepstral mean subtraction (CMS) [2, 1, 3] is a fast, eﬃcient technique for handling mismatches in both speaker and speech recognition. It estimates b and assumes A to be an identity matrix. In [9], the vector b was estimated by longterm average, shortterm average, and a maximum likelihood approach. In [10, 11], maximum likelihood (ML) approaches were used to estimate b, a diagonal A, and model parameters for HMM’s for stochastic matching. A leastsquares solution of the linear transform parameters was brieﬂy introduced in [5]. In this chapter we consider the mismatch modeled with a general linear transform, i.e. A is a full matrix, and b is a vector. The approach is to have the overall distribution of test data match the overall distribution of training data. Then a speakerdependent (SD) HMM trained on the training data is applied to evaluate the details of the test data. This is based on the assumption that diﬀerences between speakers are mainly on the details which have been characterized by HMM’s. A fast algorithm with stochastic matching for ﬁxedphrase speaker veriﬁcation is presented in this chapter. Compared to CMS and other bias removal techniques [9, 8], the introduced linear transform approach is more general since CMS and others only consider the translation; compared to the ML approaches [9, 8, 10, 11], the algorithm is simpler and faster since iterative techniques are not required and the estimation of the linear transform parameters is separated from HMM training and testing.
10.2 A Fast Stochastic Matching Algorithm We use Fig. 10.1 as a geometric interpretation of the fast stochastic matching algorithm. In Fig. 10.1 (a), the dashed line is a contour of the training data. In Fig. 10.1 (b), the solid line is a contour of the test data. Due to diﬀerent channels, noise levels, and telephone transducers, the mean of the test data is translated from the training data; the distribution is shrunk [6] and rotated
10.2 A Fast Stochastic Matching Algorithm
R train
159
R train
R test
(b)
(a)
R train
T
AR test A (c)
(d)
Fig. 10.1. A geometric interpretation of the fast stochastic matching. (a) The dashed line is the contour of training data. (b) The solid line is the contour of test data. The crosses are the means of the two data sets. (c) The test data were scaled and rotated toward the training data. (d) The test data were translated to the same location as the training data. Both contours overlap each other.
from the HMM training condition. The mismatch may cause a wrong decision when using the trained HMM to score the mismatched test data. By applying the algorithm, we ﬁrst ﬁnd a covariance matrix, Rtrain , from the training data which characterizes the overall distribution approximately. Then, we ﬁnd a covariance matrix, Rtest , from the test data and estimate the parameters of the A matrix for the linear transform in (10.1). After applying the ﬁrst transform, the overall distribution of the test data is scaled and rotated, ARtest AT , to be the same as the training data except for the diﬀerence of the means, as shown in Fig. 10.1 (c). Next, we ﬁnd the diﬀerence between the means and translate the test data to the same location of the training data as shown in Fig. 10.1 (d), where the contour of the transformed test data overlaps with the contour of the training data. We note that as a linear transform, the proposed algorithm does not destroy the details of the pdf of the test data. The details will be measured and evaluated by the trained SD HMM. If the test data from a true speaker mismatch the HMM training condition, the data will be transformed to match the trained HMM approximately. If the
160
10 Robust Speaker Veriﬁcation with Stochastic Matching
test data from a true speaker match the training condition, the calculated A and b are close to an identity matrix and a zero vector respectively, so the transform will not have much eﬀect on the HMM scores. This technique attempts to improve mismatch whether the mismatch occurs because test and training conditions diﬀer or because the test and training data originate from diﬀerent speakers. It is reasonable to suppose that speaker characteristics are found mainly in the details of the representation. However, to the extent that they are also found in global features, this technique would increase the matching scores between true speaker models and impostor test utterances. Performance, then, could possibly degrade, particularly when other sources of mismatch are absent, that is, when test and training conditions are actually matched. However, experiments in this chapter show that performance overall does improve. If the test data from an impostor do not match the training condition, the overall distribution of the data will be transformed to match it, but the details of the distribution still do not match a true speaker’s HMM because the transform criterion is not for details and there is no nonlinear transform here.
10.3 Fast Estimation for a General Linear Transform In a speaker veriﬁcation training session, we collect multiple utterances with the same content, and use a covariance matrix Rtrain and a mean vector mtrain to represent the overall distribution of the training data of all the training utterances in a cepstral domain. They are deﬁned as follows: Rtrain =
Ni U 1 1 (xi,j − mi )(xi,j − mi )T , U i=1 Ni j=1
(10.2)
U 1 mi , U i=1
(10.3)
and mtrain =
where xi,j is the jth nonsilence frame in the ith training utterance, U is the total number of training utterances, Ni and mi are the total number of nonsilence frames and the mean vector of the ith training utterance respectively, and mtrain is the average mean vector of the nonsilence frames of all training utterances. In a test session, only one utterance will be collected and veriﬁed at a time. The covariance matrix for the test data is Rtest =
Nf 1 (yj − mtest )(yj − mtest )T , Nf j=1
(10.4)
10.3 Fast Estimation For a General Linear Transform
161
where yj and mtest are a nonsilence frame and the mean vector of the test data, Nf is the total number of nonsilence frames. The criterion for parameter estimation is to have Rtest match Rtrain through a rotation, scale, and translation (RST) of the test data. For rotation and scale, we have the following equation: Rtrain − ARtest AT = 0,
(10.5)
where A is deﬁned as in (10.1); Rtrain and Rtest are deﬁned as in (10.2) and (10.4). By solving (10.5), we have the A matrix for (10.1), 1
−1
2 2 A = Rtrain Rtest .
(10.6)
Then, the translation term b of (10.1) can be obtained by b = mtrain − mrs
Nf 1 = mtrain − Axj Nf j=1
(10.7)
where mtrain is deﬁned as in (10.3); mrs is a mean vector of rotated and scaled frames; Nf is the total number of nonsilence frames of a test utterance; xj is the jth cepstral vector frame. To verify a given test utterance against a set of true speaker’s models (consisting of a SD HMM plus Rtrain , mtrain ), ﬁrst Rtest , A and b are calculated by using (10.4), (10.6), and (10.7), then all test frames are transformed by (10.1) to reduce the mismatch.
10.4 Speaker Veriﬁcation with Stochastic Matching The above stochastic matching algorithm has been applied to a textdependent speaker veriﬁcation system using general phrase passwords. The system has been discussed in the previous chapter [7]. Stochastic matching is included in the frontend processing to further improve the system robustness and performance. The system block diagram with stochastic matching is shown in Fig. 10.2. After a speaker claims an identity (ID), the system expects the same phrase obtained in the associated training session. First, a speakerindependent (SI) phone recognizer segments the input utterance into a sequence of phones by forced decoding using the transcription saved from the enrollment session. Since the SD models are trained on a small amount of data from a single session, they can’t be used to provide reliable and consistent phone segmentations. So the SI phone models are used. On the other hand, the cepstral coeﬃcients of the utterance from the test speaker is transformed to match the training data distribution by computing Eqs. (10.4), (10.6), (10.7), and (10.1). Then, the transformed cepstral coeﬃcients, decoded phone sequence,
162
10 Robust Speaker Veriﬁcation with Stochastic Matching SI phone HMM’s
SI Phone Alignment
Transcription Identity claim
Speaker Info
Phone string/ boundaries
Speaker Verifier
SD H Ta M rg M et ’s
Cepstrum
SI background HMM’s
R train m train
Scores
Transformed cepstrum
Stochastic Matching
Cepstrum
Fig. 10.2. A phrasebased speaker veriﬁcation system with stochastic matching.
and associated phone boundaries are transmitted to a veriﬁer. In the veriﬁer, a loglikelihoodratio score is calculated based on the loglikelihood scores of target and background models. LR (O; Λt ; Λb ) = L(O, Λt ) − L(O, Λb )
(10.8)
where O is the observation sequence over the whole phrase, and Λt and Λb are the target and background models, respectively. The background model is a set of HMM’s for phones. The target model is one HMM with multiple states for whole phrase. As reported in [7], this conﬁguration provides the best results in experiments. Furthermore, L(O, Λt ) =
1 P (OΛt ), Nf
(10.9)
where P (OΛt) is the loglikelihood of the phrase evaluated by one HMM, Λt , using Viterbi decoding, and Nf is the total number of nonsilence frames in the phrase. Np 1 L(O, Λb ) = P (Oi Λbi ) (10.10) Nf i=1 where P (Oi Λbi ) is the loglikelihood of the ith phone, Oi is the segmented observation sequence over the ith phone, Λbi is an HMM for the ith phone, Np is the total number of the decoded nonsilence phones, and Nf is the same as above. A ﬁnal decision on rejection or acceptance is made based on the LR score with a threshold. If a signiﬁcantly diﬀerent phrase is given, the phrase could be rejected by the SI phone recognizer before using the veriﬁer.
10.5 Database and Experiments
163
10.5 Database and Experiments The feature vector in this chapter is composed of 12 cepstrum and 12 delta cepstrum coeﬃcients. The cepstrum is derived from a 10th order linear predictive coding (LPC) analysis over a 30 ms window. The feature vectors are updated at 10 ms intervals. The experimental database consists of ﬁxedphrase utterances recorded over the long distance telephone networks by 100 speakers, 51 male and 49 female. The ﬁxed phrase, common to all speakers, is “I pledge allegiance to the ﬂag” with an average length of 2 seconds. Five utterances of each speaker recorded in one session are used to train a SD HMM plus Rtrain , mtrain for the linear transform. For testing, we used 50 utterances recorded from a true speaker at diﬀerent sessions (diﬀerent telephone channels at diﬀerent times), and 200 utterances recorded from 50 impostors of the same gender at diﬀerent sessions. For model adaptation, the second, fourth, sixth, and eighth test utterances from the tested true speaker are used to update the associated HMM plus Rtrain , mtrain for verifying succeeding test utterances. The target models for phrases are lefttoright HMM’s. The number of the states are 1.5 times the total number of phones in the phrases. There are four Gaussian components associated with each state. The background models are concatenated phone HMM’s trained on a telephone speech database from diﬀerent speakers and texts. Each phone HMM has three states with 32 Gaussian components associated with each state. Due to unreliable variance estimates from a limited amount of training data, a global variance estimate is used as a common variance to all Gaussian components [7] in the target models. The experimental results are listed in Table 10.1. These are the averages of individual equalerror rates (EERs) over the 100 evaluation speakers. The baseline results are obtained with loglikelihoodratio scores using phrasebased target model and phonebased speaker background models. The EERs without and with adaptation are 5.98% and 3.94% respectively. When using CMS, the EER’s are 3.03% and 1.96%. When using the algorithm introduced in this chapter, the equal error rates are 2.61% and 1.80%. Table 10.1. Experimental Results in Average EqualError Rates (%) Algorithms No Adaptation With Adaptation Baseline 5.98 3.94 CMS 3.03 1.96 RST (Presented) 2.61 1.80
164
10 Robust Speaker Veriﬁcation with Stochastic Matching
10.6 Conclusions A simple, fast and eﬃcient algorithm for robust speaker veriﬁcation with stochastic matching was presented. The algorithm was applied to a general phrase speaker veriﬁcation system. In the experiments, when there is no model adaptation, the algorithm improves relative EERs by 56% compared with a baseline system without any stochastic matching, and 14% compared with a system using CMS. When model adaptation is applied, the improvements are 54% and 8%. Less improvement is obtained because the SD models are updated to ﬁt diﬀerent acoustic conditions. The presented algorithm can also be applied to speaker identiﬁcation and other applications to improve system robustness.
References 1. Atal, B. S., “Automatic recognition of speakers from their voices,” Proceeding of the IEEE, vol. 64, pp. 460–475, 1976. 2. Atal, B. S., “Eﬀectiveness of linear prediction characteristics of the speech wave for automatic speaker identiﬁcation and veriﬁcation,” Journal of the Acoustical Society of America, vol. 55, pp. 1304–1312, 1974. 3. Furui, S., “Cepstral analysis techniques for automatic speaker veriﬁcation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 4. Li, Q., Parthasarathy, S., and Rosenberg, A. E., “A fast algorithm for stochastic matching with application to robust speaker veriﬁcation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Munich), pp. 1543–1547, April 1997. 5. Mammone, R. J., Zhang, X., and Pamachandran, R. P., “Robust speaker recognition,” IEEE Signal Processing Magazine, vol. 13, pp. 58–71, Sept. 1996. 6. Mansour, D. and Juang, B.H., “A family of distortion measures based upon projection operation for robust speech recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 1659–1671, November 1989. 7. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker veriﬁcation using subword background models and likelihoodratio scoring,” in Proceedings of ICSLP96, (Philadelphia), October 1996. 8. Rahim, M. G. and Juang, B.H., “Signal bias removal by maximum likelihood estimation for robust telephone speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 4, pp. 19–30, January 1996. 9. Rosenberg, A. E., Lee, C.H., and Soong, F. K., “Cepstral channel normalization techniques for HMMbased speaker veriﬁcation,” in Proceedings of Int. Conf. on Spoken Language Processing, (Yokohama, Japan), pp. 1835–1838, 1994. 10. Sankar, A. and Lee, C.H., “A maximumlikelihood approach to stochastic matching for robust speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 4, pp. 190–202, May 1996. 11. Surendran, A. C., Maximumlikelihood stochastic matching approach to nonlinear equalization for robust speech recognition. PhD thesis, Rutgers University, Busch, NJ, May 1996.
Chapter 11 Randomly Prompted Speaker Veriﬁcation
In the previous chapters, we introduced algorithms for ﬁxedphrase speaker veriﬁcation. In this chapter, we introduce an algorithm for randomly prompted speaker veriﬁcation. Instead of having a user to select and remember a passphrase, the system randomly displays a phrase and asks the user to read; therefore, the user does not need to remember the phrase. In our approach, a modiﬁed linear discriminant analysis technique, referred to here as normalized discriminant analysis (NDA), is presented. Using this technique it is possible to design an eﬃcient linear classiﬁer with very limited training data and to generate normalized discriminant scores with comparable magnitudes for diﬀerent classiﬁers. The NDA technique is applied to speaker veriﬁcation classiﬁer based on speakerspeciﬁc information obtained when utterances are processed with speakerindependent models. The algorithm has shown signiﬁcant improvement in speaker veriﬁcation performance. This research was originally reported by the author, Partharathy, and Rosenberg in [7].
11.1 Introduction As we introduced in Chapter 1, speaker recognition is categorized into two major areas: speaker identiﬁcation and speaker veriﬁcation. Speaker identiﬁcation is the process of associating an unknown speaker with a member of a known population of speakers, while speaker veriﬁcation is the process of verifying whether an unknown speaker is the same as a speaker in a known population whose identity is claimed. Another distinguishing feature of speaker recognition is whether it is textdependent or textindependent. Textdependent recognition requires that speakers utter a speciﬁc phrase or a given password. Textindependent recognition does not require a speciﬁc utterance. Furthermore, textdependent speaker veriﬁcation systems can be divided into two categories: ﬁxedphrase or randomly prompted speaker veriﬁcation. A voice
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_11, Ó SpringerVerlag Berlin Heidelberg 2012
165
166
11 Randomly Prompted Speaker Veriﬁcation
password such as “open sesame” is a ﬁxed phrase where the user must remember the passphrase. On the other hand, a bank ATM machine or a speaker veriﬁcation system can display a sequence of random numbers or even a random phrase to prompt the user to read. This kind of system is called randomly prompted speaker veriﬁcation where the user does not need to remember the passphrase. The randomly prompted utterances can provide a higher level of security than ﬁxedphrase passwords or passphrases. The advantages of a randomly prompted speaker veriﬁcation system is users do not need to remember the pass phrase and a passphrase can be changed easily time to time. This chapter focuses on textdependent, randomly prompted speaker veriﬁcation. We use connected digits as the prompted phrases. We use an 11 word vocabulary including the digits “0” through “9” plus “o”. In training, 11 connected digit utterances are recorded in one session for each speaker. The training utterances are designed to have each of the 11 words appear ﬁve times in diﬀerent contexts, so that there are ﬁve tokens of each digit for each speaker. In testing, either a ﬁxed 9digit test utterance or randomly selected 4digit test utterances are recorded in diﬀerent sessions. As reported in previous chapters and references wherein, conventional training algorithms for speaker veriﬁcation are based on maximum likelihood estimation using only training data recorded from the speakers to be modeled. Gradientdescent based discriminative training algorithms [12], neural tree algorithms [3, 11], and linear discriminant analysis (LDA)[16, 17, 13] were used for speaker veriﬁcation. The discriminant approaches provide a potential modeling advantage because they account for the separation of each designated speaker from a group of other speakers. A textdependent, connecteddigit, speaker veriﬁcation system often consists of diﬀerent classiﬁers for diﬀerent words for each speaker. The concept of principal feature classiﬁcation introduced in Chapter 3 [9, 10, 8, 18] can be applied in the design of these classiﬁers, and linear discriminant analysis [2, 5, 4] can be used to ﬁnd principle features. However, two kinds of problems occur when LDA is used to design these classiﬁers: the amount of training data is usually small, and the discriminant scores obtained from diﬀerent classiﬁers are scaled diﬀerently so that it is hard to compare and combine them; therefore, in this chapter, we introduce a normalized discriminant analysis (NDA) technique to address these problems. In this chapter, the NDA is applied to design a hybrid speaker veriﬁcation (HSV) system. As described by Setlur et al [16, 17], the system combines two types of word models or classiﬁers. (We use the term classiﬁer when we do not use decoding). The ﬁrst type of classiﬁer used is a speakerdependent, continuous density, hidden Markov model (HMM). This representation has been shown to provide good performance for connected digit password speaker veriﬁcation [15]. The second type of classiﬁer is based on speaker speciﬁc information that can be obtained when password utterances are processed with speakerindependent (SI) HMM’s. The mixture components of an SI Gaussian
11.1 Introduction
167
mixture HMM are found by clustering training data from a wide variety of speakers and recording conditions. For a particular state of a particular word, each such component is representative of some subset of the training data. When a test utterance is processed, the score for each test vector is calculated using a weighted sum of mixture component likelihoods. Because of the way the mixture component parameters are trained, it is reasonable to expect that each test speaker will have diﬀerent characteristic distributions of likelihood scores across these components. The second type of classiﬁer is based on these characteristic mixture component likelihood distributions obtained over training utterances for each speaker.
Data Fusion
HMM utterance score
NDA utterance score
HMM score for cohorts
NDA wordlevel classification
NDA feature extraction
HMM score for test speaker
Feature vectors Fig. 11.1. The structure of a hybrid speaker veriﬁcation (HSV) system.
Although the second type of speaker classiﬁer yields signiﬁcantly lower performance than the ﬁrst type, it has been shown in [16, 17] that, when combined, the two representations yield signiﬁcantly improved veriﬁcation performance over either one by itself. The aspects which distinguish our study from Setlur et al [16, 17] are in three areas. First, NDA is used instead of Fisher linear discriminant analysis; second, veriﬁcation is carried out on a database recorded over a longdistance telephone network under a variety of recording and channel conditions; third, only small amounts of training data are available per speaker. The HSV system similar to [16, 17] is shown in Figure 11.1. The HSV system consists of three modules: a Type 1, HMM classiﬁer; a Type 2, dis
168
11 Randomly Prompted Speaker Veriﬁcation
criminant analysis classiﬁer; and a data fusion layer. We note that an HSV system could include more classiﬁers as long as each individual classiﬁer can provide independent information. The structure of HSV is similar to the mixture decomposition discrimination (MDD) system in [16]. However, there are some diﬀerences between the two systems. In HSV, a modiﬁed discriminant analysis improved the performance of the NDA classiﬁer. The database in this research has more noise, the number of training data is much less, and the way to use the data is restricted to long distance telephone applications.
11.2 Normalized Discriminant Analysis To classify segmented words between a true speaker and impostors, in general we can apply the principal feature classiﬁcation (PFC) presented in Chapter 3 [8, 18] to train classiﬁers for each word of each speaker. However, for this training problem, the true speaker class only has 5 training data vectors and each vector has 96 elements. The training data sets are linear separable. Then as a special case of the PFC, we only need to ﬁnd the ﬁrst principal feature by LDA. Wordlevel discrimination between a true speaker and impostors is a twoclass classiﬁcation problem. We found, from analysis of training data, that the data is linear separable into the speaker classes, so Fisher’s LDA is a simple and eﬀective tool for the classiﬁcation problem. Using LDA, a principal feature, weight vector w, is found, such that the projected data from the true speaker and impostors is maximally separated. In brief, for twoclass LDA, w can be solved directly as −1 w = SW (mT − mI )
(11.1)
where mT and mI are the sample means of the two classes, true speaker and impostors, and SW , is usually deﬁned as SW = (x − mT )(x − mT )t + (x − mI )(x − mI )t , (11.2) x∈XT
x∈XI
where XT and XI are the data matrices of a true speaker and impostors. SW must be nonsingular. Each row in the matrices represents one training data vector. More details on LDA can be found in [2, 5, 4]. However, in practical speaker veriﬁcation applications, there are usually only a few training vectors for each true speaker. For example, there are only ﬁve vectors available in our experiments. To compensate for this lack of training data, we redeﬁne the SW in (11.2) as SˆW = RT + γRCT + RI + δRCI ,
(11.3)
where RT and RI are the sample covariance matrices from the true speaker and impostors and RCI and RCT are compensating covariance matrices from
11.2 Normalized Discriminant Analysis
169
another available group of speakers (not used in the evaluation). RCI is the sample covariance matrix of additional speakers, pooling their data. Actually, RI and RCI can be combined except we may want to weight the associated data sets diﬀerently. RCT is deﬁned as RCT
Ts 1 = Ri , Ts i=1
(11.4)
where Ri is the sample covariance matrix of Speaker i in the other group, and Ts is the total number of speakers in the group. γ and δ are weight factors determined experimentally. An LDA score p of a data vector x is obtained by projecting the x onto a weight vector w, p = wt x. To make the scores comparable across diﬀerent words and diﬀerent speakers, we use the following normalization. pˆ = αwt x + β
(11.5)
where α = 2d , β = −1 − 2μdI , and d = μT − μI . The μT and μI are the means of projected data from true speaker and impostors. After normalization, the μT and μI are located at +1 and −1. pˆ is the NDA score.
11.3 Applying NDA in the Hybrid SpeakerVeriﬁcation System NDA can be used to design Type 2 classiﬁers for speaker veriﬁcation. The classiﬁers can be used separately or as a module in the HSV system [16, 17]. 11.3.1 Training of the NDA System As described earlier, the Type 2 classiﬁer features are determined from speakerindependent (SI) HMM’s. Each training or test utterance is ﬁrst segmented into words and states. As shown in Figure 11.2, we use the averaged outputs of the Gaussian components on the HMM states as one ﬁxedlength feature vector for the NDA training. The elements of the feature vector are deﬁned as follows: xjm =
Tj 1 log(N (ot , μjm , Rjm )), Tj t=1
j = 1, ..., J; m = 1, ..., Mj .
(11.6)
where ot is the cepstral feature vector at time frame t, μjm and Rjm are the mean and covariance of the mth mixture component for state j, N (.) is a Gaussian function, and Tj is the total number of frames segmented into state j.
170
11 Randomly Prompted Speaker Veriﬁcation
Decision Output
WordLevel Classification
State j  1 features
State j features 1 T
1
Σ
T
Log
....
Σ
1 T
Log m=1
State j + 1 features Σ
Log
....
m=2
m=Mj
....... ....... ....... ....... .......
Ο
t=1
Ο
t=2
Οt Ο Ο
is the cepstral feature vector at time frame t.
t=T1 t=T
Fig. 11.2. The NDA feature extraction.
Thus, a sequence of cepstral feature vectors associated with one segmented word is mapped onto a ﬁxedlength feature vector. The length of the feature vector is equal to the total number of Gaussian components of the word HMM (J × M ). For example, a word HMM has 6 states and each state has 16 components. The length of the NDA feature vector is 6 × 16 = 96. Feature extraction is almost identical to the technique in [16, 17] except that the HMM mixture weights are omitted since they are absorbed in the NDA calculation. The structure of the word and utterance veriﬁcation for one speaker is shown in Figure 11.3. There is an NDA classiﬁer for each word. An utterance score SN DA (O) is a weighted sum of NDA scores of all words in the utterance. L
SN DA (O) =
1 uk pˆk , ki ∈ {1, .., 11}. L i=1 i i
(11.7)
11.3 Applying NDA in the Hybrid SpeakerVeriﬁcation System
171
S NDA 1/L L
Σ u kd k i=1
u
1
NDA for Word "1" vs. all others
Feature vectors of Word "1"
u
u
2
NDA for Word "2" vs. all others
...........
Feature vectors of Word "2"
11
NDA for Word "oh" vs. all others
Feature vectors of Word "11"
Fig. 11.3. The Type 2 classiﬁer (NDA system) for one speaker.
where L is the length of the utterance O (the total number of words in the utterance), pˆki is the NDA score for the ith word. Equation (11.7) speciﬁes a linear node with associated weight vectors uki that can be determined by optimal data fusion [1] to equalize the performance across words if suﬃcient training data is available. 11.3.2 Training of the HMM System For the Type 1 classiﬁer, the HMM scores are calculated from speakerdependent (SD) HMM models. Cohort normalization is applied by selecting ﬁve scores from a group of speakers not in the evaluation group. Cepstral mean normalization is applied in both the HMM and NDA classiﬁers. Usually, the veriﬁcation score S for word W for Speaker I is calculated by T
S{OW, I} =
1 max{ log bj (ot )}, j = 1, ..., J, T t=1
(11.8)
where the O = {o1 , o2 , ...oT } is the sequence of T raw data vectors segmented for word W , the J is the total number of states. bj is the mixture Gaussian likelihood for the state j. bj (ot ) = P r(ot j) =
M m=1
Cjm N (ot , μjm , Rjm ),
(11.9)
172
11 Randomly Prompted Speaker Veriﬁcation
which is a weighted sum of all Gaussian components at the state j, and M is the total number of the components. The μjm and Rjm are the mean vector and covariance matrix of the mth component at the state j. Scores for Classiﬁcation The HMM recognition is based on the log likelihood scores calculated from the speech portion of the recording [14, 15]. We use SD HMM for the scores with word segmentations and labels provided by SI HMMs. For a sequence of T feature vectors for a word W, O = {o1 , o2 , ...oT }, the likelihood of the sequence to the model for word W and speaker I, λWI is P r{OλWI } = max { {st (W)}
T
ast st+1 bst (ot )}
(11.10)
t=1
where max{st (W)} implies a Viterbi search to obtain the optimal segmentation of the vectors into states [6]. ast st+1 is the statetransition probability from state st to state st+1 . In this implementation, all state transition probabilities are set to 0.5 for veriﬁcation so they play no role in classiﬁcation. bj is the mixture Gaussian likelihood for the state j. It is deﬁned as bj (ot ) = P r(ot j) =
M
cjm N (ot , μjm , Rjm ),
(11.11)
m=1
which is a weighted sum of all Gaussian components for state j. M is the total number of the components. μjm and Rjm are the mean vector and covariance matrix of the mth component at the state j. Subscripts for W and I are omitted from a and b for clarity. Thus, the veriﬁcation score using the model of word W for Speaker I is calculated by S{OλWI } =
T 1 max { log bst (ot )}, j = 1, ..., J, T st (W) t=1
(11.12)
Cohort Normalization Scores The concept of cohort normalization is from the NeymanPearson classiﬁcation rule. It includes discriminant analysis in the HMM classiﬁcation. P r{OλI } > 1, P r{OλI¯ }
(11.13)
where P r{OλI } is the likelihood score associated with the utterance O compared with the model for speaker I. P r{OλI¯ } is the likelihood from speaker models other than I.
11.3 Applying NDA in the Hybrid SpeakerVerification System
173
Applying the rule in log likelihood score, we have the cohort normalized scores for an utterance O as SHMM {OI} = S{OλI } −
1 K
K
S{Oλk },
(11.14)
k=1,k=I
K 1 where we associate S{OλI } with log P r{OλI } and K k=1,k=I S{Oλk } with log P r{OλI¯ } The λ1 to λK are a group of selected cohort models. They are similar in some sense to λI . The similarity measurement was deﬁned in [14]. 11.3.3 Training of the Data Fusion Layer The ﬁnal decision on a given utterance O, d(O), is made by combining the NDA score SN DA (O) and HMM score SHMM (O) and using a hardlimiter threshold. S = v1 SN DA (O) + v2 SHMM (O) 1, S > θ, true speaker; d(O) = 0, S ≤ θ, impostor,
(11.15) (11.16)
where v1 and v2 are weight values trained by LDA in the same way that w in (11.1) and (11.2) is determined where XT , mT and XI , mI are replaced by the HMM and NDA scores and associated means. The scores are obtained from a group of speakers not used in the evaluation. This is an SI output node of the HSV system.
11.4 Speaker Veriﬁcation Experiments In this section, we introduce the database and our experimental results for a randomly prompted speaker veriﬁcation task. 11.4.1 Experimental Database The database consists of approximately 6000 connected digit utterances recorded over dialedup telephone lines. The vocabulary includes eleven words. These are the digits “0” through “9” plus “oh”. The database is partitioned into four subsets as shown in Table 11.1. There are 43 speakers in the Roster A, and 42 in Roster B. For each speaker, there are eleven 5digit utterances designated for training recorded in a single session from a single channel in As and Bs . These utterances are designed to have each digit appear ﬁve times in diﬀerent contexts. Each speaker has a group of test utterances in Am and Bm . These utterances are recorded over a series of sessions with a variety of
174
11 Randomly Prompted Speaker Veriﬁcation Table 11.1. Segmentation of the Database Roster A Roster B Training utterances
As
Bs
Test utterances
Am
Bm
handsets and channels. The test utterances in Am and Bm are either ﬁxed 9digit utterances or randomly selected 4digit utterances. An SI HMMbased digit recognizer [15] is used to segment each utterance into words (digits), and to generate raw feature vectors. In the digit recognizer, 10th order autocorrelation vectors are analyzed over a 45 ms window shifted every 15 ms through the utterance. Each set of autocorrelation coeﬃcients is converted to a set of 12 cepstral coeﬃcients from linear predictive coding (LPC) coeﬃcients. These cepstral coeﬃcients are further augmented by a set of 12 delta cepstral coeﬃcients calculated over a ﬁveframe window of cepstral coeﬃcients. Each “raw” data vector has 24 elements consisting of the 12 cepstral coeﬃcients and the 12 delta cepstral coeﬃcients [15]. 11.4.2 NDA System Results Experiments were conducted ﬁrst to test the NDA classiﬁer. The SI HMM’s used to obtain NDA features as in (11.6) were trained from a distinct database of connected digit utterances. These HMM’s have six states for words “0” through “9” and ﬁve states for word “o”. Each state has 16 Gaussian components. So, for a six state HMM, the NDA features have 6 × 16 = 96 elements. For each true speaker in Roster A, RT in (11.3) was calculated using utterances from As ; RI was obtained from Bs , RCT from both Bs and Bm , and RCI from Bm . The γ and δ parameters are not very sensitive for these data sets. To calculate (11.7), we use uki = 1 due to a lack of training data. The results in terms of averaged individual equalerror rates (ERR’s) are listed in Table 11.2. An EER of 6.13% was obtained with NDA using both score normalization (11.5) and pooled covariance matrices (11.3). With only score normalization (11.5) the EER is 10.12%. Without score normalization and compensating covariance matrices (as in [16, 17]), the equalerror rate was 18.18%. The NDA techniques provided an 82.78% improvement. 11.4.3 Hybrid SpeakerVeriﬁcation System Results For the Type 1 classiﬁer, SD HMM’s were trained using the utterances in As . Five cohort models were constructed from utterances in Roster B. The utterances in Am were used for testing. The HMM scores in Table 11.3 were
11.4 Speaker Verfication Experiments
175
Table 11.2. Results on Discriminant Analysis Algorithms
Scores
Cov. Matrices Normalized Pooled Normalized Unpooled Unnormalized Unpooled
NDA NDA LDA (as in [16, 17]) 1,514 true speaker utterances 23,730 impostor utterances
EER % 6.13 10.12 18.18
obtained from the experiments in [15]. The NDA scores were obtained from the current experiments. To obtain the common weight values v1 and v2 in (11.16) for all speakers, both Type 1 and Type 2 classiﬁers were trained using the data set Bs . Then v1 and v2 are formed by LDA using the output scores from the data set Bm . The major results are listed in Table 11.3. Table 11.3. Major Results Equalerror rates (%) Systems Mean Median HSV with NDA 4.32 3.14 HMMcohort 5.30 4.35 HMM 9.41 7.42 NDA 8.68 8.15 1,514 true speaker utterances 11,620 impostor utterances
The HSV system reduced the veriﬁcation EER’s by 18.43% (mean) and 27.88% (median) from the HMM classiﬁers with cohort normalization. With respect to storage requirements, the HMM classiﬁer needs 51.56 Kb space per speaker for model parameters. The NDA classiﬁer needs: 4 [(96 − 1) × 10 + (80 − 1)] = 4.116 Kb storage space per speaker, so the HSV system needs only slightly more storage than the HMM system.
11.5 Conclusions In this chapter, we introduced the randomly prompted speaker veriﬁcation system through a real system design. In our experiments, the NDA technique showed an 82.78% relative improvement in performance over the classiﬁer using Fisher’s LDA. Furthermore, when the NDA is used in a hybrid speaker veriﬁcation system combining information from speakerdependent and speaker
176
11 Randomly Prompted Speaker Veriﬁcation
independent models, speaker veriﬁcation performance was relatively improved by 18% compared to the HMM classiﬁer with cohort normalization. From author’s experience, whenever we demonstrate a speaker veriﬁcation system, someone in the audience always asks the following question: if an imposter prerecords a user’s passphrase and plays it back in front of a speaker veriﬁcation system, will the imposter be accepted by the speaker veriﬁcation system? We now have two answers to the question. First, a ﬁxedphrase speaker veriﬁcation system with special techniques can prevent imposters from using prerecorded speech to break our system. Second, the randomly prompted speaker veriﬁcation system introduced in this chapter can fully address this security concern. We note that although the system introduced here uses connected digits as the passphrases, the proposed algorithm can be extended to connected words or sentences; however, the training utterances need to be very well designed to cover all phonemes which will appear in the randomly prompted passphrases. The LDA used in this chapter is a discriminative training algorithm; however, the objective is to separate classes as much as possible. The LDA objective did not consider optimizing system performances, such as minimizing error rates directly. In the next two chapters, we will discuss discriminative training objectives and algorithms which can optimize the speaker recognition performance.
References 1. Chair, Z. and Varshney, P. K., “Optimal data fusion in multiple sensor detection systems,” IEEE Transactions on Aerospace and Electronic Systems, vol. AES22, pp. 98–101, January 1986. 2. Duda, R. O. and Hart, P. E., Pattern Classiﬁcation and Scene Analysis. New York: John & Wiley, 1973. 3. Farell, K. R., Mammone, R. J., and Assaleh, K. T., “Speaker recognition using neural networks and conventional classiﬁers,” IEEE Transactions on Speech and Audio Processing, vol. 2, Part II, January 1994. 4. Fisher, R. A., “The statistical utilization of multiple measurements,” Annals of Eugenics, vol. 8, pp. 376–386, 1938. 5. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis. New Jersey: Prentice Hall, 1988. 6. Lee, C.H. and Rabiner, L. R., “A framesynchronous network search algorithm for connected word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, pp. 1649–1658, November 1989. 7. Li, Q., Parthasarathy, S., Rosenberg, A. E., and Tufts, D. W., “Normalized discriminant analysis with application to a hybrid speakerveriﬁcation system,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), May 1996. 8. Li, Q. and Tufts, D. W., “Improving discriminant neural network (DNN) design by the use of principal component analysis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Detroit MI), pp. 3375–3379, May 1995.
References
177
9. Li, Q. and Tufts, D. W., “Synthesizing neural networks by sequential addition of hidden nodes,” in Proceedings of the IEEE International Conference on Neural Networks, (Orlando FL), pp. 708–713, June 1994. 10. Li, Q., Tufts, D. W., Duhaime, R., and August, P., “Fast training algorithms for large data sets with application to classiﬁcation of multispectral images,” in Proceedings of the IEEE 28th Asilomar Conference, (Paciﬁc Grove), October 1994. 11. Liou, H. S. and Mammone, R. J., “A subword neural tree network approach to textdependent speaker veriﬁcation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Detroit MI), pp. 357– 360, May 1995. 12. Liu, C. S., Lee, C.H., Chou, W., Juang, B.H., and Rosenberg, A. E., “A study on minimum error discriminative training for speaker recognition,” Journal of the Acoustical Society of America, vol. 97, pp. 637–648, January 1995. 13. Netsch, L. P. and Doddington, G. R., “Speaker veriﬁcation using temporal decorrelation postprocessing,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992. 14. Rosenberg, A. E. and DeLong, J., “HMMbased speaker veriﬁcation using a telephone network database of connected digital utterances,” Technical Memorandum BL0112693120623TM, AT&T Bell Laboratories, December 1993. 15. Rosenberg, A. E., DeLong, J., Lee, C.H., Juang, B.H., and Soong, F. K., “The use of cohort normalized scores for speaker veriﬁcation,” in Proceedings of the International Conference on Spoken Language Processing, (Banﬀ, Alberta, Canada), pp. 599–602, October 1992. 16. Setlur, A. R., Sukkar, R. A., and Gandhi, M. B., “Speaker veriﬁcation using mixture likelihood proﬁles extracted from speaker independent hidden Markov models,” in Submitted to International Conference on Acoustics, Speech, and Signal Processing, 1996. 17. Sukkar, R. A., Gandhi, M. B., and Setlur, A. R., “Speaker veriﬁcation using mixture decomposition discrimination,” Technical Memorandum NQ832030095013001TM, AT&T Bell Laboratories, January 1995. 18. Tufts, D. W. and Li, Q., “Principal feature classiﬁcation,” in Neural Networks for Signal Processing V, Proceedings of the 1995 IEEE Workshop, (Cambridge MA), August 1995.
Chapter 12 Objectives for Discriminative Training
The ﬁrst step in discriminative training is to deﬁne an objective function. In this chapter, the relations among a class of discriminative training objectives is derived and discovered through our theoretical analysis. The objectives selected for our discussion are the minimum classiﬁcation error (MCE), maximum mutual information (MMI), minimum error rate (MER), and generalized minimum error rate (GMER). The author’s analysis shows that all these objectives can be related to both minimum error rates and maximum a posteriori probability [10]. In theory, the MCE and GMER objectives are more general and ﬂexible than the MMI and MER objectives, and MCE and GMER are beyond the Bayesian decision theory. The results and the analytical methods used in this chapter can help in judging and evaluating discriminative objectives, and in deﬁning new objectives for diﬀerent tasks and better performances. We note that although our discussions are based on the applications of speaker recognition, the analysis can be further extended to speech recognition tasks.
12.1 Introduction In previous chapters, we have applied the expectationmaximization (EM) and linear discriminative training (LDA) algorithms in training acoustic models. It has been reported that discriminative training techniques provide significant improvements in recognition performance compared to the traditional maximumlikelihood (ML) objective in speech and speaker recognition as well as in language processing. Those discriminative objectives include the minimum classiﬁcation error (MCE) [6, 7], maximum mutual information (MMI) [1, 15, 16], minimum error rate (MER) [3], and a recently proposed generalized minimum error rate (GMER) objective [13, 12, 11], as well as other related and new versions. The most important task of these discriminative training algorithms is to deﬁne the objective function. Understanding the existing discriminative
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_12, Ó SpringerVerlag Berlin Heidelberg 2012
179
180
12 Objectives for Discriminative Training
objectives will help in deﬁning new objective for diﬀerent applications and achieving better performances. Among those objectives, the MCE and MMI objectives have been used for years, and both have shown good performances over the ML objective in speech and speaker recognition experiments (e.g. [5, 16, 8, 19]). Consequently, research has been conducted to compare their performances through experiments or theoretical analysis (e.g. [17, 18, 14]); however, the experimental comparisons are limited to particular tasks, and the results do not help in understanding the theory; on the other hand, the previous theoretical analyses are not conclusive or adequate to show the relations among those objectives. In this chapter, we intend to derive and discover the relations among the discriminative objectives theoretically and conclusively without any bias to any particular tasks. The analytical method used here can be applied to deﬁne, judge, evaluate, and compare other discriminative objectives as well. In the following sections, we will ﬁrst review the relations between error rates and the a posteriori probability, and then derive the relations between the discriminative objectives and either the a posteriori probability or the error rates; ﬁnally, we will distinguish the relations among the diﬀerent objectives.
12.2 Error Rates vs. Posterior Probability In an M class classiﬁcation problem, we are asked to make a decision in order to identify a sequence of observations, x, as a member of a class, say, Ci . The true identity of x, say, Cj , is unknown, except in the design or training phase in which observations of known identity are used as references for parameter optimization. We denote the event ai , the action of identifying an observation, as class Ci . The decision is correct if i = j; otherwise, it is incorrect. It is natural to seek a decision rule that minimizes the probability of error, or, empirically, the error rate, which entails a zeroone loss function: 0i=j i, j = 1, ..., M (12.1) L(ai Cj ) = 1 i = j. It assigns no loss to a correct decision and assigns a unit loss to an error. The probabilistic risk of ai corresponding to this loss function is R(ai x) =
M
L(ai Cj )P (Cj x)
j=1
P (Cj x)
(12.2)
= 1 − P (Ci x)
(12.3)
=
j=i
where P (Ci x) is the a posteriori probability that x belongs to Ci . Thus, the zeroone loss function links the error rates to the a posteriori probability. To
12.2 Error Rates vs. Posterior Probability
181
minimize the probability of error, one should therefore maximize the a posteriori probability P (Ci x). This is the basis of Bayes’ maximum a posteriori (MAP) decision theory, and is also referred to as minimum error rate (MER) [3] in an ideal setup. We note that the a posteriori probability P (Ci x) is often modeled as Pλi (Ci x), a function deﬁned by a set of parameters λi . Since the parameter set λi has a onetoone correspondence with Ci , we write Pλi (Ci x) = P (λi x) and other similar expressions without ambiguity. If we consider all M classes and all data samples, an objective for MER can be deﬁned as: max J(Λ) =
M Nk 1 P (λk xk,i ) N
(12.4)
k=1 i=1
where Nk is the total number of training data of class k, N = M k=1 Nk , and xk,i is the ith observation (one or a sequence of feature vectors) of class k. Λ is a set of model parameters, Λ = {λk }M k=1 . Since multilayer neural networks are for a similar task in pattern classiﬁcation, we note that it has been shown that neural networks trained by backpropagation on a sumsquared error objective can approximate the true a posteriori probability in a leastsquare sense [3]. In this chapter, we focus on the objectives that have been deﬁned for speech and speaker recognition or other dynamic patternrecognition problems.
12.3 Minimum Classiﬁcation Error vs. Posterior Probability The MCE objective was derived through a systematic analysis of classiﬁcation errors. It introduced a misclassiﬁcation measure to embed the decision process in an overall MCE formulation. During the derivation, it was also considered that the misclassiﬁcation measure is continuous with respect to the classiﬁer parameters. The empirical average cost as the typical objective in the MCE algorithm was deﬁned as [6]: min L(Λ) =
N M 1 k (dk (xi ); Λ)1(xi ∈ Ck ). N i=1
(12.5)
k=1
where M and N are the total numbers of classes and training data, and Λ = {λk }M k=1 . It can be rewritten as: min L(Λ) =
M Nk 1 k (dk (xk,i ); Λ) N i=1 k=1
(12.6)
182
12 Objectives for Discriminative Training
M where Nk is the total number of training data of class k, N = k=1 Nk , and xk,i is the ith observation of class k. k is a loss function, and a sigmoid function is often used for it: k (dk ) =
1 1+
e−ζdk +α
,
ζ>0
where dk is a class misclassiﬁcation measure deﬁned as: ⎡ ⎤1/η 1 dk (x) = −gk (x; Λ) + ⎣ gj (x; Λ)η ⎦ , M −1
(12.7)
(12.8)
j,j=k
where η > 0, and gj (x; Λ), j = 1, 2, ...M and gj = k, is a set of class conditionallikelihood functions. The second term is also called the Holder norm. A practical class misclassiﬁcation measure for hidden Markov model (HMM) training was deﬁned as [5]: ⎡ ⎤1/η 1 dk (x) = −gk (x; Λ) + log ⎣ exp[ηgj (x; Λ)]⎦ (12.9) M −1 j=k
where x = xk,i , and function g(.) is deﬁned as [5]: gk (x; Λ) = log p(xλk ).
(12.10)
Thus, we can rewrite the class misclassiﬁcation measure in (12.9) as: ⎡ ⎤1/η 1 dk (x) = − log p(xλk ) + log ⎣ p(xλj )η ⎦ . (12.11) M −1 j=k
When η = 1, we have dk (x) = − log
p(xλk ) . 1 j=k M −1 p(xλj )
(12.12)
p(xλk )Pk j=k p(xλj )Pj
(12.13)
It can be further presented as: dk (x) = − log
where Pk = 1 and Pj = M1−1 ; they are similar to the a priori probability if we conduct a normalization. To facilitate our further comparison, we convert the minimization problem to a maximization problem. Let d˜k (x) = −dk (x) p(xλk )Pk = log , j,j=k p(xλj )Pj
(12.14) (12.15)
12.3 Minimum Classification Error vs. Posterior Probability
183
and take it into the sigmoid function in (12.7). Assuming ζ = 1 and α = 0, we have 1 k (d˜k ) = (12.16) 1 + e−d˜k p(xλk )Pk = (12.17) p(xλk )Pk + j=k p(xλj )Pj p(xλk )Pk = M . j=1 p(xλj )Pj
(12.18)
Thus, the MCE objective in (12.6) is simpliﬁed to: ˜ max L(Λ) =
M Nk 1 p(xk,i λk )Pk M N i=1 j=1 p(xk,i λj )Pj
(12.19)
M Nk 1 P (λk xk,i ). N i=1
(12.20)
k=1
=
k=1
This demonstrates that the MCE objective can be simpliﬁed to MER as deﬁned in (12.4) and linked to the a posteriori probability if we make the following assumptions: Pk = 1 1 Pj = M −1 η=1 ζ=1 α = 0.
(12.21) (12.22) (12.23) (12.24) (12.25)
Among the parameters, Pk = 1 and Pj ≤ 1 imply that the MCE objective weights the true class higher or equal to competing classes. The parameter η plays the role of the Holder norm in (12.9). By changing η, the weights between the true class and competing classes can be further adjusted. The rest of the parameters, ζ and α, are related to the sigmoid function. α represents the shift of the sigmoid function. Since other parameters can play a similar role, α is usually set to zero. ζ is related to the slope of the sigmoid function. For diﬀerent tasks and data distributions, diﬀerent values of ζ can be selected to achieve the best performance. ζ is one of the most important parameters in the MCE objective, and it makes the MCE objective ﬂexible and adjustable to diﬀerent tasks and diﬀerent data distributions.
12.4 Maximum Mutual Information vs. Minimum Classiﬁcation Error The objective of MMI was deﬁned in [1] as:
184
12 Objectives for Discriminative Training
p(xk,i λk )Pk I(k) = log M . j=1 p(xk,i λj )Pj
(12.26)
In [16], the MMI objective was presented in the form of the r(Λ) =
N
p(λn xn )
(12.27)
n=1
=
Nk
p(xk,i λk )p(λk ) . M i=1 j=1 p(xk,i λj )p(λj )
(12.28)
As it has been discussed in [16], MMI increases the a posteriori probability of the model corresponding to the training data; therefore, MMI relates to MER. If we consider all M models and all data as in the above discussions, the complete objective for MMI training is: max I(Λ) = =
Nk M
p(xk,i λk )Pk log M j=1 p(xk,i λj )Pj k=1 i=1 Nk M
log P (λk xk,i ).
(12.29)
(12.30)
k=1 i=1
By applying power series expansion, we can have max I(Λ) ≈
Nk M
(P (λk xk,i ) − 1).
(12.31)
k=1 i=1
Furthermore, since a constant number does not aﬀect the objective, the objective can be written as ˆ max I(Λ) =
Nk M
P (λk xk,i ).
(12.32)
k=1 i=1
Since and N is a constant, the MMI objective in (12.30) is equivalent to: ˜ max I(Λ) =
M Nk 1 P (λk xk,i ) N i=1
(12.33)
k=1
which is equivalent to the simpliﬁed version of the MCE objective in (12.20) or the MER objective in (12.4). In other words, a procedure for optimizing MMI is equivalent to a procedure for optimizing MER, or the simpliﬁed version of MCE based on the above assumptions.
12.5 Generalized Minimum Error Rate vs. Other Objectives
185
12.5 Generalized Minimum Error Rate vs. Other Objectives In order to derive a set of closedform formulas for fast discriminative training, in [12, 11], we deﬁned a GMER objective as: M Nm 1 ˜ max J(Λ) = (dm,n ) N m=1 n=1
where (dm,n ) = dm,n
(12.34)
1 1+e−ζdm,n
is a sigmoid function, and = log p(xm,n λm )Pm − Lm log p(xm,n λj )Pj ,
(12.35)
j=m
where 0 < Lm ≤ 1 is a weighting scalar. Intuitively, Lm represents the weighting between true class m and competing classes j = m. When Lm < 1, it means that the true class m is more important than the competing classes. When Lm = 1, it means the true class and competing classes are equally important. The exact value of Lm can be determined during estimation, based on the constraint that estimated covariance matrixes must be positivedeﬁnite. The sigmoid function plays a role similar to its role in the MCE objective. It actually provides diﬀerent weights to diﬀerent training data. For that data which is hardly ambiguous in its classiﬁcation, the weight is close to 0 (i.e., decisively wrong) or 1 (i.e., decisively correct); for the data near the classiﬁcation boundary, the weighting is inbetween. The slope of the sigmoid function is controlled by the parameter ζ > 0. Its value can be adjusted based upon the data distributions in speciﬁc tasks. When Lm = 1 and ζ = 1, we have J˜ in (12.34) equivalent to MER in (12.4) and MMI in (12.33). The GMER objective in (12.34) can be equivalent to the simpliﬁed version of the MCE objective in (12.20) if we have: Lm = 1
(12.36)
Pk = 1
(12.37)
1 Pj = M −1 ζ=1 α=0
(12.38) (12.39) (12.40)
The new GMER objective is more general and ﬂexible than both MMI and MER objectives. It also remains the most important parameter ζ from the MCE objective, and the function of weighting parameter η in MCE was replaced by Lm in the GMER objective; thus, the GMER objective is as ﬂexible as the MCE objective. The most important feature is that the GMER objective is concise; thus, we can derive a new set of closedform formulas for fast parameter estimation for discriminative training [12, 11].
186
12 Objectives for Discriminative Training
12.6 Experimental Comparisons The above theoretical analysis is suﬃcient to prove relations among the different objectives. It is not surprising that all the experimental results which we can ﬁnd from diﬀerent research sites also support the theoretical analysis. For a fair comparison, we cite the results from other sites in addition to our experiments. In speaker veriﬁcation, ML (maximum likelihood), MMI, and MCE objectives were compared using the NIST 1996 evaluation dataset by Ma, et al. [14]. There were 21 male target speakers and 204 male impostors. The reported relative equalerrorrate reductions compared to the ML objective were 3.2% and 7.0% for MMI and MCE, respectively. In speech recognition, ML, MMI, and MCE objectives were compared using a common database by Reichl and Ruske [17]. It was found that both MMI and MCE objectives can have speechrecognition performance improvements over the ML objective. The absolute errorrate reduction in the MMI objective was 2.5% versus 5.3% in the MCE objective. In speaker identiﬁcation, we compared the ML and GMER objectives using an 11speaker group from the NIST 2000 dataset [11]. For the testing durations of 1, 5, and 10 seconds, the ML objective had error rates of 31.4%, 6.59%, and 1.39%, while the GMER objective had error rates of 26.81%, 2.21%, and 0.00%. The relative errorrate reductions were 14.7%, 66.5%, and 100%, respectively. For the best results, the weighting scalar Lm was determined by the optimization algorithm in iterations, and the values were changed in diﬀerent iterations, but, for any one of the iterations, it always showed that Lm = 1.0. The slope of sigmoid function for the best results is ζ = 0.8 = 1.0. Based on our above analysis, we know that these imply that the GMER objective outperforms the MMI and MER objectives.
12.7 Discussion The results from this chapter can also help in deﬁning new objectives for parameter optimization. In general, in order to optimize a desired performance requirement, the deﬁned objective must be related to the requirement as closely as possible, either mathematically or at least intuitively. For example, in order to minimize a recognition error rate, the objective should be deﬁned to be related to the recognition error rate. If the requirement is to minimize an equal error rate, the objective should be deﬁned to be related to the equal error rate. From this point of view, representing the desired objective, such as a particular error rate, is a necessary and suﬃcient condition in deﬁning a useful objective, while the discriminative property is just a necessary condition because a discriminative objective may not necessarily relate to error rates. This concept can help in evaluating and adjudging objectives and predicting the corresponding performances. For example, if a likelihood
12.7 Discussion
187
Table 12.1. Comparisons on Training Algorithms Objec Optimization Learning Determine Relation to tives Algorithms Parameters Learn. Para. Post. Prob. ML EM None – Not same Closed form MMI Closed form Yes Experiments Same MCE
GPD/Gradient descent GMER Closed form
Yes Yes
Experiments Extended Automatic
Extended
ratio is used as an objective, it is discriminative but is not related to error rates because the ratio cannot be presented as the a posteriori probability; thus, the likelihood ratio cannot be used as a training objective. Furthermore, optimization algorithms should also be considered when deﬁning an objective. When an objective is simple, it may not be possible to represent the expected training object exactly, but it may be easy to derive a fast and eﬃcient training algorithm. On the other hand, when an objective is complicated, it can represent a training objective well, but one cannot easily derive a fast training algorithm. From this point of view, a parameter optimization algorithm should be considered when deﬁning an objective.
12.8 Relations between Objectives and Optimization Algorithms For pattern recognition or classiﬁcation, objectives and optimization algorithms for parameter estimation are related to each other, and they both play important roles in solving realworld problems, in terms of recognition accuracy, speed of convergence, and time in adjusting learning rates or other parameters. We summarize the factors of the discussed objectives in Table 12.1. Regarding optimization methods, in general, closedform formulas like the expectation maximization (EM) algorithm in maximum likelihood estimation for parameter reestimation are more eﬃcient than a gradientdescent kind of approach. However, not every objective has the closedform formulas. When an objective is complicated, such as the MCE objective, it has less of a chance to derive closedform formulas. Thus, many algorithms have to rely on gradientdescent methods. For the MMI objective, a closedform parameter estimation algorithm was derived in [4] through an inequality; however, there is a constant D in the algorithm and the value of the constant needs to be predetermined for parameter estimation. Like the learning rate in gradientdescent methods, it is diﬃcult
188
12 Objectives for Discriminative Training
to determine the value of D as reported in the literature [16]. The GMER algorithm is developed under our belief that for the best performances, in terms of recognition accuracy and training speed, the objective and optimization method should be developed jointly [12, 11]. The GMER’s recognition accuracy is similar to the MCE while the training speech is close to the EM algorithm used in the ML estimation. We will discuss the GMER algorithm in Chapter 13. If we want to further investigate the diﬀerences between the MCE objective in (12.6) and MMI objectives in (12.30), the diﬀerences are mainly in the parameter set listed from (12.21) to (12.25). In theory, those parameters provide the ﬂexibility to adjust the MCE objective for diﬀerent recognition tasks and data distributions; therefore, the MCE objective is more general compared to the MMI and MER objectives. In practice, from many reported experiments, we know that some of the parameters can play an important role in recognition or classiﬁcation performances. For example, η and ζ can be adjusted to achieve better performances.
12.9 Conclusions In recent years, the objectives of discriminative training algorithms have been extended from statically separating diﬀerent data classes as in LDA to more speciﬁc or detailed tasks, such as minimizing classiﬁcation errors (MCE) [6, 2, 20], generalized minimum error rate (GMER) [13], maximum mutual information (MMI) [1], maximize the decision margins [21], and soft margin [9, 22]. In this chapter, we demonstrated that all four objectives which we have discussed for discriminative training in speaker recognition can be related to both minimum error rates and maximum a posteriori probability under some assumptions. While the MMI and MER were directly deﬁned for maximum a posteriori probability, the MCE and GMER objectives can be equivalent to maximum a posteriori probability under some assumptions and simpliﬁcations. While MCE was directly deﬁned for minimum error rates, MMI, MER, and GMER can also be related to error rates through the zeroone loss function and some assumptions. In real applications, the distributions of testing data are not exactly the same as the distribution of training data. Since MCE and GMER are more general and ﬂexible, by adjusting the slope of the sigmoid function, MCE and MER can weight the data near the decision boundary diﬀerently, and this property is not available in MMI and MER. Furthermore, by adjusting the weighting parameters for classes, MCE and GMER can weight classes diﬀerently instead of just weighting them by their a priori probabilities such as those in MMI and MER. From these points of view, MCE and GMER may have the potential to provide more robust recognition or classiﬁcation performance to testing data and to real applications. Actually, the MCE and GMER
12.9 Conclusion
189
objectives are beyond the traditional Bayes decision theory. The GMER will be discussed in detail in Chapter 13.
References 1. Bahl, L. R., Brown, P. F., de Souza, P. V., and Mercer, R. L., “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Tokyo), pp. 49–52, 1986. 2. Chou, W., “Discriminantfunctionbased minimum recognition error rate patternrecognition approach to speech recognition,” Proceedings of the IEEE, vol. 88, pp. 1201–1222, August 2000. 3. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classiﬁcation, Second Edition. New York: John & Wiley, 2001. 4. Gopalakrishnan, P. S., Kanevsky, D., Nadas, A., and Nahamoo, D., “An inequality for rational functions with applications to some statistical estimation problems,” IEEE Trans. on Information theoty, vol. 37, pp. 107–113, Jan. 1991. 5. Juang, B.H., Chou, W., and Lee, C.H., “Minimum classiﬁcation error rate methods for speech recognition,” IEEE Trans. on Speech and Audio Process., vol. 5, pp. 257–265, May 1997. 6. Juang, B.H. and Katagiri, S., “Discriminative learning for minimum error classiﬁcation,” IEEE Transactions on Signal Processing, vol. 40, pp. 3043–3054, December 1992. 7. Katagiri, S., Lee, C.H., and Juang, B.H., “New discriminative algorithm based on the generalized probabilistic descent method,” in Proceedings of IEEE Workshop on Neural Network for Signal Processing, (Princeton), pp. 299–309, September 1991. 8. Korkmazskiy, F. and Juang, B.H., “Discriminative adaptation for speaker veriﬁcation,” in Proceedings of Int. Conf. on Spoken Language Processing, (Philadelphia), pp. 28–31, 1996. 9. Li, J., Yuan, M., and Lee, C. H., “Soft margin estimation of hidden markov model parameters,” in Proc. ICSLP, pp. 2422–2425, 2007. 10. Li, Q., “Discovering relations among discriminative training objectives,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal), p. 2004, May 2004. 11. Li, Q. and Juang, B.H., “Fast discriminative training for sequential observations with application to speaker identiﬁcation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Hong Kong), April 2003. 12. Li, Q. and Juang, B.H., “A new algorithm for fast discriminative training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Orlando, FL), May 2002. 13. Li, Q. and Juang, B.H., “Study of a fast discriminative training algorithm for pattern recognition,” IEEE Trans. on Neural Networks, vol. 17, pp. 1212–1221, Sept. 2006. 14. Ma, C. and Chang, E., “Comparison of discriminative training methods for speaker veriﬁcation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. I–192 – I–195, 2003.
190
12 Objectives for Discriminative Training
15. Nadas, A., Nahamoo, D., and Picheny, M. A., “On a modelrobust training method for speech recognition,” IEEE Transactions on Acoust., Speech, Signal Processing, vol. 36, pp. 1432–1436, Sept. 1988. 16. Normandin, Y., Cardin, R., and Mori, R. D., “Highperformance connected digit recognition using maximum mutual information estimation,” IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 299–311, April 1994. 17. Reichl, W. and Ruske, G., “Discriminant training for continuous speech recognition,” in Proceedings of Eurospeech, 1995. 18. Schluter, R. and Macherey, W., “Comparison of discriminative training criteria,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 493–497, 1998. 19. Siohan, O., Rosenberg, A. E., and Parthasarathy, S., “Speaker identiﬁcation using minimum veriﬁcation error training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Seattle), pp. 109–112, May 1998. 20. Siohan, O., Rosenberg, A., and Parthasarathy, S., “Speaker identiﬁcation using minimum classiﬁcation error training,” in Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Process, pp. 109–112, 1998. 21. Vapnik, V. N., The nature of statistical learning theory. NY: Springer, 1995. 22. Yin, Y. and Li, Q., “Soft frame margin estimation of Gaussian mixture models for speaker recognition with sparse training data,” in ICASSP 2011, 2011.
Chapter 13 Fast Discriminative Training
A good training algorithm for pattern recognition needs to satisfy two criteria. First, the objective function is associated to the desired performance, and second, the parameter estimation process derived from the objective is easy to compute using available computation resources and can converge in the required time. For example, the expectationmaximization (EM) algorithm guarantees in convergence but its objective is not to minimize the error rate which is desired by most applications. On the other hand, many new objective functions are very well deﬁned to directly associate to desired performance, but are often too computationally complicated and may not be able to get the desired results in a reasonable amount of time. Therefore, for real applications, to deﬁne an objective and derive an estimation algorithm is a joint design process. This chapter presents an example where a discriminative objective was deﬁned together with its fast training algorithm. Many discriminative training algorithms for nonlinear classiﬁer designs are based on gradientdescent (GD) methods for parameter minimization. These algorithms are easy to derive and eﬀective in practice, but are slow in training speed and have diﬃculty selecting the learning rates. Their drawbacks prevent them from meeting the needs of many speaker recognition applications. To address the problem, we present a fast discriminative training algorithm. The algorithm initializes the parameters in the EM algorithm, and then uses a set of closedform formulas to further optimize an objective of minimizing error rate. Experiments in speech applications show that the algorithm provides better recognition accuracy than the EM algorithm and much faster training speed than GD approaches. This work was originally reported by the author and Juang in [11].
13.1 Introduction As we have discussed in Chapter 3 and Chapter 4, the construct of a pattern classiﬁer can be linear, such as a singlelayer perceptron, or nonlinear,
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_13, Ó SpringerVerlag Berlin Heidelberg 2012
191
192
13 Fast Discriminative Training
such as a multilayer perceptron (MLP), a Gaussian mixture model (GMM), or a hidden Markov model (HMM) if the event to be recognized involves nonstationary signals. A linear classiﬁer uses a hyperplan to partition the data space. A nonlinear classiﬁer uses nonlinear kernels to model the data distribution or the a posteriori probability, and may be better matched to the statistical behavior of the data than a linear classiﬁer [6]. Another approach is to use a set of hyperplans to partition the data space as described in Chapter 3. The parameters in the classiﬁer or recognizer need to be optimized based on given and labeled data. General methodology for optimizing the classiﬁer parameters falls into two broad classes: distribution estimation and discriminative training. The distributionestimation approach to classiﬁer training is based on Bayes decision theory, which suggests estimation of data distribution as the ﬁrst and most imperative step in the design of a classiﬁer. The most commonlyused criterion for distribution estimation is maximum likelihood (ML) estimation [6]. For those complex distributions used in many nonlinear classiﬁers, the EM algorithm for ML estimation [5] is, in general, very eﬃcient because, while it is a hillclimbing algorithm, it guarantees a net gain in the optimization objective at every iteration, leading to uniform convergence to a solution. The concept of discriminative training is simple. When training one classiﬁer for one class, discriminative training also considers separating the trained class from other classes as much as possible. One example of discriminative training is the principal feature networks introduced in Chapter 3 which solves the discriminative training problem in a sequential procedure with data pruning. In speech and speaker recognition, one often deﬁnes a cost function or objective commensurate with the performance of the pattern classiﬁcation or recognition system, and then minimizes the cost function. As discussed in Chapter 12, the cost functions proposed for discriminative training include the conventional squarederror used in the backpropagation algorithm [20, 23, 21], the minimum classiﬁcation error criterion (MCE) in the generalized probabilistic descent (GPD) algorithm [8, 4], the maximum mutual information (MMI) criterion [1, 7, 17, 2], and other versions [15, 13, 14, 16]. Variations of the gradientdescent (GD) algorithm [3] [8] [19], as well as simulated and deterministic annealing algorithms [9], are used for optimizing these objectives. In general, these optimization algorithms converge slowly, particularly for largescale or realtime problems such as automatic speech and speaker recognition, which employ HMM’s with hundreds of GMM’s and tens of thousands of parameters, hindering their applications. In this chapter, we show an approach of jointly deﬁning a discriminative objective and deriving parameter estimation algorithm. We deﬁne a generalized minimum error rate (GMER) objective for discriminative training, and thus name the algorithm fast GMER estimation. It is a batchmode approach using an approximate closedform solution for optimization.
13.2 Objective for Fast Discriminative Training
193
13.2 Objective for Fast Discriminative Training As the ﬁrst step, we deﬁne an objective for generalized minimum error rate estimation (GMER). When deﬁning the objective, we considered: ﬁrst, it is discriminative; and, second, a fast algorithm can be derived from the objective. In an M class classiﬁcation problem, we are asked to make a decision, to identify an observation x, as a member of a class, say, Ci . The true identity of x, say Cj , is not known, except in the design or training phase, in which observations of known identity are used as references for parameter optimization. We denote event αi as the action of identifying an observation as class i. The decision is correct if i = j; otherwise it is incorrect. It is natural to seek a decision rule that minimizes the probability of error, or, empirically, the error rate, which entails a zeroone loss function: 0i=j i, j = 1, ..., M L(αi Cj ) = (13.1) 1 i = j. The function assigns no loss to a correct decision, and assigns a unit loss to an error. The probabilistic risk of αi corresponding to this loss function is R(αi x) =
M
L(αi Cj )P (Cj x) = 1 − P (Ci x)
(13.2)
j=1
where P (Ci x) is the a posteriori probability that x belongs to Ci . To minimize the probability of error, one should therefore maximize the a posteriori probability P (Ci x). This is the basis of Bayes’ maximum a posteriori (MAP) decision theory, and is also referred to as minimum error rate (MER) [6]. The a posteriori probability P (Ci x) is often modeled as Pλi (Ci x), a function deﬁned by a set of parameters λi . Since the parameter set λi has a onetoone correspondence with Ci , we write Pλi (Ci x) = P (λi x) and other similar expressions without ambiguity. We further deﬁne an aggregate a posteriori (AAP) probability for the set of design samples {xm,n ; n = 1, 2, .., Nm, m = 1, 2, .., M }:
J= =
M Nm 1 P (λm xm,n ) M m=1 n=1 M Nm 1 p(xm,n λm )Pm M m=1 n=1 p(xm,n )
(13.3)
where xm,n is the nth training token from class m, Nm is the total number of tokens for class m, and Pm is the corresponding prior probability, respectively. The above AAP objective for maximum a posteriori probability or minimum error rates can be further extended to a more general and ﬂexible objective. We name it the GMER objective:
194
13 Fast Discriminative Training
max J˜ = max
M Nm M Nm 1 1 (dm,n ) = max m,n M m=1 n=1 M m=1 n=1
where (dm,n ) =
1 1 + e−αdm,n
(13.4)
(13.5)
is a sigmoid function, and dm,n = log p(xm,n λm )Pm − log
p(xm,n λj )Pj
(13.6)
j=m
represents a log probability ratio between the true class m and competing classes j = m. The sigmoid function can provide diﬀerent weighting eﬀects to diﬀerent training data. For the data that has been well classiﬁed, weighting is close to 1 or 0; for the data near the classiﬁcation boundary, weighting is near 0.5. The slope of the sigmoid function is controlled by the parameter α, where α > 0. Thus, the values of α can aﬀect the training performance and convergence. By adjusting the value for diﬀerent recognition tasks, the GMER objective can provide better performance than the MER objective. We introduce a weighting scalar Lm into (13.6) in the objective to ensure that the estimated covariance is positive deﬁnite and let 0 < Lm ≤ 1: dm,n = log p(xm,n λm )Pm − Lm log p(xm,n λj )Pj (13.7) j=m
For simplicity, we denote Lm = L. Intuitively, L represents the weighting between the true class m and competing classes j = m. When L < 1, it means that the true class is more important than the competing classes. When L = 1, it means the true class and competing classes are equally important. The range of the values of L can be determined during estimation. When L = 1 and α = 1, we have J˜ = J. For testing, based on the Bayes decision rule to minimize risk and average probability of error, for a given observation x, we should select the action or class i that maximizes the posterior probability: i = arg max {P (λm x)}. 1≤m≤M
(13.8)
Since P (x) in (13.3) is the same for all classes, we have i = arg max {P (xλm )Pm }. 1≤m≤M
(13.9)
Thus, although the training procedure can be diﬀerent, the decision procedures are still the same for both ML estimation and discriminative estimation. The decision boundary between classes i and j is the line that satisﬁes the condition of P (xλi )Pi = P (xλj )Pj .
13.3 Derivation of Fast Estimation Formulas
195
13.3 Derivation of Fast Estimation Formulas We now derive the fast estimation formulas for GMM parameter estimation based on the objective. p(xm,n λm ) =
I
cm,i p(xm,n λm,i )
(13.10)
i=1
where p(xm,n λm ) is mixture density, p(xm,n λm,i ) is component density, cm,i I is mixing parameters subject to i cm,i = 1, and I is the number of mixture components that constitute the conditional probability density. The parameters for the component density are a subset of the parameters of the mixture density, i.e., λm,i ⊂ λm . In most applications, the component density is deﬁned as a Gaussian kernel: p(xm,n λm,i ) =
1 (2π)d/2 Σm,i 1/2
1 −1 exp(− (xm,n − μm,i )T Σm,i (xm,n − μm,i )) 2
(13.11)
where μm,i and Σm,i are respectively the mean vector and covariance matrix of the ith component of the mth GMM, d is the dimension of observation vectors, and T represents the vector or matrix transpose. Let ∇θm,i J be the gradient of J with respect to θm,i ⊂ λm,i . Making the gradient vanish for maximizing J, we have: ∇θm,i J˜ =
Nm
ωm,i (xm,n )∇θm,i log p(xm,n λm,i )
n=1
−L
Nj
ω ¯ j,i (xj,¯n )∇θm,i log p(xj,¯n λm,i )
¯ =1 j=m n
=0 where ωm,i (xm,n ) = m,n (1 − m,n )
(13.12) cm,i p(xm,n λm,i )Pm p(xm,n λm )Pm
cm,i p(xj,¯n λm,i )Pm ω ¯ j,i (xj,¯n ) = j,¯n (1 − j,¯n ) n λk )Pk k=j p(xj,¯
(13.13) (13.14)
where is computed using (13.6) to represent the (unregulated) error rate. (This is deliberately set to separate, in concept, the inﬂuence of a token from the relative importance of various parameters of the classiﬁer upon the performance of the classiﬁer.) To ﬁnd the solution to (13.12), we assume that ωm,i and ω ¯ j,i can be approximated as constants around θm,i . Discussions regarding this assumption is in [11].
196
13 Fast Discriminative Training
13.3.1 Estimation of Covariance Matrices From the Gaussian component in (13.10), we have log p(xm,n λm,i ) = − log[(2π)d/2 Σm,i 1/2 ] 1 −1 − (xm,n − μm,i )T Σm,i (xm,n − μm,i ). 2 For optimization of the covariance matrix, we take the derivative with respect to matrix Σm,i 1 −1 ∇Σm,i log p(xm,n λm,i ) = − Σm,i 2 1 −1 −1 + Σm,i (xm,n − μm,i )(xm,n − μm,i )T Σm,i 2
(13.15)
where ∇Σ is deﬁned as a matrix operator ∇Σ ≡
∂ ∂si,j
d i,j=1
,
(13.16)
where si,j is an entry of matrix Σ, and d is the dimension number of observation vectors. Bringing (13.15) into (13.12) and rearranging the terms, we have: A − LB Σm,i = (13.17) D where Nm A= ωm,i (xm,n )(xm,n − μm,i )(xm,n − μm,i )T (13.18) n=1
B=
Nj
ω ¯ j,i (xj,¯n )(xj,¯n − μm,i )(xj,¯n − μm,i )T
(13.19)
¯ =1 j=m n
and D=
Nm n=1
ωm,i (xm,n ) − L
Nj
ω ¯ j,i (xj,¯n ).
(13.20)
¯ =1 j=m n
Both A and B are matrices and D is a scalar. For simplicity, we ignore subscripts m, i for A, B, and D. 13.3.2 Determination of Weighting Scalar The estimated covariance matrix, Σm,i , must be positive deﬁnite. We use this requirement to determine the upper bound of the weighting scalar L. Using the eigenvectors of A−1 B, we can construct an orthogonal matrix ˜ ˜ ˜ and B ˜ are diagonal, U, such that (1) A−LB = UT (A−L B)U, where both A
13.3 Derivation of Fast Estimation Formulas
197
˜ − LB ˜ have the same eigenvalues. These claims and (2) both A − LB and A have been proven in Theorems 1 and 2 in [11]. L can then be determined as: d a ˜k L < min , (13.21) ˜bk k=1 ˜ and B, ˜ respectively. L where a ˜i > 0 and ˜bi > 0 are the diagonal entries of A also needs to satisfy D(L) > 0
and
0 < L ≤ 1.
(13.22)
Thus, for the ith mixture component of model m, we can determine Lm,i . If model m has I mixtures, we need to determine one Lm to satisfy all mixture components in the model. Therefore, the upper bound of Lm is Lm ≤ min{Lm,1 , Lm,2 , . . . , Lm,i }.
(13.23)
In numerical computation, we need an exact number of L; therefore, we have Lm = η min{Lm,1 , Lm,2 , . . . , Lm,i }
(13.24)
where 0 < η ≤ 1 is a preselected constant, and it is much easier to determine compared to the learning rate in gradientdescent algorithms. 13.3.3 Estimation of Mean Vectors We take the derivative of (13.15) with respect to vector μm,i : −1 ∇μm,i log p(xm,n λm,i ) = Σm,i (xm,n − μm,i ).
where ∇μ is deﬁned as a vector operator d ∂ ∇μ ≡ , ∂νi i=1
(13.25)
(13.26)
where νi is an entry of vector μ, and d is the dimension number of observation vectors. Bringing (13.25) into (13.12) and rearranging the terms, we obtain the solution for mean vectors: E − LF μm,i = (13.27) D where Nm E= ωm,i (xm,n )xm,n (13.28) n=1
F=
Nj
ω ¯ j,i (xj,¯n )xj,¯n
(13.29)
¯ =1 j=m n
and D is deﬁned in (13.20). Again, for simplicity, we ignore subscripts m, i for E, F, and D. We note that the both E and F are vectors, and scalar L has been determined when estimating Σm,i .
198
13 Fast Discriminative Training
13.3.4 Estimation of Mixture Parameters The last step is to compute the mixture parameters cm,i subject to 1. Introducing Lagrangian multipliers γm , we have I M m Jˆ = J˜ + γm cm,i − 1 . m=1
I i
cm,i =
(13.30)
i=1
Taking the ﬁrst derivative and making it vanish for maximization, we have ∂ J˜ 1 = D + γm = 0. ∂cm,i cm,i
(13.31)
Rearranging the terms, we have cm,i = −
1 D. γm
(13.32)
Summing over cm,i , for i = 1...I, we can solve γm as γm = −(G − LH) where G=
Nm Im
ωm,i (ci , xm,n )
(13.33)
(13.34)
n=1 i=1
and H=
Nj Ij
ω ¯ j,i (ci , xj,¯n ).
(13.35)
¯ =1 i=1 j=m n
Bringing (13.33) into (13.32), we have cm,i =
D . G − LH
(13.36)
13.3.5 Discussions We have discussed the necessary conditions for optimization, i.e., ∇θm,i J = 0. In theory, we also need to meet the following suﬃcient conditions: 1) ∇2θm,i J < 0, in order to ensure a maximum solution; and 2) ∇θm,i ωm,i  ≈ 0 and ∇θm,i ω ¯ j,i  ≈ 0 around θm,i . This is to ensure that ωm,i (θm,i ) and ω ¯ j,i (θm,i ) in (13.12) are approximately constant; therefore the independent assumption is sound. Further discussions on the suﬃcient conditions are available in [11]. Also, in the above derivations, one observation or training token, x, represents one feature vector, and a decision is made based on the single feature
13.3 Derivation of Fast Estimation Formulas
199
vector. In speech and speaker recognition, one spoken phoneme can be represented by several feature vectors, and a short sentence can have over one hundred feature vectors. For such applications, decisions are usually made on a sequence of continually observed (extracted) feature vectors, and we have to make corresponding changes in the objective and estimation formulas accordingly. Given the n’th observation of class m with a sequence of feature vectors, the observation (token) can be presented as: n Xm,n = {xm,n,q }Q q=1
(13.37)
where the n’th token has a sequence of Qn feature vectors. To deal with this kind of problem, one usually assumes that the variable to represent the vectors is independent, identicallydistributed (i.i.d); therefore, the probability or likelihood p(Xm,n λm ) can be calculated as: p(Xm,n λm ) =
Qn
p(xm,n,q λm ).
(13.38)
q=1
The GMER objective can be rewritten as: M Nm M Nm 1 1 max J˜ = (dm,n ) = m,n M m=1 n=1 M m=1 n=1
where (dm,n ) = dm,n
(13.39)
1 1+e−αdm,n
is a sigmoid function, and = log p(Xm,n λm )Pm − log p(Xm,n λj )Pj
(13.40)
j=m
represents a log probability ratio between the true class m and the competing classes j = m. Using the same method demonstrated above, readers can derive a set of reestimation formulas, or refer to the results in [10]. We ignore the derivations here since the procedures and results are very similar to the above derivations. Given the nth observation: a sequence of feature vectors, Xn , the decision on the observation can be made as: i = arg max {P (Xn λm )Pm } 1≤m≤M
= arg max {Pm 1≤m≤M
Qn
p(xn,q λm )}.
(13.41) (13.42)
q=1
In practice, the above computation is often conducted as log likelihood: i = arg max {log Pm + 1≤m≤M
Qn q=1
log p(xn,q λm )}.
(13.43)
200
13 Fast Discriminative Training
13.4 Summary of Practical Training Procedure For programming, the practical training procedure for the fast GMER estimation can be summarized as follows: 1. initialize all model parameters for all classes by ML estimation; 2. for every mixture component i in model m, compute ωm,i and ω ¯ m,i using (13.13) and (13.14), and compute A, B, and D using (13.18), (13.19), and (13.20). 3. determine the weighting scalar L by (13.24); 4. for every mixture component i, compute Σm,i μm,i , and cm,i using (13.17), (13.27), and (13.36); 5. evaluate the performance using (13.4) and (13.6) for model m. If the performance is improved, save the best model parameters; 6. repeat Step 4 and 5 if performance is impved; 7. repeat Steps 2 to 6 for the required number of iterations for model m; 8. use the saved model for class m and repeat the above procedure for all untrained models; and 9. output the saved models for testing and applications.
13.5 Experiments The fast GMER estimation has been applied to several pattern recognition, speech recognition, and speaker recognition projects. We selected two examples present here. 13.5.1 Continuing the Illustrative Example In Chapter 11, we use a twodimensional data classiﬁcation problem to illustrate the GMM classiﬁer. Here, we continue the example and apply the GMER to the classiﬁcation problem. The data distributions were of Gaussianmixture types, with three components in each class. Each token was of two dimensions. For each class, 1,500 tokens were drawn from each of the three components; therefore, there were 4,500 tokens in total. The contours of the ideal distributions of classes 1, 2 and 3 are shown in Fig. 13.1, where the means are represented as +, ∗, and boxes, respectively. In order to simulate real applications, we assumed that the number of mixture components is unknown. Therefore, we assumed the GMM’s, which need to be trained, have two mixture components with full covariance matrices for each class. In the ﬁrst step, ML estimation was applied to initialize the GMMs with four iterations based on the training data drawn from the ideal models. The contours that represent the pdf ’s of each of the GMMs after ML estimation are plotted in Fig. 13.2.
13.5 Experiments
201
10
5
0
−5 −10
−5
0
5
Fig. 13.1. Contours of the pdf ’s of 3mixture GMM’s: the models are used to generate three classes of training data. 10
5
0
−5 −10
−5
0
5
Fig. 13.2. Contours of the pdf ’s of 2mixture GMM’s: the models are from ML estimation using four iterations.
In the next step, we used the fast GMER estimation to further train new GMM’s with 2 iterations based on the parameters estimated by ML estimation. The contours representing the pdf ’s of each of the new GMM’s after the GMER estimation are plotted in Fig. 13.3. All the contours in Figures 13.2 and 13.3 are plotted on the same scale. From Fig. 13.2 and Fig. 13.3, we can observe that GMER training signiﬁcantly reduced the overlaps among three classes. The decision boundaries of the three cases were plotted in Fig. 13.4. After GMER training, the boundaries from ML estimation shifted toward the decision boundaries from the ideal models. We note that both the ML and
202
13 Fast Discriminative Training 10
5
0
−5 −10
−5
0
5
Fig. 13.3. Contours of the pdf ’s of 2mixture GMM’s: The models are from the fast GMER estimation with two iterations on top of the ML estimation results. The overlaps among the three classes are signiﬁcantly reduced. 7 6 5 4 3 2 1 0 −1 −2 −7
−6
−5
−4
−3
−2
−1
0
1
2
Fig. 13.4. Enlarged decision boundaries for the ideal 3mixture models (solid line), 2mixture ML models (dashed line), and 2mixture GMER models (dashdotted line): After GMER training, the boundary of ML estimation shifted toward the decision boundary of the ideal models. This illustrates how GMER training improves decision accuracies.
GMER models were trained from a limited set of training data drawn from the ideal model, and the shifted areas have highdata density. This illustrates how GMER training improves classiﬁcation accuracies. We summarized the experimental results in Table 13.1. The testing data with 4,500 tokens for each class were obtained using the same methods as the
13.5 Experiments
203
TRAINING (%)
77.5 IDEAL CASE 77.19 77 76.48 76.5 76.07 76 0
76.26
76.18
3
2
1
TESTING (%)
77.5 IDEAL CASE 77.02
77 76.5
76.61
76.70
76.68
76 75.97 75.5 0
1
MER ITERATIONS
2
3
Fig. 13.5. Performance improvement versus iterations using the GMER estimation: The initial performances were from the ML estimation with four iterations. Table 13.1. ThreeClass Classiﬁcation Results of the Illustration Example Algorithms Iterations Training Set MLE 4 76.07% MER 1 76.18% MER 2 76.26% MER 3 76.48% Ideal Case 77.19%
Testing Set 75.97% 76.61% 76.70% 76.68% 77.02%
training data. The ML estimation provided an accuracy of 76.07% and 75.97% for training and testing datasets respectively, while the GMER estimation improved the accuracy to 76.26% and 76.70% after two iterations. The control parameters were set to η = 0.5 and α = 0.01. If we use the same model that generated the training data to do the testing, the ideal performances are 77.19% and 77.02% for training and testing. These ideal cases represent the ceiling of this example. To evaluate the behaviors of the GMER estimation, we plot the training and testing data on each of the iterations in Fig. 13.5. On testing, the relative improvement of the GMER estimation against the ceiling is signiﬁcant.
204
13 Fast Discriminative Training Table 13.2. Comparison on Speaker Identiﬁcation Error Rates AlgoriIteraTest Length thms tions 1 sec 5 sec 10 sec ML Estimation 5 31.41% 6.59% 1.39% GMER MLE5 26.81% 2.21% 0.00% (Propossed) + MER1 Relative Error Reduction 14.65% 66.46% 100.00%
13.5.2 Application to Speaker Identiﬁcation We also used a textindependent speaker identiﬁcation task to evaluate the GMER algorithm on sequentiallyobserved feature vectors. The experiment included 11 speakers. Given a sequence of feature vectors extracted from a speaker’s voice, the task was to identify the true speaker from a group of 11 speakers. Each speaker had 60 seconds of training data and 30  40 seconds of testing data with a sampling rate of 8 KHz. These speakers were randomly picked from the 2000 NIST Speaker Recognition Evaluation Database. The speech data were ﬁrst converted into 12dimensional (12D) Melfrequency cepstral coeﬃcients (MFCC’s) through a 30 ms window shifted every 10 ms [18]. Thus, for every 10 ms, we had one 12D MFCC feature vector. The silence frames were then removed by a batchmode endpoint detection algorithm [12]. The testing performance was evaluated based on segments of 1, 5, and 10 seconds of testing speech. The speech segment was constructed by moving a window of the length of 10, 50, or 100 vectors at every feature vector on the testing data collected sequentially. A detailed introduction to speaker identiﬁcation and a typical ML estimation approach can be found in [18] and previous chapters. We ﬁrst constructed GMM’s with 8mixture components for every speaker using the ML estimation. Each GMM was then further trained discriminatively using the sequential GMER estimation described above. During the test, for every segment, we computed the likelihood scores of all trained GMM’s in the selected test length. The speaker with the highest score was labeled as the owner of the segment. The experimental results are listed in Table 13.2. For 1, 5, and 10 seconds of testing data, the sequential GMER algorithm had 14.56%, 66.46%, and 100.00% relative error rate reductions respectively, compared to ML estimation, which is the most popular algorithm in speaker identiﬁcation.
13.6 Conclusions In this chapter, we show an example of jointly deﬁning an objective function and its training algorithm, so we can have a closedform solution for
13.6 Conclusion
205
fast estimation. Although the algorithm does not guarantee analytical convergence at each iteration, we empirically demonstrate that the GMER algorithm can train a classiﬁer or recognizer in only a few iterations, much faster than gradientdescentbased methods, while also providing better recognition accuracy due to the generalization of the principle of error minimization. Our experimental results indicated that the fast GMER algorithm is eﬃcient and eﬀective. So far, our discussion on discriminative training has been focused on error rate related objectives. The author and his colleague’s recent work focuses on decision margin related objectives and uses convex optimization approach to estimate the GMM parameters. The approach received good performance results as well. Interested readers are referred to [22] for more detailed information.
References 1. Bahl, L. R., Brown, P. F., de Souza, P. V., and Mercer, R. L., “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Tokyo), pp. 49–52, 1986. 2. BenYishai, A. and Burshtein, D., “A discriminative training algorithm for hidden Markov models,” IEEE Trans. on Speech and Audio Processing, may 2004. 3. Bishop, C., Neural networks for pattern recognition. NY: Oxford Univ. Press, 1995. 4. Chou, W., “Discriminantfunctionbased minimum recognition error rate patternrecognition approach to speech recognition,” Proceedings of the IEEE, vol. 88, pp. 1201–1222, August 2000. 5. Dempster, A. P., Laird, N. M., and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of Royal Statistical Society, vol. 39, pp. 1–38, 1977. 6. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classiﬁcation, Second Edition. New York: John & Wiley, 2001. 7. Gopalakrishnan, P. S., Kanevsky, D., Nadas, A., and Nahamoo, D., “An inequality for rational functions with applications to some statistical estimation problems,” IEEE Trans. on Information theoty, vol. 37, pp. 107–113, Jan. 1991. 8. Juang, B.H. and Katagiri, S., “Discriminative learning for minimum error classiﬁcation,” IEEE Transactions on Signal Processing, vol. 40, pp. 3043–3054, December 1992. 9. Kirkpatrick, S., C. D. Gelatt, J., and Vecchi, M. P., “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680, 1983. 10. Li, Q. and Juang, B.H., “Fast discriminative training for sequential observations with application to speaker identiﬁcation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Hong Kong), April 2003. 11. Li, Q. and Juang, B.H., “Study of a fast discriminative training algorithm for pattern recognition,” IEEE Trans. on Neural Networks, vol. 17, pp. 1212–1221, Sept. 2006.
206
13 Fast Discriminative Training
12. Li, Q., Zheng, J., Tsai, A., and Zhou, Q., “Robust endpoint detection and energy normalization for realtime speech and speaker recognition,” IEEE Trans. on Speech and Audio Processing, vol. 10, pp. 146–157, March 2002. 13. Markov, K. and Nakagawa, S., “Discriminative training of GMM using a modiﬁed EM algorithm for speaker recognition,” in Proc. ICSLP, 1998. 14. Markov, K., Nakagawa, S., and Nakamura, S., “Discriminative training of HMM using maximum normalized likelihood algorithm,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 497–500, 2001. 15. Max, B., Tam, Y.C., and Li, Q., “Discriminative auditory features for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 12, pp. 27–36, Jan. 2004. 16. MoraJimenez, I. and CidSueiro, J., “A universal learning rule that minimize wellformed cost functinos,” IEEE Trans. On Neural Networks, vol. 16, pp. 810– 820, July 2005. 17. Normandin, Y., Cardin, R., and Mori, R. D., “Highperformance connected digit recognition using maximum mutual information estimation,” IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 299–311, April 1994. 18. Reynolds, D. and Rose, R. C., “Robust textindependent speaker identiﬁcation using Gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Processing, vol. 3, pp. 72–83, 1995. 19. Robinson, M., AzimiSadjadi, M. R., and Salazar, J., “Multiaspect target discrimination using hidden Markov models and neural networks,” IEEE Trans. On Neural Networks, vol. 16, pp. 447–459, March 2005. 20. Werbos, P. J., The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. New York: J. Wiley & Sons, 1994. 21. Wu, W., Feng, G., Li, Z., and Xu, Y., “Deterministic convergence of an online gradient method for BP networks,” IEEE Trans. On neural Networks, vol. 16, pp. 533–540, May 2005. 22. Yin, Y. and Li, Q., “Soft frame margin estimation of Gaussian mixture models for speaker recognition with sparse training data,” in ICASSP 2011, 2011. 23. Yu, X., Efe, M. O., and kaynak, O., “A general backpropagation algorithm for feedforward neural networks learning,” IEEE Trnas. On Neural Networks, vol. 13, pp. 251–254, Jan. 2002.
Chapter 14 Verbal Information Veriﬁcation
So far in this book we have focused on speaker recognition, which includes speaker veriﬁcation (SV) and speaker identiﬁcation (SID). Both tasks are accomplished by matching a speaker’s voice with his or her registered and modeled speech characteristics. In this chapter, we present another approach to speaker authentication verbal information veriﬁcation (VIV), in which spoken utterances of a claimed speaker are veriﬁed against the key (usually conﬁdential) information in the speaker’s registered proﬁle automatically to decide whether the claimed identity should be accepted or rejected. Using a sequential procedure for VIV involving three questionresponse turns, we achieved an errorfree result in a telephone speaker authentication experiment with 100 speakers. This work was originally reported by the author, Juang, Zhou, and Lee in [5, 6].
14.1 Introduction As we have discussed in previous chapters, to ensure proper access to private information, personal transactions, and security of computer and communication networks, automatic user authentication is necessary. Among various kinds of authentication methods, such as voice, password, personal identiﬁcation number (PIN), signature, ﬁngerprint, iris, hand shape, etc., voice is the most convenient one because it is easy to produce, capture, and transmit over the telephone or wireless networks. It also can be supported with existing services without requiring special devices. Speaker recognition is a voice authentication technique that has been studied for several decades. There are however still several problems which aﬀect realworld applications, such as acoustic mismatch, quality of the training data, inconvenience of enrollment, and the creation of a large database to store all the enrolled speaker patterns. In this chapter we present a diﬀerent approach to speaker authentication called verbal information veriﬁcation (VIV) [5, 6]. VIV can be used independently
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_14, Ó SpringerVerlag Berlin Heidelberg 2012
207
208
14 Verbal Information Veriﬁcation
or can be combined with speaker recognition to provide convenience to users while achieving a higher level of security. As introduced in Chapter 1, VIV is the process of verifying spoken utterances against the information stored in a given personal data proﬁle. A VIV system may use a dialogue procedure to verify a user by asking questions. An example of a VIV system is shown in Fig. 14.1. It is similar to a typical telebanking procedure: after an account number is provided, the operator veriﬁes the user by asking some personal information, such as mother’s maiden name, birth date, address, home telephone number, etc. The user must answer the questions correctly in order to gain access to his/her account. To automate the whole procedure, the questions can be prompted by a texttospeech system (TTS) or by prerecorded messages.
‘‘In which year were you born ?’’ Get and verify the answer utterance. Correct
Wrong
‘‘In which city/state did you grow up ?’’
Rejection
Get and verify the answer utterance. Correct
Wrong
‘‘May I have your telephone number, please ?’’
Rejection
Get and verify the answer utterance. Correct
Acceptance on 3 utterances
Wrong
Rejection
Fig. 14.1. An example of verbal information veriﬁcation by asking sequential questions. (Similar sequential tests can also be applied in speaker veriﬁcation and other biometric or multimodality veriﬁcation.)
We note that the major diﬀerence between speaker recognition and VIV in speaker authentication is that a speaker recognition system utilizes a speaker’s voice characteristics represented by the speech feature vectors while a VIV system mainly inspects the verbal content in the speech signal. The diﬀerence
14.1 Introduction
209
can be further addressed in the following three aspects. First, in a speaker recognition system, for either SID or SV, we need to train speakerdependent (SD) models, while in VIV we usually use speakerindependent models with associated acousticphonetic identities or subwords. Second, a speaker recognition system needs to enroll a new user and to train the SD model, while a VIV system does not need such an enrollment. A user’s personal data proﬁle is created when the user’s account is set up. Finally, in speaker recognition, the system has the ability to reject an imposter when the input utterance contains a legitimate passphrase but fails to match the pretrained SD model. In VIV, it is solely the user’s responsibility to protect his or her own personal information because no speakerspeciﬁc voice characteristics are used in the veriﬁcation process. However, in real applications, there are several ways to avoid impostors using a speaker’s personal information by monitoring a particular session. A VIV system can ask for some information that may not be a constant from one session to another, e.g. the amount or date of the last deposit; or a subset of the registered personal information, e.g., a VIV system can require a user to register N pieces of personal information (N > 1), and each time only randomly ask n questions (1 ≤ n < N ). Furthermore, as we are going to present in Section 15.2, a VIV system can be migrated to an SV system and VIV can be used to facilitate automatic enrollment for SV, which will be discussed in Chapter 15.
14.2 Single Utterance Veriﬁcation Using speech processing techniques, we have two ways to verify a single spoken utterance for VIV: by automatic speech recognition (ASR) or by utterance veriﬁcation. With ASR, the spoken input is transcribed into a sequence of words. The transcribed words are then compared to the information prestored in the claimed speaker’s personal proﬁle. With utterance veriﬁcation, the spoken input is veriﬁed against an expected sequence of word or subword models which is taken from a personal data proﬁle of the claimed individual. Based on our experience [6] and the analysis in Section 14.3, the utterance veriﬁcation approach can give us much better performance than the ASR approach. Therefore, we focus our discussion only on utterance veriﬁcation approach in this study. The idea of utterance veriﬁcation for computing conﬁdence scores was used in keyword spotting and nonkeyword rejection (e.g. [13, 14, 2, 8, 17, 18, 16]). A similar concept can also be found in ﬁxedphrase speaker veriﬁcation [15, 11, 7] and in VIV [6, 4]. A block diagram of a typical utterance veriﬁcation for VIV is shown in Fig. 14.2. The three key modules, utterance segmentation by forced decoding, subword testing and utterance level conﬁdence scoring, are described in detail in the following subsections. For an access control application, when a user opens an account, some of his or her key information is registered in a personal proﬁle. Each piece of
210
14 Verbal Information Veriﬁcation Phone/subword transcription for "Murray Hill"
Identity claim
S1, . . .,S m
Target likelihoods P(O 1  λ 1 ) . . . P(O m λ m )
Forced Decoding
Passutterance "Murray Hill"
SI HMM’s for the transcription λ1 . . . λm
Confidence Phone boundaries
Anti
Scores
Measure
likelihoods
P(O 1  λ 1 ) . . . P(O m λ m) Computation λ1 . . . λ m
Anti HMM’s for the transcription
Fig. 14.2. Utterance veriﬁcation in VIV.
the key information is represented by a sequence of words, S, which in turn is equivalently characterized by a concatenation of a sequence of phones or subwords, {Sn }N n=1 , where Sn is the nth subword and N is the total number of subwords in the key word sequence. Since the VIV system only prompts one single question at a time, the system knows the expected key information to the prompted question and the corresponding subword sequence S. We then apply the subword models λ1 , ..., λN in the same order of the subword sequence S to decode the answer utterance. This process is known as forced decoding or forced alignment, in which the Viterbi algorithm is employed to determine the maximum likelihood segmentations of the subwords, i.e. P (OS) = where
max
t1 ,t2 ,...,tN
N P (O1t1 S1 )P (Ott12 +1 S2 ) . . . P (OttN SN ), −1 +1
N O = {O1 , O2 , ..., ON } = {O1t1 , Ott12 +1 , ..., OttN }, −1 +1
(14.1)
(14.2)
is a set of segmented feature vectors associated with subwords, t1 , t2 , ..., tN are the end frame numbers of each subword segments respectively, and On = n Ottn−1 +1 is the segmented sequence of observations corresponding to subword Sn , from frame number tn−1 + 1 to frame number tn , where t1 ≥ 1 and ti > ti−1 . Given an observed speech segment On , we need a decision rule by which we assign the subword to either hypothesis H0 or H1 . Following the deﬁnition in [17], H0 means that observed speech On consists of the actual sound of subword Sn , and H1 is the alternative hypothesis. For the binarytesting problem, one of the most useful tests for decision making is the NeymanPearson lemma [10, 9, 19]. For a given number of observations K, the most powerful
14.2 Single Utterance Veriﬁcation
211
test, which minimizes the error for one class while maintaining the error for the other class constant, is a likelihood ratio test, r(On ) =
P (On H0 ) P (On λn ) = ¯n ) , P (On H1 ) P (On λ
(14.3)
¯ n are the target HMM and corresponding antiHMM’s for where λn and λ subword unit Sn , respectively. The target model, λn , is trained using the data ¯ n is trained using the data of subword Sn ; the corresponding antimodel, λ ¯ which is highly confused with subword Sn [17], i.e. of a set of subwords S ¯ n ⊂ {Si }, i = n. The log likelihood ratio (LLR) for subword Sn is S ¯ n ). R(On ) = log P (On λn ) − log P (On λ
(14.4)
For normalization, an average frame LLR, Rn , is deﬁned as Rn =
1 ¯n) , log P (On λn ) − log P (On λ ln
(14.5)
where ln is the length of the speech segment. For each subword, a decision can be made by Acceptance: Rn ≥ Tn ; (14.6) Rejection: Rn < Tn , where either a subworddependent threshold value Tn or a common threshold T can be determined numerically or experimentally. In the above test, we assume the independence among all HMM states and among all subwords. Therefore, the above test can be interpreted as applying the NeymanPearson lemma in every state, then combining the scores together as the ﬁnal average LLR score. 14.2.1 Normalized Conﬁdence Measures A conﬁdence measure M for a key utterance O can be represented as M(O) = F (R1 , R2 , ..., RN ),
(14.7)
where F is the function to combine the LLR’s of all subwords in the key utterance. To make a decision on the subword level, we need to determine the threshold for each of the subword tests. If we have the training data for each subword model and the corresponding antisubword model, this is not a problem. However, in many cases, the data may not be available. Therefore, we need to deﬁne a test which allow us determining the thresholds without using the training data. For subword Sn which is characterized by a model, λn , we deﬁne Cn =
¯n ) log P (On λn ) − log P (On λ log P (On λn )
(14.8)
212
14 Verbal Information Veriﬁcation
where log P (On λn ) = 0. Cn > 0 means the target score is larger than the antiscore and vice versa. Furthermore, we deﬁne a normalized conﬁdence measure for an utterance with N subwords as N 1 M= f (Cn ), N n=1
where f (Cn ) =
1, if Cn ≥ θ; 0, otherwise,
(14.9)
(14.10)
M is in a ﬁxed range of 0 ≤ M ≤ 1. Due to the normalization in Eq. (14.8), θ is a subwordindependent threshold which can be determined separately. A subword is accepted and counted as part of the utterance conﬁdence measure only if its Cn score is greater than or equal to the threshold value θ. Thus, M can be interpreted as the percentage of acceptable subwords in an utterance; e.g. M = 0.8 implies that 80% of the subwords in the utterance are acceptable. Therefore, an utterance threshold can be determined or adjusted based on the speciﬁcations of system performance and robustness.
14.3 Sequential Utterance Veriﬁcation For a veriﬁcation with multiple utterances, the above single utterance test strategy can be extended to a sequence of subtests which is similar to the stepdown procedure in statistics [1]. Each of the subtests is an independent singleutterance veriﬁcation. As soon as a subtest calls for rejection, H1 is chosen and the procedure is terminated; if no subtest leads to rejection, H0 is accepted, i.e. the user is accepted. We deﬁne H0 be the target hypothesis in which all the answered utterances match the key information in the proﬁle. We have H0 =
J
H0 (i),
(14.11)
i=1
where J is the total number of subtests, and H0 (i) is a component target hypothesis in the ith subtest corresponding to the ith utterance. The alternative hypothesis is J H1 = H1 (i), (14.12) i=1
where H1 (i) is a component alternative hypothesis corresponding to the ith subtest. We assume the independence among subtests. On the ith subtest, a decision can be made on Acceptance: M (i) ≥ T (i); (14.13) Rejection: M (i) < T (i);
14.3 Sequential Utterance Veriﬁcation
213
where M (i) and T (i) are the conﬁdence score and the corresponding threshold for utterance i, respectively. As is well known, when performing a test, one may commit one of two types of error: rejecting the hypothesis when it is true  false rejection (FR), or accepting it when it is false  false acceptance (FA). We denote the FR and FA error rates as εr and εa , respectively. An equalerror rate (EER), ε, is deﬁned when the two error rates are equal in a system, i.e. εr = εa = ε. For a sequential test, we extend the deﬁnitions of error rates as follows. False rejection error on J utterances (J ≥ 1) is the error when the system rejects a correct response in anyone of J hypothesis subtests. False acceptance error on J utterances (J ≥ 1) is the error when the system accepts an incorrect set of responses after all of J hypothesis subtests. Equalerror rate on J utterances is the rate at which the false rejection error rate and the false acceptance error rate on J utterances are equal. For convenience, we denote the above FR and FA error rates on J utterances as Er (J) and Ea (J), respectively. Let Ωi = R1 (i) ∪ R0 (i) be the region of conﬁdence scores of the ith subtest, where R0 (i) is the region of conﬁdence scores which satisfy M (i) ≥ T (i) from which we accept H0 (i), and R1 (i) is the region of scores which satisfy M (i) < T (i) from which we accept H1 (i). The FR and FA errors for subtest i can be represented as the following conditional probabilities εr (i) = P ( M (i) ∈ R1 (i)  H0 (i) ),
(14.14)
εa (i) = P ( M (i) ∈ R0 (i)  H1 (i) ),
(14.15)
and respectively. Furthermore, the FR error on J utterances can be evaluated as J
Er (J) = P (
{M (i) ∈ R1 (i)}  H0 ),
i=1
= 1−
J (1 − εr (i)),
(14.16)
i=1
and the FA error on J utterances is Ea (J) = P (
J
{M (i) ∈ R0 (i)}  H1 ),
i=1
=
J
εa (i).
(14.17)
i=1
Equations (14.16) and (14.17) indicate an important property of the sequential test deﬁned above: the more the subtests, the less the FA error and the larger the FR error. Therefore, we can have the following strategy in a
214
14 Verbal Information Veriﬁcation
VIV system design: starting from the ﬁrst subtest, we ﬁrst set the threshold value such that the FR error rate for the subtest, εr , is close to zero or a small number corresponding to design speciﬁcations, then add more subtests in the same way until meeting the required system FA error rate, Ea , or reaching the maximum numbers of allowed subtests. In a real application, it may save veriﬁcation time by arranging the subtests in the order of descending importance and decreasing subtest error rates; thus, the system ﬁrst prompts users with the most important question or with the subtest which we know has a high FR error εr (i). Therefore, if a speaker is falsely rejected, the session can be restarted right away with little inconvenience to the user. Equation (14.16) also indicates the reason that an ASR approach would not perform very well in a sequential test. Although ASR can give us a low FR error, εr (i), on each of the individual subtests, the overall FR error on J utterances Er (J), J > 1, can still be very high. In the proposed utterance veriﬁcation approach, we make the FR on each individual subtest close to zero by adjusting the threshold value while controlling the overall FA error by adding more subtests until reaching the design speciﬁcations. We use the following examples to show the above concept. 14.3.1 Examples in SequentialTest Design We use two examples to show the sequentialtest design based on required error rates. Example 1: Adding Additional Subtest A bank operator asks two kinds of personal questions while verifying a customer. When automatic VIV is applied to the procedure, the average individual error rates on these two subtests are εr (1) = 0.1%, εa (1) = 5%; and εr (2) = 0.2%, εa (2) = 6%, respectively. Then, from Eq. (14.16) and (14.17), we know that the system FR and FA errors on a sequential test are Er (2) = 0.3% and Ea (2) = 0.3%. If the bank wants to further reduce the FA error, one additional subtest can be added to the sequential test. Suppose the additional subtest has εr (3) = 0.3% and εa (3) = 7%. The overall system error rates will be Er (3) = 0.6% and Ea (3) = 0.021%. Example 2: Determining the Number of Subtests A security system requires Er (J) ≤ 0.03% and Ea (J) ≤ 0.2%. It is known that each subtest can have εr ≤ 0.01%, and εa ≤ 12% by adjusting the thresholds. In this case, we need to determine the number of subtests, J, to meet the design speciﬁcations. From Eq. (14.17), we have
log Ea log 0.002 J= = = 3. log εa log 0.12
14.3 Sequential Utterance Verfication
215
Then, the actual system FA rate on three subtests is Ea (3) = 0.17% ≤ 0.2%; the FR rate on three tests is Ea (3) = 0.03%. Therefore, three subtests can meet the required performance on both FR and FA.
14.4 VIV Experimental Results In the following experiments, the VIV system veriﬁes speakers by three sequential subtests, i.e. J = 3. The system performance with various decision thresholds will be evaluated and compared. The experimental database includes 100 speakers. Each speaker gave three utterances as the answers to the following three questions: “In which year were you born?” “In which city and state did you grow up?” “May I have your telephone number, please?” This is a biased database. Twenty six percent (26%) of the speakers have a birth year in the 1950’s; 24% are in the 1960’s. There is only a one digit diﬀerence among those numbers of birth years. Regarding place of birth, 39% were born in “New Jersey”, with 5% born in the exact same city and state: “Murray Hill, New Jersey”. Thirty eight percent (38%) of the telephone numbers provided start with “908 582 ...”, which means that at least 60% of the digits in their answer for their telephone number are identical. In addition, some of the speakers have foreign accents, and some cities and states are in foreign countries. In the experiments, a speaker was considered a true speaker when the speaker’s utterances were veriﬁed against his or her data proﬁle. The same speaker was considered an impostor when the utterances were veriﬁed against other speakers’ proﬁles. Thus, for each true speaker, we have three utterances from the speaker and 99 × 3 utterances from the other 99 speakers as impostors. The speech signal was sampled at 8 kHz and preemphasized using a ﬁrstorder ﬁlter with a coeﬃcient of 0.97. The samples were blocked into overlapping frames of 30 ms in duration and updated at 10 ms intervals. Each frame was windowed with a Hamming window. The cepstrum was derived from a 10th order LPC analysis. The LPC coeﬃcients were then converted to cepstral coeﬃcients, where only the ﬁrst 12 coeﬃcients were kept. The feature vector consisted of 39 features including 12 cepstral coeﬃcients, 12 delta cepstral coeﬃcients, 12 deltadelta cepstral coeﬃcients, energy, delta energy, and deltadelta energy [12]. The models used in evaluating the subword veriﬁcation scores were a set of 1117 right contextdependent HMM’s as the target phone models [3], and a set of 41 contextindependent antiphone HMM’s as antimodels [17]. For a VIV system with multiple subtests, either one global threshold, i.e. T = T (i), or multiple thresholds, i.e. T (i) = T (j), i = j, can be used. The
216
14 Verbal Information Veriﬁcation
thresholds can be either context (key information) dependent or context independent. They can also be either speaker dependent or speaker independent. Two SpeakerIndependent Thresholds For robust sequential veriﬁcation, we deﬁne the logic of using two speakerindependent and contextdependent thresholds for a multiplequestion trial as follows: TL , when TL ≤ M (i) < TH at the ﬁrst time T (i) = (14.18) TH , otherwise, where TL and TH are two threshold values; M (i) and T (i) are the values of conﬁdence measure and threshold, respectively, for the ith subtest. Eq. (14.18) means TL can be used only once during the sequential trial. Thus, if a true speaker has only one lower score in a sequential test, the speaker still has the chance to pass the overall veriﬁcation trial. This is useful in noisy environments or for speakers who may not speak consistently. When the above two thresholds were applied to VIV testing, the system performance was improved from the single threshold test, as shown in Table 14.1. The minimal FA rates in the table were obtained by adjusting the thresholds while maintaining the FR rates to be 0%. As we can see from the table, the thresholds for M have limited ranges 0.0 ≤ TL , TH ≤ 1.0 and clear physical meanings: i.e. TL = 0.69 and TH = 0.84 imply that 69% and 84% of phones are acceptable respectively. The comparison on using single and two thresholds is listed in Table 14.2. Table 14.1. False Acceptance Rates when Using Two Thresholds and Maintaining False Rejection Rates to Be 0.0% Conﬁdence False Acceptance Thresholds measure on three utterance M 0.57% TL = 0.69; TH = 0.84 M2 0.79% TL = −0.262; TH = 0.831
Table 14.2. Comparison on Two and Single Threshold Tests No. of SI Error rates on Threshold thresholds three utterances values Two FA = 0.57% FR = 0.0% TL = 0.69; TH = 0.84 Single FA = 0.75% FR = 1.0% T = 0.89
14.4 VIV Experimental Results
217
Robust Intervals A speaker may have ﬂuctuated test scores, even for utterances of the same text due to variations in voice characteristics, channels, and acoustic environment. We therefore need to deﬁne a robust interval, τ , to characterize the variation and the system robustness, T (i) = T (i) − τ,
0 ≤ τ < T (i)
(14.19)
where T (i) is an original contextdependent utterance threshold as deﬁned in Eq. (14.13), and T (i) is the adjusted threshold value. The robust interval, τ , is equivalent to the tolerance in the test score to accommodate ﬂuctuation due to variations in environments or a speaker’s conditions. In a system evaluation, τ can be reported with error rates as an allowed tolerance; or it can be used to determine the thresholds based on system speciﬁcations. For example, a bank authentication system may need a smaller τ to ensure a lower FA rate for a higher security level while a voice messaging system may select a larger τ for a lower FR rate to avoid user frustration. SpeakerDependent Thresholds To further improve the performance, a VIV system can start from a speakerindependent threshold, then switch to speaker and contextdependent thresholds after the system has been used for several times by a user. To ensure no false rejection, the upper bound of the threshold for subtest i of a speaker can be selected as T (i) ≤ min{M (i, j)},
j = 1, ..., I,
(14.20)
where M (i, j) is the conﬁdence score for utterance i on the jth trial, and I is the total number of trials that the speaker has performed on the same context of utterance i. In this case, we have three thresholds associated with the three questions for each speaker. Following the design strategy proposed in Section 14.3, the thresholds were determined by ﬁrst estimating T (i) as in Eq. (14.20) to guarantee 0% FR rate. Then, the thresholds were shifted to evaluate the FA rate on diﬀerent robust intervals τ as deﬁned in Eq. (14.19). The relation between robust interval and false acceptance rates on three questions using normalized conﬁdence measure is shown in Fig. 14.3, where the horizontal axis indicates the changes of the values of robust interval τ . The three curves represent the performance of a VIV system using one to three questions for speaker authentication while maintaining a FR rate of 0%. An enlarged graph of the performance for the cases of two and three subtests is shown in Fig. 14.4. We note that the conventional ROC plot cannot be applied here since the FR is 0%. We note that the threshold adjustment is made on perspeaker, perquestion situation although the plot in Fig. 14.4 is the overall performance for all speakers.
218
14 Verbal Information Veriﬁcation
From the ﬁgures, we can see that when using a one question test, we cannot obtain a 0% EER. Using two questions, we have a 0% equalerror rate but with no tolerance (i.e. robust interval τ = 0). With three questions, the VIV system gave 0% EER with 6% robust interval, which means when a true speaker’s utterance scores are 6% lower (e.g. due to variations in telephone quality), the speaker can still be accepted while all impostors in the database can be rejected correctly. This robust interval gives room for variation in the true speaker’s score to ensure robust performance of the system. Fig. 14.4 also implies that three questions are necessary to obtain a 0% FA in the experiment. In real applications, a VIV system may apply SI thresholds to a new user and switch to SD thresholds after the user access the system successfully for a few times. The thresholds can also be updated based on the recent scores to accommodate the changes of a speaker’s voice and environment. An updated SD threshold can be determined as T (i) < α min{M (i, j)},
1 ≤ I − k ≤ j ≤ I, k ≥ 1
(14.21)
where M (i, j) is the conﬁdence score for utterance i on the jth trial, I is the total number of trials that the speaker has produced on the same context of utterance i, and k is the update duration, i.e. the updated threshold is determined based on the last k trials. 40
False acceptance with utterance verification(%)
1 subtest 35
2 subtests 3 subtests
30
25
20
15
10
5
0 −20
−18
−16 −14 −12 −10 −8 −6 −4 Robust interval for speaker dependent thresholds (%)
−2
0
Fig. 14.3. False acceptance rate as a function of robust interval with SD threshold for a 0% false rejection rate. The horizontal axis indicates the shifts of the values of the robust interval τ .
A summary of VIV for speaker authentication is shown in Table 14.3. In the utterance veriﬁcation approach, when SD thresholds are set for each key
14.4 VIV Experimental Results
219
3.5
False acceptance with utterance verification(%)
2 subtests 3
3 subtests
2.5
2
1.5
1
0.5
0 −20
−18
−16 −14 −12 −10 −8 −6 −4 −2 Robust interval for speaker dependent thresholds (%)
0
Fig. 14.4. An enlarged graph of the system performances using two and three questions. Table 14.3. Summary of the Experimental Results on Verbal Information Veriﬁcation Approaches
False False Accuracy Robust Rejection Acceptance Interval Sequential Utterance 0% 0% 100% 6% Veriﬁcation (Note: Tested on 100 speakers with 3 questions while speakerdependent thresholds were applied.)
information ﬁeld, we achieved 0% average individual EER with a 6% robust interval.
14.5 Conclusions In this chapter we presented an automatic verbal information veriﬁcation technique for user authentication. VIV authenticates speakers by verbal content instead of voice characteristics. We also presented a sequential utterance veriﬁcation solution to VIV with a system design procedure. Given the number of test utterances (subtests), the procedure can help us to design a system with minimal overall error rate; given a limit on the error rate, the procedure can ﬁnd out how many subtests are needed to obtain the expected accuracy. In a VIV experiment with three questions prompted and tested sequentially, the proposed VIV system achieved 0% equalerror rate with 6% robust interval on 100 speakers when SD utterance thresholds were applied. However, since VIV veriﬁes the verbal content only and not a speaker’s voice characteris
220
14 Verbal Information Veriﬁcation
tics, it is the user’s responsibility to protect his or her personal information. The sequential veriﬁcation technique can also be applied to other biometric veriﬁcation systems or multimodality veriﬁcation systems in which more than one veriﬁcation method can be employed, such as voice plus ﬁngerprint veriﬁcation, or other kinds of conﬁgurations. For realworld applications, the solution of a practical and useful speaker authentication system can be a combination of speaker recognition and verbal information veriﬁcation. In the next chapter, Chapter 15, we will provide an example of real speaker authentication design using the techniques outlined in this book.
References 1. Anderson, T. W., An introduction to multivariate statistical analysis, second edition. New York: John Wiley & Sons, 1984. 2. Kawahara, T., Lee, C.H., and Juang, B.H., “Combining keyphrase detection and subwordbased veriﬁcation for ﬂexible speech understanding,” in Proceedings of ICASSP, (Munich), pp. 1159–1162, May 1997. 3. Lee, C.H., Juang, B.H., Chou, W., and MolinaPerez, J. J., “A study on taskindependent subword selection and modeling for speech recognition,” in Proc. of ICSLP, (Philadelphia), pp. pp. 1816–1819, Oct. 1996. 4. Li, Q. and Juang, B.H., “Speaker veriﬁcation using verbal information veriﬁcation for automatic enrollment,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Seattle), May 1998. 5. Li, Q., Juang, B.H., Zhou, Q., and Lee, C.H., “Automatic verbal information veriﬁcation for user authentication,” IEEE Trans. on Speech and Audio Processing, vol. 8, pp. 585–596, Sept. 2000. 6. Li, Q., Juang, B.H., Zhou, Q., and Lee, C.H., “Verbal information veriﬁcation,” in Proceedings of EUROSPEECH, (Rhode, Greece), pp. 839–842, Sept. 2225 1997. 7. Li, Q., Parthasarathy, S., and Rosenberg, A. E., “A fast algorithm for stochastic matching with application to robust speaker veriﬁcation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Munich), pp. 1543–1547, April 1997. 8. Lleida, E. and Rose, R. C., “Eﬃcient decoding and training procedures for utterance veriﬁcation in continuous speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Atlanta), pp. 507–510, May 1996. 9. Neyman, J. and Pearson, E. S., “On the problem of the most eﬃcient tests of statistical hypotheses,” Phil. Trans. Roy. Soc. A, vol. 231, pp. 289–337, 1933. 10. Neyman, J. and Pearson, E. S., “On the use and interpretation of certain test criteria for purpose of statistical inference,” Biometrika, vol. 20A, pp. Pt I, 175–240; Pt II, 1928. 11. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker veriﬁcation using subword background models and likelihoodratio scoring,” in Proceedings of ICSLP96, (Philadelphia), October 1996. 12. Rabiner, L. and Juang, B.H., Fundamentals of speech recognition. Englewood Cliﬀs, NJ: PTR Prentice Hall, 1993.
References
221
13. Rahim, M. G., Lee, C.H., and Juang, B.H., “Robust utterance veriﬁcation for connected digits recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Detroit), pp. 285–288, May 1995. 14. Rahim, M. G., Lee, C.H., Juang, B.H., and Chou, W., “Discriminative utterance veriﬁcation using minimum string veriﬁcation error (MSVE) training,” in Proc. IEEE Int. Conf. Acoustic, Speech, Signal Processing, (Atlanta), pp. 3585–3588, May 1996. 15. Rosenberg, A. E. and Parthasarathy, S., “Speaker background models for connected digit password speaker veriﬁcation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), pp. 81–84, May 1996. 16. Setlur, A. R., Sukkar, R. A., and Jacob, J., “Correcting recognition errors via discriminative utterance veriﬁcation,” in Proc. Int. Conf. on Spoken Language Processing, (Philadelphia), pp. 602–605, Oct. 1996. 17. Sukkar, R. A. and Lee, C.H., “Vocabulary independent discriminative utterance veriﬁcation for nonkeyword rejection in subword based speech recognition,” IEEE Trans. Speech and Audio Process., vol. 4, pp. 420–429, November 1996. 18. Sukkar, R. A., Setlur, A. R., Rahim, M. G., and Lee, C.H., “Utterance veriﬁcation of keyword string using wordbased minimum veriﬁcation error (WBMVE) training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Atlanta), pp. 518–521, May 1996. 19. Wald, A., Sequential analysis. NY: Chapman & Hall, 1947.
Chapter 15 Speaker Authentication System Design
In this book, we have introduced various speaker authentication techniques. Each of the techniques can be considered as a technical component, such as speaker identiﬁcation, speaker veriﬁcation, verbal information veriﬁcation, and so on. In realworld applications, a speaker authentication system can be designed by combining the technical components to construct a useful and convenient system to meet the requirements of a particular application. In this chapter we provide an example of a speaker authentication system design. Following this example, the author hopes readers can design their own system for their particular applications to improve the security level of the protected system. This design example was originally reported in [2].
15.1 Introduction Consider the following realworld scenario: a bank would like to provide convenient services to its customers while retaining a high level of security. The bank would like to use a speaker veriﬁcation technique to verify customers during an online banking service, but the bank does not want to bother customers with an enrollment procedure for speaker veriﬁcation. The bank wants a customer to use the online banking service right after the customer opens a bank account without any acoustic enrollment. The bank also wants to use biometrics to enhance the banking system security. How do we design a speaker authentication system to meet the bank’s requirements based on the techniques introduced in this book? Based on the techniques introduced in this book, a feasible solution is to combine speaker veriﬁcation (SV) with verbal information veriﬁcation (VIV). In such a system design, a user can be veriﬁed by VIV in the ﬁrst 4 to 5 accesses to the online banking system, usually from diﬀerent acoustic environments, e.g., diﬀerent ATM locations or diﬀerent telephones, landline or wireless. The VIV system veriﬁes the user’s personal information and collects and veriﬁes the passphrase utterance for use as training data for speakerdependent model
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317_15, Ó SpringerVerlag Berlin Heidelberg 2012
223
224
15 Speaker Authentication System Design
Training Utterances: " Open Sesame" " Open Sesame" " Open Sesame"
HMM Training
SpeakerDependent HMM
Database Enrollment Session Test Session
Identity Claim Test Utterance: " Open Sesame"
Speaker Verifier
Scores
Fig. 15.1. A conventional speaker veriﬁcation system
construction. The user’s answers to the VIV questions can be a passphrase which will be used for SV later. After a speakerdependent (SD) model is constructed, the system then migrates from a VIV system to an SV system. This approach avoids the inconvenience of a formal enrollment procedure, ensures the quality of the training data for SV, and mitigates the mismatch caused by diﬀerent acoustic environments between training and testing. In this chapter we describe such a system in detail. Experiments in this chapter show that such a system can improve the SV performance over 40% in relative equalerror rate (EER) reduction compared to a conventional SV system.
15.2 Automatic Enrollment by VIV A conventional SV system is shown in Fig. 15.1. It involves two kinds of sessions, enrollment and test. In an enrollment session, an identity, such as an account number, is assigned to a speaker, and the speaker is asked to select a spoken passphrase, e.g. a connected digit string or a phrase. The system then prompts the speaker to repeat the passphrase several times, and a speaker dependent HMM is constructed based on the utterances collected in the enrollment session. In a test session, the speaker’s test utterance is compared against the pretrained, speakerdependent HMM model. The speaker is accepted if the likelihoodratio score exceeds a preset threshold; otherwise the speaker is rejected. When applying the current speaker recognition technology to realworld applications, several problems were encountered which motivated our research of VIV [1, 2] introduced in Chapter 14. A conventional speaker recognition system needs an enrollment session to collect data for training an SD model. Enrollment is inconvenient to the user as well as the system developer who often has to supervise and ensure the quality of the collected data. The ac
15.2 Automatic Enrollment by VIV
Passphrases of the first few accesses: " Open Sesame" " Open Sesame" " Open Sesame"
225
Save for training
Verbal Information Verification
Verified passphrases for training
Automatic Enrollment
HMM Training Speakerdependent HMM Database
Speaker Verificaiton Identity claim Test passphrase: " Open Sesame"
Speaker Verifier
Scores
Fig. 15.2. An example of speaker authentication system design: Combining verbal information veriﬁcation with speaker veriﬁcation
curacy of the collected training data is critical to the performance of an SV system. Even a true speaker might make a mistake when repeating the training utterances/passphrases several times. Furthermore, since the enrollment and testing voice may come from diﬀerent telephone handsets and networks, there may exist an acoustic mismatch between the training and testing environments. The SD models trained on the data collected in one enrollment session may not perform well when the test session is in a diﬀerent environment or via a diﬀerent transmission channel. The mismatch signiﬁcantly aﬀects SV performance. To alleviate the above problems, we propose using VIV. The combined SV and VIV system [1, 2] is shown in Fig. 15.2, where VIV is involved in the enrollment and one of the key utterances in VIV is the passphrase which will be used in SV later. During the ﬁrst 4 to 5 accesses, the user is veriﬁed by a VIV system. The veriﬁed passphrase utterances are recorded and later used to train a speakerdependent HMM for SV. At this point, the authentication process can be switched from VIV to SV. There are several advantages to the combined system. First, the approach is convenient to users since it does not need a formal enrollment session and a user can start to use the system right after his/her account is opened. Second, the acoustic mismatch problem is mitigated since the training data are from diﬀerent sessions, potentially via diﬀerent handsets and channels. Third, the quality of the training data is ensured since the training phrases are veriﬁed
226
15 Speaker Authentication System Design
Database Speakerdependent model
Identity claim Phoneme Transcription Feature Vectors
Forced Alignment
L(O, Λ t )
Target Score Computation
+
Cepstral Mean Subtraction
Speakerindependent phoneme models
Background Score Computation

+
Threshold
Decision
L(O, Λb)
Background models
Fig. 15.3. A ﬁxedphrase speaker veriﬁcation system
by VIV before establishing the SD HMM for the passphrase. Finally, once the system switches to SV, it would be diﬃcult for an impostor to access the account even if the impostor knows the true speaker’s passphrase.
15.3 FixedPhrase Speaker Veriﬁcation The details of a ﬁxedphrase SV system can be found in Chapter 9. A block diagram of the test session used in our evaluation is shown in Fig. 15.3. After the speaker claims the identity, the system expects the same passphrase obtained in the training session. First, a speakerindependent phone recognizer is applied to ﬁnd the endpoints by forced alignment. Then, cepstral mean subtraction (CMS) is conducted to reduce the acoustic mismatch. In general, to improve SV performance and robustness, a general stochastic matching algorithm as discussed in Chapter 10 and the endpoint detection algorithm as discussed in Chapter 5 can also be applied. In the block of target score computation of Fig. 15.3, the feature vectors are decoded into states by the Viterbi algorithm, using the wholephrase model trained by the VIVveriﬁed utterances. A loglikelihood score for the target model, i.e. the target score, is calculated as L(O, Λt) =
1 log P (OΛt), Nf
(15.1)
where O is a set of feature vectors, Nf is the total number of vectors, Λt is the target model, and P (OΛt ) is the likelihood score from the Viterbi decoding. In the block of the background score computation, a set of speakerindependent (SI) HMM’s in the order of the transcribed phoneme sequence, Λb = {λ1 , ..., λK }, is applied to align an input utterance with the expected transcription using the Viterbi decoding algorithm. The segmented utterance
15.3 FixedPhrase Speaker Verification
227
is O = {O1 , ..., OK }, where Oi is the set of feature vectors corresponding to the ith phoneme, Si , in the phoneme sequence. There are diﬀerent ways to compute the likelihood score for the background (alternative) model. Here, we apply the background score proposed in [3]. L(O, Λb ) =
K 1 log P (Oi λbi ), Nf i=1
(15.2)
where Λb = {λbi }K i=1 is the set of SI phoneme models, in the order of the transcribed phoneme sequence, P (Oi λbi ) is the corresponding phoneme likelihood score, K is the total number of phonemes. The SI models are trained from a diﬀerent database by the EM algorithm [3]. In real implementations, the SI model can be the same one as used in VIV. The target and background scores are then used in the following likelihoodratio test [3]. R(O; Λt , Λb ) = L(O, Λt ) − L(O, Λb ), (15.3) where L(O, Λt) and L(O, Λb ) are deﬁned in Eqs. (15.1) and (15.2) respectively. A ﬁnal decision on rejection or acceptance is made based on comparing R in Eq. (15.3) with a threshold. As pointed in [3], if a signiﬁcantly diﬀerent phrase is given, the phrase could be rejected by the SI phoneme alignment before using the veriﬁer.
15.4 Experiments In this section, we conduct experiments to verify the design of the speaker authentication system. 15.4.1 Features and Database The feature vector for SV is composed of 12 cepstral and 12 deltacepstral coeﬃcients since it is not necessary to use the 39 features for SV. The cepstrum is derived from a 10th order LPC analysis over a 30 ms window and the feature vectors are updated at 10 ms intervals [3]. The experimental database consists of ﬁxedphrase utterances recorded over the long distance telephone network by 100 speakers, 51 male and 49 female. The ﬁxed phrase, common to all speakers, is “I pledge allegiance to the ﬂag” with an average length of two seconds. We assume the ﬁxed phrase is one of the veriﬁed utterances in VIV. Five utterances of the passphrase recorded from ﬁve separate VIV sessions are used to train an SD HMM; thus the training data are collected from diﬀerent acoustic environments and telephone channels at diﬀerent time. We assume all the collected utterances have been veriﬁed by VIV to ensure the quality of the training data.
228
15 Speaker Authentication System Design
For testing, we used 40 utterances recorded from a true speaker in diﬀerent sessions, and 192 utterances recorded from 50 impostors of the same gender in diﬀerent sessions. For model adaptation, the second, fourth, sixth, and eighth test utterances from the tested true speaker are used to update the associated HMM for verifying subsequent test utterances incrementally [3]. The SD target models for the phrases are lefttoright HMM’s. The number of states are dependent on the total number of phonemes in the phrases. There are four Gaussian components associated with each state [3]. The background models are concatenated SI phone HMM’s trained on a telephone speech database from diﬀerent speakers and texts [4]. There are 43 phonemes HMM’s and each model has three states with 32 Gaussian components associated with each state. Due to unreliable variance estimates from a limited amount of speakerspeciﬁc training data, a global variance estimate was used as the common variance to all Gaussian components in the target models [3]. 15.4.2 Experimental Results on Using VIV for SV Enrollment In Chapter 14, we reported the experimental results of VIV on 100 speakers. The system had 0% error rates when three questions were tested by sequential utterance veriﬁcation. Since we were using a preveriﬁed database, we assume that all the training utterances collected by VIV are correct. The results show improvement when reducing the acoustic mismatch by using VIV for enrollment. The SV experimental results with and without adaptation are listed in Table 15.1 and Table 15.2 for the 100 speakers, respectively. The numbers are in the average percentage of individual EER. The ﬁrst data column lists the EER’s using individual thresholds and the second data column lists the EER’s using common (pooled) thresholds for all tested speakers. The baseline system is the conventional SV system in which a single enrollment session is used. The proposed system is the combined system in which VIV is used for the automatic enrollment for SV. After the VIV system is used ﬁve times, collecting training utterances from ﬁve diﬀerent sessions, it then switches over to an SV system. The test utterances for both the baseline and the proposed system are the same. Without adaptation, the baseline system has an EER of 3.03% and 4.96% for individual and pooled thresholds respectively, while the proposed system has an EER of 1.59% and 2.89% respectively. With adaptation as deﬁned in the last subsection, the baseline system has an EER of 2.15% and 3.12%, while the proposed system has an EER of 1.20% and 1.83%, respectively. The proposed system without adaptation has an even lower EER than the baseline system with adaptation. This is because the SD models in the proposed system were trained using the data from diﬀerent sessions while the baseline system just performs an incremental adaptation without reconstructing the models after collecting more data.
15.4 Expriments
229
Table 15.1. Experimental Results without Adaptation in Average EqualError Rates Algorithms Individual Thresholds Pooled Thresholds SV (Baseline) 3.03 % 4.96 % VIV+SV(proposed) 1.59 % 2.89 % Table 15.2. Experimental Results with Adaptation in Average EqualError Rates Algorithms Individual Thresholds Pooled Thresholds SV (Baseline) 2.15 % 3.12 % VIV+SV(proposed) 1.20 % 1.83 %
The experimental results indicate several advantages of the proposed system design method. First, since VIV can provide the training data from different sessions representing diﬀerent channel environments, we can do significantly better than one training session. Second, although we can adapt the models originally trained by the data collected in one session, the proposed system still does better. This is due to the fact that a new model constructed by multisession training data is more accurate than by incremental adaptation using the multisession data. Lastly, in realworld applications, all the utterances used in training and adaptation can be veriﬁed by VIV before training or adaptation. Although this advantage cannot be observed in this database evaluation, it is critical in any realworld application since even a true speaker may make a mistake while uttering a passphrase. The mistake will never be corrected once involved in model training or adaptation. VIV can protect the system from involving wrong training data. In this section, we only proposed one conﬁguration on combined VIV with SV. For diﬀerent applications, diﬀerent combinations of speaker authentication techniques can be integrated to meet diﬀerent speciﬁcations. For example, VIV can be employed in SV to verify a user before the user’s data is used for SD model adaptation, or both the VIV and SV system can share the same set of speaker independent models and the decoding scores from VIV can be used in SV as the background score.
15.5 Conclusions In this chapter, we present a design example for speaker authentication system. To improve both user convenience and system performance, we combined verbal information veriﬁcation (VIV) and speaker veriﬁcation (SV) to construct a convenient speaker authentication system. In the system, VIV is used to verify users in the ﬁrst few accesses. Simultaneously, the system collects veriﬁed training data for constructing speakerdependent models. Later, the system migrates from a VIV to SV system for authentication. The combined
230
15 Speaker Authentication System Design
system is convenient to users since they can start to use the system without going through a formal enrollment session and waiting for model training. However, it is still the user’s responsibility to protect his or her personal information from impostors until the speakerdependent model is trained and the system is migrated to an SV system. After the migration, an impostor would have diﬃculty accessing the account even if the passphrase is known. Since the training data could be collected from diﬀerent channels in diﬀerent VIV sessions, the acoustic mismatch problem is mitigated, potentially leading to a better system performance in test sessions. The speaker dependent HMM’s can be updated to cover diﬀerent acoustic environments while the system is in use to further improve the system performance. Our experiments have shown that the combined speaker authentication system improves SV performance by more than 40% compared to that of a conventional SV system by just mitigating the acoustic mismatch. Furthermore, VIV can be used to ensure training data for SV. In order to design a realworld speaker authentication system, it may be necessary to combine several speaker authentication techniques because each technique has its own advantages and limitations. If we combine them through a careful design, the combined system has more chance to meet the design speciﬁcations and be more useful in realworld applications.
References 1. Li, Q. and Juang, B.H., “Speaker veriﬁcation using verbal information veriﬁcation for automatic enrollment,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Seattle), May 1998. 2. Li, Q., Juang, B.H., Zhou, Q., and Lee, C.H., “Automatic verbal information veriﬁcation for user authentication,” IEEE Trans. on Speech and Audio Processing, vol. 8, pp. 585–596, Sept. 2000. 3. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker veriﬁcation using subword background models and likelihoodratio scoring,” in Proceedings of ICSLP96, (Philadelphia), October 1996. 4. Rosenberg, A. E. and Parthasarathy, S., “Speaker background models for connected digit password speaker veriﬁcation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), pp. 81–84, May 1996.
Index
A posteriori probability, 180, 181, 192 Accessibility, 3 Acoustic mismatch, 9, 11 Adaptation, 228 AntiHMM, 211 Antimodel, 211 Antisymmetric features, 79 ASR, 8 AT, 111, 118, 124 ATM, 223 Auditory features, 135, 136 ﬁlters, 116 Auditory transform, 111 dilation variable, 139 discretetime, 123 fast inverse transform, 123 fast transform, 123 inverse, 120, 124 scale variable, 139 shift variable, 139 translation variable, 139 Auditorybased feature extraction, 135, 136, 138 transform, 111, 118 Authentication, 1 biometricbased, 2, 3 humanhuman, 2 humanmachine, 2 informationbased, 2, 5, 6 machinehuman, 2 machinemachine, 2 tokenbased, 2
Automatic enrollment, 224 Automatic speech recognition, 8 Average individual equalerror rates, 174 Background model, 153 Background noise, 114 Backpropagation, 44, 181, 192 Band number, 139 Bank application, 223 Bark scale, 113, 116, 128, 131 Basilar membrane, 116, 138 Batchmode process, 76 Baye decision rule, 69 Bayes, 181 Bayesian decision theory, 68, 179 Beam search, 93 width, 93 Benchmark, 38 Central frequency, 139 Centroids, 27 Cepstral domain, 158 Cepstral mean subtraction, 152, 158 CFCC, 127, 140, 147 Changepoint detection, 76, 95–97 problem, 95 state detection, 97 Classiﬁcation rules, 55 Closed test, 9
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/9783642237317, Ó SpringerVerlag Berlin Heidelberg 2012
231
232
Index
CMS, 152, 158, 226 Cochlea, 116 Cochlear ﬁlter, 120, 139 ﬁlter bank, 138 ﬁlter cepstral coeﬃcients, 140 Cochlear ﬁlter cepstral coeﬃcients, 127 Code vector, 30 Codewords, 27 Cohort normalization, 172 Combined system, 225 Common thresholds, 228 Communications, 75 Computation noise, 114 Conditionallikelihood functions, 182 Conﬁdence measure, 211 Cryptographic protocols, 5 Cubic lattice, 29 Cubic root, 141 Data space, 28 Decision boundary, 63 diagram, 75 parallel, 5 sequential, 5 Decision procedure, 2 parallel, 2 sequential, 2 Decoder complexity, 93 complexity analysis, 103 detectionbased , 93 forwardbackward, 95 optimal path, 95 search space alignment, 94 searchspace reduction, 93 subspace, 95 Decoding algorithm, 94 Decomposed frequency bands, 127 Decryption, 2 Denoising, 127 DET, 4 Detection, 93 Dialogue procedure, 10 Direct median method, 30 Direct method, 10 Discriminant analysis, 44 Discriminant function, 44
Discriminative training, 192 training objectives, 179 Distinctiveness, 3 Distortion, 30 Distribution estimation, 192 Eardrum, 116 Eavesdropping, 11 EER, 71, 155, 156, 213 Eigenvalue, 26, 50 Eigenvector, 30 EM algorithm, 62, 191 Encoding error, 28 Encryption, 2 key, 4 voice key, 4 Endpoint detection, 75, 76, 86 beginning edge, 78 beginning point, 78 decision diagram, 82 ending edge, 78 ending point, 78 energy normalization, 83 ﬁlter, 78, 82 realtime, 81 rising edge, 78 Energy feature, 78 normalization, 75, 86 Enrollment, 1, 224 session, 4, 7, 154, 157 Equal loudness function, 130 Equalerror rate, 155, 156, 213 Equivalent rectangular bandwidth, 120 ERB, 116, 120, 131, 139 Error rate, 180 Estimation covariance matrices, 196 mean vectors, 197 mixture parameters, 198 Evolution, 95 Expectationmaximization algorithm, 191 FA, 213 False acceptance, 71, 213 rejection, 71, 213
Index Fast discriminative training, 191 Fast Fourier transform, 111 Fast training algorithm, 193 FFT, 111 Filter bandwidth, 111 bank, 136 Finitestate grammars , 87 Fisher’s LDA, 44 Fixed pass phrase, 8 phrase, 8 Flops, 34 Forced alignment, 152 Forward auditory transform, 138 Fourier transform, 111 FR, 213 Full search, 93 Gammatone ﬁlter, 130 ﬁlter bank, 117, 142 function, 111, 117, 130 GFCC features, 142 MGFCC features, 142 transform, 117 Gaussian classiﬁer, 54 correlated, 29 node, 54 uncorrelated, 29 Gaussian mixture model, 23, 62, 153 GD, 191 Generalized minimum error rate, 179 Generalized probabilistic descent, 67 Generalized Rayleigh quotient problem, 51 GMER, 179, 185, 191 GMM, 23, 62, 153 GPD, 67 Gradient descent, 44, 191 Hair cell, 116, 140 Hearing system, 115 Hidden Markov model, 66 lefttorightl, 97 SD, 155 Searchspace reduction, 100 Hiden nodes, 43
233
High density, 28 Histogram, 29 HMM, 66 lefttoright, 97 SD, 155 Searchspace reduction, 100 Holder norm, 182 HSV, 166 Hypercube, 29 Hyperplane, 53 Identiﬁcation, 4 Impostor, 9 Impulse response cochlear ﬁlters, 120 plot, 120 Indirect method, 10 Inner ear, 116 Internet, 5 Invariant, 29 Inverse AT, 124 Inverse auditory transform, 124 Inverse transform, 30 Joint probability distribution, 61 Kmeans, 23, 28, 32 algorithm, 27 Key information, 207 Keyword spotting, 209 Laplace distribution, 37 source, 30, 37 LBG, 27, 34 LDA, 46, 51, 166 Lefttoright HMM, 67, 152 Likelihood ratio, 71, 152 ratio test, 154 Linear discriminant analysis, 51, 166 Linear prediction cepstral coeﬃcients, 136 Linear separable, 168 Linear transform, 157, 158, 160 fast estimation, 160 LLR, 211 Local optimum, 28 Log likelihood
234
Index
ratio, 162, 211 score, 70 LPC, 152 LPCC, 136 MAP, 181 Maximum a posteriori, 181 Maximum allowable distortion, 29 Maximum mutual information, 183 MCE, 179, 188 Meansquared error, 35 Measurability, 3 Mel frequency cepstral coeﬃcient, 136 Mel frequency cepstral coeﬃcients, 135, 136, 204 Mel scale, 116 MER, 179, 188 MFCC, 135, 136, 144, 204 MGFCC, 142, 144 Middle ear, 116 transfer function, 116 Minimum classiﬁcation error, 179 Minimum error rate, 179, 181 classiﬁcation, 69 Mismatch, 157, 158, 160 ML, 179 MMI, 179, 183, 188 Modiﬁed GFCC, 142 Modiﬁed radial basis function network, 58 Mother’s maiden name, 10 Mother’s medien name, 5 MRBF, 58 MSE, 35–37 Multidimensional, 36 Multiple paths, 101 Multispectral pattern recognition, 58 Multivariate Gaussian, 23 Gaussian distribution, 23 statistical analysis, 27 statistics, 23 NDA, 165 Neural networks, 44 Neyman Pearson, 211 Neymann Pearson lemma, 210 test, 70
NIST, 4 Nonkeyword rejection, 209 Nonstationary pattern recognition , 61 process, 61, 95 Normalized conﬁdence measures, 211 Normalized discriminant analysis, 165, 166, 168 Objective, 179 function, 179 generalized minimum error rate, 179, 185 GMER, 179, 193 HMM training, 182 maximum mutual information, 179, 183 maximumlikelihood, 179 MCE, 179, 181 MER, 179 minimum classiﬁcation error, 179, 181, 183 minimum error rate, 179 MMI, 179 Onepass VQ, 23, 28 algorithm, 32 codebook design, 34 complexity analysis, 33 data pruning, 28 directmedian method, 30 principal component method, 30 pruning, 31 robustness, 38 updating, 32 Onetomany, 4 Onetoone, 4 Open test, 9 Openset evaluation, 154 Operator, 9 Optimal Gaussian classiﬁer , 53 Outer ear, 116 transfer function, 116 Outlier, 29, 39 Oval window, 116 Parametric model, 95 Passphrase, 8 Pattern recognition, 43 PCA, 23, 25
Index Perceptual linear predictive, 135, 136 Personal data proﬁle, 11 Personal identiﬁcation number, 2 Personal information, 11 PFC, 44 PFN, 43, 44 design procedure, 44 Phoneme alignment, 227 PIN, 2 Pitch harmonics, 114 PLP, 135, 136, 147 Pooled thresholds, 228 Posterior probability, 180, 181 Preemphasize, 152 Principal component, 25 analysis, 23, 25 discriminant analysis, 53 hidden node design, 52 method, 30 Principal feature, 44 classiﬁcation, 44 Principal feature network, 43, 44 classiﬁed region, 46 construction procedure, 48 contribution array, 56 decisiontree implementation, 46 Fisher’s node, 48, 51 Gaussian discriminant node, 49 hidden node, 46 hidden node design, 48 lossy simpliﬁcation, 56 maximum SNR hidden node, 55 parallel implementation, 46, 48 PC Node, 53 principal component node, 48 sequential implementation, 48 simpliﬁcation, 56 thresholds, 56 unclassiﬁed region, 46 Pruning, 28, 44, 95 Qfactor, 118, 120 Radial basis network, 46 Radius, 28 RASTA, 147 RASTAPLP, 135, 136, 147 RBF, 46 Rectiﬁer function, 120
Recurrent neural network, 61 Recursive, 97 Registration, 1 Repeatability, 3 Resynthesis, 117 Robust interval, 217 Robust speaker identiﬁcation, 135 Robustness, 3 Rotation, 158, 161 RST, 161 Scale, 158, 161 SD, 7, 9, 154, 209 SD target model, 228 Search path, 95 Searching, 93 Segmental Kmeans, 39, 67 Sequential design algorithm, 30 detection, 95 detection scheme, 96 observations, 69 probability ratio test, 95 procedure, 44 process, 76 questions, 208 utterance veriﬁcation, 212 SI, 154, 209 SID, 7 Single Gaussian distribution, 24 Single path, 101 Single utterance ASR, 209 veriﬁcation, 209 SNR, 44, 142 Source histogram, 29 Speaker authentication, 6 progressive integrated, 11 system, 223 Speaker dependent, 7, 209 Speaker identiﬁcation, 7, 204 Speaker independent, 209 Speaker recognition, 6, 7 text constrained, 8 Speaker veriﬁcation, 7 background score, 154 connected digit, 166 contexdependent, 152 enrollment phase, 152
235
236
Index
ﬁxed phrase, 151, 152, 158, 226 ﬁxedphrase, 165 general phrase, 161 global variance, 163 hybrid, 174 Hybrid system, 166 languagedependent, 156 languageindependent, 151, 156 passphrase, 151, 155 randomly prompted, 165, 166 robust, 157 speaker dependent, 152 speaker independent, 152 stochastic matching, 161 target model, 163 target score, 154 test phase, 152 text dependent, 161 text independent, 8 text prompted, 8 wholephrase model, 152 Speakerdependent threshold, 217 Speakerindependent models, 9 Spectrogram, 113 AT, 120 auditory transform, 120 FFT, 120 Spectrum, 129 Speech activity detection, 76 Speech detection, 76 Speech recognition, 6 Speech segmentation, 68 Speech segments, 78 Speechsampling rate, 78 Speed up, 34, 37 SPRT, 95 SR, 6, 7 State changepoint detection, 104 State transition diagram, 82 Stationary pattern recognition, 61 process, 61 Statistical properties, 3 nonstationary, 3 stationary, 3 Statistical veriﬁcation, 70 Stepdown procedure, 212 Stochastic matching, 157 fast, 158
fast algorithm, 158 geometric interpretation , 158 linear transform, 158 rotation, 158 scale, 158 Stochastic process, 61 Subband signals, 138 Subtest, 212 Subword hypothesis testing, 210 model, 210 sequence, 210 test, 211 SV, 7 Synthesized speech, 126 System design, 223 Target model, 154 Test session, 7, 154 Testing, 1 session, 4 Text to speech, 10, 208 Timefrequency analyses, 117 auditorybased analysis, 137 transforms, 111 Training procedure, 200 Translation, 158, 161 Traveling wave, 116 TTS, 10, 208 Uncorrelated Gaussian source, 34 User authentication, 2 Utterance veriﬁcation, 71, 209 Variable length window, 140 Vector quantization, 27 Verbal information veriﬁcation, 6, 9, 207 Veriﬁcation, 1, 4 Veriﬁcation session, 158 Viterbi, 67, 68 algorithm, 93 search, 94 VIV, 6, 9, 207, 208, 223, 224 Voice characteristics, 6 VoIP, 75, 77 VQ, 23, 27, 28 Warped frequency, 128
Index Wavelet transform, 117 Window shifting, 128 Wireless communication, 5
Word error rates, 86 WT, 117
237