Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop
Ramón López-Cózar Delgado • Tetsunori Kobayashi Editors
Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop
Editors Ramón López-Cózar Delgado Department of Languages and Computer Systems University of Granada Granada Spain
[email protected] Tetsunori Kobayashi Department of Computer Science and Engineering Waseda University Okubo 3-4-1 169-8555 Tokyo Japan
[email protected] ISBN 978-1-4614-1334-9 e-ISBN 978-1-4614-1335-6 DOI 10.1007/978-1-4614-1335-6 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011935537 © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The 3rd International Workshop on Spoken Dialogue Systems (IWSDS2011) was held at Granada, Spain, 1-3 September 2011, as a satellite event of Interspeech 2011. This annual workshop brings together researchers from all over the world working in the field of spoken dialogue systems. It provides an international forum for the presentation of research and applications, and for lively discussions among researchers as well as industrialists. Following the success of IWSDS2009 (in Irsee, Germany) and IWSDS2010 (in Gotemba Kogen Resort, Japan), IWSDS2011 has designated ”Paralinguistic Information and its Integration in Spoken Dialogue Systems” as a special theme for discussion, considering three Special Tracks: • Spoken dialogue systems for robotics. • Emotions and spoken dialogue systems. • Spoken dialogue systems for real-world applications. We also encourage discussions on common issues of spoken dialogue systems, including but not limited to: • Speech recognition and semantic analysis. • Dialogue management. • Recognition of emotions from speech, gestures, facial expressions and physiological data. • User modelling. • Planning and reasoning capabilities for coordination and conflict description. • Conflict resolution in complex multi-level decisions. • Multi-modality such as graphics, gesture and speech for input and output. • Fusion and information management. • Learning and adaptability. • Visual processing and recognition for advanced human-computer interaction. • Databases and corpora. • Evaluation strategies and paradigms. • Prototypes and products. v
vi
Preface
The workshop programme consists of 35 papers and two invited keynote talks by Prof. Kristiina Jokinen and Prof. Roger K. Moore. We thank the Scientific Committee members for their efficient contributions and for completing the review process on time. Moreover, we express our gratitude to the Steering Committee for their guidance and suggestions, and to the Local Committee for their help with the organisational arrangements. In addition, we thank the Organising Committee of IWSD2009 for their support in updating the workshop’s web page, and the Organising Committee of IWSDS2010 for sharing with us their knowledge on local arrangements. Furthermore, we must mention that this workshop would not have been possible without the support of: • • • • • •
ACL (Association for Computational Linguistics). Dept. of Languages and Computer Systems, University of Granada, Spain. ISCA (International Speech Communication Association). Korean Society of Speech Scientists. SIGDIAL (ACL Special Interest Group on Discourse and Dialogue). University of Granada, Spain.
Last, but not least, we thank the authors for their contributions. We hope all the attendees benefit from the workshop and enjoy their stay in Granada. Granada, Spain Tokyo, Japan
Ram´on L´opez-C´ozar Tetsunori Kobayashi September 2011
Organisation
IWSDS2011 was organised by the Dept. of Languages and Computer Systems (University of Granada, Spain) and the Dept. of Computer Science and Engineering (Waseda University, Japan). Chairs Ram´on L´opez-C´ozar Tetsunori Kobayashi Local Committee Zoraida Callejas David Griol Gonzalo Espejo ´ Nieves Abalos Research Group on Spoken and Multimodal Dialogue Systems (SISDIAL) Scientific Committee Jan Alexandersson - DFKI, Saarbrcken, Germany Masahiro Araki - Interactive Intelligence lab, Kyoto Institute of Technology, Japan Andr´e Berton - Daimler R&D, Ulm, Germany Heriberto Cuay´ahuitl - DFKI, Saarbr¨ucken, Germany Sadaoki Furui - Tokyo Institute of Technology, Tokyo, Japan Joakim Gustafson - KTH, Stockholm, Sweden Tobias Heinroth - Ulm University, Germany Paul Heisterkamp - Daimler Research, Ulm, Germany Kristiina Jokinen - University of Helsinki, Finland Tatsuya Kawahara - Kyoto University, Japan Hong Kook Kim - Gwangju Institute of Science and Technology, Korea Lin-Shan Lee - National Taiwan University, Taiwan vii
viii
Organisation
Mike McTear - University of Ulster, UK Mikio Nakano - Honda Research Institute, Japan Elmar N¨oth - University of Erlangen, Germany Norbert Reithinger - DFKI, Berlin, Germany Gabriel Skantze - KTH, Stockholm, Sweden Alexander Schmitt - Ulm University, Germany Kazuya Takeda - Graduate School of Information Science, Nagoya University, Japan Hsin-min Wang - Academia Sinica, Taiwan Steering Committee Gary Geunbae Lee - POSTECH, Pohang, Korea Joseph Mariani - LIMSI-CNRS and IMMI, Orsay, France Wolfgang Minker - Ulm University, Germany Satoshi Nakamura - NICT, Kyoto, Japan
Contents
Part I Keynote Talks Looking at the Interaction Management with New Eyes - Conversational Synchrony and Cooperation using Eye Gaze . . . . . . . . . . . . . . . . . . . . . . . . . Kristiina Jokinen
3
Interacting with Purpose (and Feeling!): What Neuropsychology and the Performing Arts Can Tell Us About ’Real’ Spoken Language Behaviour . Roger K. Moore
5
Part II Speech Recognition and Semantic Analysis Accessing Web Resources in Different Languages Using a Multilingual Speech Dialog System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hansj¨org Hofmann and Andreas Eberhardt and Ute Ehrlich 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Information Extraction from Semi-structured Web Sites . . . . . . . . . 4 Generic Speech Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 System Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New Technique for Handling ASR Errors at the Semantic Level in Spoken Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ram´on L´opez-C´ozar, Zoraida Callejas, David Griol and Jos´e F. Quesada 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Proposed Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Creation of an initial correction model . . . . . . . . . . . . . . . . 3.2 Optimisation of the initial correction model . . . . . . . . . . . . 4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 9 10 11 12 12 14 14 17 17 18 19 20 20 23 ix
x
Contents
4.1 Utterance corpora and scenarios . . . . . . . . . . . . . . . . . . . . . 4.2 Language models for ASR . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23 25 25 27 27
Combining Slot-based Vector Space Model for Voice Book Search . . . . . . . Cheongjae Lee and Tatsuya Kawahara and Alexander Rudnicky 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Backend Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Query Collection using Amazon Mechanical Turk . . . . . . 4 Book Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Baseline Vector Space Model (VSM) . . . . . . . . . . . . . . . . . 4.2 Multiple VSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Hybrid VSM and a Back-off Scheme . . . . . . . . . . . . . . . . . 5 Search Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Experiment Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Evaluation on Textual Queries . . . . . . . . . . . . . . . . . . . . . . . 5.3 Evaluation on Noisy Queries . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Preprocessing of Dysarthric Speech in Noise Based on CV–Dependent Wiener Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ji Hun Park, Woo Kyeong Seong, and Hong Kook Kim 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 CV–Dependent Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 CV–Classified VAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 CV–Dependent Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . 3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditional Random Fields for Modeling Korean Pronunciation Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sakriani Sakti, Andrew Finch, Chiori Hori, Hideki Kashioka, Satoshi Nakamura 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Speech Recognition Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Conditional Random Field Approach . . . . . . . . . . . . . . . . . . . . . . . . . 4 CRFs Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31 32 32 32 33 33 34 34 36 36 36 37 38 38 38 41 41 42 42 44 45 46 46 49
49 50 51 51 52 54 54
Contents
xi
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 An Analysis of the Speech Under Stress Using the Two-Mass Vocal Fold Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao Yao, Takatoshi Jitsuhiro, Chiyomi Miyajima, Norihide Kitaoka, Kazuya Takeda 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Measuring stress using glottal source . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Spectral flatness of the glottal flow . . . . . . . . . . . . . . . . . . . 2.2 Evaluation of Spectral Flatness Measure . . . . . . . . . . . . . . 3 Simulation using two-mass model . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Euisok Chung, Hyung-Bae Jeon, Jeon-Gue Park and Yun-Keun Lee 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Domain Adapted Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Unknown Word Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Incremental Domain Adaptation . . . . . . . . . . . . . . . . . . . . . 3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Word Segmentation Error Reduction . . . . . . . . . . . . . . . . . 3.2 Incremental Domain Adaptation Experiment . . . . . . . . . . . 4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
58 59 59 59 59 62 62 62 63 63 64 64 65 65 68 69 69 70 71 72 72
Part III Multi-Modality for Input and Output Analysis on Effects of Text-to-Speech and Avatar Agent in Evoking Users’ Spontaneous Listener’s Reactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teruhisa Misu, Etsuo Mizukami, Yoshinori Shiga, Shinichi Kawamoto, Hisashi Kawai and Satoshi Nakamura 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Construction of Spoken Dialogue TTS . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Spoken Dialogue Data collection and Model Training . . . 3.2 Comparison Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Comparison of Prosodic Features of the Synthesized Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Construction of Avatar Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 User Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
77 78 79 79 80 80 81 82
xii
Contents
5.1 Dialogue System used for Experiment . . . . . . . . . . . . . . . . 5.2 Evaluation of TTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Effect of Avatar Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82 82 87 88 88
Development of a Data-driven Framework for Multimodal Interactive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Masahiro Araki and Yuko Mizukami 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 2 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 2.1 Object-oriented approach for development of spoken dialogue systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 2.2 Data-driven development of Web applications . . . . . . . . . . 93 2.3 MMI system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3 Data-driven framework for MMI system development . . . . . . . . . . . 94 3.1 Background architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.2 Interaction level markup language . . . . . . . . . . . . . . . . . . . . 94 3.3 Dialog flow description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4 Object-oriented modeling language . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.1 Language specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.2 Rapid initial prototyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 Conclusion and Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Multiparty Conversation Facilitation Strategy Using Combination of Question Answering and Spontaneous Utterances . . . . . . . . . . . . . . . . . . . . 103 Yoichi Matsuyama, Yushi Xu, Akihiro Saito, Shinya Fujie and Tetsunori Kobayashi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 2 Conversation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.1 Topic Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.2 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.3 Dialogue Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4 Combination of Utterances . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Conversational Speech Synthesis System with Communication Situation Dependent HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Kazuhiko Iwata and Tetsunori Kobayashi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 2 Speech Corpora Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Contents
xiii
2.1 Communication Situations . . . . . . . . . . . . . . . . . . . . . . . . . . 115 2.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.1 HMM Training on Situation Dependent Speech Corpora . 117 3.2 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 An Event-Based Conversational System for the Nao Robot . . . . . . . . . . . . 125 Ivana Kruijff-Korbayova,´ Georgios Athanasopoulos, Aryel Beck, Piero Cosi, Heriberto Cuay´ahuitl, Tomas Dekens, Valentin Enescu, Antoine Hiolle, Bernd Kiefer, Hichem Sahli, Marc Schr¨oder, Giacomo Sommavilla, Fabio Tesser and Werner Verhelst 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 2 Event-Based Component Integration . . . . . . . . . . . . . . . . . . . . . . . . . 126 3 The Integrated System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.1 Dialogue Manager (DM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 3.2 Audio Front End (AFE) and Voice Activity Detection (VAD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 3.3 Automatic Speech Recognition (ASR) . . . . . . . . . . . . . . . . 127 3.4 Natural Language Understanding (NLU) . . . . . . . . . . . . . . 128 3.5 Natural Language Generation (NLG) . . . . . . . . . . . . . . . . . 128 3.6 Text-To-Speech Synthesis (TTS) . . . . . . . . . . . . . . . . . . . . . 129 3.7 Gesture Recognition and Understanding (GRU) . . . . . . . . 129 3.8 Non-Verbal Behavior Planning (NVBP) & Motor Control (MC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4 Experience from Experiments and Conclusions . . . . . . . . . . . . . . . . 130 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Towards Learning Human-Robot Dialogue Policies Combining Speech and Visual Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Heriberto Cuay´ahuitl, Ivana Kruijff-Korbayov´a 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 2 Learning Human-Robot Dialogues Under Uncertainty . . . . . . . . . . 134 3 Using Bayesian-Relational State Representations for Optimizing Human-Robot Dialogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.1 The Simulated Conversational Environment . . . . . . . . . . . 135 4.2 Characterization of the Learning Agent . . . . . . . . . . . . . . . 136 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
xiv
Contents
Part IV User Modelling JAM: Java-based Associative Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Robert Pr¨opper, Felix Putze, Tanja Schultz 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 3.1 Knowledge Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 3.2 Memory Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.1 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.2 Conversation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Conversation Peculiarities of People with Different Verbal Intelligence . . . 157 Kseniya Zablotskay, Umair Rahim, Sergey Zablotskiy, Steffen Walter, Wolfgang Minker 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 2.1 Corpus Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4 Discussions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Merging Intention and Emotion to Develop Adaptive Dialogue Systems . . 165 Zoraida Callejas, David Griol, Ram´on L´opez-C´ozar, Gonzalo Espejo, ´ Nieves Abalos 1 Introduction and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 2 Our proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 2.1 The emotion recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 2.2 The intention recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 3 The enhanced UAH dialogue system . . . . . . . . . . . . . . . . . . . . . . . . . 168 4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 All Users Are (Not) Equal - The Influence of User Characteristics on Perceived Quality, Modality Choice and Performance . . . . . . . . . . . . . . . . . 175 Ina Wechsung, Matthias Schulz, Klaus-Peter Engelbrecht, Julia Niemann and Sebastian M¨oller 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Contents
xv
3.1 Experimental Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 3.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 3.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 4.1 Factors Influencing Performance . . . . . . . . . . . . . . . . . . . . . 180 4.2 Factors Influencing Modality Choice . . . . . . . . . . . . . . . . . 181 4.3 Factors Influencing Quality Perceptions . . . . . . . . . . . . . . . 182 5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Part V Dialogue Management Parallel Computing and Practical Constraints when applying the Standard POMDP Belief Update Formalism to Spoken Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Paul A. Crook, Brieuc Roblin, Hans-Wolfgang Loidl and Oliver Lemon 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 1.1 Paper Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 2.1 Typical SDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 2.2 POMDP Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 3 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 3.1 Dialogue Response Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 4 POMDP DM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 4.1 Fixed Users’ Goals during Dialogues . . . . . . . . . . . . . . . . . 195 5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6 Dense POMDP Belief Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 7 Limits for SDS DM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 7.1 Practical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Ranking Dialog Acts using Discourse Coherence Indicator for Language Tutoring Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Hyungjong Noh, Sungjin Lee, Kyungduk Kim, Kyusong Lee, Gary Geunbae Lee 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 3 Discourse Coherence Indicator for Dialog Acts . . . . . . . . . . . . . . . . 206 3.1 Necessity of Discourse Coherence Indicator . . . . . . . . . . . 206 3.2 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 4 Similarity between discourse histories . . . . . . . . . . . . . . . . . . . . . . . . 209 4.1 Enhanced Levenshtein Distance . . . . . . . . . . . . . . . . . . . . . 209 4.2 Using DCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 4.3 Using Discount Rate Parameter . . . . . . . . . . . . . . . . . . . . . . 210 4.4 Ranking Score Normalization . . . . . . . . . . . . . . . . . . . . . . . 210
xvi
Contents
5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 5.1 Discounted Cumulative Gain . . . . . . . . . . . . . . . . . . . . . . . . 211 5.2 Task Completion Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 5.3 Diversity of System Dialog Acts . . . . . . . . . . . . . . . . . . . . . 213 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 On-line detection of task incompletion for spoken dialog systems using utterance and behavior tag N-gram vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Sunao Hara, Norihide Kitaoka, and Kazuya Takeda 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 2 Spoken dialog corpus of a music retrieval task . . . . . . . . . . . . . . . . . 217 3 Feature construction from dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 3.1 Encoding utterances and behaviors as tags . . . . . . . . . . . . . 218 3.2 Construction of tag N-gram feature . . . . . . . . . . . . . . . . . . . 218 3.3 Construction of interaction parameter features . . . . . . . . . 219 3.4 Training classifiers based on SVM . . . . . . . . . . . . . . . . . . . 220 4 Detection of task-incomplete dialogs . . . . . . . . . . . . . . . . . . . . . . . . . 220 4.1 Evaluation of off-line detection . . . . . . . . . . . . . . . . . . . . . . 220 4.2 Evaluation of on-line detection performance . . . . . . . . . . . 222 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Integration of Statistical Dialog Management Techniques to Implement Commercial Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 David Griol, Zoraida Callejas, Ram´on L´opez-C´ozar 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 2 Our Proposal to Introduce Statistical Methodologies in Commercial Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 2.1 Implementation by means of the Standard VoiceXML . . . 231 2.2 User Simulation to Learn the Dialog Model . . . . . . . . . . . . 232 3 Development of a Railway Information System using the Proposed Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 4 Evaluation of the Developed Dialog System . . . . . . . . . . . . . . . . . . . 235 5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 A Theoretical Framework for a User-Centered Spoken Dialog Manager . . 241 Stefan Ultes, Tobias Heinroth, Alexander Schmitt, Wolfgang Minker 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 4 Theoretical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Contents
xvii
Using probabilistic logic for dialogue strategy selection . . . . . . . . . . . . . . . . 247 Ian O’Neill, Philip Hanna, Anbu Yue and Weiru Liu 1 Adaptive Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 2 The Experiment and its Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Starting to Cook a Coaching Dialogue System in the Olympus framework 255 Joana Paulo Pardal and Nuno J. Mamede 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 2 C OOK C OACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 2.1 O LYMPUS /R AVEN C LAW framework . . . . . . . . . . . . . . . . . 257 2.2 Interaction design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 2.3 Recipes Model and OntoChef . . . . . . . . . . . . . . . . . . . . . . . 260 2.4 Acquiring recipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 2.5 Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 2.6 Cook Tutor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 3 Pursuing ontology-based dialogue systems . . . . . . . . . . . . . . . . . . . . 263 3.1 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 3.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 3.3 New Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Part VI Evaluation Strategies and Paradigms Performance of an Ad-hoc User Simulation in a Formative Evaluation of a Spoken Dialog System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Klaus-Peter Engelbrecht, Stefan Schmidt, Sebastian M¨oller 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 2 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 3 User and Speech Understanding Models . . . . . . . . . . . . . . . . . . . . . . 274 4 Creating a List of Usability Problems from Real User Data . . . . . . 275 5 Problem Discovery in the Simulated Corpora . . . . . . . . . . . . . . . . . . 277 6 Preparation of Data for Log File Inspection . . . . . . . . . . . . . . . . . . . . 279 7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Adapting Dialogue to User Emotion - A Wizard-of-Oz study for adaptation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Gregor Bertrand, Florian Nothdurft, Wolfgang Minker, Harald Traue and Steffen Walter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 3 Our Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
xviii
Contents
3.1 Goals of the Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 3.2 Prior Questionnaires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 3.3 Description of the Experiment . . . . . . . . . . . . . . . . . . . . . . . 289 3.4 Significance of Results for Dialogue Modeling . . . . . . . . . 291 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 5 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 SpeechEval: A Domain-Independent User Simulation Platform for Spoken Dialog System Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Tatjana Scheffler, Roland Roller and Norbert Reithinger 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 3 End-To-End User Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 4 Real-Life Systems, Quick Prototyping . . . . . . . . . . . . . . . . . . . . . . . . 298 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Evaluating User-System Interactional Chains for Naturalness-oriented Spoken Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Etsuo Mizukami and Hideki Kashioka 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 2 Methods: Annotation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 2.1 Dialogue Action Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 2.2 Response Evaluation Coding . . . . . . . . . . . . . . . . . . . . . . . . 303 3 Use cases: sightseeing guidance system . . . . . . . . . . . . . . . . . . . . . . . 305 3.1 About the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 3.2 Dialogue data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 4.1 Reliability of Coding scheme . . . . . . . . . . . . . . . . . . . . . . . . 307 4.2 Evaluating appropriateness . . . . . . . . . . . . . . . . . . . . . . . . . . 307 4.3 Evaluating the Interactional Sequence . . . . . . . . . . . . . . . . 310 5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 Evaluation of Spoken Dialogue System that uses Utterance Timing to Interpret User Utterances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Kazunori Komatani, Kyoko Matsuyama, Ryu Takeda, Tetsuya Ogata, and Hiroshi G. Okuno 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 2 Enumeration Subdialogue using Utterance Timing . . . . . . . . . . . . . . 317 2.1 Interpretation using Utterance Timing . . . . . . . . . . . . . . . . 317 2.2 Switching into Enumeration Subdialogue . . . . . . . . . . . . . 318 3 System for Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 4 Dialogue Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Contents
xix
4.1 Experimental Set up and Condition . . . . . . . . . . . . . . . . . . . 321 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 How context determines perceived quality and modality choice. Secondary task paradigm applied to the evaluation of multimodal interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Ina Wechsung, Robert Schleicher and Sebastian M¨oller 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Part VII Prototypes and Products Design and Implementation of a Toolkit for Evaluation of Spoken Dialogue Systems Designed for AmI Environments . . . . . . . . . . . . . . . . . . . 343 ´ Nieves Abalos, Gonzalo Espejo, Ram´on L´opez-C´ozar, Zoraida Callejas and David Griol 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 2 The toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 2.1 Automatic orthographic transcriber (AOT) . . . . . . . . . . . . . 345 3 Mayordomo system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 3.1 The Mayordomo corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 3.2 Interaction with the user simulator . . . . . . . . . . . . . . . . . . . 350 4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 5 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 A Dialogue System for Conversational NPCs . . . . . . . . . . . . . . . . . . . . . . . . . 357 Tina Kl¨uwer, Peter Adolphs, Feiyu Xu and Hans Uszkoreit 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 2 NPCs in the Virtual World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 3 The Dialogue System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 4 Input Analysis and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 5 Dialogue Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Embedded Conversational Engine for Natural Language Interaction in Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Marcos Santos-P´erez, Eva Gonz´alez-Parada and Jos´e Manuel Cano-Garc´ıa 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
xx
Contents
2 3
Conversational Agents Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Conversational Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 3.1 Lemmatizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 3.2 Object-oriented DataBase . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 4 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 5 Tests and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Adding Speech to a Robotics Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Graham Wilcock and Kristiina Jokinen 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 2 Pyro Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 3 Spoken Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 4 Spoken Dialogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 5 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
List of Contributors
´ Nieves Abalos Dept. of Languages and Computer Systems, CITIC-UGR, University of Granada, Spain, e-mail:
[email protected] Peter Adolphs DFKI, Berlin, Germany, e-mail:
[email protected] Masahiro Araki Kyoto Institute of Technology, Kyoto, Japan, e-mail:
[email protected] Georgios Athanasopoulos IBBT, Vrije Universiteit Brussel, Dept. ETRO-DSSP, Belgium, e-mail:
[email protected] Aryel Beck School of Computer Science, University of Hertfordshire, UK, e-mail:
[email protected] Gregor Bertrand Institute of Information Technology, University of Ulm, Germany, e-mail:
[email protected] Zoraida Callejas Dept. of Languages and Computer Systems, CITIC-UGR, University of Granada, Spain, e-mail:
[email protected] Jos´e Manuel Cano-Garc´ıa Electronic Technology Department, University of M´alaga, Spain, e-mail:
[email protected] Euisok Chung Speech Processing Team, ETRI, Daejeon, Korea, e-mail:
[email protected] xxi
xxii
List of Contributors
Piero Cosi Istituto di Scienze e Tecnologie della Cognizione, ISTC, C.N.R., Italy, e-mail:
[email protected] Paul A. Crook School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, UK, e-mail:
[email protected] Heriberto Cuay´ahuitl DFKI GmbH, Language Technology Lab, Saarbr¨ucken, Germany, e-mail:
[email protected] Tomas Dekens IBBT, Vrije Universiteit Brussel, Dept. ETRO-DSSP, Belgium, e-mail:
[email protected] Andreas Eberhardt BitTwister IT GmbH, Senden, Germany, e-mail:
[email protected] Ute Ehrlich Daimler AG, Ulm, Germany, e-mail:
[email protected] Valentin Enescu IBBT, Vrije Universiteit Brussel, Dept. ETRO-DSSP, Belgium, e-mail:
[email protected] Klaus-Peter Engelbrecht Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany, e-mail:
[email protected] Gonzalo Espejo Dept. of Languages and Computer Systems, CITIC-UGR, University of Granada, Spain, e-mail:
[email protected] Andrew Finch National Institute of Information and Communications Technology (NICT), Kyoto, Japan, e-mail:
[email protected] Shinya Fujie Waseda Institute of Advanced Study, Japan, e-mail:
[email protected] Eva Gonz´alez-Parada Electronic Technology Department, University of M´alaga, Spain, e-mail:
[email protected] David Griol Dept. of Computer Science, Carlos III University of Madrid, Spain, e-mail:
[email protected] List of Contributors
xxiii
Philip Hanna School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Northern Ireland, e-mail:
[email protected] Sunao Hara Graduate School of Information Science, Nagoya University, Aichi, Japan, e-mail:
[email protected] Tobias Heinroth Institute of Information Technology, University of Ulm, Germany, e-mail:
[email protected] Antoine Hiolle School of Computer Science, University of Hertfordshire, UK, e-mail:
[email protected] Hansj¨org Hofmann Daimler AG, Ulm, Germany, e-mail:
[email protected] Chiori Hori National Institute of Information and Communications Technology (NICT), Kyoto, Japan, e-mail:
[email protected] Kazuhiko Iwata Waseda University, Tokyo, Japan, e-mail:
[email protected] Hyung-Bae Jeon Speech Processing Team, ETRI, Daejeon, Korea, e-mail:
[email protected] Takatoshi Jitsuhiro Department of Media Informatics, Aichi University of Technology, Gamagori Japan e-mail:
[email protected] Kristiina Jokinen University of Helsinki, Finland, e-mail:
[email protected] Hideki Kashioka National Institute of Information and Communications Technology (NICT), Kyoto, Japan, e-mail:
[email protected] Tatsuya Kawahara Academic Center for Computing and Media Studies, Kyoto University, Kyoto, Japan, e-mail:
[email protected] Hisashi Kawai National Institute of Information and Communications Technology (NICT), Kyoto, Japan, e-mail:
[email protected] Shinichi Kawamoto National Institute of Information and Communications Technology (NICT), Kyoto, Japan, e-mail:
[email protected] xxiv
List of Contributors
Bernd Kiefer DFKI GmbH, Language Technology Lab, Saarbr¨ucken, Germany, e-mail:
[email protected] Hong Kook Kim School of Information and Communications, Gwangju Institute of Science and Technology, Gwangju 500-712, Korea, e-mail:
[email protected] Kyungduk Kim POSTECH, San 31, Hyoja-Dong, Pohang, 790-784, South Korea, e-mail:
[email protected] Norihide Kitaoka Graduate School of Information Science, Nagoya University, Aichi, Japan, e-mail:
[email protected] Tina Kl¨uwer DFKI, Berlin, Germany, e-mail:
[email protected] Tetsunori Kobayashi Department of Computer Science and Engineering, Waseda University, Japan, e-mail:
[email protected] Kazunori Komatani Graduate School of Engineering, Nagoya University, Japan, e-mail:
[email protected] Ivana Kruijff-Korbayov´a DFKI GmbH, Language Technology Lab, Saarbr¨ucken, Germany, e-mail:
[email protected] Cheongjae Lee Academic Center for Computing and Media Studies, Kyoto University, Kyoto, Japan, e-mail:
[email protected] Gary Geunbae Lee POSTECH, San 31, Hyoja-Dong, Pohang, 790-784, South Korea, e-mail:
[email protected] Kyusong Lee POSTECH, San 31, Hyoja-Dong, Pohang, 790-784, South Korea, e-mail:
[email protected] Sungjin Lee POSTECH, San 31, Hyoja-Dong, Pohang, 790-784, South Korea, e-mail:
[email protected] Yun-Keun Lee Speech Processing Team, ETRI, Daejeon, Korea, e-mail:
[email protected] List of Contributors
xxv
Oliver Lemon School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, UK, e-mail:
[email protected] Weiru Liu School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Northern Ireland, e-mail:
[email protected] Hans-Wolfgang Loidl School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, UK, e-mail:
[email protected] Ram´on L´opez-C´ozar Dept. of Languages and Computer Systems, CITIC-UGR, University of Granada, Spain, e-mail:
[email protected] Nuno J. Mamede Spoken Language Systems Laboratory, L2 F – INESC-ID and IST, Technical University of Lisbon, Portugal, e-mail:
[email protected] Kyoko Matsuyama Graduate School of Informatics, Kyoto University, Japan, e-mail:
[email protected] Yoichi Matsuyama Department of Computer Science and Engineering, Waseda University, Japan, e-mail:
[email protected] Wolfgang Minker Institute of Information Technology, University of Ulm, Germany, e-mail:
[email protected] Teruhisa Misu National Institute of Information and Communications Technology (NICT), Kyoto, Japan, e-mail:
[email protected] Chiyomi Miyajima Graduate School of Information Science, Nagoya University, Aichi, Japan, e-mail:
[email protected] Etsuo Mizukami National Institute of Information and Communications Technology (NICT), Kyoto, Japan, e-mail:
[email protected] Yuko Mizukami Kyoto Institute of Technology, Kyoto, Japan, e-mail:
[email protected] Sebastian M¨oller Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany, e-mail:
[email protected] xxvi
List of Contributors
Roger K. Moore Department of Computer Science, University of Sheffield, UK, e-mail:
[email protected] Satoshi Nakamura Nara Institute of Science and Technology (NAIST), Nara, Japan, e-mail:
[email protected] Julia Niemann Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany, e-mail:
[email protected] Hyungjong Noh POSTECH, San 31, Hyoja-Dong, Pohang, 790-784, South Korea, e-mail:
[email protected] Florian Nothdurft Institute of Information Technology, University of Ulm, Germany, e-mail:
[email protected] Tetsuya Ogata Graduate School of Informatics, Kyoto University, Japan, e-mail:
[email protected] Hiroshi G. Okuno Graduate School of Informatics, Kyoto University, Japan, e-mail:
[email protected] Ian O’Neill School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Northern Ireland, e-mail:
[email protected] Joana Paulo Pardal Spoken Language Systems Laboratory, L2 F – INESC-ID and IST, Technical University of Lisbon, Portugal, e-mail:
[email protected] Jeon-Gue Park Speech Processing Team, ETRI, Daejeon, Korea, e-mail:
[email protected] Ji Hun Park School of Information and Communications, Gwangju Institute of Science and Technology, Gwangju 500-712, Korea, e-mail:
[email protected] Robert Pr¨opper Cognitive Systems Lab (CSL), Karlsruhe Institute of Technology (KIT), Germany, e-mail:
[email protected] Felix Putze Cognitive Systems Lab (CSL), Karlsruhe Institute of Technology (KIT), Germany, e-mail:
[email protected] List of Contributors
xxvii
Jos´e F. Quesada Dept. of Computer Science and Artificial Intelligence, University of Seville, Spain, e-mail:
[email protected] Umair Rahim Institute of Information Technology, University of Ulm, Germany, e-mail:
[email protected] Norbert Reithinger Deutsches Forschungszentrum f¨ur K¨unstliche Intelligenz, Projektb¨uro Berlin, Germany, e-mail:
[email protected] Brieuc Roblin School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, UK, e-mail:
[email protected] Roland Roller Deutsches Forschungszentrum f¨ur K¨unstliche Intelligenz, Projektb¨uro Berlin, Germany, e-mail:
[email protected] Alexander Rudnicky Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA, e-mail:
[email protected] Hichem Sahli IBBT, Vrije Universiteit Brussel, Dept. ETRO-DSSP, Belgium, e-mail:
[email protected] Akihiro Saito Department of Computer Science and Engineering, Waseda University, Japan, e-mail:
[email protected] Sakriani Sakti Nara Institute of Science and Technology (NAIST), Nara, Japan, e-mail:
[email protected] Marcos Santos-P´erez Electronic Technology Department, University of M´alaga, Spain, e-mail:
[email protected] Tatjana Scheffler Deutsches Forschungszentrum f¨ur K¨unstliche Intelligenz, Projektb¨uro Berlin, Germany, e-mail:
[email protected] Robert Schleicher Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany, e-mail:
[email protected] Stefan Schmidt Quality and Usability Lab, Deutsche Telekom Laboratories, TU-Berlin, Germany e-mail:
[email protected] xxviii
List of Contributors
Alexander Schmitt Institute of Information Technology, University of Ulm, Germany, e-mail:
[email protected] Marc Schr¨oder DFKI GmbH, Language Technology Lab, Saarbr¨ucken, Germany, e-mail:
[email protected] Tanja Schultz Cognitive Systems Lab (CSL), Karlsruhe Institute of Technology (KIT), Germany, e-mail:
[email protected] Matthias Schulz Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany, e-mail:
[email protected] Woo Kyeong Seong School of Information and Communications, Gwangju Institute of Science and Technology, Gwangju 500-712, Korea, e-mail:
[email protected] Yoshinori Shiga National Institute of Information and Communications Technology (NICT), Kyoto, Japan, e-mail:
[email protected] Giacomo Sommavilla Istituto di Scienze e Tecnologie della Cognizione, ISTC, C.N.R., Italy, e-mail:
[email protected] Kazuya Takeda Graduate School of Information Science, Nagoya University, Aichi, Japan, e-mail:
[email protected] Ryu Takeda Graduate School of Informatics, Kyoto University, Japan, e-mail:
[email protected] Fabio Tesser Istituto di Scienze e Tecnologie della Cognizione, ISTC, C.N.R., Italy, e-mail:
[email protected] Harald C. Traue Medical Psychology, University Clinic for Psychosomatic Medicine and Psychotherapy, Ulm University, Germany, e-mail:
[email protected] Stefan Ultes Institute of Information Technology, University of Ulm, Germany, e-mail:
[email protected] Hans Uszkoreit DFKI, Berlin, Germany, e-mail:
[email protected] List of Contributors
xxix
Werner Verhelst IBBT, Vrije Universiteit Brussel, Dept. ETRO-DSSP, Belgium, e-mail:
[email protected] Steffen Walter Medical Psychology, University Clinic for Psychosomatic Medicine and Psychotherapy, Ulm University, Germany, e-mail:
[email protected] Ina Wechsung Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany, e-mail:
[email protected] Graham Wilcock University of Helsinki, Finland, e-mail:
[email protected] Feiyu Xu DFKI, Berlin, Germany, e-mail:
[email protected] Yushi Xu MIT Computer Science and Artificial Intelligence Laboratory, USA, e-mail:
[email protected] Xiao Yao Graduate School of Information Science, Nagoya University, Aichi, Japan, e-mail:
[email protected] Anbu Yue School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Northern Ireland, e-mail:
[email protected] Kseniya Zablotskaya Institute of Information Technology, University of Ulm, Germany, e-mail:
[email protected] Sergey Zablotskiy Institute of Information Technology, University of Ulm, Germany, e-mail:
[email protected] Acronyms
AFE AIML ALICE AmI ANOVA AODE AOST ASR AST AVP BAS BC BFI-S BIS CCG CER CM CRF CV CVC DAMSL DCI DD DM DR EMG ERQ ERR FLOPS FOD FSM
Audio Front End Artificial Intelligence Mark-up Language Artificial Linguistic Internet Computer Entity Ambient Intelligence Simple One-Way Analysis of Variance Averaged One-Dependence Estimators Automatic Orthographic Transcriber Automatic Speech Recognition Automatic Semantic Transcriber Attribute-Value-Pair Behavioral Approach System Back Channel Big Five Inventory-Short Behavioral Inhibition System Combinatory Categorial Grammar Concept Error Rate Correction Model Conditional Random Field Consonant-Vowel Consonant-Vowel-Consonant Dialog Act Markup in Several Layers Discourse Coherency Indicator Decision-Directed Dialogue Manager Dialogue Register Electromyography Emotion Regulation Questionnaire Error Reduction Rate Floating Point Operations Per Second First-Order Difference Finite State Machine xxxi
xxxii
GPGPU General Purpose Graphics Processing Unit GRU Gesture Recognition and Understanding GUI Graphical User Interface HAWIE Hamburg Wechsler Intelligence Test for Adults HCI Human-Computer Interaction HDD Hard Disk Drive HIS Hidden Information State HIT Human Intelligence Task HMM Hidden Markov Model HSV Hue Lightness Saturation Model HVSM Hybrid Vector Space Model IBBT Interdisciplinary institute for BroadBand Technology ICT Information and Communication Technology IE Information Extraction IPA Intelligent Procedure Assistant IPTV Internet-based Television IPU Inter-Pausal Unit IR Implicit Recovery IRS Information Retrieval Systems ISS International Space Station ISTC Institute of Cognitive Sciences and Technologies ITSCJ Information Technology Standards Commission of Japan ITU International Telecommunication Union ITU-T ITU Telecommunication Standardization Sector IVR Interactive Voice Response JSGF Java Speech Grammar Format KMS Known-word Mono-Syllable KWA KeyWord Accuracy LARRI Language-based Agent for Retrieval of Repair Information LIWC Linguistic Inquiry and Word Count LP Linear Prediction LRT Likelihood Ratio Test LVCSR Large Vocabulary Continuous Speech Recognition MARY Modular Architecture for Research on speech sYnthesis MC Motor Control MDL Minimum Description Length MDP Markov Decision Process MEP Maximum Entropy Probability MFCC Mel-Frequency Cepstral Coefficients MILM Multimodal Interaction Markup Language MMG Multi Motive Grid MMI MultiModal Interaction MOS Mean Opinion Score MRA Multiple Regression Analysis MRDA Meeting Recorder Dialog Act
Acronyms
Acronyms
xxxiii
MRR Mean Reciprocal Rank MRT Multiple Resource Theory MSE Mean Square Error MT Machine Translation MTurk Amazon Mechanical Turk MVC Model View Controller MVSM Multiple Vector Space Model NLG Natural Language Generation NLU Natural Language Understanding NPC Non Player Character NVBP Non-verbal Behavior Planning ODB Object-oriented DataBase ODP Ontology-based Dialogue Platform OOG Out-Of-Grammar OOV Out-Of-Vocabulary OWL Web Ontology Language PDLM Prompt-Dependent Language Model PILM Prompt-Independent Language Model PLP Probabilistic Logic Program POMDP Partially Observable Markov Decision Process POS Part-Of-Speech PSAT Probabilistic Satisfiability PSTN Public Switched Telephone Network RDB Relational DataBase RDF Resource Description Framework RL Reinforcement Learning ROC Receiver Operating Characteristic SAT Satisfiability SDC Spoken Dialog Challenge SDM Spoken Dialogue Manager SDS Spoken Dialogue System SEA-scale Subjektiv Erlebte Anstrengung (Subjectively Perceived Effort) Scale SEM Stochastic Expectation-Maximization (algorithm) SIMD Single Instruction, Multiple Data SLDS Spoken Language Dialogue System SLU Spoken Language Understanding SNR Signal-to-Noise Ratio SR Sentence Recognition SRGS Speech Recognition Grammar Specification SSE Streaming SIMD Extension SSS Successive State Splitting STRAIGHT Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrogram SU Sentence Understanding SVM Support Vector Machine
xxxiv
SVSM Single Vector Space Model S.ERR Sentence Error Rate TALK Tools for Ambient Linguistic Knowledge TC Task Completion TTS Text-To-Speech (synthesis) UAH Universidad al Habla (University on the Line) URBI Universal Robot Body Interface UMS Unknown-word Mono-Syllable UW Unknown Word VAD Voice Activity Detection VoiceXML Voice Extensible Markup Language VOP Voice Onset Point VSM Vector Space Model VUB Free University of Brussels VXML VoiceXML WA Word Accuracy WER Word Error Rate WFST Weighted Finite State Transducer WoZ Wizard-of-Oz ZODB Zope Object-oriented DataBase
Acronyms
Part I
Keynote Talks
Looking at the Interaction Management with New Eyes - Conversational Synchrony and Cooperation using Eye Gaze Kristiina Jokinen
Abstract Human conversations are surprisingly fluent concerning the interlocutors’ turn-taking and feedback behaviour. Many studies have shown the accurate timing of utterances and pointed out how the speakers synchronize and align their behaviour to produce smooth and efficient communication. Especially such paralinguistic aspects as gesturing, eye-gaze, and facial expressions provide important signals for interaction management: they allow coordination and control of interaction in an unobtrusive manner, besides also displaying the interlocutor’s attitudes and emotional state. In the context of Interaction Technology, realistic models of interaction and synchronization are also important. The system is regarded as one of the participating agents, in particular when dealing with applications like robot companions. The key concept in such interaction strategies is linked to the notion of affordance: interaction should readily suggest to the user the appropriate ways to use the interface. The challenges for Interaction Technology thus do not deal with enabling interaction in the first place, but rather with designing systems that support rich multimodal communication possibilities and human-technology interfacing that is more conversational in style. This talk explored various prerequisites and enablements of communication, seen as cooperative activity which emerges from the speakers’ capability to synchronize their intentions. We sought to address some of the main challenges related to construction of the shared knowledge and most notably, we focus on eye-gaze and discuss its use in interaction coordination: providing feedback and taking turns. We also discussed issues related to collecting and analysing eye-tracking data in natural human-human conversations, and present preliminary experiments concerning the role of eye-gaze in interaction management. The discussion also extended towards other paralinguistic aspects of communication: in multiparty dialogues head movement and gesturing also play a crucial role in signalling the person’s intention to take, hold, or yield the turn.
Kristiina Jokinen University of Helsinki, Finland, e-mail:
[email protected] R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_1, © Springer Science+Business Media, LLC 2011
3
Interacting with Purpose (and Feeling!): What Neuropsychology and the Performing Arts Can Tell Us About ’Real’ Spoken Language Behaviour Roger K. Moore
Abstract Recent years have seen considerable progress in both the technical capabilities and the market penetration of spoken language dialogue systems. Performance has clearly passed a threshold of usability which has triggered the mass deployment of effective interactive voice response systems, mostly based on the now firmly established VXML standard. In the research laboratories, next-generation spoken language dialogue systems are being investigated which employ statistical modelling techniques (such as POMDPs) to handle uncertainty and paralinguistic behaviours (such as back-channeling and emotion) to provide more ’natural’ voicebased interaction between humans and artificial agents. All of these developments suggest that the field is moving in a positive direction, but to what extent is it simply accumulating a battery of successful short-term engineering solutions as opposed to developing an underlying long-term theory of vocal interaction? This talk attempted to address this issue by drawing attention to results in research fields that are quite distinct from speech technology, but which may give some useful insights into potential generic principles of human (and even animal) behaviour. In particular, inspiration was drawn from psychology, the neurosciences and even the performing arts, and a common theme will emerge that focuses on the need to model the drives behind communicative behaviour, their emergent consequences and the appropriate characterisation of advanced communicative agents (such as robots). It was concluded that future developments in spoken language dialogue systems stand to benefit greatly from such a transdisciplinary approach, and that fields outside of speech technology will also benefit from the empirical grounding provided by practical engineered solutions.
Roger K. Moore Department of Computer Science, University of Sheffield, UK e-mail:
[email protected] R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_2, © Springer Science+Business Media, LLC 2011
5
Part II
Speech Recognition and Semantic Analysis
Accessing Web Resources in Different Languages Using a Multilingual Speech Dialog System Hansj¨org Hofmann and Andreas Eberhardt and Ute Ehrlich
Abstract While travelling a foreign country by car, drivers would like to retrieve instant local information. However, accessing local web sites while driving causes two problems: First, browsing the Web while driving puts the drivers safety at risk. Second, the information may only be available in the respective foreign language. We present a multilingual speech dialog system which enables the user to extract topic related information from web resources in different languages. The system extracts and understands information from web sites in various languages by parsing HTML code against a predefined semantic net where special topics are modelled. After extraction the content is available in a meta language representation which makes a speech interaction in different languages possible. For miminized driver distraction an intuitive and driver-convenient generic speech dialog has been designed.
1 Introduction Opening the borders within the European Union led to a lot of transnational crosscultural interactions and made travelling within Europe easier. While travelling a foreign country by car, drivers would like to retrieve instant local information from the World Wide Web (e.g. beach weather at the destination, currently playing movies in the local cinema, etc.). However, accessing the Internet while driving is not possible yet, because smartphones and similar existing technologies cannot be used for safety reasons. Reports from the National Highway Traffic Safety Administration (NHTSA) for the year 2009 showed that 20 percent of injury crashes involved distracted driving[8]. While driving a vehicle, browsing the Web by using the car’s head unit would distract the user and puts the drivers’ safety at risk. Therefore, a speech-based interface which provides a driver-convenient, audible representation of the content needs to be developed. Moreover, the user might not be able to understand the content of the local web site since it is presented in a foreign language. To support the user, the speech dialog system (SDS) must be able to extract information Hansj¨org Hofmann Daimler AG, Ulm, Germany, e-mail:
[email protected] Andreas Eberhardt BitTwister IT GmbH, Senden, Germany e-mail:
[email protected] Ute Ehrlich Daimler AG, Ulm, Germany e-mail:
[email protected] R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_3, © Springer Science+Business Media, LLC 2011
9
10
Hansj¨org Hofmann and Andreas Eberhardt and Ute Ehrlich
from web sites in different languages and present the content in the driver’s native language. Currently, browsing the World Wide Web is only achieved by using haptic input modalities and a visual browser representation, which is not feasible in a driving environment. Attempts to access the World Wide Web by speech have been made in different ways. Speech interfaces to the Web have been introduced by Lin et al.[7] and Poon et al.[9]. The HTML Speech Input API proposed as W3C draft[10] allows developers to extend web pages with specific speech tags to make them accessible by voice. However, performing only the navigation by speech is not applicable to the car environment, because the content representation is not appropriate. With the SmartWeb project[1], a step in the right direction was taken: The user can ask questions about the Soccer World Cup and the system provides answers retrieved from a knowledge base, consisting of, amongst others, extracted information from the official World Cup web site[3]. The mentioned approaches only support a single language or have to be developed seperately for each language they support. Within the FAME project1 an intelligent room with multilingual speech access has been developed. Holzapfel[6] proposes different methods for multilingual grammar specification to design a multilingual SDS or to port existing systems to new languages. However, previous multilingual SDS approaches access content which is already available in a database, is represented in a specific (meta) language and does not undergo any changes dynamically. Multilingual speech access to web sites in different languages has not been addressed, yet. In this paper, we present a multilingual SDS which enables the user to extract topic related information from web resources available in different languages. The remainder of the paper is structured as follows: In Section 2, an overview of the proposed system architecture is given. Section 3 describes the developed web scraping algorithm which extracts and interprets content relevant information from semistructured web sites. Section 4 is devoted to explaining the generic speech dialog, followed by the description of the developed prototype system and finally, conclusions are drawn.
2 System Architecture The system architecture of our proposed SDS is illustrated in Figure 1. The core component is a semantic net which models special topics (like “beach weather”, “cinema”, etc.) and is defined in KL-ONE[2] because of its simplicity and sufficient modeling abilities. The speech dialog is modelled according to this ontology. For automatic speech recognition (ASR) and text-to-speech (TTS), Daimler employs software from different leading speech technology companies which support, amongst others, Europe’s main languages. Thus, the user can interact with the SDS in various languages. We assume that the user pre-configures his interaction language in a user profile, so that the system knows which language to expect. The 1
The FAME project: http://isl.ira.uka.de/fame
Accessing Web Resources in Different Languages Using a Multilingual SDS
11
dialog manager has access to a database which relates locations with countries and associated web sites where the required information can be retrieved from. Depending on the location of the requested information, the associated web site is chosen and the information extraction (IE) algorithm is triggered with the corresponding language parameters. The IE component parses HTML code against the predefined semantic net. As the semantic model is language independent, content from web sites in different languages can be mapped onto the meta language.
Fig. 1 Architecture of the SDS.
Since the focus of this report is on the multilingual capabilities of this approach, only a brief description of the web scraping algorithm is given in Section 3.
3 Information Extraction from Semi-structured Web Sites The IE from web sites is performed in four steps illustrated in Figure 2. The different steps are explained briefly in the following[4]. Web Site
HTML Parser
Graph
Text Parser
Graph including concept hypotheses
Graph Transformation
Standardized graph representation including
Matching Extracted information Algorithm matching to
concept hypotheses
the semantic model
Fig. 2 Overview of the web scraping algorithm.
The HTML parser analyzes the web site and generates a preliminary internal graph representation. Embedded frames and referenced web pages are taken into consideration. In the second step, the text parser analyzes the textual content. By applying topic oriented grammar definitions, the text parser generates initial concept hypotheses of the semantic model for each textual content. The text parser is responsible for supporting various languages and therefore, a separate parser grammar for each language has to be defined. The required grammar is loaded according to the forwarded language parameters. Each parser grammar contains the rules for mapping the content presented in the respective language onto the concepts of the semantic model. In Figure 3, excerpts from web sites in different languages and language specific grammar rules are illustrated. The rule “$water temp descr” maps the corresponding content onto the semantic concept “water temperature descr”. Since the following matching algorithm is computationally intensive, the current cyclic graph needs to be transformed into a simple and standardized internal representation to accelerate the matching process.
12
public $water_temp_descr=
Hansj¨org Hofmann and Andreas Eberhardt and Ute Ehrlich
„Température de l‘eau:“ {out=new water_temperature_descr;};
public $water_temp_descr=
„En el agua:“ {out=new water_temperature_descr;};
Fig. 3 Language specific grammar rules for mapping content of web sites in differents languages onto semantic concepts.
Finally, the matching algorithm maps the graph structure onto the concepts of the predefined semantic model. The result of the matching algorithm is a set of instances containing the most probable concept instance hypotheses. The value of the instances is the extracted information from the web site. The result is available in the structure of the semantic model and therefore, can be presented by a speech dialog which has been modelled explicitly for the respective topic.
4 Generic Speech Dialog The generic speech dialog is adapted to the ontology and modelled for each topic explicitly. To reduce driver distraction, the dialog has to be designed in an intuitive and natural manner. By keyphrase spotting the user’s input is understood and mapped onto the corresponding concept in the predefined ontology. The Daimler SDS dialog control follows a user adaptive dialog strategy without having a globally fixed “plan” of the dialog flow. Here, the dialog is modelled as a hierarchy of sub-tasks including roles which can trigger a system reaction if the according user input is given. Thus, the dialog becomes very flexible and adapts to the user’s input [5]. Since XML is well structurable and straight forward to parse, a particular XML specification is used to model the dialogs and to define the integration of semantic contents into the dialog specification. For each supported user language the dialog has to be specified in a natural manner. Since every language has its own peculiarities, a generic word-by-word substitution is not applicable. Therefore, the keyphrases have to be designed for every language separately.
5 System Prototype A first prototype has been developed which is decribed in the following. For the topic “beach weather” a semantic net has been modelled, the text parser grammars for French, German, Spanish and Portuguese have been defined and speech dialogs for each language have been specified. In Figure 4, two local weather web sites and corresponding sample dialogs for the user languages German and English are presented.
Accessing Web Resources in Different Languages Using a Multilingual SDS
13
As illustrated in Figure 4, each weather web site has its own layout. However, the IE algorithm is able to extract information from unknown web page structures. Due to the availability of the different parser grammars, content in different languages can be extracted. Although the user asks about the same semantic concept in a natural manner, the system understands the user’s request, since the keyphrases are modelled seperately for each language.
U: What's the water temperature today? S: At which location? U: In Cannes in France. S: The water temperature is 17 °C.
U: Wie warm ist das Wasser heute? S: An welchem Ort? U: Am Strand von San Sebastian in Spanien. S: Die Wassertemperatur is 20 °C.
Fig. 4 Screenshots of a French and Spanish web site and sample dialogs in English and German.
To prove the multilingual IE capabilities, the web scraping algorithm has been evaluated. We defined a beach weather standard format which consists of the following required data: • • • • •
Forecast for two days: Today and tomorrow Weather description (“sunny”, “cloudy”, “rainy”, etc.) per day Air and water temperature value per day Wind strength per day UV index per day
Some web sites provide more detailed information than required for the weather standard (e.g. air temperature values for each time of day) which is also taken into account in the semantic model. The data will be extracted, but will not positively affect on the evaluation result. For each language a local web site has been used to evaluate the performance of the web scraping algorithm. Table 1 Results of the evaluation. Available data w.r.t. the standard weather definition www.maplage.fr French 70% www.wetter.at German 70% www.eltiempo.es Spanish 100% meteo.turismodoalgarve.pt Portuguese 80% Web Site 1 ⃝ 2 ⃝ 3 ⃝ 4 ⃝
Language
Extracted data w.r.t. available data 71% 100% 100% 100%
14
Hansj¨org Hofmann and Andreas Eberhardt and Ute Ehrlich
The coverage of the extracted data w.r.t. the standard weather definition has been computed which is illustrated in Table 1. As can be seen from Table 1, not every web site provides all the required data w.r.t. the standard weather definition. The algorithm succeeds in three out of four web sites to extract all the required information (93% in average). It often occurs that web sites provide images without alternative texts for the weather description (here: web site no. 1) whose contents cannot be extracted with the current implementation of the IE algorithm. If a web site does not provide all necessary data, or if the algorithm fails in extracting all the information, the missing content could be merged with extracted data from other topic-related web sites.
6 Conclusions We presented a multilingual SDS which enables the user to extract topic related information from web resources available in different languages. The system extracts and understands information from web sites in various languages by parsing HTML code against a predefined semantic net where special topics are modelled. After extraction the content is available in a meta language representation which makes a speech interaction in different languages possible. A generic and intuitive speech dialog has been designed which is adapted to the semantic net and accessible in various languages. A preliminary prototype system for the topic “beach weather” has been developed and evaluated. The results are promising and open the possibility for future research and improvements of the SDS. In future, we will improve the web scraping algorithm, extend the evaluation and increase the number of supported languages.
References [1] Ankolekar, A., Buitelaar, P., Cimiano, P., Hitzler, P., Kiesel, M., Krtzsch, M., Lewen, H., Neumann, G., Sintek, M., Tserendorj, T., Studer, R.: Smartweb: Mobile access to the semantic web. In: Proceedings of ISWC - Demo Session (2006) [2] Brachman, R.J., Schmolze, J.G.: An overview of the KL-ONE knowledge representation system. Cognitive Science 9, 171–216 (1985) [3] Buitelaar, P., Cimiano, P., Frank, A., Racioppa, S.: SOBA: Smartweb ontology-based annotation. In: Proceedings of ISWC - Demo Session (2006) [4] Eberhardt, A.: Konzepte f¨ur sprachbedientes internet-browsing im fahrzeug. Diploma thesis, University of Ulm (2009) [5] Ehrlich, U.: Task hierarchies representing sub-dialogs in speech dialog systems. In: Proceedings Sixth European Conference on Speech Communication and Technology (1999) [6] Holzapfel, H.: Towards development of multilingual spoken dialogue systems. In: Proceedings of Second Language and Technology Conference (2005)
Accessing Web Resources in Different Languages Using a Multilingual SDS
15
[7] Lin, D., Bigin, L., Bao-Zong, Y.: Using chinese spoken-language access to the WWW. In: Proceedings of International Conference on WCCC-ICSP, vol. 2, pp. 1321–1324 (2000) [8] National Highway Traffic Safety Administration: Traffic safety facts - an examination of driver distraction as recorded in NHTSA databases. Tech. rep., U.S. Department of Transportation (2009) [9] Poon, J., Nunn, C.: Browsing the web from a speech-based interface. In: Proceedings of INTERACT, pp. 302–309 (2001) [10] Sampath, S., Bringert, B.: Speech input API specification. W3C editor’s draft, W3C (2010)
New Technique for Handling ASR Errors at the Semantic Level in Spoken Dialogue Systems Ram´on L´opez-C´ozar, Zoraida Callejas, David Griol and Jos´e F. Quesada
Abstract This paper proposes a new technique to develop more robust spoken dialogue systems, which aims to repair incorrect semantic representations obtained by the systems due to ASR errors. To do so, it relies on a training stage that takes into account previous system misunderstandings for each dialogue state. Experiments have been carried out employing two systems (Saplen and Viajero) previously developed in our lab, which employ a prompt-independent language model and several prompt-dependent language models for ASR. The results, obtained for a corpus of 20,000 simulated dialogues, show that the technique enhances system performance for both kinds of language model, especially for the prompt-independent language model.
1 Introduction In many cases, the performance of current spoken dialogue systems (SDSs) is strongly affected by speech recognition errors. The automatic speech recognition (ASR) technology has improved notably since the 1980s, when the first speech Ram´on L´opez-C´ozar Dept. of Languages and Computer Systems, CITIC-UGR, University of Granada (Spain), e-mail:
[email protected] Zoraida Callejas Dept. of Languages and Computer Systems, CITIC-UGR, University of Granada (Spain), e-mail:
[email protected] David Griol Dept. of Computer Science, Carlos III University of Madrid (Spain), e-mail: dgriol@inf. uc3m.es Jos´e F. Quesada Dept. of Computer Science and Artificial Intelligence, University of Seville (Spain), e-mail:
[email protected] R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_4, © Springer Science+Business Media, LLC 2011
17
18
Ram´on L´opez-C´ozar, Zoraida Callejas, David Griol and Jos´e F. Quesada
recognisers were made commercially available. However, state-of-the-art speech recognisers are very sensitive to a variety of factors, such as acoustic, linguistic and speaker variability, co-articulation effects, vocabulary size, out-of-vocabulary words, background noise and speech disfluencies, among others [1] [2]. As a result, recognised sentences are not always correct and as these comprise the input for the spoken language understanding (SLU) component of the systems, the semantic interpretations obtained by the latter are sometimes incorrect. To try to avoid this problem, robust ASR, speech understanding and dialogue management techniques must be employed. The work that we present in this paper focuses on speech understanding. The goal is to enable the development of more robust SDSs by taking into account knowledge about previous system misunderstandings, learnt in a training stage, to correct potentially incorrect semantic representations obtained by the systems due to ASR errors. This knowledge is used to inspect the semantic representation obtained at each dialogue state. If this representation is assumed to be incorrect, the technique replaces it with the one that is assumed to be correct giving the state.
2 Related Work Many studies can be found in the literature addressing how to build more robust SDSs. Some of them work only at the ASR level, with the goal of avoiding ASR errors [7] [8] [17]. Others consider not only the ASR level but the semantic level. For example, [3] used word confusion networks, where the network arcs had associated weights computed from acoustic and language models. Using these networks in analysis stages after ASR, the authors achieved much better results than from 1-best recognition. Following the opposite direction, [12] employed semantic knowledge to guide the speech recogniser towards the most likely sentences. To do this the authors proposed a stochastic model that estimated the joint probability of a sentence together with its semantic annotation. Employing a different approach, [27] used language models not interpolated with word n-grams in order to retain good word accuracy, but optimised for speech understanding. This resulted in a decrease in word accuracy but a remarkable increase in understanding accuracy. This effect should not be considered a disadvantage as obtaining a high rate of correct semantic interpretations is a key point for robust SDSs. A number of studies employ confidence scores to measure the reliability of each word in the recognised sentence [5] [15] [16]. These scores can be computed at the phonetic, word or sentence levels, and help determine whether the words are correct. A score lower than a threshold suggests that the recognised word is incorrect, in which case the system’s dialogue manager will normally either reject it or generate a confirmation prompt. A problem is that these scores are not fully reliable because some incorrectly recognised words may have scores greater than the confidence threshold, while some correctly recognised words may have scores below the
New Technique for Handling ASR Errors at the Semantic Level in SDSs
19
threshold [26] [6] [11]. Therefore, the policy employed to detect ASR errors based on confidence scores might fail. Other authors have focused on dialogue management in building robust SDSs. Their goal has been to allow flexibility in user interaction in an attempt to handle unexpected contributions and to interpret these correctly within the dialogue context. For example, [10] proposed a multithreaded dialogue in which threads related to concurrent and collaborative tasks allow the user to initiate, extend or correct threads at any time during the dialogue. [13] proposed adaptable and adaptive transparency strategies to help users respond appropriately to system errors; these strategies were applied to a system that provided plane and train schedules. [20] [22] applied Markov Decision Processes (MDPs) and Reinforcement Learning (RL) to adapt dialogue strategies to model dialogue problems. [24] examined the application of both techniques to the learning of optimal dialogue strategies in the presence of ASR errors, and proposed a method that dramatically reduced the amount of training data required. [4] used RL to automatically optimise the dialogue management policy used in the NJFun dialogue system. Finally, [25] [13] proposed an approach based on Partially Observable Markov Decision Processes (POMDPs) to generate dialogue strategies, observing that as ASR errors degrade, a POMDP-based dialogue manager makes fewer mistakes compared to an MDP-based manager.
3 The Proposed Technique Fig. 4 shows the architecture of the robust speech understanding module that implements the technique that we propose in this paper. This module is comprised of: i) the SLU module of a SDS, which uses frames to represent the semantics of the recognised sentence, and ii) a new module that carries out frame corrections. The latter uses the current system prompt and what we call a correction model (CM) to decide how to correct misunderstandings caused by ASR errors. The correction is carried out by replacing an incorrect frame generated by the system’s SLU module with another frame that is assumed to be correct in the dialogue state.
Fig. 1 Robust speech understanding module.
20
Ram´on L´opez-C´ozar, Zoraida Callejas, David Griol and Jos´e F. Quesada
The output of the frame correction module is a frame, possibly corrected, that is the input to the dialogue manager of the system. Hence, some ASR errors may be unnoticed by the user provided that the frame that finally makes up the input to the dialogue manager has been corrected. One advantage of this technique is that it is independent of the task performed by the system. Hence, it can be easily applied to systems designed for any application domain, for example, fast food ordering (Saplen) [21] and bus travel booking (Viajero) [23], both used in the experiments presented in this paper. To set up the technique for a given dialogue system, we must create the appropriate correction model, which is done by carrying out two tasks: creation of an initial correction model, and optimisation of this model, as will be discussed in the following sections.
3.1 Creation of an initial correction model The initial correction model can be created either from the interaction between the system and real users, or from the interaction between the system and a user simulator. The latter is the approach followed in our experiments. The model stores incorrect frames generated by the SLU module of the system as it processes user utterances. It is comprised of tuples of the form: (T , fR , fO ), where T denotes a prompt type generated by the system, f R represents the reference frame associated with the sentence uttered to answer the prompt, and fO denotes the frame obtained by the SLU module of the system as it analyses the recognised sentence. We consider that fR is correct if it matches exactly f O , and is incorrect otherwise. To create the correction model for our experiments, we used a simple procedure that takes as input the current prompt type (T ), the reference frame ( f R ) associated with the sentence generated by the user simulator, and the frame obtained ( fO ) from the analysis of the recognised sentence. If fO did not match f R , the procedure included the tuple (T , fR , fO ) into the correction model. All the information about the frames used in a particular application domain is stored in a model that we call Σ . This model contains information about types of frame (valid vs. invalid), types of frame slots (mandatory vs. optional), valid values for the slots, and relationships between slots.
3.2 Optimisation of the initial correction model The second task to implement the proposed technique is to optimise the initial correction model to obtain the model that the frame correction module will finally use. This task is performed by carrying out three sub-tasks: compaction, removal of inadequate tuples, and expansion.
New Technique for Handling ASR Errors at the Semantic Level in SDSs
21
3.2.1 Compaction The first sub-task is to compact the initial correction model as it may contain too many repeated tuples. The goal is to reduce it as much as possible to avoid any unnecessary processing delay, given that for each input sentence uttered to answer a system prompt type T , with associated reference frame fR and obtained frame fO , the frame correction module will check whether the tuple (T , fR , fO ) is in the model to decide whether to correct fO . This sub-task can be easily automated by means of a simple procedure that takes each tuple (T , fR , f O ) and looks for the same tuple in the model, removing the duplicates if found.
3.2.2 Removal of inadequate tuples The second sub-task is to analyse the tuples in the already compacted correction model to prevent the frame correction module from replacing frames incorrectly. This can be done automatically following the algorithm shown in Fig. 2, which takes as input the already compacted correction model (CM), the Σ model discussed in Sect. 3.1, and the Π model that will be explained below, producing as output an enhanced version of the correction model. The Σ and Π models must be created in advance by the system designers, applying their knowledge about the application domain and system performance. The Π model contains information about pairs of the form: (promptType, typeOfObtainedFrame), which represent expected promptanswer pairs in the application domain. For example, as it is expected that a user utters their address when the system prompts to do so, the pair (Address?, Address) should be included into the Π model. The algorithm uses this knowledge to decide whether an obtained frame fO is in agreement wi th a system prompt type T , given that an obtained frame may be incorrect regarding the reference frame but correct for other input utterances. To remove inadequate tuples, the algorithm firstly analyses each tuple (T , fR , fO ) in the correction model employing a function that we have called Spurious, which takes T and fO and using the Σ model decides whether fO is spurious. If it is spurious, the tuple is removed from the model because it might be obtained from a variety of different inputs, and thus allowing it into the model may lead to incorrect frame replacements. Secondly, the algorithm uses a function that we have called Agreement to decide whether the type of fO matches the prompt type T taking into account the Π model. If they match, the tuple is removed from the model to avoid incorrect frame replacements. The final part of the algorithm checks whether there are no tuples with the same prompt type T and obtained frame fO which differ in the reference frame f R , as this would mean that the obtained frame could be corrected in several ways. Hence, if these tuples are found, they are removed from the correction model to prevent incorrect frame replacements.
22
Ram´on L´opez-C´ozar, Zoraida Callejas, David Griol and Jos´e F. Quesada
Fig. 2 Algorithm to remove inadequate tuples from the correction model.
New Technique for Handling ASR Errors at the Semantic Level in SDSs
23
3.2.3 Expansion The third sub-task is to expand the correction model, now free from inadequate tuples, to generalise the behaviour of the frame correction module so that it can deal appropriately with misunderstandings not observed in the training. For example, if the model does not have any correction for the prompt: Do you want to cancel the conversation and start again?, the goal of this sub-task is to include corrections for the possible misunderstanding of responses to this prompt, thus enabling the frame correction module to correct the errors. This sub-task is implemented according to the algorithm shown in Fig. 3, which takes as input the already available correction model and a model that we call Ψ which contains classes of prompt types. To create this model the system designers must take into account all the possible prompt types that the system can generate, and group them into classes considering as a classification criterion that all the prompt types in a class must have the same expected kind of response from the user. The algorithm firstly copies the tuples in the correction model CM to a new model CM’ that is initially empty. Next, it takes into account each tuple (T , f R , fO ) in CM and uses a function that we have called Class to determine the class of prompt type (K) that contains T . The algorithm then checks whether tuples of the form (T ’, fR , fO ) with T ’ ̸= T are in CM’, where T ’ represents each prompt type in K. If a tuple (T ’, fR , fO ) is not in CM’ then it is added to this model. Finally, a new correction model CM is created containing the tuples in CM’. This model is the input to the frame correction module, as shown in Fig. 4.
4 Experiments The goal of the experiments was to test whether the proposed technique was useful to enhance sentence understanding (SU), task completion (TC), implicit recovery of ASR errors (IR) [12] and word accuracy (WA) of the Saplen and Viajero systems.
4.1 Utterance corpora and scenarios We created two separate utterance corpora (for training and test) for each system ensuring that no training utterances were included in the test corpus. Both corpora contained the orthographic transcriptions of the utterances as well as their corresponding reference frames ( fR ). The corpus for Saplen contained around 5,500 utterances; roughly 50% of them were used for training and the remaining for testing. The corpus for Viajero contained around 5,800 utterances, which were also divided into training (50%) and test (50%). To collect user interactions with the systems, we employed a user simulation technique that we developed in a previous study [7]. For these experiments, we
24
Ram´on L´opez-C´ozar, Zoraida Callejas, David Griol and Jos´e F. Quesada
Fig. 3 Algorithm to expand the frame correction model.
New Technique for Handling ASR Errors at the Semantic Level in SDSs
25
designed 250 scenarios for each system. The scenario goals were reference frames selected at random from the utterance corpora used for testing. In the case of Saplen, the frames corresponded to product orders, telephone numbers, post codes and addresses, whereas for Viajero they were concerned with greetings, travel bookings, telephone numbers and queries on travel schedules, price and duration.
4.2 Language models for ASR We employed for these experiments the two kinds of language models (word bigrams) that we used in a previous study [6]. For Saplen these models were 17 prompt-dependent language models (PDLMs) and one prompt-independent language model (PILM), whereas for Viajero they were 16 PDLMs and one PILM. For each system, there was a PDLM associated with each different dialogue state, which was used to recognise the utterance provided by the user simulator to answer the system prompt generated in the state. The goal of the PILM was to recognise any kind of sentence permitted in the application domain regardless of the current system prompt. In previous experiments we observed that the performance of the systems using this kind of language model deteriorates slightly, given the broader range of sentence types and the greater vocabulary to be considered. However, this language model was appropriate for users who tended to answer system prompts with any kind of sentence within the application domain. It was interesting for us to test the proposed technique using both kinds of language model (PDLMs and PILM) because we plan to study ways to let the systems automatically select one or the other as an attempt to adapt their performance to the kind of user (more or less experienced) and the success of system-user interaction.
4.3 Results 4.3.1 Experiments with the baseline systems In these experiments, the frames received by the dialogue manager of the system were not previously corrected. Employing the user simulator to interact with each system, we generated 10 dialogues for each scenario and language model. Hence, a total of 10 x 250 x 2 = 5,000 dialogues were generated for each system. Table 1 sets out the average results obtained in terms of sentence understanding (SU), task completion (TC), word accuracy (WA) and implicit recovery of ASR errors (IR) from the analysis of these dialogues. It can be observed that the systems worked slightly better when the PDLMs were used. The reason is that for analysing each response (utterance) provided by the user simulator, the speech recogniser employed a word bigram compiled from training sentences of the appropriate type. In addition,
26
Ram´on L´opez-C´ozar, Zoraida Callejas, David Griol and Jos´e F. Quesada
the vocabulary considered using the PDLMs was much smaller than when the PILM was used. Table 1 Performance of the baseline systems (%) System
Language models
SU
TC
WA
IR
Saplen Saplen Viajero Viajero
PDLMs PILM PDLMs PILM
72.66 69.71 83.91 80.25
65.83 63.11 78.04 75.21
81.83 80.65 87.95 84.19
39.40 33.38 58.83 52.58
4.3.2 Experiments with the proposed technique In these experiments, Saplen and Viajero used the robust speech understanding module shown in Fig. 4. Therefore, the frames generated by the SLU module of the systems were replaced by the frame correction module if they were considered incorrect, which was done before they were used by the dialogue manager. We created an initial correction model for each dialogue system to obtain as much knowledge as possible regarding system misunderstandings. To do so, we used the correction model created in the worst case of the experiments described in Sect. 4.3.1, which corresponded to the usage of the PILM. We called CM1 the model obtained with Saplen, which contained 119,773 tuples, and called CM2 the model obtained with Viajero, which contained 101,602 tuples. We compacted the initial correction models for the two systems to remove repeated tuples, thus obtaining a model with 164 different tuples for Saplen, and another model with 147 tuples for Viajero. We removed the inadequate tuples from the compacted models employing a simple procedure that implements the algorithm described in Fig. 2. In the case of CM1 this procedure removed 87 tuples, whereas for CM2 it removed 75 tuples. We expanded the models to generalise the behaviour of the frame correction module to prompt types not observed in the training. To do this, we classified the 43 prompt types that Saplen can generate into 17 classes, and the 52 prompt types that Viajero can generate into 15 classes. Then, we employed a simple procedure that implements the algorithm described in Fig. 3. As a result, we obtained a correction model with 359 tuples for Saplen and another model for Viajero with 320 tuples. In order to get experimental results using the robust speech understanding module with these models, we employed again the user simulator and generated another 10 dialogues for each scenario and language model, i.e. 10 x 250 x 2 = 5,000 dialogues for each dialogue system. Table 2 shows the average results obtained.
New Technique for Handling ASR Errors at the Semantic Level in SDSs
27
Table 2 System performance employing the proposed technique (%) System
Language models
SU
TC
WA
IR
Saplen Saplen Viajero Viajero
PDLMs PILM PDLMs PILM
91.03 89.25 98.21 95.18
91.28 89.36 95.90 93.27
88.90 88.18 94.55 91.17
57.73 53.68 74.30 68.21
5 Conclusions and Future Work A comparison of Table 1 and Table 2 shows that the proposed technique improves the performance of the baseline systems. Regarding Saplen, SU increases by 18.37% absolute for the PDLMs and by 19.54% absolute for the PILM. The improvement in terms of SU reflects a remarkable increment in terms of TC, which is 25.45% absolute for the PDLMs and 26.25% absolute for the PILM. Regarding WA, in the case of Saplen the increment is 7.07% absolute for the PDLMs and 7.53% absolute for the PILM, whereas for Viajero it is 6.6% absolute for the PDLMs and 6.98% absolute for the PILM. Future work includes studying methods to extract more information from the initial correction model. In the current implementation the technique removes the repeated tuples in the model to make it as small as possible to avoid any processing delay. However, knowing the number of duplicates of a tuple (T , fR , fO ) can be important as it can provide useful information about how often the dialogue system misunderstands a sentence uttered to answer a prompt type T . Therefore, it could be interesting to make an analysis of the tuples in the initial correction to obtain statistical information about system misunderstandings to be considered in the process for frame replacement. Acknowledgements This research has been funded by Spanish project ASIES TIN2010-17344.
References [1] Rabiner, L., Juang, B. H.: Fundamentals of Speech Recognition. Prentice-Hall (1993) [2] Huang, X., Acero, A., Hon, H.: Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice-Hall (2001) [3] D. Hakkani-Tur, D., F. B’echet, F., Riccardi, G., Tur, G.: Beyond ASR 1-best: Using word confusion networks in spoken language understanding. Computer Speech and Language, 20(4), 495–514 (2006) [4] Singh, S., Litman, D., Kearns, M., Walker, M.: Optimizing dialogue management with reinforcement learning: Experiments with the NJFun system.
28
Ram´on L´opez-C´ozar, Zoraida Callejas, David Griol and Jos´e F. Quesada
Journal of Artificial Intelligence Research, 16, 105–133 (2000) [5] Rose, R. C., Yao, H., Riccardi, G., Wright, J.: Integration of utterance verification with statistical language modelling and spoken language understanding. Speech Communication, 34, 321–331 (2001) [6] Hazen, T. J., Seneff, S., Polifroni, J.: Recognition confidence scoring and its use in speech understanding systems. Computer Speech and Language, 16, 49–67 (2002) [7] Molau, S., Keyers, D., Ney, H.: Matching training and test data distributions for robust speech recognition. Speech Communication, 169, 579–601 (2003) [8] K. Weber, K., Ikbal, S., Bengio, S., Bourland, H.: Robust speech recognition and feature extraction using HMM2. Computer Speech and Language, 17, 195–211 (2003) [9] L´opez-C´ozar, R., De la Torre, A., Segura, J. C., Rubio, A. J., Snchez, V.: Assessment of dialogue systems by means of a new simulation technique. Speech Communication, 40(3), 387–407 (2003) [10] Lemon, O., Gruenstein, A.: Multithreaded context for robust conversational interfaces: Context-sensitive speech recognition and interpretation of corrective fragments. ACM Transactions on Computer-Human Interaction, 11(3), 241–267 (2004) [11] Feng, J., Sears, A.: Using confidence scores to improve hands-free speech based navigation in continuous dictation systems. ACM Transactions on Computer-Human Interaction, 11(4), 329–356 (2004) [12] Erdogan, H., Sarikaya, R., Chen, S. F., Gao, Y., Picheny, M: Using semantic analysis to improve speech recognition performance. Computer Speech and Language, 19, 321–343 (2005) [13] Karsenty, L., V. Botherel: Transparency strategies to help users handle system errors. Speech Communication, 45, 305–324 (2005) [14] L´opez-C´ozar, R., Callejas, Z.: Combining language models in the input interface of a spoken dialogue system. Computer Speech and Language, 20, 420–440 (2005) [15] Jiang, H.: Confidence measures for speech recognition: A survey. Speech Communication, 45, 455–470 (2005) [16] Higashinaka, R., Sudoh, K., Nakano, M.: Incorporating discourse features into confidence scoring of intention recognition results in spoken dialogue systems. Speech Communication, 48, 417–436 (2006) [17] Gemello, R., Mana, F., Albesano, D., de Mori, R.: Multiple resolution analysis for robust automatic speech recognition. Computer Speech and Language, 20, 2–21 (2006) [18] Young, S., Gasic, M., Keizer, S., Mairesse, F., Schatzmann, J., Thomson, B., Yu, K.: The hidden information state model: A practical framework for POMDP-based spoken dialogue management. Computer Speech and Language, 24(2), 150–174 (2009) [19] Danieli, M., Gerbino, E.: Metrics for evaluating dialogue strategies in a spoken language system. In: Proc. of AAAI Spring Symposium on Empirical Methods in Discourse Interpretation and Generation, pp. 34-39 (1995)
New Technique for Handling ASR Errors at the Semantic Level in SDSs
29
[20] Levin, E., Pieraccini, R.: A stochastic model of computer-human interaction for learning dialog strategies. In: Proc. of Eurospeech, pp. 1883-1886 (1997) [21] L´opez-C´ozar, R., Garc´ıa, P., D´ıaz, J., Rubio, A.: A voice activated dialogue system for fast-food restaurant applications. In: Proc. of Eurospeech, pp. 1783-1786 (1997) [22] Levin, E., Pieraccini, R., Eckert, W.: Using Markov decision process for learning dialog strategies. In: Proc. of ICASSP, pp. 201-204 (1998) [23] L´opez-C´ozar, R., Rubio, A., Garc´ıa, P., D´ıaz, J. E., L´opez-Soler, J. M.: Telephone-based service for bus traveller service. In: Proc. of 1st Spanish Meeting on Speech Technologies (2000) [24] Goddeau, D., Pineau, J.: Fast reinforcement learning of dialog strategies. In: Proc. of ICASSP, pp. 1233-1236 (2000) [25] Roy, N., Pineau, J., Thrun, S.: Spoken dialogue management using probabilistic reasoning. In: Proc. of 38th Annual Meeting of the ACL (2000) [26] Zhang, R., Rudnicky, I.: Word level confidence annotation using combination of features. In: Proc. of Eurospeech, pp. 2105-2108 (2001) [27] Wang, Y., Acero, A., Chelba, C.: Is word error rate a good indicator for spoken language understanding accuracy. In: Proc. of ASRU, pp. 577-582 (2003)
Combining Slot-based Vector Space Model for Voice Book Search Cheongjae Lee and Tatsuya Kawahara and Alexander Rudnicky
Abstract We describe a hybrid approach to vector space model that improves accuracy in voice search for books. We compare different vector space approaches and demonstrate that the hybrid search model using a weighted sub-space model smoothed with a general model and a back-off scheme provides the best search performance on natural queries obtained from the Web.
1 Introduction The book shopping domain poses interesting challenges for spoken dialog systems as the core interaction involves search for an often under-specified item, a book for which the user may have incomplete or incorrect information. Thus, the system needs to first identify a likely set of candidates for the target item, then efficiently reduce this set to match that item or items originally targeted by the user. This part of the process is characterized as “voice search” and several such systems have been described (Section 2). In this paper we focus on the voice search algorithm and specifically on two sources of difficulty: users not having an exact specification for a target, and queries being degraded through automatic speech recognition (ASR) and spoken language understanding (SLU) errors.
Cheongjae Lee Academic Center for Computing and Media Studies, Kyoto University, Kyoto, Japan, e-mail:
[email protected] Tatsuya Kawahara Academic Center for Computing and Media Studies, Kyoto University, Kyoto, Japan, e-mail:
[email protected] Alexander Rudnicky Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA, e-mail: alex@ cs.cmu.edu
R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_5, © Springer Science+Business Media, LLC 2011
31
32
Lee et al.
One of the major problems is that users may not have the exact information about a book. For example, if the correct book title is “ALICE’S ADVENTURES IN WONDER-LAND AND THROUGH THE LOOKING-GLASS, the user may not remember the entire title and might say “I don’t know the whole title but it’s something like ALICE ADVENTURE”. We previously found that 33% of 200 respondents in a survey did not have the complete information. Moreover, many titles are simply too long to say even though the user might know the exact title (in our database the longest title has 38 words). Thus, users often provide a few keywords instead of exact title. There are additional peculiarities. For instance, the title “MISS PARLOA’S NEW COOK BOOK” is a book by Ms. Parloa in the cookbook category, but this title contains its author’s name and the category. These problems may cause degradation in a system that attempts to parse the input. The problem is exacerbated by the large number of eBooks as well as inconsistencies in the database1 . To address these problems, this paper presents a robust search algorithm based on sub-space models for voice search applications. Specifically, we propose a hybrid approach which combines slots-based models with the general models as a back-off.
2 Related Work Voice search [6] has been used in various applications: automated directory assistance system [8], consumer rating system [9], multimedia search [4], and book search [3]. Early voice search systems primarily focused on issues of ASR and search problems in locating business or residential phone listings [8]. Then, it has been extended to general web search such as Google’s. Recent voice search systems have been applied to search for entries in large multimedia databases [4]. Although a simple string matching technique was used to measure the similarity of an ASR output string to entity values in the database [3], vector space models (VSM) have been widely used [8, 9]. We also propose a hybrid search model using slot-based VSMs and a back-off scheme to improve the search accuracy.
3 Data Collection 3.1 Backend Database Our system contains a relational database (RDB) consisting of 15,088 eBooks, sampled randomly from the Amazon Kindle Book website. For each book we harvested 17 attributes including its title, authors, categories, price, sales rank. Although many attributes can be used to search for appropriate books, it is not necessarily practi1
Currently, 800,126 eBooks are available in the Amazon Kindle [http://www.amazon.com/Kindle-eBooks/, retrieved January 6th, 2011]
Store
Combining Slot-based Vector Space Model for Voice Book Search Slots
33
Title Author Category
Max. Length 38 10 Avg. Length 6.99 2.25 Voca. Size 13708 8159
5 1.53 1002
Table 1 Statistics of the book database.
cal to handle all possible attributes. To define the set of slots for use in the system, we surveyed 221 persons on which information they typically have when they buy eBooks. Note that the respondents had previously bought eBooks.The top three were title (31.99%), authors (26.10%), and category (14.73%). Consequently, we focus on these three slots. Table 1 shows the statistics of the book database used in our system. These contribute 20,882 unique words to the system vocabulary.
3.2 Query Collection using Amazon Mechanical Turk A key challenge in building a voice search system is defining a habitable user language prior to the point at which a prototype system is available to collect actual user data. Often the procedure consists of the developer and a few other volunteers generating likely inputs as language data. This approach necessarily introduces a sampling bias. We sought to improve this sample diversity by using the Amazon Mechanical Turk (MTurk) service to obtain user utterances at a low cost. MTurk is an on-line marketplace for human workers (turkers) who perform “human intelligence tasks” (HITs) in exchange for small sums of money [2]. We created HITs to elicit utterances, providing metadata consisting of title, authors, and a category. Turkers were asked to formulate a response to the question “how can I help you?” posed by a hypothetical human bookstore clerk. A typical query might be “I AM LOOKING FOR ALICE IN WONDER-LAND BY CARROLL”. Although we asked that the turkers think in terms of a spoken query, it is important to note that the queries collected were written, not spoken. In addition, although we have been developing the whole voice search system in which users can find the target book by interacting with the system, the current queries were not in the context of a dialog system, but appeared at the first turn in a dialog. We have focused on the first user’s turn because the first queries should be well-processed for efficient interaction.
4 Book Search Algorithm The search problem in our system is to return a relevant set of books given noisy queries. In this section, we describe how to search for relevant books in voice book search.
34
Lee et al.
4.1 Baseline Vector Space Model (VSM) Our vector-space search engine uses a term space, where each book is represented as a vector with specific weights in a high-dimensional space (vi ). A query is also represented as the same kind of vector (vq ). The retrieved list of book is created by calculating the cosine similarity, s(vq , vi ) between two vectors as follows: s(vq , vi ) =
vq · vi ∥vq ∥∥vi ∥
(1)
If the vectors are normalized, it is possible to compute the cosine similarity as the dot product between the unit vectors. s(vq , vi ) = vˆq · vˆi
(2)
This formulation allows for rapid search, important as there are many vectors to compute for each query. We use stemming to compact the representation, but we do not eliminate stop words as some stop words are necessary and meaningful for identifying relevant books. For example, some titles consist of only stop words such as “YOU ARE THAT” and “IT”. They will not be indexed correctly if stop words are filtered out. There are several different ways of assigning term weights but not all are appropriate for this task. For example, TFxIDF does not work well for book search since most values and queries are too short to estimate reliable weights. We used a simple term count weight to represent term vectors; weights indicating the occurrence count for a given term. In the conventional single vector space model (here, SVSM), all terms in different slots are indexed together over a single term space and every term is equally weighted regardless of its slot name. In such a model, slot names may not be necessary for book search because all query terms are integrated into a single query vector. This model can be robust against SLU errors in which the slot names are incorrectly extracted. This model might also be adequate for books in which the title includes its author’s name and category. However, it cannot capture inter-slot relationships. For example, when some users who provides a mixed category query (“A MYSTERY BY CHRISTIE”).
4.2 Multiple VSM We also consider a multiple VSM model (MVSM) in which each slot j is independently indexed over sub-spaces, and each slot-based model is then interpolated with slot-specific weights, w j , as follows: l ∗ = argmax ∑ w j · s j (vq j , v j ) l
j
(3)
Combining Slot-based Vector Space Model for Voice Book Search
35
Fig. 1 The strategy of database search (IG: In-grammar, OOG: Out-of-grammar).
Each query is parsed using the Phoenix semantic parser [7]. The slot-based vector vq j for MVSM is generated by slot values extracted by the parser (Figure 1). Although the interpolation weights w j can usually be set empirically or by using held-out data, these weights can be modified based on a user’s preferences or on confidence scores derived from speech recognition. For instance, if the slots were unreliable, the slot values could be less weighted for the book search. In this work, the weights w j were set based on the slot preferences that users expressed in our survey (see Section 3.1). For unfilled slots, the weight is set to 0 and the weights are renormalized dynamically according to the current slot-filling coefficient ( f j ) that is assigned a value of one if the slot name j is filled, as follows: wˆ j =
∑k fk · wk fj ·wj
(4)
MVSM can be easily tuned to improve the quality of list generation. Nevertheless, incorrect word-to-slot mapping could degrade search performance relative to SVSM since MVSM relies on this mapping.
36
Lee et al.
4.3 Hybrid VSM and a Back-off Scheme Finally, we propose a hybrid search model (HVSM) in which SVSM and MVSM are linearly interpolated with a specific weight. We expect that HVSM can compensate for the individual drawbacks of the SVSM and MVSM models, though at the cost of additional computation. Some queries may be out-of-grammar (OOG) and do not result in a parse. In this case, SVSM can be used as a back-off search model in which all terms in the query are converted into an input vector without the slot information allowing search to be carried out.
5 Search Evaluation 5.1 Experiment Set-up We use two evaluation metrics widely used in information retrieval. One is precision at n (P@n), which represents the number of correct queries having the answer in the top n relevant items divided by the total number of queries. The other is mean reciprocal rank (MRR), which indicates the average of the reciprocal ranks of search results for a sample of queries [5]. In reality there may be multiple correct answers in a list, when users do not have the exact book in their mind. For example, some users can search for any fictions without an exact book in their mind. Because it is difficult to automatically determine the relevance relationship between the queries and the lists, we identified a single correct book corresponding to each query. We collected 948 textual queries in MTurk (Section 3.2). We then manually created a grammar covering observed query patterns (e.g., “I’D LIKE A BOOK BY [AUTHOR]”), but sub-grammars for the slot values were automatically generated from the book database. To define the sub-grammars, the book titles were tokenized into a bag of words; title queries have many combinations of words regardless of their orders and grammars because users can say content words (e.g. ‘ALICE’, ‘ADVENTURE’, ‘WONDERLAND’, etc) without functional words (e.g. ‘IN’, ‘OF’, ‘THROUGH’, etc). The author names were divided into the first name, the middle name, and the last name because users man say either the full name or a partial name. For example, either ‘LEWIS’, ‘CARROLL’, or ‘LEWIS CARROLL’ may be spoken when users are looking for books by ‘LEWIS CARROLL’. Note that the 661 books used for collecting the queries were different from the books used to make the evaluation sub-grammars. Parsing with the resulting grammar does not always map slot information correctly, even if the query is fully parsed; therefore, SLU will introduce errors even with correct input. Out of 948 test queries, 392 queries had no parse results due to lack of coverage. The F1 score of the semantic parser on 556 parsed queries is
Combining Slot-based Vector Space Model for Voice Book Search Type Parsed Unparsed Total
37
#Queries Avg. Words Avg. Slots 556 392 948
12.01 15.77 13.56
1.97 2.10 2.02
Table 2 Statistics of the queries collected in MTurk. #Queries means the number of queries given the query type. Avg. Words and Avg. Slots represent the average number of words and slots in the queries, respectively.
Query Type Parsed Unparsed Total
SVSM (na¨ıve) SVSM (parsed) MVSM HVSM P@100 MRR P@100 MRR P@100 MRR P@100 MRR 0.8849 0.6048 0.9137 0.7080 0.8327 0.6905 0.9335 0.7710 0.8087 0.5386 0.8534 0.5774 0.8703 0.6380 0.8228 0.6296 0.8819 0.6763
Table 3 Evaluation results on textual queries (WER=0%). SVSM (na¨ıve) refers to SVSM with the whole query.
75.20% in which the slot values are partially matched by using the cosine similarity between the reference and the hypothesis because the slot values do not necessarily exactly match the information in the database. For instance, “ALICE” or “ADVENTURE” may be individually meaningful to search for the relevant books although “ALICE’S ADVENTURE” might not be parsed from the utterance “I AM LOOKING FOR ALICE’S ADVENTURE BY CARROLL”. Table 2 shows the statistics of our test queries.
5.2 Evaluation on Textual Queries To evaluate the search performance on the textual queries collected through MTurk, we used parsed, unparsed, and total queries (Table 3). First, these results show that the use of SLU result can improve the search accuracy. Although the raw queries without SLU results could be used for the input of the vector space model, parsed queries shows better performances over different models. In addition, SLU results may be necessary to manage subsequent dialog, such as confirming the slot values and narrowing the candidate items. Next, HVSM shows the best performance on parsed and total queries. This means that slot information in MVSM may be useful to search more precisely in an actual system and that SVSM, not considering slot names, may be a necessary adjunct to overcome SLU errors. Finally, the results also show the back-off scheme is effective on unparsed inputs since it can return the relevant items even if the current query was OOG.
38
Lee et al. Query Type Parsed Unparsed Total
SVSM (na¨ıve) SVSM (parsed) MVSM HVSM P@100 MRR P@100 MRR P@100 MRR P@100 MRR 0.8410 0.5526 0.8619 0.6519 0.8033 0.6211 0.8954 0.7037 0.8152 0.5694 0.8217 0.5652 0.8270 0.5902 0.8122 0.5802 0.8354 0.6022
Table 4 Evaluation results on noisy queries (WER=23.94%). SVSM (na¨ıve) refers to SVSM with the whole query.
5.3 Evaluation on Noisy Queries We generated noisy queries by using a simple ASR error simulator [1] applied to the textual queries. These are not real spoken queries, but artificially simulated queries given both a specific word error rate (WER) and error type distribution. We preliminarily evaluated the search performance on these queries (WER=23.94%) although simulated queries may differ from real queries produced in real ASR (Table 4). Out of 948 test queries, 239 queries had parse results and the F1 score on parsed queries is 64.30%. The decrease in the ratio of parsed queries had an adverse effect in the performance of the proposed method. However, the HVSM still shows the best performance.
6 Conclusion and Discussion We propose a hybrid approach to voice search using an effective vector space model, HVSM. HVSM provides the best performance on natural queries sourced through MTurk. This approach can consider the slot information and overcome OOG problems. Some issues have yet to be resolved. The main issue is evaluating the book search model on spoken queries and not typed queries. We have collected speech data as spoken queries using MTurk and are investigating various ASR hypothesis structures (e.g. n-best list) as well as confidence scores that might be incorporated into our proposed model to realize robustness.
References [1] Jung, S., Lee, C., Kim, K., Lee, G.G.: Data-driven user simulation for automated evaluation of spoken dialog systems. Computer Speech and Language 23(4), 479–509 (2009) [2] Marge, M., Banergee, S., Rudnicky, A.I.: Using the amazon mechanical turk for transcription of spoken language. In: Proc. ICASSP, pp. 5270–5273 (2010)
Combining Slot-based Vector Space Model for Voice Book Search
39
[3] Passonneau, R.J., Epstein, S.L., Ligorio, T., Gordon, J.B., Bhutada, P.: Learning about voice search for spoken dialogue systems. In: Proc. NAACL, pp. 840–848 (2010) [4] Song, Y.I., Wang, Y.Y., Ju, Y.C., Seltzer, M., Tashev, I., Acero, A.: Voice search of structured media data. In: Proc. IEEE ICASSP, pp. 3941–3944 (2009) [5] Voorhees, E.M., Tice, D.M.: The trec-8 question answering track evaluation. In: Proc. Text Retrieval Conference TREC-8, pp. 83–105 (1999) [6] Wang, Y.Y., D.Yu, Ju, Y.C., Acero, A.: An introduction to voice search. IEEE Signal Processing Magazine 25(3), 29–38 (2008) [7] Ward, W., Issar, S.: Recent improvements in the cmu spoken language understanding system. In: Proc. ARPA Human Language Technology workshop, pp. 213–216 (1994) [8] Yu, D., Ju, Y.C., Wang, Y.Y., Zweig, G., Acero, A.: Automated directory assistance system. In: Proc. INTERSPEECH, pp. 2709–2712 (2007) [9] Zweig, G., Nguyen, P., Ju, Y.C., Wang, Y.Y., Yu, D., Acero, A.: The voice-rate dialog system for consumer ratings. In: Proc. INTERSPEECH, pp. 2713–2716 (2007)
Preprocessing of Dysarthric Speech in Noise Based on CV–Dependent Wiener Filtering Ji Hun Park, Woo Kyeong Seong, and Hong Kook Kim
Abstract In this paper, we propose a consonant–vowel (CV) dependent Wiener filter for dysarthric automatic speech recognition (ASR) in noisy environments. When a Wiener filter is applied to dysarthric speech in noise, it distorts initial consonants of dysarthric speech. This is because compared to normal speech, the speech spectrum at a consonant-vowel onset in dysarthric speech is much similar to that of noise, thus speech at the onset is easy to be removed by the Wiener filtering. In order to mitigate this problem, the transfer function of a Wiener filter is differently constructed depending on the result of CV classification that is performed by combining voice activity detection (VAD) and vowel onset estimation. In this work, VAD is done by a statistical model based approach and the vowel onset estimation is by investigating the variation of linear prediction residual signals. To demonstrate the effectiveness of the proposed CV–dependent Wiener filter on the performance of dysarthric ASR, we compare the performance of an ASR system employing the proposed method with that using a conventional Wiener filter for different groups of degrees of disability under different signal–to–noise ratio conditions. Consequently, it is shown from the ASR experiments that the proposed Wiener filter achieves a relative average word error rate reduction of 10.41%, 6.03%, and 0.94% for the mild, moderate, and severe group of disability, respectively, when compared to the conventional Wiener filter.
1 Introduction Dysarthria comprises a family of motor speech disorders which arise from damage to the central or peripheral nervous system, characterized by poor articulation [1]. In other words, slow, weak, imprecise or uncoordinated movements of the speech production musculature result in reduced speech intelligibility [2]. Individuals with speech motor disorders also have physical disabilities caused by neuromotor impairment [3]. Thus, they cannot easily use typical interfaces, including a keyboard, a mouse, and so on. For dysarthric speakers, an automatic speech recognition (ASR) system is a very useful and practical tool to interact with machines such as computers and mobile devices. Hence, an ASR–based human–computer interface system Ji Hun Park, Woo Kyeong Seong, and Hong Kook Kim School of Information and Communications, Gwangju Institute of Science and Technology, Gwangju 500-712, Korea, e-mail: {jh_park,wkseong,hongkook}@gist.ac.kr
R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_6, © Springer Science+Business Media, LLC 2011
41
42
Ji Hun Park, Woo Kyeong Seong, and Hong Kook Kim
would be highly desirable despite the fact that the ASR performance for dysarthric speech is dramatically deteriorated by reduced speech intelligibility. For this reason, there have been a number of research works investigating the feasibility of ASR for dysarthric speech [3, 4]. However, they have limited and exclusively focused on recognition in a quiet environment. Thus, in order to improve the usability of ASR for dysarthric speech in real environments, the noise processing applied to dysarthric speech is necessary. There are a number of techniques applied to reduce noise signals which include spectral subtraction, minimum mean square error estimation, Wiener filtering, and etc [5]. Among them, Wiener filtering approaches have been significantly attracted due to their low complexity and relatively good performance. However, the transfer function estimated for a Wiener filter is recursively constructed based on the priori and posteriori signal–to–noise ratios (SNRs) estimated from the speech and noise spectra of the previous analysis frames. Thus, it could result in the under–estimation of the priori and posteriori SNR, especially at speech onset frames [6]. Many dysarthric individuals are likely to exhibit imprecise articulation in initial consonants [7]. Hence, when compared to normal speech, the characteristics of initial consonants in dysarthric speech are much similar to those of noise [8]. For this reason, initial consonants in dysarthric speech are easy to be distorted by the Wiener filtering. Moreover, a consonant in dysarthric speech has a relatively longer duration than that in normal speech [7]. Consequently, an ASR system applied to dysarthric speech has more degraded performance in initial consonants than other phonemes. In this paper, we propose a consonant–vowel (CV) dependent Wiener filter for preventing distortions induced in initial consonants of dysarthric speech. To this end, the CV classification is first performed by combining voice activity detection (VAD) and vowel onset estimation. A Wiener filter is then differently designed according to the result of the CV classification.
2 CV–Dependent Wiener Filter In this section, we describe how a CV–dependent Wiener filter for dysarthric speech can be differently constructed based on the result of CV classification. Fig. 1 shows a block diagram of preprocessing of dysarthric noisy speech using the proposed CV–dependent Wiener filtering. As shown in the figure, the proposed method consists of mainly two blocks; one is CV classification based on VAD and voice onset points (VOPs) estimated from linear prediction (LP) residual signals, and the other is Wiener filtering constructed by the result of CV classification.
2.1 CV–Classified VAD The VAD algorithm employed in this paper discriminates between speech and non– speech frames based on a likelihood ratio test (LRT). To this end, a statistical model is constructed by using Gaussian distribution for two hypotheses of speech absence and speech presence, represented as H0 and H1 , respectively. In particular, the likelihood ratio between speech and non–speech is computed by using a priori and a posteriori SNR, as follows [9].
Preprocessing of Dysarthric Speech in Noise Based on CV–Dependent Wiener Filtering
x(n) = s (n) + w(n)
Statistical Model-based VAD
VAD Flag
CV Classification LP Analysisbased VOP Detection
43
VOP
CVclassified VAD Flag
CV-dependent Wiener Filter Design
Wiener Filter Applying
sˆ(n)
Fig. 1 Block diagram of the proposed CV–dependent Wiener filtering applied to dysarthric noisy speech.
( ) p(X(t, k)|H1 ) 1 γ (t, k)η (t, k) R(t, k) = = exp p(X(t, k)|H0 ) 1 + η (t, k) 1 + η (t, k)
(1)
where X(t, k) is the spectral component of input noisy speech for the t–th time frame and k–th frequency bin (0 ≤ k < K). In addition, η (t, k) and γ (t, k) represent a priori and a posteriori SNR, respectively, and they are defined as η (t, k) = λS (t, k)/λW (t, k) and γ (t, k) = |X(t, k)|2 /λW (t, k), respectively, where λS (t, k) and λW (t, k) indicate the estimated power spectrum of clean speech and noise, respectively. The likelihood ratio, R(t, k), is derived under the assumption that a priori SNR and noise spectral components for a given noisy speech frame are known. However, these parameters are practically unknown because noisy speech is only observed. Note here that noise spectral components are estimated from initial silent frames and updated from the frames classified as non–speech. In addition, the decision– directed (DD) method is employed to estimate a priori SNR [9]. Assuming that spectral components are statistically independent, the VAD flag in Fig. 1 is set from the average of the log likelihood ratios for all K frequency bins such as { 1 if K1 ∑k log R(t, k) > θ VAD(t) = (2) 0 otherwise where θ is a threshold for the LRT. As shown in Fig. 1, we combine the information regarding VOPs to classify each frame of noisy speech as a consonant or a vowel. According to a source–filter model of speech, the residual signal obtained from LP analysis of speech corresponds to the excitation source [10]. In this paper, the variation of such an LP residual signal is exploited for VOP detection. The characteristic of the excitation source changes both at the fine and gross level during speech production, and the VOPs are the events associated with the changes at the gross level. The LP residual signal therefore needs to be smoothed to obtain a better estimate of VOP. In this paper, the envelope of the LP residual signal is convolved with a Hamming window of 50 ms to smooth the LP residual signal. As mentioned earlier, each VOP is associated with the instant at which there is a significant change in the smoothed envelope of the LP residual signal. To make the change clear, the first–order difference (FOD) of the smoothed envelope of the LP residual is computed and it is further smoothed by convolving with a Hamming window of 20 ms. Next, the local maxima of FOD along the analysis frame are identified. Among the maxima, we eliminate some of them whose value is less than
44
Ji Hun Park, Woo Kyeong Seong, and Hong Kook Kim
a predefined value. Next, we search a local maximum that appears first. From the assumption that it is very rare that two VOPs exist together within a 100 ms interval, this local maximum is decided as a VOP if there does not exist another local maximum within 100 ms. On the other hand, if there exists another local maximum, then we select a VOP from either of them which has the higher value. This search process continues until the end of the utterance. Since most of the syllables in Korean are represented as a type of CV or CVC, each VOP can be interpreted as the junction point between an initial consonant and a vowel. Therefore, the region appeared prior to the VOP is considered as the initial consonant region. Let Ivop be a set of closed intervals constructed from VOPs such as Ivop = {[VOP(i) − ε ,VOP(i)]|i = 1, ..., M}, where VOP(i) is the i–th VOP, M is the total number of VOPs in the utterance, and [x, y] is a closed interval including x and y. Finally, the CV–classified VAD result is defined by combining the VAD flag in Eq. (2) and VOPs as 0 if VAD(t) = 0 CV (t) = 1 if VAD(t) = 1 and t ∈ Ivop . (3) 2 otherwise Note that we set ε as 10 in this paper, considering the average duration of the consonants of dysarthric speech [7].
2.2 CV–Dependent Wiener Filter In this subsection, we will describe how to design a Wiener filter using the CV classification result defined in Eq. (3). A noisy speech signal, x(n) = s(n) + w(n), is represented as X(t, k) = S(t, k) +W (t, k)
(4)
where S(t, k) and W (t, k) are spectral components at the t-th time frame and k-th frequency bin, of clean speech, s(n), and noise, w(n), respectively. The spectral ˆ k), is obtained by applying a Wiener filter as component of denoised speech, S(t, ˆ k)|2 = H(t, k) · |X(t, k)|2 = |S(t,
η (t, k) |X(t, k)|2 1 + η (t, k)
(5)
where H(t, k) is the transfer function of the Wiener filter. A priori SNR, η (t, k), is estimated by employing the DD approach defined as
η (t, k) = α
ˆ − 1, k)|2 |S(t + (1 − α ) · max[γ (t, k) − 1, 0] λW (t − 1, k)
(6)
where α denotes a weighting factor between an estimate of the priori SNR and that of the posteriori SNR, γ (t, k), defined as γ (t, k) = |X(t, k)|2 /λW (t, k). In addition, λW (t, k) indicates the noise power spectrum at the k–th frequency bin for the t–th time frame, which is estimated depending on the CV classification to mitigate the underestimation of η (t, k) at speech onset frames. That is, λW (t, k) is estimated as
Preprocessing of Dysarthric Speech in Noise Based on CV–Dependent Wiener Filtering 100
WER
100
Rel. WER
WER
60 40 20 0
60 40 20 0
Baseline
Wiener
CV-Wiener
Baseline
(a) 100
WER
60 40 20 0
Wiener (b)
100
Rel. WER
80
WER (%)
WER (%)
Rel. WER
80
WER (%)
WER (%)
80
45
WER
CV-Wiener
Rel. WER
80 60 40 20 0
Baseline
Wiener
CV-Wiener
Baseline
(c)
Wiener
CV-Wiener
(d)
Fig. 2 Comparison of WERs (%) of the baseline ASR system and ASR systems employing the conventional and proposed Wiener Filter according to different groups of disability under babble, car, and home noise conditions; (a) mild group, (b) moderate group, (c) severe group, and (d) average over all the groups.
β λW (t − 1, k) + (1 − β )|X(t, k)|2 λW (t, k) = λW (t − 1, k)D(η (t − 1, k)) λW (t − 1, k)
if CV (t) = 0 if CV (t) = 1 otherwise
(7)
where β is a forgetting factor and D(·) is a down–weighting function defined as D(x) = 1/{1 + exp(−a(x + b))}, where a and b represent the gradient and displacement of the sigmoid function, respectively. As mentioned earlier, since most of the syllables in Korean are represented as a type of CV or CVC, we focus on the modification of λW (t, k) for the initial consonant intervals. In this work, we set α in Eq. (6), β in Eq. (7), a and b in the down–weighting function as 0.98, 0.95, 0.2, 5, respectively, from the preliminary experiments.
3 Performance Evaluation In this section, we evaluated the effect of the proposed CV–dependent Wiener filter on the ASR performance, and compared it with that of a conventional Wiener filter [5]. For the experiments, we recorded dysarthric speech utterances as a test database, which were composed of 100 Korean command words for a device control. Each command word was spoken by 30 dysarthric speakers in three groups classified on the basis of the degree of disability. The three groups were severe, moderate, and mild, and each group was composed of five male and five female speakers. In order to simulate noisy environments, we artificially added three different types of noises – babble, car, and home – with an SNR of 10 or 20 dB. As a training database, we prepared isolated words of 18,240 utterances of the Korean speech corpus [11]. The acoustic models were triphones represented by a three–state left–to–right hidden Markov model (HMM) with four Gaussian mixtures. For the language model, the lexicon size was 100 words and a finite state network grammar was employed. Fig. 2 compares average word error rates (WERs) of the baseline ASR system and ASR systems employing the conventional Wiener filter and the proposed CV– dependent Wiener filter according to the degree of disability under babble, car, and
46
Ji Hun Park, Woo Kyeong Seong, and Hong Kook Kim
home noise conditions. In the figure, the WER, depicted as a black bar, was averaged over all the different noise types and SNRs. In addition, the relative WER reduction by the proposed Wiener filter against the baseline or the conventional Wiener filter was represented by a gray box. As shown in the figure, the proposed CV–dependent Wiener filter provided the smallest WER for all disabled groups. In particular, for the mild group, relative WER reduction of 10.41% was archived by the proposed Wiener filter, compared to the conventional Wiener filter. However, due to the very poor speech intelligibility, the WER for the severe group was extremely high, regardless of noise processing methods.
4 Conclusion In this paper, we proposed a CV–dependent Wiener filter for dysarthric speech recognition under noise conditions. By incorporating CV–classified VAD, the proposed Wiener filter could be differently estimated according to the CV classification result. We performed ASR experiments under the simulated noise conditions for three disabled groups, including mild, moderate, and severe. As a result, an ASR system employing the proposed CV–dependent Wiener filter achieved relative WER reductions of 10.41%, 6.03%, and 0.94% for mild, moderate, and severe group, respectively, when compared to that using the conventional Wiener filter. Acknowledgements This work was supported in part by the R&D Program of MKE/KEIT (10036461, Development of an embedded key-word spotting speech recognition system individually customized for disabled persons with dysarthria) and the Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology (No.2010-0023888).
References [1] Haines D (2004) Neuroanatomy: an Atlas of Structures, Sections, and Systems. Lippincott Williams and Wilkins, Hagerstown [2] Platt LJ, Andrews G, Young M, Quinn PT (1980) Dysarthria of adult cerebral palsy: I. Intelligibility and articulatory impairment. Journal of Speech and Hearing Research 23(1):28–40 [3] Hasegawa–Johnson M, Gunderson J, Perlman A, Huang T (2006) HMM– based and SVM–based recognition of the speech of talkers with spastic dysarthria. in Proc. of International Conference on Acoustics, Speech, and Signal Processing 1:1060–1063 [4] Parker M, Cunningham S, Enderby P, Hawley, M, Green P (2006) Automatic speech recognition and training for severely dysarthric users of assistive technology: the STARDUST project. Clinical Linguistics and Phonetics 20(2/3):149–156 [5] Benesty J, Makino S, Chen J (2005) Speech Enhancement. Springer, Berlin [6] Erkelens JS, Heusdens R (2008) Tracking of nonstationary noise based on data–driven recursive noise power estimation. IEEE Trans. on Audio, Speech, and Language Processing 16(6):1112–1123 [7] Kent RD, Rosenbek JC (1983) Acoustic patterns of apraxia of speech. Journal of Speech and Hearing Research 26(2):231–249
Preprocessing of Dysarthric Speech in Noise Based on CV–Dependent Wiener Filtering
47
[8] Platt LJ, Andrews G, Howie PM (1980) Dysarthria of adult cerebral palsy: II. Phonemic analysis of articulation errors. Journal of Speech and Hearing Research 23(1):41–55 [9] Sohn J, Kim NS, Sung W (1999) A statistical model based voice activity detection. IEEE Signal Processing Letters 6(1):1–3 [10] Prasanna SR, Reddy BV, Krishnamoorthy P (2009) Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. on Audio, Speech, and Language Processing 17(4):556–565 [11] Kim S, Oh S, Jung HY, Jeong HB, Kim JS (2002) Common speech database collection. in Proc. Acoustical Society of Korea 21(1):21–24
Conditional Random Fields for Modeling Korean Pronunciation Variation Sakriani Sakti, Andrew Finch, Chiori Hori, Hideki Kashioka, Satoshi Nakamura
Abstract This paper addresses the problem of modeling Korean pronunciation variation as a sequential labeling task where tokens in the source language (phonemic symbols) are labeled with tokens in the target language (orthographic Korean transcription). This is done by utilizing conditional random fields (CRFs), which are undirected graphical models that maximize the posterior probabilities of the label target sequence given the input source sequence. In this study, the proposed CRFbased pronunciation variation model is applied to our Korean LVCSR after we perform the standard hidden Markov model (HMM)-based recognition of the phonemic syllable of the actual pronunciation (surface forms). The goal is then to output a sequence of Korean orthography given a sequence of phonemic syllable surface forms. Experimental results show that the proposed CRF model could help enhance our Korean large-vocabulary continuous speech recognition system.
1 Introduction It is well known that the pronunciation of a word is not always the same as the orthographic transcription. The effect of pronunciation variation in large vocabulary continuous speech recognition (LVCSR) systems has been widely studied [1]. Reports in [1] show that pronunciation variation is one of the most important factors affecting the construction of LVCSR systems. In the Korean language, a large proportion of word units are pronounced differently from their written forms, due to an agglutinative and highly inflective nature with a severe phonological phenomena and coarticulation effects. Manual construction of the Korean pronunciation variation model has often been conducted in order to provide a high recognition rate. But this process is expensive and time-consuming. Many studies have proposed methods S. Sakti∗∗ , A. Finch, and C. Hori, H. Kashioka, S. Nakamura∗∗ National Institute of Information and Communications Technology (NICT), Japan e-mail: {sakriani.sakti,andrew.finch,chiori.hori,hideki.kashioka, satoshi.nakamura}@nict.go.jp ∗∗ Currently
belongs to Nara Institute of Science and Technology (NAIST), Japan; e-mail:{ssakti,s-nakamura}@ is.naist.jp
R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_7, © Springer Science+Business Media, LLC 2011
49
50
S. Sakti, A. Finch, C. Hori, H. Kashioka, S. Nakamura
for generating pronunciation variants automatically [2, 3]. The work by [4] provided a model that produces Korean pronunciation variants based on morphophonological analysis. However, this strategy requires a large dictionary and complex morphophonemic rules. Some exceptions still exist that could not be covered only by rules. In this paper, we address the problem of modeling Korean pronunciation variation as a sequential labeling task where tokens in the source language (phonemic symbols) are labeled with tokens in the target language (orthographic Korean transcription). This is done by utilizing conditional random fields (CRFs) [5], which have been widely used for sequential learning problems in natural language processing tasks (i.e., part-of-speech tagging [5], morphological analysis [6], shallow parsing [7], and named entity recognition [8]). CRFs are basically undirected graphical models that maximize the posterior probabilities of the label target sequence given the input source sequence. In this study, the proposed CRF-based pronunciation variation model is applied on our Korean LVCSR after we perform the standard hidden Markov model (HMM)-based recognition of the phonemic syllable of the actual pronunciation (surface forms). The goal is then to output a sequence of Korean orthography given a sequence of phonemic syllable surface forms.
2 Speech Recognition Framework Given the feature vectors x = [x1 , x2 , ..., xT ] of the speech signals, the statistical speech recognition task is to find an orthographic Korean sequence we = [we1 , we2 , ..., weN ] that maximizes the conditional probability P(we |x). Here, the orthographic we sequence can be either an eumjeol or eojeol sequence. However, choosing an eojeol as a basic recognition unit we leads to high OOV rates, whereas choosing an eumjeol unit results in high acoustic confusability due to a severe phonological phenomena. In this study, we introduce an intermediate symbol of the phonemic syllable surface form s p = [s p 1 , s p 2 , ..., s p M ], where one phonemic syllable corresponds to one possible pronunciation only. The conditional probability becomes: wˆe = arg max P(we |x) = arg max{∑ P(we , s p |x)} we
we
sp
≈ arg max{max P(we |s p )P(s p |x)} we
sp
≈ arg max P(we |sˆp ); where: sˆp = arg max P(s p |x). we
(1)
sp
This equation suggests that the speech recognition task can be constructed as a serial architecture composed of two independent parts: 1. The first part represents finding the most probable phonemic syllable sequence sˆp . It is performed by standard HMM-based speech recognition where a phonemic syllable unit is used as the recognition unit as follows: sˆp = arg max P(s p |x) = arg max P(x|s p )P(s p ), sp
(2)
sp
P(s p ) denotes a language model (LM) of phonemic syllable units and P(x|s p ) denotes an acoustic model (AM). In this manner, the lexicon dictionary and OOV
CRF for Korean Pronunciation Variation
51
Fig. 1 Generation of Korean orthographic eumjeols given the phonemic syllable surface forms.
rates can be kept small, while avoiding high acoustic confusability. Here, the Korean orthography of written transcription has not yet been considered. 2. The second part then transforms the phonemic syllable surface forms sˆp into the desirable orthography of a recognition unit we by utilizing the CRF framework. In this preliminary study, the goal is to output a sequence of Korean orthography syllables (eumjeol) given a sequence of phonemic syllable surface forms. Details of CRF-based pronunciation modeling is described in the following section.
3 Conditional Random Field Approach Conditional random fields (CRF) are undirected graphical models that maximize the posterior probabilities of the label target sequence given the input source sequence. The concept was first introduced by Lafferty et al [5] for segmenting and labeling sequence data. Assuming the graphical structure for our Korean pronunciation variation model is fixed and forms a simple first-order chain as illustrated in Fig. 1, the CRF then provide the conditional probability of Korean orthographic syllable sequence we = [we1 , we2 , ..., weN ] given the phonemic syllable sequence s p = [s p 1 , s p 2 , ..., s p N ] by the fundamental theorem of random fields [9] as follows: wˆe = arg max P(we |s p ), where: P(we |s p )= we
( ) 1 exp ∑ ∑ λ j f j (wei−1 , wei , s p , i) . Z(s p ) i j
(3)
Z(s p ) is a normalization factor, λ j is a weight of feature f j , and f j (wei−1 , wei , s p , i) is a feature function in the label sequence at position i and i − 1. This formulation can be efficiently estimated with dynamic programming using the Viterbi algorithm. Further details regarding CRFs can be found in [5].
4 CRFs Feature Set Figure 1 also shows an overall example of the generation process of Korean orthographic syllable eumjeol sequence, given the sequence of phonemic syllable surface form. Each phonemic syllable in each sentence was then transformed into the desirable eumjeol using the proposed CRF model. The feature function f j (wei−1 , wei , s p , i) (which is also shown in Eq. 3) describes in a way to reflect the association between observation phonemic syllable sequence s p = [s p 1 , s p 2 , ..., s p N ],
52
S. Sakti, A. Finch, C. Hori, H. Kashioka, S. Nakamura
and previous and current orthographic syllable, wei−1 and wei , respectively. However, there may be too many sparse feature functions produced. This is why, in practice, typically binary-valued functions which are local features that only take the subsequence of s p = s p i−k , ..., s p i+k into account are used. A feature function example from Fig. 1: f j (wei−1 , wei , s p , i) = 1, if wei−1 = “gwan”, wei = “i”, s p i = “ni”, and f j (wei−1 , wei , s p , i) = 0 otherwise. In our system, the CRF implementation of was based on the well-known open source software CRF++[10]. Three types of CRF are used with different combinations of features (see Table 1). The feature functions are produced according to pre-defined feature templates as described in Table 2. Template F00 − F09 defines the state feature function between the current state wei and the observation subsequence s p = s p i−k , ..., s p i+k , and template F10 defines the transition feature function between neighboring states wei−1 and wei . Table 1 The three types of CRF and their feature sets. CRF
Feature sets
Type A F01-F03,F05-F06 Type B F01-F03,F05-F06,F10 Type C F00-F10
s p range
we range
s p i−1 , ..., s p i+1 s p i−1 , ..., s p i+1 s p i−2 , ..., s p i+2
wei wei−1 , wei wei−1 , wei
Table 2 Templates of feature function. ID
Template
ID
Template
ID
F00 F01 F02 F03
f (wei , s p i−2 , i) f (wei , s p i−1 , i) f (wei , s p i , i) f (wei , s p i+1 , i)
F04 F05 F06 F07
f (wei , s p i+2 , i) F08 f (wei , s p i−1 , s p i , i) F09 f (wei , s p i , s p i+1 , i) F10 f (wei , s p i−2 , s p i−1 , s p i , i)
Template f (wei ,s p i−1 , s p i , s p i+1 , i) f (wei ,s p i , s p i+1 , s p i+2 , i) f (wei−1 ,wei , i)
5 Experimental Evaluation The experiments were conducted using the large vocabulary continuous Korean speech database developed by the Speech Information Technology and Industry Promotion Center (SiTEC) [16]. There are about 200 speakers (100 males, 100 females) for the phonetically-balanced sentences and 800 speakers (400 males, 400 females) for the dictation application sentences. Each speaker uttered about 100 sentences, resulting in a total of 100,000 utterances (about 70 hours of speech). Orthographic transcriptions annotated with pronunciation were available for the whole corpus. The last two prompt sets from each of the Sent01, Dict01 and Dict02 were allocated to the test set, while the remaining data were used as the training set. A sampling frequency of 16 kHz, frame length of a 20-ms Hamming window, frame shift of 10 ms, and 25 dimensional feature parameters consisting of 12-order MFCC, ∆ MFCC and ∆ log power are used as feature parameters. The full phoneme set, as defined in [11], contains a total of 40 phoneme symbols. These consist of 19 consonants and 21 vowels (including nine monophthongs and 12 diphthongs). A Three-state HMM was used as the initial model for each phoneme. Then, a state level HMnet was obtained using a successive state splitting (SSS) algorithm based
CRF for Korean Pronunciation Variation
53
on the minimum description length (MDL) criterion in order to gain the optimal structure in which triphone contexts are shared and tied at the state level. Details about MDL-SSS can be found in [12]. The resulting context-dependent triphone had 2,231 states in total with 5-20 Gaussian mixture components per state. The baseline pronunciation dictionary has orthographic syllables as its lexical entry. These syllables have multiple pronunciations (4,252 lexical entries in total). The baseline orthographic syllable language models have a trigram perplexity of 16.6, 20.6, and 31.2 on Sent01, Dict01, and Dict02 test sets respectively. The proposed pronunciation dictionary and language model unit are based on the phonemic syllable units of the actual pronunciation (surface forms). Thus no multiple pronunciations exist here, and the resulting lexicon size is only one-third of the baseline lexicon (1,337 lexical entries in total). The language model has a slightly higher trigram perplexity of 18.7, 22.4, and 31.3 on Sent01, Dict01, and Dict02 test sets respectively. The performance on eumjeol target unit sequences for CRFs type A, B, and C in comparison with baseline system on the Sent01, Dict01, and Dict02 test sets is shown in Fig. 2(a). Note that this performance can also be considered for character accuracy to compare fairly with other approaches or languages. The best system could achieve 8.76% eumjeol error rate (by CRF type-C on Sent01 test set), yielding 26.7% absolute error rate reduction with respect to baseline orthographic syllable recognition. In addition, we also conducted experiments with other existing well-established approaches. Here, we applied a joint source-channel N-gram model [13, 14], in the view of close coupling of the phonemic syllable surface form s = [s1 , s2 , ..., sM ] as the source, and the Korean orthographic syllable eumjeol sequence e = [e1 , e2 , ..., eM ] as the target. Figure 2(b) shows (eumjeol) error rate on Sent01 test set provided by both the joint source-channel N-gram model and the proposed model. The results reveal that the proposed model outperformed the joint source-channel N-gram model on all the tasks.
(a)
(b)
Fig. 2 Eumjeol error rate in LVCSR system using the proposed CRF pronunciation model in comparison with baseline system (a) and joint source-channel N-gram model (b).
54
S. Sakti, A. Finch, C. Hori, H. Kashioka, S. Nakamura
6 Conclusions We demonstrated the possibility of utilizing a CRF framework to model pronunciation variations in a Korean LVCSR system. This method allows transformation between different symbols in Korean orthography. The proposed CRF pronunciation variation model is applied after we perform the standard HMM-based recognition on phonemic syllable surface forms. The goal is then to map the given phonemic syllable sequence into a Korean orthography sequence. The results reveal that, by incorporating it into our Korean LVCSR system, it could help to enhance the performance. Our model outperformed both the baseline orthographic syllable recognition and joint source-channel N-gram model. The entire process requires only annotated texts without any linguistic knowledge, making it applicable to other agglutinative languages.
7 Acknowledgements The authors would like to thank Chooi-Ling Goh for her support and useful discussion regarding the CRF framework.
References [1] H. Strik and C. Cucchiarini, “Modeling pronunciation variation for ASR: A survey of the literature,” Speech Communication, vol. 29, pp. 225–246, 1999. [2] B. Kim, G.G. Lee, and J. Lee, “Morpheme-based grapheme to phoneme conversion using phonetic patterns and morphophonemic connectivity information,” ACM Transactions on Asian Language Information Processing, vol. 1, no. 1, pp. 6582, 2002. [3] J. Jeon, S. Wee, and M. Chung, “Generating pronunciation dictionary by analyzing phonological variations frequently found in spoken Korean,” in Proc. of International Conference on Speech Processing, 1997, pp. 519–524. [4] J. Jeon, S. Cha, M. Chung, and J. Park, “Automatic generation of Korean pronunciation variants by multistage applications of phonological rules,” in Proc. of ICSLP, Sydney, Australia, 1998, pp. 1943–1946. [5] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. of ICML, Williamstown, MA, USA, 2001, pp. 282–289. [6] T. Kudo, K. Yamamoto, and Y. Matsumoto, “Applying conditional random fields to Japanese morphological analysis,” in Proc of EMNLP, 2004, pp. 230–237. [7] F. Sha and F. Pereira, “Shallow parsing with conditional random fields,” in Proc. of HLTNAACL, Edmonton, Canada, 2003, pp. 213–220. [8] J.R. Finkel and C.D. Manning, “Joint parsing and named entity recognition,” in Proc. of NAACL, Boulder, Colorado, USA, 2009, pp. 326–334. [9] J. Hammersley and P. Clifford, “Markov fields and finite graphs and lattices,” 1971. [10] T. Kudo, “CRF++: Yet another CRF toolkit,” http://crfpp.sourceforge.net/, 2005. [11] M. Kim, Y.R. Oh, and H.K. Kim, “Non-native pronunciation variation modeling using an indirect data driven method,” in Proc. of ASRU, Kyoto, Japan, 2007, pp. 231–236. [12] T. Jitsuhiro, T. Matsui, and S. Nakamura, “Automatic generation of non-uniform HMM topologies based on the MDL criterion,” IEICE Trans. Inf. & Syst., vol. E87-D, no. 8, pp. 2121–2129, 2004. [13] H. Li, M. Zhang, and J. Su, “A joint source-channel model for machine transliteration,” in Proc. of ACL, Barcelona, Spain, 2004, pp. 160–167.
CRF for Korean Pronunciation Variation
55
[14] M. Bisani and H. Ney, “A joint-sequence models for grapheme-to-phoneme conversion,” Speech Communication, vol. 50, pp. 434–451, 2008.
An Analysis of the Speech Under Stress Using the Two-Mass Vocal Fold Model Xiao Yao, Takatoshi Jitsuhiro, Chiyomi Miyajima, Norihide Kitaoka, Kazuya Takeda
Abstract We focus on the glottal source of speech production, which is essential for understanding the behavior of vocal fold when speech is produced under psychological stress. A spectral flatness measure (SFM) is introduced, as a useful tool, to evaluate stress levels in speech. Further, the relationship between the physical parameters of the two-mass vocal fold model and the proposed stress level measurement is established. The physical parameters of two-mass model are examined and analyzed comparing with measurements in order to estimate the state of vocal folds in people experiencing stress in the future. In this paper, experiments are performed using stressed speech gathered from real telephone conversations to evaluate the stress level measurement.Results show that the SFM can detect stress and can be used as a measurement for differentiating stressed from neutral speech. Furthermore, the changes in physical parameters can be analyzed to understand the behavior of vocal folds when stress occurs.
Xiao Yao Graduate School of Information Science, Nagoya University, Aichi, Japan, e-mail: xiao.yao@ g.sp.m.is.nagoya-u.ac.jp Takatoshi Jitsuhiro Department of Media Informatics, Aichi University of Technology, Gamagori Japan, e-mail:
[email protected] Chiyomi Miyajima Graduate School of Information Science, Nagoya University, Aichi, Japan, e-mail: miyajima@ nagoya-u.jp Norihide Kitaoka Graduate School of Information Science, Nagoya University, Aichi, Japan, e-mail: kitaoka@ nagoya-u.jp Kazuya Takeda Graduate School of Information Science, Nagoya University, Aichi, Japan, e-mail: kazuya.
[email protected] R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_8, © Springer Science+Business Media, LLC 2011
57
58
Xiao Yao, Takatoshi Jitsuhiro, Chiyomi Miyajima, Norihide Kitaoka, Kazuya Takeda
1 Introduction It has recently become much more important to estimate someone’s mental condition, especially from speech. Particularly for call center systems, detection techniques applied to caller speech can significantly improve customer service. A study of speech under stress is needed, in order to improve recognition of the mental state of people, and to understand the context in which the speaker is communicating. Many scholars have devoted their researches to the field of analysis of stress in speech. Stress is a psycho-physiological state characterized by subjective strain, dysfunctional physiological activity, and deterioration of performance [1]. Some external factors and internal factors may induce stress, which are likely to be detrimental to the performance of communication equipment and systems with vocal interfaces [2]. The performance of a speech recognition algorithm is significantly challenged when speech is produced in stressful environments. The influence of the Lombard effect on speech recognition has been focused on in [3], and in some special environments workload task stress has been proven to have a significant impact on the performance of speech recognition system [4]. Also some work has been done from a linear speech production model for the classification of stressed speech [5]. Our work mainly concentrates on the analysis of stressed speech based on a speech production model instead of on observed speech features, in order to gain a deeper comprehension of the working mechanisms of vocal folds responsible for different speaking styles. Our final target is to simulate several speaking style speech by the two-mass model for detecting stress speech. For this purpose, we will explore the properties of the underlying physical speech production system, and search for essential factors related to stress. The glottal flow, which is the source of speech, is mainly examined when making an attempt to explain changes in speech production under stress. As a result of this study, the characteristics of glottal flow can be related to physical parameters, and an explanation of how the two-mass model applies to real speech can be made. In this paper, our recent works are described that the physical parameters of twomass model are examined and analysed comparing with stress measurement. a measurement is introduced to evaluate the presence of stress in speech in the real world. We concentrate on analysis of the ability of the model to detect the presence of stress by exploring the relationship between the stress level measurement and parameters of two-mass model and then variation in stiffness is studied to estimate the behaviour of the vocal folds. The paper is structured as follows: In section 2, the stress level measurement used to detect the stress in this paper is described. In section 3, simulation of speech using two-mass model is described and used to find the relationship between the stress level measurement and stiffness parameters, and experimental results are shown and analysed to prove that such a relationship exists. Finally, in section 4 we draw our conclusions.
An Analysis of the Speech Under Stress Using the Two-Mass Vocal Fold Model
59
2 Measuring stress using glottal source 2.1 Spectral flatness of the glottal flow The glottal flow can be estimated by the inverse filter method from speech pressure waveforms using the iterative adaptive inverse filter method (IAIF). To evaluate the stress from glottal flow, the spectral flatness measure (SFM) can be applied. Spectral flatness is a measurement to characterize an audio spectrum, which defined by dividing the geometric mean by the arithmetic mean of the power spectrum: √ N ∏N−1 n=0 S(n) SFM = 1 N−1 , (1) N ∑n=0 S(n) in which S(n) is the magnitude of the power spectrum of the nth bin. Bigger SFM indicates that the spectrum has a similar amount of power in all frequency bands, and the speech spectrum envelope would become relatively flat, like white noise. While smaller SFM shows that the spectral power is concentrated in relatively narrow bands, and its envelope would appear spiky, as is typical for tonal sound.
2.2 Evaluation of Spectral Flatness Measure In the experiment, we used a database collected by the Fujitsu Corporation[13]. This database contains speech samples from 11 people, including 4 male and 7 female subjects. In order to simulate mental pressure resulting in psychological stress, three tasks corresponding different situations were introduced. These tasks were performed by a speaker having a telephone conversation with an operator to simulate a situation involving pressure during telephone communication. The three kinds of tasks are (A) Concentration; (B) Time pressure; and (C) Risk taking. For each speaker, there are four dialogues with different tasks. The speech data from database is inverse filtered with 12-order LPC. The frame size is 64ms, and with 16ms for frame shift. The distributions of SFM are shown in Figure 1. From the results of our experiment, stressed speech can be separated from neutral speech by analysing the distribution of the stress level measurement: SFM, Compared to the values for neutral speech, stressed speech result in smaller SFM values.
3 Simulation using two-mass model Two-mass vocal fold model is proposed by Ishizaka and Flanagan to simulate the process of speech production [6]. The relationship between the physical parameters
60
Xiao Yao, Takatoshi Jitsuhiro, Chiyomi Miyajima, Norihide Kitaoka, Kazuya Takeda
Fig. 1 Distribution of SFM for neutral and stress speech
and stress level measurement is essential for the work to analysing stressed speech using the model. Therefore, the physical parameters of two-mass model are checked and analysed comparing with stress measurement. It would provide the basis to make some assumptions from which we can estimate the shape and movement of the vocal folds when speech is produced under the stressful conditions. The control parameters [7] are defined as Ps, k1, k2, kc (sub-glottal pressure and stiffness), which would influence the fundamental frequency. The initial areas Ag0i have been set as 0.05cm2, and the masses and the resistance viscosity set equal to typical values proposed by Ishizaka and Flanagan. To obtain a significant relationship between the control parameters and the proposed stress level measurement, different values of stiffness are analyzed to study the variation in the SFM. Three stiffness parameters k1, k2, kc should be taken into account. The range of for k1 is from 8000 to 480000, and k2 from 800 to 48000, and the range from 2500 to 150000 for kc. In this experiment, only glottal flow is considered, which represents the source of speech. Using a fixed set of parameters, the glottal flow is obtained to calculate the values of stress measurement. First, kc is fixed, and the variation in SFM is observed with k1 and k2. The results are shown in Figure 2. kc is fixed as different values: (a) 2,500 (b) 7,500 (c) 25,000. When kc is fixed to a smaller value (a), SFM depends on both k1 and k2, A clear decrease in SFM is noticed as k1 and k2 become stronger, so this observation indicates that bigger values of both k1 and k2 can result in smaller values of SFM, which represents stress; When kc increases to 7,500 (b), SFM is more strongly influenced by k1 than k2; when a bigger value of kc is fixed in (c), SFM only depends on the larger k1; If kc continues increasing, a similar trend can be observed. Figure 3 exhibits the relation between the SFM and each stiffness parameter respectively. The variation range of SFM with each stiffness is k1: 0.3689, k2: 0.1680, kc: 0.1376. Therefore, SFM is more strongly influenced by k1 than k2 and kc. In relation to SFM, the values of k1 and k2 are inversely proportional to SFM. The difference is that k1 is stable at first, but declines rapidly soon afterwards, while
An Analysis of the Speech Under Stress Using the Two-Mass Vocal Fold Model
(a) kc =2,500
61
(b) kc =7,500
(c) kc =25,000 Fig. 2 The relationship between the estimated stress level measurement (SFM) and stiffness parameters
Fig. 3 The relationship between the mean of the estimated stress level (SFM) and stiffness parameters
k2 always declines steadily. The value of SFM increases with the coupling stiffness of kc at beginning, and decrease finally, which means a smaller and a larger kc are also a better indicator of stress detection. Therefore, when stress occurs, kc would normally be smaller or larger, but a bigger k1 and k2 could be obtained.
62
Xiao Yao, Takatoshi Jitsuhiro, Chiyomi Miyajima, Norihide Kitaoka, Kazuya Takeda
4 Conclusion We developed a method of analyzing stress in speech by using a speech production model to explore the underlying mechanisms of vocal fold behavior. In this work, a stress level measurement method was applied. The relationship between the physical parameters and the SFM value has been established, and the physical parameters of two-mass model were checked and analyzed comparing with stress measurement. Further improvements are possible if more detailed investigation, from the viewpoint of vocal dynamics, is needed to build a theoretical basis of stressed speech production. More works will be focused on the extraction of physical parameters from real speech concerning vocal folds and vocal tract, and an explanatory relationship between the physical parameters and stress will be built.
5 Acknowledgements This work has been partially supported by the ”Core Research for Evolutional Science and Technology”(CREST) project of the ”Japan Science and Technology Agency”(JST). We are very grateful to Mr. Matsuo of the Fujitsu Corporation for the use of database and for his valuable suggestions.
References [1] Steeneken, H.J.M. and Hansen, J.H.L.: Speech Under Stress Conditions: Overview of the Effect on Speech Production and on System Performance. in Proc. ICASSP, 4, 2079-2082 (1999). [2] Cairns, D, Hansen, J.H.L.: Nonlinear Analysis and Detection of Speech Under Stressed Conditions. The Journal of the Acoustical Society of America, textbf96, 6, 3392-3400 (1994). [3] Junqua, J.C. : The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Amer., 1, 510-524, (1993). [4] Bard, E. G., Sotillo, C., Anderson, A. H., Thompson, H. S., and Taylor, M. M.: The DCIEM map task corpus: Spontaneous dialogue under sleep deprivation and drug treatment, Speech Commun., 20, 71-84 (1996). [5] Hansen, J. H. L.: Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition, Speech Commun., 20, 151-173 (1996). [6] Ishizaka, K., Flanagan, J.L.: Synthesis of voiced sounds from a two-mass model of the vocal cords, Bell.Syst.Tech .Journal, 51, 1233-1268 (1972). [7] Teffahi, H.: A two-mass model of the vocal folds: determination of control parameters. Multimedia computing and System (2009).
Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling Euisok Chung, Hyung-Bae Jeon, Jeon-Gue Park and Yun-Keun Lee
Abstract This paper introduces a domain-adapted word segmentation approach to text where a word delimiter is not used regularly. It depends on an unknown word extraction technique. This approach is essential for language modeling to adapt to new domains since a vocabulary set is activated in a word segmentation step. We have achieved ERR 21.22% in Korean word segmentation. In addition, we show that an incremental domain adaptation of the word segmentation decreases the perplexity of input text gradually. It means that our approach supports an out-of-domain language modeling.
1 Introduction The concept of a domain adaptation is more associated with a language modeling approach. In general, a language model expands its vocabulary set for a new domain text when its application moves to that domain. In this case, a few languages such as Korean, Chinese and Japanese have a problem of word segmentation. This kind of languages has no space between words although Korean supports a word delimiter partially. Word segmentation determines a vocabulary set, which means that a word segmentation process should recognize all of the words in the new domain text,
Euisok Chung Speech Processing Team, ETRI, Daejeon, Korea e-mail:
[email protected] Hyung-Bae Jeon Speech Processing Team, ETRI, Daejeon, Korea e-mail:
[email protected] Jeon-Gue Park Speech Processing Team, ETRI, Daejeon, Korea e-mail:
[email protected] Yun-Keun Lee Speech Processing Team, ETRI, Daejeon, Korea e-mail:
[email protected] R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_9, © Springer Science+Business Media, LLC 2011
63
64
Euisok Chung, Hyung-Bae Jeon, Jeon-Gue Park and Yun-Keun Lee
which also includes unknown words. Therefore, we focus on a domain adaptation through a word segmentation process. In the next section, we begin with a description about word segmentation. Then, a domain adaptation issue will appear. We will go to a main issue, an unknown word extraction where mono-syllable detection has to be preceded. In addition, an incremental domain adaptation approach will be described. In section 3, we evaluate a word segmentation error reduction and an incremental domain adaptation of word segmentation. Finally, in discussion, we remark about a weakness of our approach and future work.
2 Domain Adapted Word Segmentation Language modeling for automatic speech recognition needs a training text corpus, tokenized by word. Word segmentation process can provide standards for a vocabulary list. But, the process depends on a known-word list although it can detect unknown words in an ad hoc way. The segmented text including unknown words has a tendency of splitting as short length words if the unknown word consists of the short length words. An out-of-domain text may generate an unknown word problem. Therefore, we need unknown word extraction process for the domain adaptation of word segmentation process.
2.1 Word Segmentation The first step of word segmentation is to divide input sentence s into the sequence of monosyllable, m1 ...mn , (n > 0) . Then, word splitting candidates, w1 ...wk , (1 ≤ k ≤ n), are generated by using a word dictionary d and the simple heuristic of unknown word prediction. The candidates are simply all possible segmentation of s. The heuristic is to pass the current mi as w when we can not find a word wmi ...m j , (i ≤ j ≤ n) in the dictionary (wmi ...m j ̸∈ d). After passing mi , we find next word with wmi+1 ...m j in the dictionary d. The ranking process scores the segmentation candidates with a language model L. Thus, PL (w1 ...wk ) scores the candidates into word segmentation n-best list. We build the language model with a pseudo-morphology analyzed corpus, attached by part of speech(POS) tags. In addition, we build POS language model T and the word dictionary including POS tags. The POS tagging process generates the tagging candidates and searches the best POS sequence through the Viterbi algorithm. The tagging result, t1 ...tk , of the w1 ...wk has a score of PT (t1 ...tk ). Finally, we use POS tagging score and word segmentation score such as eq. 1 for rescoring the n-best word segmentation list. score(w1 ...wn ) = α · PL (w1 ...wk ) + (1 − α ) · PT (t1 ...tk )
(1)
Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling
65
2.2 Domain Adaptation The word segmented text corpus can be used as a training corpus for a language model of automatic speech recognition (ASR) and machine translation (MT). It is clear that an additional text corpus is required when we expand the domain of ASR or MT to new domains. In the case, a general approach is to build a text corpus for a new domain, based on that the new language model may be interpolated with the original language model. However, the text corpus for a new domain may cause the unknown word problem to the word segmentation. It is obvious that the unknown word problem makes over-segmented text where the word splitting process segments long unknown word with partial matched short known word. If the word segmentation recognizes the unknown word in the text corpus of new domain, it can adapt to the new domain. Also, it means that an out-of-domain language modeling is possible. In the next section, we show the procedure of unknown word extraction.
2.3 Unknown Word Extraction As mentioned before, the word segmentation has a tendency to over-segment text where unknown word is included. The over-segmented text may increase the perplexity of language model since the sequence of short word, partially matched, has a feature of sparseness. In general, the partial matched short word might be a monosyllable because of the sparseness of syllable patterns. Therefore, the detection of unknown word mono-syllables should be the first step for an extraction of unknown words. This kind of approaches have been reported in [4] and [2]. Ma and Chen proposed the unknown word extraction system to extract all unknown words (UW) from a Chinese text [4]. The first step of the system is UW detection with UW detection rules trained from corpus. From that, the string of tokens with UW monosyllable mark is generated, then, UW extraction is processed with VCFG for UWs. Finally, candidates of UWs are added to the augmented dictionary, with that, the re-segmentation is processed. In this paper, we adopt the approach of UW extraction in [4] since we found the phenomenon of over-segmenting text including UW when we have built the word segmentation process for Korean. Furthermore, the segmented result of UW usually consists of the combination of mono-syllables and short length words (multisyllables) existing in word dictionary. But, we use a difference approach in each step. The first is the UW mono-syllable detection, in which we use a CRF model to detect mono-syllabic morphemes. The second is the generation of UW candidates where we use a simple heuristic to generate all possible UWs including the monosyllable and the context. The third is UW selection approach where we only extract UWs decreasing the perplexity of the corpus. We will describe each step in detail.
66
Euisok Chung, Hyung-Bae Jeon, Jeon-Gue Park and Yun-Keun Lee
2.3.1 Unknown word mono-syllable detection The unknown word mono-syllable detection is simply known-word mono-syllable detection. It is clear that if a mono-syllable is not a known-word mono-syllable (KMS) then it is unknown-word mono-syllable (UMS). The KMS can be learned from the part-of-speech (POS) tagged corpus. We build UMS detector with 300,000 word POS tagged corpus. Table 1 Syllable types for detection UMS unknown word monosyllable KMS known word monosyllable KPS known word multisyllable
We classify all of the word in the training corpus as KMS or KPS in Table 1. There is no UMS since all of the word is known. Then, we build the mono-syllable context patterns of Table 2, which is similar to the unknown word detection rule types in [4]. Table 2 Mono-syllable context pattern mi mi mi+1 mi mi+1 mi+2 mi mi+1 ti+2 ti ti+1 ti−1 mi ti−2 ti−1 mi
mi−1 mi mi−2 mi−1 mi ti−2 mi−1 mi ti ti−1 ti mi ti mi ti+1 ti+2
In the context pattern, mi is ith morpheme and ti is ith POS tag. The target morpheme is a mono-syllabic morpheme. As mentioned before, we use CRF model since it can build probabilistic model integrating various feature types while it shows good performance in sequential learning domains [3]. The sequence of input string is x and the sequence of syllable type is y in eq. 2. fk is the feature function, which is 1 when the condition of xt , yt−1 and yt is satisfied. λk is the parameter value to the instance of the feature pattern, which is computed from the process of parameter estimation such as L-BFGS. y∗ = argmaxy PΛ (y|x) T
= argmaxy exp( ∑
K
∑ λk fk (yt−1 , yt , xt ))
(2)
t=1 k=1
Afterwards, when we apply the UMS detection process, the POS tagging process is required. Then, we just compute the probability of target mono-syllabic mor-
Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling
67
for i = 3 to n-2 do if ti−1 is s or ti is s then do segmentation else if mi−3 is not UMS and mi−2 is not UMS then do segmentation else if ti is in {etm, jx, jc, jm, co, xsm, xsn, xsv} then do segmentation end if end if end if end for Fig. 1 UW generation algorithm.
pheme according to eq. 2. If a tagging result of syllable types is not KMS or under a threshold, then it is UMS.
2.3.2 Generation of unknown word candidates After detecting UMS, we can generate UW candidates with heuristics. The heuristic is composed of POS tags in Table 3 and UMS history. It is described in Figure 1. The heuristic is not general approach since we build the heuristic empirically. In that, we focus on generating all possible UW candidates. The input of the algorithm is a word segmented string w1 ...wn where each word wi has UMS tag or not. The output of the algorithm is a re-segmented string. From the result, we can extract unknown word candidates if each word of the result is not in word dictionary d. Table 3 Part of speech used in UW candidate generation tags part of speech s etm jx jc jm co xsm xsn xsv
symbol adnominalizing ending auxiliary particle case particle adnominal case particle copula adjective-derivational suffix noun-derivational suffix verb-derivational suffix
68
Euisok Chung, Hyung-Bae Jeon, Jeon-Gue Park and Yun-Keun Lee
tws ← ws(t, d) ppl0 ← ppl(tws ) for i = 1 to N do add uwi to d tws ← ws(t, d) ppluw ← ppl(tws ) remove uwi from d if ppluw + δ < ppl0 then select uwi as an unknown word end if end for Fig. 2 UW selection procedure.
2.3.3 Unknown word selection We select the UW when it improves the performance of word segmentation. The decision depends on word segmentation results according to whether UWs are added to dictionary d or not. The measure for the comparison is a perplexity, which is used to evaluate a LM and is defined by 2l p [7]. The logprob l pk is described in eq. 3. In this step, we use the LM of word segmentation for a perplexity. The LM is fixed and the input text has a variation affected by UWs. When the variation decreases the perplexity above threshold δ , we extract the UW. l pk = −
1 k ˜ i |hi ) ∑ log2 p(w k i=1
(3)
The procedure of an unknown word selection is described in Figure 2. First, the word segmentation process ws(t, d) transforms input text t into the segmented text tws with dictionary d and language model. Then, the perplexity of tws can be computed by the function ppl(tws ) as ppl0 . Afterwards, each uwi in unknown word candidates (size N) computes perplexity ppl(tws ). Finally, we select uwi as an unknown word when ppl(tws ) is larger than ppl0 .
2.4 Incremental Domain Adaptation The domain adaptation of word segmentation might require the large size of corpus. The size of input text t of UW selection procedure might be a bottleneck since every UW candidates need word segmentation of the input text. Therefore, we divide the input text into subtexts. The sequence of subtexts accumulates the UW dictionary step by step. It means that each intermediate result of UW extraction applies to the word segmentation of the next subtext. The incremental accumulation of unknown words is described in Figure 3. Finally, W Sn has all of unknown words accumulated from all subtexts.
Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling
69
Fig. 3 incremental accumulation of unknown words.
3 Experiments 3.1 Word Segmentation Error Reduction We use pseudo-morphological analyzed corpus which has 300,000 eojeols and partof-speech marks in each morpheme for evaluations. An eojeol is sequence of morphemes without space in Korean. We use 269,992 eojeols as a training data and 24,380 eojeols as a test data. There are 26,698 unique morphemes in training data. Test data has 6,287 unique morphemes and 1,346 unknown morphemes that are not included in the training data. We build LMs for word segmentation and POS tagger with SRILM toolkit, described in [6]. The LM for word segmentation has 26,700 unigrams, 104,917 bigrams and 32,417 trigrams. The LM for a POS tagger has 29 unigrams, 309 bigrams and 1,140 trigrams. We evaluate the word segmentation whether it segment an eojeol as a correct sequence of morphemes. The result is described in Table 4. The accuracy of word segmentation (wst) is 93.67% which is 1-best result. The 5-best result is 96.27%. Because of that, we try to reorder the 5-best results of word segmentation with eq. 1. We build POS tagger, required by eq. 1. The accuracy of POS tagger is 91.5%. However, the gain is not good. The result of rescoring (wst + res) is 93.75% which shows ERR 1.29% Before the unknown word extraction experiment, we evaluate the unknown word monosyllable detection. The number of unknown word in test data is 2,077 with a count of duplication. The number of unknown word mono-syllables by the word segmentation process is 1,858. The mono-syllable detection shows recall 92.08% and precision 25.64%. After that, we evaluate the recall and precision of unknown word extraction. The recall is 20.65% and the precision is 68.39%. The extracted unknown words add to the word dictionary of word segmentation process as an adaptation. Thus, it is
70
Euisok Chung, Hyung-Bae Jeon, Jeon-Gue Park and Yun-Keun Lee
two-pass procedure for the same input text. The first pass is the unknown word extraction. The second pass is to do word segmentation of the same input text with the unknown words. The experiment (wst + uw) shows 1best 94.98% and 5best 97.78%. In the experiment (wst +uw +res), 1best is 95.08% and 5best is 97.78%. As a result, we can achieve ERR 21.22% with the unknown word extraction and POS rescoring. Table 4 Result of Word Segmentation Test type
1best
5best
wst wst + uw wst + res wst + uw + res
93.67% 94.98% 93.75% 95.08%
96.27% 97.78% 20.68% 96.27% 97.78% 21.22%
ERR(1best)
3.2 Incremental Domain Adaptation Experiment The POS tagged corpus for the language model of a word segmentation process is constructed from various domains. In this experiment, we try to adapt the word segmentation process to a twitter domain. We have crawled 100,000 twits which has 300,000 unique words as a training corpus. It is divided into 100 subtexts. For a test data, we have collected 319 twits which have 3,351 unique words. We count the number of unknown words and compute perplexity of subtexti in each step. Considering the effect of ‘UW accumulation’, we try to test in two cases; one is ‘UW accumulation’, the other is ‘not UW accumulation’. It is ‘not UW accumulation’ if we remove path from UWi to W Si+1 in Figure 3. The increase of unknown words is described in Figure 4. The decrease of perplexity is described in Figure 5. We found that the size of unknown word increase linearly. The unknown words make a word segmentation process to decrease perplexity of a test data. It means that an unknown word extraction makes the word segmentation process to adapt to a twitter domain. In general, the performance of ASR is inversely proportional to the perplexity of the language model. Therefore, we can suggest that our approach is available to do out-of-domain language modeling. In addition, the ‘UW accumulation’ generates unknown words less than ‘not UW accumulation’. The performance gain by decreasing the number of UW would be better although ‘not UW accumulation’ shows better perplexity result,
Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling
71
Fig. 4 Increase of Unknown Words.
Fig. 5 Decrease of Perplexity.
4 Discussion In the unknown word extraction, we cannot show the performance of [4], where a recall is 57% and a precision is 76%. The reason is based on the difference of target language. We test in Korea, but [4] showed Chinese unknown word extraction. As reported in [4], the recall is 99% when all-mono syllables are regarded as an unknown word mono-syllable. However, unknown word mono-syllable ratio in Korean is 89% in our test set. We will try to find other clues for extraction of unknown word in Korean in addition to the mono-syllable. In the experiment of the increase of unknown words, it would be better if we can show the convergence. It requires
72
Euisok Chung, Hyung-Bae Jeon, Jeon-Gue Park and Yun-Keun Lee
very large test data. We will try to test this topic after saturating the performance of word segmentation process. From now on, we have described the approach of word segmentation for the out-of-domain language model. The approach consists of mono-syllable detection, generation of unknown word candidates and selection of unknown words. With this unknown word extraction, we have achieved ERR 21.22% in Korean word segmentation. In addition, we have shown the positive result of incremental domain adaptation with only internal text. The result is 15.09% reduction in perplexity, which shows similar result of Seymore’s approach [5] where external text support language model adaptation. In the future, we will test the performance of ASR with the adapted language models. Also, we will try to adopt high-level knowledge for a word segmentation, which could be the noun sense identification approach such as [8].
5 Acknowledgements This work was supported by the Industrial Strategic technology development program, 10035252, Development of dialog-based spontaneous speech interface technology on mobile platform funded by the Ministry of Knowledge Economy(MKE, Korea).
References [1] Chen, K. J., Bai, M. H., ”Unknown Word Detection for Chinese by a Corpusbased Learning Mothod”, International Journal of Computational Linguistics and Chinese Language Processing, Vol.3, pp.27-44, 1998 [2] Chen, K. J., Ma, W. Y., ”Unknown word extraction for Chinese documents”, in Proceeding COLING ’02 Proceedings of the 19th international conference on Computational linguistics - Volume 1, 2002 [3] Lafferty, J., McCallum, A., and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in Proceeding of the 18th International Conference on Machine Learning. 282-289. 2001. [4] Ma, W. Y., Chen, K. J., ”A bottom-up merging algorithm for Chinese unknown word extraction”, in Proceeding SIGHAN ’03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17, 2003 [5] Seymore, K., Rosenfeld, R., ”Using Story Topics for Language Model Adaptation”, in Proceeding of the Eurospeech, 1997 [6] Stolcke, A., ”SRILM - An Extensible Language Modeling Toolkit”, in Proceeding of the International Conference Spoken Language Processing, Denver, Colorado, September 2002.
Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling
73
[7] Varile, G. B., Zampolli, A., ”Survey of the state of the art in human language technology”, Cambridge University Press, pp32-33, 1997 [8] Yang, S. I., Seo, Y. A., Kim, Y. K. and Ra, D., ”Noun Sense Identification of Korean Nominal Compounds Based on Sentential Form Recovery,” ETRI Journal, vol.32, no.5, Oct. 2010, pp.740-749.
Part III
Multi-Modality for Input and Output
Analysis on Effects of Text-to-Speech and Avatar Agent in Evoking Users’ Spontaneous Listener’s Reactions Teruhisa Misu, Etsuo Mizukami, Yoshinori Shiga, Shinichi Kawamoto, Hisashi Kawai and Satoshi Nakamura
Abstract This paper reports an analysis on effect of text-to-speech (TTS) and avatar agent in evoking user’s user’s spontaneous backchannels. We construct an HMMbased dialogue-style TTS system that generates human-like cues that evoke users’ backchannels. We also constructed an avatar agent that can make several listener’s reactions. A spoken dialogue system for information navigation was implemented and was evaluated in terms of evoked user backchannels. We conducted user experiments and the results indicated that (1) the user backchannels evoked by our TTS are more informative for the system in detecting users’ feelings than those by conventional reading-style TTS and (2) use of avatar agent can invite more user backchannels.
1 Introduction One of the most enduring problems in spoken dialogue systems research is realizing a natural dialogue in a human-human form. One direction researchers have been utilizing spontaneous nonverbal and paralinguistic information. For example, Bohus [4] applied user’s spatiotemporal trajectory to measure their engagement to dialogues. We have developed a spoken dialogue system that senses users’ interest based on their gaze information [10]. This paper focuses on backchannels, one of the most common forms of paralinguistic information in human-human dialogue. In particular, we focus on users’ verbal feedback, such as “uh-huh” (called Aizuchi in Japanese), and non-verbal feedback in the form of nods. Such backchannels are very common phenomena, and considered to be used to facilitate smooth human-human communications. In this regard, Maynard [13] indicated that such backchannels are listener’s signals to
National Institute of Information and Communications Technology (NICT), e-mail: teruhisa.
[email protected] R.L.-C. Delgado and T. Kobayashi (eds.), Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, DOI 10.1007/978-1-4614-1335-6_10, © Springer Science+Business Media, LLC 2011
77
78
Misu et al.
let the speaker continue speaking (continuer), to indicate that the listener understands and consents. It was also hypothesized that humans detect feelings expressed via backchannels, and the correlation between backchannel patterns and user interests was examined [9]. These studies indicate that detection of spontaneous user backchannels can benefit spoken dialogue systems by providing informative cues that reflect the user’s situation. For instance, if a spoken dialogue system can detect user’s backchannels, it can facilitate smooth turn-taking. The system can also detect user’s feelings and judge if it should continue the current topic or change it. Despite these previous studies and decades of analysis on backchannels, few practical dialogue systems have made use of them. This is probably due to the fact that users do not react as spontaneously to dialogue systems as they do to other humans. We presume one of the reasons for this is the unnatural intonation of synthesized speech. That is, conventional speech synthesizers do not provide users with signs to elicit backchannels; an appropriate set of lexical, acoustic and prosodic cues (or backchannel-inviting cues [1]), which tends to precede the listener’s backchannels in human-human communication. Though recorded human speech can provide such cues, it is costly to re-record system’s speech every time system scripts are updated. Another reason would be lack of metaphor of human-computer interaction as a conversation. Without being aware of listener, users won’t feel the need for backchannels. In this work, we therefore tackle the challenge of constructing a dialogue system with dialogue-style text-to-speech (TTS) system and avatar agent that inspires users to make spontaneous backchannels under the hypothesis of: People will give more spontaneous backchannels to a spoken dialogue system that makes more spontaneous backchannel-inviting cues than a spoken dialogue system that makes less spontaneous ones. which is derived from the Media Equation [17]. The main points of this paper are as follows. 1. We construct a spoken dialogue-style HMM-based TTS system and then analyze if the synthesized speech has backchannel-inviting prosodic cues. The TTS is evaluated in terms of user’s evoked listener’s reactions. We show that the system can estimate user’s degree of interest by using the user’s evoked feedback. 2. We examine an effect by the presence of an avatar agent and show that use of an avatar agent lead to more user backchannels.
2 Related Works A number of studies have aimed at improving the naturalness of TTS. Though most of these have focused on means of realizing a clear and easy-to-listen-to readingstyle speech, some attempts have been made at spontaneous conversational speech. Andersson [3] and Marge [12] focused on lexical phenomena such as lexical filler and acknowledgments in spontaneous speech, and showed that inserting them improves the naturalness of human-computer dialogues. In this work, we tackle constructing a natural dialogue-style TTS system focusing on prosodic phenomena such
Analysis on Effects of Text-to-Speech and Avatar Agent
79
as intonation and phoneme duration. Campbell constructed a conversational speech synthesizer using concatenative synthesis techniques [5], but only phrase-based concatenation was considered, and no evaluation was given in terms of listener reactions. In the field of conversation analysis, many studies analyzed backchannels in human-human dialogue focusing on lexical and non-verbal cues [1, 11, 19]. For instance these cues were examined in preceding utterances, such as in part-of-speech tags, length of pause, power contour pattern, and F0 contour pattern around the end of the Inter-Pausal Units (IPUs). [1] showed that when several of the above cues occur simultaneously, the likelihood of occurrence of a backchannel will increase. Our work aims to construct a TTS system that can imitate these features based on the findings of these studies. Several studies also utilized the above findings for spoken dialogue systems. Okato [16] and Fujie [6] trained models to predict backchannels, and implemented spoken dialogue systems that make backchannels. Our goal differs in that it is to inspire users to give backchannels. As for presence of an avatar agent (or a humanoid robot), there are many studies that have examined effects for human-computer interaction focusing on such as turntaking [8], communication activation [20]. However, few studies have examined the effect to evoke backchannels.
3 Construction of Spoken Dialogue TTS 3.1 Spoken Dialogue Data collection and Model Training In order to make spontaneous dialogue-style TTS that can evoke backchannels, we construct a spontaneous dialogue-style speech corpus that contains backchannelinviting cues, and then train an HMM acoustic model for synthesis. We collected our training data by dubbing a script of our Kyoto Sightseeing Guidance Spoken Dialogue Corpus [14], a set of itinerary-planning dialogues in Japanese. In the dialogue task, the expert guide has made recommendations on sightseeing spots and restaurants until has decided on a plan for the day. With the guide’s recommendations, many users give spontaneous backchannels. We made a set of dialogue scripts from the corpus, and asked voice actors to act them out. When preparing the dialogue script for dubbing, we first removed fillers and backchannels from the transcripts of the dialogue corpus. We then annotated the guide’s end of the IPUs, where the the user made backchannels, with #. We asked two professional voice actresses to duplicate the spoken dialogue of the script, with playing the role of the tour guide, and the other as the tourist, sitting face-to-face. During the recording, we asked the tour guide role to read the scenario with intonation so that the tourist role would spontaneously make backchannels at the points marked with #. The tourist was allowed to make backchannels at will at any pause
80
Misu et al.
segments the guide made. We recorded 12 dialogue sessions in total. The speech data was manually labeled, and 239.3 minutes of tour guide utterances, which are used to train our HMM for the TTS system, were collected. The training data is complemented by the ATR 503 phonetically balanced sentence set [2], so as to cover deficiencies in the phoneme sequence. The sentence set is collected from news articles, and data consists of 43.1 minutes of reading-style speech. We trained HMM for our TTS system Ximera using the HMM-based Speech Synthesis System (HTS) [21]. We adopted mel log spectrum approximation (MLSA) filter-based vocoding [18], a quint-phone-based phoneme set and five state HMMbased acoustic modeling. All training data including reading-style speech data were used for model training. Context labels, such as part-of-speech and location of the word containing the phoneme, were also used in training the HMM.
3.2 Comparison Target To compare the effectiveness of our TTS in evoking users’ spontaneous backchannels, we constructed a comparison system that adopts a conventional reading-style TTS system. An HMM model was trained using 10-hour reading-style speech by another professional female narrator. Though it is possible to train a model using reading-style speech of the voice actor of the dialogue-style TTS, clarity and naturalness of the synthesized speech were much worse than that of the dialogue-style TTS due to an insufficient amount of training data. Thus, we adopted a model using sufficient training data by another speaker which was readily available. The speech rate was set to the same speed as that of the dialogue-style TTS. Other settings, such as the descriptive text and avatar agent, were the same as those of the base system.
3.3 Comparison of Prosodic Features of the Synthesized Speech We investigated the prosodic features of the final phoneme of IPUs in the synthesized explanations used in the dialogue system used for the following user experiment to confirm if they contain backchannel-inviting cues. Following the findings of a previous study [11], we investigated the duration, F0 contour pattern and power contour pattern of the final phoneme of the IPUs1 . In conversation analysis of Japanese, the F0 contour pattern label of the final phoneme is often used similar to the way the ToBI intonation label is used in English. While the contour pattern is usually manually labeled, we roughly determined the patterns based on the following procedure. We first normalized the log F0 scale using all utterances so that it has zero mean and one standard deviation (z-score: z = (x − µ )/σ ). We then divided each final phoneme of the IPU into former and latter parts, and calculated the F0 1
For this study, we define an IPU as a maximal sequence of words surrounded by silence longer than 200 ms. This unit usually coincides with one Japanese phrasal unit.
Analysis on Effects of Text-to-Speech and Avatar Agent
81
Table 1 Prosodic analysis of final phonemes of IPUs (dialogue-style TTS vs. reading-style TTS) dialogue synth. reading synth. dur. phoneme [msec] 172.9 (± 29.6) 126.1 (± 19.1) average (± standard deviation) F0 power pattern dialogue reading dialogue reading rise-rise 5.4 % 0.0 % 0.0 % 0.0 % rise-flat 2.0 % 0.0 % 1.7 % 0.0 % rise-fall 23.5 % 0.0 % 46.3 % 5.3 % flat-rise 5.0 % 0.0 % 0.0 % 0.0 % flat-flat 1.7 % 0.0 % 4.0 % 9.2 % flat-fall 15.8 % 0.0 % 22.8 % 18.1 % fall-rise 15.8 % 0.0 % 0.7 % 0.0 % fall-flat 3.4 % 0.0 % 7.0 % 0.0 % fall-fall 27.5 % 100.0 % 17.4 % 76.5 %
slope of each segment by linear regression. By combination of following three patterns, we defined nine F0 contour patterns for the final phonemes of the IPUs. The pattern of the segment was judged as rise if the slope was larger than a threshold θ . If the slope was less than the threshold −θ , the pattern was judged as fall. Otherwise, it was judged as flat. Here, θ was empirically set to 5.0. The power contour patterns of the IPUs were estimated by a similar procedure. The results are given in Table 1. According to a study [11], in which prosodic features of IPUs followed by a turnhold with backchannel, without backchannel and turn-switch were compared, a long duration in the final phoneme is a speaker’s typical sign to keep floor. The same study also reported that the flat-fall and rise-fall pattern of F0 and power are more likely to be followed by a backchannel than a turn-hold without a backchannel and turn-switch. The synthesized dialogue-style speech contained much more rise-fall and flat-fall patterns in F0 and power than that generated by the reading-style TTS system. The average duration of the final phoneme was also longer. Considering the fact that the speech data was generated from the same script, this indicates that the synthesized speech by the dialogue-style TTS system contains more backchannelinviting features than that by the reading-style TTS system.
4 Construction of Avatar Agent We constructed an animated 3D desktop avatar named Hanna. The avatar can express its status through several motions. For example, when the user begins speaking, it can express the state of listening using the listener’s motion, as shown in the figure. Though lip synching is not conducted, the agent makes lip movements during the IPUs. We thus assume that users can detect end of IPUs by using the visual clue.
82
Misu et al.
Fig. 1 System setting
5 User Experiment 5.1 Dialogue System used for Experiment To evaluate our TTS system based on users’ reactions, a sightseeing guidance spoken dialogue system that assist users in making decision was implemented using our dialogue system framework [7]. The system can explain six sightseeing spots in Kyoto. The system provides responses to user requests for explanation about a certain spot2 . Each descriptive text on a sightseeing spot consists of 500 (±1%) characters, 30 phrases. The text is synthesized using section 3 TTS3 . We set the speech rate of our TTS as nine phoneme per second. The duration of the synthesized descriptive speech per spot is about three minutes. A sample dialogue with the system is shown in Table 2. A video (with English subtitles) of an sample dialogue with a user can be seen at http://mastarpj.nict.go.jp/ ˜xtmisu/video/TTS.wmv. A display is also used to present photos of the target sightseeing spot. Figure 2 shows the GUI the user sees.
5.2 Evaluation of TTS 5.2.1 Experimental Setup We evaluated the TTS systems using 30 subjects (15 male and 15 female; 15 in their 30s and 15 in their 40s) who had not previously used spoken dialogue systems. Subjects were asked to use the dialogue system in two settings; dialogue-style TTS system and reading-style TTS system. The experiment was conducted in a small 2
The explanation comprises an introduction followed by explanation on origin, nature, buildings, events, and good future. 3 The descriptive texts are not included in the training data.
Analysis on Effects of Text-to-Speech and Avatar Agent
83
Fig. 2 Screen shot of the dialogue system Table 2 Example dialogue with the system (translation of Japanese) System Hello, my name is Hanna, a tour guide for sightseeing in Kyoto. Nice to meet you. Please say “Okay” when you have decided which spot you would like to know about first. User Okay. System Which spot would you like to know about? User Tell me about Shinnyo-do temple. System Sure, I’ll explain Shinnyo-do temple. System Shinnyo-do is a Buddhist Tendai Temple in Sakyo-ku, Kyoto City and the formal name is Shinsho Gokuraku-ji. You may take the bus directly from Kyoto Station for about thirty minutes, and the fee is two hundred twenty yen. . . . System Shall I continue? User Sure. System The temple is famous, especially for its coloring of autumn leaves; Red Maple stand on both sides of the stone-paved approach, and it appears as though the temple is placed inside a maple forest. . . . ... System That’s all about this spot. System Which spot would you like to know about next? User Daitoku-ji please. ...
(about 2 m2 ) soundproof room with no one else present. Subjects sat in front of the system shown in Figure 1, and used the dialogue system. We instructed the subjects to speak with the avatar agent Hanna (not with the system). We also told them that the avatar agent was listening to their speech at all times using the microphone, and was observing their reactions using the camera above the display4 . Subjects were given the task of acquiring information about three candidate sightseeing spots in Kyoto shown on the display and then selecting one that they liked5 . An example dialogue with the system is shown in Table 2. In the middle of the explanation about the spot, the avatar confirmed whether it could continue
4 5
The system did not actually sense the subjects’ reactions. Subjects had freedom in choosing the order of the spots.
84
Misu et al.
Table 3 Questionnaire items 1. 2. 3. 4. 5.
Overall, which speech was better? Which speech had easier-to-understand explanations? For which speech did you feel compelled to give backchannels? Which speech was more appropriate for this system? Which speech had more human-like explanation? #1
#2
#3
#4
#5
(a) both
(b) dialogue style
(c) reading style (d) neither
Fig. 3 Questionnaire results
the explanation6 , as shown in the example. A video (with English subtitles) showing a real user dialogue can be seen at http://mastarpj.nict.go.jp/˜xtmisu/video/exp.avi. After the subject selected from candidate spots, we changed the TTS system settings and instructed the user to have another dialogue session selecting one of another three spots. Considering the effects of the order, the subjects were divided into four groups; the first group (Group 1) used the system in the order of “Spot list A with dialogue-style speech rightarrow Spot list B with reading-style speech,” the second group (Group 2) worked in reverse order. Groups 3 and 4 used a system alternating the order of the spot sets.
5.2.2 Questionnaire Results After the experiments, subjects were asked to fill in a questionnaire about the system. Table 3 shows the questionnaire items. The subjects selected (a) both are good, (b) dialogue-style speech was better, (c) reading-style speech was better, or (d) neither were good. Figure 3 shows the results. The dialogue-style speech generally earned higher ratings, but reading-style was slightly higher in items #2 and #5. This tendency is likely attributable to the fact that the dialogue-style speech had worse clarity and naturalness than reading-style. The mean opinion score (MOS), which is often used to measure clarity and naturalness of TTS, of the dialogue-style TTS was in fact 2.79, worse than 3.74 for the readingstyle. The main cause for appears to be a lack of training data (c.f. five hours for dialogue-style TTS, and 10 hours for reading-style TTS). We can improve the MOS by collecting more training data for the dialogue-style TTS.
6
Users are asked to reply positively to the system request.
Analysis on Effects of Text-to-Speech and Avatar Agent
85
Table 4 Percentages and average number of users who made backchannels (with avatar agent) TTS % users made BCs # average BCs taken Dialogue-style 100.0% (50.0%, 100.0%) 30.4 (1.8, 28.6) Reading-style 100.0% (50.0%, 87.5%) 26.1 (3.1, 23.0) Dialogue-style 75.0% (25.0%, 62.5%) 12.7 (0.5, 12.2) Reading-style 75.0% (25.0%, 62.5%) 12.9 (1.3, 11.6) Dialogue-style 100.0% (28.6%, 100.0%) 14.0 (0.4, 13.6) Reading-style 100.0% (0%, 100.0%) 19.3 (0, 19.3) Dialogue-style 87.5% (42.9%, 87.5%) 28.2 (4.7, 23,5) Reading-style 100.0% (71.4%, 87.5%) 24.8 (6.5, 18.3) Dialogue-style 86.7% (36.7%, 86.7%) 21.1 (1.7, 19.4) Reading-style 90.0% (40.0%, 83.3%) 20.6 (2.4, 18.2) Total backchannel (verbal feedback [Aizuchi], nodding)
Group 1: (Dialogue → Reading) (list A → list B) Group 2: (Reading → Dialogue) (list A → list B) Group 3: (Dialogue → Reading) (list B → list A) Group 4: (Reading → Dialogue) (list B → list A) All:
5.2.3 Analysis of Frequency of Backchannels We analyzed the number of backchannels that users made during the dialogue session. We manually annotated subjects’ verbal feedbacks, such as “uh-huh” and nodding of the head using the recorded video. Out of 30 subjects, 26 gave some form of backchannel to the system. Table 4 shows the percentages and average number of times subjects gave backchannels. Many users made more backchannels using the dialogue-style TTS system. Despite the significant difference in questionnaire item #3, there were no significant differences in the average number of users’ backchannels.
5.2.4 Informativeness of Backchannels We evaluated the TTS in terms of the informativeness of evoked backchannels. The spontaneous prosodic pattern of the backchannels is expected to suggest positive/negative feelings on regarding the recommended candidate. One promising use of backchannels in our application is for detecting users’ feelings about the currently focused on spot, and choosing to continue the explanation on the current topic if the user seems interested, or otherwise change the topic. We therefore label backchannels made during the systems explanation of the spot that the user finally selected7 as “positive” and those made during the explanations of the other two spots as “negative” and consider distinguishing between them8 . In human-human dialogues, it was confirmed that when a user responds promptly, the majority of responses are positive, and more backchannels also suggest positive responses [9]9 . We parameterize the backchannel information with two features representing timing and frequency. We assume that the user backchannels can be inserted around the end of IPUs, and measure the time difference of the two events. Note that this 7
During the experiment, users were asked to select from three explained sightseeing spot. Ideally positive/negative should be judged in IPU unit, but such an annotation is almost impossible. 9 Only verbal backchannels were examined in the work. 8
86
Misu et al. 3
3 Positive Negative
# backchannel
# backchannels
2.5 Positive Negative
2 1.5 1
2.5 2 1.5 1 0.5
0.5
0 0