Advances in Natural Multimodal Dialogue Systems
Text, Speech and Language Technology VOLUME 30
Series Editors Nancy Ide, Vassar College, New York Jean Véronis, Université de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT & T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T. Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universitat Autonòma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMSI-CNRS, France
The titles published in this series are listed at the end of this volume.
Advances in Natural Multimodal Dialogue Systems Edited by
Jan C.J. van Kuppevelt Waalre, The Netherlands
Laila Dybkj ær University of Southern Denmark, Odense, Denmark
and
Niels Ole Bernsen University of Southern Denmark, Odense, Denmark
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN-10 ISBN-13 ISBN-10 ISBN-13 ISBN-10 ISBN-13
1-4020-3934-4 (PB) 978-1-4020-3934-8 (PB) 1-4020-3932-8 (HB) 978-1-4020-3032-4 (HB) 1-4020-3933-6 (e-book) 978-1-4020-3933-1 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com
Printed on acid-free paper
All Rights Reserved © 2005 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Printed in the Netherlands
Contents
Preface
xi
1 Natural and Multimodal Interactivity Engineering - Directions and Needs Niels Ole Bernsen and Laila Dybkjaer 1. Introduction 2. Chapter Presentations 3. NMIE Contributions by the Included Chapters 4. Multimodality and Natural Interactivity References
1 1 2 7 16 19
Part I Making Dialogues More Natural: Empirical Work and Applied Theory 2 Social Dialogue with Embodied Conversational Agents Timothy Bickmore and Justine Cassell 1. Introduction 2. Embodied Conversational Agents 3. Social Dialogue 4. Related Work 5. Social Dialogue in REA 6. A Study Comparing ECA Social Dialogue with Audio-Only Social Dialogue 7. Conclusion References 3 A First Experiment in Engagement for Human-Robot Interaction in Hosting Activities Candace L. Sidner and Myroslava Dzikovska 1. Introduction 2. Hosting Activities 3. What is Engagement? 4. First Experiment in Hosting: A Pointing Robot 5. Making Progress on Hosting Behaviours 6. Engagement for Human-Human Interaction 7. Computational Modelling of Human-Human Hosting and Engagement 8. A Next Generation Mel
v
23 23 25 29 32 36 40 48 49 55 55 56 57 59 62 63 70 73
vi
Advances in Natural Multimodal Dialogue Systems 9. Summary References
Part II
74 74
Annotation and Analysis of Multimodal Data: Speech and Gesture
4 FORM Craig H. Martell 1. Introduction 2. Structure of FORM 3. Annotation Graphs 4. Annotation Example 5. Preliminary Inter-Annotator Agreement Results 6. Conclusion: Applications to HLT and HCI? Appendix: Other Tools, Schemes and Methods of Gesture Analysis References 5 On the Relationships among Speech, Gestures, and Object Manipulation in Virtual Environments: Initial Evidence Andrea Corradini and Philip R. Cohen 1. Introduction 2. Study 3. Data Analysis 4. Results 5. Discussion 6. Related Work 7. Future Work 8. Conclusions Appendix: Questionnaire MYST III - EXILE References 6 Analysing Multimodal Communication Patrick G. T. Healey, Marcus Colman and Mike Thirlwell 1. Introduction 2. Breakdown and Repair 3. Analysing Communicative Co-ordination 4. Discussion References 7 Do Oral Messages Help Visual Search? Noëlle Carbonell and Suzanne Kieffer 1. Context and Motivation 2. Methodology and Experimental Set-Up 3. Results: Presentation and Discussion 4. Conclusion References
79 79 80 85 86 88 90 91 95 97 97 99 101 103 106 106 108 108 110 111 113 113 117 125 126 127 131 131 134 141 153 154
Contents 8 Geometric and Statistical Approaches to Audiovisual Segmentation Trevor Darrell, John W. Fisher III, Kevin W. Wilson, and Michael R. Siracusa 1. Introduction 2. Related Work 3. Multimodal Multisensor Domain 4. Results 5. Single Multimodal Sensor Domain 6. Integration References
vii 159 159 160 162 166 167 175 178
Part III Animated Talking Heads and Evaluation 9 The Psychology and Technology of Talking Heads: Applications in Language Learning Dominic W. Massaro 1. Introduction 2. Facial Animation and Visible Speech Synthesis 3. Speech Science 4. Language Learning 5. Research on the Educational Impact of Animated Tutors 6. Summary References 10 Effective Interaction with Talking Animated Agents in Dialogue Systems Björn Granström and David House 1. Introduction 2. The KTH Talking Head 3. Effectiveness in Intelligibility and Information Presentation 4. Effectiveness in Interaction 5. Experimental Applications 6. The Effective Agent as a Language Tutor 7. Experiments and 3D Recordings for the Expressive Agent References 11 Controlling the Gaze of Conversational Agents Dirk Heylen, Ivo van Es, Anton Nijholt and Betsy van Dijk 1. Introduction 2. Functions of Gaze 3. The Experiment 4. Discussion 5. Conclusion References
183 183 184 191 194 197 210 211 215 215 217 219 223 231 235 237 239 245 245 248 252 258 260 260
viii
Advances in Natural Multimodal Dialogue Systems
Part IV Architectures and Technologies for Advanced and Adaptive Multimodal Dialogue Systems 12 MIND: A Context-Based Multimodal Interpretation Framework in Conversational Systems Joyce Y. Chai, Shimei Pan and Michelle X. Zhou 1. Introduction 2. Related Work 3. MIND Overview 4. Example Scenario 5. Semantics-Based Representation 6. Context-Based Multimodal Interpretation 7. Discussion References 13 A General Purpose Architecture for Intelligent Tutoring Systems Brady Clark, Oliver Lemon, Alexander Gruenstein, Elizabeth Owen Bratt, John Fry, Stanley Peters, Heather Pon-Barry, Karl Schultz, Zack Thomsen-Gray and Pucktada Treeratpituk 1. Introduction 2. An Intelligent Tutoring System for Damage Control 3. An Architecture for Multimodal Dialogue Systems 4. Activity Models 5. Dialogue Management Architecture 6. Benefits of ACI for Intelligent Tutoring Systems 7. Conclusion References 14 MIAMM – A Multimodal Dialogue System using Haptics Norbert Reithinger, Dirk Fedeler, Ashwani Kumar, Christoph Lauer, Elsa Pecourt and Laurent Romary 1. Introduction 2. Haptic Interaction in a Multimodal Dialogue System 3. Visual Haptic Interaction – Concepts in MIAMM 4. Dialogue Management 5. The Multimodal Interface Language (MMIL) 6. Conclusion References 15 Adaptive Human-Computer Dialogue Sorin Dusan and James Flanagan 1. Introduction 2. Overview of Language Acquisition 3. Dialogue Systems 4. Language Knowledge Representation 5. Dialogue Adaptation 6. Experiments
265 265 267 267 268 270 277 282 283 287
287 288 294 295 298 301 302 303 307 308 309 313 319 326 331 331 333 333 334 337 340 341 347
Contents 7. Conclusion References
ix 351 353
16 Machine Learning Approaches to Human Dialogue Modelling Yorick Wilks, Nick Webb, Andrea Setzer, Mark Hepple and Roberta Catizone 1. Introduction 2. Modality Independent Dialogue Management 3. Learning to Annotate Utterances 4. Future work: Data Driven Dialogue Discovery 5. Discussion References
355 357 362 366 367 368
Index
371
355
Preface
The chapters in this book jointly contribute to what we shall call the field of natural and multimodal interactive systems engineering. This is not yet a well-established field of research and commercial development but, rather, an emerging one in all respects. It brings together, in a process that, arguably, was bound to happen, contributors from many different, and often far more established, fields of research and industrial development. To mention but a few, these include speech technology, computer graphics and computer vision. The field’s rapid expansion seems driven by a shared vision of the potential of new interactive modalities of information representation and exchange for radically transforming the world of computer systems, networks, devices, applications, etc. from the GUI (graphical user interface) paradigm into something which will enable a far deeper and much more intuitive and natural integration of computer systems into people’s work and lives. Jointly, the chapters present a broad and detailed picture of where natural and multimodal interactive systems engineering stands today. The book is based on selected presentations made at the International Workshop on Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems held in Copenhagen, Denmark, in 2002 and sponsored by the European CLASS project. CLASS was initiated on the request of the European Commission with the purpose of supporting and stimulating collaboration among Human Language Technology (HLT) projects as well as between HLT projects and relevant projects outside Europe. The purpose of the workshop was to bring together researchers from academia and industry to discuss innovative approaches and challenges in natural and multimodal interactive systems engineering. The Copenhagen 2002 CLASS workshop was not just a very worthwhile event in an emerging field due to the general quality of the papers presented. It was also largely representative of the state of the art in the field. Given the increasing interest in natural interactivity and multimodality and the excellent quality of the work presented, it was felt to be timely to publish a book reflecting recent developments. Sixteen high-quality papers from the workshop were selected for publication. Content-wise, the chapters in this book illustrate most aspects of natural and multimodal interactive systems engineering: applicable
xi
xii
Advances in Natural Multimodal Dialogue Systems
theory, empirical work, data annotation and analysis, enabling technologies, advanced systems, re-usability of components and tools, evaluation, and future visions. The selected papers have all been reviewed, revised, extended, and improved after the workshop. We are convinced that people who work in natural interactive and multimodal dialogue systems engineering – from graduate students and Ph.D. students to experienced researchers and developers, and no matter exactly which community they come from originally - may find this collection of papers interesting and useful to their own work. We would like to express our sincere gratitude to all those who helped us in preparing this book. Especially we would like to thank all reviewers for their valuable and extensive comments and criticism which have helped improve the quality of the individual chapters as well as the entire book. THE EDITORS
Chapter 1 NATURAL AND MULTIMODAL INTERACTIVITY ENGINEERING - DIRECTIONS AND NEEDS Niels Ole Bernsen and Laila Dybkjær Natural Interactive Systems Laboratory University of Southern Denmark Campusvej 55, 5230 Odense M, Denmark
{nob, laila}@nis.sdu.dk Abstract
This introductory chapter discusses the field of natural and multimodal interactivity engineering and presents the following 15 chapters in this context. A brief presentation of each chapter is given, their contributions to specific natural and multimodal interactivity engineering needs are discussed, and the concepts of multimodality and natural interactivity are explained along with an overview of the modalities investigated in the 15 chapters.
Keywords:
Natural and multimodal interactivity engineering.
1.
Introduction
Chapters 2 through 16 of this book present original contributions to the emerging field of natural and multimodal interactivity engineering (henceforth NMIE). A prominent characteristic of NMIE is that the field is not yet an established field of research and commercial development but, rather, an emerging one in all respects, including applicable theory, experimental results, platforms and development environments, standards (guidelines, de facto standards, official standards), evaluation paradigms, coherence, ultimate scope, enabling technologies for software engineering, general topology of the field itself, ”killer applications”, etc. The NMIE field is vast and brings together practitioners from very many different, and often far more established, fields of research and industrial development, such as signal processing, speech technology, computer graphics, computer vision, human-computer interaction, virtual and augmented reality, non-speech sound, haptic devices, telecommunications, computer games, etc. 1 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 1–19. © 2005 Springer. Printed in the Netherlands.
2
Advances in Natural Multimodal Dialogue Systems
Table 1.1. Needs for progress in natural and multimodal interactivity engineering (NMIE). General
Specific
Understand issues, problems, solutions
Applicable theory: for any aspect of NMIE. Empirical work and analysis: controlled experiments, behavioural studies, simulations, scenario studies, task analysis on roles of, and collaboration among, specific modalities to achieve various benefits. Annotation and analysis: new quality data resources, coding schemes, coding tools, and standards. Future visions: visions, roadmaps, etc., general and per sub-area.
Build systems
Enabling technologies: new basic technologies needed. More advanced systems: new, more complex, versatile, and capable system aspects. Make it easy: Re-usable platforms, components, toolkits, architectures, interface languages, standards, etc.
Evaluate
Evaluate all aspects: of components, systems, technologies, processes, etc.
It may be noted that the fact that a field of research has been established over decades in its own right is fully compatible with many if not most of its practitioners being novices in NMIE. It follows that NMIE community formation is an ongoing challenge for all. Broadly speaking, the emergence of a new systems field, such as NMIE, takes understanding of issues, problems and solutions, knowledge and skills for building (or developing) systems and enabling technologies, and evaluation of any aspect of the process and its results. In the particular case of NMIE, these goals or needs could be made more specific as shown in Table 1.1. Below, we start with a brief presentation of each of the following 15 chapters (Section 2). Taking Table 1.1 as point of departure, Section 3 provides an overview of, and discusses the individual chapters’ contributions to specific NMIE needs. Section 4 explains multimodality and natural interactivity and discusses the modalities investigated in the chapters of this book.
2.
Chapter Presentations
We have found it appropriate to structure the 15 chapters into four parts under four headlines related to how the chapters contribute to the specific NMIE needs listed in Table 1.1. Each chapter has a main emphasis on issues that contribute to NMIE needs captured by the headline of the part of the book to which it belongs. The division can of course not be a sharp one. Several chap-
Natural and Multimodal Interactivity Engineering - Directions and Needs
3
ters include discussions of issues that would make them fit into several parts of the book. Part one focuses on making dialogues more natural and has its main emphasis on experimental work and the application of theory. Part two concerns annotation and analysis of multimodal data, in particular, the modalities of speech and gesture. Part three addresses animated talking heads and related evaluation issues. Part four covers issues in building advanced multimodal dialogue systems, including architectures and technologies.
2.1
Making Dialogues More Natural: Empirical Work and Applied Theory
Two chapters have been categorised under this headline. They both aim at making interaction with a virtual or physical agent more natural. Experimental work is central in both chapters and so is the application of theory. The chapter by Bickmore and Cassell (Chapter 2) presents an empirical study of the social dialogue of an embodied conversational real-estate agent during interaction with users. The study is a comparative one. In one setting, users could see the agent while interacting with it. In the second setting, users could talk to the agent but not see it. The paper presents several interesting findings on social dialogue and interaction with embodied conversational agents. Sidner and Dzikovska (Chapter 3) present empirical results from humanrobot interaction in hosting activities. Their focus is on engagement, i.e. on the establishment and maintenance of a connection between interlocutors and the ending of it when desired. The stationary robot can point, make beat gestures, and move its eyes while conducting dialogue and tutoring on the use of a gas turbine engine shown to the user on a screen. The authors also discuss human-human hosting and engagement. Pursuing and building on theory of human-human engagement and using input from experiments, their idea is to continue to add capabilities to the robot which will make it become better to show engagement.
2.2
Annotation and Analysis of Multimodal Data: Speech and Gesture
The five chapters in this part have a common emphasis on data analysis. While one chapter focuses on annotation of already collected data, the other four chapters all describe experiments with data collection. In all cases, the data are subsequently analysed in order to, e.g., provide new knowledge on conversation or show whether a hypothesis was true or not. Martell (Chapter 4) presents the FORM annotation scheme and illustrates its use. FORM enables annotators to mark up the kinematic information of
4
Advances in Natural Multimodal Dialogue Systems
gestures in videos. Although designed for gesture markup, FORM is also designed to be extensible to markup of speech and other conversational information. The goal is to establish an extensible corpus of annotated videos which can be used for research in various aspects of conversational interaction. In an appendix, Martell provides a brief overview of other tools, schemes and methods for gesture analysis. Corradini and Cohen (Chapter 5) report on a Wizard-of-Oz study in which it was investigated how people use gesture and speech during interaction with a video game when they do not have access to standard input devices. The subjects’ interaction with the game was recorded, transcribed, further coded, and analysed. The primary analysis of the data concerns the users’ use of speech-only, gesture-only and speech and gesture in combination. Moreover, subjective data were collected by asking subjects after the experiment about their modality preferences during interaction. Although subjects’ answers and their actual behaviour in the experiment did not always match, the study indicates a preference for multimodal interaction. The chapter by Healey et al. (Chapter 6) addresses the analysis of humanhuman interaction. The motivation is the lack of support for the design of systems for human-human interaction. They discuss two psycholinguistic approaches and propose a third approach based on the detection and resolution of communication problems. This approach is useful for measuring the effectiveness of human-human interaction across tasks and modalities. A coding protocol for identification of repair phenomena across different modalities is presented followed by evaluation results from testing the protocol on a small corpus of repair sequences. The presented approach has the potential to help in judging the effectiveness of multimodal communication. Carbonell and Kieffer (Chapter 7) report on an experimental study which investigated if oral messages facilitate visual search tasks on a crowded display. Using the mouse, subjects were asked to search and select visual targets in complex scenes presented on the screen. Before the presentation of each scene, the subject would either see the target alone, receive an oral description of the target and spatial information about its position in the scene, or get a combination of the visual and oral target descriptions. Analysis of the data collected suggests that appropriate oral messages can improve search accuracy as well as selection time. However, both objectively and subjectively, multimodal messages were most effective. Darrell et al. (Chapter 8) discuss the problem of knowing who is speaking during multi-speaker interaction with a computer. They present two methods based on geometric and statistical source separation approaches, respectively. These methods are used for audiovisual segmentation of multiple speakers and have been used in experiments. One setup in a conference room used several stereo cameras and a ceiling-mounted microphone array grid. In this case a
Natural and Multimodal Interactivity Engineering - Directions and Needs
5
geometric method was used to identify the speaker. In a second setup, involving use of a handheld device or a kiosk, a single camera and a single omnidirectional microphone were used and a statistical method applied for speaker identification. Data analysis showed that each approach was valuable in the intended domain. However, the authors propose that a combination of the two methods would be of benefit and initial integration efforts are discussed.
2.3
Animated Talking Heads and Evaluation
Three chapters address animated talking heads. These chapters all include experimental data collection and data analysis, and present evaluations of various aspects of talking head technology and its usability. The chapter by Massaro (Chapter 9) concerns computer-assisted speech and language tutors for the deaf, hard of hearing and autistic children. The author presents an animated talking head for visible speech synthesis. The skin of the head can be made transparent so that one can see the tongue and the palate. The technology has been used in a language training program in which the agent guides the user through a number of exercises in vocabulary and grammar. The aim is to improve speech articulation and develop linguistic and phonological awareness in the users. The reported experiments show positive learning results. Granström and House (Chapter 10) describe work on animated talking heads, focusing on the increased intelligibility and efficiency provided by the addition of the talking face which uses text-to-speech synthesis. Results from various studies are presented. The studies include intelligibility tests and perceptual evaluation experiments. Among other things, facial cues to convey, e.g., feedback, turn-taking, and prosodic functions like prominence have been investigated. A number of applications of the talking head are described, including a language tutor. Heylen et al. (Chapter 11) report on how different eye gaze behaviours of a cartoon-like talking face affect the interaction with users. Three versions of the talking face were included in an experiment in which users had to make concert reservations by interacting with the talking face through typed input. One version of the face was aimed to be a close approximation to human-like gaze behaviour. In a second version gaze shifts were kept minimal, and in a third version gaze shifts were random. Evaluation of data from the experiment clearly showed a better performance of, and preference for, the human-like version.
6
2.4
Advances in Natural Multimodal Dialogue Systems
Architectures and Technologies for Advanced and Adaptive Multimodal Dialogue Systems
The last part of this book comprises five chapters which all have a strong focus on aspects of developing advanced multimodal dialogue systems. Several of the chapters present architectures which may be reused across applications, while others emphasise learning and adaptation. The chapter by Chai et al. (Chapter 12) addresses the difficult task of interpreting multimodal user input. The authors propose to use a fine-grained semantic model that characterises the meaning of user input and the overall conversation, and an integrated interpretation approach drawing on context knowledge, such as conversation histories and domain knowledge. These two approaches are discussed in detail and are included in the multimodal interpretation framework presented. The framework is illustrated by a real-estate application in which it has been integrated. Clark et al. (Chapter 13) discuss the application of a general-purpose architecture in support of multimodal interaction with complex devices and applications. The architecture includes speech recognition, natural language understanding, text-to-speech synthesis, an architecture for conversational intelligence, and use of the Open Agent Architecture. The architecture takes advantage of reusability and has been deployed in a number of dialogue systems. The authors report on its deployment in an intelligent tutoring system for shipboard damage control. Details about the tutoring system and the architecture are presented. Reithinger et al. (Chapter 14) present a multimodal system for access to multimedia databases on small handheld devices. Interaction in three languages is supported. Emphasis is on haptic interaction via active buttons combined with spoken input and visual and acoustic output. The overall architecture of the system is explained and so is the format for data exchange between modules. Also, the dialogue manager is described, including its architecture and multimodal fusion issues. Dusan and Flanagan (Chapter 15) address the difficult issue of ensuring sufficient vocabulary coverage in a spoken dialogue system. To overcome the problem that there may always be a need for additional words or word forms, the authors propose a method for adapting the vocabulary of a spoken dialogue system at run-time. Adaptation is done by the user by adding new concepts to existing pre-programmed concept classes and by providing semantic information about the new concepts. Multiple input modalities are available for doing the adaptation. Positive results from preliminary experiments with the method are reported. Wilks et al. (Chapter 16) discuss machine learning approaches to the modelling of human-computer interaction. They first describe a dialogue manager
Natural and Multimodal Interactivity Engineering - Directions and Needs
7
built for multimodal dialogue handling. The dialogue manager uses a kind of stereotypical dialogue patterns, called Dialogue Action Frames, for representation. The authors then describe an analysis module which learns to assign dialogue acts and semantic contents from corpora. The idea is to enable automatic derivation of Dialogue Action Frames, so that the dialog manager will be able to use Dialogue Action Frames that are automatically leaned from corpora.
3.
NMIE Contributions by the Included Chapters
Using the righthand column entries of Table 1.1, Table 1.2 indicates how the 15 chapters in this book contribute to the NMIE field. A preliminary conclusion based on Table 1.2 is that, for an emerging field which is only beginning to be exploited commercially, the NMIE research being done world-wide today is already pushing the frontiers in many of the directions needed. In the following, we discuss the chapter contributions to each of the lefthand entries in Table 1.2.
3.1
Applicable Theory
It may be characteristic of the NMIE field at present that our sample of papers only includes a single contribution of a primarily theoretical nature, i.e. Healey et al., which applies a psycholinguistic model of dialogue to help identify a subset of communication problems in order to judge the effectiveness of multimodal communication. Human-machine communication problems, their nature and identification by human or machine, has recently begun to attract the attention of more than a few NMIE researchers, and it has become quite clear that we need far better understanding of miscommunication in natural and multimodal interaction than we have at present. However, the relative absence of theoretical papers is not characteristic in the sense that the field does not make use of, or even need, applicable theory. On the contrary, a large number of chapters actually do apply existing theory in some form, ranging from empirical generalisations to full-fledged theory of many different kinds. For instance, Bickmore and Cassell test generalisations on the effects on communication of involving embodied conversational agents; Carbonell and Kieffer apply modality theory; Chai et al. apply theories of human-human dialogue to the development of a fined-grained, semanticsbased multimodal dialogue interpretation framework; Massaro applies theories of human learning; and Sidner and Dzikovska draw on conversation and collaboration theory.
8
Advances in Natural Multimodal Dialogue Systems
Table 1.2. NMIE needs addressed by the chapters in this book. Specific to NMIE Applicable theory Empirical work analysis
Contributions and
Annotation and analysis Future visions Enabling technologies
More advanced systems
Make it easy
Evaluate
3.2
No new theory except 6 but plenty of applied theory, e.g. 2, 3. Effects on communication of animated conversational agents. 2, 10, 11 Spoken input in support of visual search. 7 Gesture and speech for video game playing. 5 Multi-speaker speech recognition. 8 Gaze behaviour for more likeable animated interface agents. 2, 11 Audio-visual speech output. 9, 10 Animated talking heads for language learning. 9, 10 Tutoring. 3, 9, 10, 13 Hosting robots. 3 Coding scheme for conversational interaction research. 4 Standard for internal representation of NMIE data codings. 4 Many papers with visions concerning new challenges in their research. Interactive robotics: robots controlled multimodally, tutoring and hosting robots. 3 Multi-speaker speech recognition. 8 Audio-visual speech synthesis for talking heads. 9, 10 Machine learning of language and dialogue acts assignment. 15, 16 Multilinguality. 14 Ubiquitous (mobile) application. 14 On-line observation-based user modelling for adaptivity. 12, 14 Complex natural interactive dialogue management. 12, 13, 14, 16 Machine learning for more advanced dialogue systems. 15, 16 Platform for natural interactivity. 6 Re-usable components (many papers). Architectures for multimodal systems and dialogue management. 3, 12, 13, 14, 16, 16 Multimodal interface language. 14 XML for data exchange. 10, 14 Effects on communication of animated conversational agents. 2, 10, 11 Evaluations of talking heads. 9, 10, 11 Evaluation of audio-visual speech synthesis for learning. 9, 10
Empirical Work and Analysis
Novel theory tends to be preceded by empirical exploration and generalisation. The NMIE field is replete with empirical studies of human-human and human-computer natural and multimodal interaction [Dehn and van Mulken,
Natural and Multimodal Interactivity Engineering - Directions and Needs
9
2000]. By their nature, empirical studies are closer to the process of engineering than is theory development. We build NMIE research systems not only from theory but, perhaps to a far greater extent, from hunches, contextual assumptions, extrapolations from previous experience and untried transfer from different application scenarios, user groups, environments, etc., or even Wizard of Oz studies, which are in themselves a form of empirical study, see, e.g., the chapter by Corradini and Cohen and [Bernsen et al., 1998]. Having built a prototype system, we are keen to find out how far those hunches, etc. got us. Since empirical testing, evaluation, and assessment are integral parts of software and systems engineering, all we have to do is to include ”assumptions testing” in the empirical evaluation of the implemented system which we would be doing anyway. The drawback of empirical studies is that they usually do not generalise much due to the multitude of independent variables involved. This point is comprehensively argued and illustrated for the general case of multimodal and natural interactive systems which include speech in [Bernsen, 2002]. Still, as we tend to work on the basis of only slightly fortified hunches anyway, the results could often serve to inspire fellow researchers to follow them up. Thus, best-practice empirical studies are of major importance in guiding NMIE progress. The empirical chapters in this book illustrate well the points made above. One cluster of findings demonstrate the potential of audio-visual speech output by animated talking heads for child language learning (Massaro) and, more generally, for improving intelligibility and efficiency of human-machine communication, including the substitution of facial animation for the, still-missing, prosody in current speech synthesis systems (Granström and House). In counter-point, so to speak, Darrell et al. convincingly demonstrate the advantage of audio-visual input for tackling an important next step in speech technology, i.e. the recognition of multi-speaker spoken input. Jointly, the three chapters do a magnificent job of justifying the need for natural and multimodal (audiovisual) interaction independently of any psychological or social-psychological argument in favour of employing animated conversational agents. A key question seems to be: for which purpose(s), other than harvesting the benefits of using audio-visual speech input/output described above, do we need to accompany spoken human-computer dialogue with more or less elaborate animated conversational interface agents [Dehn and van Mulken, 2000]? By contrast with spoken output, animated interface agents occupy valuable screen real estate, do not necessarily add information of importance to the users of large classes of applications, and may distract the user from the task at hand. Whilst a concise and comprehensive answer to this question is still pending, Bickmore and Cassell go a long way towards explaining that the introduction of life-like animated interface agents into human-computer spoken dialogue is
10
Advances in Natural Multimodal Dialogue Systems
a tough and demanding proposition. As soon as an agent appears on the display, users tend to switch expectations from talking to a machine to talking to a human. By comparison, the finding of Heylen et al. that users tend to appreciate an animated cartoon agent more if it shows a minimum of human-like gaze behaviour might speak in favour of preferring cartoon-style agents over life-like animated agents because the former do not run the risk of facing our full expectations to human conversational behaviour. Sidner and Dzikovska do not involve a virtual agent but, rather, a robot in the dialogue with the user, so they do not have the problem of an agent occupying part of the screen. But they have still have the behavioural problems of the robot to look into just as, by close analogy, do the people who work with virtual agents. The experiments by Sidner and Dzikovska show that there is still a long way to go before we fully understand and can model the subtle details of human behaviour in dialogue. On the multimodal side of the natural interactivity/multimodality semidivide, several papers address issues of modality collaboration, i.e., how the use of modality combinations could facilitate, or even enable, human-computer interaction tasks that could not be done easily, if at all, using unimodal interaction. Carbonell and Kieffer report on how combined speech and graphics output can facilitate display search, and Corradini and Cohen show how the optional use of different input modalities can improve interaction in a particular virtual environment.
3.3
Annotation and Analysis
It is perhaps not surprising that we are not very capable of predicting what people will do, or how they will behave, when interacting with computer systems using new modality combinations and possibly also new interactive devices. More surprising, however, is the fact that we are often just as ignorant when trying to predict natural interactive behaviours which we have the opportunity to observe every day in ourselves and others, such as: which kinds of gestures, if any, do people perform when they are listening to someone else speaking? This example illustrates that, to understand the ways in which people communicate with one another as well as the ways in which people communicate with the far more limited, current NMI systems, we need extensive studies of behavioural data. The study of data on natural and multimodal interaction is becoming a major research area full of potential for new discoveries. A number of chapters make use of, or refer to, data resources for NMIE, but none of them take a more general view on data resource issues. One chapter addresses NMIE needs for new coding schemes. Martell presents a kinematically-based gesture annotation scheme for capturing the kinematic information in gestures from videos of speakers. Linking the urgent issue of
Natural and Multimodal Interactivity Engineering - Directions and Needs
11
new, more powerful coding tools with the equally important issue of standardisation, Martell proposes a standard for the internal representation of NMIE codings.
3.4
Future Visions
None of the chapters have a particular focus on future visions for the NMIE field. However, many authors touch on future visions, e.g., their descriptions of future work and what they would like to achieve. This includes the important driving role of re-usable platforms, tools, and components for making rapid progress. Moreover, there are several hints at the future importance to NMIE of two generic enabling technologies which are needed to extend spoken dialogue systems to full natural interactive systems. These technologies are (i) computer vision for processing camera input, and (ii) computer animation systems. It is only recently that the computer vision community has begun to address issues of natural interactive and multimodal human-system communication, and there is a long way to go before computer vision can parallel speech recognition as a major input medium for NMIE. The chapters by Massaro and Granström and House illustrate current NMIE efforts to extend natural and multimodal interaction beyond traditional information systems to new major application areas, such as training and education which has been around for a while already, notably in the US-dominated paradigm of tutoring systems using animated interface agents, but also to edutainment and entertainment. While the GUI, including the current WWW, might be said to have the edutainment potential of a schoolbook or newspaper, NMIE systems have the much more powerful edutainment potential of brilliant teachers, comedians, and exiting human-human games.
3.5
Enabling Technologies
An enabling technology is often developed over a long time by some separate community, such as by the speech recognition community from the 1950s to the late 1980s. Having matured to the point at which practical applications become possible, the technology transforms into an omnipresent tool for system developers, as is the case with speech recognition technology today. NMIE needs a large number of enabling technologies and these are currently at very different stages of maturity. Several enabling technologies, some of which are at an early stage and some of which are finding their way into applications, are presented in this book in the context of their application to NMIE problems, including robot interaction and agent technology, multi-speaker interaction and recognition, machine learning, and talking face technology. Sidner and Dzikovska focus on robot interaction in the general domain of ”hosting”, i.e., where a virtual or physical agent provides guidance, education,
12
Advances in Natural Multimodal Dialogue Systems
or entertainment based on collaborative goals negotiation and subsequent action. A great deal of work remains to be done before robot interaction becomes natural in any approximate sense of the term. For instance, the robot’s spoken dialogue capabilities must be strongly improved and so must its embodied appearance and global communicative behaviours. In fact, Sidner and Dzikovska make some of the same conclusions as Bickmore and Cassell, namely that agents need to become far more human-like in all or most respects before they are really appreciated by humans. Darrell et al. address the problem in multi-speaker interaction of knowing who is addressing the computer when. Their approach is to use a combination of microphones and computer vision to find out who is talking. Developers of spoken dialogue applications must cope with problems resulting from vocabulary and grammar limitations and from difficulties in enabling much of the flexibility and functionality inherent in human-human communication. Despite having carried out systematic testing, the developer often finds that words are missing when a new user addresses the application. Dusan and Flanagan propose machine learning as a way to overcome part of this problem. Using machine learning, the system can learn new words and grammars taught to it by the user in a well-defined way. Wilks et al. address machine learning or transformation-based learning - in the context of assigning dialogue acts as part of an approach to improved dialogue modelling. In another part of their approach, Wilks et al. consider the use of dialogue action frames, i.e., a set of stereotypical dialogue patterns which perhaps may be learned from corpus data, as a means for flexibly switching back and forth between topics during dialogue. Granström and House and Massaro describe the gain in intelligibility that can be obtained by combining speech synthesis with a talking face. There is still much work to do both on synthesis and face articulation. For most languages, speech synthesis is still not very natural to listen to and if one wants to develop a particular voice to fit a certain animated character, this is not immediately possible with today’s technology. With respect to face articulation, faces need to become much more natural in terms of, e.g., gaze, eyebrow movements, lip and mouth movements, and head movements, as this seems to influence users’ perception of the interaction, cf. the chapters by Granström and House and Heylen et al.
3.6
More Advanced Systems
Enabling technologies for NMIE are often component technologies, and their description, including state of the art, current research challenges, and unsolved problems, can normally be made in a relatively systematic and focused manner. It is far more difficult to systematically describe the complexity
Natural and Multimodal Interactivity Engineering - Directions and Needs
13
of the constant push in research and industry towards exploring and exploiting new NMIE application types and new application domains, addressing new user populations, increasing the capabilities of systems in familiar domains of application, exploring known technologies with new kinds of devices, etc. In general, the picture is one of pushing present boundaries in all or most directions. During the past few years, a core trend in NMIE has been to combine different modalities in order to build more complex, versatile and capable systems, getting closer to natural interactivity than is possible with only a single modality. This trend is reflected in several chapters. Part of the NMIE paradigm is that systems must be available whenever and wherever convenient and useful, making ubiquitous computing an important application domain. Mobile devices, such as mobile phones, PDAs, and portable computers of any (portable) size have become popular and are rapidly gaining functionality. However, the interface and interaction affordances of small devices require careful consideration. Reithinger et al. present some of those considerations in the context of providing access to large amounts of data about music. It can be difficult for users to know how to interact with new NMIE applications. Although not always very successful in practice, the classical GUI system has the opportunity to present its affordances in static graphics (including text) before the user chooses how to interact. A speech-only system, by contrast, cannot do that because of the dynamic and transitory nature of acoustic modalities. NMIE systems, in other words, pose radically new demands on how to support the user prior to, and during, interaction. Addressing this problem, several chapters mention user modelling or repositories of user preferences built on the basis of interactions with a system, cf. the chapters by Chai et al. and Reithinger et al. Machine learning, although another example of less-than-expected pace of development during the past 10 years, has great potential for increasing interaction support. In an advanced application of machine learning, Dusan and Flanagan propose to increase the system’s vocabulary and grammar by letting users teach the system new words and their meaning and use. Wilks et al. use machine learning as part of an approach to more advanced dialogue modelling. Increasingly advanced systems require increasingly complex dialogue management, cf. the chapters by Chai et al., Clark et al., and Wilks et al. Lifelikeness of animated interface agents and conversational dialogue are among the key challenges in achieving the NMIE vision. Multilinguality of systems is an important NMIE goal. Multilingual applications are addressed by Reithinger et al. In their case, the application is running on a handheld device. Multi-speaker input speech is mentioned by Darrell et al. For good reason, recognition of multi-speaker input has become a lively research topic. We
14
Advances in Natural Multimodal Dialogue Systems
need solutions in order to, e.g., build meeting minute-takers, separate the focal speaker’s input from that of other speakers, exploit the huge potential of spoken multi-user applications, etc.
3.7
Make It Easy
Due to the complexity of multimodal natural interaction, it is becoming dramatically important to be able to build systems as easily as possible. It seems likely that no single research lab or development team in industry, even including giants such as Microsoft, is able to master all of the enabling technologies required for broad-scale NMIE progress. To advance efficiently, everybody needs access to those system components, and their built-in know-how, which are not in development focus. This implies strong attention to issues, such as re-usable platforms, components and architectures, development toolkits, interface languages, data formats, and standardisation. Clark et al. have used the Open Agent Architecture (OAA, http://www.ai.sri.com/∼oaa/), a framework for integrating heterogeneous software agents in a distributed environment. What OAA and other architectural frameworks, such as CORBA (http://www.corba.org/), aim to do is provide a means for modularisation, synchronous and asynchronous communication, well-defined inter-module communication via some interface language, such as IDL (CORBA) or ICL (OAA), and the possibility of implementation in a distributed environment. XML (Extensible Markup Language) is a simple, flexible text format derived from SGML (ISO 8879) which has become popular as, among other things, a message exchange format, cf. Reithinger et al. and Granström and House. Using XML for wrapping inter-module messages is one way to overcome the problem of different programming languages used for implementing different modules. Some chapters express a need for reusable components. Many of the applications described include off-the-shelf software, including components developed in other projects. This is particularly true for mature enabling technologies, such as speech recognition and synthesis components. As regards multimodal dialogue management, there is an expressed need for reuse in, e.g., the chapter by Clark et al. who discuss a reusable dialogue management architecture in support of multimodal interaction. In conclusion, there are architectures, platforms, and software components available which facilitate the building of new NMIE applications, and standards are underway for certain aspects. There is still much work to be done on standardisation, new and better platforms, and improvement of component software. In addition, we need, in particular, more and better toolkits in support of system development and a better understanding of those components which
Natural and Multimodal Interactivity Engineering - Directions and Needs
15
cannot be bought off-the-shelf and which are typically difficult to reuse, such as dialogue managers. Advancements such as these are likely to require significant corpus work. Corpora with tools and annotation schemes as described by Martell are exactly what is needed in this context.
3.8
Evaluate
Software systems and components evaluation is a broad area, ranging from technical evaluation over usability evaluation to customer evaluation. Customer evaluation has never been a key issue in research but has, rather, tended to be left to the marketing departments of companies. Technical evaluation and usability evaluation, including evaluation of functionality from both perspectives, are, on the other hand, considered important research issues. The chapters show a clear trend towards focusing on usability evaluation and comparative performance evaluation. It is hardly surprising that performance evaluation and usability issues are considered key topics today. We know little about what happens when we move towards increasingly multimodal and natural interactive systems, both as regards how these new systems will perform compared to alternative solutions and how the systems will be received and perceived by their users. We only know that a technically optimal system is not sufficient to guarantee user satisfaction. Comparative performance evaluation objectively compares users’ performance on different systems with respect to, e.g., how well they understand speech-only versus speech combined with a talking face or with an embodied animated agent, cf. Granström and House, Massaro, and Bickmore and Cassell. The usability issues evaluated all relate to users’ perception of a particular system and include parameters, such as life-likeness, credibility, reliability, efficiency, personality, ease of use, and understanding quality, cf. Heylen et al. and Bickmore and Cassell. Two chapters address how the intelligibility of what is being said can be increased through visual articulation, cf. Granström and House and Massaro. Granström and House have used a talking head in several applications, including tourist information, real estate (apartment) search, aid for the hearing impaired, education, and infotainment. Evaluation shows a significant gain in intelligibility for the hearing impaired. Eyebrow and head movement enhance perception of emphasis and syllable prominence. Over-articulation may be useful as well when there are special needs for intelligibility. The findings of Massaro support these promising conclusions. His focus is on applications for the hard-of-hearing, children with autism, and child language learning more generally. Granström and House also address the increase in efficiency of communication/interaction produced by using an animated talking head. Probably,
16
Advances in Natural Multimodal Dialogue Systems
naturalness is a key point here. This is suggested by Heylen et al. who made controlled experiments on the effects of different eye gaze behaviours of a cartoon-like talking face on the quality of human-agent dialogues. The most human-like agent gaze behaviour led to higher appreciation of the agent and more efficient task performance. Bickmore and Cassell evaluate the effects on communication of an embodied conversational real-estate agent versus an over-the-phone version of the same system, cf. also [Cassell et al., 2000]. In each condition, two variations of the system was available. One would be fully task-oriented while the second version would include some small-talk options. In general, users liked the system better in the phone condition. In the phone condition, subjects appreciated the small-talk whereas, in the embodied condition, subjects wanted to get down to business. The implication is that agent embodiment has strong effects on the interlocutors. Users tend to compare their animated interlocutors with humans rather than machines. To work with users, animated agents need considerably more naturalness and personally attractive features communicated non-verbally. This imposes a tall research agenda on both speech and non-verbal output, requiring conversational abilities both verbally and nonverbally. Jointly, the chapters on evaluation demonstrate a broad need for performance evaluation, comparative as well as non-comparative, that can inform us on the possible benefits and shortcomings of new natural interactive and multimodal systems. The chapters show a similar need for usability evaluation that can help us find out how users perceive these new systems, and a need for finding ways in which usability and user satisfaction might be correlated with technical aspects in order for the former to be derived from the latter.
4.
Multimodality and Natural Interactivity
Conceptually, NMIE combines natural interactive and multimodal systems and components engineering. While both concepts, natural interactivity and multimodality, have a long history, it would seem that they continue to sit somewhat uneasily side by side in the minds of most of us. Multimodality is the idea of being able to choose any input/output modality or combination of input/output modalities for optimising interaction with the application at hand, such as speech input for many heads-up, hands-occupied applications, speech and haptic input/output for applications for the blind, etc. A modality is a particular way of representing input or output information in some physical medium, such as something touchable, light, sound, or the chemistry for producing olfaction and gustation [Bernsen, 2002], see also the chapter by Carbonell and Kieffer. The physical medium of the speech modalities, for instance, is sound or acoustics but this medium obviously enables the trans-
Natural and Multimodal Interactivity Engineering - Directions and Needs
17
mission of information in many acoustic modalities other than speech, such as earcons, music, etc. The term multimodality thus refers to any possible combination of elementary or unimodal modalities. Compared to multimodality, the notion of natural interactivity appears to be the more focused of the two. This is because natural interactivity comes with a focused vision of the future of interaction with computer systems as well as a relatively well-defined set of modalities required for the vision to become reality. The natural interactivity vision is that of humans communicating with computer systems in the same ways in which humans communicate with one another. Thus, natural interactivity specifically emphasises human-system communication involving the following input/output modalities used in situated human-human communication: speech, gesture, gaze, facial expression, head and body posture, and object manipulation as integral part of the communication (or dialogue). As the objects being manipulated may themselves represent information, such as text and graphics input/output objects, natural interaction subsumes the GUI paradigm. Technologically, the natural interactivity vision is being pursued vigorously by, among others, the emerging research community in talking faces and embodied conversational agents as illustrated in the chapters by Bickmore and Cassell, Granström and House, Heylen et al., and Massaro. An embodied conversational agent may be either virtual or a robot, cf. the chapter by Sidner and Dzikovska. A weakness in our current understanding of natural interactivity is that it is not quite clear where to draw the boundary between the natural interactivity modalities and all those other modalities and modality combinations which could potentially be of benefit to human-system interaction. For instance, isn’t pushing a button on the mouse or otherwise, although never used in humanhuman communication for the simple reason that humans do not have communicative buttons on them, as natural as speaking? If it is, then, perhaps, all or most research on useful multimodal input/output modality combinations is also research into natural interactivity even if the modalities addressed are not being used in human-human communication? In addition to illustrating the need for more and better NMIE theory, the point just made may explain the uneasy conceptual relationship among the two paradigms of natural interactivity and multimodality. In any case, we have decided to combine the paradigms and address them together as natural and multimodal interactivity engineering. Finally, by NMI ’engineering’ we primarily refer to software engineering. It follows that the expression ’natural and multimodal interactivity engineering’ primarily represents the idea of creating a specialised branch of software engineering for the field addressed in this book. It is important to add, however, that NMIE enabling technologies are being developed in fields whose practitioners do not tend to regard themselves as doing software engineering, such as signal processing. For instance, the recently launched European Network of
18
Advances in Natural Multimodal Dialogue Systems
Excellence SIMILAR (http://www.similar.cc) addresses signal processing for natural and multimodal interaction.
4.1
Modalities Investigated in This Book
We argued above that multimodality includes all possible modalities for the representation and exchange of information among humans and between humans and computer systems, and that natural interactivity includes a rather vaguely defined, large sub-set of those modalities. Within this wide space of unimodal modalities and modality combinations, it may be useful to look at the modalities actually addressed in the following chapters. These are summarised by chapter in Table 1.3. Table 1.3. Modalities addressed in the included chapters (listed by chapter number plus first listed author). Chapter
Input
Output
2. Bickmore
speech, gesture (via camera) vs. speech-only speech, mouse clicks, (new version includes face and gesture input via camera)
embodied conversational agent + images vs. speech-only + images robot pointing and beat gestures, speech, gaze N/A video game
6. Healey 7. Carbonell 8. Darrell
gesture speech, gesture, object manipulation/manipulative gesture N/A gesture (mouse) speech, camera-based graphics
9. Massaro
mouse, speech
10. Granström
N/A
11. Heylen
typed text
12. Chai 13. Clark 14. Reithinger
speech, text, gesture (pointing) speech, gesture (pointing) speech, haptic buttons
15. Dusan
speech, keyboard, mouse, penbased drawing and pointing, camera speech and possibly other modalities
3. Sidner
4. Martell 5. Corradini
16. Wilks
N/A speech, graphics N/A audio-visual speech synthesis, talking head, images, text audio-visual speech synthesis, talking head talking head, gaze speech, graphics speech, text, graphics music, speech, text, graphics, tactile rythm speech, graphics, text display
speech and possibly other modalities. Focus is on dialogue modelling so input/output modalities are not discussed in detail
Natural and Multimodal Interactivity Engineering - Directions and Needs
19
Combined speech input/output which, in fact, means spoken dialogue almost throughout, is addressed in about half of the chapters (Bickmore and Cassell, Chai et al., Clark et al., Corradini and Cohen, Dusan and Flanagan, Reithinger et al., and Sidner and Dzikovska). Almost two thirds of the chapters address gesture input in some form (Bickmore and Cassell, Chai et al., Clark et al., Corradini and Cohen, Darrell et al., Dusan and Flanagan, Martell, Reithinger et al., and Sidner and Dzikovska). Five chapters address output modalities involving talking heads, embodied animated agents, or robots (Bickmore and Cassell, Granström and House, Heylen et al., Massaro, and Sidner and Dzikovska). Three chapters (Darrell et al., Bickmore and Cassell, and Sidner and Dzikovska) address computer vision. Dusan and Flanagan also mention that their system has camera-based input. Facial expression of emotion is addressed by Granström and House. Despite its richness and key role in natural interactivity, input or output speech prosody is hardly discussed. Granström and House discuss graphical ways of replacing missing output speech prosody by facial expression means. In general, the input and output modalities and their combinations discussed would appear representative of the state-of-the-art in NMIE. The authors make quite clear how far we are from mastering the very large number of potentially useful unimodal ”compounds” theoretically, in input recognition, in output generation, as well as in understanding and generation.
References Bernsen, N. O. (2002). Multimodality in Language and Speech Systems - From Theory to Design Support Tool. In Granström, B., House, D., and Karlsson, I., editors, Multimodality in Language and Speech Systems, pages 93–148. Dordrecht: Kluwer Academic Publishers. Bernsen, N. O., Dybkjær, H., and Dybkjær, L. (1998). Designing Interactive Speech Systems. From First Ideas to User Testing. Springer Verlag. Cassell, J., Bickmore, T., Campbell, L., Vilhjálmsson, H., and Yan, H. (2000). Human Conversation as a System Framework: Designing Embodied Conversational Agents. In Embodied Conversational Agents, pages 29–63. Cambridge, MA: MIT Press. Dehn, D. and van Mulken, S. (2000). The Impact of Animated Interface Agents: A Review of Empirical Research. International Journal of Human-Computer Studies, 52:1–22.
PART I
MAKING DIALOGUES MORE NATURAL: EMPIRICAL WORK AND APPLIED THEORY
Chapter 2 SOCIAL DIALOGUE WITH EMBODIED CONVERSATIONAL AGENTS Timothy Bickmore Northeastern University, USA
[email protected] Justine Cassell Northwestern University, USA
[email protected] Abstract
The functions of social dialogue between people in the context of performing a task is discussed, as well as approaches to modelling such dialogue in embodied conversational agents. A study of an agent’s use of social dialogue is presented, comparing embodied interactions with similar interactions conducted over the phone, assessing the impact these media have on a wide range of behavioural, task and subjective measures. Results indicate that subjects’ perceptions of the agent are sensitive to both interaction style (social vs. task-only dialogue) and medium.
Keywords:
Embodied conversational agent, social dialogue, trust.
1.
Introduction
Human-human dialogue does not just comprise statements about the task at hand, about the joint and separate goals of the interlocutors, and about their plans. In human-human conversation participants often engage in talk that, on the surface, does not seem to move the dialogue forward at all. However, this talk – about the weather, current events, and many other topics without significant overt relationship to the task at hand – may, in fact, be essential to how humans obtain information about one another’s goals and plans and decide whether collaborative work is worth engaging in at all. For example, realtors use small talk to gather information to form stereotypes (a collection 23 J.C.J. van Kuppevelt et al. (eds.), Advances in Natural Multimodal Dialogue Systems, 23–54. © 2005 Springer. Printed in the Netherlands.
24
Advances in Natural Multimodal Dialogue Systems
of frequently co-occurring characteristics) of their clients – people who drive minivans are more likely to have children, and therefore to be searching for larger homes in neighbourhoods with good schools. Realtors – and salespeople in general – also use small talk to increase intimacy with their clients, to establish their own expertise, and to manage how and when they present information to the client [Prus, 1989]. Nonverbal behaviour plays an especially important role in such social dialogue, as evidenced by the fact that most important business meetings are still conducted face-to-face rather than on the phone. This intuition is backed up by empirical research; several studies have found that the additional nonverbal cues provided by video-mediated communication do not effect performance in task-oriented interactions, but in interactions of a more social nature, such as getting acquainted or negotiation, video is superior [Whittaker and O’Conaill, 1997]. These studies have found that for social tasks, interactions were more personalized, less argumentative and more polite when conducted via video-mediated communication, that participants believed video-mediated (and face-to-face) communication was superior, and that groups conversing using video-mediated communication tended to like each other more, compared to audio-only interactions. Together, these findings indicate that if we are to develop computer agents capable of performing as well as humans on tasks such as real estate sales then, in addition to task goals such reliable and efficient information delivery, they must have the appropriate social competencies designed into them. Further, since these competencies include the use of nonverbal behaviour for conveying communicative and social cues, then our agents must have the capability of producing and recognizing nonverbal cues in simulations of face-to-face interactions. We call agents with such capabilities “Embodied Conversational Agents” or “ECAs.” The current chapter extends previous work which demonstrated that social dialogue can have a significant impact on a user’s trust of an ECA [Bickmore and Cassell, 2001], by investigating whether these results hold in the absence of nonverbal cues. We present the results of a study designed to determine whether the psychological effects of social dialogue – namely to increase trust and associated positive evaluations – vary when the nonverbal cues provided by the embodied conversational agent are removed. In addition to varying medium (voice only vs. embodied) and dialogue style (social dialogue vs. taskonly) we also assessed and examined effects due to the user’s personality along the introversion/extroversion dimension, since extroversion is one indicator of a person’s comfort level with face-to-face interaction.
Social Dialogue with Embodied Conversational Agents
2.
25
Embodied Conversational Agents
Embodied conversation agents are animated anthropomorphic interface agents that are able to engage a user in real-time, multimodal dialogue, using speech, gesture, gaze, posture, intonation, and other verbal and nonverbal behaviours to emulate the experience of human face-to-face interaction [Cassell et al., 2000c]. The nonverbal channels are important not only for conveying information (redundantly or complementarily with respect to the speech channel), but also for regulating the flow of the conversation. The nonverbal channel is especially crucial for social dialogue, since it can be used to provide such social cues as attentiveness, positive affect, and liking and attraction, and to mark shifts into and out of social activities [Argyle, 1988].
2.1
Functions versus Behaviours
Embodiment provides the possibility for a wide range of behaviours that, when executed in tight synchronization with language, carry out a communicative function. It is important to understand that particular behaviours, such as the raising of the eyebrows, can be employed in a variety of circumstances to produce different communicative effects, and that the same communicative function may be realized through different sets of behaviours. It is therefore clear that any system dealing with conversational modelling has to handle function separately from surface-form or run the risk of being inflexible and insensitive to the natural phases of the conversation. Here we briefly describe some of the fundamental communication categories and their functional sub-parts, along with examples of nonverbal behaviour that contribute to their successful implementation. Table 2.1 shows examples of mappings from communicative function to particular behaviours and is based on previous research on typical North American nonverbal displays, mainly [Chovil, 1991; Duncan, 1974; Kendon, 1980].
Conversation initiation and termination Humans partake in an elaborate ritual when engaging and disengaging in conversations [Kendon, 1980]. For example, people will show their readiness to engage in a conversation by turning towards the potential interlocutors, gazing at them and then exchanging signs of mutual recognition typically involving a smile, eyebrow movement and tossing the head or waving of the arm. Following this initial synchronization stage, or distance salutation, the two people approach each other, sealing their commitment to the conversation through a close salutation such as a handshake accompanied by a ritualistic verbal exchange. The greeting phase ends when the two participants re-orient their bodies, moving away from a face-on orientation to stand at an angle. Terminating a conversation similarly moves through stages, starting with non-verbal cues, such as orientation shifts
26
Advances in Natural Multimodal Dialogue Systems
Table 2.1. Some examples of conversational functions and their behaviour realization [Cassell et al., 2000b]. Communicative Functions Initiation and termination Reacting Inviting Contact Distance Salutation Close Salutation Break Away Farewell Turn-Taking Give Turn Wanting Turn Take Turn Feedback Request Feedback Give Feedback
Communicative Behaviour Short Glance Sustained Glance, Smile Looking, Head Toss/Nod, Raise Eyebrows, Wave, Smile Looking, Head Nod, Embrace or Handshake, Smile Glance Around Looking, Head Nod, Wave Looking, Raise Eyebrows (followed by silence) Raise Hands into gesture space Glance Away, Start talking Looking, Raise Eyebrows Looking, Head Nod
or glances away and cumulating in the verbal exchange of farewells and the breaking of mutual gaze.
Conversational turn-taking and interruption Interlocutors do not normally talk at the same time, thus imposing a turn-taking sequence on the conversation. The protocols involved in floor management – determining whose turn it is and when the turn should be given to the listener – involve many factors including gaze and intonation [Duncan, 1974]. In addition, listeners can interrupt a speaker not only with voice, but also by gesturing to indicate that they want the turn. Content elaboration and emphasis Gestures can convey information about the content of the conversation in ways for which the hands are uniquely suited. For example, the two hands can better indicate simultaneity and spatial relationships than the voice or other channels. Probably the most commonly thought of use of the body in conversation is the pointing (deictic) gesture, possibly accounting for the fact that it is also the most commonly implemented for the bodies of animated interface agents. In fact, however, most conversations don’t involve many deictic gestures [McNeill, 1992] unless the interlocutors are discussing a shared task that is currently present. Other conversational gestures also convey semantic and pragmatic information. Beat gestures are small, rhythmic baton like movements of the hands that do not change in form with the content of the accompanying speech. They serve a pragmatic func-
Social Dialogue with Embodied Conversational Agents
27
tion, conveying information about what is “new” in the speaker’s discourse. Iconic and metaphoric gestures convey some features of the action or event being described. They can be redundant or complementary relative to the speech channel, and thus can convey additional information or provide robustness or emphasis with respect to what is being said. Whereas iconics convey information about spatial relationships or concepts, metaphorics represent concepts which have no physical form, such as a sweeping gesture accompanying “the property title is free and clear.”
Feedback and error correction During conversation, speakers can nonverbally request feedback from listeners through gaze and raised eyebrows and listeners can provide feedback through head nods and paraverbals (“uh-huh”, “mmm”, etc.) if the speaker is understood, or a confused facial expression or lack of positive feedback if not. The listener can also ask clarifying questions if they did not hear or understand something the speaker said.
2.2
Interactional versus Propositional Behaviours
The mapping from form (behaviour) to conversational function relies on a fundamental division of conversational goals: contributions to a conversation can be propositional and interactional. Propositional information corresponds to the content of the conversation. This includes meaningful speech as well as hand gestures and intonation used to complement or elaborate upon the speech content (gestures that indicate the size in the sentence “it was this big” or rising intonation that indicates a question with the sentence “you went to the store”). Interactional information consists of the cues that regulate conversational process and includes a range of nonverbal behaviours (quick head nods to indicate that one is following) as well as regulatory speech (“huh?”, “Uhhuh”). This theoretical stance allows us to examine the role of embodiment not just in task- but also process-related behaviours such as social dialogue [Cassell et al., 2000b].
2.3
REA
Our platform for conducting research into embodied conversational agents is the REA system, developed in the Gesture and Narrative Language Group at the MIT Media Lab [Cassell et al., 2000a]. REA is an embodied, multi-modal real-time conversational interface agent which implements the conversational protocols described above in order to make interactions as natural as face-toface conversation with another person. In the current task domain, REA acts as a real estate salesperson, answering user questions about properties in her database and showing users around the virtual houses (Figure 2.1).
28
Advances in Natural Multimodal Dialogue Systems
Figure 2.1.
User interacting with REA.
REA has a fully articulated graphical body, can sense the user passively through cameras and audio input, and is capable of speech with intonation, facial display, and gestural output. The system currently consists of a large projection screen on which REA is displayed and which the user stands in front of. Two cameras mounted on top of the projection screen track the user’s head and hand positions in space. Users wear a microphone for capturing speech input. A single SGI Octane computer runs the graphics and conversation engine of REA, while several other computers manage the speech recognition and generation and image processing. REA is able to conduct a conversation describing the features of the task domain while also responding to the users’ verbal and non-verbal input. When the user makes cues typically associated with turn taking behaviour such as gesturing, REA allows herself to be interrupted, and then takes the turn again when she is able. She is able to initiate conversational error correction when she misunderstands what the user says, and can generate combined voice, facial expression and gestural output. REA’s responses are generated by an incremental natural language generation engine based on [Stone and Doran, 1997] that has been extended to synthesize redundant and complementary gestures synchronized with speech output [Cassell et al., 2000b]. A simple discourse
Social Dialogue with Embodied Conversational Agents
29
model is used for determining which speech acts users are engaging in, and resolving and generating anaphoric references.
3.
Social Dialogue
Social dialogue is talk in which interpersonal goals are foregrounded and task goals – if existent – are backgrounded. One of the most familiar contexts in which social dialogue occurs is in human social encounters between individuals who have never met or are unfamiliar with each other. In these situations conversation is usually initiated by “small talk” in which “light” conversation is made about neutral topics (e.g., weather, aspects of the interlocutor’s physical environment) or in which personal experiences, preferences, and opinions are shared [Laver, 1981]. Even in business or sales meetings, it is customary (at least in American culture) to begin with some amount of small talk before “getting down to business”.
3.1
The Functions of Social Dialogue
The purpose of small talk is primarily to build rapport and trust among the interlocutors, provide time for them to “size each other up”, establish an interactional style, and to allow them to establish their reputations [Dunbar, 1996]. Although small talk is most noticeable at the margins of conversational encounters, it can be used at various points in the interaction to continue to build rapport and trust [Cheepen, 1988], and in real estate sales, a good agent will continue to focus on building rapport throughout the relationship with a buyer [Garros, 1999]. Small talk has received sporadic treatment in the linguistics literature, starting with the seminal work of Malinowski who defined “phatic communion” as “a type of speech in which ties of union are created by a mere exchange of words”. Small talk is the language used in free, aimless social intercourse, which occurs when people are relaxing or when they are accompanying “some manual work by gossip quite unconnected with what they are doing” [Malinowski, 1923]. Jacobson also included a “phatic function” in his well-known conduit model of communication, that function being focused on the regulation of the conduit itself (as opposed to the message, sender, or receiver) [Jakobson, 1960]. More recent work has further characterized small talk by describing the contexts in which it occurs, topics typically used, and even grammars which define its surface form in certain domains [Cheepen, 1988; Laver, 1975; Schneider, 1988]. In addition, degree of “phaticity” has been proposed as a persistent goal which governs the degree of politeness in all utterances a speaker makes, including task-oriented ones [Coupland et al., 1992].
30
3.2
Advances in Natural Multimodal Dialogue Systems
The Relationship between Social Dialogue and Trust
Figure 2.2 outlines the relationship between small talk and trust. REA’s dialogue planner represents the relationship between her and the user using a multi-dimensional model of interpersonal relationship based on [Svennevig, 1999]: familiarity describes the way in which relationships develop through the reciprocal exchange of information, beginning with relatively nonintimate topics and gradually progressing to more personal and private topics. The growth of a relationship can be represented in both the breadth (number of topics) and depth (public to private) of information disclosed [Altman and Taylor, 1973]. solidarity is defined as “like-mindedness” or having similar behaviour dispositions (e.g., similar political membership, family, religions, profession, gender, etc.), and is very similar to the notion of social distance used by Brown and Levinson in their theory of politeness [Brown and Levinson, 1978]. There is a correlation between frequency of contact and solidarity, but it is not necessarily a causal relation [Brown and Levinson, 1978; Brown and Gilman, 1972]. affect represents the degree of liking the interactants have for each other, and there is evidence that this is an independent relational attribute from the above three [Brown and Gilman, 1989]. The mechanisms by which small talk are hypothesized to effect trust include facework, coordination, building common ground, and reciprocal appreciation.
Facework The notion of “face” is “the positive social value a person effectively claims for himself by the social role others assume he has taken during a particular contact” [Goffman, 1967]. Interactants maintain face by having their social role accepted and acknowledged. Events which are incompatible with their line are “face threats” and are mitigated by various corrective measures if they are not to lose face. Small talk avoids face threat (and therefore maintains solidarity) by keeping conversation at a safe level of depth. Coordination The process of interacting with a user in a fluid and natural manner may increase the user’s liking of the agent, and user’s positive affect, since the simple act of coordination with another appears to be deeply gratifying. “Friends are a major source of joy, partly because of the enjoyable things they do together, and the reason that they are enjoyable is perhaps the coordination.” [Argyle, 1990]. Small talk increases coordination between the two participants by allowing them to synchronize short units of talk and nonverbal acknowledgement (and therefore leads to increased liking and positive affect).
Social Dialogue with Embodied Conversational Agents
31
Figure 2.2. How small talk effects trust [Cassell and Bickmore, 2003].
Building common ground Information which is known by all interactants to be shared (mutual knowledge) is said to be in the “common ground” [Clark, 1996]. The principle way for information to move into the common ground is via face-to-face communication, since all interactants can observe the recognition and acknowledgment that the information is in fact mutually shared. One strategy for effecting changes to the familiarity dimension of the relationship model is for speakers to disclose personal information about themselves – moving it into the common ground – and induce the listener to do the same. Another strategy is to talk about topics that are obviously in the common ground – such as the weather, physical surroundings, and other topics available in the immediate context of utterance. Small talk establishes common ground (and therefore increases familiarity) by discussing topics that are clearly in the context of utterance. Reciprocal appreciation In small talk, demonstrating appreciation for and agreement with the contributions of one’s interlocutor is obligatory. Performing this aspect of the small talk ritual increases solidarity by showing mutual agreement on the topics discussed.
32
3.3
Advances in Natural Multimodal Dialogue Systems
Nonverbal Behaviour in Social Dialogue
According to Argyle, nonverbal behaviour is used to express emotions, to communicate interpersonal attitudes, to accompany and support speech, for self presentation, and to engage in rituals such as greetings [Argyle, 1988]. Of these, coverbal and emotional display behaviours have received the most attention in the literature on embodied conversational agents and facial and character animation in general, e.g. [Cassell et al., 2000c]. Next to these, the most important use of nonverbal behaviour in social dialogue is the display of interpersonal attitude [Argyle, 1988]. The display of positive or negative attitude can greatly influence whether we approach someone or not and our initial perceptions of them if we do. The most consistent finding in this area is that the use of nonverbal “immediacy behaviours” – close conversational distance, direct body and facial orientation, forward lean, increased and direct gaze, smiling, pleasant facial expressions and facial animation in general, nodding, frequent gesturing and postural openness – projects liking for the other and engagement in the interaction, and is correlated with increased solidarity [Argyle, 1988; Richmond and McCroskey, 1995]. Other nonverbal aspects of “warmth” include kinesic behaviours such as head tilts, bodily relaxation, lack of random movement, open body positions, and postural mirroring and vocalic behaviours such as more variation in pitch, amplitude, duration and tempo, reinforcing interjections such as “uh-huh” and “mm-hmmm”, greater fluency, warmth, pleasantness, expressiveness, and clarity and smoother turn-taking [Andersen and Guerrero, 1998]. In summary, nonverbal behaviour plays an important role in all face-to-face interaction – both conveying redundant and complementary propositional information (with respect to speech) and regulating the structure of the interaction. In social dialogue, however, it provides the additional, and crucial, function, of conveying attitudinal information about the nature of the relationship between the interactants.
4. 4.1
Related Work Related Work on Embodied Conversational Agents
Work on the development of ECAs, as a distinct field of development, is best summarized in [Cassell et al., 2000c]. The current study is based on the REA ECA (see Figure 2.1), a simulated real estate agent, who uses vision-based gesture recognition, speech recognition, discourse planning, sentence and gesture planning, speech synthesis and animation of a 3D body [Cassell et al., 1999]. Some of the other major systems developed to date are Steve [Rickel and Johnson, 1998], the DFKI Persona [André et al., 1996], Olga [Beskow and
Social Dialogue with Embodied Conversational Agents
33
McGlashan, 1997], and pedagogical agents developed by Lester et al. [1999]. Sidner and Dzikovska [2005] report progress on a robotic ECA that performs hosting activities, with a special emphasis on “engagement” – an interactional behaviour whose purpose is to establish and maintain the connection between interlocutors during a conversation. These systems vary in their linguistic generativity, input modalities, and task domains, but all aim to engage the user in natural, embodied conversation. Little work has been done on modelling social dialogue with ECAs. The August system is an ECA kiosk designed to give information about local restaurants and other facilities. In an experiment to characterize the kinds of things that people would say to such an agent, over 10,000 utterances from over 2,500 users were collected. It was found that most people tried to socialize with the agent, with approximately 1/3 of all recorded utterances classified as social in nature [Gustafson et al., 1999].
4.2
Related Studies on Embodied Conversational Agents
Koda and Maes [1996] and Takeuchi and Naito [1995] studied interfaces with static or animated faces, and found that users rated them to be more engaging and entertaining than functionally equivalent interfaces without a face. Kiesler and Sproull [1997] found that users were more likely to be cooperative with an interface agent when it had a human face (vs. a dog or cartoon dog). André, Rist and Muller found that users rated their animated presentation agent (“PPP Persona”) as more entertaining and helpful than an equivalent interface without the agent [André et al., 1998]. However, there was no difference in actual performance (comprehension and recall of presented material) in interfaces with the agent vs. interfaces without it. In another study involving this agent, van Mulken, André and Muller found that when the quality of advice provided by an agent was high, subjects actually reported trusting a text-based agent more than either their ECA or a video-based agent (when the quality of advice was low there were no significant differences in trust ratings between agents) [van Mulken et al., 1999]. In a user study of the Gandalf system [Cassell et al., 1999], users rated the smoothness of the interaction and the agent’s language skills significantly higher under test conditions in which Gandalf utilized limited conversational behaviour (gaze, turn-taking and beat gesture) than when these behaviours were disabled. In terms of social behaviours, Sproull et al. [1997] showed that subjects rated a female embodied interface significantly lower in sociability and gave it a significantly more negative social evaluation compared to a text-only interface. Subjects also reported being less relaxed and assured when interacting with the embodied interface than when interacting with the text interface.
34
Advances in Natural Multimodal Dialogue Systems
Finally, they gave themselves significantly higher scores on social desirability scales, but disclosed less (wrote significantly less and skipped more questions in response to queries by the interface) when interacting with an embodied interface vs. a text-only interface. Men were found to disclose more in the embodied condition and women disclosed more in the text-only condition. Most of these evaluations have tried to address whether embodiment of a system is useful at all, by including or not including an animated figure. In their survey of user studies on embodied agents, Dehn and van Mulken conclude that there is no “persona effect”, that is a general advantage of an interface with an animated agent over one without an animated agent [Dehn and van Mulken, 2000]. However, they believe that lack of evidence and inconsistencies in the studies performed to date may be attributable to methodological shortcomings and variations in the kinds of animations used, the kinds of comparisons made (control conditions), the specific measures used for the dependent variables, and the task and context of the interaction.
4.3
Related Studies on Mediated Communication
Several studies have shown that people speak differently to a computer than another person, even though there are typically no differences in task outcomes in these evaluations. Hauptmann and Rudnicky [1988] performed one of the first studies in this area. They asked subjects to carry out a simple informationgathering task through a (simulated) natural language speech interface, and compared this with speech to a co-present human in the same task. They found that speech to the simulated computer system was telegraphic and formal, approximating a command language. In particular, when speaking to what they believed to be a computer, subject’s utterances used a small vocabulary, often sounding like system commands, with very few task-unrelated utterances, and fewer filled pauses and other disfluencies. These results were extended in research conducted by Oviatt [Oviatt, 1995; Oviatt and Adams, 2000; Oviatt, 1998], in which she found that speech to a computer system was characterized by a low rate of disfluencies relative to speech to a co-present human. She also noted that visual feedback has an effect on disfluency: telephone calls have a higher rate of disfluency than co-present dialogue. From these results, it seems that people speak more carefully and less naturally when interacting with a computer. Boyle et al. [1994] compared pairs of subjects working on a map-based task who were visible to each other with pairs of subjects who were co-present but could not see each other. Although no performance difference was found between the two conditions, when subjects could not see one another, they compensated by giving more verbal feedback and using longer utterances. Their conversation was found to be less smooth than that between mutually visible
Social Dialogue with Embodied Conversational Agents
35
partners, indicated by more interruptions, and less efficient, as more turns were required to complete the task. The researchers concluded that visual feedback improves the smoothness and efficiency of the interaction, but that we have devices to compensate for this when visibility is restricted. Daly-Jones et al. [1998] also failed to find any difference in performance between video-mediated and audio-mediated conversations, although they did find differences in the quality of the interactions (e.g., more explicit questions in audio-only condition). Whittaker and O’Conaill [1997] survey the results of several studies which compared video-mediated communication with audio-only communication and concluded that the visual channel does not significantly impact performance outcomes in task-oriented collaborations, although it does affect social and affective dimensions of communication. Comparing video-mediated communication to face-to-face and audio-only conversations, they also found that speakers used more formal turn-taking techniques in the video condition even though users reported that they perceived many benefits to video conferencing relative to the audio-only mode. In a series of studies on the effects of different media and activities on trust, Zheng, Veinott et al. have demonstrated that social interaction, even if carried out online, significantly increases people’s trust in each other [Zheng et al., 2002]. Similarly, Bos et al. [2002] demonstrated that richer media – such as face-to-face, video-, and audio-mediated communication – leads to higher trust levels than media with lower bandwidth such as text chat. Finally, a number of studies have been done comparing face-to-face conversations with conversations on the phone [Rutter, 1987]. These studies find that, in general, there is more cooperation and trust in face-to-face interaction. One study found that audio-only communication encouraged negotiators to behave impersonally, to ignore the subtleties of self-presentation, and to concentrate primarily on pursuing victory for their side. Other studies found similar gains in cooperation among subjects playing prisoner’s dilemma face-to-face compared to playing it over the phone. Face-to-face interactions are also less formal and more spontaneous than conversations on the phone. One study found that face-to-face discussions were more protracted and wide-ranging while subjects communicating via audio-only kept much more to the specific issues on the agenda (the study also found that when the topics were more wide-ranging, changes in attitude among the participants was more likely to occur). Although several studies found increases in favourable impressions of interactants in face-to-face conversation relative to audio-only, these effects have not been consistently validated.
36
4.4
Advances in Natural Multimodal Dialogue Systems
Trait-Based Variation in User Responses
Several studies have shown that users react differently to social agents based on their own personality and other dispositional traits. For example, Reeves and Nass have shown that users like agents that match their own personality (on the introversion/ extraversion dimension) more than those which do not, regardless of whether the personality is portrayed through text or speech [Nass and Gong, 2000; Reeves and Nass, 1996]. Resnick and Lammers showed that in order to change user behaviour via corrective error messages, the messages should have different degrees of “humanness” depending on whether the user has high or low self-esteem (“computer-ese” messages should be used with low self-esteem users, while “human-like” messages should be used with highesteem users) [Resnick and Lammers, 1985]. Rickenberg and Reeves showed that different types of animated agents affected the anxiety level of users differentially as a function of whether users tended towards internal or external locus of control [Rickenberg and Reeves, 2000]. In our earlier study on the effects of social dialogue on trust in ECA interactions, we found that social dialogue significantly increased trust for extraverts, while it made no significant difference for introverts [Cassell and Bickmore, 2003]. In light of the studies summarized here, the question that remains is whether these effects continue to hold if the nonverbal cues provided by the ECA are removed.
5.
Social Dialogue in REA
For the purpose of trust elicitation and small talk, we have constructed a new kind of discourse planner that can interleave small talk and task talk during the initial buyer interview, based on the relational model outlined above. An overview of the planner is provided here; details of its implementation can be found in Cassell and Bickmore [2003].
5.1
Planning Model
Given that many of the goals in a relational conversational strategy are nondiscrete (e.g., minimize face threat), and that trade-offs among multiple goals have to be achieved at any given time, we have moved away from static world discourse planning, and are using an activation network-based approach based on Maes’ Do the Right Thing architecture [Maes, 1989]. This architecture provides the capability to transition smoothly from deliberative, planned behaviour to opportunistic, reactive behaviour, and is able to pursue multiple, non-discrete goals. In our implementation each node in the network represents a conversational move that REA can make.
Social Dialogue with Embodied Conversational Agents
37
Thus, during task talk, REA may ask questions about users’ buying preferences, such as the number of bedrooms they need. During small talk, REA can talk about the weather, events and objects in her shared physical context with the user (e.g., the lab setting), or she can tell stories about the lab, herself, or real estate. REA’s conversational moves are planned in order to minimize the face threat to the user, and maximize trust, while pursuing her task goals in the most efficient manner possible. That is, REA attempts to determine the face threat of her next conversational move, assesses the solidarity and familiarity which she currently holds with the user, and judges which topics will seem most relevant and least intrusive to users. As a function of these factors, REA chooses whether or not to engage in small talk, and what kind of small talk to choose. The selection of which move should be pursued by REA at any given time is thus a non-discrete function of the following factors: Closeness REA continually assesses her “interpersonal” closeness with the user, which is a composite representing depth of familiarity and solidarity, modelled as a scalar quantity. Each conversational topic has a predefined, pre-requisite closeness that must be achieved before REA can introduce the topic. Given this, the system can plan to perform small talk in order to “grease the tracks” for task talk, especially about sensitive topics like finance. Topic REA keeps track of the current and past conversational topics. Conversational moves which stay within topic (maintain topic coherence) are given preference over those which do not. In addition, REA can plan to execute a sequence of moves which gradually transition the topic from its current state to one that REA wants to talk about (e.g., from talk about the weather, to talk about Boston weather, to talk about Boston real estate). Relevance REA maintains a list of topics that she thinks the user knows about, and the discourse planner prefers moves which involve topics in this list. The list is initialized to things that anyone talking to REA would know about – such as the weather outside, Cambridge, MIT, or the laboratory that REA lives in. Task goals REA has a list of prioritized goals to find out about the user’s housing needs in the initial interview. Conversational moves which directly work towards satisfying these goals (such as asking interview questions) are preferred. Logical preconditions Conversational moves have logical preconditions (e.g., it makes no sense for REA to ask users what their major is
38
Advances in Natural Multimodal Dialogue Systems
until she has established that they are students), and are not selected for execution until all of their preconditions are satisfied. One advantage of the activation network approach is that by simply adjusting a few gains we can make REA more or less coherent, more or less polite (attentive to closeness constraints), more or less task-oriented, or more or less deliberative (vs. reactive) in her linguistic behaviour. In the current implementation, the dialogue is entirely REA-initiated, and user responses are recognized via a speaker-independent, grammar-based, continuous speech recognizer (currently IBM ViaVoice). The active grammar fragment is specified by the current conversational move, and for responses to many REA small talk moves the content of the user’s speech is ignored; only the fact that the person responded at all is enough to advance the dialogue. At each step in the conversation in which REA has the floor (as tracked by a conversational state machine in REA’s Reaction Module [Cassell et al., 2000a], the discourse planner is consulted for the next conversational move to initiate. At this point, activation values are incrementally propagated through the network (following [Maes, 1989]) until a move is selected whose preconditions are satisfied and whose activation value is over a specified threshold. Within this framework, REA decides to do small talk whenever closeness with the user needs to be increased (e.g., before a task query can be asked), or the topic needs to be moved little-by-little to a desired topic and small talk contributions exist which can facilitate this. The activation energy from the user relevance condition described above leads to REA starting small talk with topics that are known to be in the shared environment with the user (e.g., talk about the weather or the lab).
5.2
Interactional Behaviour during Social Dialogue
Shifts between small talk moves and task moves are marked by conventional contextualization cues – discourse markers and beat gestures. Discourse markers include “so” on the first small talk to task talk transition, “anyway” on resumption of task talk from small talk, and “you know” on transition to small talk from task talk [Clark, 1996]. Prior to producing lengthy utterances, REA gazes away briefly before she starts her turn, partly as a turn-taking and floor-holding move and partly to mask the processing delays in generating long utterances. Finally, REA smiles as soon as she detects that users have started their speaking turns (using audio thresholding of the user’s speech).
5.3
Example Interaction
An interview between REA and a user typically proceeds as shown in the following dialogue. (User responses are only shown in positions in which they
Social Dialogue with Embodied Conversational Agents
39
effect the selection of subsequent moves; REA contributions are her actual utterances.) 1 That microphone is terrible, I hate using those things. 2 Sorry about my voice, this is some engineer’s idea of natural sounding. 3 Are you one of our sponsors? User: Yes 4 Were you at our last sponsor meetings? 5 I got so exhausted at the last sponsor meeting I think I was starting to lose my voice by the end. 6 So, where would you like to live? 7 How many bedrooms do you need? 8 Do you need access to the subway? 9 Is one bath enough? 10 You know, Boston is certainly more expensive than it used to be. 11 Anyway, what can you afford? 12 What kind of down payment can you make? 13 Let me see what I have available. Dialogue 1. “Small Talk” REA.
In this example, REA opens with small talk moves regarding things in her shared physical environment with the user (1-2). She then proceeds to small talk related to sponsors (after establishing that the user is a sponsor). After a few turns, enough closeness has been established (simply by doing small talk) that REA can move into task talk (6-9). However, before bringing up the topic of finance – a topic that is potentially very face threatening for the user – REA decides that additional closeness needs to be established, and moves back into small talk (10). This small talk move not only increases closeness but shifts the topic to finance, enabling REA to then bring up the issue of how much the user is able to afford (11-12). If REA’s adherence to closeness preconditions is reduced, by decreasing the contributions of these preconditions to the activation of joint projects, this results in her engaging in less small talk and being more task goal oriented. If everything else is held constant (relative to the prior example) the following dialogue is produced.
40
Advances in Natural Multimodal Dialogue Systems
1 So, where would you like to live? 2 What can you afford? 3 What kind of down payment can you make? 4 How many bedrooms do you need? 5 Do you need access to the subway? 6 Is one bath enough? 7 Let me see what I have available. Dialogue 2. “Task-only REA”.
In this example, REA does not perform any small talk and sequences the task questions in strictly decreasing order of priority.
6.
A Study Comparing ECA Social Dialogue with Audio-Only Social Dialogue
The dialogue model presented above produces a reasonable facsimile of the social dialogue observed in service encounters such as real estate sales. But, does small talk produced by an ECA in a sales encounter actually build trust and solidarity with users? And, does nonverbal behaviour play the same critical role in human-ECA social dialogue as it appears to play in human-human social interactions? In order to answer these questions, we conducted an empirical study in which subjects were interviewed by REA about their housing needs, shown two “virtual” apartments, and then asked to submit a bid on one of them. For the purpose of the experiment, REA was controlled by a human wizard and followed scripts identical to the output of the planner (but faster, and not dependent on automatic speech recognition or computational vision). Users interacted with one of two versions of REA which were identical except that one had only task-oriented dialogue (TASK condition) while the other also included the social dialogue designed to avoid face threat, and increase trust (SOCIAL condition). A second manipulation involved varying whether subjects interacted with the fully embodied REA – appearing in front of the virtual apartments as a life-sized character (EMBODIED condition) – or viewed only the virtual apartments while talking with REA over a telephone. Together these variables provided a 2x2 experimental design: SOCIAL vs. TASK and EMBODIED vs. PHONE. Our hypotheses follow from the literature on small talk and on trust among humans. We expected subjects in the SOCIAL condition to trust REA more, feel closer to REA, like her more, and feel that they understood each other more
Social Dialogue with Embodied Conversational Agents
41
than in the TASK condition. We also expected users to think the interaction was more natural, lifelike, and comfortable in the SOCIAL condition. Finally, we expected users to be willing to pay REA more for an apartment in the SOCIAL condition, given the hypothesized increase in trust. We also expected all of these SOCIAL effects to be amplified in the EMBODIED condition relative to the PHONE-only condition.
6.1
Experimental Methods
This was a multivariate, multiple-factor, between-subjects experimental design, involving 58 subjects (69% male and 31% female).
6.1.1 Apparatus. One wall of the experiment room was a rear-projection screen. In the EMBODIED condition REA appeared life-sized on the screen, in front of the 3D virtual apartments she showed, and her synthetic voice was played through two speakers on the floor in front of the screen. In the PHONE condition only the 3D virtual apartments were displayed and subjects interacted with REA over an ordinary telephone placed on a table in front of the screen. For the purpose of this experiment, REA was controlled via a wizard-of-oz setup on another computer positioned behind the projection screen. The interaction script included verbal and nonverbal behaviour specifications for REA (e.g., gesture and gaze commands as well as speech), and embedded commands describing when different rooms in the virtual apartments should be shown. Three pieces of information obtained from the user during the interview were entered into the control system by the wizard: the city the subject wanted to live in; the number of bedrooms s/he wanted; and how much s/he was willing to spend. The first apartment shown was in the specified city, but had twice as many bedrooms as the subject requested and cost twice as much as s/he could afford (they were also told the price was “firm”). The second apartment shown was in the specified city, had the exact number of bedrooms requested, but cost 50% more than the subject could afford (but this time, the subject was told that the price was “negotiable”). The scripts were comprised of a linear sequence of utterances (statements and questions) that would be made by REA in a given interaction: there was no branching or variability in content beyond the three pieces of information described above. This helped ensure that all subjects received the same intervention regardless of what they said in response to any given question by REA. Subject-initiated utterances were responded to with either backchannel feedback (e.g., “Really?”) for statements or “I don’t know” for questions, followed by an immediate return to the script. The scripts for the TASK and SOCIAL conditions were identical, except that the SOCIAL script had additional small talk utterances added to it, as
42
Advances in Natural Multimodal Dialogue Systems
described in [Bickmore and Cassell, 2001]. The part of the script governing the dialogue from the showing of the second apartment through the end of the interaction was identical in both conditions. Procedure. Subjects were told that they would be interacting with REA, who played the role of a real estate agent and could show them apartments she had for rent. They were told that they were to play the role of someone looking for an apartment in the Boston area. In both conditions subjects were told that they could talk to REA “just like you would to another person”.
6.1.2 Measures. Subjective evaluations of REA – including how friendly, credible, lifelike, warm, competent, reliable, efficient, informed, knowledgeable and intelligent she was – were measured by single items on nine-point Likert scales. Evaluations of the interaction – including how tedious, involving, enjoyable, natural, satisfying, fun, engaging, comfortable and successful it was – were also measured on nine-point Likert scales. Evaluation of how well subjects felt they knew REA, how well she knew and understood them and how close they felt to her were measured in the same manner. All scales were adapted from previous research on user responses to personality types in embodied conversational agents [Moon and Nass, 1996]. Liking of REA was an index composed of three items – how likeable and pleasant REA was and how much subjects liked her – measured items on ninepoint Likert scales (Cronbach’s alpha =.87). Amount Willing to Pay was computed as follows. During the interview, REA asked subjects how much they were able to pay for an apartment; subjects’ responses were entered as $X per month. REA then offered the second apartment for $Y (where Y = 1.5 X), and mentioned that the price was negotiable. On the questionnaire, subjects were asked how much they would be willing to pay for the second apartment, and this was encoded as Z. The task measure used was (Z - X) / (Y - X), which varies from 0% if the user did not budge from their original requested price, to 100% if they offered the full asking price. Trust was measured by a standardized trust scale [Wheeless and Grotz, 1977] (alpha =.93). Although trust is sometimes measured behaviourally using a Prisoner’s Dilemma game [Zheng et al., 2002], we felt that our experimental protocol was already too long and that game-playing did not fit well into the real estate scenario. Given literature on the relationship between user personality and preference for computer behaviour, we were concerned that subjects might respond differentially based on predisposition. Thus, we also included composite measures for introversion and extroversion on the questionnaire. Extrovertedness was an index composed of seven Wiggins [Wiggins, 1979] extrovert adjective items: Cheerful, Enthusiastic, Extroverted, Jovial, Outgo-
Social Dialogue with Embodied Conversational Agents
43
ing, and Perky. It was used for assessment of the subject’s personality (alpha =.87). Introvertedness was an index composed of seven Wiggins [Wiggins, 1979] introvert adjective items: Bashful, Introverted, Inward, Shy, Undemonstrative, Unrevealing, and Unsparkling. It was used for assessment of the subject’s personality (alpha =.84). Note that these personality scales were administered on the post-test questionnaire. For the purposes of this experiment, therefore, subjects who scored over the mean on introversion-extroversion were said to be extroverts, while those who scored under the mean were said to be introverts.
6.1.3 Behavioural measures. Rates of speech disfluency (as defined in [Oviatt, 1995]) and utterance length were coded from the video data. Observation of the videotaped data made it clear that some subjects took the initiative in the conversation, while others allowed REA to lead. Unfortunately, REA is not yet able to deal with user-initiated talk, and so user initiative often led to REA interrupting the speaker. To assess the effect of this phenomenon, we therefore divided subjects into PASSIVE (below the mean on number of user-initiated utterances) and ACTIVE (above the mean on number of userinitiated utterances). To our surprise, these measures turned out to be independent of introversion/extroversion (Pearson r=0.042), and to not be predicted by these latter variables.
6.2
Results
Full factorial single measure ANOVAs were run, with SOCIALITY (Task vs. Social), PERSONALITY OF SUBJECT (Introvert vs. Extrovert), MEDIUM (Phone vs. Embodied) and INITIATION (Active vs. Passive) as independent variables.
6.2.1 Subjective assessments of REA. In looking at the questionnaire data, our first impression is that subjects seemed to feel more comfortable interacting with REA over the phone than face-to-face. Thus, subjects in the phone condition felt that they knew REA better (F=5.02; p