Analysis of Verbal and Nonverbal Communication and Enactment - COST 2102

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Author: Anna Esposito | Alessandro Vinciarelli | Klara Vicsi | Catherine Pelachaud | Anton Nijholt

319 downloads 1765 Views 10MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6800

Anna Esposito Alessandro Vinciarelli Klára Vicsi Catherine Pelachaud Anton Nijholt (Eds.)

Analysis of Verbal and Nonverbal Communication and Enactment The Processing Issues COST 2102 International Conference Budapest, Hungary, September 7-10, 2010 Revised Selected Papers

13

Volume Editors Anna Esposito Second University of Naples and IIASS, Vietri sul Mare (SA), Italy E-mail: [email protected] Alessandro Vinciarelli University of Glasgow, UK E-mail: [email protected] Klára Vicsi Budapest University of Technology and Economics, Hungary E-mail: [email protected] Catherine Pelachaud TELECOM ParisTech, Paris, France E-mail: [email protected] Anton Nijholt University of Twente, Enschede, The Netherlands E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-25774-2 e-ISBN 978-3-642-25775-9 DOI 10.1007/978-3-642-25775-9 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: Applied for CR Subject Classification (1998): H.4, H.5, I.4, I.2, J.4 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI

© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

This book is dedicated to: Luigi Maria Ricciardi for his 360-degree open mind. We will miss his guidance now and forever and to: what has never been, what was possible, and what could have been though we never know what it was.

This volume brings together the advanced research results obtained by the European COST Action 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication,” primarily discussed at the PINK SSPnet-COST 2102 International Conference on “Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues” held in Budapest, Hungary, September 7–10, 2010 (http://berber.tmit.bme.hu/cost2102/). The conference was jointly sponsored by COST (European Cooperation in Science and Technology, www.cost.eu ) in the domain of Information and Communication Technologies (ICT) for disseminating the advances of the research activities developed within the COST Action 2102: “Cross-Modal Analysis of Verbal and Nonverbal Communication”(cost2102.cs.stir.ac.uk) and by the European Network of Excellence on Social Signal Processing, SSPnet (http://sspnet.eu/). The main focus of the conference was on methods to combine and build up knowledge through verbal and nonverbal signals enacted in an environment and in a context. In previous meetings, COST 2102 focused on the importance of uncovering and exploiting the wealth of information conveyed by multimodal signals. The next steps have been to analyze actions performed in response to multimodal signals and to study how these actions are organized in a realistic and socially believable context. The focus was on processing issues, since the new approach is computationally complex and the amount of data to be treated may be considered algorithmically infeasible. Therefore, data processing for gainingenactive knowledge must account for natural and intuitive approaches, based more on heuristics and experiences rather than on symbols, as well as on the discovery of new processing possibilities that account for new approaches for data analysis, coordination of the data flow through synchronization and temporal organization and optimization of the extracted features.

VI

Preface

The conference had a special session for COST 2102 students. The idea was to select original contributions from early-stage researchers. To this aim all the papers accepted in this volume were peer reviewed. This conference also aimed at underlining the role that women have had in ICT and—to this end—the conference was named “First SSPnet-COST2102 PINK International Conference.” The International Steering Committee was composed of only women. The themes of the volume cover topics on verbal and nonverbal information in body-to-body communication, cross-modal analysis of speech, gestures, gaze and facial expressions, socio-cultural differences and personal traits, multimodal algorithms and procedures for the automatic recognition of emotions, faces, facial expressions, and gestures, audio and video features for implementing intelligent avatars and interactive dialogue systems, virtual communicative agents and interactive dialogue systems. The book is arranged into two scientific sections according to a rough thematic classification, even though both sections are closely connected and both provide fundamental insights for cross-fertilization of different disciplines. The first section, “Multimodal Signals: Analysis, Processing and Computational Issues,” deals with conjectural and processing issues of defining models, algorithms, and heuristic strategies for data analysis, coordination of the data flow and optimal encoding of multi-channel verbal and nonverbal features. The second section, “Verbal and Nonverbal Social Signals,” presents original studies that provide theoretical and practical solutions to the modelling of timing synchronization between linguistic and paralinguistic expressions, actions, body movements, activities in human interaction and on their assistance for effective human–machine interactions. The papers included in this book benefited from the live interactions among the many participants of the successful meeting in Budapest. Over 90 senior and junior researchers gathered for the event. The editors would like to thank the Management Board of the SSPnet and the ESF COST- ICT Programme for the support in the realization of the conference and the publication of this volume. Acknowledgements go in particular to the COST Science Officers Matteo Razzanelli, Aranzazu Sanchez, Jamsheed Shorish, and the COST 2102 reporter Guntar Balodis for their constant help, guidance, and encouragement. The event owes its success to more individuals than can be named, but notably the members of the local Steering Committee Kl´ ara Vicsi, György Szaszák, and D´ avid Sztah´ o, who actively operated for the success of the event. Special appreciation goes to the president of the International Institute for Advanced Scientific Studies (IIASS), Gaetano Scarpetta and to the Dean and the Director of the Faculty and the Department of Psychology at the Second University of Naples, Alida Labella and Giovanna Nigro, for

Preface

VII

making available people and resources for the editing of this volume. The editors are deeply indebted to the contributors for making this book a scientifically stimulating compilation of new and original ideas and to the members of the COST 2102 International Scientific Committee for their rigorous and invaluable scientific revisions, dedication, and priceless selection process. July 2011

Anna Esposito Alessandro Vinciarelli Kl´ ara Vicsi Catherine Pelachaud Anton Nijholt

Organization

International Steering Committee Anna Esposito Kl´ ara Vicsi Catherine Pelachaud Zsófia Ruttkay Jurate Puniene Isabel Trancoso Inmaculada Hernaez Jerneja Zganec Gros Anna Pribilova Kristiina Jokinen

Second University of Naples and IIASS, Italy Budapest University of Technology and Economics, Hungary CNRS, TELECOM ParisTech, France P´ azmány Péter Catholic University, Hungary Kaunas University of Technology, Lithuania INESC-ID Lisboa, Portugal Universidad del Pais Vasco, Spain Ljubljana, Slovenia Slovak University of Technology, Slovak Republic University of Helsinki, Finland

COST 2102 International Scientific Committee Alberto Abad Samer Al Moubayed Uwe Altmann Sigr´ un Mar´ıa Ammendrup Hicham Atassi Nikos Avouris Martin Bachwerk Ivana Baldasarre Sandra Baldassarri Ruth Bahr Gérard Bailly Marena Balinova Marian Bartlett Dominik Bauer Sieghard Beller ˇ Stefan Beòuˇs Niels Ole Bernsen Jonas Beskow Peter Birkholz Horst Bishof Jean-Francois Bonastre Marek Boh´ aè Elif Bozkurt

INESC-ID Lisboa, Portugal Royal Institute of Technology, Sweden Friedrich Schiller University Jena, Germany School of Computer Science, Iceland Brno University of Technology, Czech Republic University of Patras, Greece Trinity College Dublin, Ireland Second University of Naples, Italy Zaragoza University, Spain University of South Florida, USA GIPSA-lab, Grenoble, France University of Applied Sciences, Austria University of California, San Diego, USA RWTH Aachen University, Germany Universit¨ at Freiburg, Germany Constantine the Philosopher University, Slovakia University of Southern Denmark, Denmark Royal Institute of Technology, Sweden RWTH Aachen University, Germany Technical University Graz, Austria Université d’Avignon, France Technical University of Liberec, Czech Republic Ko¸c University, Turkey

X

Organization

Nikolaos Bourbakis Maja Bratanić Antonio Calabrese Erik Cambria Paola Campadelli Nick Campbell Valent´ın Carde˜ noso Payo Nicoletta Caramelli Antonio Castro-Fonseca Aleksandra Cerekovic Peter Cerva Josef Chaloupka Mohamed Chetouani Gérard Chollet Simone Cifani Muzeyyen Ciyiltepe Anton Cizmar David Cohen Nicholas Costen Francesca D’Olimpio Vlado Delić Céline De Looze Francesca D’Errico Angiola Di Conza Giuseppe Di Maio Marion Dohen Thierry Dutoit Laila DybkjÆr Jens Edlund Matthias Eichner Aly El-Bahrawy Ci˘ gdem Ero˘glu Erdem Engin Erzin Anna Esposito Antonietta M. Esposito Joan F` abregas Peinado Sascha Fagel Nikos Fakotakis Manuela Farinosi Marcos Fa´ undez-Zanuy Tibor Fegy´ o Fabrizio Ferrara Dilek Fidan Leopoldina Fortunati

ITRI, Wright State University, USA University of Zagreb, Croatia Istituto di Cibernetica – CNR, Naples, Italy University of Stirling, UK Universit` a di Milano, Italy University of Dublin, Ireland Universidad de Valladolid, Spain Universit` a di Bologna, Italy Universidade de Coimbra, Portugal Faculty of Electrical Engineering, Croatia Technical University of Liberec, Czech Republic Technical University of Liberec, Czech Republic Universitè Pierre et Marie Curie, France CNRS URA-820, ENST, France Universit` a Politecnica delle Marche, Italy Gulhane Askeri Tip Academisi, Turkey Technical University of Koˇsice, Slovakia Université Pierre et Marie Curie, Paris, France Manchester Metropolitan University, UK Second University of Naples, Italy University of Novi Sad, Serbia Trinity College Dublin, Ireland Università di Roma 3, Italy Second University of Naples, Italy Second University of Naples, Italy ICP, Grenoble, France Faculté Polytechnique de Mons, Belgium University of Southern Denmark, Denmark Royal Institute of Technology, Sweden Technische Universität Dresden, Germany Ain Shams University, Egypt `ı Bah¸ce¸sehir University, Turkey Ko¸c University, Turkey Second University of Naples, Italy Osservatorio Vesuviano Napoli, Italy Escola Universitaria de Mataro, Spain Technische Universität Berlin, Germany University of Patras, Greece University of Udine, Italy Universidad Politécnica de Catalu˜ na, Spain Budapest University of Technology and Economics, Hungary University of Naples “Federico II”, Italy Ankara Universitesi, Turkey Universit` a di Udine, Italy

Organization

Todor Ganchev Carmen Garc´ıa-Mateo Vittorio Girotto Augusto Gnisci Milan Gnjatović Bjorn Granstrom Marco Grassi Maurice Grinberg Jorge Gurlekian Mohand-Said Hacid Jaakko Hakulinen Ioannis Hatzilygeroudis Immaculada Hernaez Javier Hernando Wolfgang Hess Dirk Heylen Daniel Hládek R¨ udiger Hoffmann Hendri Hondorp David House Evgenia Hristova Stephan H¨ ubler Isabelle Hupont Amir Hussain Viktor Imre Ewa Jarmolowicz Kristiina Jokinen Jozef Juhár Zdravko Kacic Bridget Kane Jim Kannampuzha Maciej Karpinski Eric Keller Adam Kendon Stefan Kopp Jacques Koreman Theodoros Kostoulas Maria Koutsombogera Robert Krauss Bernd Kröger Gernot Kubin Olga Kulyk Alida Labella

XI

University of Patras, Greece University of Vigo, Spain Universit` a IUAV di Venezia, Italy Second University of Naples, Italy University of Novi Sad, Serbia Royal Institute of Technology, Sweden Universit` a Politecnica delle Marche, Italy New Bulgarian University, Bulgaria LIS CONICET, Argentina Université Claude Bernard Lyon 1, France University of Tampere, Finland University of Patras, Greece University of the Basque Country, Spain Technical University of Catalonia, Spain Universit¨ at Bonn, Germany University of Twente, The Netherlands Technical University of Koˇsice, Slovak Republic Technische Universität Dresden, Germany University of Twente, The Netherlands Royal Institute of Technology, Sweden New Bulgarian University, Bulgaria Dresden University of Technology, Gremany Aragon Institute of Technology, Spain University of Stirling, UK Budapest University of Technology and Economics, Hungary Adam Mickiewicz University, Poland University of Helsinki, Finland Technical University Koˇsice, Slovak Republic University of Maribor, Slovenia Trinity College Dublin, Ireland RWTH Aachen University, Germany Adam Mickiewicz University, Poland Université de Lausanne, Switzeland University of Pennsylvania, USA University of Bielefeld, Germany University of Science and Technology, Norway University of Patras, Greece Institute for Language and Speech Processing, Greece Columbia University, USA RWTH Aachen University, Germany Graz University of Technology, Austria University of Twente, The Netherlands Second University of Naples, Italy

XII

Organization

Emilian Lalev Yiannis Laouris Anne-Maria Laukkanen Amélie Lelong Borge Lindberg Saturnino Luz Wojciech Majewski Pantelis Makris Kenneth Manktelow Raffaele Martone Rytis Maskeliunas Dominic Massaro Olimpia Matarazzo Christoph Mayer David McNeill Jiˇr´ı Mekyska Nicola Melone Katya Mihaylova Péter Mihajlik Michal Miriloviˇc Izidor Mlakar Helena Moniz Tamás Mozsolics Vincent C. M¨ uller Peter Murphy Antonio Natale Costanza Navarretta Eva Navas Delroy Nelson Géza Németh Friedrich Neubarth Christiane Neuschaefer-Rube Giovanna Nigro Anton Nijholt Jan Nouza Michele Nucci Catharine Oertel Stanislav Ond´ aˇs Rieks Op den Akker

New Bulgarian University, Bulgaria Cyprus Neuroscience and Technology Institute, Cyprus University of Tampere, Finland GIPSA-lab, Grenoble, France Aalborg University, Denmark Trinity College Dublin, Ireland Wroclaw University of Technology, Poland Neuroscience and Technology Institute, Cyprus University of Wolverhampton, UK Second University of Naples, Italy Kaunas University of Technology, Lithuania University of California - Santa Cruz, USA Second University of Naples, Italy Technische Universität M¨ unchen, Germany University of Chicago, USA Brno University of Technology, Czech Republic Second University of Naples, Italy University of National and World Economy, Bulgaria Budapest University of Technology and Economics, Hungary Technical University of Koˇsice, Slovakia Roboti c.s. d.o.o, Maribor, Slovenia INESC-ID Lisboa, Portugal Budapest University of Technology and Economics, Hungary Anatolia College/ACT, Greece University of Limerick, Ireland University of Salerno and IIASS, Italy University of Copenhagen, Denmark Escuela Superior de Ingenieros, Spain University College London, UK University of Technology and Economics, Hungary Austrian Research Inst. Artificial Intelligence, Austria RWTH Aachen University, Germany Second University of Naples, Italy Universiteit Twente, The Netherlands Technical University of Liberec, Czech Republic Università Politecnica delle Marche, Italy Trinity College Dublin, Ireland Technical University of Koˇsice, Slovak Republic University of Twente, The Netherlands

Organization

Karel Paleˇcek Igor Pandzic Harris Papageorgiou Kinga Papay Paolo Parmeggiani Ana Pavia Paolo Pedone Tomislav Pejsa Catherine Pelachaud Bojan Petek Harmut R. Pfitzinger Francesco Piazza Neda Pintaric Mat´ uˇs Pleva Isabella Poggi Guy Politzer Jan Prazak Ken Prepin Jiˇrı Pˇribil Anna Pˇribilov´ a Emanuele Principi Michael Pucher Jurate Puniene Ana Cristina Quelhas Kari-Jouko Räihä Roxanne Raine Giuliana Ramella Fabian Ramseyer Josè Rebelo Peter Reichl Luigi Maria Ricciardi Maria Teresa Riviello Matej Rojc Nicla Rossini Rudi Rotili Algimantas Rudzionis Vytautas Rudzionis Hugo L. Rufiner Milan Rusko

XIII

Technical University of Liberec, Czech Republic Faculty of Electrical Engineering, Croatia Institute for Language and Speech Processing, Greece University of Debrecen, Hungary Universit` a degli Studi di Udine, Italy Spoken Language Systems Laboratory, Portugal Second University of Naples, Italy University of Zagreb, Croatia Université de Paris, France University of Ljubljana, Slovenia University of Munich, Germany Università degli Studi di Ancona, Italy University of Zagreb, Croatia Technical University of Koˇsice, Slovak Republic Universit` a di Roma 3, Italy University of Paris 8, France Technical University of Liberec, Czech Republic Telecom-ParisTech, France Academy of Sciences, Czech Republic Slovak University of Technology, Slovakia Universit` a Politecnica delle Marche, Italy Telecommunications Research Center Vienna, Austria Kaunas University of Technology, Lithuania Instituto Superior de Psicologia Aplicada, Portugal University of Tampere, Finland University of Twente, The Netherlands Istituto di Cibernetica – CNR, Naples, Italy University Hospital of Psychiatry Bern, Switzerland Universidade de Coimbra, Portugal FTW Telecommunications Research Center, Austria Universit` a di Napoli “Federico II”, Italy Second University of Naples and IIASS, Italy University of Maribor, Slovenia Università del Piemonte Orientale, Italy Universit` a Politecnica delle Marche, Italy Kaunas University of Technology, Lithuania Kaunas University of Technology, Lithuania Universidad Nacional de Entre R´ıos, Argentina Slovak Academy of Sciences, Slovak Republic

XIV

Organization

Zsófia Ruttkay Yoshinori Sagisaka Bartolomeo Sapio Mauro Sarrica Gellért S´ arosi Gaetano Scarpetta Silvia Scarpetta Stefan Scherer Ralph Schnitker Jean Schoentgen Björn Schuller Milan Seˇcujski Stefanie Shattuck-Hufnagel Marcin Skowron Jan Silovsky Zdenˇek Smékal Stefano Squartini Piotr Staroniewicz J´ an Staˇs Vojtˇech Stejskal Marian Stewart-Bartlett Xiaofan Sun Jing Su D´ avid Sztah´ o Jianhua Tao Balázs Tarján Jure F. Tasiˇc Murat Tekalp Kristinn Thórisson Isabel Trancoso Luigi Trojano Wolfgang Tschacher Markku Turunen Henk Van den Heuvel Betsy van Dijk Giovanni Vecchiato Leticia Vicente-Rasoamalala Robert Vich Kl´ ara Vicsi

Pazmany Peter Catholic University, Hungary Waseda University, Japan Fondazione Ugo Bordoni, Italy University of Padova, Italy Budapest University of Technology and Economics, Hungary University of Salerno and IIASS, Italy Salerno University, Italy Ulm University, Germany Aachen University, Germany Université Libre de Bruxelles, Belgium Technische Universität M¨ unchen, Germany University of Novi Sad, Serbia MIT, Research Laboratory of Electronics, USA Austrian Research Institute for Artificial Intelligence, Austria Technical University of Liberec, Czech Republic Brno University of Technology, Czech Republic Universit` a Politecnica delle Marche, Italy Wroclaw University of Technology, Poland Technical University of Koˇsice, Slovakia Brno University of Technology, Czech Republic University of California, San Diego, USA University of Twente, The Netherlands Trinity College Dublin, Ireland Budapest University of Technology and Economics, Hungary Chinese Academy of Sciences, P.R. China Budapest University of Technology and Economics, Hungary University of Ljubljana, Slovenia Ko¸c University, Turkey Reykjav´ık University, Iceland Spoken Language Systems Laboratory, Portugal Second University of Naples, Italy University of Bern, Switzerland University of Tampere, Finland Radboud University Nijmegen, The Netherlands University of Twente, The Netherlands Universit` a “La Sapienza”, Italy Alchi Prefectural University, Japan Academy of Sciences, Czech Republic Budapest University, Hungary

Organization

Hannes Högni Vilhj´ almsson Jane Vincent Alessandro Vinciarelli Laura Vincze Carl Vogel Jan Vol´ın Rosa Volpe Martin Vondra Pascal Wagner-Egger Yorick Wilks Matthias Wimmer Matthias Wolf Bencie Woll Bayya Yegnanarayana Vanda Lucia Zammuner ˇ Jerneja Zganec Gros Goranka Zoric

XV

Reykjav´ık University, Iceland University of Surrey, UK University of Glasgow, UK Università di Roma 3, Italy Trinity College Dublin, Ireland Charles University, Czech Republic Université de Perpignan, France Academy of Sciences, Czech Republic Fribourg University, Switzerland University of Sheffield, UK Institute for Informatics Munich, Germany Technische Universität Dresden, Germany University College London, UK International Institute of Information Technology, India University of Padova, Italy Alpineon, Development and Research, Slovenia Faculty of Electrical Engineering, Croatia

XVI

Organization

Sponsors The following organizations sponsored and supported the international conference: European COST Action 2102 “Cross-Modal Analysis of Verbal and Nonverbal Communication” (cost2102.cs.stir.ac.uk)

ESFProvidetheCOSTOfficethroughandECcontract

COSTissupportedbytheEURTDFrameworkprogramme

COST—the acronym for European Cooperation in Science and Technology—is the oldest and widest European intergovernmental network for cooperation in research. Established by the Ministerial Conference in November 1971, COST is presently used by the scientific communities of 36 European countries to cooperate in common research projects supported by national funds. The funds provided by COST—less than 1% of the total value of the projects— support the COST cooperation networks (COST Actions) through which, with EUR 30 million per year, more than 30,000 European scientists are involved in research having a total value which exceeds EUR 2 billion per year. This is the financial worth of the European added value which COST achieves. A “bottom–up approach” (the initiative of launching a COST Action comes from the European scientists themselves), “` a la carte participation” (only countries interested in the Action participate), “equality of access” (participation is open also to the scientific communities of countries not belonging to the European Union) and“flexible structure”(easy implementation and light management of the research initiatives) are the main characteristics of COST. As precursor of advanced multidisciplinary research, COST plays a very important role in the realization of the European Research Area (ERA) anticipating and complementing the activities of the Framework Programmes, constituting a “bridge” toward the scientific communities of emerging countries, increasing the mobility of researchers across Europe and fostering the establishment of “Networks of Excellence” in many key scientific domains such as: biomedicine and molecular biosciences; food and agriculture; forests, their products and services; materials, physical and nanosciences; chemistry and molecular sciences and technologies; earth system science and environmental management; information

Organization

XVII

and communication technologies; transport and urban development; individuals, societies cultures and health. It covers basic and more applied research and also addresses issues of pre-normative nature or of societal importance. website: http://www.cost.eu SSPnet: European Network on Social Signal Processing, http://sspnet.eu/

The ability to understand and manage the social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis and synthesis of relevant behavioral cues like blinks, smiles, crossed arms, head nods, laughter, etc., the research efforts in machine analysis and synthesis of human social signals such as empathy, politeness, and (dis)agreement, are few and tentative. The main reasons for this are the absence of a research agenda and the lack of suitable resources for experimentation. The mission of the SSPNet is to create a sufficient momentum by integrating an existing large amount of knowledge and available resources in social signal processing (SSP) research domains including cognitive modeling, machine understanding, and synthesizing social behavior, and thus: – Enable the creation of the European and world research agenda in SSP – Provide efficient and effective access to SSP-relevant tools and data repositories to the research community within and beyond the SSPNet – Further develop complementary and multidisciplinary expertise necessary for pushing forward the cutting edge of the research in SSP The collective SSPNet research effort is directed toward integration of existing SSP theories and technologies, and toward identification and exploration of potentials and limitations in SSP. More specifically, the framework of the SSPNet will revolve around two research foci selected for their primacy and significance: human–human interaction (HHI) and human–computer interaction (HCI). A particular scientific challenge that binds the SSPNet partners is the synergetic combination of human–human interaction models, and automated tools for human behavior sensing and synthesis, within socially adept multimodal interfaces. School of Computing Science, University of Glasgow, Scotland, UK Department of Psychology, Second University of Naples, Caserta, Italy Laboratory of Speech Acoustics of Department of Telecommunication and Media Informatics, Budapest University for Technology and Economics, Budapest, Hungary

XVIII

Organization

Complex Committee on Acoustics of the Hungarian Academy of Sciences, Budapest, Hungary Scientific Association for Infocommunications, Budapest, Hungary International Institute for Advanced Scientific Studies“E.R. Caianiello”IIASS, www.iiassvietri.it/ Società Italiana Reti Neurotiche, SIREN, www.associazionesiren.org/ Regione Campania and Provincia di Salerno, Italy

Table of Contents

Multimodal Signals: Analysis, Processing and Computational Issues Real Time Person Tracking and Behavior Interpretation in Multi Camera Scenarios Applying Homography and Coupled HMMs . . . . . . . . . Dejan Arsić and Bj¨ orn Schuller Animated Faces for Robotic Heads: Gaze and Beyond . . . . . . . . . . . . . . . . Samer Al Moubayed, Jonas Beskow, Jens Edlund, Bj¨ orn Granstr¨ om, and David House RANSAC-Based Training Data Selection on Spectral Features for Emotion Recognition from Spontaneous Speech . . . . . . . . . . . . . . . . . . . . . . Elif Bozkurt, Engin Erzin, Ciˇ ¸ gdem Eroˇglu Erdem, and A. Tanju Erdem

1 19

36

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Bachwerk and Carl Vogel

48

Switching Between Different Ways to Think: Multiple Approaches to Affective Common Sense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erik Cambria, Thomas Mazzocco, Amir Hussain, and Tariq Durrani

56

Efficient SNR Driven SPLICE Implementation for Robust Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Squartini, Emanuele Principi, Simone Cifani, Rudi Rotili, and Francesco Piazza Study on Cross-Lingual Adaptation of a Czech LVCSR System towards Slovak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petr Cerva, Jan Nouza, and Jan Silovsky Audio-Visual Isolated Words Recognition for Voice Dialogue System . . . . Josef Chaloupka Semantic Web Techniques Application for Video Fragment Annotation and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marco Grassi, Christian Morbidoni, and Michele Nucci

70

81 88

95

Imitation of Target Speakers by Different Types of Impersonators . . . . . . Wojciech Majewski and Piotr Staroniewicz

104

Multimodal Interface Model for Socially Dependent People . . . . . . . . . . . . Rytis Maskeliunas and Vytautas Rudzionis

113

XX

Table of Contents

Score Fusion in Text-Dependent Speaker Recognition Systems . . . . . . . . . Jiˇr´ı Mekyska, Marcos Faundez-Zanuy, Zdenˇek Smékal, and Joan F` abregas Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality within a Multimodal Shell . . . . . . . . . . . . . . . . . . . . . . . . Izidor Mlakar and Matej Rojc

120

133

Multimodal Embodied Mimicry in Interaction . . . . . . . . . . . . . . . . . . . . . . . Xiaofan Sun and Anton Nijholt

147

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications . . . . Jan Nouza and Marek Boh´ aˇc

154

Towards the Automatic Detection of Involvement in Conversation . . . . . . Catharine Oertel, Céline De Looze, Stefan Scherer, Andreas Windmann, Petra Wagner, and Nick Campbell

163

Extracting Sentence Elements for the Natural Language Understanding Based on Slovak National Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ zm´ Stanislav Ond´ aˇs, Jozef Juh´ ar, and Anton Ciˇ ar

171

Detection of Similar Advertisements in Media Databases . . . . . . . . . . . . . . Karel Palecek

178

Towards ECA’s Animation of Expressive Complex Behaviour . . . . . . . . . . Izidor Mlakar and Matej Rojc

185

Recognition of Multiple Language Voice Navigation Queries in Traffic Situations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gellért S´ arosi, Tam´ as Mozsolics, Bal´ azs Tarj´ an, Andr´ as Balog, Péter Mihajlik, and Tibor Fegy´ o Comparison of Segmentation and Clustering Methods for Speaker Diarization of Broadcast Stream Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Prazak and Jan Silovsky

199

214

Influence of Speakers’ Emotional States on Voice Recognition Scores . . . . Piotr Staroniewicz

223

Automatic Classification of Emotions in Spontaneous Speech . . . . . . . . . . D´ avid Sztah´ o, Viktor Imre, and Kl´ ara Vicsi

229

Modification of the Glottal Voice Characteristics Based on Changing the Maximum-Phase Speech Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Vondra and Robert V´ıch

240

Table of Contents

XXI

Verbal and Nonverbal Social Signals On Speech and Gestures Synchrony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Esposito and Antonietta M. Esposito Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amélie Lelong and Gérard Bailly Towards the Acquisition of a Sensorimotor Vocal Tract Action Repository within a Neural Model of Speech Processing . . . . . . . . . . . . . . . Bernd J. Kr¨ oger, Peter Birkholz, Jim Kannampuzha, Emily Kaufmann, and Christiane Neuschaefer-Rube Neurophysiological Measurements of Memorization and Pleasantness in Neuromarketing Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giovanni Vecchiato and Fabio Babiloni Annotating Non-verbal Behaviours in Informal Interactions . . . . . . . . . . . . Costanza Navarretta

252

273

287

294 309

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena to a Theoretical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rosa Volpe, Lucile Chanquoy, and Anna Esposito

316

Investigation of Movement Synchrony Using Windowed Cross-Lagged Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uwe Altmann

335

Multimodal Multilingual Dictionary of Gestures: DiGest . . . . . . . . . . . . . . ˇ Milan Rusko and Stefan Beˇ nuˇs

346

The Partiality in Italian Political Interviews: Stereotype or Reality? . . . . Enza Graziano and Augusto Gnisci

355

On the Perception of Emotional “Voices”: A Cross-Cultural Comparison among American, French and Italian Subjects . . . . . . . . . . . . . . . . . . . . . . . Maria Teresa Riviello, Mohamed Chetouani, David Cohen, and Anna Esposito Influence of Visual Stimuli on Evaluation of Converted Emotional Speech by Listening Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiˇr´ı Pˇribil and Anna Pˇribilov´ a Communicative Functions of Eye Closing Behaviours . . . . . . . . . . . . . . . . . Laura Vincze and Isabella Poggi Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicla Rossini

368

378 393

406

XXII

Table of Contents

Selection Task with Conditional and Biconditional Sentences: Interpretation and Pattern of Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabrizio Ferrara and Olimpia Matarazzo Types of Pride and Their Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Isabella Poggi and Francesca D’Errico

419 434

People’s Active Emotion Vocabulary: Free Listing of Emotion Labels and Their Association to Salient Psychological Variables . . . . . . . . . . . . . . Vanda Lucia Zammuner

449

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

461

Real Time Person Tracking and Behavior Interpretation in Multi Camera Scenarios Applying Homography and Coupled HMMs Dejan Arsić1 and Björn Schuller2 1

Müller BBM Vibroakustiksysteme GmbH, Planegg, Germany [email protected] 2 Institute for Human-Machine Communication, Technische Universität München, Germany [email protected]

Abstract. Video surveillance systems have been introduced in various fields of our daily life to enhance security and protect individuals and sensitive infrastructure. Up to now they have been usually utilized as a forensic tool for after the fact investigations and are commonly monitored by human operators. A further gain in safety can only be achieved by the implementation of fully automated surveillance systems which will assist human operators. In this work we will present an integrated real time capable system utilizing multiple camera person tracking, which is required to resolve heavy occlusions, to monitor individuals in complex scenes. The resulting trajectories will be further analyzed for so called Low Level Activities , such as walking, running and stationarity, applying HMMs, which will be used for the behavior interpretation task along with motion features gathered throughout the tracking process. An approach based on coupled HMMs will be used to model High Level Activities such as robberies at ATMs and luggage related scenarios.

1 Introduction Visual surveillance systems, which are quite common in urban environments, aim at providing safety in everyday life. Unfortunately most CCTV cameras are unmonitored and the vast majority of benefits are either in forensic use or deterring potential offenders, as these might be easily recognized and detected [40]. Therefore it seems desirable to support human operators and implement automated surveillance systems to be able to react in time. In order to achieve this aim most systems are split into two parts, the detection and tracking application and the subsequent behavior interpretation part. As video material may contain various stationary or moving objects and persons whose behavior may be interesting, these have to be detected in the current video frame and tracked over time. As a single camera usually is not sufficient to cope with dense crowds and large regions, multiple cameras should be mounted to view defined regions from different perspectives. Within these perspectives corresponding objects have to be located. Appearance based methods, such as matching color [32], lead to frequent errors due to different color settings and lighting situations in the individual sensors. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 1–18, 2011. c Springer-Verlag Berlin Heidelberg 2011

2

D. Arsić and B. Schuller

Approaches based on geometrical information rely on geometrical constraints between views, using calibrated data [43] or homography between uncalibrated views, which e.g. Khan [25] suggested to localize feet positions. However, as Khan’s approach only localizes feet, it consequently tends to segment persons into further parts. In these respects a novel extension to this framework is presented herein, applying homography in multiple layers to successfully overcome the problem of aligning multiple segments belonging to one person. As convenient side effect the localization performance will increase dramatically [6]. Nevertheless this approach still creates some errors in complex scenes and is computationally quite expensive. Therefore a real time capable alteration of the initial homography approach will be presented in sec. 2. The results of the applied tracking approaches will be presented using the multi camera tracking databases from the Performance Evaluation of Tracking and Surveillance Challanges (PETS) in the years 2006, 2007 and 2009 [37,3,28]. All these databases have been recorded in public places, such as train stations or airports, and show at leas four views of the scene. Subsequently an integrated approach for behavior interpretation will be presented in sec. 3. Although a wide range of approaches already exists, this issue is not yet solved. Most of these operate on 2D level using texture information to extract behaviors or gait [39,15]. Unfortunately it is not possible to guarantee similar and non-obscured views in real world scenarios, which are required by these algorithms. Hence it is suggested to operate on trajectory level. Trajectories can be can be extracted robustly by the previously mentioned algorithm, easily be normalized and compared to a baseline scenario with little to no changes and knowledge of the scene geometry. Nevertheless the positions of important landmarks and objects, which may be needed for the scenario recognition, should be collected. Other information is not required. Common approaches come at the cost of collecting a large amount of data to train Hidden Markov Models (HMM) [31] or behavioral maps [11]. Despite the scenario’s complexity and large inter class variance, some scenarios are following a similar scheme, which can be modeled by a HMM architecture in two layers, where the first layer is responsible for the recognition of Low Level Activities (LLA). In the second layer complex scenarios are furthermore analyzed again applying HMMS, where only LLAs are used as features. High flexibility and robustness is achieved by the introduction of state transition between High Level Activities (HLA), allowing a detailed dynamic scene representation. It will be shown that this approach provides a high accuracy at low computational effort.

2 Object Localization Using Homography 2.1 Planar Homographies Homography [22] is a special case of projective geometry. It enables the mapping of points in spaces with different dimensionality Rn [17]. Hence, a point p observed in a view can be mapped into its corresponding point p in another perspective or even coordinate system. Fig. 1 illustrates this for the transformation of a point p in world coordinates R3 into the image pixel p in R2

p = (x, y) ← p = (x, y, z).

(1)

Real Time Person Tracking and Behavior Interpretation

3

Fig. 1. The homography constraint visualized with a cylinder standing on a planar surface

Planar homographies, here the matching of image coordinates onto the ground plane, in contrast only require an affine transformation from R2 → R2 . This can be interpreted as a simple rotation with R and translation with T

p = Rp + T.

(2)

As has been shown in [25], projective geometry between multiple cameras and a plane in world coordinates can be used for person tracking. A point pπ located on the plane is visible as piπ in view Ci and as p jπ in a second view C j . piπ and p jπ can be determined with piπ = Hiπ pπ and p j π = H j π pπ , (3) where Hiπ denotes the transformation between view Ci and the ground plane π . The composition of both perspectives results in a homography [22] p jπ = H jπ H−1 iπ piπ = Hi j piπ

(4)

between the images planes. This way each pixel in a view can be transformed into another arbitrary view, given the projection matrices for the two views. A 3D point pπ located off the plane π , visible at location piπ in view Ci , can also be warped into another image with pw = Hpiπ , and pw = p2π . The resulting misalignment is called plane parallax. As illustrated in fig. 1 the homography projects a ray from the camera center Ci through a pixel p and extends it until it intersects with the plane π , which is referred to as piercing point of a pixel and the plane π . The ray is subsequently projected into the camera center of C j , intersecting the second image plane at pw . As can be seen, points in the image plane do not have any plane parallax, whereas those off the plane do have a considerable such. Each scene point pπ located on an object in the 3D scene and on plane π will therefore be projected into a pixel p1π , p2π , · · · , pnπ in all available n views, if the projections are located in detected foreground regions FGi with piπ ∈ FGi .

(5)

Furthermore, each point piπ can be determined by a transformation between view i and an arbitrary chosen one indexed with j piπ = Hi j p jπ ,

(6)

4


where Hi j is the homography of plane π from view i to j. Given a foreground pixel pi ∈ FGi in view Ci , with its piercing point located inside the volume of an object inside the scene, the projection p j = Hi j pi ∈ FG j

(7)

lies in the foreground region FG j . This proposition, the so called homography constraint, is segmenting pixel corresponding to ground plane positions of objects and helps resolving occlusions. The homography constraint is not necessarily limited to the ground plane and can be used in any other plane in the scene, as will be shown in sec. 2.2. For the localization of objects, the ground plane seems sufficient to find objects touching it. In the context of pedestrians a detection of feet is performed, which will be explained in the following sections. Now that it is possible to compute point correspondences from the 2D space to the 3D world and vice versa, it is also possible to determine the number of objects and their exact location in a scene. In the first stage a synchronized image acquisition is needed, in order to compute the correspondences of moving objects in the current frames C1 ,C2 , . . . ,Cn . Subsequently, a foreground segmentation is performed in all available smart sensors to detect changes from the empty background B(x, y) [25] : FGi (x, y,t) = Ii (x, y,t) − Bi (x, y)

(8)

where the appropriate technique to update the background pixel, here based on Gaussian Mixture Models, is chosen for each sensor individually. It is advisable to set parameters, such as the update time, separately in all sensors to guarantee a high performance. Computational effort is reduced by masking the images with a predefined tracking area. Now the homography Hiπ between a pixel pi in the view Ci and the corresponding location on the ground plane π can be determined. In all views the observations x1 , x2 , . . . , xn can be made at the pixel positions p1 , p2 , . . . , pn . Let X resemble the event that a foreground pixel pi has a piercing point within a foreground object with the probability P(X|x1 , x2 , . . . , xn ). With Bayes’ law we have p(X|x1 , x2 , . . . , xn ) ∝ p(x1 , x2 , . . . , xn |X)p(X).

(9)

The first term on the right side is the likelihood of making an observation x1 , x2 , ..., xn , given an event X happens. Assuming conditional independence, the term can be rewritten to p(x1 , x2 , . . . , xn |X) = p(x1 |X) · p(x2 |X) · . . . · p(xn |X). (10) According to the homography constraint, a pixel within an object will be part of the foreground object in every view p(xi |X) ∝ p(xi ),

(11)

where p(xi ) is the probability of xi belonging to the foreground. An object is then detected in the ground plane when n

p(X|x1 , x2 , . . . , xn ) ∝ ∏ p(xi ) i=1

(12)


5

Fig. 2. a) Planar homography for object detection. b) Resolving occlusions by adding further views.

exceeds a threshold θ . In order to keep computational effort low, it is feasible to transform only regions of interest [3]. These are determined by thresholding the entire image, resulting in a binary image, before the transformation and the detection of blobs with a simple connected component analysis. This way only the binary blobs are transformed into the ground plane instead of the corresponding probability maps. Therefore eq. 12 can be simplified to n

p(X|x1 , x2 , . . . , xn ) ∝ ∑ p(xi )

(13)

i=1

without any influence on the performance. The value of θlow is usually set dependent on the number n of camera sensors to θlow = n − 1, in order to provide some additional robustness in case one of the views accidentally fails. The thresholding on sensor level has a further advantage compared to the so called soft threshold [25,12], where the entire probability map is transformed and probabilities are actually multiplied as in eq. 12. A small probability or even xi = 0 would affect the overall probability and set it to small values, whereas the thresholded sum is not affected. Using the homography constraint hence solves the correspondence problem in the views C1 ,C2 , . . . ,Cn , as illustrated in fig 2a) for a cubic object. In case the object is human, only the feet of the person touching the ground plane will be detected. The homography constraint additionally resolves occlusions, as can be seen in fig. 2a). Pixel regions located within the detected foreground areas, indicated in dark gray on the ground plane, and representing the feet, will be transformed to a piercing point within the object volume. Foreground pixels not satisfying the homography constraint are located off the plane, and are being warped into background regions of other views. The piercing point is therefore located outside the object volume. All outliers indicate regions with high uncertainty, as there is no depth information available. This limitation can now be used to detect occluded objects. As visualized in fig. 2b), one cuboid is occluded by the other one in view C1 , as apparently foreground blobs are merged. The right object’s bottom side is occluded by the larger object’s body. Both objects are visible in view C2 , resulting in two detected foreground regions. A second set of foreground pixels, located on the ground plane π in view C2 , will now satisfy the homography constraint and localize the occluded object. This process allows the localization of feet positions, although they are entirely occluded, by creating a kind of see through effect.

6


Fig. 3. Detection example applying homographic transformation in the ground plane. Detected object regions are subsequently projected into the third view of the PETS2006 data set. The regions in yellow represent intersecting areas. As can be seen, some objects are split into multiple regions. These are aligned in a subsequent tracking step.

Exemplary results of the object localization are shown in fig. 3, where the yellow regions on the left hand side represent possible object positions. For an easier post processing, the resulting intersections are interpreted as circular object regions ORi with center point p j (x, y,t) and its radius r j (t), which is given by r j (t) = is the size of the intersecting region.

A j (t) π , where A j (t)

2.2 3D Reconstruction of the Scene The major drawback of planar homography is the restriction to the detection of objects touching the ground, which leads to some unwanted phenomena. Humans usually have two legs and therefore two feet touching the ground, but unfortunately not necessarily positioned next to each other. Walking people will show a distance between their feet of up to one meter. Computing intersections in the ground plane consequently results in two object positions per person. Fig. 3 illustrates the detected regions for all four persons present in the scene. As only the position of the feet is determined, remaining information on body shape and posture is dismissed. As a consequence distances between objects and individuals cannot be determined exactly. For instance, a person might try to reach an object with her arm and be just few millimeters away from touching it, though the computed distance would be almost one meter. Furthermore, tracking is only limited to objects located on a plane, while other objects, such as hands, birds, etc. cannot be considered. To resolve these limitations, it seems reasonable to try to reconstruct the observed scenery as a 3D model. Therefore various techniques have already been applied: Recent works mostly deal with the composition of so called visual hulls from an ensemble of 2D images [27,26], which requires a rather precise segmentation in each smart sensor and the use of 3D constructs like voxels or visual cones. These are subsequently being intersected in the 3D world. A comparison of scene reconstruction techniques can be found in [35]. An approach for 3D reconstruction of objects from multiple views applying homography has already been presented in [24]. All required information can be gathered by fusion of silhouettes in the image plane, which can be resolved by planar homography. With a large set of cameras or views a quite precise object reconstruction can be


Z

Z

7

Z

1.8m 1m

0m

Y Y

Y

X(1m) X(0m) X

X

a)

Y(0m) Y(1m) Yw(1.8)

X

b)

c)

Fig. 4. a) Computation of layer intersections using two points. b) Transformed blobs in multiple layers. c) 3D reconstruction of a cuboid.

achieved, which is not required for this work. This approach can be altered to localize objects and approximate the occupied space with low additional effort [6], which will improve the detection and tracking performance. The basic idea is to compute the intersections of transformed object boundaries in additional planes, as illustrated in fig. 4b). This transformation can be computed rapidly by taking epipolar geometry into account, which will be computationally more efficient than computing the transformation for each layer. All possible transformations of an image pixel I(x, y) are basically located on an infinite line g in world coordinates (xw , yw , zw ). This line can be described by two points p1 and p2 . Therefore only two transformations, which can be precomputed, are required for the subsequent processing steps. This procedure is usually only valid for a linear stretch in space, which can be assumed in most applied sensor setups. The procedure described in sec. 2.1 is applied for each desired layer, resulting in intersecting regions in various heights, as illustrated in fig 4 b) and c). The object’s height is not required as the polygons are only intersecting within the region above the person’s position. In order to track humans it has been decided to use ten layers with a distance of 0.20 m covering the range of 0.00 m to 1.80 m, as this is usually sufficient to separate humans and only the head would be missing in case the person is by far taller. The ambiguities created by the planar homography approach are commonly solved by the upper body. Therefore the head, which is usually smaller than the body, is not required. The computed intersections have to be aligned in a subsequent step in order to reconstruct the objects’ shapes. Assuming that an object does usually not float above another one, all layers can be stacked into one layer by projecting the intersections to the ground floor. This way a top view is simulated applying a simple summation of the pixel P = (xw , yw , zw ) in all layers into one common ground floor layer with: GF(xw , yw ) =

n

∑ P(xw , yw , zl ).

(14)

l=1

Subsequently, a connected component analysis is applied, in order to assign unique IDs to all possible object positions in the projected top view. Each ID is then propagated to the layers above the ground floor, providing a mapping of object regions in the single layers. Besides the exact object location, additionally volumetric information, such as

8


Fig. 5. Detection example on PETS2007 data [3] projected in two camera views. All persons, expect the lady in the ellipse, have been detected and labeled consistently in both views. The error occurred already in the foreground segmentation.

height, width, and depth, is extracted from the image data, providing a more detailed scene representation than the simple localization. Some localization examples are provided in fig. 5, where cylinders approximate the object volume. The operating area has been restricted to the predefined area of interest, which is the region with the marked up coordinate system. As can be seen, occlusions can be resolved easily without any errors. One miss, the lady marked with the black ellipse, appeared because of an error in the foreground segmentation. She has been standing in the same spot even before the background model has been created, and therefore not been detected. 2.3 Computational Optimization of the 3D Representation The localization accuracy of the previously described approach comes at the cost of computational effort. Both the homography and the fusion in individual layers are quite demanding operations, although a simple mathematical model lies beneath them. Therefore a computationally more efficient variation will be presented in the following. As each detected foreground pixel is transformed into the ground plane, a vast amount of correspondences has to be post processed within the localization process. Instead of computing complex occupancy cones, the observed region is covered by a three dimensional grid with predefined edge lengths. Thus, we segment the observed space into a grid of volume elements, so called voxels. In a first step corresponding voxel and pixel positions in the recorded image are computed. This can be done by computing homographies in various layers, using occupancy rays cast from each image pixel in each separate camera view. Each voxel passed by a ray originating from one pixel is henceforth associated with that pixel. Due to the rough quantization of the 3D space, multiple pixel positions will be matched to each voxel. While slightly decreasing precision, this will result in a larger tolerance to calibration errors. As we now have a precomputed lookup table of pixel to voxel correspondences, it is possible to calculate an occupancy grid quickly for each following observation. Each voxel is assigned a score which is set to zero at first. For each pixel showing a foreground object, all associated voxels’ scores are incremented by one step. Going through all the foreground regions of all images, it is possible to compute the scores for each voxel in the occupancy grid. After all image pixels have been processed, a simple thresholding operation is performed on the scores of the voxels, excluding voxels with low scores and thus ambiguous regions. The remaining voxels with higher scores


9

Fig. 6. 3D reconstruction and detection results of a scene from the PETS2009 [18] dataset

then provide an approximated volume of the observed object. The threshold is usually set equal to the number of cameras, meaning that a valid voxel needs an associated foreground/object pixel in each camera view. After filling the individual grid elements, a connected components analysis, which is commonly used in image processing, is applied to the 3D voxel grid in order to locate objects. The only significant difference to the 2D operation is the number of possibly connected neighbor elements which rises from 8 to 26. An exemplary detection result is illustrated in fig. 6, using a scene from the PETS2009 workshop [18]. Due to the rough quantization of the tracking region, calibration errors and unreliable foreground segmentation could be partially eliminated, and a by far higher tracking accuracy has been reached applying this method, which has been evaluated using the PETS2007 database. While the localization accuracy of the multi layer homography approach (MLH) and the presented voxel based tracking achieved the same localization accuracy of 0.15m, the number of ID changes has been decreased drastically from 18 to 3. this result is comparable to a combined MLH and 2D tracking approach, as presented in [8], where a graph based representation using SIFT features [30] has been applied [28]. Speaking in terms of tracking accuracy, the performance has not risen drastically. Computational effort has been decreased by the factor seven at the same time. This makes this approach by far more efficient than comparable ones.

3 Behavior Interpretation The created trajectories and changes in motion patterns can now be used by a behavior interpretation module, which subsequently either triggers an alarm signal or reacts to the observed activity by other appropriate means [23]. This module is basically matching an unknown observation sequence to stored reference samples and performing a comparison. The basic problem is to find a meaningful representation of human behavior, which is a quite a challenging task even for highly trained human operators, who

10


indeed should be ’experts in the field’. A wide range of classifiers, based on statistical learning theory, has been employed in the past, in order to recognize different behavior. The probably most popular approaches involve the use of dynamic classifiers, such as HMMs [31] or Dynamic Time Warping [36]. Nevertheless static classifiers, e. g. Support Vector Machines (SVM) or Neural Networks (NN) are being further explored, as these may outperform dynamic ones [4]. All these approaches have in common that they are data driven approaches, which usually requires a vast amount of real world training data. This is usually not available, as authorities usually do not provide or simply do not have such data, and preparing data and model creation are quite time consuming. Therefore an effective solution has to be found to overcome this problem. In order to be able to pick up interesting events and derive so called ’threat intentions’, which may for instance include robberies or even the placement of explosives, a set of Predefined Indicators (PDI), such as loitering in a defined region, has been collected [13]. These PDIs have been assembled to complex scenarios, which can be interpreted as combination and temporal sequence of so called Low Level Activities. Hence, the entire approach consists of two subsequent steps: The Low Level Activity detection and the subsequent scene analysis using the outputs of the LLA analysis. 3.1 Feature Extraction The recognition of complex events on trajectory level requires a detailed analysis of temporal events. A trajectory can be interpreted as an object projected onto the ground plane, and therefore techniques from the 2D domain can be used. According to Francois [20] and Choi [16], the most relevant trajectory related features are defined as follows: continue, appear, disappear, split, and merge. All these can be handled by the tracking algorithm, where the object age, meaning the number of frames a person is visible, can also be determined reliably. Additionally, motion patterns, such as speed and stationarity, are being analyzed. – Motion Features: In order to be able to perform an analysis of LLAs from a wide range of recordings and setups, it is reasonable to remove the position of the person in the first place. It is important to detect running, walking or loitering persons, where the position only provides contextual information. Therefore only the persons’ speed and acceleration are computed directly on trajectory level. The direction of movement can also be considered as contextual information, which leads to the conclusion to just record changes in the direction of motion on the xy plane. – Stationarity: For some scenarios, such as left luggage detection, objects not altering their spatial position have to be picked up in a video sequence. Due to noise in the video material or slight changes in the detector output, e. g. the median of a particle filter, the object location is slightly jittering. A simple spatial threshold over time is usually not adequate, because the jitter might vary in intensity over time. Therefore the object position pi (t) is averaged over the last N frames: pi =

1 t ∑ pi (t ) N t =t−N

(15)


Subsequently, the normalized variance in both x− and y− direction 1 t 2 σi (t) = p (t ) − p ) i ∑ i N t =t−N

11

(16)

is computed [9,3]. This step is required to smooth noise created by the sensors and errors during image processing. Stationarity can then be assumed for objects with a lower variance than a predefined threshold θ : 1 if var < θ stationarity = , (17) 0 else where 1 indicates stationarity and 0 represents walking or running. Given only the location coordinates, this method does not discriminate between pedestrians and other objects, enabling the stationarity detection for any given object in the scene. A detection example is illustrated in fig. 7. – Detection of Splits and Mergers: According to Perera [33], splits and merges have to be detected in order to maintain IDs in the tracking task. Guler [21] tried to handle these as low level events describing more complex scenarios, such as people getting out of cars or forming crowds. A merger usually appears in case two previously independent objects O1 (t) and O2 (t) unite to a normally bigger one O12 (t) = O1 (t − 1) ∪ O2 (t − 1).

(18)

This observation is usually made in case two objects are either located extremely close to each other or touch one another in 3D, whereas in 2D a partial occlusion might be the reason for a merger. In contrast two objects O11 (t) and O12 (t) can be created by a splitting object O1 (t − 1), which might have been created by a previous merger. While others usually analyze object texture and luminance [38], the applied rule based approach only relies on the object position and the regions’ sizes. Disappearing and appearing objects have to be recognized during the tracking process, in order to incorporate a split or merge: • Merge: one object disappears but two objects can be mapped on one and the same object during tracking. In an optimal case both surfaces would intersect with the resulting bigger surface O1 (t − 1) ∩ O12(t) & O2 (t − 1) ∩ O12(t).

(19)

• Split: Similar to the object split two objects at frame t are mapped to one object at time t − 1, where the objects both intersect with the old splitting one O11 (t) ∩ O1 (t − 1) & O12 (t) ∩ O1 (t − 1).

(20)

– Proximity of Objects: As in various cases persons are interacting with each other, it seems reasonable to model combined motions. This can be done according to the direction of movement, proximity of objects, and velocity. As the direction of

12


Fig. 7. Exemplary recognition results for Walking, Loitering and Operating an ATM

motion can be simply computed, it is possible to elongate the motion vector v and compute intersections with interesting objects or other motion vectors. Further the distance between object positions can be easily detected with di j = (xi (t) − x j (t))2 + (yi (t) − y j (t))2 . (21) Distances in between persons and objects are usually computed scenario relatedly and require contextual knowledge, as the positions of fixed objects are known beforehand and these objects cannot necessarily be detected automatically. In case interactions between persons are required, it is sufficient to analyze only the objects with the smallest distance. 3.2 Low Level Activity Detection The classification of Low Level activities has been performed applying various different techniques. Thereby rule based approaches [6] and Bayesian Networks [14] have been quite popular. As it is hard to handle continuous data streams with both approaches and to set up a wide set of rules for each activity, dynamic data driven classification should be preferred. Though it has previously been stated that data is hardly available, this accounts only for complex scenarios, such as robberies or theft. It is therefore reasonable to collect LLAs from different data sources and additionally collect a large amount of normal data containing none of the desired LLAs, as this will be the class usually appearing. Hidden-Markov-Models [34] are applied for the trajectory analysis task in the first stage, as these can cope with dynamic sequences with variable length. Neither duration, start or end frame of the desired LLAs is known before the training phase. Only the order and number of activities for each sample in the database are defined. Each action is represented by a four or five state, left-right, continuous HMM and trained using the Baum-Welch-Algorithm [10]. During the training-process the activities are aligned to the training data via the Viterbi-Algorithm in order to find the start and end frames of the contained activities. The recognition task was performed applying the ViterbiAlgorithm. For this task all features expect the contextual information, such as position or proximity, have been applied. Table 1 illustrates the desired classes and the recognition results. This approach has been evaluated on a total amount of 2.5 h of video including the


13

Table 1. Detection (det) results and false positives (fpos) for all five LLAs within the databases. The HMM based approach obviously outperforms the followed static Bayesian Networks approach.

14 7 18 12 60

Running Stationarity Drop luggage Pick up luggage Loitering

[#] detBN detHMM fpos BN fpos HMM 10 7 0 0 60

13 0 15 10 60

1 0 12 0 3

0 0 1 2 1

Event

Fig. 8. a) Structure of the coupled HMMs

PETS2006, PETS2007, and the PROMETHEUS [1] datasets. As such a detailed analysis of the datasets has not yet been performed, a comparison to concurring approaches is not possible. Nevertheless results applying Bayesian Networks, as presented in [7] are provided if available. Note that the activities of interest only cover a small part of the databases. It is remarkable that for all classes only few misses can be reported and a very small amount of false positives is detected. A confusion matrix is not provided, as usually misses were confused with neutral behavior, while this was usually responsible for false positives. Walking is handled as neutral behavior and due to the large amount of data not especially considered for the evaluation task. Nevertheless it can be recognized almost flawlessly, although longer sequences of walking are frequently segmented into shorter parts. This problem can be covered by summing up continuous streams of walking. 3.3 Scenario Recognition Having extracted and detected all required LLAs, either with HMMs or using the tracking algorithm, these can now be further analyzed by a scenario interpretation module. Recent approaches were frequently based on a so called Scenario Description Language (SDL), which contains examples for each possible scenario [13]. Applying the SDL based approach can be interpreted as rule based reasoning, which can be achieved with a simple set of rules [8]. Current approaches use a wide range of LLA features and perform the analysis of behaviors or emotions with Dynamic Bayesian Networks (DBN) [41], which usually require a vast amount of data to compute the inferences. A simple form of the DBN, also data driven, is the well-known HMM. It is capable to segment and classify data streams at the same time. Current implementations usually analyze the trajectory created by one person, not allowing the interaction of multiple persons.

14


Table 2. Detection (det) results and false positives (fpos) for all five complex scenarios within the evaluated databases. Rules obviously perform by far weaker than DBNs, which are outperformed by HMMs. Event

[#] det DBN det Rules det HMM fpos DBN fpos Rules fpos HMM

Left Luggage Luggage Theft Operate ATM Wait at ATM Robbery at ATM

11 6 17 15 3

9 2 17 15 3

5 0 17 10 2

10 4 17 15 3

3 3 2 3 0

6 1 5 7 4

2 2 0 1 0

Furthermore, it seems hard to compute transition probabilities when a wide range of states and orders is allowed, if only little data is available. Therefore it has already been proposed to couple Markov chains [13]. A DBN based approach has been presented in [7], where the outputs of individually classified trajectories have been combined to an overall decision. In contrast to the previously used simple Markovian structure, now a HMM based implementation is used to allow for more complex models and scenarios. As fig. 8 illustrates, the applied implementation allows transitions between several HMMs being run through in parallel. This has the advantage that not each and every scenario has to be modeled individually and links between individually modeled trajectories or persons can be established. In a very basic implementation it can be assumed that these state transitions are simple triggers, which set a feature value, allowing to leave the actual state, which has been repeated a couple of times. One of the major issues with this approach is the need of real data. As this is not available in vast amounts, training has been performed using real data and an additional set of definitions by experts, where artificial variance has been included by insertions and deletions of observations. The trained models have been once more evaluated with the previously mentioned three databases, namely PETS2006, PETS2007 and PROMETHEUS. A brief overview on the results is given in table 2, which compares the HMM based approach to previous ones applying either rules [3] or the previously mentioned Dynamic Bayesian Networks (DBN) [7]. Obviously both DBNs and HMMs perform better than rule based approaches. The presented coupled HMM approach nevertheless performs slightly better than the previous DBN based implementation, which only allowed state transitions from left to the right and not between individual models. Especially the lower false positive rate of the coupled HMM approach is remarkable.

Fig. 9. Exemplary recognition of a robbery at an ATM


15

Two exemplary recognition results from the Prometheus database are provided in fig. 7 and fig. 9, where a person is either operating an ATM or being robbed at an ATM machine. As it can be seen, the activities in the scene are correctly picked up, assigned to the corresponding persons, and displayed in the figures.

4 Conclusion and Outlook We have presented an integrated framework for the robust interpretation of complex behaviors utilizing multi camera surveillance systems. The tracking part has been conducted in a voxel based representation of the desired tracking regions, which has been based on Multi Layer Homography. This approach has been improved both in speed and performance by this rough quantization of space. Nevertheless tracking performance can be further enhanced by creating a 3D model of the person using information retrieved from the original images, as proposed for Probability Occupancy Maps [19]. Furthermore the introduction of other sensors, such as 3D cameras or thermal infrared, could provide a more reliable segmentation of the scene [5]. Further it has been demonstrated that a complex behavior can be decomposed into multiple easy to detect LLAs, which can be detected either during the tracking phase or applying HMMs. The detected LLA are subsequently fed into a behavior interpretation module, which uses coupled HMMs and allows transitions between concurring models. Applying this approach resulted in a high detection and low false positive rate for all three evaluated databases. For future development it would be desired to analyze persons in further detail, which would include the estimation of the person’s pose [2,29], which will also allow the recognition of gestures [42]. Besides the introduction of further features and potential LLAs, the scenario interpretation needs further improvement. While a limited amount of behaviors can be modeled with little data, ambiguities between classes with low variance may not be distinguished that easily. Summed up the presented methods can be used as assistance for human operated CCTV systems, helping staff to focus attention on noticeable events at a low false positive rate, though at the same time ensuring minimal false negatives.

References 1. Ahlberg, J., Arsić, D., Ganchev, T., Linderhed, A., Menezes, P., Ntalampiras, S., Olma, T., Potamitis, I., Ros, J.: Prometheus: Prediction and interpretation of human behavior based on probabilistic structures and heterogeneous sensors. In: Proceedings 18th ECCAI European Conference on Artificial Intelligence, ECAI 2008, Patras, Greece, pp. 38–39 (2008) 2. Andriluka, M., Roth, S., Schiele, B.: Monocular 3d pose estimation and tracking by detection. In: Proceedings International IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), pp. 623–630 (2010) 3. Arsić, D., Hofmann, M., Schuller, B., Rigoll, G.: Multi-camera person tracking and left luggage detection applying homographic transformation. In: Proceedings Tenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2007, Rio de Janeiro, Brazil, pp. 55–62 (2007)

16


4. Arsić, D., Hörnler, B., Schuller, B., Rigoll, G.: A hierarchical approach for visual suspicious behavior detection in aircrafts. In: Proceedings 16th IEEE International Conference on Digital Signal Processing, Special Session “Biometric Recognition and Verification of Persons and their Activities for Video Surveillance”, DSP 2009, Santorini, Greece (2009) 5. Arsić, D., Hörnler, B., Schuller, B., Rigoll, G.: Resolving partial occlusions in crowded environments utilizing range data and video cameras. In: Proceedings 16th IEEE International Conference on Digital Signal Processing, Special Session “Fusion of Heterogeneous Data for Robust Estimation and Classification”, DSP 2009, Santorini, Greece (2009) 6. Arsić, D., Lehment, N., Hristov, E., Hrnler, B., Schuller, B., Rigoll, G.: Applying multi layer homography for multi camera tracking. In: Proceeedings Second ACM/IEEE International Conference on Distributed Smart Cameras, ICDSC 2008, Stanford, CA, USA, pp. 1–9 (2008) 7. Arsić, D., Lyutskanov, A., Kaiser, M., Rigoll, G.: Applying bayes markov chains for the detection of atm related scenarios. In: Proceedings IEEE Workshop on Applications of Computer Vision (WACV), in Conj. with the IEEE Computer Society’s Winter Vision Meetings, Snowbird, Utah, USA, pp. 1–8 (2009) 8. Arsić, D., Schuller, B., Rigoll, G.: Multiple camera person tracking in multiple layers combining 2d and 3d information. In: Proceedings Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2), Marseille, France (2008) 9. Auvinet, E., Grossmann, E., Rougier, C., Dahmane, M., Meunier, J.: Left luggage detection using homographies and simple heuristics. In: Proceedings Ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2006, New York, NY, USA, pp. 51–59 (2006) 10. Baum, L.E.: An inequality and associated maximalization technique in statistical estimation for probabilistic function of markov processes. Inequalities 3, 1–8 (1972) 11. Berclaz, J., Fleuret, F., Fua, P.: Multi-camera tracking and atypical motion detection with behavioral maps. In: Proceedings 10th European Conference on Computer Vision, Marseille, France (2008) 12. Broadhurst, A., Drummond, T., Cipolla, R.: A probabilistic framework for space carving. In: Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, pp. 388–393 (2001) 13. Carter, N., Ferryman, J.: The safee on-board threat detection system. In: Proceedings International Conference on Computer Vision Systems, pp. 79–88 (May 2008) 14. Carter, N., Young, D., Ferryman, J.: A combined bayesian markovian approach for behaviour recognition. In: Proceedings 18th International IEEE Conference on Pattern Recognition, ICPR 2006, Washington, DC, USA, pp. 761–764 (2006) 15. Chen, D., Liao, H.M., Shih, S.: Continuous human action segmentation and recognition using a spatio-temporal probabilistic framework. In: Proceedings Eighth IEEE International Symposium on Multimedia, ISM 2006, Washington, DC, USA, pp. 275–282 (2006) 16. Choi, J., Cho, Y., Cho, K., Bae, S., Yang, H.S.: A view-based multiple objects tracking and human action recognition for interactive virtual environments. The International Journal of Virtual Reality 7, 71–76 (2008) 17. Estrada, F., Jepson, A., Fleet, D.: Planar homographies, lecture notes foundations of computer vision. University of Toronto, Department of Computer Science (2004) 18. Ferryman, J., Shahrokni, A.: An overview of the pets 2009 challenge. In: Proceedings Eleventh IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2009, Miami, FL, USA, pp. 1–8 (2009) 19. Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multi-camera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 267–282 (2008) 20. Francois, A.R.J.: Real-time multi-resolution blob tracking. In: IRIS Technical Report, IRIS04-422, University of Southern California. Los Angeles, USA (2004)


17

21. Guler, S.: Scene and content analysis from multiple video streams. In: Proceedings 30th IEEE Workshop on Applied Imagery Pattern Recognition, AIPR 2001, pp. 119–123 (2001) 22. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003) 23. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 34(3), 334–352 (2004) 24. Khan, S.M., Yan, P., Shah, M.: A homographic framework for the fusion of multi-view silhouettes. In: Proceedings Eleventh IEEE International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, pp. 1–8 (2007) 25. Khan, S., Shah, M.: A multiview approach to tracking people in crowded scenes using a planar homography constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 133–146. Springer, Heidelberg (2006) 26. Kutulakos, K., Seitz, S.: A theory of shape by space carving, technical report tr692. Tech. rep., Computer Science Deptartment, University Rochester (1998) 27. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(2), 150–162 (1994) 28. Lehment, N., Arsić, D., Lyutskanov, A., Schuller, B., Rigoll, G.: Supporting multi camera tracking by monocular deformable graph tracking. In: Proceedings Eleventh IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2009, Miami, FL, USA, pp. 87–94 (2009) 29. Lehment, N., Kaiser, M., Arsic, D., Rigoll, G.: Cue-independent extending inverse kinematics for robust pose estimation in 3d point clouds. In: Proceeding IEEE International Conference on Image Processing (ICIP 2010), Hong Kong, China, pp. 2465–2468 (2010) 30. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 31. Oliver, N., Rosario, B., Pentland, A.: A bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis Machine Intelligence 22(8), 831– 843 (2000) 32. Orwell, J., Remagnino, P., Jones, G.: Multi-camera colour tracking. In: Proceedings Second IEEE Workshop on Visual Surveillance, VS 1999, Fort Collins, CO, USA, pp. 14–21 (1999) 33. Perera, A., Srinivas, C., Hoogs, A., Brooksby, G., Hu, W.: Multi-object tracking through simultaneous long occlusions and split-merge conditions. In: Proceedings 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006, Washington, DC, USA, pp. 666–673 (2006) 34. Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1989) 35. Seitz, S., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR, New York, NY, June 17-22, vol. 1, pp. 519–528 (2006) 36. Takahashi, K., Seki, S., Kojima, E., Oka, R.: Recognition of dexterous manipulations from time-varying images. In: Proceedings 1994 IEEE Workshop on Motion of Non-Rigid and Articulated Objects, pp. 23–28 (1994) 37. Thirde, D., Li, L., Ferryman, J.: Overview of the pets2006 challenge. In: Proceedings Ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2006, pp. 1–8. IEEE, New York (2006) 38. Vigus, S., Bul, D., Canagarajah, C.: Video object tracking using region split and merge and a kalman filter tracking algorithm. In: Proceedings International Conference On Image Processing, ICIP 2001, Thessaloniki, Greece, vol. x, pp. 650–653 (2001)

18


39. Wang, L.: Abnormal walking gait analysis using silhouette-masked flow histograms. In: Proceedings 18th International Conference on Pattern Recognition, pp. 473–476. IEEE Computer Society, Washington, DC (2006) 40. Welsh, B., Ferrington, D.: Effects of closed circuit television surveillance on crime. Campbell Systematic Reviews 17, 110–135 (2008) 41. Wöllmer, M., Schuller, B., Eyben, F., Rigoll, G.: Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE Journal of Selected Topics in Signal Processing 4(5), 867–881 (2010); special Issue on ”Speech Processing for Natural Interaction with Intelligent Environments 42. Wu, C., Aghajan, H.: Model-based human posture estimation for gesture analysis in an opportunistic fusion smart camera network. In: Proceedings IEEE Conference on Advanced Video and Signal Based Surveillance, AVSS 2007, pp. 453–458 (2007) 43. Yue, Z., Zhou, S., Chellappa, R.: Robust two-camera tracking using homography. In: Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2004, vol. 3, pp. 1–4 (2004)

Animated Faces for Robotic Heads: Gaze and Beyond Samer Al Moubayed, Jonas Beskow, Jens Edlund, Björn Granstr¨ om, and David House Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden {sameram,beskow,davidh}@kth.se, {edlund,bjorn}@speech.kth.se http://www.speech.kth.se

Abstract. We introduce an approach to using animated faces for robotics where a static physical object is used as a projection surface for an animation. The talking head is projected onto a 3D physical head model. In this chapter we discuss the different benefits this approach adds over mechanical heads. After that, we investigate a phenomenon commonly referred to as the Mona Lisa gaze effect. This effect results from the use of 2D surfaces to display 3D images and causes the gaze of a portrait to seemingly follow the observer no matter where it is viewed from. The experiment investigates the perception of gaze direction by observers. The analysis shows that the 3D model eliminates the effect, and provides an accurate perception of gaze direction. We discuss at the end the different requirements of gaze in interactive systems, and explore the different settings these findings give access to. Keywords: Facial Animation, Talking Heads, Shader Lamps, Robotic Heads, Gaze, Mona Lisa Effect, Avatar, Dialogue System, Situated Interaction, 3D Projection, Gaze Perception.

1

Introduction

During the last two decades, there has been ongoing research and impressive enhancement in facial animation. Researchers have been developing human-like talking heads that can have human-like interaction with humans [1], realize realistic facial expressions [2], and express emotions [3] and communicate behaviors [4]. Several talking heads are made to represent personas embodied in 3D facial designs (referred to as ECAs, Embodied Conversational Agents) simulating human behavior and establishing interaction and conversation with a human interlocutor. Although these characters have been embodied in human-like 3D animated models, this embodiment has always been limited by how these characters are displayed in our environment. Traditionally, talking heads have been displayed using two dimensional display (e.g. flat screens, wall projections, etc) A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 19–35, 2011. c Springer-Verlag Berlin Heidelberg 2011

20

S. Al Moubayed et al.

having no shared access to the three dimensional environment where the interaction is taking place. Surprisingly, there is little research on the effects of displaying 3D ECAs on 2D surfaces on the perception of the agent embodiment and its natural interaction effects [5]. Moreover, 2D displays come with several usually undesirable illusions and effects, such as the Mona Lisa gaze effect. For a review on these effects, refer to [6]. In robotics on the other hand, the complexity, robustness and high resolution of facial animation, which is done using computer graphics, is not employed. This is due to the fact that the accurate and highly subtle and complicated control of computer models (such as eyes, eye-lids, wrinkles, lips, etc.) do not map onto mechanically controlled heads. Such computer models require very delicate, smooth, and fast control of the motors, appearance and texture of a mechanical head. This fact has a large implication on the development of robotic heads. Moreover, in a physical mechanical robot head, the design and implementation of anthropomorphic properties can be limited, highly expensive, time consuming and difficult to test until the final head is finished. In talking heads on the other hand, changes in color, design, features, and even control of the face can be very easy and time efficient compared to mechanically controlled heads. There are few studies attempting to take advantage of the appearance and behavior of talking heads in the use of robotic heads. In [7], a flat screen is used as the head of the robot, displaying an animated agent. In [8], the movements of the motors of a mechanical head are driven by the control parameters of animated agents, in a trial to generate facial trajectories that are similar to those of a 3D animated face. These studies, although showing the interest and need to use the characteristics of animated talking agents in robot heads, are still limited by how this agent is represented, in the first case by a 2D screen that comes with detrimental effects and illusions, but profits from the appearance of the animated face, and in the second case by a mechanical head that tries to benefit from the behavior but misses on appearance. In this chapter we will present a new approach for using animated faces for robotic heads, where we attempt to guarantee the physical dimensionality and embodiment of the robotic head, and the appearance and behavior of the animated agents. After representing our approach and discussing its benefits, we investigate and evaluate this approach by studying its accuracy in delivering gaze direction in comparison to two dimensional display surfaces. Perhaps one of the most important effects of displaying three-dimensional scenes on two-dimensional surfaces is the Mona Lisa gaze effect. The Mona Lisa gaze effect is commonly described as an effect that makes it appear as if the Mona Lisas gaze rests steadily on the viewer as the viewer moves through the room. This effect has important implications in situational and spatial interaction, since gaze direction of a face displayed over a two-dimensional display does not point to an absolute location in the environment of the observer. In Section 2 we describe our proposal of using a 3D model of a human head as a projection surface for an animated talking head. In Section 3 we discuss

Animated Faces for Robotic Heads: Gaze and Beyond

21

Fig. 1. The technical setup: the physical model of a human head used as a 3D projection surface, to the left; the laser projector in the middle; and a snapshot of the 3D talking head to the right.

the benefits of using our approach in comparison to a traditional mechanical robotic head. In Section 4 we describe an experimental setup and a user study on the perception of gaze targets using a traditional 2D display and the novel 3D projection surface. In Section 5 we discuss the properties of gaze in terms of faithfulness for different communication requirements and configurations. We discuss different applications that can capitalize on our approach as well as research and experimentation made possible by it in Section 6 and present final conclusions in Section 7.

2

Projected Animated Faces on 3D Head Models

Our approach is based on the idea of projecting an animated face on a 3D surface a static, physical model of a human head. The technique of manipulating static objects with light is commonly referred to as the Shader Lamps technique [9] [10]. This technique is used to change the physical appearance of still objects, by illuminating them, using projections of static or animated textures, or video streams. We implement this technique by projecting an animated talking head (seen to the right in figure 1) on an arbitrary physical model of a human head (seen to the left in figure 1) using a laser micro projector (SHOWWX Pico Projector, seen in the center of figure 1). The main advantage of using a laser projector is that the image is always in focus, even on curved surfaces. The talking head used in the studies is detailed in [11] and includes a face, eyes, tongue, and teeth, based on static 3D-wireframe meshes that are deformed using direct parameterizations by applying weighted transformations to their vertices according to principles first introduced by [12]. Figure 2 shows the 3D projection surface with and without a projection of the talking head.

3

Robotic Heads with Animated Faces

the capacity for adequate interaction is a key concern. Since a great proportion of human interaction is managed non-verbally through gestures, facial expressions

22


Fig. 2. A physical model of a human head, without projection (left) and complete with a projection of the talking head, a furry hat, and a camera (right)

and gaze, an important current research trend in robotics deals with the design of social robots. But what mechanical and behavioral compromises should be considered in order to achieve satisfying interaction with human interlocutors? In the following, we present an overview of the practical benefits of using an animated talking head projected on a 3D surface as a robotic head. 1 Optically based. Since the approach utilizes a static 3D projection surface, the actual animation is done completely using computer graphics projected on the surface. This provides an alternative to mechanically controlled faces, saving electrical consumption and avoiding complex mechanical designs and motor control. Computer graphics also offers many advantages over motor based animation of robotic heads in speed, animation accuracy, resolution and flexibility. 2 Animation using computer graphics. Facial animation technology has shown tremendous progress over the last decade, and currently offers realistic, efficient, and reliable renditions. It is currently able to establish facial designs that are very human-like in appearance and behavior compared to the physical designs of mechanical robotic heads. 3 Facial design. The face design is done through software, which potentially provides the flexibility of having an unlimited range of facial designs for the same head. Even if the static projection surface needs to be re-customized to match a particularly unusual design, this is considerably simpler, faster, and cheaper than redesigning a whole mechanical head. In addition, the easily interchangeable face design offers the possibility to efficiently experiment with the different aspects of facial designs and characteristics in robotics heads, for example to examine the anthropomorphic spectrum. 4 Light weight. The optical design of the face leads to a considerably more lightweight head, depending only on the design of the projection surface. This makes the design of the neck much simpler and a more light-weight neck can be used, as it has to carry and move less weight. Ultimately, a lighter mobile robot is safer and saves energy. 5 Low noise level. The alternative of using light projection over a motorcontrolled face avoids all motor noises generated by moving the face. This is


23

crucial for a robot interacting verbally with humans, and in any situation where noise generation is a problem. 6 Low maintenance. Maintenance is reduced to software maintenance and maintenance of the micro laser projector, which is very easily replaceable. In contrast, mechanical faces are complicated, both electronically and mechanically, and an error in the system can be difficult and time consuming to troubleshoot. Naturally, there are drawbacks as well. Some robotic face designs cannot be achieved in full using light-projected animation alone, for example those requiring very large jaw openings which cannot be easily and realistically delivered without mechanically changing the physical projection surface. For such requirements, a hybrid approach can be implemented which combines a motor based physical animation of the head for the larger facial movements, with an optically projected animation for the more subtle movements, for example changes in eyes, wrinkles and eyebrows. In addition, the animations are delivered using light, so the projector must be able to outshine the ambient light, which becomes an issue if the robot is designed to be used in very bright light, such as full daylight. The problem can be remedied by employing the evermore powerful laser projectors that are being brought to the market.

4

Gaze Perception and the Mona Lisa Gaze Effect

The importance of gaze in social interaction is well-established. From a human communication perspective, Kendons work in [13] on gaze direction in conversation is particularly important in inspiring a wealth of studies that singled out gaze as one of the strongest non-vocal cues in human face-to-face interaction (see e.g. [14]). Gaze has been associated with a variety of functions within social interaction Kleinkes review article from 1986 , for example, contains the following list: (a) provide information, (b) regulate interaction, (c) express intimacy, (d) exercise social control, and (e) facilitate service and task goals ([15]). These efforts, in turn, were shadowed by a surge of activity in the human-computer interaction community, which recognized the importance of modeling gaze in artificial personas such as embodied conversational agents (ECAs) (e.g. [16]; [17]). To date, these efforts have been somewhat biased towards the production of gaze behavior, whereas less effort has been expended on the perception of gaze. In light of the fact that an overwhelming majority of ECAs are either 2D or 3D models, rendered on 2D displays, this is somewhat surprising: the perception of 2D renditions of 3D scenes is notoriously riddled with artefact and illusions of many sorts for an overview, see [18]. Perhaps the most important of these for using gaze behaviors in ECAs for communicative purposes is the Mona Lisa gaze effect or the Mona Lisa stare, commonly described as an effect that makes it appear as if the Mona Lisas gaze rests steadily on the viewer as the viewer moves through the room (figure 3). The fact that the Mona Lisa gaze effect occurs when a face is presented on a 2D display has significant consequences for the use and control of gaze in communication. To the extent that gaze in a 2D face follows the observer, gaze does not

24


Fig. 3. Leonardo da Vinci’s Mona Lisa. Mona Lisa appears to be looking straight at the viewer, regardless of viewing angle. The painting is in the public domain.

point unambiguously at a point in 3D space. In the case of multiple observers, they all have the same perception of the image, no matter where they stand in relation to e.g. the painting or screen. This causes an inability to establish a situated eye contact with one particular observer, without simultaneously establishing it with all others, which leads to miscommunication if gaze is employed to support a smooth flowing interaction with several human subjects: all human subjects will perceive the same gaze pattern. In the following experiment, we investigate the accuracy of perceived gaze direction in our 3D head model, discuss the different applications it can be used for, and contrast it with a traditional 2D display. The experiment detailed here was designed and conducted to confirm the hypothesis that a talking head projected on a 2D display is subject to the Mona Lisa gaze effect, while projecting it on a 3D surface inhibits the effect and enforces an eye-gaze direction that is independent of the subjects angle of view. Accordingly, the experiment measures perception accuracy of gaze in these two configurations. 4.1

Setup

The experiment setup employs a set of subjects simultaneously seated on a circle segment centrad at the stimulus point a 2D or 3D projection surface facing the stimuli point. Adjacent subjects are equidistant from each other and all subjects are equidistant to the projection surface so that the angle between two adjacent subjects and the projection surface was always about 26.5 degrees. The positions are annotated as -53, -26.5, 0, 26.5, 53, where 0 is the seat directly in front of the projection surface. The distance from subjects to the projection surface was 1.80 meters (figure 4).


25

Fig. 4. Schematic of the experiment setup: five simultaneous subjects are placed at equal distances along the perimeter of a circle centred on the projection surface

Two identical sets of stimuli are projected on a 2D surface in the 2D condition (2DCOND) and on a 3D surface in the 3D condition (3DCOND). The stimuli sets contain the animated talking head with 20 different gaze angles. The angles are equally spaced between -25 degrees and +13 degrees in the 3D models internal gaze angle (horizontal eyeball rotation in relation to skull) with 2 degree increments, where 0 degree rotation is when the eyes are looking straight forward. The angles between +13 degrees and +25 degrees were left out because of a programming error, but we found no indications that this asymmetry has any negative effects on the experimental results. Five subjects were simultaneous employed in a within-subject design, where each subject judged each stimulus in the experiment. All five subjects had normal or corrected to normal eye sight. 4.2

Method

Before the experiment, the subjects were presented with an answer sheet, and the task of the experiment was explained: to point out, for each stimulus, which subject the gaze of the animated head is pointing at. The advantage of using subjects as gaze target is that this method provides perceptually, and communicatively, relevant gaze targets instead of using, for example, a spatial grid as in [19]. For each set of 20 stimuli, each of the seated subjects got an empty answer sheet with 20 answer lines indicating the position of all subjects. The subject enters a mark on one of the subjects indicating her decision. If the subject believed the head was looking beyond the rightmost or the leftmost subject, the subject entered the mark at the end of either of the two arrows to the right or left of the boxes that represent the subjects.

26


Fig. 5. Snapshots, taken over the shoulder of a subject, of the projection surfaces in 3DCOND (left) and 2DCOND (right)

The five subjects were then randomly seated at the five positions and the first set of 20 stimuli was projected in 3DCond, as seen on the left of figure 5. Subjects marked their answer sheets after each stimulus. When all stimuli were presented, the subjects were shifted to new positions and the process repeated, in order to capture any bias for subject/position combinations. The process was repeated five times, so that each sat in each position once, resulting in five sets of responses from each subject. 4.3

Analysis and Results

Figure 6 plots the raw data for all the responses over gaze angles. The size of the bubbles indicates the number of responses with the corresponding value for that angle; the bigger the bubble, the more subjects perceived gaze in that particular direction. It is again clear that in 3DCond, the perception of gaze is more precise (i.e. fewer bubbles per row) compared to 2DCond. Figure 7 shows bubble plots similar to those in figure 6, with responses for each stimulus. The figure differs in that the data plotted is filtered so that only responses are plotted where perceived gaze matched the responding subject, that is when subjects responded that the gaze was directed directly at themselves what is commonly called eye-contact or mutual gaze. These plots show the location of and the number of the subjects that perceived eye-contact over different gaze angles. In 2DCond, the Mona Lisa gaze effect is very visible: for all the near-frontal angles, each of the five subjects, independently from where they are seated, perceived eye-contact. The figure also shows that the effect is completely eliminated in 3DCond, in which generally only one subject at a time perceived eye-contact with the head. 4.4

Estimating the Gaze Function

In addition to investigating the gaze perception accuracy of projections on different types of surfaces, the experimental setup allows us to measure a psychometric


27

Fig. 6. Responses for all subject positions (X axis) over all internal angles (Y axis) for each of the conditions: 2DCOND to the left and 3DCOND to the right. Bubble size indicates number of responses. The X axis contains the responses for each of the five subject positions (from 1 to 5), where 0 indicates gaze perceived beyond the leftmost subject, and 6 indicates gaze perceived beyond the rightmost subject.

function for gaze which maps eyeball rotation in a virtual talking head to physical, real-world angles an essential function to establish eye-contact between the real and virtual world. We estimated this function by applying a first order polynomial fit to the data to get a linear mapping from the real positions of the gaze targets perceived by the subjects, to the actual internal eyeball angles in the projected animated talking head, for each condition. In 2DCOND, the estimated function that resulted from the linear fit to the data is: Angle = −5.2 × Gaze Target (1) RMSE = 17.66

(2)

R square = .668

(3)

28


Fig. 7. Bubble plot showing only responses where subjects perceived eye-contact: subject position (X axis) over all internal angles (Y axis) for each of the conditions: 2DCond to the left and 3DCond to the right. Bubble size indicates number of responses.

And for the 3DCOND: Angle = −4.1 × Gaze Target

(4)

RMSE = 6.65

(5)

R square = .892

(6)

where R square represents the ability of the linear fit to describe the data. Although the resulting gaze functions from the two conditions are similar, the goodness of fit is markedly better in 3DCOND than in 2DCOND. The results provide a good estimation of a gaze psychometric function. If the physical target gaze point is known, the internal angle of eye rotation can be calculated. By reusing the experimental design, the function can be estimated for any facial design or display surface.

5

Spatial Faithfulness of Gaze and Situated Interaction

Armed with this distinction between perception of gaze in 2D and 3D displays, we now turn to how communicative gaze requirements are met by the two system types. Situated interaction requires a shared perception of spatial properties


29

where interlocutors and objects are placed, in which direction a speaker or listener turns, and at what the interlocutors are looking. Accurate gaze perception is crucial, but plays different roles in different kinds of communication, for example between co-located interlocutors, between humans in avatar or video mediated human-human communication, and between humans and ECAs or robots in spoken dialogue systems. We propose that it is useful to talk about three levels of gaze faithfulness, as follows. We define the observer as the entity perceiving gaze and a target point as an absolute position in the observers space. – Mutual Gaze. When the observer is the gaze target, the observer correctly perceives this. When the observer is not the gaze target, the observer correctly perceives this. In other words, the observer can correctly answer the question: Does she look me in the eye? – Relative Gaze. There is a direct and linear mapping between the intended angle of the gaze relative to the observer and the observers perception of that angle. In other words, the observer can correctly answer the question: How much to the left of/to the right of/above/below me is she looking? – Absolute Gaze. A one-to-one mapping is correctly preserved between the intended target point of gaze and the observers perception of that target point. In other words, the observer can accurately answer the question: At what exactly is she looking? Whether a system can produce faithful gaze or not depends largely on four parameters. Two of these represent system capabilities: the type of display used, limited here to whether the system produces gaze on a 2D surface or on a 3D surface and whether the system knows where relevant objects (including the interlocutors head and eyes) are in physical space; e.g. through automatic object tracking or with the help of manual guidance). A special case of the second capability is the ability to know only where the head of the interlocutor is. The remaining two have to do with the requirements of the application: the first is what level of faithfulness is needed as discussed above and the second whether the system is to interact with one or many interlocutors at the same time. We start by examining single user systems with a traditional 2D display without object tracking, These systems are faithful in terms of mutual gaze no matter where in the room the observer is the system can look straight ahead to achieve mutual gaze and anywhere else to avoid it; it is faithful in terms of relative gaze regardless of where in the room the observer is, the system can look to the left and be perceived as looking to the right of the observer, and so on; and it is unrealistic in terms of absolute gaze the system can only be perceived as looking at target objects other than the observer by pure luck. Next, we note that single user systems with a traditional 2D display with object tracking are generally the same as those without object tracking. It is possible, however, that the object tracking can help absolute gaze faithfulness, but it requires a fairly complex transformation involving targeting the objects in terms of angles relative the observer. If the objects are targeted in absolute terms, the observer will not perceive gaze targets as intended.

30


Fig. 8. Faithful (+) or unrealistic (-) gaze behaviour under different system capabilities and application requirements. +* signifies that although faithfulness is most likely possible, it involves unsolved issues and additional transformations that are likely to cause complications.

Multi-user systems with a traditional 2D display and no object tracking perform poorly. They are unrealistic in terms of mutual gaze, as either all or none of the observers will perceive mutual gaze; they are unrealistic with respect to relative gaze, as all observers will perceive the gaze to be directed at the same angle relative themselves; and they are unrealistic in terms of absolute gaze as well. Multi-user systems with a traditional 2D display and object tracking perform exactly as poorly as those without object tracking regardless of any attempt to use the object tracking to help absolute faithfulness by transforming target positions in relative terms, all observers will perceive the same angle in relation to themselves, and only one at best will perceive the intended position. Turning to the 3D projection surface systems, both single and multi user systems with a 3D projection surface and no object tracking are unrealistic in terms of mutual gaze, relative gaze, and absolute gaze without knowing where to direct its gaze in real space, it is lost. By adding head tracking, the systems can produce faithful mutual gaze, and single user systems with head tracking can attempt faithful relative gaze by shifting gaze angle relative the observers head. In contrast, both single and multi user systems with a 3D projection surface and object tracking, coupling the ability to know where objects and observers are with the ability to target any position, are faithful in terms of all of mutual gaze, relative gaze, and absolute gaze. Figure 8 presents an overview of how meeting the three levels of faithfulness depends on system capabilities and application requirements. Examining the table in the figure, we first note that in applications where more than one


31

participant is involved, using a 2D projection surface will result in a system that is unrealistic on all levels (lower left quadrant of the table), and secondly, that a system with a 3D projection surface and object tracking will provide faithful eye gaze regardless of application requirements (rightmost column). These are the perhaps unsurprising results of the Mona Liza gaze effect being in place in the first case, causing the gaze perception of all in a room to be the same, and of mimicking the conditions under which a situated human interacts in the second, with a physical presence in space and full perception of the environment and ones relation to it. Thirdly, we note that if no automatic or manual object or head tracking is available, the 3D projection surface is unrealistic in all conditions, as it requires information on where in the room to direct its gaze, and that head only tracking improves the situation to some extent. Fourthly, and more interestingly, we note that in single user cases where no object tracking or head tracking only is available, the 2D surface is the most faithful one (upper left quadrant). In these cases, we can tame and harness the Mona Lisa gaze effect and make it work for us. This suggests that gaze experiments such as those described in [20] and [21] could not have been performed with a 3D projection surface unless sophisticated head trackers would have been employed. In summation, it is worthwhile to have a clear view of the requirements of the application or investigation before designing the system. In some cases (i.e. single user cases with no need for absolute gaze faithfulness), a simpler 2D display system without any tracking can give results similar to a more complex 3D projection surface system with head or object tracking facilities at considerably lower cost and effort. On the other hand, if we are to study situated interaction with objects and multiple participants, we need to guarantee successful delivery of gaze at all levels with a 3D projection surface that inhibits the Mona Lisa stare effect and reliable object tracking, manual or automatic, to direct the gaze.

6

Applications and Discussions

As we have seen, the Mona Lisa gaze effect is highly undesirable in several communicative setups due to the manner in which it limits our ability to control gaze target perception. We have also seen that under certain circumstances, the effect a cognitive ability to perceive a depicted scene from the point of view of the camera or painter can be harnessed to allow us to build relatively simple applications, which would otherwise have required much more effort. A hugely successful example is the use of TV screens and movie theaters, where entire audiences perceive the same scene, independently from where they are seated. If this was not the case, the film and TV industries might well have been less successful. There are also situations where an ECA can benefit from establishing eye-contact with either all viewers simultaneously in a multiparty situation, as when delivering a message or taking the role of e.g. a weather presenter, and when it is required to establish eye contact with one person whose position in

32


the room is unknown to the ECA, as is the case in most spoken dialogue system experiments to date involving an ECA. Although the Mona Lisa gaze effect can be exploited in some cases, it is an obstacle to be overcome in the majority of interaction scenarios, as those where gaze is required to point exclusively to objects in the physical 3D space of the observer, or where multiple observers are involved in anything but the most basic interactions. In order to do controlled experiments investigating gaze in situated multiparty dialogues, the Mona Lisa effect must be overcome, and we can do this readily using the proposed technique. In other words, the technique opens possibilities for many applications which require absolute gaze perception, but would not have been possible with the use of a 2D display. In the following we present a short list of application families that we have recently begun to explore in the situated interaction domain, all of which require the levels of gaze perception afforded by 3D projection surfaces. The first family of applications is situated and multiparty dialogues with ECAs or social conversational robots. These systems need to be able to switch their attention among the different dialogue partners, while keeping the partners informed about the status of the dialogue and who is being addressed, and exclusive eye-contact with single subjects is crucial for selecting an addressee. In such scenarios, a coherently shared and absolute perception of gaze targets is needed to achieve a smooth human-like dialogue flow a requirement that can not be met unless the Mona Lisa gaze effect is eliminated. The second family involves any application where there is a need for a pointing device to point at objects in real space the space of the human participant. Gaze is a powerful pointing device that can point from virtual space to real space while being completely non-mechanic as opposed to for example fingers or arrows and it is non-intrusive and subtle. A third family of applications is mediated interaction and tele-presence. A typical application in this family is virtual conferential systems. In a traditional system, the remote partner cannot meaningfully gaze into the environment of the other partners, since the remote partner is presented through a 2D display subject to the Mona Lisa gaze effect. Establishing a one-to-one interaction through mutual gaze cannot be done, as there is no ability to establish an exclusive eye contact. In addition to that, people look at the video presenting the other partners instead of looking into the camera, which is another obstacle for shared attention and mutual gaze and no one can estimate reliably at what the remote participant is looking. If a 3D head is used to represent the remote subject, who is represented through mediation as an avatar, these limitations to video conferential can, at lease partially, be resolved.

7

Conclusions

To sum up, we have proposed two ways of taming Mona Lisa: firstly by eliminating the effect and secondly by harnessing and exploiting it.


33

En route to this conclusion, we have proposed an affordable way of eliminating the effect by projecting an animated talking head on a 3D projection surface a generic physical 3D model of a human head, and verified experimentally that it allows subjects to perceive gaze targets in the room clearly from various viewing angles, meaning that the Mona Lisa effect is eliminated. In the experiment, the 3D projection surface was contrasted with a 2D projection surface, clearly displaying the Mona Lisa gaze effect in the 2D case. In addition to eliminating the Mona Lisa gaze effect, the 3D setup allowed observers to perceive with very high agreement who was being looked at. The 2D setup showed no such agreement. We showed how the data serves to estimate a gaze psychometric function to map actual gaze target into eyeball rotation values in the animated talking head. Based on the experimental data and the working model, we proposed three levels of gaze faithfulness relevant to applications using gaze: mutual gaze faithfulness, relative gaze faithfulness, and absolute gaze faithfulness. We further suggested that whether a system achieves gaze faithfulness or not depends on several system capabilities whether the system uses a 2D display or the proposed 3D projection surface, and whether the system has some means of knowing where objects and the interlocutors are, but also depending on the application requirements whether the system is required to speak to more than one person at a time and the level of gaze faithfulness it requires. One of the implications of this is that the Mona Lisa gaze effect can be exploited and put to work for us in some types of applications. Although perhaps obvious, it falls out neatly from the working model. Another implication is that the only way to robustly achieve all three levels of gaze faithfulness is to have some means of tracking objects in the room and to use an appropriate 3D projection surface. However, without knowledge of objects positions, the 3D projection surface falls short. We close by discussing the benefits of 3D projection surfaces in terms of human-robot interaction, where the technique can be used to create faces for robotic heads with a high degree of human-likeness, better design flexibility, more sustainable animation, low weight and noise levels and lower maintenance costs, and by discussing in some detail a few application types and research areas where the elimination of the Mona Lisa gaze effect through the use of 3D projection surfaces is particularly useful, such as when dealing with situated interaction or multiple interlocutors. We consider this work to be a stepping stone for several future investigations and studies into the role and employment of gaze in human-robot, human-ECA, and human-human mediated interaction. Acknowledgments. This work has been partly funded by the EU project IURO (Interactive Urban Robot) FP7-ICT-248314 . The authors would like to thank the five subjects for participating in the experiment.

34


References 1. Beskow, J., Edlund, J., Granstrm, B., Gustafson, J., House, D.: Face-to-face interaction and the KTH Cooking Show. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces: Active Listening and Synchrony, pp. 157–168. Springer, Heidelberg (2010) 2. Ruttkay, Z., Pelachaud, C. (eds.): From Brows till Trust: Evaluating Embodied Conversational Agents. Kluwer, Dordrecht (2004) 3. Pelachaud, C.: Modeling Multimodal Expression of Emotion in a Virtual Agent. Philosophical Transactions of Royal Society B Biological Science, B 364, 3539–3548 (2009) 4. Granstrm, B., House, D.: Modeling and evaluating verbal and non-verbal communication in talking animated interface agents. In: Dybkjaer, l., Hemsen, H., Minker, W. (eds.) Evaluation of Text and Speech Systems, pp. 65–98. Springer, Heidelberg (2007) 5. Shinozawa, K., Naya, F., Yamato, J., Kogure, K.: Differences in effect of robot and screen agent recommendations on human decision-making. International Journal of Human Computer Studies 62(2), 267–279 (2005) 6. Todorovi, D.: Geometrical basis of perception of gaze direction. Vision Research 45(21), 3549–3562 (2006) 7. Gockley, R., Simmons, J., Wang, D., Busquets, C., DiSalvo, K., Caffrey, S., Rosenthal, J., Mink, S., Thomas, W., Adams, T., Lauducci, M., Bugajska, D., Perzanowski, Schultz, A.: Grace and George: Social Robots at AAAI. In: Proceedings of AAAI 2004. Mobile Robot Competition Workshop, pp. 15–20. AAAI Press, Menlo Park (2004) 8. Sosnowski, S., Mayer, C., Kuehnlenz, K., Radig, B.: Mirror my emotions! Combining facial expression analysis and synthesis on a robot. In: Proceedings of the Thirty Sixth Annual Convention of the Society for the Study of Artificial Intelligence and Simulation of Behaviour, AISB 2010 (2010) 9. Raskar, R., Welch, G., Low, K.-L., Bandyopadhyay, D.: Shader lamps: animating real objects with image-based illumination. In: Proc. of the 12th Eurographics Workshop on Rendering Techniques, pp. 89–102 (2001) 10. Lincoln, P., Welch, G., Nashel, A., Ilie, A., State, A., Fuchs, H.: Animatronic shader lamps avatars. In: Proc. of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality (ISMAR 2009). IEEE Computer Society, Washington, DC (2009) 11. Beskow, J.: Talking heads – Models and applications for multimodal speech synthesis. Doctoral dissertation, KTH (2003) 12. Parke, F.I.: Parameterized Models for Facial Animation. IEEE Computer Graphics and Applications 2(9), 61–68 (1982) 13. Kendon, A.: Some functions of gaze direction in social interaction. Acta Psychologica 26, 22–63 (1967) 14. Argyle, M., Cook, M.: Gaze and mutual gaze. Cambridge University Press, Cambridge (1976) ISBN: 978-0521208659 15. Kleinke, C.L.: Gaze and eye contact: a research review. Psychological Bulletin 100, 78–100 (1986) 16. Takeuchi, A., Nagao, K.: Communicative facial displays as a new conversational modality. In: Proc. of the INTERACT 1993 and CHI 1993 Conference on Human Factors in Computing Systems (1993)


35

17. Bilvi, M., Pelachaud, C.: Communicative and statistical eye gaze predictions. In: Proc. of International Conference on Autonomous Agents and Multi-Agent Systems, Melbourne, Australia (2003) 18. Gregory, R.: Eye and Brain: The Psychology of Seeing. Princeton University Press, Princeton (1997) 19. Delaunay, F., de Greeff, J., Belpaeme, T.: A study of a retro-projected robotic face and its effectiveness for gaze reading by humans. In: Proc. of the 5th ACM/IEEE International Conference on Human-robot Interaction, pp. 39–44. ACM, New York (2010) 20. Edlund, J., Nordstrand, M.: Turn-taking gestures and hour-glasses in a multimodal dialogue system. In: Proc. of ISCA Workshop on Multi-Modal Dialogue in Mobile Environments, Kloster Irsee, Germany (2002) 21. Edlund, J., Beskow, J.: MushyPeek - a framework for online investigation of audiovisual dialogue phenomena. Language and Speech 52(2-3), 351–367 (2009)

RANSAC-Based Training Data Selection on Spectral Features for Emotion Recognition from Spontaneous Speech Elif Bozkurt1 , Engin Erzin1 , C ¸ iˇgdem Eroˇglu Erdem2 , and A. Tanju Erdem3 1 Multimedia, Vision and Graphics Laboratory, College of Engineering, Ko¸c University, 34450, Sariyer, Istanbul, Turkey {ebozkurt,eerzin}@ku.edu.tr 2 Department of Electrical and Electronics Engineering, Bah¸ce¸sehir University, 34349 Be¸sikta¸s, Istanbul, Turkey [email protected] 3 Department of Electrical and Electronics Engineering, ¨ ¨ udar, Istanbul, Turkey Ozyeˇ gin University, 34662 Usk¨ [email protected]

Abstract. Training datasets containing spontaneous emotional speech are often imperfect due the ambiguities and difficulties of labeling such data by human observers. In this paper, we present a Random Sampling Consensus (RANSAC) based training approach for the problem of emotion recognition from spontaneous speech recordings. Our motivation is to insert a data cleaning process to the training phase of the Hidden Markov Models (HMMs) for the purpose of removing some suspicious instances of labels that may exist in the training dataset. Our experiments using HMMs with Mel Frequency Cepstral Coefficients (MFCC) and Line Spectral Frequency (LSF) features indicate that utilization of RANSAC in the training phase provides an improvement in the unweighted recall rates on the test set. Experimental studies performed over the FAU Aibo Emotion Corpus demonstrate that decision fusion configurations with LSF and MFCC based classifiers provide further significant performance improvements. Keywords: Affect recognition, emotional speech classification, RANSAC, data cleaning, decision fusion.

1

Introduction

For supervised pattern recognition problems such as emotion recognition from spontaneous speech, large training sets need to be recorded and labeled to be used for the training of the classifier. The labeling of large training datasets is a tedious job, carried out by humans and hence prone to human mistakes. The mislabeled (or noisy) examples of the training data may result in a decrease in the classifier performance. It is not easy to identify these contaminations or imperfections of the training data since they may also be hard to learn examples. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 36–47, 2011. c Springer-Verlag Berlin Heidelberg 2011

RANSAC-Based Training Data Selection on Spectral Features

37

In that respect, pointing out troublesome examples is a chicken-and-egg problem, since good classifiers are needed to tell which examples are noisy [1]. Spectral features play an important role in emotion recognition. The dynamics of the vocal tract can potentially change under different emotional states. Hence spectral characteristics of speech differ for various emotions [14]. The utterance level statistics of the spectral features have been widely used in speech emotion recognition and demonstrated a considerable success [13] [12]. In this work, we assume that outliers in the training set of emotional speech recordings mainly result from mislabeled or ambiguous data. Our goal is to remove such noisy samples from the training set to increase the performance of Hidden Markov Model based classifiers modeling spectral features. 1.1

Previous Work

Previous research on data cleaning, which is also called as data pruning or decontamination of training data shows that removing noisy samples is worthwhile [1] [2] [3]. Guyon et al. [9] have studied data cleaning in the context of discovering informative patterns in large databases. They mention that informative patterns are often intermixed with unwanted outliers, which are errors introduced non-intentionally to the database. Informative patterns correspond to atypical or ambiguous data and are pointed out as the most ”surprising” ones. On the other hand, garbage patterns are also surprising, which correspond to meaningless or mislabeled patterns. The authors point out that automatically cleaning the data by eliminating patterns with suspiciously large information gain may result in loss of valuable informative patterns. Therefore they propose a user-interactive method for cleaning a database of hand-written images, where a human operator checks those patterns that have the largest information gain and therefore the most suspicious. Batandela and Gasca [2] report a cleaning process to remove suspicious instances of the training set or correcting the class labels and keep them in the training set. Their method is based on the Nearest Neighbor classifier. Wang et al. [22], present a method to sample a large and noisy multimedia data. Their method is based on a simple distance measure that compares the histograms of the sample set and the whole set in order to assess the representativeness of the sample set. The proposed method deals with noise in an elegant way, and has been shown to be superior to the simple random sample (SRS) method [8][16]. Angelova et al. [1] present a fully automatic algorithm for data pruning, and demonstrate its success for the problem of face recognition. They show that data pruning can improve the generalization performance of classifiers. Their algorithm has two components: the first component consists of multiple semiindependent classifiers learned on the input data, where each classifier concentrates on different aspects and the second component is a probabilistic reasoning machine for identifying examples which are in contradiction with most learners and therefore noisy.

38

E. Bozkurt et al.

There are also other approaches for learning with noisy data based on regularization [17] or averaging decisions of several functions such as bagging [4]. However, these methods are not successful in high-noise cases. 1.2

Contribution and Outline of the Paper

In this paper, we propose an algorithm for automatic noise elimination from training data using Random Sample Consensus. RANSAC is a paradigm for fitting a model to noisy data and utilized in many computer vision problems [21]. RANSAC performs multiple trials of selecting small subsets of the data to estimate the model. The final solution is the model with maximal support from the training data. The method is robust to considerable noise. In this paper, we adopt RANSAC for training HMMs for the purpose of emotion recognition from spontaneous emotional speech. To the best of our knowledge, RANSAC has not been used before for cleaning an emotional speech database. The outline of the paper is as follows. In Section 2, background information is provided describing the spontaneous speech corpus and the well known RANSAC algorithm. In Section 3, the proposed method is described including the speech features, the Hidden Markov Model, the RANSAC-based HMM fitting approach and the decision fusion method. In Section 4, our experimental results are provided, which is followed by conclusions and future work given in Section 5.

2 2.1

Background The Spontaneous Speech Corpus

The FAU AIBO corpus is used in this study [19]. The corpus consists of spontaneous, German and emotionally colored recordings of children interacting with Sony’s pet robot Aibo. The data was collected from 51 children and consisted of 48,401 words. Each word was annotated independently from each other as neutral or as belonging to one of the ten other classes, which are named as: joyful (101 words), surprised (0), emphatic (2,528), helpless (3), touchy (i.e., irritated) (225), angry (84), motherese (1,260), bored (11), reprimanding (310), rest (i.e., non-neutral but not belonging to the other categories) (3), neutral (39,169), and there were also 4,707 words not annotated since they did not satisfy the majority vote rule used in the labeling procedure. Five labelers were involved in the annotation process, and a majority vote approach was used to decide on the final label of a word, i.e., if at least three labelers agreed on a label, the label was attributed to the word. As we can see from the above numbers, in 4,707 of the words, the five listeners could not agree on a label. Therefore, we can say that labeling spontaneous speech data into emotion classes is not an easy task, since the emotions are not classified easily and may even contain a mixture of more than one emotion. This implies that the labels of the training may be imperfect, which may adversely affect the recognition performance of the trained pattern classifiers.


39

In the INTERSPEECH 2009 emotion challenge, the FAU AIBO dataset was segmented into manually defined chucks consisting of one or more words, since that was found to be the best unit of analysis [19], [20]. A total of 18,216 chunks was used for the challenge and the emotions were grouped into five classes, namely: Anger (including angry, touchy, and reprimanding classes) (1,492), Emphatic (3,601), Neutral (10,967), Positive (including motherese and joyful) (889), and Rest (1,267). The data is highly unbalanced. Since the data was collected at two different schools, speaker independence is guaranteed by using the data of one school for training and the data of the other school for testing. This dataset is used in the experiments of this study. 2.2

The RANSAC Algorithm

Random Sample Consensus is a method for fitting a model to noisy data [7]. RANSAC is capable of being robust to error levels of significant percentages. The main idea is to identify the outliers as data samples with greatest residuals with respect to the fitted model. These can be excluded and the model is recomputed. The steps of the general RANSAC algorithm are as follows [21] [7]: 1. Suppose we have n training data samples X = x1 , x2 , ..., xn to which we hope to fit a model determined by (at least) m samples (m ≤ n). 2. Set an iteration counter k = 1. 3. Choose at random m items from X and compute a model. 4. For some tolerance ε, determine how many elements of X are within of the derived model ε. If this number exceeds a threshold t, re-compute the model over this consensus set and stop. 5. Set k = k + 1 If , k < K for some predetermined K, go to 3. Otherwise, accept the model with the biggest consensus set so far, or fail. There are possible improvements to this algorithm [21] [7]. The random subset selection may be improved if we have prior knowledge of data and its properties, that is some samples may be more likely to fit a correct model than others. There are three parameters that need to be chosen: • ε , which is the acceptable deviation from a good model. It might be empirically determined by fitting a model to m points, measuring the deviations and setting to some number of standard deviations above the mean error. • t, which is the size of the consensus set. There are two purposes for this parameter: to represent enough sample points for a sufficient model and to represent the enough number of samples to refine the model to the final best estimate. For the first point a value of t satisfying t − m > 5 has been suggested [7]. • K, which is the maximum number to run the algorithm while searching a satisfactory fit. Values of K = 2ω −m or K = 3ω −m have been argued to be reasonable choices [7], where ω is the probability of a randomly selected sample to be within ε of the model.

40

3 3.1

E. Bozkurt et al.

RANSAC-Based Data Cleaning Method Extraction of the Speech Features

We represent spectral features of speech using mel-frequency cepstral coefficients (MFCC) and line spectral frequencies (LSF) with their first and second order derivatives. MFCC features. Spectral features, such as mel-frequency cepstral coefficients (MFCC), are expected to model the varying nature of speech spectra under different emotions. We represent the spectral features of each analysis window of the speech data with a 13-dimensional MFCC vector consisting of energy and 12 cepstral coefficients, which will be denoted as fC . LSF features. Line spectral frequency (LSF) decomposition has been first developed by Itakura [10] for robust representation of the coefficients of linear predictive (LP) speech models. LP analysis of speech assumes that a short stationary segment of speech can be represented by a linear time invariant all pole 1 filter of the form H(z) = A(z) , which is a pth order model for the vocal tract. LSF decomposition refers to expressing the p-th order inverse filter A(z) in terms of two polynomials P (z) = A(z) − z p+1 A(z −1 ) and Q(z) = A(z) + z p+1 A(z −1 ), which are used to represent the LP filter as, H(z) =

1 2 = . A(z) P (z) + Q(z)

(1)

The polynomials P (z) and Q(z) each have p/2 zeros on the unit circle, where phases of the zeros are interleaved in the interval [0, π]. Phases of p zeros from the P (z) and Q(z) polynomials form the LSF feature representation for the LP model. Extraction of LSF features, which is finding p zeros of P (z) and Q(z) polynomials, is also computationally effective and robust. Note that the formant frequencies correspond to the zeros of A(z). Hence, P (z) and Q(z) will be close to zero at each formant frequency, which implies that the neighboring LSF features will be close to each other around formant frequencies. This property relates the LSF features to the formant frequencies [15], and makes them good candidates to model emotion related prosodic information in the speech spectra. We represent the LSF feature vector of each analysis window of speech as a p dimensional vector fL . Dynamic features. Temporal changes in the spectra play an important role in human perception of speech. One way to capture this information is to use dynamic features, which measure the change in the short-term spectra over time. We compute the first and second time derivatives of the thirteen dimensional MFCC features using the following regression formula:


2 ΔfC [n] =

k=−2 kfC [n + 2 2 k=−2 k

k]

,

41

(2)

where fC [n] is the MFCC feature vector at time frame n. Then, the extended MFCC feature vector, including the first and second order derivative features, is T represented as fCΔ = fCT ΔfCT ΔΔfCT , where T is the vector transpose operator. Likewise, the extended LSF feature vector including dynamic components is denoted as fLΔ . 3.2

Emotion Classification Using Hidden Markov Models

Hidden Markov model has been deployed with great success in automatic speech recognition to model temporal spectral information, and they were also used similarly for emotion recognition as well [18]. We model the temporal patterns of the emotional speech utterances using HMM. We target to make a decision for syntactically meaningful chunks of speech segments, where in each segment typically a single emotional evidence is expected. Furthermore, in each speech segment emotional evidence may exhibit temporal patterns. Hence, we employ N states left-to-right HMM to model each emotion class. Feature observation probability distributions are modeled by M mixture Gaussian density functions with diagonal covariance matrix. Structural parameters N and M are determined through a model selection method and discussed under experimental studies. In the emotion recognition phase, the likelihood of a given speech segment is computed over HMM with the Viterbi decoding for each emotion class. Then, the utterance is classified as expressing the emotion, which yields the highest likelihood score. 3.3

RANSAC-Based Training of HMM Classifiers

Our goal is to train an HMM for each of the five emotion classes in the training set (Anger, Emphatic, Positive, Neutral and Rest). For each emotion class, we want to select a training set such that the fraction of the number of inliers (consensus set) over the total number of utterances in the dataset is maximized. In order to apply the RANSAC algorithm for fitting an HMM model, we need to estimate suitable values for the parameters m, ε , t, K and ω , which were defined in Section 2.2. For determining the biggest consensus set (inliers) for each of the five emotions, we use a simple HMM structure with single state and 16 Gaussian mixtures per state. The steps of the RANSAC-based HMM training method are as follows: 1. For each of the five emotions suppose we have n training data samples X = x1 , x2 , ..., xn to which we hope to fit a model determined by (at least) m samples (m ≤ n). Initially, we randomly select m = 320 utterances considering use of 20 utterances per Gaussian mixture is sufficient for the training process.

42

E. Bozkurt et al.

2. Set an iteration counter k = 1. 3. Choose at random m items from X and compute an HMM with a given number of states and Gaussian mixtures per state. Estimate the normalized likelihood values for the rest of the training set, using the trained HMM. 4. Set tolerance level to ε = (μ - 1.5 * σ), where mean (μ) and standard deviation (σ) values are calculated using the normalized likelihood values of the initial randomly selected m utterances. Determine how many elements of X are within ε of the derived model. If this number exceeds a threshold t, recompute the model over this consensus set and stop. 5. Increase the iteration counter k = k + 1, If k < K, and k < 200, for some predetermined K, go to step 3. Otherwise, accept the model with the biggest consensus set so far, or fail. Here, we estimate K, the number of loops required for the RANSAC algorithm to converge, using the number of inliers [4]: K=

ln(1 − p) ln(1 − ω m )

(3)

i Here we set ω = m , where mi is the number of inliers for iteration i and p m = 0.9 is the probability that at least one of the sets of random samples does not include an outlier.

3.4

Decision Fusion for Classification of Emotions

Decision fusion is used to compensate for possible misclassification errors resulting from a given modality classifier with other available modalities, where scores resulting from each unimodal classification are combined to arrive at a conclusion. Decision fusion is especially effective when contributing modalities are not correlated and resulting partial decisions are statistically independent. We consider a weighted summation based decision fusion technique to combine different classifiers [6] for emotion recognition. The HMM classifiers with MFCC and LSF features output likelihood scores for each emotion and utterance, which need to be normalized prior to the decision fusion process. First, for each utterance, likelihood scores of both classifiers are mean-removed over emotions. Then, sigmoid normalization is used to map likelihood values to the [0, 1] interval for all utterances [6]. After normalization, we have two likelihood score sets for the HMM classifiers for each emotion and utterance. Let us denote normalized log-likelihoods of MFCC and LSF based HMM classifiers as ρ¯γe (C) and ρ¯γe (L) respectively, for the emotion class e. The decision fusion then reduces to computing a single set of joint log-likelihood ratios, ρe , for each emotion class e. Assuming the two classifiers are statistically independent, we fuse the two classifiers, which will be denoted by γe (C)⊕γe (L), by computing the weighted average of the normalized likelihood scores ρe = α¯ ργe (C) + (1 − α)¯ ργe (L) ,

(4)

where the parameter α is selected in the interval [0, 1] to maximize the recognition rate on the training set.


4

43

Experimental Results

In this section, we present our experimental results for the 5-class emotion recognition problem using FAU-Aibo speech database provided by the INTERSPEECH 2009 emotion challenge. The distribution of emotional classes in the database is highly unbalanced that the performance is measured as unweighted recall (UA) rate which is the average recall of all classes. In Table 1 and Table 2, we list the UA rates for classifiers modeling MFCC and LSF features with 1-state and 2-state HMMs with number of Gaussian mixtures in the range [8, 160] per state. In the experiments further increasing number of states did not improve our results. We can see that incorporation of a RANSAC based data cleaning procedure yields an increase in the unweighted recall rates in all cases. For the MFCC feature set, the highest improvement (2.84%) is seen for the 1state HMM with 160 Gaussian mixtures, whereas for the LSF feature set the highest improvement is obtained as 2.73 % for 1-state HMM with 80 Gaussian mixtures. Table 1. Unweighted recall rates (UA) for 1- and 2- state HMMs modeling MFCC features with and without RANSAC

Number of 1 state 2 states mixtures All-data RANSAC All-data RANSAC 16 38.39 39.51 38.46 38.63 56 38.84 39.79 40.17 40.45 80 38.63 40.62 40.18 40.95 160 38.82 41.66 40.36 41.32

Table 2. Unweighted recall rates (UA) for 1- and 2- state HMMs modeling LSF features with and without RANSAC

Number of 1 state 2 states mixtures All-data RANSAC All-data RANSAC 16 34.53 34.24 36.59 36.71 56 36.69 38.39 35.38 37.54 80 36.67 39.40 35.65 36.95 160 36.82 39.30 35.98 37.50

We also provide a plot of unweighted recall rate versus number of Gaussian mixtures per state for 1-state and 2-state HMMs with and without RANSAC cleaning in Figure 1 and 2 for feature sets MFCCs and LSFs, respectively. If we compare the curves denoted by circles and squares for the feature sets, we can say that the RANSAC based data cleaning method brings significant improvements to the emotion recognition rate.

44

E. Bozkurt et al.

Fig. 1. Unweighted recall rate versus number of Gaussian mixtures per state for (a) 1state and (b) 2-state HMMs modeling M F CCΔΔ features with and without RANSAC

Comparison of the Classifiers. We would like to compare the accuracies of the HMM classifiers with and without using RANSAC-based training data selection. There are various statistical tests for comparing the performances of supervised classification learning algorithms [5] [11]. The McNemars test tries to assess the significance of the differences in the performances of two classification algorithms that have been tested on the same testing data. The McNemars test has been shown to have low probability of incorrectly detecting a difference when no difference exists (type I error) [5]. We performed the McNemars test to show that the improvement achieved with the proposed RANSAC-based data cleaning method, as compared to employing all the available training data is significant. The McNemar’s values for the MFCC feature set modeled by 1- and 2- state HMM classifiers with 160 Gaussian mixtures per state are computed as 231.246 and 8.917, respectively. Since these values are larger than the statistical significance threshold χ2(1,.95) = 3.8414, we can conclude that the improvement provided by RANSAC-based cleaning is statistically significant. The McNemar’s values for the LSF feature set modeled by 1- and 2-state HMMs with 160 Gaussian mixtures per state are calculated as 196.564 and 22.448, respectively. Again, since these values are greater than the statistical significance threshold we can claim that the RANSAC based classifier has a better accuracy, which is statistically significant. Note that the data we fed to the RANSAC-based training data selection algorithm consisted of chunks of one or more words for which three of the five labelers agreed on the emotional content. Using five labelers may not always be possible and if only one labeler is present, the training data is expected to be more noisy. In such cases, the proposed RANSAC based training data selection algorithm has the potential to bring even higher improvements to the performance of the classifier.


45

Fig. 2. Unweighted recall rate versus number of Gaussian mixtures per state for (a) 1-state and (b) 2-state HMMs modeling LSF ΔΔ features with and without RANSAC.

One drawback of the RANSAC algorithm that was observed during the experiments is that it is time consuming, since many random subset selections need to be tested. Decision Fusion of the RANSAC-based Trained Classifiers. Decision fusion of the RANSAC-based trained HMM classifiers is performed for various combinations of MFCC and LSF features. The fusion weight, α, is optimized over a subset of the training database prior to be used on the test data. The highest recall rate observed with the classifier fusion is 42.22 % for α = 0.84 when 1-state HMMs with 80 mixtures modeling RANSAC-cleaned MFCCs are fused with 2-state HMMs with 104 mixtures modeling RANSAC-cleaned LSF features.

5

Conclusions and Future Work

In this paper, we presented a random sampling consensus based training data selection method for the problem of emotion recognition from a spontaneous emotional speech database. The experimental results show that the proposed method is promising for HMM based emotion recognition from spontaneous speech data. In particular, we observed an improvement of up to 2.84 % in the unweighted recall rates on the test set of the spontaneous FAU AIBO test set, significance of which have been shown by the McNemar’s test. Moreover, the decision fusion of the LSF features with the MFCC features resulted in improved classification rates over the state-of-the-art MFCC-only decision for the FAU Aibo database.

46

E. Bozkurt et al.

In order to increase the benefits of the data cleaning approach, and to decrease the training effort, the algorithm may be improved by using semi-deterministic subset selection methods. Further experimental studies are planned to include more speech features (e.g., prosodic features), more complicated HMM structures and other spontaneous datasets. Acknowledgments. This work was supported in part by the Turkish Scientific and Technical Research Council (TUBITAK) under projects 106E201, 110E056 and COST2102 action.

References 1. Angelova, A., Abu-Mostafa, Y., Perona, P.: Pruning training sets for learning of object categories. In: Proc. Int. Conf. on Computer Vision and Pattern Recognition, CVPR (2005) 2. Barandela, R., Gasca, E.: Decontamination of training samples for supervised pattern recognition methods. In: Amin, A., Pudil, P., Ferri, F., I˜ nesta, J.M. (eds.) SPR 2000 and SSPR 2000. LNCS, vol. 1876, pp. 621–630. Springer, Heidelberg (2000) 3. Ben-Gal, I.: Outlier Detection, Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic Publishers, Dordrecht (2005) 4. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996) 5. Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 7, 1895–1924 (1998) 6. Erzin, E., Yemez, Y., Tekalp, A.M.: Multimodal speaker identification using an adaptive classifier cascade based on modality realiability. IEEE Transactions on Multimedia 7(5), 840–852 (2005) 7. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Graphics and Image Processing 24 (1981) 8. Gu, B., Hu, F., Liu, H.: Sampling and its applications in data mining: A survey. Tech. Rep. School of Computing, National University of Singapore (2000) 9. Guyon, I., Matin, N., Vapnik, V.: Discovering informative patterns and data cleaning. In: Workshop on Knowledge Discovery in Databases (1994) 10. Itakura, F.: Line spectrum representation of linear predictive coefficients of speech signals. Journal of the Acoustical Society of America 57(1), S35 (1975) 11. Kuncheva, L.I.: Combining Pattern Classifiers. John Wiley and Sons, Chichester (2004) 12. Kwon, O., Chan, K., Hao, J., Lee, T.: Emotion recognition by speech signals. In: Proc. of Eurospeech 2003, Geneva (September 2003) 13. Lee, C.M., Narayanan, S.S.: Toward detecting emotions in spoken dialogs. Journal 13, 293–303 (2005) 14. Lee, C.M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., Narayanan, S.: Emotion recognition based on phoneme classes. In: Proc. ICSLP 2004, pp. 889–892 (2004) 15. Morris, R.W., Clements, M.A.: Modification of formants in the line spectrum domain. IEEE Signal Processing Letters 9(1), 19–21 (2002)


47

16. Olken, F.: Random Sampling from Databases. Ph. D. Thesis, Department of Computer Science, University of California, Berkeley (1993) 17. Ratsch, G., Onada, T., Muller, K.: Regularizing adaboost. Advances in Neural Information Processing Systems 11, 564–570 (2000) 18. Schuller, B., Rigoll, G., Lang, M.: Hidden markov model based speech emotion recognition. In: Proc. Int. Conf. Acoustics, Speech and Signal Processing, ICASSP (2003) 19. Schuller, B., Steidl, S., Batliner, A.: The interspeech 2009 emotion challenge. In: Interspeech (2009), ISCA. Brighton, UK (2009) 20. Seppi, D., Batliner, A., Schuller, B., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Aharonson, V.: Patterns, prototypes, performance: Classifying emotional user states. In: Interspeech (2008) ISCA (2008) 21. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis and Machine Vision. Thomson (2008) 22. Wang, S., Dash, M., Chia, L., Xu, M.: Efficient sampling of training set in large and noisy multimedia data. ACM Transactions on Multimedia Computing, Communications and Applications 3 (2007)

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue Martin Bachwerk and Carl Vogel Computational Linguistics Group, School of Computer Science and Statistics, Trinity College, Dublin 2, Ireland {bachwerm,vogel}@tcd.ie

Abstract. In this paper, we claim that language is likely to have emerged as a mechanism for coordinating the solution of complex tasks. To confirm this thesis, computer simulations are performed based on the coordination task presented by Garrod & Anderson (1987). The role of success in task-oriented dialogue is analytically evaluated with the help of performance measurements and a thorough lexical analysis of the emergent communication system. Simulation results confirm a strong effect of success mattering on both reliability and dispersion of linguistic conventions.

1

Introduction

In the last decade, the field of communication science has seen a major increase in the number of research programmes that go beyond the more conventional studies of human dialogue (e.g. [6,7]) in an attempt to reproduce the emergence of conventionalized communication systems in a laboratory (e.g. [4,8,10]). In his seminal paper, Galantucci has proposed to refer to this line of research as experimental semiotics, which he sees as a more general form of experimental pragmatics. In particular, Galantucci defines that the former “studies the emergence of new forms of communication”, while the latter “studies the spontaneous use of pre-existing forms of communication” (p. 394, [5]). Experimental semiotics provides a novel way of reproducing the emergence of a conventionalized communication system under laboratory conditions. However, the findings from this field cannot be transferred to the question of primeval emergence of language without the caveat that the subjects of the present-day experiments are very much familiar with the concepts of conventions and communication systems (even if they are not allowed to employ any existing versions of these in the conducted experiments), while our ancestors who somehow managed to invent the very first conventionalized signaling system, by definition, could not have been aware of these concepts. Since experimental semiotics researchers cannot adjust the minds of their subjects in order to find out how they could discover the concept of a communication system, the most these experiments can realistically achieve is make the subjects signal the ‘signalhood’ of some novel form of communication (see. [13]). To go any further seems at least for now to require the use of computer models and simulations. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 48–55, 2011. c Springer-Verlag Berlin Heidelberg 2011

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue

49

Consequently, we are interested in how a community of simulated agents can agree on a set of lexical conventions with a very limited amount of given knowledge about the notion of a communication system. In this particular paper, we address this issue by conducting several computer simulations that are meant to reconstruct the human experiments conducted by [6] and [7], which suggest that the establishment of new conventions requires for at least some understanding to be experienced, for example measured in the success of the action performed in response to an utterance, and that differently organized communities can come up with variously effective communication systems. While the communities in the current experiments are in a way similar to the social structures implemented in [1], the focus here is on local coordination and the role of task-related communicative success, rather than the effect of different higher-order group structures.

2

Modelling Approach

The experiments presented in this paper have been performed with the help of the Language Evolution Workbench (LEW) (see [16,1] for more detailed descriptions of the model). This workbench provides over 20 adjustable parameters and makes as few assumptions about the agents’ cognitive skills and their awareness of the possibility of a conventionalized communication system as possible. The few cognitive skills that are assumed can be considered as widely accepted (see [11,14] among others) as the minimal prerequisites for the emergence of language. These skills include the ability to observe and individuate events, the ability to engage in a joint attention frame fixed on an occurring event, and the ability to interact by constructing words and utterances from abstract symbols1 and transmitting these to one’s interlocutor.2,3 During such interactions, one of the agents is assigned the intention to comment on the event, while a second agent assumes that the topic of the utterance relates in some way to the event and attempts to decode the meaning of the encountered symbols accordingly. From an evolutionary point of view, the LEW fits in with the so called faculty of language in the narrow sense as proposed by [9] in that the agents are equipped with the sensory, intentional and concept-mapping skills at the start, and the simulations attempt to provide an insight into how these could be combined to produce a communication system with comparable properties to a human language. From a pragmatics point of view, our approach directly adopts the claim made by [12] that dialogue is the underlying form of communication. Furthermore, despite the agents in the LEW lacking any kind of embodiment, they are designed in a way that makes each agent individuate events according to 1 2 3

While we often refer to such symbols as ‘phonemes’ throughout the paper, there is no reason why these should not be representative of gestural signs. Phenomena such as noise and loss of data during signal transmission are ignored in our approach for the sake of simplicity. It is important to stress out that hearers are not assumed to know the word boundaries of an encountered utterance. However, simulations with so called synchronized transmission have been performed previously by [15].

50

M. Bachwerk and C. Vogel

its own perspective, which in most cases results in their situation models being initially non-aligned, thus providing the agents with the task of aligning their representations, similarly to the account presented in [12].

3

Experiment Design

In the presented experiments, we aim to reproduce the two studies originally performed by Garrod and his colleagues, but in an evolutionary simulation performed on an abstract model of communication. Our reconstruction lies in the context of a simulated dynamic system of agents which should provide us with some insights about how Garrod’s findings can be transferred to the domain of language evolution. The remainder of this section outlines the configuration of the LEW used in the present study, together with an explanation of the three manipulated parameters. The results of the corresponding simulations are then evaluated in the following section 4, with special emphasis being put on the communicative potential and general linguistic properties of the emergent communication systems.4 Garrod observed in his two studies that conventions have a better chance of getting established and reused if their utilisation appears to lead to one’s interlocutor understanding of one’s utterance, either by explicitly signaling so or by performing an adequate action. Notably, in task-based communication, interlocutors may succeed in achieving a task with or without complete mutual understanding of the surrounding dialogue. Nevertheless, our simulations have been focussed on a parameter of the LEW that defines the probability that communicative success matters psm in an interaction. From an evolutionary point of view, this parameter is motivated by the numerous theories that put cooperation and survival as the core function of communication (e.g. [2]). However, the abstract implementation of the parameter allows us to refrain from selecting any particular evolutionary theory as the target one by generalizing over all kinds of possible success that may result from a communication bout, e.g. avoiding a predator, hunting down a prey or battling off a rival gang. The levels of the parameter that defines if success matters were varied between 0 and 1 (in steps of 0.25) in the presented simulations. To clarify the selected values of the parameter, psm =0 means that communicative success plays no role whatsoever in the system and psm =1 means that only interactions satisfying a minimum success threshold will be remembered by the agents. The minimum success threshold is established by an additional parameter of the LEW and can be generally interpreted as the minimum amount of information that needs to be extracted by the hearer from an encountered utterance in order to be of any 4

We intentionally refrain from referring to the syntax-less communication systems that emerge in our simulations as ‘language’ as that would be seen as highly contentious by many readers. Furthermore, even though the term ‘protolanguage’ appears to be quite suited for our needs (cf. [11]), the controversial nature of that term does not really encourage its use either, prompting us to stick to more neutral expressions.

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue

51

use. In our experiments, we have varied between a minimum success threshold of 0.25 and 1 (in steps of 0.25).5 The effects of this parameter will not be reported in this paper due to a lack of significance and space limitations. In addition to the above two parameters, the presented experiments also introduce two different interlocutor arrangements, similar to the studies in [6] and [7]. In the first of these, pairs of agents are partnered with each other for the whole duration of the simulation, meaning that they do not converse with any other agents at all. The second arrangement emulates the community setting introduced in [7] by successively alternating the pairings of agents, in our case after every 100 interaction ‘epochs’.6 The introduction of the community setting was motivated by the hypothesis that a community of agents should be able to engage in a global coordination process, as opposed to local entrainment, resulting in more generalized and thus eventually more reliable conventions.

4

Results and Discussion

The experimental setup described above resulted in 34 different parameter combinations, for each of which 600 independent runs have been performed in order to obtain empirically reliable data. The evaluation of the data has been performed with the help of a number of measures that have been selected with the goal of being able to describe both the communicative usefulness of an evolved convention system, as well as compare its main properties to those of languages as we know them now (see [1] for a more detailed account). In order to understand how well a communication system performs in a simulation, it is common to observe the understanding precision and recall rates, precision∗recall ). As can which can be combined to a single F-measure (F 1 = 2 ∗ precision+recall be seen from Figure 1(a), the results suggest that having a higher psm has a direct effect on the understanding rates of a community (t value between 26.68 and 210.63, p L, l -> l=:, etc) so that it was not necessary to map the phonetic transcriptions of adaptation data onto the Czech set of phonemes. After that, the speaker specific models with the full set of Slovak phonemes were created for each the test speaker using combination of MAP and MLLR. Table 4. WER in [%] after utilization of speaker adaptation methods speaker M1 M2 M3 F1 F2 F3 total

GD models 11.5 10.6 12.2 11.6 10.3 12.1 11.4

GD models + SA 9.6 8.2 10.1 9.1 8.3 9.8 9.2

GD models + phon. dupl. + SA 8.7 7.3 8.4 8.0 7.3 8.6 8.1

The results of this evaluation are summarized in Table 4. The value of total WER (in the last row) was calculated as an average over all the test speakers according to the number of words in their test recordings. The presented numbers show that WER of the prior Czech-to-Slovak adapted GD models with phoneme mapping was reduced from 11.4 % to 9.2 % after speaker adaptation. It is also evident that the proposed approach for speaker dependent acoustic modeling yielded to additional reduction of WER to 8.1%.

Study on Cross-Lingual Adaptation of a Czech LVCSR System towards Slovak

87

5 Conclusion Within this study, two-phase cross-lingual adaptation from Czech to Slovak was proposed and evaluated experimentally for an existing LVCSR system. The presented results showed that the resulting Czech-to-Slovak adapted system can operate with WER of 8% in the voice dictation task. This value is only about 3% worse than the typical WER for the original Czech system in the same task and using the speaker specific models. At this moment, the similar concept of cross-lingual adaptation is also being tested for one other Slavic language – Polish. The plan for the next research is to focus on collecting more data for AM training and unsupervised crosslingual adaptation approaches, which should allow for creating better SI or GD models without the need of making manual phonetic transcriptions. Acknowledgments. This work was supported by the Grant Agency of the Czech Republic within grant no. P103/11/P499 and grant no. 102/08/0707.

References 1. Nouza, J., Zdansky, J., Cerva, P., Kolorenc, J.: Continual On-line Monitoring of Czech Spoken Broadcast Programs. In: Proceedings of International Conference on Spoken Language Processing (ICSLP) 2006, Pittsburgh, USA, pp. 1650–1653 (September 2006) 2. Bayeh, R., Lin, S., Chollet, G., Mokbel, C.: Towards multilingual speech recognition using data driven source/target acoustical units association. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2004, Montreal, Quebec, Canada, pp. 521–524 (2004) 3. Lin, H., Deng, L., Yu, D., Gong, Y., Acero, A., Lee, C.H.: A study on multilingual acoustic modeling for large vocabulary ASR. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2009, Taipei, Taiwan, pp. 4333–4336 (2009) 4. EU Web Pages, http://eur-lex.europa.eu/en/treaties/index.htm 5. Kral, A., Sabol, J.: Phonetics and Phonology. SPN, Bratislava (1989) (in Slovak) 6. Ivanecky, J.: Automatic speech transcription and segmentation. PhD thesis, Košice (2003) (in Slovak) 7. Gauvain, J.L., Lee, C.H.: Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Transactions on Speech and Audio Processing 2, 291–298 (1994) 8. Gales, M.J.F., Woodland, P.C.: Mean and Variance Adaptation Within the MLLR Framework. Computer Speech and Language 10, 249–264 (1996)

Audio-Visual Isolated Words Recognition for Voice Dialogue System Josef Chaloupka Institute of Information Technology, Technical University of Liberec, Studentska 2, 461 17 Liberec, Czech Republic [email protected]

Abstract. This contribution is about experiments in audio-visual isolated words recognition. The results of these experiments will be used to improve our voice dialogue system, where visual speech recognition will be added. The voice dialogue systems can be used in train or bus stations (or elsewhere), where noise levels are relatively high, therefore the visual part of speech can improve the recognition rate mainly in noisy conditions. The audio-visual recognition of isolated words in our experiments was based on the technique of two-stream Hidden Markov Models (HMM) and on the HMM of single Czech phonemes and visemes. Different visual speech features and a different number of states and mixtures of HMM were evaluated in single tests. In the following experiments, isolated words were being recognized after training of the HMM and babble noise was added in the successive steps to the acoustic speech signal. Keywords: Audio-visual speech recognition, visual speech parameterization, audio-visual voice dialogue system.

1 Introduction Human lips, teeth and mimic muscles affect the production of speech. Visual information of speech can help the hearing-impaired to understand and it is also necessary for all people in order to understand pronounced speech information in noisy conditions. Therefore, it will be good to use visual information for automatic speech recognition especially in noisy conditions. The utilization of the visual speech part in systems for speech recognition is more likely in the stage of tests or prototypes, but the visual speech part in systems of audio-visual speech synthesis has been used in different communication-information or educational systems in the world for more than ten years. We have developed several multimodal voice dialog systems in our lab where audio-visual speech synthesis (talking head) is included [1]. We would like to add a subsystem for audiovisual automatic speech recognition (AV ASR) to our multimodal voice dialogue system; we have, therefore, developed an algorithm for visual speech parameterization in real time and tested several strategies how to recognize audiovisual speech signals. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 88–94, 2011. © Springer-Verlag Berlin Heidelberg 2011

Audio-Visual Isolated Words Recognition for Voice Dialogue System

89

2 Features Extraction and Audio-Visual Speech Recognition The features extraction from an audio speech signal is well-solved at present (2010) and LPCC – Linear Prediction Cepstral Coefficients or MFCC – Mel Frequency Cepstral Coefficients are very often successfully used, therefore, only visual speech parameterization is described in this part.

Fig. 1. The principle of audio and visual speech features extraction

The extraction of visual features is as follows: in the first step, human faces are detected in video images from a visual signal. The Viola-Jones face detector [2] based on the Haar-like filters and the AdaBoost algorithm was used in our parameterization system. It is necessary to decide who is speaking if more than one human face is detected in one video recording. The visual voice activity detector solves this problem: the object of lips is segmented from the bottom part of the detected face by image segmentation. The static visual feature – the vertical opening of the lips is taken from the segmented object of the lips. The dynamic features are computed from the static features for each detected face and the sum of ten subsequent absolute values of the dynamic features is the parameter for decision who is speaking. It is not a good idea to use only static visual features of the vertical opening of the lips to determine who is speaking, because somebody has a widely opened mouth (e.g. while yawning), but he (she) doesn’t speak. The sum of the dynamic features from DCT visual features was used in our previous work, but the problem was that somebody had different facial grimacing during a speech act and he (she) was selected as a

90

J. Chaloupka

speaker. Around the object of segmented lips, ROI – the Region Of Interest is selected and separated. The visual features are separated from the ROI in the last step. At present [3], two main groups of visual speech features exist – shape visual features and appearance-based visual features. The shape visual features are separated directly from the segmented object of the lips [4] - they are the horizontal and vertical opening of lips, lip rounding etc. It is difficult to find the exact border of human lips in some real video images (the color of lips is sometimes very similar to the color of human skin), it is, therefore, difficult to choose the exact shape features. Hence, the appearance-based visual features are used more often. These visual features are computed from the ROI by means of a transform: DFT – Discrete Fourier Transform, DCT – Discrete Cosine Transform, LDA – Linear Discriminant Analysis or PCA – Principal Component Analysis. DCT is chosen most often [5] because it is possible to compute this transform very fast using an algorithm similar to the well-known FFT algorithm – Fast Fourier Transform (for computing the DFT). For the extraction of visual features, It is also possible to use the methods and algorithms for stereovision [6] and the results are better than with the extraction of visual features from one 2D video image, but their use requires more computation time. A relatively new method for the extraction of visual features is based on AAM – Active Appearance Models but the visual features from AAM are quite speaker-dependent. It is, however, possible to use the transform [7] of the visual features vector and use these vectors in the speaker-independent audio-visual speech recognizer. The audio and visual features or the result of audio speech recognition only and visual speech recognition only are combined (integrated) after the extraction of audio and visual features [3]. The early integration of visual and audio features and the middle integration by two-stream HMM was used in our work. The audio and visual features are combined into one vector in the early integration process and these vectors are used for training of HMM and for the recognition based on these HMM. The output function of state S for two-stream HMM (middle integration) is:

G bSγ ( x ) =

γt

G ∏ (b ( x ) ) 2

S

(1)

t

t =1

G G where x1 is audio feature vector, x2 is visual feature vector, γt is weight of stream G and bS ( xt ) is output state function: M G bS (xt ) =  c sm m =1

1

(2π )

P

det Σ sm

[ (

) (

G G T G G ⋅ exp − 0.5 x t − x sm Σ −sm1 x t − x sm

)]

(2)

G G where xt is feature vector, P is number of features, x sm is mean values vector, Σ sm is covariance matrix and M is number of mixtures. The main task of using two-stream HMM is to find weights γt for the audio and visual stream and for the given SNR (Signal to Noise Ratio). The only possible way to achieve this is to change the SNR (add noise to the audio signal) and change the weights for the audio and visual stream in a single step and look for the best recognition rate, see fig. 2.


91

Fig. 2. Recognition rate from audio-visual speech recognition for weights of audio and visual stream and for 5dB SNR in audio signal

3 Noisy Condition Simulation It is very difficult or impossible to create an audio-visual speech database where the audio signal has a given SNR, therefore, noisy conditions are simulated and noise is added to the original audio signal. A special algorithm was developed for the addition of noise to the audio signal for a given SNR. A babble noise from the NOISEX database [8] was chosen for our purpose. Prior to the experiments, it was necessary to estimate the Signal to Noise Ratio (SNR) in our audio speech signals. They exist several algorithms for it [9]. The SNR is calculated from the power of signal Ps and from the power of noise Pn (which is added to the signal) in our case: N −1

 s [i] 2

P SNR = 10. log s = 10. log Pn

i =0 N −1

 i =0

n 2 [i ]

(3)

where N is the number of samples from signal, s[i] are samples from “clear” signal (without noise), n[i] are samples from noise signal which is included in original audio signal x. Our hypothesis was that noise n is an additive noise, hence x[i] = s[i] + n[i]. Ps was estimated from the power of the audio signal Px and Pn, which was computed from the non-speech part of the audio signal - a speech or non-speech detector was used for this purpose.

92

J. Chaloupka N −1

N −1

N −1

 s [i]  x [i]  n [i] 2

Ps = Px − Pn =

2

=

i =0

N

2

−

i =0

N

i =0

N

(4)

The problem is that the SNR (3) is computed from all audio signals and the dynamic changes of the speech signal are not covered much, the segmental signal to noise ratio SSNR, therefore, yields a better result:

  1 SSNR = 10. log F  

N −1

M −1

 j =0



 s [i]  2 j

i =0 N −1

 i =0

 n [i ]  

(5)

2 j

where F is the number of frames from speech signal and M is the length of one frame in samples. SSNR (SSNRe) was estimated from each audio signal and noise signal was added according to given relative change of SSNR (∆SSNR) in the second step.

SSNRw = SSNRe − ΔSSNR

(6)

where SNRw is new value of SSNR in audio signal:

SSNRw = 10. log

Px − Pn Pn + c.Pan

(7)

Pan is power of additive noise and c is gain coefficient: c=

Px − Pn Pan ⋅ 10

 SSNRw     10 

−

Pn Pan

(8)

Resulting audio signal xn[i] is created from original signal x[i] and from additive noise an[i]: xn [i ] = x[i ] + an[i ]. c , 0 ≤ i ≤ N

(9)

4 Experiments Our own Czech audio-visual speech database AVDBcz2 was used for experiments with audio-visual words recognition in noisy conditions. Frontal camera scans of 35 people (speakers) were taken for this database. Each speaker uttered 50 words and 50 sentences. 2 experiments have been done for our database. The whole-word HMM were used in the first experiment and the HMM of phonemes (40 Czech phonemes) and visemes (13 Czech visemes) and the early integration of audio and visual features were used in the second experiment. 50 words (video recordings) from the first 25 speakers were in the training database for the whole-word HMM training and 50


93

words from the remaining 5 speakers were used for establishing the weights in twostream audio-visual HMM. The last 50 words from the last 5 speakers were in the test database. The test database for the second experiment was the same as for the first experiment but 50 sentences from the first 30 speakers were used for the training of HMM (3 states) of single phonemes and visemes. 15 visual features (5 DCT static features + 5 delta + 5 delta-delta) were extracted from the visual signal and 39 audio features (13 MFCC + 13 delta + 13 delta-delta) were obtained from the audio signal. The number of DCT visual features had been established in our previous tests for visual speech recognition, where 5 DCT visual features (+ dynamic features) yielded the best recognition rate (for 14 states HMM). Babble noise was added to the original audio signal in single steps (for the given SNR) and the recognition rate for audio speech recognition only and for audio-visual speech recognition were evaluated. The results of the first experiment are shown in fig. 3. and the results of the second experiment can be seen in fig. 4.

Fig. 3. Audio-visual speech recognition with two-stream HMM

Fig. 4. Audio-visual speech recognition with HMM of phonemes and visemes

94

J. Chaloupka

The visual recognition rate was 45,2% where whole-word two-stream HMM (middle integration) were used, and it was 30% for the use of HMM of single phonemes and visemes (early integration).

5 Conclusion Several experiments of audio-visual speech recognition in noisy conditions have been done in this work. The results of audio-visual speech recognition in noisy conditions from the first experiment (two-stream HMM, middle integration) are better than those established in the second experiment (HMM of phonemes and visemes, early integration), but the utilization of the HMM of single phonemes and visemes in the recognizer in the voice dialogue system is more practical particularly when we have a large vocabulary (more than 1000 words). We would like to integrate the audio-visual speech recognizer based on the HMM of single Czech phonemes and visemes into our multimodal voice dialogue system in the near future.

Acknowledgments. The research reported in this paper was partly supported by the grant MSMT OC09066 (project COST 2102) and by the Czech Science Foundation (GACR) through the project No. 102/08/0707.

References 1. Chaloupka, J., Chaloupka, Z.: Czech Artificial Computerized Talking Head George. In: Esposito, A., Vích, R. (eds.) Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions. LNCS (LNAI), vol. 5641, pp. 324–330. Springer, Heidelberg (2009) 2. Viola, P., Jones, M.: J.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004) 3. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91(9), 1306–1326 (2003) 4. Liew, A.W.C., Wang, S.: Visual speech recognition – lip segmentation and mapping. Medical Information Science Reference Press, New York (2009) 5. Heckmann, M., Kroschel, K., Savariaux, C., Berthommier, F.: DCT-based video features for audio-visual speech recognition. In: Proc. Int. Conf. Spoken Lang. Process. (2002) 6. Goecke, R., Asthana, A.: A Comparative Study of 2D and 3D Lip Tracking Methods for AV ASR. In: Proceedings of International Conference on Auditory-Visual Speech Processing (AVSP 2008), Australia, pp. 235–240 (2008) ISBN 978-0-646-49504-0 7. Lan, Y., Theobald, B.J., Harvey, R., Ong, E.J., Bowden, R.: Improving Visual Features for Lip-reading. In: The 9th International Conference on Auditory-Visual Speech Processing AVSP 2010, Japan, pp. 142–147 (September 2010) ISBN 978-4-9905475-0-9 8. Varga, A.P., Steeneken, H.J.M., Tomlinson, M., Jones, D.: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Tech. Rep., Speech Research Unit, Defence Research Agency, Malvern, UK (1992) 9. Zhao, D.Y., Kleijn, W.B., Ypma, A., de Vries, B.: Online Noise Estimation Using Stochastic-Gain HMM for Speech Enhancement. IEEE Transactions on Audio, Speech, and Language Processing 16(4), 835–846 (2008)

Semantic Web Techniques Application for Video Fragment Annotation and Management Marco Grassi, Christian Morbidoni, and Michele Nucci Department of Biomedical, Electronic and Telecommunication Engineering, Universit` a Politecnica delle Marche - Ancona, 60131, Italy {m.grassi,c.morbidoni}@univpm.it, [email protected] http://www.semedia.dibet.univpm.it

Abstract. The amount of videos loaded every day on the web is constantly growing and in a close future videos will constitute the primary Web content. However, video resources are currently handled only through the use of plugins and result therefore scarcely integrated on the World Wide Web. Standards as HTML 5 and Media Fragment URI, actually under development, promise to enhance video accessibility and to allow a more effective management of video fragments. On the other hand, the need for annotating digital objects, possibly at a low granularity level, is being highlighted in various scientific communities. User created annotation, if properly structured and machine processable, can enrich web content and enhance search and browsing capabilities. Providing full support for video fragments tagging, linking, annotation and retrieval represents therefore a key-factor for the development of a new generation of Web applications. In this paper, we discuss the feasibility of Semantic Web techniques in this scenario and introduce a novel Web application for semantic multimodal video fragment annotation and management that we are currently developing. Keywords: Video annotation, Semantic Web, Media Fragment.

1

Introduction

The advent of Web 2.0 has led to an explosion of user generated Web contents and made tagging, linking and commenting resources everyday activities for web users and a valuable source of metadata that can be exploited to drive resource ranking, classification and retrieval. Collaborative approach has been therefore increasingly understood as key factor for resource annotation that can be applied not only for user-generated data but also for scientific purposes in dealing with large collection of information, as in the case of digital libraries. User created annotations, if properly structured and machine-processable, can enrich web content and enhance search and browsing capabilities. Semantic Web (SW) techniques are finding more and more application in web resource annotation. These allow in fact univocally identifying web resources and exploiting a univocally interpretable format as RDF for resource description and ontologies A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 95–103, 2011. c Springer-Verlag Berlin Heidelberg 2011

96

M. Grassi, C. Morbidoni, and M. Nucci

to add semantic to the encoded information in order to make them effectively sharable between different users. In addition, the possibility of managing digital objects at a low granularity level, for example to access or annotate text excepts or photo regions, is being highlighted in various scientific communities. This is particularly true for videos resources, which till few years ago constituted just a marginal part of Web content and are still currently handled only through the use of plugins and result therefore scarcely integrated on the World Wide Web. By the way, since the spreading of video sharing services as YouTube, the amount of videos loaded every day on the web is growing constantly and in a close feature videos will constitute the primary content in the Web. However, video resources standards as HTML 5 and Media Fragment URI, actually under development, promise to enhance video accessibility and to allow a more effective management of video fragments. Its not a case that HTML5, the next major revision of the HTML standard, introduces a specific video tag and provides a wider support for video functionalities. We believe that providing full support for video fragments tagging, linking, annotation and retrieval represents therefore a key-factor for the development of a new generation of Web applications. In this paper, we discuss the application of SW techniques in Web resource annotation, analyzing the common requirements of a general purpose web annotator and focusing on the video annotation to introduce a novel Web application for semantic video fragment annotation and management that we are currently developing. Paper is organized as follows. Section 2 briefly introduces the Semantic Web initiative. Section 3 provides an overview of main general-purpose annotation systems while Section 4 focuses on existing video annotation tools. Section 5 discusses the main requirements and implementation guidelines of a general-purpose annotation tool. Finally, Section 6 introduces the prototype video annotation application.

2

Semantic Web

The Semantic Web1 (SW) is an initiative by W3C that aims to implement a next generation Web in which information can be expressed in a machineunderstandable format and can be processed automatically by software agents. The SW enables data interoperability, allowing data to be shared and reused across heterogeneous devices and applications. The SW is mainly based on the Resource Description Framework (RDF) to define relations among different data, creating semantic networks. In SW ontologies are used to organize information and formally describe concepts of a domain of interest. An ontology is a vocabulary including a set of terms and relations among them. Ontologies can be developed using specific ontology languages such as the RDF Schema Language (RDFS) or the Web Ontology Language (OWL) for inference and knowledge base modeling. Semantic Web techniques are suitable for application in all the scenarios that require advanced data-integration, to link data coming from multiple sources 1

See http://www.w3.org/2001/sw/ for Semantic Web initiative, related technologies and standards

Semantic Web Techniques Application

97

without preexisting schema, and powerful data-modeling to represent expressive semantic descriptions of application domains and to provide inferencing power for applications.

3

Annotation Systems

Annotating web documents like web pages, part of web pages, images, audios and videos is one of the most spread technique to create interconnected and structured metadata on the Web. In the last years many annotation systems have been proposed to ease and to support the creation of annotations. Annotea is a web-base annotation protocol that uses an RDF based annotation schema [1] to formally describe annotations. Annotea has been implemented for the first time in Amaya [2] browser, but currently, other applications based on Annotea protocol are available. Some of these applications have extended and adapted the Annotea protocol to support additional use-cases such as the annotation of audio and video material [3]. EuropeanaConnect Media Annotation Prototype (ECMAP) [4] is online media annotation suite based on Annotea that allows users to extend existing bibliographic information about digital items like images, audio and videos. ECMAP provides free-text annotation and semantic tagging, integrates Linked Data resource linkage into the user annotation process and provides shape drawing tools on images, maps and video. It also provides special support for high-resolution map images, enabling tile-based rendering for faster delivery, geo-referencing and semantic tag suggestions based on geographic location. LORE [5] (Literature Object Reuse and Exchange) is a lightweight tool designed to enable scholars and teachers of literature to author, edit and publish compliant compound information objects that encapsulate related digital resources and bibliographic records. LORE provides a graphical user interface for creating, labeling and visualizing typed relationships between individual objects using terms from a bibliographic ontology. SWickyNotes [6] is an open-source desktop application for semantically annotating web pages and digital libraries. It bases on Semantic Web standards to allow annotations to be more than simple textual sticky notes: they can contain semantically structured data that can be later used to meaningfully browse the annotated contents. One Click Annotator [7] is a WYSIWYG Web editor for enriching content with RDFa annotations, enabling not expert to create semantic metadata. It allows annotating words and phrases with references to ontology concepts and for creating relationships between annotated phrases. Also, automatic annotation systems have been developed, which aim to create structured metadata while the user is creating contents. OpenCalais [8] is a web annotation system based on a web service that is able to automatically create rich semantic metadata for text content submitted by users. COHSE [9] enables the automatic generation of metadata descriptions by analyzing the content of a Web page and comparing the analyzed parts with concepts described in a lexicon. These kinds of automatic annotations systems generally relay on Natural Language Processing (NLP), Machine Learning, Text Mining and other similar techniques.

98

4


Video Annotation Tools

Video annotation constitutes the starting point in many scientific researches, for example in the study of emotion and of multimodality of human communication. Multimodal video annotation, in particular, is a highly time-consuming and difficult task. Several desktop tools have been developed in recent years to provide support in this activity. A complete review of such tools goes beyond the purpose of this work, which focuses on web applications, and can be found in [10]. Also, in a previous work [11], we conducted a survey about multimodal video annotation tools and schemas and traced a roadmap toward the application of Semantic Web techniques to enhance multimodal video annotation, which also according to survey result appear to be highly beneficial in this scenario. In recent year, as a result of great spreading of videos over the Web, video content management have been increasingly supported by web application and also tools for general purpose video annotation have started to be developed. YouTube itself [12] enables video uploaders to create textual annotations in the form of text bubbles or notes, also highlighting part of the screen and to make these annotations visible to all YouTube users when the video is played. VideoAnt [13] is a Web application that also uses YouTube as videos source, which allows to insert markers in video timeline and to associate textual annotation. The created annotations are also sent by email for being accessed also by other users. Project Pad [14] is a project to build a web-based system for media annotation and collaboration for teaching and learning and scholarly applications. Project Pad provides an open source Web application, distributed under GPL license, developed in Java and Flash available both as a standalone application and as part of Sakai. The application allows selecting video segments and creating textual annotations. It also provides a timeline visualization of the annotations. Kaltura Advanced Editor [15] provides several functionalities for online video editing, supporting timeline based editing and video and audio layers. It allows to add soundtracks and transitions to import videos, images and audio while editing, and to add effects and textual annotation. EuropeanaConnect Media Annotation Suite (ECMAS) [4] includes a client application for video annotation. It allows to select video segments adding a marker and to add textual annotation. EuropanaConnect video annotator also includes Semantic Web capabilities to provide users can augment existing online videos with related resources on the Web, provided by Linked Data cloud. This augmentation happens on-the-fly while the users are writing their annotations, the application proposes in fact related resource derived from DBpedia [16], a semantic compliant version of Wikipedia. User has to verify the semantic validity of a link or to disambiguate between eventual homonyms before they become part of the annotation. Such linked resources can be then exploited in the underlying search and retrieval infrastructure. Between the existing web video annotation tools, only ECMAS exploits some of the possibilities offered by Semantic Web and only for information augmentation. Annotation can only be added in the form of free text and there is no possibility to structure the annotation according to standardized domain


99

ontologies. Also, the management of video fragments, when provided, does not follow Media Fragment URI standard, which limits the interoperability and accessibility. In addition, apart from Kaltura editor, existing application for video annotation provide quite poor interfaces for and limited performances in comparison with existing desktop tools, not supporting for example drag and drop functionality.

5

SemLib Annotation Tool

SEMLIB European project [17], we are currently participating, aims to improve the current state of the art in digital libraries through the application of Semantic Web techniques in this scenario. Three main challenges need to be faced: – improve efficiency of searches, considering that due to the high rate of the data that digital libraries can store it has become very difficult for the users to find and retrieve relevant content; – promote interoperability allowing re-using, re-purposing, and re-mixing digital objects in heterogeneous environments, taking into account that nowadays digital libraries are consumed and manipulated at the same time by human actors and by machines and other software applications; – allow effective resource linking also outside the boundaries of a single digital repository. The purpose is therefore to develop a modular and configurable web application based on Semantic Web technologies that can be plugged into other existing web applications and digital libraries and that can export/import semantic annotation from/to the Web of Data (Linked Data). The system shall allow common users, with no knowledge of Semantic Web techniques, of Digital Libraries to enrich the content, establish relations and be of support for their scholarly activities. In next subsections, we discuss the main requirements of a web resources annotation system and applicable technologies for their accomplishment. 5.1

Requirements Discussion

In order to accomplish with its purposes, five main requirements have been identified for the application: – Flexibility. The proposed application has to allow detailed annotations of heterogeneous resources (text, images, audio and video) in different application domains and to be pluggable into other existing web applications. – Interoperability. The possibility to share fully understandable information between different users, software agents and application represents a fundamental requirement to create richer applications, allowing augment the original knowledge base by adding related information coming from different external sources. – Collaborative annotations. Application has to provide support for collaborative annotation management, allowing every user to create its own annotations and to access existing annotations.

100


– Fine grain annotations. Nowadays, links, tags and annotations are added at resource level. For example, creators or users can add tags to classify a document about its main topics but they cannot specify in which part of the document each single topic is treated. The capability to fully implement the concept of bookmarks for web resources, identifying and providing access to specific desired fragment of a resource, represents a key factor to create a the next generation of web application, enhancing resource fruition and enabling more efficient automatic information aggregation. – Ease of use. Application should expose an intuitive and engaging interface able to hide the underlying complexity of the system to users, which are not required to have any knowledge of SW techniques. In particular, the creation of well-structured annotations, according to RDF (subject, property, value) model, should be accomplished in fast and easy way.

Fig. 1. A simplified sketch of system architecture

5.2

Technical Solutions and Implementation Guidelines

In order to satisfy the requirements outlined in previous subsection, several technical solution and implementation guidelines have been identified for the implementation of the proposed system, which can be synthesized as follows: – Standards compliance. In order to provide maximum support for interoperability, system implementation relies on standards both in data encoding and in resource identification. RFD used as data model to encode information in a univocally interpretable standard. XPointer [18] and Media Fragment URI [19] are used respectively to identify unambiguously text excerpts in web pages and subparts of images and audio video resources, to provide support for addressing and retrieving resources as the automated processing of such subparts for reuse.


101

– Stand-off markup. paradigm is applied for annotation management, which means that annotations reside in a location different from the location of the data being described by it. Exploiting URIs, XPointers and Media Fragment URIs, that allows to univocally identify resources and fragments of those, a resolution mechanism can be implemented to allows annotations to be accessed and stored independently from the original resources but still remaining unambiguously associated to those. This approach is particularly suitable in the considered scenario, allowing on one side to annotate every resource on the Web, even if read-only, secured or located on a remote server, and on the other providing maximum freedom to users both in creating their own annotations and in filtering and visualizing existing annotations. – Pluggable ontologies. The use of ontologies allows providing semantically rich and structured descriptions of resources in specific knowledge domains. It constitutes therefore a fundamental requirement both for the flexibility (possibility to create detailed descriptions in different domain) and interoperability (possibility to rely on standardized vocabulary in the annotations) of the created annotations. The proposed application should provide a basic set of default general-purpose descriptors. In addition, it should allow the possibility to import external ontologies, as plug-in vocabularies, for enabling effective structured descriptions of any knowledge domain. – Modularity. represents a fundamental requirement of system architecture both to provide support for the management of different resource formats and to allow the system to be pluggable on other existent applications. Figure 1 provides a simplified sketch of core system architecture. Annotation creation is separated from annotation visualization and different handlers are provided to supply specific management for the different functionalities required by the different supported media formats.

6

SemTube: Semantic YouTube Video Annotation Prototype

As proof of concept we are currently developing a web application, for semantic YouTube videos annotation. Other than being the main video sharing service on the Web, YouTube offers powerful APIs [20] that makes video embedding in Web pages an easy task and also provides Player APIs that gives control over YouTube video playback. In particular, the JavaScript Chromeless player APIs have been used to create a custom player that in addition to the common playback functionality allows the possibility to select frame and video for annotation. A custom video progress bar has been created in Javascript allowing to place markers for selecting frames and segments. Once selected the frame or the segment to be annotated, annotation can be performed both using free text, tags and relying on the descriptor provided by an ontology that can be retrieved in real-time from a SPARQL endpoint using AJAX technology. The created annotations are both displayed in the web page and stored into a Sesame triplestore for later retrieval and quering.

102


Fig. 2. A screenshot of SemTube (Semantic YouTube Video Annotator

7

Conclusions

The capability of providing full support for video fragments annotation and management represents a key-factor for the development of a new generation of Web applications. In this paper, we discussed the application of SW techniques in this scenario, analyzing the main requirements of a general purpose web annotator and focusing on the video annotation. We also introduced a novel Web application for semantic video fragment annotation and management that we are currently developing. Acknowledgments. This work has been supported by COST 2102, SSPNET and SEMLIB project (SEMLIB - 262301 - FP7-SME-2010-1).

References 1. Kahan, J., Koivunen, M.R.: Annotea: An Open RDF Infrastructure for Shared Web Annotations. In: Proceedings of the 10th International Conference on World Wide Web, pp. 623–632 (2001) 2. Amaya Web Browser, http://www.w3.org/Amaya/ 3. Schroeter, R., Hunter, J., Kosovic, D.: FilmEd - Collaborative Video Indexing, Annotation and Discussion Tools Over Broadband Networks. In: Proceedings of the Multimedia Modelling Conference 2004, pp. 346–353 (January 2004)


103

4. Haslhofer, B., Momeni, E., Gay, M., Simon, R.: Augmenting Europeana Content with Linked Data Resources. In: 6th International Conference on Semantic Systems (I-Semantics) (September 2010) 5. Gerber, A., Hunter, J.: Authoring, Editing and Visualizing Compound Objects for Literary Scholarship. Journal of Digital Information 11 (2010) 6. SWickyNotes: Sticky Web Notes with Semantics, http://dbin.org/swickynotes/ 7. Ralf Heese, M.L.: One Click Annotation. In: 6th Workshop on Scripting and Development for the Semantic Web (2010) 8. OpenCalais, http://www.opencalais.com/ 9. Goble, C., Bechhofer, S., Carr, L., De Roure, D., Hall, W.: Conceptual Open Hypermedia = The Semantic Web? In: The Second International Workshop on the Semantic Web, Hong Kong, p. 4450 (May 2001) 10. Rohlfing, K., et al.: Comparison of multimodal annotation tools - workshop report. Gespraechsforschung-Online Zeitschrift zur verbalen Interaktion 7(7), 99–123 (2006) 11. Grassi, M., Morbidoni, C., Piazza, F.: Towards Semantic Multimodal Video Annotation. In: Esposito, A., Esposito, A.M., Martone, R., M¨ uller, V.C., Scarpetta, G. (eds.) COST 2010. LNCS, vol. 6456, pp. 305–316. Springer, Heidelberg (2011) 12. YouTube Video Annotations, http://www.youtube.com/it/annotations_about 13. VideoANT, http://ant.umn.edu/ 14. Project Pad, http://dewey.at.northwestern.edu/ppad2/ 15. Kaltura Video Editing and Annotation, http://corp.kaltura.com/video_platform/video_editing 16. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web (7), 154165 (2009) 17. SEMLIB - Semantic Tools for Digital Libraries. SEMLIB - 262301 - FP7-SME2010-1, http://www.semlib.org/ 18. XML Pointer Language (XPointer), http://www.w3.org/TR/xptr 19. Media Fragments URI 1.0. W3C Working Draft June 24 (2010), http://www.w3.org/TR/media-frags/ 20. YouTube APIs and Tools, http://code.google.com/apis/youtube/overview

Imitation of Target Speakers by Different Types of Impersonators Wojciech Majewski and Piotr Staroniewicz Wroclaw University of Technology, Institute of Telecommunications, Teleinformatics and Acoustics, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland [email protected]

Abstract. Vowel formant frequencies planes obtained from speech samples of three well-known Polish personalities and their imitations performed by three impersonators of different type (professional, semi-professional and amateur) have been compared. The vowel formant planes for the imitations were generally, but not always, placed between the impersonator’s natural voice and the target. The largest resemblance between the formant planes for the imitation and the target was obtained for the amateur, whose imitations were, however, subjectively evaluated as the worst ones. Thus, except for acoustical parameters, other factors, like qualification and experience of the impersonator, are very important in realization of impersonation tasks. Keywords: vowel formant planes, impersonators.

1 Introduction Impersonation of a person by means of voice may occur in two very different situations. The first situation concerns a public entertainment when a professional impersonation artist amuses the public imitating the voices of well-known personalities. The second situation concerns a forensic voice identification when an automatic speaker recognition system used for security purposes may be cheated by a skilful impostor imitating the voice of an authorized person. Thus, the problem of voice mimicry seems to be very interesting for the general public and is very important for law enforcement agencies. In spite of this, the problem of voice imitation has been studied to a rather limited extent. The first study on voice imitation was published in 1971 by Endres, Bambach and Flösser [1]. In this study vowel formant frequencies and fundamental frequency in original and imitated voices were compared. Although the imitators managed to change their formant and fundamental frequencies in the direction of the target values, they were not able to match or be similar to those of the imitated people. In 1997 Ericson and Wretling [2] examined the timing, fundamental frequency and vowel formants frequencies of three Swedish politicians and their imitations performed by a professional impersonator. The global speech rate and fundamental frequency were mimic very closely and the vowel space for two of the three target A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 104–112, 2011. © Springer-Verlag Berlin Heidelberg 2011

Imitation of Target Speakers by Different Types of Impersonators

105

voices was intermediate between that of the artist’s own voice and the target, but for the third target there was no apparent reduction in the distance. In the same year Schlichting and Sullivan [3] published the results of subjective speaker recognition indicating that the listeners are able to discriminate between the real voice and professional imitation. However, the imitation led to 100 per cent misidentification in the worst case. In 2006 Zetterholm [4] examined one impersonator and his different voice imitations to gain some insight into the flexibility of the human voice and speech. The results indicated that the impersonator was able to adopt a range of articulatoryphonetic configurations to achieve the target speakers. The authors of the present paper examined selected aspect of voice imitation. In the first study published in 2005 [5] the results of aural-perceptual voice recognition of Polish personalities and their imitations performed by cabaret entertainers were presented. It has been shown that the impersonators were able to fool the listeners, i.e. to convince them that they have heard the target speakers. At the same time, however, similar number of the listeners recognized the imitation. In the subsequent studies in 2006 [6] mel frequency cepstral coefficients of original speakers and their imitators were presented, while in 2007 [7] speaking fundamental frequency under similar conditions has been examined. Finally, in the last study published in 2008 [8] selected acoustical parameters obtained from speech samples of well-known Polish personalities and their imitations performed by cabaret entertainers were presented and discussed. In the present study the influence of the impersonator’s type on voice mimicry is examined. Only one kind of parameters, i.e. formant frequencies of Polish vowels were utilized. To be more specific, the two lowest formant frequencies, i.e. F1 and F2 of Polish vowels were measured, drawn in F1-F2 planes and compared to find out how flexible the human voice apparatus is and if the person of the impersonator and his professional qualifications have an influence on the distribution of formant frequencies in F1-F2 planes and the distances between these planes for different speakers and different ways of speech production. F1-F2 planes have been applied in the experiment since such patterns are widely used in research on speech acoustics and F1 and F2 parameters are considered as the most important parameters for speech and speaker recognition.

2 Experimental Procedure As the target speakers, three well-known Polish personalities have been selected, whose characteristic voices are relatively easy to imitate for impersonators. The selected target speakers were: Lech Walesa – former president of Poland, Jerzy Urban – editor-in-chief of Polish weekly Nie and Adam Michnik – editor-in-chief of Polish daily Gazeta Wyborcza. The speech samples of over one minute duration were obtained for Lech Walesa from the recordings available in the archives of the Polish radio and for Adam Michnik and Jerzy Urban from the internet. Three different types of impersonators have been employed. The first one was a professional. It was Waldemar Ochnia, one of the best Polish impersonators. He was imitating the voices of Lech Walesa and Jerzy Urban. The second one was a

106

W. Majewski and P. Staroniewicz

semi-professional. It was Piotr Gumulec, a cabaret artist, who was imitating the voices of Lech Walesa and Adam Michnik. The third one was Lukasz Likus, an amateur, who also was imitating the voices of Lech Walesa and Adam Michnik. The test material consisted of speech samples produced by the target speakers and by each impersonator under two speaking conditions: 1) while imitating a given target speaker, 2) while speaking the same text in his natural voice. Thus, for a given targetimpersonator pair the semantic content of all three speech samples was the same. From audio files under given speaking conditions all six Polish oral vowels, i.e. u, o, a, e, i and y [I] have been extracted. Each vowel was represented by nine segments of 40 ms duration taken from particular words spoken in the same context. The vowel segments were extracted from the beginning, middle and end of particular words, from short and long words and from the first and second part of the spoken text. Such a variety of vowel selection permitted to consider the influence of the place of vowels in words, the influence of short and long words and the influence of the first and second part of the text on vowel parameters. Vowel formant frequencies of the two lowest formants, i.e. F1 and F2, were obtained by means of Praat program. For each vowel under given speaking conditions the mean values of F1 and F2 in Hz from all nine realizations of a given vowel have been calculated, converted to Bark scale and utilized to draw F1-F2 planes of six Polish vowels. In addition, an aural-perceptual speaker recognition test was carried out to find out how effective the voice imitations were. The test was performed by a group of 15 listeners of normal hearing who stated that they knew the target speakers. Speech samples of 15 seconds duration produced by target speakers were presented first to the listeners and next, after a short break, another four speech samples of similar duration for each of the target voices and their imitations have been presented in an random order. The only restriction was that the original speech sample and its imitation could not have been presented as the adjacent stimuli. The task of the listeners was to state if a given speech sample is an original or an imitation. In case of imitation the quality of imitation had to be evaluated in a five-points scale (1 - poor imitation, 5 – very good imitation).

3 Results In Figs.1-6 F1-F2 planes are drawn on the basis of the mean values of formant frequencies obtained for all the realizations of speech samples produced by the target speakers and their impersonators when they imitate the target speakers and when they speak naturally. In Figs.1-3 the vowel formant plane for Walesa is presented, accompanied by the vowel formants parameters of the imitation and natural voice of Ochnia (Fig. 1), Gumulec (Fig.2) and Likus (Fig.3). Looking at Fig.1 it may be observed that the professional impersonator was able to change his vowel formant frequencies in comparison to his natural voice to be more close to the target voice. His vowel formant plane for the imitation is larger and generally is placed in between the planes for his natural voice and the target voice. The obtained results are similar to those presented by Ericson and Wretling [2].


Fig. 1. Vowel formant planes for Walesa (target) and Ochnia (imitation and natural)

Fig. 2. Vowel formant planes for Walesa (target) and Gumulec (imitation and natural)

107

108


Fig. 3. Vowel formant planes for Walesa (target) and Likus (imitation and natural)

Fig. 4. Vowel formant planes for Urban (target) and Ochnia (imitation and natural)


109

Fig. 5. Vowel formant planes for Michnik (target) and Gumulec (imitation and natural)

Fig. 6. Vowel formant planes for Michnik (target) and Likus (imitation and natural)

The situation for the semi-professional impersonator presented in Fig.2 is not so clear. It may be seen that Gumulec changed his vowel formant frequencies but it is difficult to answer whether the formant plane for the imitation is more similar to the target plane or his natural plane. Moreover, in contrast to the results obtained by the

110


professional, the plane for imitation is smaller than for the natural voice and shifted in the direction of larges values of F1. Still, another situation is presented in Fig.3. First of all, very large changes in formant frequencies values between the natural voice and the imitation may be seen. A large effort of the impersonator to change his voice is confirmed by frequent breaks he made during the imitation to give his voice production apparatus a rest. The F1-F2 plane for his natural voice is the smallest and the plane for the imitation is very close to the plane for the target voice. In Fig.4 the vowel formant plane for Urban is presented together with the planes for the imitation and natural voice of the professional impersonator (Ochnia). Similarly, like in many other cases, in comparison to the imitator’s natural voice, the imitation is shifted toward the larger values of F1 and somewhat in the direction of larger values of F2. This time, however, the plane for the imitation does not seem to be closer to the plane of the target. In Figs.5 and 6 the vowel formant plane for Michnik is accompanied by the vowel formant planes for the imitation and natural voice of the semi-professional (Gumulec) (Fig.5) and the amateur (Likus) (Fig.6). In Fig.5 it may be seen again that the imitation is shifted in the direction of larger values of F1 and F2. In Fig.6 similar tendencies as in Fig.3 may be seen: There is a large shift between the planes for the imitation and the natural voice of the impersonator, the plane for the natural voice is the smallest and the planes for the imitation and the target are very close. Since one of the goals of the present study was a comparison between the achievements of different types of impersonators, in Fig.7 vowel formant planes have been plotted together for Walesa and all the three impersonators employed. Visually, most similar are the planes for Walesa and the professional (Ochnia). It is interesting to note that the plane for the semi-professional (Gumulec) was smaller than for the amateur (Likus). The results shown in Fig.7 have been used to calculate the Euclidian distance between particular vowels. The results of these calculations are plotted in Fig.8. An interesting observation is that the mean distances imitation-target for all the vowels are equal for the professional (Ochnia) and the amateur (Likus), while for the semiprofessional (P.Gumulec) are substantially larger. As it has already been mentioned, the subjective tests of speech samples recognition as originals or imitations and the evaluation of the quality of impersonation have also been made. The listeners’ estimations of the perceived stimuli as originals or imitations are presented in Table 1. The evaluation of the quality of imitation in a five-point scale is also given. The effectiveness of recognition of the original target voices by the listeners was high and expressed in percent reached 98,3% for Walesa, 90% for Urban and 65% for Michnik, which reflects the public popularity of a given speaker. The effectiveness of imitation presented in the same table was much lower and it ranged from 46.7% for the impersonation of Urban by Ochnia (professional) to only 6.7% for the impersonation of Michnik by.Likus (amateur). On the basis of the results presented in Table 1 it may be said that, on the mean, the results of the imitation for the semi-professional (Gumulec) were twice as good as for the amateur (Likus) and twice as bad as for the professional (Ochnia). Thus, the qualification and experience of the impersonator plays a major role in impersonation tasks. This observation is confirmed by the evaluation of the quality of


111

imitation presented in the last six columns of Table 1. The impersonations performed by the amateur were generally judged as poor (1.5 and 1.6 points on the mean in a five point scale), as satisfactory by the semi-professional (2.3 and 2.4 points) and as good by the professional (3.0 and 3.6 points).

Fig. 7. Vowel formant planes for Walesa (target) and all his three impersonators

Fig. 8. Euclidean distances between the vowels of Walesa and all his three impersonators

112

W. Majewski and P. Staroniewicz Table 1. Distribution of listeners’ answers to perceived stimuli Speech samples Walesa-original Michnik-original Urban-original Walesa by Ochnia Urban by Ochnia Walesa by Gumulec Michnik by Gumulec Walesa by Likus Michnik by Likus

Target

Imitation

98.3 65.0 90.0 26.7 46.7 15.0 16.7 8.3 6.7

1.7 35.0 10.0 73.3 53.3 85.0 83.3 91.7 93.3

1 0 1 0 2 1 8 14 32 28

Imitation evaluation 2 3 4 5 0 0 0 1 2 8 2 1 0 1 3 0 9 22 10 0 3 11 9 8 22 18 3 0 10 16 10 0 18 4 1 2 22 5 1 8

Mean 5.0 3.7 4.2 3.0 3.6 2.3 2.4 1.5 1.6

4 Conclusions It has been shown that the speakers are able to modify their speech production apparatus in the desired direction. The vowel formant planes for the imitations were generally, but not always, placed between the impersonator’s natural voice and the target. The largest resemblance between the vowel formant planes for the imitation and the target was obtained for the amateur, whose imitations were, however, subjectively evaluated as the worst ones. This indicates that except for the examined acoustical parameters other factors, like qualification and experience of the impersonator, are very important in the realization of impersonation tasks. Acknowledgments. This work was partially supported by COST Action 2102 “Crossmodal Analysis of Verbal and Non-verbal Communication” and by the grant from the Polish Minister of Science and Higher Education (decision nr 115/N-COST/2008/0).

References 1. Endres, W., Bambach, W., Flösser, G.: Voice spectrograms as a function of age, voice disguise and voice imitation. JASA 49, 1842–1848 (1971) 2. Ericson, A., Wretling, P.: How flexible is the human voice – a case study of mimicry. In: Proc. Eurospeech 1997, Rhodes, vol. 2, pp. 1043–1046 (1997) 3. Schlichting, F., Sullivan, K.: The imitated voice – a problem for line-ups? Int. J. Speech, Language and the Law 4, 148–165 (1997) 4. Zetterholm, E.: Same speaker – different voices. A study of one impersonator and some of his different imitations. In: Proc. 11 Australian Int. Conf. on Speech Sci. & Techn., Auckland, pp. 70–75 (2006) 5. Majewski, W.: Aural-perceptual voice recognition of original speakers and their imitators. Archives of Acoustics 30 supplement, 183–186 (2005) 6. Majewski, W.: Mel frequency cepstral coefficients (MFCC) of original speakers and their imitators. Archives of Acoustics 31, 445–449 (2006) 7. Majewski, W.: Speaking fundamental frequency of original speakers and their imitators. Archives of Acoustics 31, 17–23 (2007) 8. Majewski, W., Staroniewicz, P.: Acoustical parameters of target voices and their imitators. Speech and Language Technology 11, 17–23 (2008)

Multimodal Interface Model for Socially Dependent People Rytis Maskeliunas1 and Vytautas Rudzionis2 1 2

Kaunas University of Technology, Kaunas, Lithuania Vilnius University, Kaunas faculty, Kaunas, Lithuania [email protected]

Abstract. The paper presents an analysis of the multimodal interface model for the socially dependent people. The general requirements for the interface were to be as simple as possible and as natural as possible (in principle such interface should be the theoretical replacement of a typical “standard” one finger “joystick” control). Performed experiments allowed us to detect the most often used commands, the expected accuracy level for the selected applications and perform various usability tests. Keywords: multimodal, speech recognition, touch based GUI, human – machine interaction.

1 Introduction Multimodal interfaces have lots of advantages comparing with the more widely used “standard” (single modality) interfaces. Domains such as health, education, egovernment, and e-commerce have a great potential for the applications of multimodal interfaces and could lead to a higher efficiency, lower costs, better reliability, accessibility, quality of information content, decentralized communication, etc. [1]. Advantages of the multimodal dialogs [2–4] could be exploited even better by designing applications where primary users will be the socially-dependent people (elderly, people with disabilities, technically naive people, etc.) [5–7]. These advantages first of all lie in the fact that such users are often technically naive and have some sort of fear dealing with the new technologies. Interfaces which are as similar as possible to the real human-human communications are of great importance for such people. Additional inconvenience is related to the fact that mobile and small size portable devices usually have small keyboards and screens [8]. This is unavoidable design compromise of such devices introduces the additional factor of inconvenience for the development of the traditional GUI type interfaces. Such inconvenience is felt particularly sharp by the elderly and other socially-dependent people [9]. Many studies have proven that speech recognition centric interfaces used as a main modality for the control of mobile and portable devices have an enormous market potential and many usability advantages. Spoken commands may be the simplest and the most convenient way to replace a traditional keyboard based control of portable A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 113–119, 2011. © Springer-Verlag Berlin Heidelberg 2011

114

R. Maskeliunas and V. Rudzionis

devices [10, 11]. In recent years a number of new multimodal services and prototypes were presented and developed in various countries and in various areas of applications [12–15]. Unfortunately the design of multimodal interfaces still isn’t a straightforward task: there are no clear answers which spoken commands should be used to achieve the necessary naturalness of the Human – Computer Interface (HCI) and to minimize information access time, which additional modalities should be integrated and when, etc. Our research tries to find the answers to some of those questions.

2 Multimodal Interface Model for Socially Dependent People In this study the main goal was to propose a multimodal interface model to potentially provide easier communication with more and more widespread technical devices and more and more complicated user interfaces. Interface model for socially dependent people has been formulated and several research tasks were established. Among the research tasks were the following ones: • • • •

How many voice commands should be recognized for the efficient performance and necessary naturalness; What is the optimum “length” of a command for the human – computer interaction; What control modalities should be chosen to perform selected tasks in a most efficient and easy way; The evaluation of the usability and naturalness of such interface.

The evaluation and consideration of those tasks and the design of human-machine interface based on these results should in principle lead to the convenient and natural user oriented interface. It is possible that some of the results could lead to the contradictory conclusions. So another aim of this study was to try to determine which factors could be treated as the more important ones when designing such type of human-machine interface. For the experimental evaluation tasks a demo application for a multimodal wheelchair control was developed (continuing from [16]), serving as the basis for the interface efficiency evaluation experiments. Target condition were that this demo application should satisfy speech recognition accuracy requirements (at least 95% recognition accuracy for every voice command), meaning that if some particular voice command can't guarantee the necessary speaker independent recognition accuracy level it should be replaced by another voice command. The design of the application was to provide the voice based input combined with the more traditional GUI based touch screen interface with additional video input for the added control (gaze and motion recognition). The targeted audience was the socially dependent (mostly elderly and disabled) and not computer literate people. It has been decided that the control of such application should be realized on a widespread portable device (smartphone) utilizing the traditional touch based GUIs providing the user with a possibility to use more than one types of interface for human-machine interaction. The information flowchart of this application is shown in Fig. 1.

Multimodal Interface Model for Socially Dependent People

115

Fig. 1. Information flowchart of the demo application

The application architecture was a typical client side application serving as a connection point with the server side speech and video processing engines. Users didn’t have the opportunities to train their voices for the better recognition of voice commands or to adapt speech recognition engine to the characteristic properties of their voices (no “learning curve”). The video processing wasn’t used in this evaluation due to time constraints.

Fig. 2. The illustration of the demo application

Touchscreen and haptic (built-in vibro) input capabilities were used as a possibility to provide a simple GUI for a wheelchair control (Fig. 2). The user was given the possibility to confirm or to reject the proposition or to point out the direction. It was expected that such capability will be used mainly in the case when speech recognition can’t provide accurate enough recognition rate.

116


3 The Experimental Evaluation Several groups of experiments were carried out trying to evaluate the usability and user’s preferences in various modes of operation. First group of experiments was performed trying to evaluate speech recognition accuracy of voice commands used in demo application. 20 speakers of different age groups participated in the experiment (the speakers of older age groups prevailed). There were two sets of voice commands used in this experiment. Each set contained 10 Lithuanian voice commands. Commands in each command set in principle has the same semantic meanings but first set consisted from the phonetically more complicated commands while the second one has been composed from the “simpler” commands. Each speaker pronounced each utterance 50 times so a total number of 1000 phrases and sentences were used to test voice command recognition accuracy. A proprietary Lithuanian ASR system based on HMM (restricted via GRXML rules) was used for evaluation. The recognition accuracy of each voice command is presented in Fig. 3. The average recognition accuracy in the first voice command set was 77 % while in the second one the recognition accuracy was 97 %. ϭϬϬ ϴϬ ϲϬ ϰϬ ϮϬ Ϭ

^ĞƚŶƌ͘ϭ͕ĂĐĐƵƌĂĐǇ͕й

^ĞƚŶƌ͘Ϯ͕ĂĐĐƵƌĂĐǇ͕й

Fig. 3. The recognition accuracy of ten voice commands

These results show that proper design of voice command vocabulary could lead to a substantial increase in accuracy rate and customer satisfaction (all users said that they liked more to use the second command set despite the fact that the second set contained commands composed from the words used less frequently in everyday speech). We may conclude that users are ready to use voice input if it allows to achieve a high enough recognition accuracy rate, rather than to use more popular words but to face lower recognition accuracy and to face the necessity to repeat the same commands. It could be seen that the first command set was characterized by the bigger accuracy deviation among different commands (the worst recognition accuracy was only 19 %). Using the second set of commands 95 % of all users expressed satisfaction with the control capabilities while only 40 % of users expressed their satisfaction using first set of voice commands. The factor of low recognition accuracy was pointed out as the most irritating by the users.


117

Another test was carried on using long voice command strings (two different sets, one with higher recognition accuracy than the other), trying to determine the optimum “length” of a command. In this case the same 20 users needed to utter several voice commands in a row (continuous speech) to achieve a predefined task. If all voice commands were recognized correctly the task has been treated as solved, otherwise the task was treated as unsolved and the user was asked to repeat the errors. The first set required the use of 4 words (imitating a simple sentence), the second required to use 8 words (imitating the description of an action) and the third task required to use 11 words (imitating the detailed instructions). The results of this experiment are shown in Fig. 4. In this experiment all users also preferred the second (composed of simpler, less popular words but recognized better) set of voice commands. As expected the accuracy decreased and complexity in usability increased the longer the utterance has become. An interesting observation was made – the longer the utterance time has become, the lower was the satisfaction level (noticeable irritation was expressed by most users). Only the 65 % preferred to use voice while still getting quite usable accuracy levels (~80 %) with a long set of 11 commands. ϭϬϬ ϴϬ ϲϬ ϰϬ ϮϬ Ϭ ϰĐŽŵŵĂŶĚƐ

ϴĐŽŵŵĂŶĚƐ

ϭϭĐŽŵŵĂŶĚƐ

^ĞƚŶƌ͘ϭ͕ĂĐĐƵƌĂĐǇ͕й

^ĞƚŶƌ͘Ϯ͕ĂĐĐƵƌĂĐǇ͕й

^ĞƚŶƌ͘ϭ͕ůŝŬĞĚƚŚĞǀŽŝĐĞŝŶƉƵƚ͕й

^ĞƚŶƌ͘Ϯ͕ůŝŬĞĚƚŚĞǀŽŝĐĞŝŶƉƵƚ͕й

Fig. 4. The recognition accuracy of voice command strings

In the third group of experiments users had the possibility to freely choose to use voice commands to control the device or to use touch screen capabilities to navigate through the menu and to invoke the same control capabilities and to solve the same tasks. The same 20 speakers took part in these experiments. They were divided into the two groups with 10 participants in each group (similar composition of age in both groups). One group used a first set of voice commands (lower recognition accuracy) with the touch screen interface while the second group used the second set of voice commands (higher recognition accuracy) with the same touch screen GUI. Users in each group were given the same tasks (achievable either by voice commands or by some actions on screen) and were asked after the experiment which communication mode – spoken commands or touch screen navigation they would treat as a more preferable one. In the first group only 50 % of users said that spoken input was the preferable way of interaction while in the second group 90 % of users preferred the spoken input over the touch screen navigation. In both groups 85 % of users confirmed that combined use of voice commands with the touch screen navigation was helpful. 72 % of all elder participants selected the speech input as the most

118


EƵŵďĞƌŽĨƉĞŽƉůĞ

attractive way for HCI control. These results are important in the light of our pilot test with the more technically skilled younger users having some experience interacting with portable devices in general and touch screen devices in particular. In this group preference to use spoken commands wasn’t expressed so clearly: only 20 % of younger (the technically literate) users said that the spoken input was the preferable way of interaction, while others preferred the touch based GUI. The overall results (Fig. 5.) let us made a preliminary conclusion that speech centric multimodal interface is of particular importance and convenience for the socially dependent users. Detailed evaluation of the user’s satisfaction as the dependency of the WER will be obtained in the near future (more data will be gathered from a larger number of participants). ϭϬ ϵ ϴ ϳ ϲ ϱ ϰ ϯ Ϯ ϭ Ϭ

'ƌŽƵƉ 'ƌŽƵƉ

WƌĞĨĞƌƌĞĚƚŽƵĐŚ

WƌĞĨĞƌƌĞĚǀŽŝĐĞ

ƉƉƌĞĐŝĂƚĞĚƚŚĞ ŵƵůƚŝŵŽĚĂů ĂƉƉƌŽĂĐŚ

Fig. 5. The evaluation of the usability

4 Conclusions 1. The prototype system of a multimodal interface (using the speech recognition as the main modality) for the socially-dependent people, has been proposed. The system uses voice commands combined with the touch screen based GUI-like interface intended to be used by the socially dependent people. The experiments with two different sets of voice commands showed that nearly all users expressed satisfaction with the application when average recognition accuracy was 97 %. 2. Observation was made that the longer the utterance has become, the lower was the satisfaction level. Only the 65 % preferred to use voice while still getting quite usable accuracy levels (~80 %) with a long set of 11 words. 3. Most of the users (85 %) expressed the satisfaction of having the possibilities to use a multimodal interface (voice commands supplemented with the touch screen based GUI-like interface). 4. The technically naive and socially-dependent people got more value from the multimodal and voice based interface than the technically skilled users. Most of the elder users (72 %) said that the control using voice commands is the most attractive way for the human-machine interaction, while 80 % of younger users preferred the touch based GUI. As expected the lower speech recognition accuracy caused a more frequent use of touch and vice-versa.


119

Acknowledments. This research was done under the grant by Lithuanian Academy of Sciences for the research project: ”Dialogų modelių, valdomų lietuviškomis balso komandomis, panaudojimo telefoninėse klientų aptarnavimo sistemose analizė” No.: 20100701–23.

References 1. Noyes, J.M.: Enhancing mobility through speech recognition technology. IEE Developments in Personal Systems, 4/1–4/3 (1995) 2. Pieraccini, M., Huerta, R.J.: Where do we go from here? Research and Commercial Spoken Dialog Systems. In: Proc. of 6th SIGdial Workshop on Discourse and Dialog, Lisbon, Portugal, pp. 1–10 (2005) 3. Acomb, K., et al.: Technical Support Dialog Systems, Issues, Problems, and Solutions. In: Proc. of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, Rochester, New York, pp. 25–31 (2007) 4. Paek, T., Pieraccini, R.: Automating spoken dialogue management design using machine learning: An industry perspective. Speech Communication, Special Issue on Evaluating New Methods and Models for Advanced Speech-Based Interactive Systems 50(8-9), 716– 729 (2008) 5. Valles, M., et al.: Multimodal environmental control system for elderly and disabled people. In: Proc. of Engineering in Medicine and Biology Society, Amsterdam, vol. 2, pp. 516–517 (1996) 6. Perry, M., et al.: Multimodal and ubiquitous computing systems: supporting independentliving older users. IEEE Transactions on Information Technology in Biomedicine 8(3), 258–270 (2004) 7. Wai, A.A.P., et al.: Situation-Aware Patient Monitoring in and around the Bed Using Multimodal Sensing Intelligence. In: Proc. of Intelligent Environments, Kuala Lampur, pp. 128–133 (2010) 8. Ishikawa, S.Y., et al.: Speech-activated text retrieval system for multimodal cellular phones. In: Proc. of Acoustics, Speech, and Signal Processing, vol. 1, pp. I-453–I-456 (2004) 9. Verstockt, S., et al.: Assistive smartphone for people with special needs: The Personal Social Assistant. In: Proc. of Human System Interactions, Catania, pp. 331–337 (2009) 10. Oviatt, S.: User-centered modeling for spoken language and multimodal interfaces. IEEE Multimedia 3(4), 26–35 (1996) 11. Deng, L., et al.: A speech-centric perspective for human-computer interface. In: Proc. of Multimedia Signal Processing 2002, pp. 263–267 (2002) 12. Zhao, Y.: Speech-recognition technology in health care and special-needs assistance (Life Sciences). Signal Processing Magazine 26(3), 87–90 (2009) 13. Sherwani, J., et al.: Speech vs. touch-tone: Telephony interfaces for information access by low literate users. In: Proceedings of Information and Communication Technologies and Development, Doha, pp. 447–457 (2009) 14. Motiwalla, L.F.: Jialun Qin. Enhancing Mobile Learning Using Speech Recognition Technologies: A Case Study. In: Management of eBusiness 2007, Toronto, pp. 18–25 (2007) 15. Sherwani, J., et al.: HealthLine: Speech-based Access to Health Information by Lowliterate Users. In: Proc. of Information and Communication Technologies and Development, Bangalore, pp. 1–9 (2007) 16. Maskeliunas, R.: Modeling Aspects of Multimodal Lithuanian Human - Machine Interface. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds.) Multimodal Signals, COST Seminar 2008. LNCS (LNAI), vol. 5398, pp. 75–82. Springer, Heidelberg (2009)

Score Fusion in Text-Dependent Speaker Recognition Systems Jiˇr´ı Mekyska1 , Marcos Faundez-Zanuy2 , Zdenˇek Smékal1 , and Joan F` abregas2 1

Signal Processing Laboratory, Department of Telecommunications, Faculty of Electrical Engineering and Communication, Brno University of Technology Brno, Czech Republic [email protected], [email protected] 2 Escola Universit` aria Politècnica de Matar´ o Barcelona, Spain {faundez,fabregas}@tecnocampus.com

Abstract. According to some significant advantages, the text-dependent speaker recognition is still widely used in biometric systems. These systems are, in comparison with the text-independent, more accurate and resistant against the replay attacks. There are many approaches regarding the text-dependent recognition. This paper introduces a combination of classifiers based on fractional distances, biometric dispersion matcher and dynamic time warping. The first two mentioned classifiers are based on a voice imprint. They have low memory requirements while the recognition procedure is fast. This is advantageous especially in lowcost biometric systems supplied by batteries. It is shown that using the trained score fusion, it is possible to reach successful detection rate equal to 98.98 % and 92.19 % in case of microphone mismatch. During verification, system reached equal error rate 2.55 % and 6.77 % when assuming the microphone mismatch. System was tested using Catalan database which consists of 48 speakers (three 3 s training samples per speaker). Keywords: Text-dependent speaker recognition, Voice imprint, Fractional distances, Biometric dispersion matcher, Dynamic time warping.

1

Introduction

Speaker identification is a task, were the system tries to answer the question “Who is speaking?” Many behavioral biometric systems are based on this task, because, due to the individual shape of a vocal tract and a manner of speak (intonation, loudness, rhythm, accent, etc.), it is possible to distinguish between speakers. These systems can serve as gates to the secured areas, authentication systems in the field of banking or as simple systems which provide an access to private lifts (e. g. in hospitals). Generally it is possible to divide these systems into text-dependent and textindependent recognition systems. In case of the text-dependent systems, the speaker has to exactly utter required phoneme, word or sentence. This utterance can repeat during all recognitions or it can be utterance randomly chosen A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 120–132, 2011. c Springer-Verlag Berlin Heidelberg 2011

Score Fusion in Text-Dependent Speaker Recognition Systems

121

(e. g. sequence of digits). The utterance can also serve as a password so that it is known only by the target speaker and the system. In case of text-independent speaker recognition, the speaker is recognized independently on the utterance. This is advantageous in more general cases, where it is not possible to force speaker to utter exact sequence of phonemes. However text-independent systems are not as accurate as text-dependent and they usually require a lot of training data which are not always available. Moreover text-dependent systems can be resistant against replay attacks when using randomly chosen utterances from a large set. 1.1

Low-Cost Text-Dependent Speaker Recognition

The state of the art text-independent speaker recognition systems are usually based on GMM–UBM (Gaussian Mixture Model – Universal Background Model) [14] or SVM [2], [3]. A deeper overview of these systems can be found in [10]. In [12] we introduced text-dependent speaker recognition in low-cost biometric systems based on voice imprint. Although the prices of memories and computational burden have rapidly decreased, there are still some cases where low-memory and low-computational requirements are necessary. For example in biometric systems based on sensor nets there can be dozens of sensors which are switched on from the stand by state just for the purpose of recognition and then switched off again. These sensors are usually supplied by batteries therefore the computational burden during recognition must be very low. Moreover these sensors do not have big memories for storage of large data. During the recognition procedure, the system was using classifiers DTW (Dynamic Time Warping), FD (Fractional Distances) and BDM (Biometric Dispersion Matcher). It has been shown, that using FD, the system fulfilled the requirements of low-cost biometric system: 1. low memory needed for speakers’ models and procedure of recognition, 2. training using just a limited number of samples (i. e. 2 – 3 samples lasting app. 3 s), 3. fast identification/verification (less than 100 ms for database which consists of app. 50 speakers’ models). This research was interested in the improvement of recognition accuracy. In [12] it was shown that using the fractional distances along with the voice imprint, it is possible to reach successful detection rate equal to 96.94 % and 82.29 % in case of microphone mismatch. During verification, system reached equal error rate 3.93 % and 10.43 % when assuming microphone mismatch. This research shows that using different combination of classifiers it is possible to reach better results. This paper is organized as follows. Section 2 describes process of calculating voice imprint used by FD and BDM. Section 3 mentions well-known classifier used for the text-dependent speaker recognition and introduces new application of the other classifiers which are usually applied in another field of recognition (e. g. hand-writing or face recognition). Section 4 is devoted to the main experimental results and conclusions.

122

2

J. Mekyska et al.

Voice Imprint

To describe the procedure of voice imprint calculation firstly consider the feature matrix Λ, where each column is related to a signal segment (20 – 30 ms) and nth row to nth feature. In this work the imprint was based on MFCC [15], LPCC [17], PLP [9], CMS [11], ACW [5] and combination of these features. These features were also extended by 1st order and 2nd order regression coefficients. The disadvantage of matrix Λ is that it has two dimensions, some coefficients in matrix can be irrelevant and the number of vectors in matrix can vary for each speaker and sentence. If the DCT is applied on Λ in horizontal direction, it concentrate the energy on a few coefficients and more over if the original coefficients were already less correlated in vertical direction, the energy will be concentrated in one corner of the matrix.1 This process is illustrated on Fig. 1. Picture c) represents the matrix of LPCC and on picture b) there is its DCT. c) 20

6000

15 n→

f [Hz] →

a) 8000

4000 2000 0

10 5

0.5 1 1.5 nT [s] →

50 100 150 fri →

b)

d)

20 [n] →

5

10

prnt

n→

15

c

5

0 -5

50 100 150 fr → i

0

2000 n→

Fig. 1. Procedure of voice imprint calculation: a) spectrogram; b) matrix Λ of clpcc [n]; c) DCT {Λ}; d) voice imprint cprnt [n] (fs = 16 kHz, NFFT = 2048, p = 20, Hamming window with size 20 ms and overlap 10 ms)

1

This effect is similar to the calculation of JPEG image format, where the two dimensional DCT concentrate energy to one corner of the image matrix.


123

To obtain one dimensional signal from matrix Λ, the coefficients can be read from matrix by different ways. It can be read zig-zag like in JPEG, it can be read by the columns or by the rows. On Fig. 1 d) there is an example of coefficients read by columns. The first DC coefficient is not used, because it usually has no important information about the speaker. To obtain for each speaker the same length of this one dimensional signal, it is simply multiplied by rectangular window. In this place the voice imprint cprnt [n] can be defined. In the next part of work, we will consider voice imprint as: cprnt [n] = w[n] · r (DCT {Λ}) , 1 , for n = 0, 1, 2, . . . , Nv − 1, w[n] = 0 , otherwise,

(1) (2)

where function r(M) represents the reading from matrix M and Nv is the length of voice imprint. If r (DCT {Λ}) is shorter than Nv , it should be padded by zeros. By the value of Nv we can also limit the number of important coefficients for speaker recognition.

3

Classifiers

The proposed system is using just a limited number of training samples, which means 2 – 3 samples. It is considered that these samples last app. 3 s. According to this assumption it is not suitable to use some statistical methods like GMM (Gaussian Mixture Models) [1], [14], [13], HMM (Hidden Markov Models) [1] or ANN (Artificial Neural Networks) [18], because these classifiers need a lot of training data to achieve a good performance. However one statistical classifier does not have this disadvantage, it is BDM (Biometric Dispersion Matcher) and it will be deeply described in sec 3.2 [6], [7]. 3.1

Template Matching Methods

Voice imprint is 1D signal with a fixed number of coefficients, thus it is possible to use during the classification a template matching method. One representative of these methods is a classifier based on fractional distances (FD). This classifier was successfully tested on the on-line signature recognition in [16] and in [12] it was shown, that this classifier worked well also in the field of text-dependent speaker recognition. Assume that we have one input voice imprint cIprnt [n], one reference voice imprint cRprnt [n] and their lengths are same and equal to Nv . Then the distance between these two imprints d (cIprnt , cRprnt ) is calculated according to equation: [16] d (cIprnt , cRprnt ) =

N −1 v

k1 |cIprnt [n] − cRprnt [n]|

k

,

(3)

n=0

where k is in [16] recommended to set around 0.4. This has effect, that also distances between small coefficients significantly contribute to the final distance

124

J. Mekyska et al.

d (cIprnt , cRprnt ). It is useful especially when these small values are important for the speaker recognition, which is typical for voice imprint. In this case the Euclidean distance is not suitable, because its k = 2. Another template matching method is DTW (Dynamic Time Warping) [8]. However it is important to highlight that in case of this work, DTW was used along with the feature matrix Λ, not with the voice imprint cprnt [n]. 3.2

Biometric Dispersion Matcher

In [6] J. Fàbregas and M. Faundez-Zanuy proposed new classifier called biometric dispersion matcher (BDM). With advantage, this classifier can be used in biometric systems where just a few training samples per person exists. Instead of using one model per person, BDM trains a quadratic discriminant classifier (QDC) that distinguish only between two classes: E (pairs of patterns corresponding to the same class) and U (pairs of patterns corresponding to the different classes) [6]. Using BDM, it is possible to solve the simple dichotomy: “Do the two feature vectors belong to the same speaker?” Consider, that c is the number of speakers, m is the number of samples taken from each speaker, xij is the j th sample feature column vector of speaker i2 , p is a dimension of each feature vector xij and δ ∈ Rp is the difference of two feature vectors, then the quadratic discriminant function g(δ) that solves the dichotomy can be described according to: [6] g(δ) =

1 T −1 1 δ S U − S −1 δ + ln E 2 2

|S U | |S E |

,

(4)

where S U and S E are the covariance matrices corresponding to the classes E and U. The matrices can be calculated according to the formulas: [6] S E = ξ (c − 1) SU = ξ

c

c m

i=1 j,l=1 m

T

(xij − xil ) (xij − xil ) , T

(xij − xkl ) (xij − xkl ) , i = k,

(5)

(6)

i,k=1 j,l=1

ξ=

1 . cm2 (c − 1)

(7)

If g(δ) ≥ 0, then δ ∈ E, which means that the two patterns (or voice imprints) belong to the same speaker, otherwise they belong to the different speakers. The BDM has three important advantages: 1. Comparing dispersion of the distributions of E and U, BDM performs feature selection. Only features with the quotient of the standard deviations σσUE smaller than a fixed threshold can be selected. 2

In our case, each vector can be represented by voice imprint cprnt [n].


125

2. When a new speaker is added to the system, next model do not have to be trained. Only two models at beginning are trained. According to these models we decide whether two patterns come from the same speaker or not. 3. Most of the verification systems set the threshold θ a posteriori in order to minimize equal error rate REER . This is an unrealistic situation, because systems need to fix θ in advance. BDM has the threshold set a priori and is still comparable to the state-of-the-art classifiers [6], [7].

4

Experimental Results

Text-dependent recognition system based on FD, BDM and DTW was tested using corpus which consists of 48 bilingual speakers (24 males and 24 females) who were recorded in 4 sessions. The delay among the first three sessions is one week, the delay between the 3rd and 4th session is one month. Speakers uttered digits, sentences and text in Spanish and Catalan language. Speech signals were sampled by fs = 16 kHz and recorded using three microphones: AKG C420, AKG D46S and SONY ECM 66B. Each speech sample is labeled by M1 – M8, the meaning of these labels is described in tab. 1. Table 1. Notation of speech corpus Lab. Sess. Microphone Lab. Sess. Microphone M1 1 AKG C420 M5 2 AKG D46S M2 2 AKG C420 M6 3 SONY ECM 66B M3 3 AKG C420 M7 4 AKG C420 M4 1 AKG D46S M8 4 SONY ECM 66B

During the evaluation, the classifier was trained using three samples and was tested by the last one. There are four different possibilities of testing: 1. 2. 3. 4.

Training Training Training Training

by by by by

sessions sessions sessions sessions

2, 1, 1, 1,

3, 3, 2, 2,

4 4 4 3

and and and and

testing testing testing testing

by by by by

session session session session

1. 2. 3. 4.

It is obvious that after testing there are 4 confusion matrices. According to these matrices successful detection rate RS [%], equal error rate REE [%] and minimum of detection cost function min (FDCF ) [%] was calculated. The whole testing procedure was divided into two scenarios. During the first scenario SC1 only the samples (sessions) recorded by microphone AKG C420 were selected. To evaluate the system in mismatch conditions, samples from 4 different sessions recorded by 3 different microphones (AKG C420, AKG D46S, SONY ECM 66B) were in the second scenario SC2 selected. Two of these sessions were recorded by AKG C420. It was not decided to use just 3 sessions recorded by 3 different microphones, because the probability, that the actual signal is recorded by the same microphone as the reference signal, is at the beginning of the use of system higher.

126

4.1

J. Mekyska et al.

Settings of Classifiers and Features

The classifiers’ settings were found empirically so that they provide good results, but all these settings, along with the suitable selection of features, affects the final results and it is possible that there are other setting that are better for the classification. For this purpose it would be better to use some kind of optimization or genetic algorithms. The settings used in this work are listed below: – FD – coefficient used for the calculation of distance (see sec. 3.1) k = 0.5; the first DC coefficient of cprnt [n] was removed; there was calculated one template voice imprint as a mean of the three training imprints. – BDM – the first DC coefficient of cprnt [n] was removed; threshold used for the feature selection (see chap. 3.2) was set to value 0.24. – DTW – in case of this classifier the matrices of features Λ were used. Regarding the features it was decided to use these representatives and their combinations: MFCC, PLP, LPCC, CMS, ACW, MFCC+LPCC+ACW. Each signal was trimmed using the VAD and consequently filtered by a 1st order highpass filter with α = 0.95. During the feature extraction Hamming window with size 25 ms (400 samples) and overlap 10 ms (160 samples) was used. In case of BDM, the length of voice imprint Nv = 257.3 In case of FD with MFCC, PLP, LPCC, CMS and ACW the length of voice imprint was set to Nv = 350. In case of FD with MFCC+LPCC+ACW Nv = 1050. 4.2

Score Fusion

The higher recognition accuracy can be reached using the score fusion. As was written in sec. 4, after the test procedure there are 4 matrices with particular distances as an output of each classifier. If these distances are considered as scores, we can combine them into the final matrix D according to equation: di,j =

P p=1

ap · cpnorm · dpi,j ,

(8)

where di,j is the final distance on the ith row and j th column of D. dpi,j is the distance on the ith row and j th column of pth score matrix. P is a number of all score matrices, ap ∈ 0, 1 is weight of pth score matrix and cpnorm is a normalization coefficient of pth score matrix. Each classifier can generate different range of distances. For example distances calculated by DTW can be generally more than thousand times higher than in case of FD. Therefore it is necessary to firstly somehow normalize all values in matrices, so that distances from the different classifiers will have similar weights. This can be done using normalization coefficient cpnorm . If we add the diagonals 3

First DC coefficient was removed thus Nv = 256. This length was better for calculation.


127

of all score matrices Dp generated by the same pth classifier to one vector vp , then cpnorm can be calculated according to: cpnorm =

1 . mean (vp )

(9)

Coefficient ap determines how much will the pth distance participate to the final score. If there are many classifiers, then the best values of ap can be found using some optimization methods (e. g. hill climbing, genetic algorithms). If there are only 2 classifiers then it is possible to use one coefficient a according to: di,j = a · c1norm · d1i,j + (a − 1) · c2norm · d2i,j .

(10)

Changing the value of a from 0 to 1 with step 0.01, it is possible to find the best combination. Table 2. System performance in both scenarios

DTW

BDM

FD

Cl. Featuresa

a

MFCC+Δ+Δ2 PLP+Δ+Δ2 LPCC+Δ+Δ2 CMS+Δ+Δ2 ACW+Δ+Δ2 MFCC+Δ+Δ2 LPCC+Δ+Δ2 ACW+Δ+Δ2 MFCC+Δ PLP+Δ LPCC+Δ CMS+Δ ACW+Δ MFCC LPCC ACW MFCC+Δ+Δ2 PLP+Δ+Δ2 LPCC+Δ+Δ2 CMS+Δ+Δ2 ACW+Δ+Δ2 MFCC+Δ+Δ2 LPCC+Δ+Δ2 ACW+Δ+Δ2

(SC1) RS [%] REE [%] FDCF b[%] 93.37 7.14 5.82 91.33 8.16 6.53 94.90 5.10 4.35 89.80 8.16 7.54 96.94 4.08 3.93

RS [%] 81.77 72.92 77.08 80.21 81.77

(SC2) REE [%] FDCF [%] 15.10 12.77 16.67 14.23 13.54 12.32 14.06 12.27 10.94 10.43

96.43

5.61

4.62

82.29

13.54

11.78

86.22 86.73 83.16 58.16 84.69

8.67 7.14 8.67 25.51 7.65

7.33 7.08 8.18 18.45 7.23

58.85 53.65 38.02 35.42 37.50

25.00 25.52 34.38 37.50 35.94

17.98 17.92 26.22 27.38 25.10

62.76

27.55

19.03

66.15 17.19

16.37

96.94 92.86 96.94 97.96 98.47

4.08 8.67 3.06 4.08 2.55

3.82 5.71 2.52 3.14 1.95

82.81 76.04 85.94 91.67 89.06

11.46 13.54 12.50 6.77 9.90

10.40 11.73 10.77 6.56 9.42

96.94

4.08

3.66

82.81

11.46

10.07

The values of FDCF are considered as minimum of this function.

128

J. Mekyska et al. Table 3. Score fusion using the equal weights

FD–BDM–DTW

FD–BDM

DTW–FD

BDM–DTW

Cl. Featuresa

a b

4.3

MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW

RS [%] 96.94 92.86 96.94 97.96 98.47 77.04 97.45 96.94 96.94 96.94 98.47 98.47 93.37 91.33 94.90 89.80 96.94 76.02 97.45 96.94 96.94 96.94 98.47 77.55

(SC1) REE [%] FDCF b[%] 4.08 3.80 8.67 5.71 3.06 2.52 4.08 3.14 2.55 1.96 18.37 11.80 4.08 3.84 6.12 4.88 3.57 2.76 4.08 4.02 4.08 3.01 4.59 3.07 7.14 5.81 8.16 6.52 5.10 4.35 8.16 7.54 4.08 3.93 17.86 11.81 4.08 3.84 6.12 4.88 3.57 2.76 4.08 4.02 4.08 3.01 14.80 10.72

RS [%] 82.81 76.04 85.94 91.67 89.06 82.81 83.85 81.25 85.42 89.06 89.06 85.42 81.77 72.92 77.08 80.21 81.77 82.29 83.85 81.25 85.42 89.06 89.06 85.42

(SC2) REE [%] FDCF [%] 11.46 10.39 13.54 11.72 12.50 10.77 6.77 6.56 9.90 9.41 11.46 10.11 11.46 10.16 11.98 11.11 10.42 9.71 9.90 8.23 9.90 8.58 10.42 9.77 15.10 12.77 16.67 14.23 13.54 12.32 14.06 12.27 10.94 10.43 13.54 11.78 11.46 10.15 11.98 11.11 10.42 9.71 9.90 8.23 9.90 8.58 10.42 9.77

The features were extended by 1st and 2nd order regression coefficients. The values of FDCF are considered as minimum of this function.

System Evaluation

In tab. 2 there can be found the results of identification and verification when only one classifier was used. Tab. 3 provides the results considering the combinations BDM–DTW, DTW–FD, FD–BDM and FD–BDM–DTW. In this case all coefficients ap were set to 1. Comparing the values in both tables it is clear, that BDM does not provide any improvement. Only DTW–FD can increase the accuracy, therefore in the next part of research it is focused on this combination. Different values of coefficient a were also used. Some characteristics showing the change of RS , REE and min (FDCF ) depending on the weight a can be seen on Fig. 2 (during the score fusion calculation an equation (10) was used). Fig. 2 take into account only the score fusion of distances calculated using the same features. But it is also possible to take the first distance from DTW, based on one feature (e. g. MFCC), and second distance from FD, based on another feature (e. g. ACW). The results of RS , REE and min (FDCF ) for different feature


a)

129

b)

100

100

MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW

95

90

RS [%] →

RS [%] →

95

0

0.5 a→ c)

0

0.5 a→ d)

1

0

0.5 a→ f)

1

0

0.5 a→

1

[%] →

15

R

EE

4 2 0.5 a→ e)

6

DCF

4 2 0

0.5 a→

1

10 5 0

1

) [%] →

0

min(F

REE [%] → min(FDCF) [%] →

80

1

6

0

85 75

8

0

90

10 5 0

Fig. 2. Change of characteristics depending on weight a (combination DTW–FD): a) RS in SC1; b) RS in SC2; c) REE in SC1; d) REE in SC2; e) min (FDCF ) in SC1; f) min (FDCF ) in SC2

combination can be found in tab. 4. If the tab. 2 is compared to the tab. 4 there can be seen, that using the score fusion of DTW and FD it is possible to increase RS from 98.47 % to 98.98 % and decrease min (FDCF ) from 1.95 % to 1.90 %, speaking about SC1. In case of SC2, it is possible to increase RS from 91.67 % to 92.19 %, and decrease min (FDCF ) from 6.56 % to 6.36 %.

130

J. Mekyska et al.

Table 4. Best results of score fusion using different features and different values of weight a (these results are obtained by combination of DTW and FD). The values of RS , REE and FDCF are expressed in % Feat. comb.a

RS MFCC–MFCC 97.45 MFCC–PLP 97.96 MFCC–LPCC 98.47 MFCC–CMS 98.47 MFCC–ACW 98.47 MFCC–MFCC+LPCC+ACW 98.47 PLP–MFCC 97.45 PLP–PLP 96.94 PLP–LPCC 97.96 PLP–CMS 96.94 PLP–ACW 97.45 PLP–MFCC+LPCC+ACW 97.96 LPCC–MFCC 97.96 LPCC–PLP 97.45 LPCC–LPCC 97.45 LPCC–CMS 97.96 LPCC–ACW 97.45 LPCC–MFCC+LPCC+ACW 97.45 CMS–MFCC 97.96 CMS–PLP 97.96 CMS–LPCC 97.96 CMS–CMS 97.96 CMS–ACW 98.47 CMS–MFCC+LPCC+ACW 97.96 ACW–MFCC 98.47 ACW–PLP 98.47 ACW–LPCC 98.98 ACW–CMS 98.98 ACW–ACW 98.98 ACW–MFCC+LPCC+ACW 98.47 MFCC+LPCC+ACW–MFCC 97.96 MFCC+LPCC+ACW–PLP 97.96 MFCC+LPCC+ACW–LPCC 98.47 MFCC+LPCC+ACW–CMS 98.47 MFCC+LPCC+ACW–ACW 98.47 MFCC+LPCC+ACW– 98.47 MFCC+LPCC+ACW a b

5

a 0.55 0.69 0.62 0.68 0.70 0.58 0.54 0.52 0.54 0.68 0.32 0.42 0.41 0.67 0.34 0.48 0.37 0.83 0.59 0.98 0.63 0.69 0.71 0.91 0.82 1.00 0.93 0.87 0.86 0.81 0.63 0.71 0.62 0.70 0.72 0.56

(SC1) REE a 4.08 0.01 4.08 0.01 3.57 0.00 3.57 0.02 3.57 0.26 4.08 0.05 5.10 1.00 5.10 1.00 3.57 1.00 5.10 1.00 3.57 1.00 4.59 1.00 2.55 0.01 2.55 0.01 2.55 0.00 3.06 0.02 2.55 0.05 2.55 0.05 3.06 0.01 3.06 0.01 3.06 0.00 3.06 0.02 2.55 0.11 3.06 0.06 2.55 0.01 2.55 0.01 2.55 0.00 2.55 0.02 2.55 0.26 2.55 0.02 4.08 0.01 4.08 0.01 3.57 0.00 3.57 0.02 3.57 0.50 4.08 0.05

FDCF b a 3.24 0.00 3.44 0.01 3.24 0.00 3.39 0.00 2.76 0.00 3.10 0.00 4.28 0.00 4.85 0.01 3.39 1.00 4.45 0.00 3.05 1.00 3.75 1.00 2.10 0.00 2.29 0.00 2.37 0.00 2.33 0.00 2.10 0.00 2.18 0.00 2.57 0.00 2.63 0.03 2.47 0.00 2.81 0.00 2.47 0.00 2.34 0.00 1.95 0.00 1.95 0.00 1.94 0.00 1.90 0.00 1.95 0.00 1.95 0.00 3.17 0.00 3.37 0.01 3.11 0.00 3.26 0.00 2.73 0.00 3.07 0.00

RS 86.46 85.42 85.94 84.90 86.46 88.02 83.33 82.29 84.38 84.90 86.98 84.90 87.50 87.50 86.46 86.98 86.46 87.50 92.19 91.67 91.67 92.19 92.19 92.19 90.10 89.58 89.58 90.10 90.10 89.58 86.46 85.42 85.94 85.42 86.98 88.02

a 0.58 0.61 0.57 0.51 0.49 0.58 0.49 0.57 0.46 0.55 0.41 0.51 0.64 0.71 0.75 0.73 0.61 0.57 0.59 0.94 0.93 0.85 0.62 0.76 0.77 0.83 0.80 0.79 0.75 0.81 0.58 0.65 0.64 0.50 0.31 0.60

(SC2) REE a 10.42 0.04 10.94 0.03 10.42 0.05 10.42 0.00 9.38 0.96 10.42 0.01 10.94 0.04 11.46 0.03 10.94 0.06 10.94 0.00 9.38 1.00 10.94 0.50 10.42 0.04 10.42 0.04 9.90 0.07 10.42 0.00 8.85 1.00 10.42 0.00 6.77 0.03 6.77 0.04 6.77 0.03 6.77 0.03 6.77 0.02 6.77 0.01 9.38 0.02 9.38 0.03 9.38 0.07 8.85 0.00 8.85 0.17 8.85 0.00 10.42 0.04 10.94 0.03 10.42 0.05 10.42 0.00 9.38 0.97 9.90 0.00

FDCF 9.88 10.01 9.34 9.17 8.37 9.57 9.62 10.49 9.38 8.93 8.65 9.47 9.42 9.70 9.26 8.58 8.47 9.21 6.42 6.54 6.52 6.49 6.36 6.55 8.17 8.51 8.50 7.83 8.11 8.32 9.93 9.96 9.38 9.16 8.37 9.62

a 0.02 0.00 0.00 0.00 0.06 0.02 0.00 0.00 0.00 0.00 0.98 0.50 0.00 0.00 0.00 0.00 1.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.02 0.00 0.00 0.00 0.01 0.03

The features were extended by 1st and 2nd order regression coefficients. The values of FDCF are considered as minimum of this function.

Conclusion

According to the results is sec. 4.3 it is shown that it is possible to improve the results of text-dependent speaker recognition system using the score fusion. The system can use the classifiers DTW, BDM and FD. But in case of BDM, there are no improvements. In case of combination DTW-FD using the same weights it is possible to increase the accuracy (e. g. using PLP features). Nevertheless to find the best score fusion, it is necessary to use different weights and different feature combination. Tab 4 shows that using the score fusion of DTW and FD (when features ACW and CMS are selected) it is possible to reach successful detection rate equal to 98.98 % and 92.19 % in case of microphone mismatch.


131

During verification, system reached equal error rate 2.55 % and 6.77 % when assuming the microphone mismatch. Although it is possible to increase the system accuracy of the identification using the score fusion, in case of the verification, the improvement was not significant. Considering the recording by the same microphone it is probably difficult to reach better results, because the system already provided good results. In case of microphone mismatch, better results can be probably obtained using more robust features. It is also possible to use more sophisticated methods to find suitable coefficients ap and cpnorm (e. g. using genetic algorithms). For the next work it is proposed to extend the segmental features by suprasegmental features like fundamental frequency. The first three formants and features based on frequency tracking can be also used [4]. All these features can improve the accuracy of classification. The identification using DTW is in comparison to FD very slow, nearly thousand times slower. More over this time increases with the increasing number of speakers and samples in database. On the other hand DTW provides better results. Therefore it would be good to find a compromise between the accuracy and identification time. For example FD can very quickly find first 10 candidates and these candidates can be consequently processed by DTW. There is also proposed an extension to FD. These distances can be calculated from an output of floating window which returns max, min, mean or standard deviation of the series selected by this window. As was already mentioned in sec. 4.1, the classifiers’ settings were found empirically, but it is also possible that there exists better option. For this purpose, it would be good to apply some kind of optimization. Acknowledgments. This research has been supported by Project KONTAKTME 10123 (Research of Digital Image and Image Sequence Processing Algorithms), Project SIX (CZ.1.05/2.1.00/03.0072), Project VG20102014033, and projects MICINN and FEDER TEC2009-14123-C04-04.

References 1. BenZeghiba, M.F., Bourlard, H.: User-customized Password Speaker Verification Using Multiple Reference and Background Models. Speech Communication 8, 1200–1213 (2006), iDIAP-RR 04-41 2. Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., Torres-Carrasquillo, P.A.: Support Vector Machines for Speaker and Language Recognition. Computer Speech & Language 20(2-3), 210–229 (2006), http://www.sciencedirect.com/ science/article/B6WCW-4GSSP9F-1/2/4aaea6467cc61ee4919a9b1c953316b1, odyssey 2004: The speaker and Language Recognition Workshop - Odyssey-04 3. Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support Vector Machines Using GMM Supervectors for Speaker Verification. IEEE Signal Processing Letters 13(5), 308–311 (2006)

132

J. Mekyska et al.

4. Das., A., Chittaranjan, G., Srinivasan, V.: Text-dependent Speaker Recognition by Compressed Feature-dynamics Derived from Sinusoidal Representation of Speech. In: 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland (2008) 5. Davis, S., Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech and Signal Processing 28(4), 357–366 (1980) 6. F` abregas, J., Faundez-Zanuy, M.: Biometric Dispersion Matcher. Pattern Recogn. 41, 3412–3426 (2008), http://portal.acm.org/citation.cfm?id=1399656.1399907 7. F` abregas, J., Faundez-Zanuy, M.: Biometric Dispersion Matcher Versus LDA. Pattern Recogn. 42, 1816–1823 (2009), http://portal.acm.org/citation.cfm?id=1542560.1542866 8. Furui, S.: Cepstral Analysis Technique for Automatic Speaker Verification. IEEE Transactions on Acoustics, Speech and Signal Processing 29(2), 254–272 (1981) 9. Hermansky, H.: Perceptual Linear Predictive (PLP) Analysis of Speech. The Journal of the Acoustical Society of America 87(4), 1738–1752 (1990), http://link.aip.org/link/?JAS/87/1738/1 10. Kinnunen, T., Li, H.: An Overview of Text-independent Speaker Recognition: From Features to Supervectors. Speech Communication 52(1), 12–40 (2010), http://www.sciencedirect.com/science/article/B6V1C-4X4Y22C-1/2/ 7926da351ef5c650f2a1a37adcd839a1 11. Mammone, R.J., Zhang, X., Ramachandran, R.P.: Robust Speaker Recognition: a Feature-based Approach. IEEE Signal Processing Magazine 13(5), 58 (1996) 12. Mekyska, J., Faundez-Zanuy, M., Smekal, Z., F` abregas, J.: Text-dependent Speaker Recognition in Low-cost Systems. In: 6th International Conference on Teleinformatics, Dolni Morava, Czech Republic, pp. 154–158 (2011) 13. Reynolds, D.A.: Speaker Identification and Verification Using Gaussian Mixture Speaker Models. Speech Commun. 17, 91–108 (1995), http://portal.acm.org/ citation.cfm?id=211311.211317 14. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing (2000) 15. Swanson, A.L., Ramachandran, R.P., Chin, S.H.: Fast Adaptive Component Weighted Cepstrum Pole Filtering for Speaker Identification. In: Proceedings of the 2004 International Symposium on Circuits and Systems, ISCAS 2004, vol. 5, pp. 612–615 (May 2004) 16. Vivaracho-Pascual, C., Faundez-Zanuy, M., Pascual, J.M.: An Efficient Low Cost Approach for On-line Signature Recognition Based on Length Normalization and Fractional Distances. Pattern Recogn. 42, 183–193 (2009), http://portal.acm. org/citation.cfm?id=1412761.1413027 17. Wong, E., Sridharan, S.: Comparison of Linear Prediction Cepstrum Coefficients and Mel-requency Cepstrum Coefficients for Language Identification. In: Proceedings of 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing, pp. 95–98 (2001) 18. Yegnanarayana, B., Kishore, S.P.: AANN: an Alternative to GMM for Pattern Recognition. Neural Networks 15(3), 459–469 (2002), http://www.sciencedirect.com/science/article/B6T08-459952R-2/2/ a53c123eaecb7ccb7b50baec88885192

Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality within a Multimodal Shell Izidor Mlakar1 and Matej Rojc2 1

Roboti c.s. d.o.o, Tržaška cesta 23, Slovenia {[email protected]} 2 Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova ulica 17, Slovenia [email protected]

Abstract. Web applications are a widely-spread and a widely-used concept for presenting information. Their underlying architecture and standards, in many cases, limit their presentation/control capabilities of showing pre-recorded audio/video sequences. Highly-dynamic text content, for instance, can only be displayed in its native from (as part of HTML content). This paper provides concepts and answers that enable the transformation of dynamic web-based content into multimodal sequences generated by different multimodal services. Based on the encapsulation of the content into a multimodal shell, any text-based data can dynamically and at interactive speeds be transformed into multimodal visually-synthesized speech. Techniques for the integration of multimodal input (e.g. visioning and speech recognition) are also included. The concept of multimodality relies on mashup approaches rather than traditional integration. It can, therefore, extended any type of web-based solution transparently with no major changes to either the multimodal services or the enhanced web-application. Keywords: multimodal interfaces, multimodality, multimodal shell, web multimodal, ECA-based speech synthesis.

1 Introduction Multimodal user interfaces offer several communication channels for use when interacting with the machine. These channels are usually categorized as input and outputbased. Multimodal applications running within the traditionally resource-limited, operating system (OS), and the device-dependent pervasive computing environments started to transit to web environment with the W3C’s presentation of multimodal web, and the standardisation of Extensible Multimodal Annotation markup language (EMMA) [1]. The web environment represents device and operating system independent software architectures with almost resource unlimited computational spaces. Due to the advances in the physical infrastructures of wide-area networks (e.g. stability of links, increased up/down link bandwidth, etc.) the intensive data exchange between services also became a solvable problem. Different distributive environments A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 133–146, 2011. © Springer-Verlag Berlin Heidelberg 2011

134

I. Mlakar and M. Rojc

started to form. These environments enable resource sharing, and increase the operability of traditional resource-heavy input/output processing technologies. Although such environments enable relatively device/OS independent multimodal services, the user interfaces (running on selected devices) still depend on basic device capabilities and therefore need to be adjusted, or even separately recoded for each device type. Web applications on the other hand provide easily and with HTML 5 even self-adjustable user interface that can run on any device that supports a web browser. Browser enabled-devices range from desktop to fast emerging mobile devices (e.g. smart-phones, iPhone, PDAs, etc.) that enable accessing the web content virtually from anywhere, anytime, and are becoming more and more popular every day. As a consequence, web services and applications have evolved and range from simple presentation pages to complex social networks, e-commerce, e-learning, and other business applications (e.g. B2B, CMS etc.). Although web applications offer literally limitless data and pattern gathering fields that can be extensively used with the aim of generating efficient personalized multimodal human-machine interaction interfaces, they can still prevent users from experiencing their full potential. Such cases usually refer either to the complexity of the web page, or the imparity of the user (e.g. modality mismatches for blind, or deaf users). Traditional assisting tools such as WebSpeak [2] and BrookesTalk [3] text-to-speech (providing an aural overview of Web content for the user) have been developed due to the fact that using speech is the most natural way of communication. In addition, interfaces have also been developed that rely on visual cues such as [5]. Speech technologies have the ability to further expand the convenience and accessibility of services, as well as lower the complexity of traditional unimodal GUI-based web interfaces [6]. State-of-the-art web applications, such as iGoogle1 MyYahoo2, Facebook3 offer, in essence, at least some personalization plug-ins (e.g. in the form o user-defined styles and content), yet in the context of natural human-machine interaction and personification still only offer a small number of multimodal interactive channels, usually limited to keyboard, mouse, touch-screen, and sometimes text-to-speech-synthesis and speech recognition. This paper presents a multimodal web-based user interface BQ-portal, which was developed on a flexible and distributive multimodal web platform. The BQ-portal is, in essence, a stand-alone web application, acting as an information kiosk for students. The application and core frameworks it is built upon, address the multiple-modality issues by enhancing the presented content via speech and non-speech related technologies (text-to-speech and visual speech synthesis using embodied conversational agents). Currently, the web-based user interface offers several communication channels that users can use, ranging from traditional keyboard/mouse/touch-screen setups, to complex content presentation using embodied conversational agents (ECAs). In the presented work ECA’s are regarded as “one-way” presentation channel. Their interaction capabilities are, therefore, limited to the presentation of the web-content. However, the architectural concepts presented in this paper, and their distributive and 1

iGoogle: http://www.google.com/ig myYahoo: http://my.yahoo.com/ 3 http://www.facebook.com/ 2

Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality 135

service oriented nature, will allow flexible implementation and integration of different Interactive Communication Management (ICM), and Own Communication Management (OCM) based tactics for simulation of “human-like” interaction by using reactive ECAs. The paper is structured as follows. Firstly, it addresses related works about developing multimodal web interfaces. The Section 3 presents a general concept of fusing existing web-based interfaces with different multimodal-based services. Section 4 presents the development of a multimodal web application BQ-portal, built on a novel fusion-concept. The paper then presents an implementation of two mashup-based multimodal web services (RSS feed reader and Language translator) that can be used by students while browsing the content of BQ-portal web application. Finally, the paper concludes with a short discussion about future work, and the conclusion.

2 Multimodal Web-Based Interfaces – Related Works As far as the field of multimodal research is concerned, new concepts and ideas emerge almost every day. Yet, most of them are still task-oriented, and placed within relatively-closed user environments. Such, usually unimodal concepts do provide vast insights into how human-human interaction works and also how man-machine interaction should work, but the usage and extension of them usually remains limited. Services, extracted from different multimodal-concepts, and closed multimodal applications are limited to the usage of a few isolated modalities, such as: text-to-speech synthesis, speech recognition, etc. Each service has its own under-laying communication protocol. Therefore, it seems only natural to provide such common platform within which web and non-web based services could interact by using a set of compatible protocols. Web-based user interfaces allow a generalization of different services and device independent implementations. Such interfaces implement standard communication protocols, such as: HTTP, RTP, and TCP/IP (depending on the data). These protocols are understandable to most of the existing web-browsers. Furthermore, with the HTML and cascade style sheets (CSSs), web-based user interfaces also provide a means of interface adaptation to both the user and device context. MONA [7] can be regarded as one of the environments for the development of multimodal webapplications. MONA focuses mainly on middleware for providing complete multimodal web-based solutions, running on diverse mobile clients (e.g. PDAs, Smart Phones, mobile phones, iPhone, etc.). The MONA concept involves a presentation server for mobile devices in different mobile phone networks. MONA’s framework transforms a single multimodal web-based user interface (MWUI) specification into a deviceadaptable multimodal web-user interface. The speech (the main multimodal channel) data is transferred using either circuit, or packet-switched voice-connection. In addition to MONA, other researchers have also worked on providing multimodal interfaces in resource-limited pervasive environments, e.g. [8], [9]. Furthermore, the TERESA [10] and ICARE [11] projects reach beyond the unimodal world of web services. TERESA provides an authoring tool for the development of multimodal interfaces within multi-device environments. It automatically produces combined XHMTL and VoiceXML multimodal web-interfaces that are specific to a targeted

136


multimodal platform. ICARE provides a component-based approach for the development of web-based multimodal interfaces. The SmartWeb [12] is an extension project of the SmartKom (www.smartkom.org) project, one of the multimodal dialog systems that combine speech and gestures with facial expressions for both input and output. SmartWeb provides a context-aware user interface, and can support users in different roles. It is based on the W3C standards for the semantic web, the Resource Description Framework (RDF/S), and the Web Ontology Language (OWL). As fully operable examples of SmartWeb usage, researchers have provided a personal guide for the 2006 FIFA World Cup in Germany, and a P2P-based communication between a car and a motor bike. In contrast to complete frameworks for the development of web-based multimodal solutions, Microsoft and IBM provide toolkits, multimodal APIs, such as: Speech Application Language Tags – SALT4, and WebSphere Voice Toolkit from IBM5. These APIs contain several libraries for the development of multimodal services, and their integration within different web-applications. A more detailed study on the subject of multimodal web applications can be found in [13]. Most of these frameworks are directed towards ubiquitous computing, and provide approaches for development of new multimodal web applications, rather, than enabling the empowerment of existing web applications with different multimodal services and solutions. The BQ-portal web-based interface development, presented in this paper, follows the idea of implementing web browser-based solutions as multimodal interfaces. It is based on the distributive DATA platform [14] and the multimodal web platform (MWP) [15] concepts. The main purpose of these concepts is to allow as flexible fusion of different multimodal technologies and services provided by already developed non-multimodal web applications, as possible. The basic idea of BQ-portal web-based interface development and its infrastructure for deploying speech technologies (speech recognition and text-to-speech synthesis) has already been outlined in [15]. This paper, however, presents an in-depth presentation of the BQ-portal web-based user interface development, including fusion with advanced technologies needed for the visual synthesis of human-like behaviour characteristics. The BQ-portal web application does not adapt/transform the core technologies (such as ECA-service, or text-to-speech service). It is based on the idea of providing an extension to core technologies used within the existing web applications (similar as in [16]). The concept of BQ-portal application, therefore, strives to enhance different web-based services with different non-web based technologies. The fusion concept is based on a mashup approach and browser cross-domain communication, and does not require any major concept/functional/code/architectural changes to either the web or the non-web based services. By exploiting the features of web-based solutions (HTML 5.0, CSS, java-script) such an interface also enables a device-independent provision ranging from a common desktop computer to browser-enabled mobile devices. The multimodal web-based interfaces provided by such fusion enable the usage of different multimodal technologies in a transparent manner, regardless of their modality. Multimodal fusion concept is also quite general and can be extended to virtually any existing web-based solution. 4 5

http://msdn.microsoft.com/en-us/library/ms994629.aspx http://www-01.ibm.com/software/pervasive/tech/demos/tts.shtml


3 Multimodal Web Application – General Concept Let’s assume that users can access web applications within different computer environments using different web-based browser interfaces. Normally, these web applications (with rich content) are based on GUIs and support user-machine interaction using traditional input/output devices (e.g. mouse, keyboard, touch screen etc.). More advanced user interfaces should be developed and provided in order to integrate and enable additional modalities for machine-interaction, and to further improve user experience. Such a concept is introduced in Figure 1, which enables flexible integration of multimodal technologies into general web-applications. The concepts presented in Figure 1 suggest the fusion of general web-applications with several multimodal extensions, and the formation of a platform-independent multimodal service framework.

Fig. 1. Multimodal web-platform concept

On the one hand, we have core multimodal technologies and multimodal services, whilst, on the other hand, a web application providing user-interface, content and several services that should/could be extended with additional input/output modalities. Both technologies run, in general, on different platforms and use different communication protocols. It is assumed that web-based services use HTTP (or any other web-service protocol), and the non-web services interact under a general, unified protocol, understandable to all user-devices. It is also assumed that both protocols are unrelated, and cannot be simply merged. In order to fuse web and non-web based services in a flexible way an intermediate object (a multimodal shell) is introduced. Such multimodal shell allows both types of technologies to retain their base rules whilst, at the same time, complementing each other. Any multimodal web-based application can be viewed as a set of four relatively-independent objects. The multimodal services (first object) represent several input/output-based core technologies (text-to-speech synthesis, visual ECA synthesis, etc.). The second object then represents a general web application (e.g. e-learning, e-business, information kiosk, etc.) providing the core user-interface, content, and core functionality. The third object represents a set of different browser-based user interfaces that can use web-based

138


Fig. 2. Multimodal shell’s functional architecture

services. And finally, the fourth object is a multimodal shell, representing an intermediate object that understands both multimodal services and web-application, receives user requests and responds to them (e.g. voice browsing). The implementation of the multimodal shell, presented in the following section, involves IFRAME-based crossdomain interaction. By using the presented multimodal shell (that understands both types of service protocols), general web-services (or their results) can be flexibly transformed into multimodal-based services (or their results). 3.1 Multimodal Shell Multimodal shell is an intermediate object that can be used for transformation of webbased content into multimodal content, and for presenting such fused content according to the user/device context. Figure 2 presents the functional architecture of such an intermediate object. It is basically an application wrapper, implemented as a deviceindependent web-application without any visual components. The transformation from web-based to multimodal-based and back to web-compatible data is implemented by using the concept of encapsulation. The core user-interface is stored within an IFRAME of the multimodal shell’s interface. Since it is assumed that both webapplications (the core application and the multimodal shell) are of different origins (different domain), a cross-domain API is integrated into both interfaces. This crossdomain API then handles any data transfer from the core-application towards the multimodal services (e.g. data for text-to-speech synthesis), or data transfer from multimodal services towards the web application (e.g. additional control options, such as: e.g. speech recognition). In addition to data transformation, the multimodal shell also handles the presentation of multimodal data (e.g. playing audio/video stream of the synthesized data), and the collection of device properties. The fusion of web and


non-web content, performed by the multimodal shell, is based on three components: Web platform-interface, Multimodal core and Multimodal user front-end. The Multimodal core component presents a group of non-web services that work under a unified protocol. These core services can be accessed by different applications in a general way, regardless to the web application context. The Web-platform interface component The Web-platform interface component handles the adaption of multimodal services to the user and device contexts. This software module registers the user interface to the Multimodal core and decides which core services to allow for each user interface and in what form should these services respond. The component also normalizes the web-based data (e.g. HTML based textual information) to the input format required by core. In the context of web-based services, this component additionally implements connections to different external, web-based services (e.g. weather service, RSS feeds, etc). The main task of this component is, therefore, to serve different context-dependent requests and to propagate different context-oriented responses. To achieve all these tasks, the component further implements low level components used for processes, such as: registration and adaptation of user-interface and user-services, web and non-web-based data exchange, data normalization, protocol unification, etc. Interface services component describes a set of services that are used as an access point to different external services (e.g. speech synthesis, RSS feeds, weather services, tourist info service, etc.), and for data and protocol transformation and normalization (HTML/XML-based data to raw text, high-level HTML-based request to low level sequences of command packets etc.). For instance, in the case of RSS-feeds, this component forms HTTP connection to the RSS feed-provider, transforms its data into raw text, and redirects it to the different core services e.g. TTS service. The Device manager and Service manager components are used for registration and adaption of user-interface and user-services through processing of different contextual information. The minimal context information that most user-devices can provide includes: input/output capabilities (type, display size, presence of audio/video input and output devices, etc.), network interface, web-browser type, the preferred video stream player etc. This information can be gained through services, such as: Resource Description framework (RDF), and different java-script-based client libraries that can be integrated within the Device manager component, or within the Multimodal shell’s front-end component. Such contextual information is then used to model different multimodal services. For instance, the type of web browser, screen dimensions, and the type of network connection define the availabilities of the embodied conversational agent, and also video stream properties (e.g. encoding and maximal video size), video stream player, etc. The Service manager component also obtains/holds web-properties, information about the web-content, and its structure. The web properties include among others service descriptions and HTML structure of web-pages. The Dialog manager component models human-machine interaction and, by using the device, service, and user context enables/disables the multimodal services. It also defines the communication paths for different user-requests (e.g. text for visual synthesis). This component also transforms (normalizes) the any input data in order to meet the requirements of each non-web based service.

140


The Multimodal user front-end component The Multimodal user front-end component is an HTML-based user interface that merges the front-end of the targeted web-application with the multimodal shell’s front-end. It represents a web page containing an IFRAME without any visual elements. The IFRAME serves to present the content of the targeted web application, and forms a “domain” bridge between the provider of multimodal services and the targeted web application. This “domain” bridge is implemented with java-script-based cross domain API, which allows direct data transfer between the multimodal shell component, and the targeted web application (e.g. encapsulated web-application). In other words, the cross domain API allows direct interaction and data exchange between two domains. Users can send data from the web-application interface directly to the Multimodal shell component. The Multimodal shell component can remotely control users’ web-based interfaces. The Multimodal user front-end component also contains java-based client API that can be used for implementing direct TCP/IP session to the multimodal core. Created session can then be used for management and data transfer. Data transferred during this session is assumed to be of non-textual (non-web) nature (e.g. video stream from camera, audio stream from microphone, audio-visually synthesized text, etc.).

4 BQ-Portal Web Application Web application BQ-portal presents an implementation of mashup-based webapplication for provision of multimodal services. It was developed by using the presented multimodal web application concept (section 3), based on MWP [20] and DATA [19] architectures. BQ-portal web application can be regarded as a “targeted’’ web application with no direct implementation of non-web based services. It serves as an information kiosk for students. By fusing web-based (e.g. RSS feeds and BQportal’s web-based services) and non-web based services (TTS, ECA and language translation services), it performs audio/visual content presentation using embodied conversational agents. Its functional architecture based on the multimodal web application concept presented in section 3, is shown in Figure 3.

Fig. 3. Functional architecture of the BQ-portal web application


The BQ-portal web application (Figure 3) implements its own web-based user interfaces, supporting the cross domain communication. The Multimodal shell component mediates between device-adapted web-based user interfaces and the multimodal services within the Multimodal core component. Therefore, it performs fusion of webbased and non-web-based technologies into multimodal, user oriented services and multimodal, web-based user interfaces. The web-based content is presented in a format that is acceptable for the web-browsers, and the multimodal content (speech output and ECA animation) is presented as a video stream. Users can use all available services as provided by the BQ-portal web application in a regular (non-multimodal) fashion, or as multimodal services. Figure 3 presents two applicative domains suggested by the concept of the MWP platform. The first one is the multimodal domain (established by services within the Multimodal core component). It combines different non-web based technologies (PLATTOS TTS service [4], EVA service [17]).The second one is the web domain. It combines web-related technologies, such as: the different services provided by the BQ-portal web application, and the web-based services provided by other web applications, such as: RSS feeds, News feeds, Weather forecast, traffic info etc. Both mentioned applicative domains are assumed to be of different origin (e.g. run on different application frameworks) and, therefore, cannot directly communicate between eachother. Therefore, the cross domain API (discussed in section 3) is used for interfacing both domains and for the relevant data exchange performed directly through the user interface. The cross domain API implements an IFRAME encapsulation, and wraps the targeted (general) web-application into the IFRAME of the Multimodal shell‘s web-based user interface. Additionally, a multimodal client interface API (Figure 2), a light java client-based application, also automatically runs within the web-based user interface of the Multimodal shell component. This light client provides general device information, and enables communication between the user device and the Multimodal core component. The mashup principle, implemented by the cross domain API, serves as a data exchange process between applications of different domains. For an instance, to generate audio/visual content based on the text presented within the user interface. The generated speech and ECA EVA responses are always presented together as a video stream played by web-based stream player. ECA EVA can provide different formats for the output stream (e.g. RTP or HTTP), and different audio/video encodings (e.g. MPEG-2/4, H.264, FLV, etc.). The multimodal front-end will automatically generate video stream player based on the device properties (e.g. custom java-based video player, or web-player encapsulated within the MWP’s front-end). Some of multimodal web-based services provided by the BQ-portal application are outlined in the following sections. The concepts of multimodality within BQ-portal web application are quite general and can serve as a base for designing several new multimodal services. For an instance, when speech recognition is available (among multimodal services) the use of web-based interface can also be voice-driven.

5 Multimodal Services – ECA Enhanced Web Services Based on the concept of multimodality, the BQ-portal web application already offers several multimodal services based on ECA EVA and TTS system. These services

142


transform general web-based content into an audio-visual response (generated by the embodied conversational agent EVA and TTS PLATTOS). Such services, therefore, unify web-content, text-to-speech output, and the embodied conversational agent into responsive audio-visually presented content. This section presents insights in how web-based content is transformed into multimodal response, and two multimodal services provided by the BQ-portal web application. 5.1 Visualized RSS Feeds RSS feeds are a common entity within today’s web-applications. Feed services are used to provide and present information from outside sources (external web applications). Commonly, these services can be accessed over standardized XML interface, using a well structured and constant XML scheme. If the structure is unknown, it can also be determined from the content of the XML. The idea to enhance RSS feeds with multimodal output services involve: -

parsing the feed (to obtain the titles, the corresponding content, images, etc), removing all HTML related content, such as: CSS styles, tags, etc., redirecting normalized content to corresponding multimodal services, generating multimodal output, presenting the multimodal output to the user.

Fig. 4. Enhanced RSS feeds by using multimodal output services


This process is implemented upon the user’s request, either for the index of RSS titles, or for the content of a title. If the index page is to be read, the TTS service is fed with titles, otherwise the selected title is connected to its content, and the content is then fed to the TTS service. The TTS service (based on PLATTOS TTS system, available in the DATA service cloud) generates the speech, and the EVA-Script based description of the text being synthesized. This description is a sequential set of phonemes that are assigned with attributes, such as: duration, pitch, and the prominence levels. The EVA-Script description is then transformed into animated lip movement, and sent as a video stream towards the user’s multimodal front end component. Figure 4 presents the functional architecture of the multimodal output based RSS feed reader. Three different RSS news feed providers are supported within the current version of the BQ-portal web application. The indexes and individual feeds are accessed using standard web-based RSS connectors. The RSS parser parses the feeds, and generates raw text by using RSS scheme descriptors (e.g. XML tree parsing). When the user accesses the feed, he/she actually accesses the RSS feed reader interface that stores both the raw and styled data of the individual feed. A read-feed request initiates the feeding process, and circumstantially launches the multimodal output process. With the delay expected in interactive norms, the user hears and sees the generated multimodal output within its interface. 5.2 Visualized Translations The BQ-portal also supports Google-translator-API-based6 real-time language translations. In this way, users can translate either the content of the currently presented web page as whole or only fragments of the content at a time. The data source for the translation is always provided by the user interface, and the process of translation is always initiated upon the user’s request. Figure 5 presents the functional architecture of the BQ-portal multimodal output-based text translator’s API. The multimodal output based text translator sends cleaned data (raw text with no HTML, CSS, XML elements) into ‘language discovery’ and ‘language translation’ modules, and returns two types of outputs: the translations as text, and the video sequence of the generated multimodal output (synthesized and visualized speech). The BQ-portal’s presentation layer provides translation data in the form of raw text (text without any HTML/CSS related information). This raw text is then translated. Firstly, it is passed through the ‘language discovery’ module. This module is native to Google API, which defines the language of the input raw text. If the ‘language discovery’ module fails to do this, the user is asked to define the input language manually. The target translation language is currently Slovenian (since PLATTOS TTS for the Slovenian language is available). The real-time translation processes then proceed with the translation phase, where detected input language and raw text are translated into the Slovenian language by using the Google translator API. The translation result is raw text that is fed to a DATA service cloud. Within this cloud, raw data are firstly redirected to the TTS service that generates audio and EVA Script-based descriptions. The TTS service output is then redirected towards the ECA service. The obtained data are used for the generation of multimodal output (ECA animated speech sequences). The multimodal output is then transferred back to the user interface as video stream. 6

http://code.google.com/apis/language/

144


Fig. 5. Enhanced text translations by using multimodal output services

The presented approach of the multimodal output based text translation process can easily be extended to any type of document that can be parsed and cleaned (transformed into raw text) by using the BQ-portal web application, ranging from word documents to pdf books. The quality of the multimodal output based translation, however, highly depends on the quality of the translated text - the quality of the Google translator API.

6 Conclusion This paper presents a concept of developing multimodal web interface that could overcome all those device/system dependences that are concurrent with several multimodal web interfaces. The concepts outlined in this article enable the integration of multimodal technologies into different types of web-based solutions. By visualizing the textual context, embodied conversational agents can add more life to the content, and also be used as supportive technologies providing additional meaning to the content. As a result, this paper has presented two services that are implemented within the BQ-portal web application. The RSS feed visualization service allows the BQ-portal to directly visualize the content of any RSS feed provided (must have a known XML/HTML scheme). The visualized translations, on the other hand, present a service that incorporates both web and non-web services. By using Google translator API (web based services), and TTS + ECA (non-web based services), the user can translate selected text and also visualize the translations.


The main focus of the presented MWP concept is to provide an interface that allows fusion (not integration) of non-web based services with general web-based applications. By using the mashup-based principle of cross-domain interaction, it has been shown that non-web technologies, such as: text-to-speech synthesis and embodied conversational agents, can be fused with different web services and web-content. Such a fusion enriches and enhances existing general web content and presents it through multiple communication channels. In the paper ECAs were regarded only as one way presentation channel, being able to perform only OCM (e.g. visual speech synthesis, speech related gesture synthesis, etc.) based on TTS output. However, the BQ-portal web application’s service oriented architecture allows the development and integration of application independent ICM. In order to form two-way interaction loops, different behaviour management techniques, as e.g. in [18, 19], can be integrated into multimodal core component. These techniques can be further interfaced with ECA and TTS service. In this way, ECAs would gain the ability not only to present the web content, but also the ability to respond to different user requests, and the ability to influence the interaction flow. In the future we plan to extend the presented multimodal concept by providing services, such as: speech recognition and visioning (multimodal input), and will further enhance the BQ-portal web application with new input modalities. In addition we plan to research and introduce different ICM tactics and dialog management systems in order to provide more human-like communicative dialog to general web-based application, or service. These research activities will allow us to transform the kiosk-based application into more natural web-based interface that can be used within different intelligent environments. Acknowledgements. Operation part financed by the European Union, European Social Fund.

References 1. EMMA: Extensible MultiModal Annotation Markup Language. W3C Recommendation (2009), http://www.w3.org/TR/2009/REC-emma-20090210/ 2. Hakkinen, M., Dewitt, J.: WebSpeak: user interface design of an accessible web browser. White Paper, the Productivity Works Inc. (1996) 3. Zajicek, M., Powell, C., Reeves, C.: A web navigation tool for the blind. In: Proceedings of the 3rd ACM/SIGAPH on Assistive Technologies, pp. 204–206 (1998) 4. Rojc, M., Kačič, Z.: Time and Space-Efficient Architecture for a Corpus-based Text-toSpeech Synthesis System. Speech Communication 49(3), 230–249 (2007) 5. Yu, W., Kuber, R., Murphy, E., Strain, P., McAllister, G.: A novel multimodal interface for improving visually impaired people’s web accessibility. Virtual Reality 9(2), 133–148 (2006) 6. Oviatt, S., Cohen, P.: Perceptual user interfaces: multimodal interfaces that process what comes naturally. Communications of the ACM 43(3), 45–53 (2000) 7. Niklfeld, G., Anegg, H.: Device independent mobile multimodal user interfaces with the MONA Multimodal Presentation Server. In: Proceedings of Eurescom Summit 2005 (2005)

146


8. Song, K., Lee, K.H.: Generating multimodal user interfaces for Web services. Interacting with Computers Archive 20(4-5) (September 2008) 9. Chang, S.E., Minkin, B.: The implementation of a secure and pervasive multimodal Web system architecture. Information and Software Technology 48(6) (2006) 10. Berti, S., Paternò, F.: Migratory MultiModal Interfaces in MultiDevice Environments. In: Proc. of 7th Int. Conf. on Multimodal Interfaces ICMI 2005. ACM Press, New York (2005) 11. Bouchet, J., Nigay, L., Ganille, T.: ICARE software components for rapidly developing multimodal interface. In: Conference Proceedings of ICMI 2004 (2004) 12. Wahlster, W.: SmartWeb: Mobile Applications of the Semantic Web. In: Biundo, S., Frühwirth, T., Palm, G. (eds.) KI 2004. LNCS (LNAI), vol. 3238, pp. 50–51. Springer, Heidelberg (2004) 13. Stanciulescu, A., Vanderdonckt, J.: Design Options for Multimodal Web Applications. In: Computer Aided Design of User Interfaces V, pp. 41–56 (2007) 14. Rojc, M., Mlakar, I.: Finite-state machine based distributed framework DATA for intelligent ambience systems. In: Proceedings of CIMMACS 2009, WSEAS Press (2009) 15. Mlakar, I., Rojc, M.: Platform for flexible integration of multimodal technologies into web application domain. In: Proceedings of E-ACTIVITIES 2009, International Conference on Information Security and Privacy (ISP 2009), WSEAS Press (2009) 16. Thang, M.D., Dimitrova, V., Djemame, K.: Personalised Mashups Opportunities and Challenges for User Modelling. In: Conati, C., McCoy, K., Paliouras, G. (eds.) UM 2007. LNCS (LNAI), vol. 4511, pp. 415–419. Springer, Heidelberg (2007) 17. Mlakar, I., Rojc, M.: EVA: expressive multipart virtual agent performing gestures and emotions. International Journal of Mathematics and Computers in Simulation 5(1), 36–44 (2011), http://www.naun.org/journals/mcs/19-710.pdf 18. Morency, L.P., de Kok, I., Jonathan Gratch, J.: A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems 20(1), 70–84 (2010) 19. Wrede, B., Kopp, S., Rohlfing, K., Lohse, M., Muhl, C.: Appropriate feedback in asymmetric interactions. Journal of Pragmatics 42(9), 2369–2384 (2010)

Multimodal Embodied Mimicry in Interaction Xiaofan Sun and Anton Nijholt Human Media Interaction, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands {x.f.sun,a.nijholt}@ewi.utwente.nl

Abstract. Nonverbal behavior plays an important role in human-human interaction. One particular kind of nonverbal behavior is mimicry. Behavioral mimicry supports harmonious relationships in social interaction through creating affiliation, rapport, and liking between partners. Affective computing that employs mimicry knowledge and that is able to predict how mimicry affects social situations and relations can find immediate application in humancomputer interaction to improve interaction. In this short paper we survey and discuss mimicry issues that are important from that point of view: application in human-computer interaction. We designed experiments to collect mimicry data. Some preliminary analysis of the data is presented. Keywords: Mimicry, affective computing, embodied agents, social robots.

1 Introduction People come from different cultures and have different backgrounds while growing up. This is reflected in their verbal and nonverbal interaction behavior, speech and language use, attitudes, social norms and expectations. Sometimes a harmonious communication is difficult to establish or continue because of these different cultures and backgrounds. This is also true when people are from the same culture and have the same background, but differ in opinions or are in competition. In designing user interfaces for human-computer interaction, including social robots and artificial embodied agents, in designing tools for computer-mediated interaction, and in designing tools or environments for training and simulation where interaction is essential, we should be aware of this. These interfaces, tools and environments need to be socially intelligent, capable of sensing or detecting information relevant to social interaction. Mimicry is often an automatic and unconscious process where, usually, the mimicker neither intends to mimic nor is consciously aware of doing so, but may tend to activate a desire to affiliate. For example, mimicking behaviors even occur among strangers when no affiliation goal is present. Certainly, mimicking strangers assumes unconscious mimicry. In other cases, people often mimic each other without realizing they want to create similarity. This also can be assumed to be unconscious mimicry. Conversational partners may or may not be consciously engaged in mimicry, but no doubt, one or both of the interactants take on the posture, mannerisms, and movements of the other during natural interaction [1]. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 147–153, 2011. © Springer-Verlag Berlin Heidelberg 2011

148

X. Sun and A. Nijholt

Some instances of mimicry in daily life and factors that affect them are given below. People often mimic their bosses’ behavior in a meeting or discussion. For example, repeat what the boss said because of a desire to affiliate even if there is no real agreement. As another example, meeting or discussion partners mimic each other to gain acceptance and agreement when they share or want to share an opinion in a discussion. Thus, it is worth nothing that interactants mimic each other because of directly activating goals though without consistent awareness. Mimicry occurs in our daily life all the time, and in most of the cases mimicry behavior implicates or explicates the mimickee and mimicker’s actual attitudes, beliefs, and affects, moreover, judging the current interaction situation as positive or negative. Nonconscious mimicry widely occurs in our daily life, for example, people unconsciously speak more softly when they are visiting a library. Mimicry is inherently sensitive to actual social context; in other words, automatic mimicry changes with changing goals according to the realistic social situation. It is expected that human-computer interfaces that employ knowledge on mimicry can improve natural, human-like interaction behavior. It requires detection and generation of mimicry behavior. It allows the interface to adapt to its human partner and to create affiliation and rapport. This can in particular be true when mimicry behavior is added to human-like computer agents with which users communicate. One of the important goals for the future studies in embodied virtual agents and social robots is to use social strategies in order to make them more sociable and natural [2]. The sociable agent should have the capability of recognizing positive and negative situations and its communicative behavior should be appropriate in the current situation. Then it can achieve desirable interaction results such as creating affiliation and rapport, gaining acceptance, increasing belongingness, and, of course, better understanding of the conversational partner. Indeed, in recent research on humanoid agents the view that humans are “users” of a certain “tool” is shifting to that of a “partnership” with artificial, autonomous agents [3], [4]. Social agents need to have the capabilities to acquire various types of inputs from human users in verbal and non-verbal communication modalities. Also, social agents should have the capabilities of understanding the input signals to recognize a current situation, and then according to desired goals in the conversational setting to combine social strategies to determine what behavior is appropriate to express in response to the multimodal input information. Similarly, in the output phase, agents are expected to have the capabilities of mimicking users’ facial expression, eye contact, postural or even verbal types to gain more closeness and natural communication.

2 Types of Mimicry Various types of mimicry can be distinguished. They range from almost directly mimicking facial expressions and slight head movements to long term effects of interaction such as convergence in attitudes [2]. When we look at automatic detection and generation, we confine ourselves to the directly observable and developing mimicry behavior during interactions and what can be concluded from that. Therefore, below we distinguish mimicry in facial expressions, in speech, in body behavior (including gestures and head movements) and emotions.

Multimodal Embodied Mimicry in Interaction

149

2.1 Facial Expression Mimicry Interactants may express similar facial expressions during face-to-face interactions. When one of two interactants facing each other takes on a certain facial action, the partner may take on a congruent action [5], [6]. For instance, if one is smiling, the other may also smile. From previous mimicry experiments it is known that when images of a facial expression displaying a particular emotion are presented, people display similar expressions, even if those images are just static expression [7], [8], [9]. 2.2 Vocal Mimicry Vocal behavior coordination occurs when people match the speech characteristics and patterns of their interaction partners [10]. They may neither intend to do so nor are they consciously aware of doing so. This can be observed even if they are not facing each other [11]. 2.3 Postural Mimicry Body behavioral coordination involves taking on the postures, mannerisms, gestures, and motor movements of other people such as rubbing the face, touching the hair, or moving the legs [12]. For instance, if one is crossing his legs with the right leg on top of the left, maybe the other also cross his legs with the left leg on top of the right leg or with the right leg on top of the right leg [13]. 2.4 Emotional Mimicry The perception of mimicry is not limited to the perception of behavioral expressions [14]. Emotional mimicry is another phenomenon that needs to be considered. It is more complicated and mostly based on personal feeling and perception. In [7] emotional mimicry is classified into positive mood mimicry, negative mood mimicry and counter-mimicry. In an actual social situation not all emotion expressions are mimicked equally. Normally people have a higher chance to mimic positive emotion than negative emotion. This seems to be because of a negative emotional mimicry being less relevant and costly [15]. Consider, for example, the situation where someone tells you a bad thing happened to him or her, and he or she consciously or unconsciously, displays a sad face. Mimicking his or her sadness expression means signaling understanding, and maybe also willingness to help. Hence, sadness mimicry only occurs between people who are close to each other rather than just a passing acquaintance [15]. In contrast, people mimic happiness regardless of the relationship with each other or the situational context because of mimicking positive emotion is with low risk and is low costly [14]. Usually in a competition condition such as debates or negotiations, counter-mimicry is evoked to express different attitudes or negative emotion in a polite and implicit way, which shows contrasting facial expressions, and postural or vocal cues, such as a smile when the expresser winces in pain [7].

150


3 Mimicry as a Nonconscious Tool to Enhance Communication Individuals may consciously engage in more mimicry with each other in the case that they intend to affiliate during interaction. In contrast, they may also consciously engage in less mimicry since they prefer disaffiliation [16]. Hence, mimicry has the power to enhance social interaction and to express preferences. This is not really different in the case of unconscious mimicry. Unconscious mimicry shows a merging of the minds such as creating more similar attitudes or share more viewpoints [12]. Moreover, in interpersonal interaction mimicry can be an unconsciously used ‘tool’ to create greater feelings of, e.g., rapport and affiliation [17]. Mimicry can be seen as an assessment of the current social interaction situation (e.g., positive environment and negative environment). The connection between mimicry and closeness of social interaction was shown by a study conducted by Jefferis, van Baaren and Chartrand [18]. To use mimicry as a tool to enrich social interaction, some important research issues are, first, to understand and explore how people experience and use mimicry, second, to examine the implications of explicit mimicry behaviors in terms of social perceptions of the mimickers, third, to analyze detected and classified mimicry behavior for cues about the characteristics of the interaction, and, finally, to examine to what extend mimicking should occur so that it enriches communication properly. Embodied automatic mimicry can be used as a social strategy to achieve the desired level of affiliation or disaffiliation. The key is to obtain an optimal level of embodied mimicry [2], that is, mimicry should occur only to the proper degree so that such mimicry behavior serves the affiliation goal and is not costly and risky.

4 Measuring of Mimicry Mimicry refers to the coordination of movement between individuals in both timing and form during interpersonal communication. These phenomena are observed in newborn infants [8], and it is reported that these phenomena are related to language acquisition [10] and, as mentioned before, rapport. Therefore, many researchers have been interested in investigating the nature of these phenomena and have introduced theories explaining these phenomena. Because of this broad range of theoretical applicability, interactional mimicry has been measured in many different ways [19]. These methodologies can be divided into two types: behavior coding and rating. Some research has resulted in illustrating the similarities and differences between using a coding method and a rating method for measuring mimicry. Some researchers have been studying interpersonal communication using both methods. Recently, Reidsma et al. [20] presented a quantitative method for measuring the level of nonverbal synchrony during interaction. First the amount of movement of a person as a function of time is measured by image difference computations. Then, with the help of the cross-correlation between the movement functions of two conversational partners, taking into account possible time delays, it is determined if they move synchronously. In research on judging rapport and affiliation, studies examined how people use objective cues, as measured by a coding method, or subjective cues, as measured by a rating method, when they perceive interpersonal communication.


151

For automatic mimicry detection advanced learning techniques need to be employed to construct a model from both subjective knowledge and training data. Affect (e.g., disagreement/agreement) recognition is accomplished through probabilistic inference by systematically integrating mimicry measurements with mimicry behavior detection and a mimicry behavior organization model. In the model head movements, postural movements, and facial expressions can be explicitly modeled by different sub-modes in lower levels, while the higher level model represents the interaction between the modes. However, automatic selection of the sensory sources based on the information need is non-trivial; hence no operational systems exploit this. Individual sensors are integrated in sensor networks. Perceived data from single sensors need to be fused and integrated in the network. Moreover, the multimodal signals should be considered mutually dependent rather than be combined only at the end as is the case in decision-level fusion. And the same problem also appears in classifying features such as when and how to combine the features from various sensor models.

5 Collecting Data and Annotation It is necessary to automatically detect mimicry and recognize affect based on mimicry analysis. To achieve the ultimate goal of automatically analyzing mimicry some sub goals need to be achieved. First, a multi-modal database of interactional mimicry in social interactions is necessary to be set up, and secondly, possible rules and algorithms of mimicry in interactions need to be explored based on experimental social psychology. The desire to set up a multimodal database of interactional mimicry in social interactions are to (1) understand and explore how people consciously and unconsciously employ and display mimicry behavior, (2) develop methods and design tools to automatically detect synchrony and mimicry in social interactions, (3) examine and annotate the implications of mimicry detection in terms of social perceptions and emotions of the mimickers, (4) develop social mimicry algorithms to be utilized by embodied conversational agents. In sum, the goal is to understand when and why mimicry behavior happens and what the exact types of those non-verbal behaviors are in human face-to-face communication by annotating, analyzing and modeling recorded data. Recently we finished the process of collecting data from a large number of face-toface interactions in an experimental setting. The recordings were done at Imperial College London in collaboration with the iBUG group of Imperial College. The setting and the interaction scenarios aimed at extracting natural multimodal mimicry information, and to explore the relationship between the occurrence of mimicry and human affect (see section 2). The corpus was recorded using a wide range of devices including face-to-face-talking and fixed microphones, individual and room-view video cameras from different views, all of which produced auditory and visual output signals that are synchronized with each other. Two scenarios were followed in the experiments: a discussion on a political topic, and a role-playing game. More than 40 participants were recruited to participate. They also had to fill in questionnaires to report their felt experiences. The recordings and ratings are stored in a database. The interactions are being manually annotated for

152


many different phenomena, including dialogue acts, turn-taking, affect, and some head and hand gestures, body movement and facial expressions. Annotation includes annotating behavioural expressions for participants separately, annotating the meaning expressed by the behavioral expressions, and annotating mimicry episodes. Some preliminary results on automatic detection of mimicry episodes can be found in [21]. The corpus will be made available to the scientific community through a webaccessible database.

6 Conclusion Embodied mimicry can provide important clues for investigations of human-human and human-agent interactions. Firstly, as an indicator of cooperativeness and empathy. Secondly, in its application as a means to enrich communication. The impact of a practical technology to mediate human interactions in real time would be enormous both for society and individuals as a whole (improving business relations, cultural understanding, communication relationship, etc). It would find immediate applications in areas such as adapting interactions to help people with less confidence, training people for improved social interactions, or in specific tools for tasks such as negotiation. This technology would also strongly influence science and technology (providing a powerful new class of research tools for social science and anthropology, for example). While the primary goal of such an effort would be to facilitate direct mediated communication between people, advances here would also facilitate interactions between humans and machines. Moreover, given the huge advances in computer vision and algorithmic gesture detection, coupled with the propensity for more and more computers to utilize highbandwidth connections and embedded video cameras, the potential for computer agents to detect, mimic, and implement human gestures and other behaviors is quite boundless and promising. Together with the early findings in [21] this suggests that mimicry can be added to computer agents to improve the user’s experience unobtrusively, that is to say, without the user’s notice. It is worth mentioning again that the first main issue in our research is to explore and later to analyze automatically in what situation and to what extend mimicking behaviors occur. Acknowledgments. We gratefully acknowledge the useful comments of some anonymous referees. This work has been funded in part by FP7/2007-2013 under the grant agreement no. 231287 (SSPNet).

References 1. Chartrand, T.L., Bargh, J.A.: The chameleon effect: the perception-behavior link and social interaction. Journal of Personality and Social Psychology 76(6), 893–910 (1999) 2. Kopp, S.: Social resonance and embodied coordination in face-toface conversation with artificial interlocutors. Speech Communication 52(6), 587–597 (2010) 3. Bailenson, J.N., Yee, N.: Digital chameleons. Psychological Science 16(10), 814–819 (2005)


153

4. Bailenson, J.N., Yee, N., Patel, K., Beall, A.C.: Detecting digital chameleons. Computers in Human Behavior 24(1), 66–87 (2008) 5. Chartrand, T.L., Jefferis, V.E.: Consequences of automatic goal pursuit and the case of nonconscious mimicry, pp. 290–305. Psychology Press, Philadelphia (2003) 6. Nagaoka, C., Komori, M., Nakamura, T., Draguna, M.R.: Effects of receptive listening on the congruence of speakers’ response latencies in dialogues. Psychological Reports 97, 265–274 (2005) 7. Hess, U., Blairy, S.: Facial mimicry and emotional contagion to dynamic emotional facial expressions and their influence on decoding accuracy. Int. J. Psychophysiology 40(2), 129–141 (2001) 8. Bernieri, F.J., Reznick, J.S., Rosenthal, R.: Synchrony, pseudosynchrony, and dissynchrony: Measuring the entrainment process in mother-infant interactions. Journal of Personality and Social Psychology 54(2), 243–253 (1988) 9. Yabar, Y., Johnston, L., Miles, L., Peace, V.: Implicit behavioral mimicry: Investigating the impact of group membership. Journal of Nonverbal Behavior 30(3), 97–113 (2006) 10. Giles, H., Powesland, P.F.: Speech style and social evaluation. Academic Press, London (1975) 11. Lakin, J.L., Chartrand, T.L., Arkin, R.M.: Exclusion and nonconscious behavioral mimicry: Mimicking others to resolve threatened belongingness needs (2004) (manuscript) 12. Bernieri, F.J.: Coordinated movement and rapport in teacher student interactions. Journal of Nonverbal Behavior 12(2), 120–138 (1998) 13. LaFrance, M.: Nonverbal synchrony and rapport: Analysis by the cross-lag panel technique. Social Psychology Quarterly 42(1), 66–70 (1979) 14. Chartrand, T.L., Maddux, W., Lakin, J.L.: Beyond the perception behavior link: The ubiquitous utility and motivational moderators of nonconscious mimicry. In: Hassin, R.R., Uleman, J.S., Bargh, J.A. (eds.) The New Unconscious, pp. 334–361. Oxford University Press, New York (2005) 15. Bourgeois, P., Hess, U.: The impact of social context on mimicry. Biol. Psychol. 77, 343–352 (2008) 16. Lakin, J., Chartrand, T.L.: Using nonconscious behavioral mimicry to create affiliation and rapport. Psychol. Sci. 14, 334–339 (2003) 17. Chartrand, T.L., van Baaren, R.: Chapter 5 Human Mimicry. Advances in Experimental Social Psychology 41, 219–274 (2009) 18. Jefferis, V.E., van Baaren, R., Chartrand, T.L.: The functional purpose of mimicry for creating interpersonal closeness. Ohio State University (2003) (manuscript) 19. Gueguen, N., Jacob, C., Martin, A.: Mimicry in social interaction: Its effect on human judgment and behavior. European Journal of Sciences 8(2), 253–259 (2009) 20. Reidsma, D., Nijholt, A., Tschacher, W., Ramseyer, F.: Measuring Multimodal Synchrony for Human-Computer Interaction. In: Proceedings International Conference on CYBERWORLDS, pp. 67–71. IEEE Xplore, Los Alamitos (2010) 21. Sun, X.F., Truong, K., Nijholt, A., Pantic, M.: Automatic Visual Mimicry Expression Analysis in Interpersonal Interaction. In: Proceedings Fourth IEEE Workshop on CVPR for Human Communicative Behavior Analysis. IEEE Xplore, Los Alamitos (2011)

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications Jan Nouza and Marek Boháč Institute of Information Technology and Electronics, Technical University of Liberec Studentská 2, 461 17 Liberec, Czech Republic {jan.nouza,marek.bohac}@tul.cz

Abstract. In the paper we propose a method that simplifies initial stages in the development of speech recognition applications that are to be ported to other languages. The method is based on cross-lingual adaptation of the acoustic model. In the search for optimal mapping between the target and original phonetic inventories we utilize data generated in the target language by a highquality TTS system. The data is analyzed by an ASR module that serves as a partly restricted phoneme recognizer. We demonstrate the method on Czech-toPolish adaptation of two prototype systems, one aimed at handicapped persons and another prepared for fluent dictation with large vocabulary. Keywords: Speech recognition, speech synthesis, cross-lingual adaptation.

1 Introduction As the number and variety of voice technology applications increases, the demand to port them into other languages becomes acute. One of the crucial issues in localization of the already developed products for another language is the cost of the transfer. In ASR (Automatic Speech Recognition) systems, the major costs are related to the adaptation of their two language-dependent layers: the acoustic-phonetic one and the linguistic one. Usually, the latter task is easier for automation because it is based on statistical processing of texts, which are now widely available in digital form (e.g. on internet [1]). The former task takes significantly more human work since it requires a large amount of annotated speech recordings and some deeper phonetic knowledge. These costs may be prohibitive if we aim at porting applications for special groups of clients, such as handicapped persons, where the number of potential users is small and the price of the products should be kept low. The research described in this paper has had three major goals. First, we were asked to transfer the voice tools developed for Czech handicapped persons to the similar target groups in the countries where these tools are not available. Second, we wanted to find a methodology that would make the transfer as rapid and cheap as possible. And third, we wished to explore the limits of the proposed approach to see whether it is applicable also for more challenging tasks. Our initial attempt was to allow for porting the MyVoice and MyDictate tools into other (mainly Slavic) languages. The two programs were developed in our lab in 2004 to 2006. They enabled Czech motor-handicapped people to work with a PC in a A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 154–162, 2011. © Springer-Verlag Berlin Heidelberg 2011

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

155

hands-free manner, with a large degree of flexibility and customization [2]. Very soon, a demand to port the MyVoice for Slovak language occurred and a few years later the software was transferred also to Spanish. The adaptation of the acousticphonetic layer of the MyVoice’s engine was done in a simple and straightforward way – by mapping the phonemes of the target language to the original Czech ones [3]. In both the cases, the mapping was conducted by experts who knew the phonetics of the target and original languages. As the demand for porting the voice tools to several other languages increases we are searching for an alternative approach in which the expert can be (at least partly) replaced by a machine. In this paper, we investigate a method where a TTS system together with an ASRbased tool tries to play the role of a ‘skilled phonetician’ whose aim is to find the optimal acoustic-phonetic mapping. The approach has been proposed and successfully tested on Polish language. Our experiments show that the scheme yields promising results not only for small-vocabulary applications but also for a more challenging task, such as fluent dictation of professional texts. In the following sections, we briefly introduce the ASR systems developed for Czech. Then we focus on the issues related to their transfer to Polish. We mention the main differences occurring on the phonetic level between the two languages and propose a method that utilizes the output from a Polish TTS system for creating the objective mapping of Polish orthography to Czech phonetic inventory. The proposed solution is simple and cheap because it does not require human-made recordings or the expert in phonetics and yet it seems applicable in the desired area.

2 ASR Systems Developed for Czech Language During the last decade we have developed two types of ASR engines, one for voicecommand input and discrete-speech dictation and another for fluent speech recognition with very large vocabularies. The former proved to be useful mainly in applications where robust hands-free performance is the highest priority. This is the case, for example, of voice-controlled aids developed for motor-handicapped people. Voice commands and voice typing can help them very much if it is reliable, flexible, customizable and does not require highcost computing power. The speed of typing, on the other side, is slower but this is not the crucial factor. The engine we have developed can operate with small vocabularies (tens or hundreds of commands) as well as with very large lexicons (up to 1 million words). It has been recently used in the MyVoice tool and in the MyDictate program. Both the programs can run not only on low-cost PCs but also on mobile devices [4]. The latter engine is a large-vocabulary continuous-speech recognition (LVCSR) decoder. It has been developed for voice dictation and speech transcription tasks with regard to specific needs of highly inflected languages, like Czech and other Slavic languages [5]. The recent version operates in real time with lexicons whose size goes up to 500 thousands words. Both the engines use the same signal processing and acoustic modeling core. A speech signal is 16 kHz sampled and parameterized every 10 ms into 39 MFCC features per frame. The acoustic model (AM) employs CDHMMs that are based either on context-independent phonetic units (monophones) or context-dependent triphones. The latter yield slightly better performance though the former are more compact,

156

J. Nouza and M. Boháč

require less memory and they are more robust against pronunciation deviations. The last aspect is important especially if we consider using the AM in speech recognition of another language. The linguistic part of the systems consists in the lexicon (which can be general or application oriented) and the corresponding language model (LM). The LM used in simpler systems has form of fixed grammar, in the dictation and transcription systems it is based on bigrams. The final applications (e.g. the programs MyVoice, MyDictate and FluentDictate) have been developed for Czech. Yet, the engines themselves are languageindependent. If the above programs are to be used in another language, we need to provide them with a new lexicon, a corresponding LM and an AM that fit the coding used for the pronunciation part of the lexicon

3 Case Study: Adapting ASR System to Polish Language In the following part, we present a method that allows us to adapt the acoustic model of an ASR system to a new language. Its main benefit consists in the fact that only minimum amount of speech data needs to be recorded and annotated for the target language. Instead of recording human-produced speech we employ data generated by a high-quality TTS system. Then, we analyze it by an ASR system in order find the optimal mapping between the phonemes of the target language and the phonetic inventory of the original acoustic model. Moreover, the ASR system serves as an automatic transducer that proposes and evaluates the rules for transcribing the orthographic form of words in the target language into pronunciation forms based on phonemes of the original languages. It is evident that the AM created for the new language using the above approach cannot perform as well as the AM that was trained directly on the target language data. Therefore, we need to evaluate how good this adapted AM is. For this purpose we utilize the TTS again. This time we employ it as a generator of test data, perform speech recognition tests and compare the results with those achieved for the same utterances produced by human speakers. In the next sections, we will illustrate the method on a case study in which two Czech ASR systems have been adapted for Polish language. 3.1 Czech vs. Polish Phonology Czech and Polish belong to the same West branch of Slavic languages, however, they differ significantly on lexical as well as on phonetic level. The phonetic inventory of Czech consists of 10 wovels (5 short and 5 long ones + very rare schwa) and 30 consonants. All are listed in Table 1 where each phoneme is represented by its SAMPA symbol [6]. (In this text, we prefer to use SAMPA notation rather than IPA because it is easier for typing and reading.) Polish phonology [8] recognizes 8 vowels and 29 consonants. Their list with SAMPA symbols [9] is in Table 2. By comparing the two tables we can see that there are 3 vowels (I, e~, o~) and 5 consonants (ts', dz', s',z', w) that are specific for Polish. All the other phonemes have their counterparts in Czech. (Note that symbol n' used in Polish SAMPA is equivalent to J in Czech SAMPA.)


157

Table 1. Czech phonetic inventory Groups Vowels (11) Consonants (30)

SAMPA symbols a, e, i, o, u, a:, e:, i:, o:, u:, @ (schwa) p, b, t, d, c, J\, k, g, ts, dz, tS,dZ, f, v, s, z, S, Z, X, h\, Q\, P\ j, r, l, m, n, N, J, F Table 2. Polish phonetic inventory

Groups Vowels (8) Consonants (29)

SAMPA symbols a, e, i, o, u, I, e~, o~ p, b, t, d, k, g, ts, dz, tS, dZ, f, v, s, z, S, Z, X, ts', dz', s',z', w, j, r, l, m, n, N, n' (equivalent to Czech J)

3.2 How to Map Polish Phonemes to Czech Phoneme Inventory? In our previous research on cross-lingual adaptation, we have done transfer of a Czech ASR system to Slovak [10] and to Spanish [3]. In both the cases, the Czech acoustic model was used and the language specific phonemes were mapped to the Czech ones. The mapping was designed by the experts who knew the original and the target language. An alternative to this expert-driven method is a data-driven approach, e.g. that described in [11] where the similarity between phonemes in two languages is measured by Bhattacharyya distance. However, this method requires quite a lot of recorded and annotated data in both the languages. In this paper, we propose an approach where the data from the target language are generated by a TTS system and the mapping is controlled by an ASR system. The main advantage is that the data can be produced automatically, on demand and in the amount and structure that is needed. 3.3 Phonetic Mapping Based on TTS Output Analyzed by ASR System The key component is a high-quality TTS system. For Polish language, we have chosen the IVONA software [12]. It employs the algorithm that produces an almost natural speech by concatenating properly selected units from a large database of recordings. The software won several awards in TTS competitions [13, 14]. Recently, it offers 4 different voices (2 male and 2 female), which - for our purpose - introduces an additional degree of voice variety. The software can be tested via its web pages [12]. Any text typed in its input box is immediately converted into an utterance. The second component is an ASR system operating with the given acoustic model (the Czech one in this case). It is arranged in the way that it works as a partly restricted phoneme recognizer. The ASR module takes a recording, transforms it into a series of feature vectors X = x(1), ...x(t), ...x(T) and outputs the most probable sequence of phonemes p1, p2, … pN. The output includes the phonemes, their times and their likelihoods. The module is called with several parameters, as shown in the example below:

158


Recording_name: Recorded_utterance: Pronunciation: Variants:

maslo-Ewa.wav masło mas?o ?= u l

uv

In the above example, the recognizer takes the given sound file, processes it and evaluates which of the proposed pronunciations is best. The output looks like this: 1. masuo 2. masuvo 3. maslo -

avg. likelihood = -77.417 avg. likelihood = -77.956 avg. likelihood = -78.213

We can see that for the given recording and the given AM, it is Czech phoneme ‘u’ that fits best to Polish letter ‘ł‘ (and corresponding phoneme ‘w’). The module also provides rich information from the phonetic decoding process (phoneme boundaries, likelihoods in frames, etc.), which can be used for detailed study, as shown in Fig.1. Ewa

-40

-60 likelihood

likelihood

-60

-80

-100 -

ma

s

20

? 40 frames Jan

-40

o

-

?

o

60 80 20 40 masło _ _ _ ?=l __ __ ?=uv _____ ?=u frames Maja -40

60

m

a

s

80

-60 likelihood

likelihood

-80

-100 -

-60

-80

-100 -

Jacek

-40

m 20

a

s

?

40 frames

o 60

-80

-100 80

ma 20

s

?

40 60 frames

o80

Fig. 1. Diagrams showing log likelihoods in frames of speech generated by TTS system (voices Ewa, Jacek, Jan, Maja). Different pronunciation variants of Polish letter ‘ł‘ in word ‘masło‘ can be compared.

Using the TTS software we have recorded more than 50 Polish words, each spoken by four available voices. The words were selected so that all the Polish specific phonemes occurred at various positions and context (e.g. at the start, in the middle, at


159

the end of words, in specific phonetic clusters, etc.). For each word, we offered the phoneme recognizer several pronunciation alternatives to choose from. In most cases, the output from the recognizer was consistent in the way that the same best phonemes were assigned to the Polish ones across various words and the four TTS voices. In some cases, however, the mapping showed to be dependent on the context, e.g. Polish ‘rz’ was mapped either to Czech phonemes ‘Z’, ‘Q\’ or ‘P\’. The results are summarized in Table 3. We can see that the resulting map covers not only the phoneme-to-phoneme relations but also the grapheme-to-phoneme conversion. It is also interesting to compare these objectively derived mappings with those considered in subjective perception. Since Poland and the Czech Republic are neighboring countries, Czech people have a lot of chances to hear spoken Polish and to use some Polish words, such as proper and geographical names. The subjective perception of some Polish specific phonemes seems to be different from what has been found by the objective investigation. For example, Czech people tend to perceive Polish ‘I’ as ‘i’ (the reason being that the letters ‘i’ and ‘y’ are pronounced in the same way in Czech – as ‘i’). Also Polish pair of letters ‘rz’ is usually considered as equivalent to Czech ‘ř’, which is not always true. The above described method proves that the ASR machine (equipped with the given acoustic model) has different perception. Anyway, this perception is objective because it is the ASR system that is to perform the recognition task. Table 3. Polish orthography and phonemes mapped to Czech phonetic inventory

Letter(s) in Polish orthography y ó ę ą dz ź / z(i) ś / s(i) dź / dz(i) ć / c(i) ż rz sz dż cz ń / n(i) h, ch ł

Polish phoneme(s) (SAMPA) I u e~ o~ dz z' s' dz' ts' Z Z S dZ tS n' X w

Mapping to Czech phoneme(s) (SAMPA) e, (schwa) u e+n, (e+N) o+n, (o+N) dz Z S dZ tS Z Z (Q\ or P\ in clusters trz, drz) S dZ tS J X u

160


3.4 Evaluation on Small Vocabulary Task The first task, in which we tested the proposed method and evaluated the resulting mapping, was Polish voice-command control, the same as in the MyVoice tool. The basic lexicon in this application consists of 256 commands, such as names of letters, digits, keys on PC keyboard, mouse actions, names of computer programs, etc. These commands have been translated into Polish, their pronunciations have been created automatically using the rules in Table 3 and after that they were recorded by the IVONA TTS (all the four voices) and by two Polish speakers. All the recordings were passed to the MyVoice’s ASR module operating with the original Czech AM. The experiment was to show us how well this cross-lingual application performed, and whether there is a significant difference in recognition of synthetic and human speech. Also it allowed us to compare the objectively derived mapping with the subjective phoneme conversion mentioned in section 3.3. The results are included in Table 4. We can observe that the performance measured by the Word Recognition Rate (WRR) is considerably high both for the TTS data as well as for human speakers. The results are comparable to those achieved for Czech, Slovak and Spanish [3]. 3.5 Evaluation on Fluent Speech Dictation with Large Lexicon The second task was to build a very preliminary version of a Polish voice dictation system for radiology. In this case, we used data (articles, annotations, medical reports) available online at [15]. For this purpose, we collected a small corpus (approx. 2 MB) of radiology texts and created a lexicon made of 23.060 most frequent words. Their pronunciations were derived automatically using the rules in Table 3. The bigram language model was computed on the same corpus. To test the prototype system, we selected three medical reports not included in the training corpus. They were recorded again by the IVONA software (four times with four different voices) and by two native speakers. The results from this experiment are part of Table 4. The WRR values are about 8 - 10 % lower compared to the Czech dictation system for radiology but it should be noted that our main aim was to test the proposed fast prototyping technique. The complete design of this demo system took just one week. It is also interesting to compare the results achieved with the TTS data to the human-produced ones. We can see that the TTS speech yielded slightly better recognition rates. It is not surprising as we have already observed this in our previous investigations [16]. In any case, we can see that the TTS utterances can be used during the development process as a cheap source of benchmarking data. Table 4. Results from speech recognition experiments in Polish language

Task Voice commands – TTS data Voice commands – human speech Fluent dictation (radiology) – TTS data Fluent dictation (radiology) – human speech

Lexicon size 256 256 23060 23060

WRR [%] 97.8 96.6 86.4 83.7


161

4 Discussion and Conclusions The results of the two experiments show that the proposed combination of TTS data and ASR-driven mapping is applicable in rapid prototyping of programs that are to be transferred into other languages. The TTS system for the target language should be of high quality, of course, and it is appreciated if it can offer multiple voices. If this is true, we can obtain not only the required L2-L1 phonetic mapping but also the grapheme-to-phoneme conversion table that will help us in generating pronunciations for the lexicon in the target application. Moreover, the TTS system can serve as a cheap source of test data needed for preliminary evaluations. The results we obtained in the first experiment prove that the created lexicon (with its automatically derived pronunciations) could be immediately used in the Polish version of the MyVoice software. Even though the internal acoustic model is Czech, we can expect the overall system performance being at the similar level as it is for Czech users. The most important thing is that during the prototype development no Polish data needed to be recorded and annotated and thus the whole process could be fast and cheap. Furthermore, we showed that the phonetic mapping generated via the combination of TTS and ASR systems would lead to more objective and better results compared to those based on subjective perception. In the second experiment we demonstrated that the same automated approach can be utilized also in a more challenging task, during the initial phase of the development of a dictation system. Within very short time we were able to create a Polish version of the program that can be used for demonstration purposes, for getting potential partners interested and for allowing at least initial testing with future users. Acknowledgments. The research was supported by the Grant Agency of the Czech Republic (grant no. 102/08/0707).

References 1. Vu, N.T., Schlippe, T., Kraus, F., Schultz, T.: Rapid Bootstrapping of five Eastern European Languages using the Rapid Language Adaptation Toolkit. In: Proc. of Interspeech 2010, Japan, Makuhari, pp. 865–868 (2010) 2. Cerva, P., Nouza, J.: Design and Development of Voice Controlled Aids for MotorHandicapped Persons. In: Proc. of Interspeech 2007, Antwerp, pp. 2521–2524 (2007) 3. Callejas, Z., Nouza, J., Cerva, P., López-Cózar, R.: Cost-Efficient Cross-Lingual Adaptation of a Speech Recognition System. In: Advances in Intelligent and Soft Computing, vol. 57, pp. 331–338. Springer, Heidelberg (2009) 4. Nouza, J., Cerva, P., Zdansky, J.: Very Large Vocabulary Voice Dictation for Mobile Devices. In: Proc. of Interspeech 2009, UK, Brighton, pp. 995–998 (2009) 5. Nouza, J., Zdansky, J., Cerva, P., Silovsky, J.: Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak). In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces, COST Seminar 2009. LNCS, vol. 5967, pp. 225–241. Springer, Heidelberg (2010) 6. Czech SAMPA, http://noel.feld.cvut.cz/sampa/ 7. Nouza, J., Psutka, J., Uhlir, J.: Phonetic Alphabet for Speech Recognition of Czech. Radioengineering 6(4), 16–20 (1997)

162


8. Gussman, E.: The Phonology of Polish. Oxford University Press, Oxford (2007) 9. Polish SAMPA, http://www.phon.ucl.ac.uk/home/sampa/polish.htm 10. Nouza, J., Silovsky, J., Zdansky, J., Cerva, P., Kroul, M., Chaloupka, J.: Czech-to-Slovak Adapted Broadcast News Transcription System. In: Proc. of Interspeech 2008, Australia, Brisbane, pp. 683–2686 (September 2008) 11. Kumar, S.C., Mohandas, V.P., Li, H.: Multilingual Speech Recognition: A Unified Approach. In: Proc. of Interspeech 2005, Portugal, Lisboa, pp. 3357–3360 (2005) 12. IVONA TTS system, http://www.ivona.com/ 13. Kaszczuk, M., Osowski, L.: Evaluating Ivona Speech Synthesis System for Blizzard Challenge 2006. In: Blizzard Workshop, Pittsburgh (2006) 14. Kaszczuk, M., Osowski, L.: The IVO Software Blizzard 2007 Entry: Improving Ivona Speech Synthesis System. In: Sixth ISCA Workshop on Speech Synthesis, Bonn (2007) 15. http://www.openmedica.pl/ 16. Vich, R., Nouza, J., Vondra, M.: Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 136–148. Springer, Heidelberg (2008)

Towards the Automatic Detection of Involvement in Conversation Catharine Oertel1 , C´ eline De Looze1 , Stefan Scherer2 , 3 Andreas Windmann , Petra Wagner3 , and Nick Campbell1 1

Speech Communication Laboratory, Trinity College Dublin, Ireland 2 University of Ulm, Germany 3 Bielefeld University, Germany

Abstract. Although an increasing amount of research has been carried out into human-machine interaction in the last century, even today we are not able to fully understand the dynamic changes in human interaction. Only when we achieve this, will we be able to go beyond a one-to-one mapping between text and speech and be able to add social information to speech technologies. Social information is expressed to a high degree through prosodic cues and movement of the body and the face. The aim of this paper is to use those cues to make one aspect of social information more tangible; namely participants’ degree of involvement in a conversation. Our results for voice span and intensity, and our preliminary results on the movement of the body and face suggest that these cues are reliable cues for the detection of distinct levels of participants involvement in conversation. This will allow for the development of a statistical model which is able to classify these stages of involvement. Our data indicate that involvement may be a scalar phenomenon. Keywords: Social involvement, multi-modal corpora, discourse prosody.

1

Introduction

Language and speech, and later, writing systems, have evolved to serve human communication. In today’s society human-machine interaction is becoming more and more ubiquitous. However, despite more than half a century of research in speech technology, neither computer scientists, linguists nor phoneticians have yet reached a full understanding of how the variations in speech function as a means of human communication and social interaction. A one-to-one mapping between text and speech is not sufficient to treat the social information exchanged in human interaction. What makes a conversation a naturally interactive dialogue are the dynamic changes involved in spoken interaction. We propose that these changes might be explained by the concept of involvement. Following Antil [1] we define involvement as “the level of perceived personal importance and/or interest evoked by a stimulus (or stimuli) within a specific situation” [1]. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 163–170, 2011. c Springer-Verlag Berlin Heidelberg 2011

164

C. Oertel et al.

Moreover, we consider involvement in our study to be a scalar phenomenon. Contrary to Wrede & Shriberg [2] who define involvement as a binary phenomenon, we agree with Antil in that “involvement must be conceptualized and operationalized as a continuous variable, not as a dichotomous variable” [1]. Similar to Dillon [3], who uses a slider to let participants indicate their level of emotional engagement, we used a scale from 1-10 in our annotation scheme to indicate distinct levels of involvement. Studies on involvement [2], [4], or related concepts such as emotional engagement [5] [6], interest [7], or interactional rapport [8] reported that these phenomena are conveyed by specific prosodic cues. For example, Wrede and Shriberg [2], in their study on involvement found that there was an increase in mean and range of the fundamental frequency (F0) in more activated speech as well as tense voice quality. Moreover, Crystal and Davy [9] reported that, in live cricket commentaries, the more the commentator is involved in reporting the action (i.e. at the action peak), the quicker the speech rate.

2

Main Objectives and Hypotheses

In our study we looked at how prosodic parameters as well as visual cues may be used to indicate levels of involvement. A statistical model based on these cues would enable the automatisation of involvement detection. Automatic involvement detection allows for a time efficient search through multimodal corpora, and may be used for interactive speech synthesis. The prosodic parameters (i.e. F0, duration and intensity) include level and span of the voice, articulation rate (i.e. excluding pauses) and intensity of the voice. The visual parameter includes the participants’ amount of change in movement of the body and face. Based on studies [2–9] our hypotheses are: the higher the degree of involvement, (1) the higher the level and (2) the wider the span of the voice, (3) the quicker the articulation rate, (4) the higher the intensity and (5) the higher the amount of movement in the face and body of the participants.

3 3.1

Experiment Data Collection: The D64 Corpus

We used the D64 corpus [10] for this study. It was recorded over two successive days in a rented apartment, resulting in a total of eight hours of multimodal recordings. Five participants took part on the first day and four on the second. Three of the participants were male and two female. They were colleagues and/or friends (with the exception of one naive participant), ranging in age from early twenties to early sixties. They were able to move freely around as well as to eat and drink refreshments as in normal daily life. The conversation was not directed and ranged widely over topics both trivial and technical.

Towards the Automatic Detection of Involvement in Conversation

3.2

165

Data Selection

For our analysis, all 5 speakers were included. Data was chosen from two different recording sessions; Session 1 and Session 2 (a total of 1 hour of recording). For session 1, there was no predefined topic, and the conversation was allowed to meander freely. For session 2, the first author’s Master’s research was amongst the topics of discussion. Speaking time per speaker varies between 1 to 15 minutes (mean=9 min; sd=5,15). 3.3

Data Annotation

We developed an annotation scheme based on hearer independent, intuitive impressions [11] and annotated approximately 1 hour of video recordings for levels of involvement. The annotation scheme was validated perceptively and was combined with acoustic analysis and movement data. Our measure of involvement comprises the joint involvement of the entire group. Involvement annotations are based on the following criteria: Involvement level 1 is reserved for cases in which virtually no interaction is taking place and in which interlocutors are not taking notice of each other at all and are engaged in completely different pursuits. Involvement level 2 is a less extreme variant of involvement level 1. Involvement level 3 is annotated when subgroups emerge. For example, in a conversation with four participants, this would mean that two subgroups of two interlocutors each would be talking about different subjects and ignore the respective other subgroup. Involvement level 4 is annotated when only one conversation is taking place while for involvement level 5 interlocutors also need to show mild interest in the conversation. Involvement level 6 is annotated when conditions for involvement level 5 are fulfilled and interlocutors encourage the turnholder to carry on. Involvement level 7 is annotated when interlocutors show increased interest and actively contribute to the conversation. For involvement level 8, interlocutors must fulfil the conditions for involvement level 7 and contribute even more actively to the conversation. They might for example jointly, wholeheartedly laugh or totally freeze following a remark of one of the participants. Involvement level 9 is annotated when interlocutors show absolute, undivided interest in the conversation and each other and vehemently emphasise the points they want to make. Participants signal that they either strongly agree or disagree with the turn-holder. Involvement level 10 is an extreme variant of involvement level 9. A ten point scale was chosen for annotation but only values 4-9 were actually used in the annotations.This fact might be explained by the calm and friendly nature of the conversation. The numbers of times in which involvement level 4 and 9 were annotated were statistically not sufficient and were thus excluded from further analysis. 3.4

Measurements and Statistical Analyses

Acoustic measurements were obtained using the software Praat [12]. The level and span of the voice were measured by calculating the F0 median (the mean

166

C. Oertel et al.

being too sensitive to erroneous values) and the log2 (F 0max − F 0min) respectively. The F0-level is given on a linear scale (i.e. Hertz) while F0-span is given on a logarithmic scale (i.e. octave). In order to avoid possible pitch tracking errors, pitch floor and pitch ceiling were set to the values q15 · 0.83 (where ’q’ stands for percentile) and q65 · 1.92 (De Looze [13]). Articulation rate was calculated in terms of number of syllables per second. Syllables were detected automatically using a prominence detection tool developed by Tamburini [14]. In order to neutralise speaker differences in voice level and span, articulation rate and intensity, data were normalised by a z-score transformation. For the movement extraction an algorithm was chosen which is not restricted to calculating movement changes for the whole picture but rather for individual people (note that movement measurements were only calculated in this study for two speakers). From the video data coordinates of the faces and bodies at each frame composed by the exact spot of the top left corner and the bottom right corner of the face are extracted as in Scherer et al [15] by utilising the standard Viola Jones algorithm [16]. Normalisation is carried out as these coordinates are highly dependent on the distance of the person to the camera in order to obtain relative movement over the size of the detected face and body. Only in the case where a face is recognised a moving average is calculated. ANOVA analyses were carried out for the above mentioned cues. 3.5

Results

Level and Span of the voice. As illustrated in Figure 1, involvement level 6 is significantly higher than involvement level 5 (F(3,1041)=8.843; p=0.006370) and involvement level 8 is significantly higher than involvement level 7 (F(1,440)=6.58; p=0.0106). Involvement level 7 is however not significantly higher than 6 (F(2,830)=4.899; p=0.35040). The acoustic cue F0-max/min as illustrated in Figure 1 increases with involvement. While involvement level 7 is significantly higher than involvement level F0-max/min

8

F0-median

**

3

*

1 0

ZScore(Octave)

2

-2

-2

-1

0

ZScore(Hertz)

4

2

6

**

4

5

6

7

Involvement

8

9

4

5

6

7

8

9

Involvement

Fig. 1. Boxplots of F0-median and F0 max/min according to four levels of involvement

Towards the Automatic Detection of Involvement in Conversation

167

6 (F(2,831)=22.82; p=7.96e-08), involvement level 6 is not significantly higher than involvement level 5 (F(3,1041)=18.31; p=0.6325) and involvement level 8 is not significantly higher than involvement level 7 (F(1,440)=21.2; p=0.274). Articulation Rate. The acoustic cue articulation rate does not illustrate any significant changes. The articulation rate of the individual speakers stays approximately the same over the various involvement levels. Intensity. The acoustic cue intensity illustrates an increasing slope as can be seen in Figure 2. While involvement level 6 is significantly higher than involvement level 5 (F(3,1130)= 139.5 ; p=1.62e-05) and involvement level 7 is significantly higher than involvement level 6 (F(2,889)= 121; p= 1 

(4)

Noise index was calculated on the same way, as described above for individual noises, and then the final index will be the average of these items. Totalized index will be the following:

Automatic Classification of Emotions in Spontaneous Speech

total index =

233

1 3 * speech index + * noise index 4 4

(5)

2.3 Results At the beginning of the test serial the following classes were prepared for training: b (speech), u (silence/pause), a (noise of car), g (gesture), k (background speech), s (noise of wind), t (telephone signal), r (creaking), i (hooter). Sound of hooter was immediately removed at the first testing, because it appeared only in one sound file, for a short time. Marking of p (clatter of paper) and h (hitting) sounds were closed up with creaking because of the acoustic similarity of these sounds. During tests we introduced a „breathing” label, which consisted breathing noises derived from speakers in the phone. In test serial 1 the acoustical parameters were the followings: mel-frequency cepstral coefficients calculated with different window lengths (100, 250, 500 and 750 milliseconds), intensity and fundamental sound values, and their first and second derivatives. We achieved the best results in case of MFCC parameters calculated with window size of 500 ms (Table 5). In case of sound files of the worst quality (recorded in a car, with a hand-free phone, marked with * in Table 5) results of classification became poor, too. System was not able to recognize almost any speech in these files. To improve these results, we introduced a „noisy speech” class (marked with „z”). Results achieved this way, and results achieved with the original models can be seen in Table 5. Table 4. Length of Markov Models assigned to classes

Number of state 11 stated model 5 stated model

Labels (classes) b, k a, g, s, t, r, u, l

Table 5. The best classification results achieved with a time window of 500 ms according to different indexes given in [%]

In case of original models

0,69

Noise index 63,95

Total index 16,51

After the introduction of a noisy model Speech Noise Total index index index 46,81 63,3 50,93

11,36

24,29

14,59

32,74

24,29

30,6

83,42

100

35,58

83,89

70,07 78,84 79,79 58,76 70,98 57,35 72,61

68,43 82,64 98,88 76,8 84,22 80,06 88,82

29,07 9,8 23,34 33,28 32,71 0,58 38,24

58,59 64,43 79,99 65,92 71,34 60,19 76,17

Record identifier 01*

Speech index

02* 03

100

33,7

04 05 06 07 08 09 10

83,62 100 98,75 67,22 83,61 76,31 84,55

29,39 15,34 22,9 33,4 33,1 0,46 36,79

234

D. Sztahó, V. Imre, and K. Vicsi Table 6. Result of modified class grouping

State number 14 11 5 4

Classes b, z, k s, a, u g, r l, t

Table 7. Recognition results achieved with modified groups of classes given in [%]

Record identifier

Speech index

01 02 03 04 05 06 07 08 09 10

49,65 16,75 100 87,23 82,64 100 65,2 86,91 83,24 88,1

Noise index 57,96 28,95 38,34 17,75 8,61 29,2 30,09 37,24 0,58 36,89

Total index 51,73 19,79 84,58 69,86 64,13 82,3 56,42 74,49 62,57 75,3

Fig. 2. An example for the result of automatic classification

For sake of the further improvement of classification we tried to modify models according to multiple approaches. On the basis of assumed difficulty (complexity of acoustic classes), labels thought to be falsely classified, and average length of individual sound patterns we generated different groups of models, and assigned Markov Models with different state numbers to them. Groups of classes achieved with this method, and the result of recognition belonging to them are shown by Tables 6 and 7. The best recognition resulted four different groups of Markov Model state


235

numbers. Short Markov Models were applied to short noises, like hit, crash noises and to sounds with low rate of spectral change, like dial tones. Longer Markov Models were applied to speech, and to longer noises with higher level rate of spectral change, like wind and car sounds. These changes increased the recognition performance in the case almost every recording.

3 Emotion Recognition 3.1 Database of Emotions For realization of emotion recognition spontaneous telephone speech containing continuous conversations, and records of different talkshows were collected and annotated. The recordings consist of spontaneous speech material and improvisation play by actors. Continuous speech was distributed to phrase units, and phrases were annotated by emotions, and the most characteristic emotional parts were marked. During the annotation it occurred, that emotional classification of phrase units was not obvious to the human listeners. The annotators marked different emotions to same segments. In order to solve this, the persons making the annotation had to mark only the borders of segments filled with emotion, and their classification was made by multiple listeners during a separated subjective test serial using a predefined set of emotions without any scale of intensity of a given emotion (Table 8). The listeners did not have to consider the intensity of the heard emotion. Thus the subjective listening tests of the segments of 2540 emotional segments was made by 30 persons, and after it we have chosen 985 emotional segments from 43 speakers by 6 emotions. Only those voice patterns were selected, where there was a 70% of correspondence in the decisions. Emotions were the following: neutral, sad, surprised, angry/nervous, laughing during speech, and happy. Distribution among categories is shown in Table 8. Table 8. Number of emotional patterns selected by 30 monitoring persons

Type of emotion Neutral Nervous/angry Happy Laughing during speech Sad Surprised

Number of selected phrases (70% of correspondence in the decisions at the subjective test) 517 290 39 42 54 43

3.2 Emotion Recognition Process During the emotion recognition experiments we had to reduce the set of emotion categories, since not all had enough samples to achieve proper training. 4 emotion categories were selected, according to Table 9 and 10 they are the following: neutral, angry/nervous, happy and laughing during speech together, and sad. In order to achieve a proper training, a balanced set of emotion samples were selected. The

236

D. Sztahó, V. Imre, and K. Vicsi

neutral and anger category was reduced to the size of happy category. We applied Support Vector Machines for the automatic classification, using the toolkit of LIBSVM [7] with C# programming language, which can be downloaded freely. The aim of these experiments was to examine, which acoustics parameters are necessary for the recognition of emotions. We examined the following issues: • • • • • •

average, maximum, range and standard deviation of the fundamental frequency (marking: F0) average, maximum, range and standard deviation of derivative of the fundamental frequency (marking: ΔF0) average, maximum, range and standard deviation of intensity (marking: EN) average, maximum, range and standard deviation of derivative of the intensity (marking: ΔEN) average, maximum, range and standard deviation of 12 mel-frequency cepstral coefficients (marking: MFCCi) average, maximum, range and standard deviation of harmonicity values (marking: HARM)

Every characteristic was computed with a 10-ms timestep, and then we calculated the proper statistic characteristic by every phrase-length unit. Thus every phrase had a value from the above enumeration, and all of these features were put into the feature vector related to the given phrase. 3.3 Results Four emotion marks - angry/nervous: A, happy: J, neutral: N, sad: S - were used during the tests. Table 9 contains the results of four experiments prepared with different types of feature vectors. Table 9. Results of automatic recognitions in [%], in case of different groups of feature vectors

Feature vector: F0, ΔF0, EN, ΔEN

A J N S

A 51 18 6 15

J 15 32 9 4

N S 5 4 17 2 57 3 13 7 Recognition result: 56,98

Feature vector: F0, ΔF0, EN, ΔEN, HARM,

A J N S

A 46 17 7 12

J 13 30 8 7



237

Table 9. (continued)

Feature vector: F0, ΔF0, EN, ΔEN, MFCCi

A J N S

A 57 12 4 5

J 13 37 12 17


Feature vector: F0, ΔF0, EN, ΔEN, HARM, MFCCi A J N S 61 9 4 1 A 11 41 11 6 J 3 12 56 4 N 5 16 5 13 S Recognition result: 66,27 Table 10. Result of automatic emotion recognition in [%] in case of female and male voice samples, and in case of a characteristic vector given the best result

male speakers

A J N S

A 17 1 2 1

J 0 7 2 5

N 4 2 18 0

S 1 7 0 14

Recognition result: 69,14 female speakers

A J N S

A 46 9 1 3

J 6 31 9 8

N 1 11 40 6

S 0 1 3 2

Recognition result: 67,28 Recognition results show, that beyond basic prosodic parameters, that can be found in the literature (fundamental frequency, intensity), the mel-frequency cepstrum parameters have an important role in automatic recognition. This means that spectral

238

D. Sztahó, V. Imre, and K. Vicsi

features also have important meaning in emotion recognition. Harmonicity values can even improve it, but since the number of samples is not sufficient yet, its effect is not proven, however it is worth to examine in the future. There is a need for a continuous database collection. It is worth to see the results, when voice patterns are selected separately, distributed to female and male patterns. Result of this can be seen in Table 10. Although recognition shows a slight improvement, it means only a few differences in voice sample numbers because of the insufficient number of voice samples.

4 A Quasi-real Time Emotion Recognition Process in Spontaneous Speech During speech communication, mainly in case of a long conversation, emotional state of the speaking person can change continuously. To follow the mental state of the speaker, we have to separate the continuous speech to sections. In the present case we chose the phrase to be the basic unit of segmentation. At the construction of our real-time recognizer the automatic phrase-leveled segmentation is realized by the speech detector described in the chapter 2. Block diagram of the real time automatic emotion recognizer is shown on Figure 3, where the speech detector-segmenter and the emotion recognizer are built together. Acoustic processing of the two independent recognizers is separated on the figure, because the system uses two different methods. However, we plan to use only one module for this in the future.

F0i, ∆F0i

Ei, ∆Ei

Normalization

Preparing mutlidimensonal feature vector

MFCCi

audio signal

Acoustic preprocessing

Speech/noise detection

Database cointatining phone-line recordings

Clause unit segmenteation

speech segments

Markovmodells

Acoustic preprocessing

Database cointatining emotinal recordings

Speech detection

emotion Emotion category classification

Support Vector Machines

Emotion recognition

Fig. 3. Block diagram of the automatic emotion recognizer in case of spontaneous speech


239

5 Conclusion In this article a method was presented for automatic emotion recognition task, which is able to recognize emotions in real time and in noisy environment on the basis of prosodic and spectral parameters of speech. We have developed a process based on Hidden Markov Models, which segments the audio signal into phrase sized speech parts and acoustic environment noise, solving the speech-not speech detection and phrase-leveled segmentation. During evaluation of results of speech detection it can be determined, that this method can be applied for spontaneous speech. The achieved speech index result can even be 80% in case of not too noisy records. It is an acceptable performance, as it can be seen on Figure 2. Speech detection, phrase segmenting process is followed by the emotion recognition process. In case of training with voice samples of four emotions with a subjective monitoring, the automatic recognizer based on Support Vector Machines can classify emotional voice samples of phrase length unit with 66% of correctness. Acknowledgement. This research was prepared in the framework of Jedlik project No OM-00102/2007. named "TELEAUTO" and TÁMOP-4.2.2-08/1/KMR-20080007 project and TÁMOP 4.2.2-08/1-2008-0009 project.

References 1. Tóth, S.L., Sztahó, D., Vicsi, K.: Speech Emotion Perception by Human and Machine. In: Proceedings of COST Action 2102 International Conference. Patras, Greece, October 29-31 (2007); Revised Papers in Verbal and Nonverbal Features of Human-Human and HumanMachine Interaction 2008. LNCS, vol. 5042, pp. 213–224. Springer, Heidelberg (2008) 2. Hozjan, V., Kacic, Z.: A rule-based emotion-dependent feature extraction method for emotion analysis from speech. The Journal of the Acoustical Society of America 119(5), 3109–3120 (2006) 3. Navas, E., Hernáez, I., Luengo, I.: An Objective and Subjective Study of the Role of Semantics and Prosodic Features in Building Corpora for Emotional TTS. IEEE Transactions on Audio, Speech, and Language Processing 14(4), 1117–1127 (2006) 4. Klára, V., Dávid, S.: Ügyfél érzelmi állapotának detektálása telefonos ügyfélszolgálati dialógusban. VI. Magyar Számítógépes Nyelvészeti Konferencia, Szeged, pp. 217-225 (2009) 5. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (Computer program), http://www.praat.org (retrieved) 6. The Hidden Markov Model Toolkit (HTK), http://htk.eng.cam.ac.uk/ 7. Chang, C.C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Modification of the Glottal Voice Characteristics Based on Changing the Maximum-Phase Speech Component Martin Vondra and Robert Vích Institute of Photonics and Electronics, Academy of Sciences of the Czech Republic, Chaberska 57, CZ 18251 Prague 8, Czech Republic {vondra,vich}@ufe.cz

Abstract. Voice characteristics are influenced especially by the vocal cords and by the vocal tract. Characteristics known as voice type (normal, breathy, tense, falsetto etc.) are attributed to vocal cords. Emotion influences among others the tonus of muscles and thus influences also the vocal cords behavior. Previous research confirms a large dependence of emotional speech on the glottal flow characteristics. There are several possible ways for obtaining the glottal flow signal from speech. One of them is the decomposition of speech using the complex cepstrum into the maximum- and minimum-phase components. In this approach the maximum-phase component is considered as the open phase of the glottal flow signal. In this contribution we present experiments with the modification of the maximum-phase speech signal component with the aim to obtain synthetic emotional speech.

1 Introduction The present research in speech synthesis is focusing especially on the ability to change the expressivity of the produced speech. In the case of unit concatenation synthesis, where synthetic speech achieves almost natural sounding, we have only the possibility to construct several new speech corpuses for each expressive style [1]. This is a very time and resource consuming procedure and greatly increases the memory demands of the speech synthesis system. From this perspective, it would be better if we had the possibility directly to influence or modify the individual characteristics of the speech related to the expressive style of speaking. This can be achieved by a suitable speech model that has the ability to influence the individual parameters. The basic speech production model is based on the source-filter theory (Fig. 1). In the simplest case the source or excitation is represented by Dirac unit impulses with the period equal to the fundamental period of speech for voiced sounds and by white noise for unvoiced speech. The vocal tract model is represented by a time varying digital filter, which performs the convolution of the excitation with its impulse response. The vocal tract model can be based on linear prediction [2], on approximation of the inverse cepstral transformation [3], etc. The main speech characteristics related to expressivity are the prosody (pitch, intensity and timing variation) and the voice quality (the speech timbre). Voice quality is given by both – the vocal tract and also by the vocal cord oscillation. For A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 240–251, 2011. © Springer-Verlag Berlin Heidelberg 2011

Modification of the Glottal Voice Characteristics

241

Vocal tract parameters Pitch period

Impulse generator

White noise

Voiced Unvoiced

Vocal tract model

Synthetic speech

Fig. 1. Source-filter speech model

the given speaker the vocal tract is primarily responsible for creating the resonances realizing the corresponding speech sounds. The voice quality characteristics are given mainly by the excitation of the vocal tract – by the glottal signal. The vocal cords influence the speech in such a way that can be described as modal, breathy, pressed or lax phonation. There are several papers that confirm that source speech parameters are influenced by the expressive content of speech [4, 5]. If we want to achieve a modification of speech based on the source-filter model we must perform speech deconvolution into the source and vocal tract components at first. There are several possibilities for doing this. If we have an estimation of the vocal tract model parameters, we can use filtering by the inverse vocal tract filter [6]. Research on Zeros of Z-Transform (ZZT) [7] of the speech frames proved that deconvolution into the source and filter components can be done by separation of ZZT into zeros inside and outside the unit circle in the z-plane. The same can be done using the separation into the anticipative and causal complex cepstrum parts [8]. Our approach is based on speech deconvolution using the complex cepstrum, which allows to get the glottal signal from the speech. However, practical experiments show that the cepstral deconvolution does not lead to the true glottal signal in all voiced speech frames. Our solution of this issue lies in the estimation of the parameters of the glottal signal from the source magnitude spectrum obtained by deconvolution, where the basic glottal parameters are usually maintained. These parameters are the glottal formant and its bandwidth. Based on these parameters we can design a linear anticausal IIR model of the glottal signal [9]. First the deconvolution based on the complex speech cepstrum will be introduced. We describe also several methods of complex cepstrum computation. Further some examples of reliable and poor estimation of the glottal signal will be shown. In the following part the design of a 2nd order anticausal IIR glottal model will be described. Then the speech deconvolution can be performed with the inverse model of the glottal signal, which leads to the vocal tract impulse response. If we save this vocal tract impulse response and change the glottal model parameters, we achieve a modified speech signal after convolution of the saved vocal tract and the modified glottal model impulse responses.

2 Complex Cepstrum Deconvolution of Speech There are several possibilities for estimating the glottal signal from speech. A list of the most common methods is given in [6]. The majority of methods include inverse filtering by the vocal tract model. A new and attractive method, which can lead to an

242

M. Vondra and R. Vích

estimation of the glottal signal, is based on ZZT [7], which is a method for complex cepstrum computation of exponential sequences. In this method the glottal signal is estimated from the maximum-phase component of the speech frame. This is done by computing the roots of the speech frame Z-transform and by separation of the roots into zeros inside and outside the unit circle in the z-plane. From the properties of the complex cepstrum and ZZT the same can be performed also by separation into the anticipative and causal parts of the complex cepstrum. 2.1 Methods of Complex Cepstrum Calculation The complex cepstrum xˆ[n] of the windowed speech frame x[n], n = 0, ... N – 1; N is the speech frame length, can be computed by several methods. 2.1.1 Calculation Using the Complex Logarithm The complex cepstrum xˆ[n] is given by the inverse Fast Fourier Transform (FFT) of Xˆ [k ]

xˆ[n] =

1 M

M −1

 Xˆ [k ]e j2 πkn / M ,

(1)

k =0

where n = –M/2 – 1, ... , 0, ... M/2; M is the dimension of the applied FFT algorithm. For minimizing the cepstrum aliasing M > N. Xˆ [k ] = ln X [k ] = ln X [k ] + j arg X [k ]

(2)

is the logarithmic complex spectrum of the speech frame, where the real part is given by the logarithm of the spectrum magnitude and the imaginary part is given by the phase spectrum in radians. The phase is an ambiguous function with uncertainty of 2π. For this reason we must perform the phase unwrapping before the inverse FFT (1). The spectrum is efficiently computed by the FFT X [k ] =

M −1

 x[n]e− j2 πkn / M .

(3)

n=0

The phase unwrapping can be rather difficult especially in cases where the speech frame Z-transform has zeros close to the unit circle in the z-plane, which cause sudden phase changes [10]. For this reason we have tried also another methods for complex cepstrum computation. 2.1.2 Using the Logarithmic Derivative

xˆ[ n] = −

1 jnM

M −1

X ′[ k ]

 X [k ] e j2 πkn / M

, for n = 1, ... M – 1,

k =0

(4)

where M −1

X ′[ k ] = − j  nx[ n]e − j2 πkn / M n=0

(5)


243

is the logarithmic derivative of the speech frame spectrum. The first cepstral coefficient can be computed as the mean value of the logarithmic speech magnitude spectrum xˆ[0] =

1 M

M −1

 log X [k ] .

(6)

k =0

The advantage of this method is that it does not need the phase unwrapping. But if the speech frame Z-transform has zeros close to the unit circle in the z-plane and the dimension of FFT is low, this method gives wrong useless results. Moreover the formula (4) adopted from [11] pp. 793 is not appropriate for practical implementation. If we substitute the fraction X’[k]/X[k] by Xd[k] then practical implementation of (4) can be done based on following formulae

1 IFFT[X d [k ]], for n = 1,..., M / 2, jn 1 IFFT[X d [k ]], for n = 1 + M / 2,..., M − 1. j( M − n)

xˆ[n] = − xˆ[n] =

(7)

2.1.3 Using ZZT The Z-transform of the windowed speech frame x[n] can be written as

X ( z) =

N −1

N −1

N −1

n= 0

n=0

m =1

 x[n]z − n =z − N +1  x[n]z n =x[0]z − N +1∏ ( z − Z m ) .

(8)

Setting X(z) = 0 leads to the solution of a high degree polynomial equation. A numerical method must be used – we utilize Matlab roots function which is based on eigenvalues of the associated companion matrix. The zeros Zm of (8) can lie inside or outside the unit circle in the z-plane. If we denote the zeros inside the unit circle as ak and the zeros outside the unit circle as bk we can compute the complex cepstrum based on the relationship [11]

xˆ[ n] = log A , Mi

n = 0,

n k

a , n > 0, k =1 n

xˆ[ n] = − Mo

(9)

−n k

b , n < 0, k =1 n

xˆ[ n] = 

where A is the real constant, Mi is the number of zeros inside the unit circle in the zplane and Mo is the number of zeros outside the unit circle in the z-plane. The complex cepstrum computed by this technique is called as the root cepstrum. The disadvantage of this method is the relatively high computational requirement in contrary to the previous methods, especially for higher sampling frequency, where the speech frame has a relatively high number of samples.

244


2.2 Complex Cepstrum Speech Deconvolution

The steps of complex cepstrum speech deconvolution are shown in Fig. 2 and 3. The first step in complex cepstrum speech deconvolution is the speech segmentation. The frames must be chosen pitch-synchronously with the length of two pitch periods and with one pitch period overlapping. It is important to have the Glottal Closure Instant (GCI) in the middle of the frame as shown in [7]. Also the frame weighting is of high importance. The Hamming window, which is usually used in speech analysis, is not the best choice. In [8] a new parameterized window is proposed, with the Hann and the Blackman window as particular cases and the optimum parameter for deconvolution is given. The segmentation and the used window cause that the magnitude spectrum is very smooth – the periodicity of voiced excitation is totally destroyed and the magnitude spectrum approximates the spectrum envelope. The second step is the complex cepstrum computation. For this we can use (1), (7) or (9). Matlab has the function named cceps(), which realizes (1) or (9). If we use (1) the phase unwrapping must be performed. The resulting complex cepstrum is very sensitive to the used phase unwrapping algorithm. Anyway it is better to use a sufficient high dimension for the FFT algorithm – the phase unwrapping is then more unambiguous (for 8 kHz sampling frequency M = 2048 points of the FFT is usually sufficient). If we compute the complex cepstrum using the logarithmic derivative (7), the results are quite inconsistent – their sensitivity on the FFT dimension is even higher than in the computation using phase unwrapping. Maybe the most reliable method for computing the complex cepstrum seems to be the ZZT. Comparison of speech deconvolution using the complex cepstrum computed using FFT and ZZT are shown in Fig. 3.

Fig. 2. Signal, spectra and complex cepstrum of the stationary part of the vowel a


245

Fig. 3. Anticipative and causal cepstra computed using FFT and ZZT and the corresponding spectra and impulse responses for the vowel a

The last step in complex cepstrum deconvolution is the inverse cepstral transformation separately for the anticipative and for the causal parts of the complex cepstrum. This leads to the anticipative (maximum-phase) or causal (minimum-phase) spectrum and further to the anticipative (maximum-phase) or causal (minimum-phase) impulse responses. From Fig. 3 it is evident that the maximum-phase part of the speech can be considered as the glottal signal, which is proved in [7]. The reconstruction of the speech frame can be performed by convolution of the anticipative and causal impulse responses [12]. This reconstructed speech is of mixed phase and has higher quality than the classical parametric speech models, which employ the Dirac unit impulse excitation and the minimum-phase vocal tract model based e.g. on linear prediction or on Padé approximation. 2.3 Problematic Frames in Complex Cepstral Deconvolution

First we used directly the maximum-phase impulse response for the modification of the glottal signal [13], but we observed that for some voiced speech segments after the complex cepstral speech deconvolution the maximum-phase speech components are not similar to the typical glottal signal, see Fig. 4. This occurs more often for a higher

246


sampling frequency than 8 kHz. The anticipative impulse response in Fig. 4 is more similar to the AM modulation of the glottal signal. This may be probably caused by the noise component in the excitation in natural speech, which may have a negative impact on the separability into the the minimum- and maximum-phase speech components. It might be interesting to perform some harmonic noise decomposition [14] before the complex cepstrum deconvolution, which would be used only for the harmonic component.

3 Design of 2nd order Anticausal IIR Model of the Glottal Signal Our first experiment with the modification of the glottal signal [13] was based on extension or shortening of the maximum-phase impulse response. For the speech frame analyzed in Fig. 3 this is appropriate and we can achieve a modification of the open quotient of the glottal signal. However, this technique cannot be used for the speech frame analyzed in Fig. 4. In this case, when we perform the convolution of the original maximum- and minimum-phase impulse responses, we obtain the original speech frame, but after the extension or shortening of the maximum-phase component, the convolution produces a signal that differs from a typical speech signal.

Fig. 4. Anticipative and causal cepstra, the corresponding spectra and impulse responses for the problematic voiced speech frame estimated using ZZT


247

We decided to solve this problem by designing a model of the glottal signal, whose parameters can be reliably estimated from the anticipative (maximum-phase) magnitude spectrum. The first peak in the maximum-phase magnitude spectrum near the zero frequency refers to the glottal formant. The glottal formant is not caused by a resonance as a classical vocal tract formant, but it is a property of the glottal impulse. This glottal formant is usually visible also in the problematic frames – see Fig. 3 and 4 – and can be estimated by peak picking. In our experience, the formant is more visible when the spectrum is computed from the anticipative root cepstrum. The glottal formant is one of the main parameters of the glottal signal and it is coupled with the open quotient of the glottal impulse [9]. The model of the glottis can be represented by two complex conjugate poles in the z-plane at a frequency, which is equal to the frequency of the glottal formant. This is a property of a 2nd order IIR filter. The magnitude of the pole pair can be estimated from the glottal formant bandwidth. For agreement with the glottal signal phase properties this model must have poles outside the unit circle. Such a filter is unstable, but this model can be designed as anticausal, which means that the time response of this filter is calculated in the reverse time direction. The response of such a filter can be computed as a time reversed response of a causal filter with poles in conjugate reciprocal position to the original poles. The frequency and impulse responses of the glottal model for the speech frame in Fig. 4 are depicted in Fig. 5. 3.1 Speech Deconvolution with the 2nd Order Anticausal Model of the Glottal Signal

If we want to use the described 2nd order anticausal glottal model for modification of the speech signal, we must integrate this model into the speech deconvolution. The deconvolution can be performed by filtering the windowed speech frame by the inverse model of the glottal signal, which is a simple FIR filter with zeros in the same place, where the glottal model has the poles. The resulting signal can be considered as the vocal tract impulse response. The described deconvolution is schematically shown in Fig. 6.

Fig. 5. Frequency and impulse responses of the 2nd order anticausal glottal model together with the pole plot in the z-plane

248


Fig. 6. Speech deconvolution with the inverse glottal model

4 Glottal Signal Modification The glottal signal modification can be performed by a change of the estimated parameters – the glottal formant and its bandwidth. These parameters are obtained from the maximum-phase magnitude spectrum, which is estimated by the complex cepstrum deconvolution. The simplest modification is the increase or decrease of the open quotient of the glottal signal. According to [15] the open quotient is inversely proportional to the glottal formant. The bandwidth of the glottal formant is coupled with the asymmetry coefficient of the glottal flow. In Fig. 7 the influence of the glottal formant and its bandwidth variation by the same multiplication factor are shown. This results in the modification of the open quotient only, the asymmetry quotient is the same for all cases. Fg a = 3/2 Fg o , Bfg a = 3/2 Bfg o and Fg b = 2/3 Fg o, Bfg b = 2/3 Bfg o , where Fg o is the frequency of the original

Fig. 7. Example of the responses for 2nd order anticausal model of the glottal signal for the modification of the glottal formant and its bandwidth


249

glottal formant, Bfg o is its bandwidth, Fg a or Fg b are the modified frequencies of the glottal formant and Bfg a and Bfg b are the corresponding modified bandwidths. The modified speech frame after the convolution of the modified glottal signal with the impulse response of the vocal tract, which was obtained by deconvolution using the inverse glottal model with the original glottal formant and its bandwidth, are shown in Fig. 8. Finally in Fig. 9 the original speech and both speech conversions are shown, where the modifications of the glottal signal from the previous example were used.

Fig. 8. Example of the modified speech impulse responses for the cases in Fig. 7

Fig. 9. Example of the modified speech using the change of the glottal model parameters (see Fig. 7)

250


5 Conclusion In this contribution experiences with complex cepstrum speech deconvolution and a proposal of glottal signal modification based on a 2nd order anticausal glottal model were described. The complex cepstral speech deconvolution is sensitive, above all, to speech segmentation, the speech frame must be 2 pitch periods long with the GCI in the middle of the frame and a proper weighting window must be used. Also the method of complex cepstrum estimation or a robust phase unwrapping algorithm is of high importance. Although all of these criteria are fulfilled, the complex cepstral deconvolution doesn’t give adequate results especially for sampling frequency higher than 8 kHz. This is probably caused by some portion of noise, which is present in the higher frequency band of the source speech signal. For this reason we developed a 2nd order anticausal model of the glottal signal, whose parameters can be reliably estimated by the complex cepstral deconvolution also in cases of problematic speech frames. The proposed 2nd order anticausal model of the glottal signal has two basic parameters – the frequency of the glottal formant and its bandwidth. Cepstral deconvolution is then used only for the estimation of these parameters. A modification of the glottal signal is achieved by filtering the original windowed speech frame by the inverse model of the glottal signal, which leads to the impulse response of the vocal tract. Then the parameters of the glottal model are changed and the modified speech frame is estimated by convolution of the impulse responses of the vocal tract with the new glottal model response. The preliminary listening tests proved that increase of the glottal formant and of its bandwidth (i.e. decrease of the open quotient of the glottal signal) leads to a tense sounding voice. On the other hand, the decrease of the glottal formant and of its bandwidth (i.e. increase of the open quotient) leads to a lax sounding voice. But it is clear that for a change of the emotional speech style also the conversion of the vocal tract model and of the prosody must be used. The modification of the glottal signal alone is not sufficient for the generation of emotional speech, but it can boost the speech style given mainly by the prosody. Acknowledgments. This paper has been supported within the framework of COST2102 by the Ministry of Education, Youth and Sports of the Czech Republic, project number OC08010 and by the research project 102/09/0989 by the Grant Agency of the Czech Republic.

References 1. Iida, A., Campbell, N., Higuchi, F., Yasumutra, M.: A corpus-based speech synthesis system with emotions. Speech Communication 40, 161–187 (2003) 2. Vích, R.: Pitch Synchronous Linear Predictive Czech and Slovak Text-to-Speech Synthesis. In: Proc. of the 15th International Congress on Acoustics, ICA 1995, Trondheim, Norway, vol. III, pp. 181–184 (1995)


251

3. Vích, R.: Cepstral Speech Model, Padé Approximation, Excitation and Gain Matching in Cepstral Speech Synthesis. In: Jan, J. (ed.) BIOSIGNAL 2000, VUTIUM, Brno, pp. 77–82 (2000) 4. Gobl, C., Chasaide, A.N.: The role of voice quality in communicating emotion, mood and attitude. Speech Communication 40, 18–212 (2003) 5. Airas, M., Alku, P.: Emotions in Vowel Segments of Continuous Speech: Analysis of the Glottal Flow Using the Normalized Amplitude Quotient. Phonetica 63, 26–46 (2006) 6. Walker, J., Murphy, P.: A Review of Glottal Waveform Analysis. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 1–21. Springer, Heidelberg (2007) 7. Bozkurt, B.: Zeros of the z-transform (ZZT) representation and chirp group delay processing for the analysis of source and filter characteristics of speech signals. Ph.D. Thesis, Faculté Polytechnique De Mons, Belgium (2005) 8. Drugman, T., Bozkurt, B., Dutoid, T.: Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation. In: INTERSPEECH 2009, Brighton, U.K, pp. 116–119 (2009) 9. Doval, B.: Alessandro, Ch., Henric, N.: The voice source as a causal/anticausal linear filter. In: Proc. of ISCA Tutorial and Research Workshop on Voice Quality (VOQUAL), Geneva, pp. 15–19 (2003) 10. Tribolet, J.: A new phase unwrapping algorithm. IEEE Transactions on Acoustics, Speech and Signal Processing 25(2), 170–177 (1977) 11. Oppenheim, A.V., Schafer, R.V.: Discrete-Time Signal Processing, pp. 768–825. Prentice Hall, Englewood Cliffs (1989) 12. Vích, R.: Nichtkausales Cepstrales Sprachmodell. In: Proc. 20th Electronic Speech Processing Conference – ESSV 2009, Dresden, Germany, pp. 107–114 (2009) 13. Vondra, M., Vích, R.: Speech Conversion Using a Mixed-phase Cepstral Vocoder. In: Proc. of 21st Electronic Speech Processing Conference – ESSV 2010, Berlin, Germany, pp. 112–118 (2010) 14. Stylianou, Y.: Decomposition of speech signals into a deterministic and a stochastic part. In: Proc. of Fourth International Conference on Spoken Language, ICSLP 1996, Philadelphia, pp. 1213–1216 (1996) 15. Doval, B., d’Alessandro, C., Henrich, N.: The spectrum of glottal flow models, http://rs2007.limsi.fr/index.php/PS:Page_2

On Speech and Gestures Synchrony Anna Esposito1,2 and Antonietta M. Esposito3 1

Dep. of Psychology, Second University of Naples, Via Vivaldi 43, 81100 Caserta, Italy 2 IIASS, Via Pellegrino 19, 84019, Vietri sul Mare, SA, Italy 3 Istituto Nazionale di Geofisica e Vulcanologia, sezione di Napoli Osservatorio Vesuviano, Napoli, Italy [email protected], [email protected]

Abstract. Previous research works proved the existence of synchronization between speech and holds in adults and in 9 year old children with a rich linguistic vocabulary and advanced language skills. When and how does this synchrony develop during child language acquisition? Could it be observed also in children younger than 9? The present work aims to answer the above questions reporting on the analysis of narrations produced by three different age groups of Italian children (9, 5 and 3 year olds). Measurements are provided on the amount of synchronization between speech pauses and holds in the three different groups, as a function of the duration of the narrations. The results show that, as far as the reported data concerns, in children, as in adults, holds and speech pauses are to a certain extent synchronized and play similar functions, suggesting that they may be considered as a multi-determined phenomenon exploited by the speaker under the guidance of a unified planning process to satisfy a communicative intention. In addition, considering the role that speech pauses play in communication, we speculate on the possibility that holds may serve to similar purposes supporting the hypothesis that gestures as speech are an expressive resource that can take on different functions depending on the communicative demand. While speech pauses are likely to play the role of signalling mental activation processes aimed at replacing the “old spoken content” of the communicative plan with a new one, holds may signal mental activation processes aimed at replacing the “old visible bodily action” with new ones reflecting the representational and/or propositional contribution of gestures to the new communicative plan. Keywords: Speech pauses, holds, synchrony, child narrations.

1 Introduction Humans communicate through a gestalt of actions which involve much more than the speech production system. Facial expressions, head, body and arm movements (grouped under the name of gestures) all potentially provide information to the communicative act, supporting (through different channels) the speaker’s communicative goal and also allowing the speaker to add a variety of other information to his/her messages including (but not limited to) his/her psychological

A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 252–272, 2011. © Springer-Verlag Berlin Heidelberg 2011

On Speech and Gestures Synchrony

253

state, attitude, etc. The complexity of the communicative act expression should be taken into account in human-computer interaction research aiming at modeling and improving such interaction by developing user-friendly applications which should simplify and enrich the average end user’s ability to use automatic systems. Psycholinguistic studies have confirmed the complementary nature of verbal and nonverbal aspects in human expressions [44, 56, 58], demonstrating how visual information processing integrates and supports speech comprehension [61]. In the field of human-machine interaction, research works on mutual contribution of speech and gestures to communication have being carried out along three main axes. Some studies have been mainly devoted to model and synchronize speech production and facial movements for implementing more natural “talking heads” or “talking faces” [23, 28, 33, 37, 40-41, 54] taking into account, in some cases, features able also to encode emotional states [15, 29, 64]. Other studies have exploited a combination of speech and gestures features (mainly related to the oral movements) with the aim to improve the performance of automatic speech recognition systems [11]. Some others have dealt with the modeling and synthesis of facial expression (virtual agents), head, hand movements and body postures [3, 48] with the aim to improve the naturalness and effectiveness of interactive dialogues systems. Such studies are in their seminal stage, even though some prototypes, which prove the efficacy of modeling gestural information, have already been developed for the American English language [6-7, 70]. A less investigated but crucial aspect for multimodal human-machine interaction is the relationship between paralinguistic and extra-linguistic information conveyed by speech and gestures in human body-to-body interaction (in this context, the term gestures is mainly referred to facial expressions, head and hand/arm movements). Psycholinguistic studies have shown that humans convey meanings not only by using words (lexicon), and that there exists a set of non-lexical expressions carrying specific communicative values, expressing for example turn-taking and feedback mechanism regulations, or signalling active cognitive processes (such as the recovery of lexicon from the long term memory) during speech production [5, 8-10]. Typical non-lexical but communicative events at the speech level are, for example, empty and filled pauses and other hesitation phenomena (by which the speaker signals his/her intention to keep the turn), vocalizations and nasalizations signalling positive or negative feedback and the so called “speech repairs” which convey information on the speaker’s cognitive state and the planning and re-planning strategies she/he is typically using in a discourse. All these non-lexical events are often included in the overall category of “disfluencies” and therefore considered (mostly in the automatic speech recognition research) as similar to non-lexical and non communicative speech events such as coughing or sneezing. On the other hand, seminal works have observed that such non-lexical acts are also communicative speech events and show gestural correlates, both for the English and the Italian language [21-22, 25]. Adding to a mathematical model of human-machine interaction a representation of this gestural information would bring to the implementation of more natural and user-friendly interactive dialog systems and may contribute to the general improvement of the system performance. The present paper aims at contributing to the development of this research reporting data on the synchronization between communicative entities in speech (in

254

A. Esposito and A.M. Esposito

particular empty and filled speech pauses and vowel lengthening) and in gestures (in particular holds). The data where collected both for adults and for three differently aged groups of children (particularly 3, 5 and 9 year old children) with the aim to assess when synchrony between holds and speech pause develops during child language acquisition.

2 Getting the Focus on Speech Pauses and Holds 2.1 The Role of Pausing Strategies in Dialogue Organization A characteristic of spontaneous speech, as well as of other types of speech, is the presence of silent intervals (empty pauses) and vocalizations (filled pauses) that do not have a lexical meaning. Pauses seem to play a role in controlling the speech flow. Several studies have been conducted to investigate the system of rules that underlie speaker pausing strategies and their psychological bases. Research in this field has shown that pauses may play several communicative functions, such as building up tension or raising expectations in the listener about the rest of the story, assisting the listener in her/his task of understanding the speaker, signalling anxiety, emphasis, syntactic complexity, degree of spontaneity and gender, and transmitting educational and socio-economical information [1, 34, 36, 51]. Studies on speech pause distribution in language production have produced evidence of a relationship between pausing and discourse structure. Empty and filled pauses are more likely to coincide with boundaries, realized as a silent interval of varying length, at clause and paragraph level [68]. This is particularly true for narrative structures where it has been shown that pausing marks the boundaries of narrative units [9-10, 18-20, 62-63]. Several cognitive psychologists have suggested that pausing strategies reflect the complexity of neural information processing. Pauses will surface in the speech stream as the end product of a “planning” process that cannot be carried out during speech articulation and the amount and length of pausing reflects the cognitive effort related to lexical choices and semantic difficulties for generating new information [4-5, 10, 18-20, 34]. We can conclude from the above considerations that pauses in speech are typically a multi-determined phenomenon attributable to physical, socio-psychological, communicative, linguistic and cognitive causes. Physical pauses are normally attributed to breathing or articulatory processes (i.e. pauses due to the momentary stoppage of the breath stream caused by the constrictors of the articulatory mechanism or the closure of the glottis). Socio-psychological pauses are caused by stress or anxiety [2]. Communicative pauses are meant to permit the listener to comprehend the message or to interrupt and ask questions or make comments. Linguistic pauses are used as a mean for discourse segmentation. Finally, cognitive pauses are related to mental processes connected to the flow of speech, such as replacing the current mental structure with a new one, in order to continue the production [8-10] or difficulties in conceptualization [34]. Recent studies aimed to investigate the role of speech pauses, such as empty and filled pauses, and phoneme lengthening, in child narrations have shown that also children exploit pausing strategies to shape their discourse structure (in Italian)[18-20]. Children pause, like


255

adults, to recover from their memory the new information (the added1 one) they are trying to convey. More complex (in terms of cognitive processing) is the recovery effort, longer is the pausing time. The longer are the pauses, the lower is the probability that they can be associated to given information. Most of the long pauses (96% for female and 94% for male) are associated to a change of scene suggesting that long pauses are favored by children in signalling discourse boundaries. The consistency in the distribution of speech pauses seems to suggest that, at least in Italian, both adults and children exploit a similar model of timing to regulate speech flow and discourse organization. In the light of these considerations it seems pretty logical to ask what could be, if any, the role of gesture pauses (holds henceforth) in communication and the possible functions they are assumed to play with respect to speech pauses. To this aim, in the reported data, socio-psychological, articulatory, and communicative pauses were ruled out from the analysis. The first ones were assumed not to be a relevant factor, by virtue of the particular elicitation setting (see next section for details). The second and third ones were identified during the speech analysis and eliminated from the dataset. The speech pauses considered in this work, therefore, are linguistic, cognitive and breathing pauses. On the basis of the assumption that breathing and linguistic pauses are part of the strategy the speakers adopt for grouping words into a sentence, in the following they are both considered part of the planned communication process. 2.2 The Role of Gestural Holds in Shaping the Interaction In daily human-to-human interaction we usually encode the messages we want to transmit in a set of actions that go beyond verbal modality. Nonverbal actions (grouped under the name of gestures) help to clarify meanings, feelings, and contexts, acting for the speaker as an expressive resource exploited in partnership with speech for appropriately shaping communicative intentions and satisfying the requirements of a particular message being transmitted. There is a considerable body of evidence attributing to gestures similar semantic and pragmatic functions as in speech and rejecting the hypothesis that neither gestures nor speech alone might have the primary role in the communicative act [16, 44, 56] but there are also data suggesting that the role of gestures is secondary to speech serving as support to the speaker's effort to encode his/her message [49-50, 52, 59-60, 65]. The latter hypothesis appears to be a reasonable position since during our everyday interactions we are aware of generating verbal messages and of the meaning we attribute to these, thanks to a continuous auditory feedback. On the other hand, we are not endowed with a similar feedback for our gesticulation, posture, and facial expressions. In addition, most of the gesturing is made without a conscious control, since we do not pay special attention to it while speaking, and additionally, humans carry out successful communications also in situations where they cannot see each other (on the telephone for example, see Short et al. [69] and Williams [74]). Conversely, it is really hard to infer the meaning of a message when only gestures and no speech is provided and therefore it might appear obvious, if not trivial to presume 1

The present work interprets the concepts of “given” and “added” according to the definition proposed by Chafe [8], which considered as “added” any verbal material that produces a modification in the listener’s conscious knowledge, and therefore “given” verbal material was intended as not to produce such a modification.

256


that the role of gestures, if any, in communication, is just of assistance to both the listener, and/or the speaker [31-32, 52-53, 60-61, 66]. Nonetheless, more in-depth analyses, shed doubts on the above position and prove that gestures and speech are partners in shaping communication and giving kinetic and temporal (visual and auditory) dimensions to our thoughts. Some hints on these gestural functions may simply be experienced in our everyday life. Gestures resolve speech ambiguities and facilitate comprehension in noisy environment, they act as a language when verbal communication is impaired, and in some contexts not only they are preferred, but produce more effective results than speech, in communicating ideas [35, 42-47, 5558, 72]. More interesting, it has been shown that gestures are used in semantic coherence with speech and may be coordinated with tone units and prosodic entities, such as pitch-accented syllables and boundary tones [17, 43, 71, 75]. Besides, gestures add an imagistic dimension to the phrasal contents [35, 39, 45, 47, 55] and are synchronized with speech pauses [4-5, 21-22, 25, 43, 38]. In the light of these considerations, gestures are to be regarded as an expressive system that, in partnership with speech, provide means for giving form to our thoughts [13, 42, 55-56].

1a)

1b)

Fig. 1. Distributions of Empty Pauses (1a) and Holds (1b) over the Clauses (yellow bars) in an action plan dialogue produced by an American English speaker (Esposito et al., 2001)


257

Have we been convincing? Are the provided data able to definitely assess the role of gestures in communicative behaviours? Since the experimental data are somewhat conflictual, the question of how to integrate and evaluate the above different positions on the relevance of gestures in communication is still open and the results we are going to present may be relevant in evaluating their relative merits. In previous works (see Esposito et al. [21-22, 25]), we adopted the theoretical framework that gestures, acting in partnership with speech, have similar semantic and pragmatic functions. Starting from these assumptions, we tried to answer the following questions about hand movements: assuming that speech and gestures are co-involved in the production of a message, is there any gestural equivalent to filled and empty pauses in speech? Assuming that we have found some equivalent gestural entities, to what degree do these synchronize with speech pauses? As an answer to our first question, in two pilot studies we identified a gestural entity that we called hold. A careful review of speech and gesture data showed that in fluent speech contexts, holds appear to be distributed similarly to speech pauses and to overlap with them, independently from the language (the gesture data were produced by Italian and American English speakers) and the context (there were two narrative contexts: an action plan and a narration dialogue). As an example to support the above conclusions, the data in Figure 1 show the distribution of empty speech pauses (red bars, Figure 1a)) and the distribution of holds (red bars, Figure 1b) over speech clauses(yellow bars) produced by an American English speaker during an action plan dialogue. On the y-axis are reported the number of clauses (also displayed as yellow bars) and on the x-axis the durations of clauses, holds, and speech pauses. Figure 2 shows the amount of overlaps between empty speech pauses and holds (red bars) and between locutions and holds (white bars) during the narration of an episode of a familiar cartoon (Silvester-Twitee) reported by an Italian (Figure 2a) and an American English speaker (Figure 2b). In a recent work [16] we found further support to our previous speculations through the analysis of narrative discourse data collected both from children and adults who participated in a similar elicitation experiment. Both adults and children were native speakers of Italian. There were two goals motivating this extension of the previous research: 1) If the relationships previously found, between holds and speech pause, are robust, they should be independent of age; i.e., they should also be evident in child narrations; 2) If at the least some aspects of speech and gesture reflect a unified planning process, these should be similar for all human beings providing that the same expressive tools are available. The results of the above research work [16] are partially displayed in Figures 3 and 4. Figure 3 graphically shows the percentage of overlaps against the percentage of speech pauses that do not overlap with holds, in children (3a) and adults (3b).

258


2a)

2b)

Fig. 2. Percentage of overlaps between empty (EP) and filled (FP) speech pauses and holds (red bars) and between clauses and holds (white bars) during the narration of a cartoon episode (Silvester-Twitee) narrated by an Italian (Figure 2a) and an American English speaker (Figure 2b).

Figure 4 displays, for each subject in each group (children and adults), the hold and speech pause rates computed as the ratios between the number of holds and/or speech pauses over the length of the subject’s narrations measured in seconds. Figure 4a is for children and Figure 4b is for adults. The Pearson correlation coefficient was computed as a descriptive statistic of the magnitude or the amount of information that can be inferred about speech pause frequency from the known hold frequency. The Pearson correlation coefficient between holds and speech pauses for children was ґ = 0.97, and the proportion of the variation of speech pauses that is determined by the variation of holds (i.e. the coefficient of determination) was ґ2 = 0.93, which means that 93% of the children’s speech pause variation is predictable from holds. For adults ґ = 0.88, and ґ2 = 0.78.


259

Children 84.8%

% Overlap

100

% Speech Pauses overlapping with holds

80 60

15.2%

40 20

% Speech Pauses not overlapping with holds

0

3a

Adults 83%

% Overlap

100 80 60 40 20 0

17%

% Speech Pauses overlapping with holds % Speech Pauses not overlapping with holds

3b Fig. 3. Percentage of overlaps and non-overlaps between speech pauses and holds in children (3a) and adults (3b)

The two groups of speakers produced a similar distribution of hold and speech pause overlaps. The degree of synchronization was so high that further statistic analyses to assess its significance were not necessary, if the word “synchronization” is interpreted more loosely to mean “the obtaining of a desired fixed relationship among corresponding significant instants of two or more signals [www.its.bldrdoc.gov]”. In summary, the reported data showed that the frequency of overlaps between holds and speech pauses not only was remarkably high but was much the same for adults and children (see Figures 3 and 4), clearly indicating that both children and adults tended to synchronize speech pauses with holds independently of their age. The two rates were also statistically significant according to the results of a one way ANOVA test performed for each group, with hold and speech pause rates as within-subject variables. The differences between hold and speech pause rates were not significant for children (F(1,10) = 1.09, ρ = 0.32) suggesting that holds and speech pauses were equally distributed along children’s narrations. For adults, the differences were statistically significant (F(1,6) = 11.38, ρ = 0.01), suggesting that adults used holds more frequently than speech pauses.

260


Children 1

rate

0,8 0,6

Holds Rate

0,4 Speech Pauses Rate

0,2 0 S1

S2

S3

S4

S5

S6

Subject's number

4a Adults 1

rate

0,8 0,6 Holds Rate

0,4

Speech Pauses Rate

0,2 0 S1

S2

S3

S4

Subject's Number

4b Fig. 4. Hold rates against speech pause rates for children (4a) and adults (4b)

From these results, considering the role that speech pauses play in communication we speculate on the possibility that holds may serve to similar purposes supporting the view that gestures as speech are an expressive resource that can take on different functions depending on the communicative demand. The data discussed above seem to support this hypothesis, showing that 93% of the children and 78% of the adult speech pause variation is predictable from holds, suggesting that at least to some extent, the function of holds may be thought to be similar to speech pauses. We further speculated that while speech pauses are likely to play the role of signalling mental activation processes aimed at replacing the “old spoken content” of an “utterance” with a new one, holds may signal mental activation processes aimed at replacing the “old visible bodily actions” (intimately involved in the semantic and/or pragmatic contents of the old “utterance”) with new bodily actions reflecting the representational and/or propositional contribution that gestures are engaged to convey in the new “utterance”. In order to further support the above results, in the present work we try to answer the following questions: When and how does this synchrony develop during child language acquisition? Could it be observed also in children younger than 9? To answer the above questions, in the following sections, are described the analyses of


261

narrations produced by three different age groups of Italian children (9, 5 and 3 year olds) and measurements of the amount of speech pauses and holds are provided as a function of the word rates and narration durations.

3 Material The video recordings on which our analysis is based are of narrations by three groups of children: • • •

8 females, of 9 ± 3 months year olds; 5 males, and 5 females of 5 ± 3 months year olds; 3 males, and 3 females of 3 ± 3 months year olds;

The children told the story of a 7-minute animated color cartoon they had just seen. The cartoon was of a familiar type to Italian children involving a cat and a bird. The listener was the child’s teacher together with other children also participating in the experiment. The children’s recordings were made after the experimenter had spent two months with the children in order to become familiar to them and after several preparatory recordings had been made in various contexts in order for the children to get used to the camera. This kept out stranger-experimenter inhibitions from the elicitation setting; i.e., factors that could result in stress and anxiety. Limiting these factors allowed us to rule out the “socio-psychological” type of pauses [2]. The cartoon had an episodic structure, each episode characterized by a “cat that tries to catch a bird and is foiled” narrative arc. The experimental set-up is the same reported in previously works of McNeill and Duncan [57] and the decision to use such a similar experimental set-up was made on the purpose to allow future comparisons with other similar research works. Because of the cartoon’s episodic structure, typically children would forget entire episodes and therefore only four episodes (those in common to all the child narrations) were analyzed. None of the participants was aware that speech and gesture pauses were of interest. The video was analyzed using commercial video analysis software (VirtualDub™) that allows viewing video-shots, and forward and backward movements through the shots. The speech waves, extracted from the video, were sampled at 16 kHz and digitalized at 16 bits. The audio was analyzed using Speechstation2™ from Sensimetrics. For the audio measurements the waveform, energy, spectrogram, and spectrum were considered together, in order to identify the beginnings and endings of utterances, filled and empty speech pauses and phoneme lengthening. The details of the criteria applied to identify the boundaries in the speech waveform are described in [24,26]. Both the video and audio data were analyzed perceptually, the former frame-by-frame and the latter clause-by-clause or locution-by-locution, where a “clause” or a “locution” is assumed to be “a sequence of words grouped together on semantic or functional basis” [18-20]. 3.1 Some Working Definitions In this study, empty pauses are simply defined as a silence (or verbal inactivity) in the flow of speech equal to or longer than 120 milliseconds. Filled pauses are defined as

262


the lengthening of a vowel or consonant identified perceptually (and on the spectrogram) by the experimenter or as one of the following expressions: “uh, hum, ah, ehm, ehh, a:nd, the:, so, the:n, con:, er, e:, a:, so:”2. A hold is detected when the arms and hands remained still for at least three video frames (i.e., approximately 120 ms.) in whatever position excluding the rest position. The latter is defined as the home position of the arms and hands when they are not engaged in gesticulation and typically this is at the lower periphery of the gesture space (see Mc Neill [58] p.89). The holds associated with gesture rest were not included in the analysis by virtue of the particular elicitation setting (see next section for details). Note that the absence of movement is judged perceptually by an expert human coder. Therefore, the concept of hold is ultimately a perceptual one. A hold may be thought to be associated with a particular level of discourse abstraction. In producing a sentence, the speaker may employ a metaphoric gesture with a hold spanning the entire utterance. However, the speaker may also engage in word search behaviour (characterized by a slight oscillatory motion centered around the original hold) without any change in hand shape (the Butterworth gesture cited in Mc Neill [58]). The speaker may also add emphatic beats coinciding with points of peak prosodic emphasis in the utterance. While the word search and emphatic beats may sit atop the original hold, most observers will still perceive the underlying gesture hold.

4 Results Figure 5 displays the percentage of holds overlapping with speech pauses in the three different groups. In this particular case rest positions were not included in the data since it could have been objected that small children may perform a considerable amount of rest positions that cannot be accounted as holds (in the previous study rest positions were difficult to interpret as not to be also holds and the two types of gestures were considered together). The Pearson product-moment correlation coefficient was computed as a descriptive statistic of the magnitude or the amount of information that can be inferred about speech pause frequency from the known hold frequency. The Pearson correlation coefficient between holds and speech pauses for 3 year old children was ґ = 0.72, and the proportion of the variation of speech pauses that is determined by the variation of holds (i.e. the coefficient of determination) was ґ2 = 0.52, which means that 52% of the children’s speech pause variation is predictable from holds. For 5 year old children ґ = 0.18, and ґ2 = 0.03, which means that there was no correlation between pause variation and holds in 5 year old children. For 9 year old children ґ = 0.83, and ґ2 = 0.70, which means that 70% of the children’s speech pause variation is predictable from holds. For adults it was previously found [16] ґ = 0.88, and ґ2 = 0.78, which means that 78% of the adults’ speech pause variation is predictable from holds.

2

The notation “:” indicates vowel or consonant lengthening.


263

%hold overlapping with pauses AV=84,6% ; SD=7,8

100

AV=71,6% SD: 12,1

% overlap

80

AV=70,6% SD: 8,1

60 40 20 0 3 year olds

5 year olds

9 year olds

Fig. 5. Percentage of holds overlapping with speech pauses in the three different age groups of children

Distribution of Pause and Hold Rates in 3 year old children Pause Rate

Hold Rate

1 0,8 0,6 0,4 0,2 0 F 1

F2

F 3

M 1

M 2

M 3

Subjects

Fig. 6. Distribution of pause and hold rates in three year old children Distribution of Pause and Hold Rates in 5 year old children Pause Rate

Hold Rate

1 0,8 0,6 0,4 0,2 0

F1 F2 F4 F5 F6 M1 M2 M3 M4 M5 Subjects

Fig. 7. Distribution of pause and hold rates in five year old children

264


Distribution of Pause and Hold Rates in 9 year old children Pause Rate

Hold Rate

1 0,8 0,6 0,4 0,2 0 F1

F2

F3

F4

F5

F7

F9

F 10

Subjects

Fig. 8. Distribution of pause and hold rates in nine year old children

The data reported above show that a correlation does exist between holds and speech pauses, only for two out of the three groups of children and that for five year old children there was no synchronization between speech pauses and holds. In order to ascertain the causes of this discrepancy in the above results, the distribution of hold and speech pause rates was checked for each group and each subject. The next figures (Figures 6, 7 and 8) will display such distributions in the three different age groups. The labels F and M on the x-axis indicate the female and male children respectively. The distribution of hold and speech pause rates does not seem to explain why for five year old children there was no correlation between holds and speech pauses and therefore, as further control, the rest position rates (computed as the ratio of the number of rest positions over the duration – in seconds – of the episodes under analysis) were considered for the three groups of children (see Figures 9, 10, and 11).

Rest Position Rates

Rest position rates in 3 year olds 0,18 0,16 0,14 0,12 0,1 0,08

s

0,06 0,04 0,02 0 F1

F2

F3

M1

M2

subjects

Fig. 9. Rest position rates in three year old children

M3


265

Rest position rates in 5 year olds

Rest Position Rates

0,18 0,16 0,14 0,12 0,1 0,08 0,06 0,04 0,02 0 F1

F2

F4

F5

F6

M1

M2

M3

M4

M5

subjects

Fig. 10. Rest position rates in five year old children Rest position rates in 9 year olds

Rest Position Rates

0,18 0,16 0,14 0,12 0,1 0,08 0,06 0,04 0,02 0 F1

F2

F3

F4

F5

F7

F9

F 10

subjects

Fig. 11. Rest position rates in nine year old children

The data show an unexpected trend and pose the basis for novel interpretations on the amount of gestural holds that can be predicted from speech pauses. In fact, Figure 10 shows that the distribution of the rest position rates in five year old children is quite high and very different from those observed in three and nine year old children.

5 Discussion The previous section displays three interesting results. Firstly, a great amount of variation in speech pauses is highly correlated with holds, both in adults and children, and there is a great amount of overlaps between the two speech and gestural entities. Secondly, speech pauses are highly synchronized with holds and this synchronization

266


does not depend on the speaker’s age. Thirdly, five year old children do not seem to show the same amount of overlap and synchronization between hold and speech pause entities. What does this suggest about gestures and speech partnership? To answer to this question it is necessary to recall the role attributed to speech pauses, in particular to cognitive speech pauses that are under examination in this work. As already pointed out in the introductive section, these speech pauses are used to “hold the floor”, i.e. to prevent interruption by the listener while the speaker searches for a specific word [27], but can also serve for other functions, such as reflecting the complexity of neural information processing. Pauses will surface in the speech stream as the end product of a “planning” process that cannot be carried out during speech articulation and the amount and length of pausing reflect the cognitive effort related to lexical choices and semantic difficulties for generating new information [4-5, 8-10, 34]. In summary, speech pauses seem likely to play the role of signalling mental activation processes aimed at replacing a particular attentional state with a new one. Given the great amount of overlaps between holds and speech pauses, holds appear to be gestural entities with similar function and similar behaviour as speech pauses. Therefore, these data appear to support the hypothesis that non-verbal modalities and speech have similar semantic and pragmatic functions, and therefore that, at least in some respects, speech and gestures reflect a unified planning process, which is implemented synchronously in space and time thanks to the exploitations of two different avenues (the manual-visual versus the oral-auditory channel). As speech pauses seem likely to play the role of signalling mental activation processes aimed at replacing the “given spoken content” of a former utterance with an “added” one, holds may signal mental activation processes aimed at replacing “given visible bodily actions” (intimately involved in the semantic and/or pragmatic contents of the former “utterance”) with “added bodily actions” reflecting the new representational and/or propositional contribution that gestures are engaged to convey in the new “utterance”. Note that the meaning given here to the word “utterance” is the same used by Kendon (see chapter 1, page 5, [44]) “as an object constructed for others from components fashioned from both spoken language and gesture”. As far as the reported data concerns, in children, as in adults, holds and speech pauses are to a certain extent synchronized and play similar functions, suggesting that they may be considered as a multi-determined phenomenon exploited by the speaker under the guidance of a unified planning process to satisfy a communicative aim. Under the above assumption of the meaning of the word “utterance” we can speculate about how to justify the second result reported in the present work, i.e. why, hold rates in adults are significantly different from speech pause rates whereas this is not the case for 3 and 9 year old children. Our hypothesis is that being gestures the “kinetic” expression of our thoughts the speaker may use them in many different ways and one of these could be to structure the spoken discourse when lexical access does not present difficulties. This is one of the functions played by holds in adults. In fact, the holds performed by adults in their narrations and not synchronized with speech pauses were made at the end of clauses, as to mark the different components of the sentence and to emphasize or underline groups of words. Children, instead, being less skilled in assembling bodily and verbal information, tend to attribute to holds the same functions as speech pauses. These considerations may explain the differences in


267

hold and speech pause rates between the two groups. On the other hand, children may be less skilled in bodily actions than in language since they start to experience visible actions after birth, whereas language feedback is experienced during pregnancy. Furthermore, sophisticated utterances, where verbal and nonverbal entities are put together to express thoughts with the purpose to maximize the amount of information transmitted, are a prerogative of adult communication behaviour and may not be necessary in child utterances, limiting the functions and the use of gestures and consequently of holds. There is an inconsistency in the above discussion, since five year old children do not seem to synchronize speech pauses and holds. However at the age of 5-6 children acquire social consciousness and care for their performance as well as for the perception of the self by the others (according to several psychological theories as Theory of Mind [30, 73]). This includes heightened sensitivity to criticism [14]. It could be this acquired sensitivity that may have prevented children, even in a friendly environment, to report their narrations in a relaxed way, increasing their rest positions and decreasing the synchronization of their gestures with their speech. Although the present data may be relevant in assessing the partnership between speech and gestures, it should be emphasized that this is a pilot study, based on data restricted to a narration context and that further work is needed to support the above assumptions as well as to assess the functions of holds in the production of utterances.

6

Conclusions

The present paper reports perceptual data showing that both adults and children make use of speech pauses in synchronization with holds, thereby supporting the hypothesis that, at least in some respects, speech and gestures reflect a unified communicative planning process in the production of utterances. The consistency among the subjects in the distribution of holds and speech pauses suggests that, at least in the Italian language, there is an intrinsic timing behaviour, probably a general pattern of rules that speakers (in narrations) use to regulate the speech flow in synchrony with the bodily actions for structuring the discourse organization. The synchrony we are speaking of is more specific than the synchrony discussed in Condon and Sander [12] as well as in several and more recent papers published in literature [5, 27, 42-44, 5558]. Contrarily to what has been objected by a reviewer of this paper, synchrony between holds and speech pauses of different typology was first observed and discussed by one the authors in [16, 22, 25]. The importance of this synchrony is strongly related to the multi-determined nature of pauses in speech and may help enlightening the role of gestures in communication. The authors will welcome new investigations in this direction. It would be interesting to conduct an analysis on a more extensive data set and model this behaviour in mathematical terms. This might help to derive a deterministic algorithm that would be of great utility for applications in the field of human-machine interaction, favouring the implementation of more natural speech synthesis and interactive dialog systems. The analysis that has been developed in this paper sheds lights only on a subset of much richer and more subtle processes that are at the basis of the rules and procedures governing the dynamic of face-to-face communication. Among the phenomena not yet examined and worth to be investigated are:

268 • • •


The relevance that the bodily actions of the speaker and listener in guiding the dialogues might have; The exploitation of speech and holds in mid turn or in signalling the engagement and disengagement of the participants within the turn; The functioning and positioning of speech pause and holds at certain favourite sequential positions within conversations where they are more likely to be relaxed, such as at the end of clauses and paragraphs during a narration.

In the present study, the consequences of the listener’s actions on the speaker’s have not been considered. Interactions between speaker and listener are relevant and surely may lead to systematic changes in the emerging structure of the speaker’s utterance and in her/his distribution of speech and holds along the utterance. How these dynamics are implemented during interaction is a central issue for the development of a theory of speech and gestures partnership. Acknowledgments. This work has been supported by the European projects: COST 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication” (http://cost2102.cs.stir.ac.uk/) and COST ISCH TD0904 “TMELY: Time in MEntal activitY (http://w3.cost.eu/index.php?id=233&action_number=TD0904). Acknowledgment goes to three unknown reviewers for their helpful comments and suggestions and to Tina Marcella Nappi for her editorial help.

References 1. Abrams, K., Bever, T.G.: Syntactic Structure Modifies Attention During Speech Perception and Recognition. Quarterly Journal of Experimental Psychology 21, 280–290 (1969) 2. Beaugrande, R.: Text Production. Text Publishing Corporation, Norwood (1984) 3. Bryll, R., Quek, F., Esposito, A.: Automatic Hand Hold Detection in Natural Conversation. In: Proc. of IEEE Workshop on Cues in Communication, Hawai, December 9 (2001) 4. Butterworth, B.L., Hadar, U.: Gesture, Speech, and Computational Stages: A Reply to McNeill. Psychological Review 96, 168–174 (1989) 5. Butterworth, B.L., Beattie, G.W.: Gestures and silence as indicator of planning in speech. In: Campbell, R.N., Smith, P.T. (eds.) Recent Advances in the Psychology of Language, pp. 347–360. Olenum Press, New York (1978) 6. Cassell, J., Nakano, Y., Bickmore, T., Sidner, C., Rich, C.: Non-verbal Cues for Discourse Structure. In: Association for Computational Linguistics Joint EACL-ACL Conference (2001a) 7. Cassell, J., Vilhjalmsson, H., Bickmore, T.: BEAT: The Behavior Expression Animation Toolkit. In: Proc. of SIGGRAPH (2001b) 8. Chafe, W.L.: Language and Consciousness. Language 50, 111–133 (1974) 9. Chafe, W.L.: The Deployment of Consciousness in the Production of a Narrative. In: Chafe, W.L. (ed.) The Pear Stories, pp. 9–50. Ablex, Norwood (1980) 10. Chafe, W.L.: Cognitive Constraint on Information Flow. In: Tomlin, R. (ed.) Coherence and Grounding in Discourse, pp. 20–51. John Benjamins, Amsterdam (1987) 11. Chen, L., Liu, Y., Harper, M.P., Shriberg, E.: Multimodal Model Integration for Sentence Unit Detection. In: Proceedings of ICMI, State College Pennsylvania, USA, October 13-15 (2004)


269

12. Condon, W.S., Sander, L.W.: Synchrony Demonstrated between Movements of the Neonate and Adult Speech. Child Development 45(2), 456–462 (1974) 13. De Ruiter, J.P.: The Production of Gesture and Speech. In: McNeill, D. (ed.) Language and Gesture, pp. 284–311. Cambridge University Press, UK (2000) 14. Dunn, J.: Children as Psychologist: The Later Correlates of Individual Differences in Understanding Emotion and Other Minds. Cognition and Emotion 9, 187–201 (1995) 15. Esposito, A.: Affect in Multimodal Information. In: Tao, J., Tan, T. (eds.) Affective Information Processing, pp. 211–234. Springer, Heidelberg (2008) 16. Esposito, A., Marinaro, M.: What Pauses Can Tell Us About Speech and Gesture Partnership. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing Sub-Series E: Human and Societal Dynamics, vol. 18, pp. 45–57. IOS Press, The Netherlands (2007) 17. Esposito, A., Esposito, D., Refice, M., Savino, M., Shattuck-Hufnagel, S.: A Preliminary Investigation of the Relationships between Gestures and Prosody in Italian. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing Sub-Series E: Human and Societal Dynamics, vol. 18, pp. 65–74. IOS Press, The Netherlands (2007) 18. Esposito, A.: Children’s Organization of Discourse Structure Through Pausing Means. In: Faundez-Zanuy, M., Janer, L., Esposito, A., Satue-Villar, A., Roure, J., Espinosa-Duro, V., et al. (eds.) NOLISP 2005. LNCS (LNAI), vol. 3817, pp. 108–115. Springer, Heidelberg (2006) 19. Esposito, A.: Pausing Strategies in Children. In: Proceedings of the International Conference in Nonlinear Speech Processing, Cargraphics, Barcelona, Spain, April 19-22, pp. 42–48 (2005) 20. Esposito, A., Marinaro, M., Palombo, G.: Children Speech Pauses as Markers of Different Discourse Structures and Utterance Information Content. In: Proceedings of the International Conference: From Sound to Sense: +50 Years of Discoveries in Speech Communication, June 10-13, pp. C139–C144. MIT, Cambridge (2004) 21. Esposito, A., Natale, A., Duncan, S., McNeill, D., Quek, F.: Speech and Gestures Pauses Relationships: A Hypothesis of Synchronization. In: Proceedings of the V National Conference on Italian Psychology, AIP, Grafica80-Modugno, Bari, Italy, pp. 95–98 (2003) (in Italian) 22. Esposito, A., Duncan, S., Quek, F.: Holds as gestural correlates to empty and filled pauses. In: Proc. of ICSLP, Colorado, vol. 1, pp. 541–544 (2002a) 23. Esposito, A., Gutierrez-Osuna, R., Kakumanu, P., Garcia, O.N.: Optimal Data Encoding for Speech Driven Facial Animation. Wright State University Technical Report N. CSWSU-04-02, Dayton, Ohio, USA 1-11 (2002b) 24. Esposito, A.: On Vowel Height and Consonantal Voicing Effects: Data from Italian. Phonetica 9(4), 197–231 (2002c) 25. Esposito, A., McCullough, K.E., Quek, F.: Disfluencies in Gesture: Gestural Correlates to Filled and Unfilled Speech Pauses. In: Proc. of IEEE Workshop on Cues in Communication, Hawai (2001) 26. Esposito, A., Stevens, K.N.: Notes on Italian Vowels: An Acoustical Study (Part I). Research Laboratory of Electronic, Speech Communication Working Papers 10, 1–42 (1995) 27. Erbaugh, M.S.: A Uniform Pause and Error Strategy for Native and Non-native Speakers. In: Tomlin, R. (ed.) Coherence and Grounding in Discourse, pp. 109–130. John Benjamins, Amsterdam (1987)

270


28. Ezzat, T., Geiger, G., Poggio, T.: Trainable Video Realistic Speech Animation. In: Proc. of SIGGRAPH, San Antonio, Texas, pp. 388–397 (2002) 29. Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. Pattern Recognition 36(1), 259–275 (2003) 30. Flavell, J.H.: Cognitive development: Children’s Knowledge About the Mind. Annual Review of Psychology 50, 21–45 (1999) 31. Freedman, N.: The Analysis of Movement Behaviour During the Clinical Interview. In: Siegmann, A.W., Pope, B. (eds.) Studies in Dyadic Communication, pp. 177–208. Pergamon Press, Oxford (1972) 32. Freedman, N., Van Meel, J., Barroso, F., Bucci, W.: On the Development of Communicative Competence. Semiotica 62, 77–105 (1986) 33. Fu, S., Gutierrez-Osuna, R., Esposito, A., Kakumanu, P., Garcia, O.N.: Audio/Visual Mapping with Cross-Modal Hidden Markov Models. IEEE Transactions on Multimedia 7(2), 243–252 (2005) 34. Goldmar Eisler, F.: Psycholinguistic: Experiments in Spontaneous Speech. Academic Press, London (1968) 35. Goldin-Meadow, S.: Gesture: How Our Hands Help Us Think. Harvard University Press, Cambridge (2003) 36. Green, D.W.: The Immediate Processing of Sentence. Quarterly Journal of Experimental Psychology 29, 135–146 (1977) 37. Gutierrez-Osuna, R., Kakumanu, P., Esposito, A., Garcia, O.N., Bojorquez, A., Castello, J., Rudomin, I.: Speech-Driven Facial Animation with Realistic Dynamics. IEEE Transactions on Multimedia 7(1), 33–42 (2005) 38. Hadar, U., Butterworth, B.L.: Iconic Gestures, Imagery and Word Retrieval in Speech. Semiotica 115, 147–172 (1997) 39. Kähler, K., Haber, J., Seidel, H.: Geometry-based Muscle Modeling for Facial Animation. In: Proc. of Inter. Conf. on Graphics Interface, pp. 27–36 (2001) 40. Kakumanu, P., Esposito, A., Gutierrez-Osuna, R., Garcia, O.N.: Comparing Different Acoustic Data-Encoding for Speech Driven Facial Animation. Speech Communication 48(6), 598–615 (2006) 41. Kakumanu, P., Gutierrez-Osuna, R., Esposito, A., Bryll, R., Goshtasby, A., Garcia, O.N.: Speech Dirven Facial Animation. In: Proc. of ACM Workshop on Perceptive User Interfaces, Orlando, November 15-16 (2001) 42. Kendon, A.: Spacing and Orientation in Co-present Interaction. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Second COST 2102. LNCS, vol. 5967, pp. 1– 15. Springer, Heidelberg (2010) 43. Kendon, A.: Some Topic in Gesture Study. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing SubSeries E: Human and Societal Dynamics, vol. 18, pp. 1–17. IOS Press, The Netherlands (2007) 44. Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004) 45. Kendon, A.: Sign Languages of Aboriginal Australia: Cultural, Semiotic and Communicative. Cambridge University Press, Cambridge (1988) 46. Kendon, A.: Current Issues in the Study of Gesture. In: Nespoulous, J.L., et al. (eds.) The Biological Foundations of Gestures: Motor and Semiotic Aspects, pp. 23–27. LEA Publishers, Hillsdale (1986)


271

47. Kendon, A.: Gesticulation and Speech: Two Aspects of the Process of Utterance. In: Ritchie Key, M. (ed.) The Relationship of Verbal and Nonverbal Communication, pp. 207–227. Mouton and Co., The Hague (1980) 48. Kipp, M.: From Human Gesture to Synthetic Action. In: Proc. of Workshop on Multimodal Communication and Context in Embodied Agents, Montreal, pp. 9–14 (2001) 49. Kita, S., Özyürek, A.: What Does Cross-Linguistic Variation in Semantic Coordination of Speech and Gesture Reveal? Evidence for an Interface Representation of Spatial Thinking and Speaking. Journal of Memory and Language 48, 16–32 (2003) 50. Kita, S.: How Representational Gestures Help Speaking. In: McNeill, D. (ed.) Language and Gesture, pp. 162–185. Cambridge University Press, UK (2000) 51. Kowal, S., O’Connell, D.C., Sabin, E.J.: Development of Temporal Patterning and Vocal Hesitations in Spontaneous Narratives. Journal of Psycholinguistic Research 4, 195–207 (1975) 52. Krauss, R., Chen, Y., Gottesman, R.F.: Lexical Gestures and Lexical Access: A Process Model. In: McNeill, D. (ed.) Language and Gesture, pp. 261–283. Cambridge University Press, UK (2000) 53. Krauss, R., Morrel-Samuels, P., Colasante, C.: Do Conversational Hand Gestures Communicate? Journal of Personality and Social Psychology 61(5), 743–754 (1991) 54. Lee, Y., Terzopoulos, D., Waters, K.: Realistic Modeling for Facial Animation. In: Proc. of SIGGRAPH, pp. 55–62 (1995) 55. McNeill, D.: Gesture and Thought. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing Sub-Series E: Human and Societal Dynamics, vol. 18, pp. 18–31. IOS Press, The Netherlands (2007) 56. McNeill, D.: Gesture and Thought. University of Chicago Press, Chicago (2005) 57. McNeill, D., Duncan, S.: Growth Points in Thinking for Speaking. In: McNeill, D. (ed.) Language and Gesture, pp. 141–161. Cambridge University Press, UK (2000) 58. McNeill, D.: Hand and Mind: What Gesture Reveal about Thought. University of Chicago Press, Chicago (1992) 59. Morsella, E., Krauss, R.M.: Muscular Activity in the Arm During Lexical Retrieval: Implications for Gesture-Speech Theories. Journal of Psycholinguistic Research 34, 415– 437 (2005) 60. Morsella, E., Krauss, R.M.: Can Motor States Influence Semantic Processing? Evidence from an Interference Paradigm. In: Columbus, A. (ed.) Advances in Psychology Research, vol. 36, pp. 163–182. Nova, New York (2005a) 61. Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual Prosody and Speech Intelligibility. Psychological Science 15(2), 133–137 (2004) 62. O’Shaughnessy, D.: Timing Patterns in Fluent and Disfluent Spontaneous Speech. In: Proceedings of ICASSP Conference, Detroit, Detroit, pp. 600–603 (1995) 63. Oliveira, M.: Pausing Strategies as Means of Information Processing Narratives. In: Proceedings of the International Conference on Speech Prosody, Aix-en-Provence, pp. 539–542 (2002) 64. Prinosil, J., Smekal, Z., Esposito, A.: Combining Features for Recognizing Emotional Facial Expressions in Static Images. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I., et al. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 56– 69. Springer, Heidelberg (2008) 65. Rimé, B., Schiaratura, L.: Gesture and Speech. In: Feldman, R.S., Rimé, B. (eds.) Fundamentals of Nonverbal Behavior, Cambridge University Press, pp. 239–284. Cambridge University Press, Cambridge (1992)

272


66. Rimé, B.: The Elimination of Visible Behaviour from Social Interactions: Effects of Verbal, Nonverbal, and Interpersonal Variables. European Journal of Social Psychology 12, 113–129 (1982) 67. Rogers, W.T.: The Contribution of Kinesic Illustrators Towards the Comprehension of Verbal Behaviour Within Utterances. Human Communication Research 5, 54–62 (1978) 68. Rosenfield, B.: Pauses in Oral and Written Narratives. Boston University Press (1987) 69. Short, J., Williams, E., Christie, B.: The Social Psychology of Telecommunications. Wiley, New York (1976) 70. Stocky, T., Cassell, J.: Shared Reality: Spatial Intelligence in Intuitive User Interfaces. In: Proc. of Intelligent User Interfaces, San Francisco, CA, pp. 224–225 (2002) 71. Shattuck-Hufnagel, S., Yasinnik, Y., Veilleux, N., Renwick, M.: A Method for Studying the Time Alignment of Gestures and Prosody in American English: ‘Hits’ and Pitch Accents in Academic-Lecture-Style Speech. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing SubSeries E: Human and Societal Dynamics, vol. 18, pp. 32–42. IOS Press, The Netherlands (2007) 72. Thompson, L.A., Massaro, D.W.: Evaluation and Integration of Speech and Pointing Gestures During Referential Understanding. Journal of Experimental Child Psychology 42, 144–168 (1986) 73. Wellman, H.M.: Early Understanding of Mind: The Normal Case. In: Baron-Cohen, S., et al. (eds.) Understanding Other Mind: Perspective from Children with Autism, pp. 10–39. Oxford Univ. Press, Oxford (1993) 74. Williams, E.: Experimental Comparisons of Face-to-Face and Mediated Communication: A Review. Psychological Bulletin 84, 963–976 (1977) 75. Yasinnik, Y., Renwick, M., Shattuck-Hufnagel, S.: The Timing of Speech-Accompanying Gestures with Respect to Prosody. In: Proceedings of the International Conference: From Sound to Sense: +50 Years of Discoveries in Speech Communication, June 10-13, pp. C97–C102. MIT, Cambridge (2004)

Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes Amélie Lelong and Gérard Bailly GIPSA-Lab, Speech & Cognition dpt., UMR 5216 CNRS/Grenoble INP/UJF/U. Stendhal, 38402 Grenoble Cedex, France {amelie.lelong,gerard.bailly}@gipsa-lab.grenoble-inp.fr

Abstract. During an interaction people are known to mutually adapt. Phonetic adaptation has been studied notably for prosodic parameters such as loudness, speech rate or fundamental frequency. In most of the cases, results are contradictory and the effectiveness of phonetic convergence during an interaction remains an open issue. This paper describes an experiment based on a children game known as speech dominoes that enabled us to collect several hundreds of syllables uttered by different speakers in different conditions: alone before any interaction vs. after it, in a mediated interaction vs. in a face-to-face interaction. Speech recognition techniques were then applied to globally characterize a possible phonetic convergence. Keywords: face-to-face interaction phonetic convergence, mutual adaptation.

1 Introduction The Communication Adaptation Theory (CAT), introduced by Giles et al [1], postulates that individuals accommodate their communication behavior either by becoming much closer of their interlocutor (convergence) or on the contrary by increasing their differences (divergence). People can adapt to each other in different ways. For example, conversational partners notably adapt to each other’s choice of words and references [2] and also converge on certain syntactic choices [3]. ZoltanFord [4] has shown that users of dialog systems converge lexically and syntactically to the spoken responses of the system. Ward et al [5] demonstrated that adaptive systems mimicking this behavior facilitate learning. This alignment [6] may have several benefits such as easing comprehension [7], facilitating the exchange of messages of which the meaning is highly context-dependent [8], disclosing ability and willingness to perceive, understanding or accepting new information [9] and maintaining social glue or resonance [10]. Researchers have examined also adaptation of phonetic dimensions such as pitch [11], speech rate [12], loudness [13], dispersions of vocalic targets [14] as well as more global alignment such as turn-taking [15]. But the results of these different studies show a weak convergence and even in some cases no convergence at all. In the perceptual study conducted by Pardo [16], disparities between talkers have been attributed to various dimensions such as social settings, communication goals and varying roles in the conversation. Sex differences have also been put forward: female interlocutors show more convergence than males. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 273–286, 2011. © Springer-Verlag Berlin Heidelberg 2011

274

A. Lelong and G. Bailly

The emerging field of research is crucial to the comprehension of adaptive behavior during unconstrained conversation on one hand and to versatile speech technologies that aim at substituting one partner with an artificial conversational agent on the other hand. Literature shows that two main challenges persist: (a) the need of original experiments that allow us to collect sufficient phonetic material to study and isolate the impact of the numerous factors influencing adaptation; (b) the use of automatic techniques for characterizing the degree of convergence if any.

2 State of the Art In the following section, several influential articles will be presented. These papers thoroughly summarize research about phonetic adaptation. 2.1 Convergence and Social Role There are only a few studies that explain the role of convergence in a social interaction. Different interpretations have been given. First of all, convergence could be a consequence of the episodic memory system [17]. People keep a trace of all their multimodal experiences during social interaction. An exemplar-based retrieval of previous behavior given similar social context is triggered so that the current interaction benefits from previous attunement. Adaptation can also be used in a community to let a more stable form emerge across those present in the community [18] or to help people to define their identity by categorizing others and themselves into groups constantly compared and evaluated [19]. Other studies have shown that convergence may help to accomplish mutual goal [20], align representations [18], increase the quality of an interaction [21], and furthermore contribute to mutual comprehension by decreasing social distance [21]. According to Labov [22], convergence could be due to the need to add emphasis to expression and persist for the next interaction. Finally, adaptation could be interpreted as a behavioral strategy to achieve particular social goals such as approval [11, 23] or desirability [24]. 2.2 Description of Key Studies on Phonetic Convergence Pardo [16] examined whether pairs of talkers converged in their phonetic repertoire during a single conversational interaction called a map-task. Six same-sex pairs were recruited to solve a series of 5 map tasks where their role – instruction giver or receiver – were exchanged. The advantage of the map task is to collect landmark names that are uttered several times during the interaction by each interlocutor in order to have the receiver replicate the itinerary described by the giver. One or two weeks before any interaction, talkers read out the set of map task landmark labels in order to obtain reference pronunciations. Just after interaction, the same procedure was also performed to test the persistence of convergence, i.e. to distinguish stimulusdependent mimicry from mimesis which is supposed to originate from a deeper change of phonetic representations [25]. To measure convergence, 30 listeners were asked to judge similarity between pronunciations of pre-, map- and post-task landmark labels in a AXB test, X being a map-task utterance and (A,B) pre-, map- or

Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes

275

post-task of the same utterance pronounced by the corresponding partner. Results of this forced choice showed significant main effects of expose and persistence but there was also dependence of role and sex: givers’ instructions converged more than receivers’ instructions and particularly for female givers. It is in agreement with the results found by Namy [26]. Delvaux and Soquet [14] questioned the influence of ambient speech on the pronunciations of some keywords. These keywords were chosen in order to collect representatives of two sounds (the mid-open vowels [n] and [']) the allophonic variations of which are typical of the two dialects of French spoken in Belgium. During these non interactive experiments, subjects were asked to describe a simple scene: “C’est dans X qu’il y a N Y” (It’s in X that there are N Y), where X were locations, N numbers and Y objects. This description was either uttered by the speaker or by recorded speakers using the same or the other dialect. Pre- and posttasks were also performed for the same reasons enounced previously. The phonetic analysis focused on the production of the two sounds that were used in the two possible labels X. The authors sought for unintentional imitation. To characterize the amplitude of that change, they compared durations and spectral characteristics of target sounds. In most cases, small but significant displacements towards the prototypes of the other ambient dialect were observed for both sounds (see the lowering of the canonical values in Test 1 and 2 in Fig. 1). Similar unconscious imitation of characteristics of ambient speech has also been observed by Gentilucci et al [27] for audiovisual stimulations.

Fig. 1. Results on spectral distance calculated by Delvaux and Soquet [14]. It can be seen that, during tests (Tests 1 & 2), subjects are getting away from their own reference (Pre-test) and closer to the other dialect (References 1 & 2).

Aubanel and Nguyen [28] also conducted experiments to study the mutual influence between French accents, i.e. northern versus southern, that could be part of the subjects’ experience. They have proposed an original paradigm to collect dense interactive corpora made up of uncommon proper nouns. They defined some criteria in order to discriminate the two accents, i.e. schwa, back mid vowels, mid vowels in word-final syllables, coronal stops, and nasal vowels. Uncommon proper nouns containing these segments are chosen so as to maximize coverage of alternative spellings. They chose

276

A. Lelong and G. Bailly

their subjects in a major high school and grouped them according to their sex and to a similar score on the Crowne-Marlowe [29] social desirability scale. One week before any interaction, subjects read out three sets of 16 names to get reference pronunciations. This session was repeated just after the interactions to measure mimesis. During the interaction, dyads were asked to associate names with photographs and the corresponding characters’ statements. Aubanel and Nguyen used a Bayes classifier to automatically assign subjects to a group and test different levels of convergence in the dyads (towards the interlocutor, the interlocutor’s group and accent) using linear discriminant analysis performed on spectral targets. They found very few instances of convergence. Additionally convergence was quite dependent of the critical segments analyzed, the sessions and the pairs. 2.3 Comments These studies show that phonological and phonetic convergence is very weak. The experimental paradigms used so far either collect few instances (typically a dozen in Aubanel and Nguyen) of few key segments or many instances of a very small set of key segments (two in Delvaux and Soquet). These segments are always produced in a controlled context within key words. Both studies have focused on inter-dialectal convergence and segments that carry most of the dialectal variation. This a priori choice is questionable since it remains to be shown that subjects at first negotiate these critical segments before or more easily than others. Since the convergence is segment-dependent, it is interesting to study the speakers’ alignment on the common repertoire of their mother tongue. In our experiments, we will examine the convergence of the eight French peripheral oral vowels. In most studies, interlocutors or ambient speech are not known a priori by the subjects. The authors were certainly expecting to observe on-line convergence as the dialog proceeds. The hypothesis that adaptation and alignment is immediate and fast is questionable: in the following we will compare convergence of unknowns with those of good friends. Table 1. First speech dominoes used in the interactive scenario. Interlocutors have to choose and utter alternatively the rhyming words. Correct chainings of rhymes are highlighted with a dark background.

spk 1

spk 2

spk 1

spk 2

spk 1

spk 2

spk 1

rotnr

tordy

5imi

5ema

Tm+!c`r+ !>`Hm+!>Hr+!>Dr+ !g`s+ !>`Te+!lHs+!>`m+!mHB+!yHB+ !y`9s, !>HB+ !>Hl+ !>`Tr] ([!] indicates a stressed syllable). Typical frequent CVCC-syllables are [!>Tms+ !>Hrs+ü!>`kr+ !mHBs+!y`9js+ !ln9ms+ !>`ks+ !jNls]. Typical frequent CCV-syllables are [!srt9+sr?+ !jk`H+!sr`H+!Roh9]. Typical frequent CCVC-syllables are [!sr?m] and [!RtD5n]. Table 1. The ten most frequent words in the categories noun, verb, adjective/adverb and other (i.e. pronouns and particles; particles comprise prepositions, conjunctions, and interjections [11]), in our corpus of Standard German; N = frequency of occurrence of that word. Nouns “Mama” (mom) “Bär” (bear) “Papa” (dad) “Mond” (moon) “Kinder” (children) “Katze” (cat) “Frau” (wife) “Bett” (bed) “Mädchen” (girl) “Wasser” (water)

N 392 278 235 217 190 147 145 106 105 104

Verbs “ist” (is) “hat” (has) “sagt” (says) “war” (was) “kann” (can) “wird” (will be) “will” (want) “sagte” (said) “muss” (must) “sieht” (sees)

N 793 448 413 246 184 159 156 131 120 112

Adj./Adv. “kleine” (little) “mehr” (more) “schnell” (fast) “viel” (much) “kleinen” (little) “fest” (fixed) “genau” (exactly) “großen” (large) “einfach” (simple) “große” (large)

N 287 126 90 75 74 67 60 59 58 58

Others “und” (and) “die” (the) “der” (the) “sie” (she/it) “das” (the) “den” (the) “ein” (a) “er” (he) “es” (it) “in” (in)

N 2367 1678 1644 1391 891 831 781 777 764 616

290

B.J. Kröger et al.

Table 2. Number N of most frequent syllables occurring at least M times within the corpus and percentage of text or speech which can be produced using only these syllables

Number N of most frequent syllables

Minimum number M of instances of each of these N most frequent syllables 477 856 1396 2139 2843 3475 4763

>= 40 >= 20 >= 10 >= 5 >= 3 >= 2 >= 1

Percentage of sentences within the corpus which can be produced using the N most frequent syllables 75% 85% 91% 96% 98% 99% 100%

The training of the phonetic map (P-MAP) was done in two steps. First, the training set (comprising phonemic, auditory, and motor plan states) was established for the 200 most frequent syllables. This was done by (i) choosing one acoustic realization of each syllable produced by one speaker of Standard German (33 years old, male), who uttered a selection of the sentences listed in the children’s book corpus, and (ii) applying an articulatory-acoustic re-synthesis method [13] in order to generate the appropriate motor plans. Each auditory state is based on the acoustic realization and is represented in our model as a short-term memory spectrogram comprising 24 × 30 neurons, where 24 rows of neurons represent the 24 critical bands (20 to 16000 Hz) and where 65 columns represent successive time intervals of 12.5 ms each (overall length of short-term time interval: 812.5 ms). The degree of activation of each neuron represents the spectral energy within a time-frequency interval. Each motor plan state is based on the motor plan generated by our re-synthesis method [13] and is represented in the neural model by a vocal tract action score as introduced in [14]. The score is determined by considering (i) a specification of the temporal organization of vocal tract actions within each syllable (i.e. 11 action rows over the whole short-term time interval: 11 × 65 neurons) and (ii) a specification of each type of action (4 × 17 for consonantal and 2 × 15 for vocalic actions; assuming CCVCC as the maximally complex syllable structure). Each phonemic state is based on the discrete description of all segments (allophones) of each syllable: 159 neurons in total. In the second step, this syllabic sensorimotor training set, covering the 200 most frequent syllables, was applied in order to train three P-MAPS of different sizes i.e. self-organizing neuron maps with 15 × 15, 20 × 20, and 25 × 25 neurons, respectively. 5000 incremental training cycles were computed using standard training conditions for self-organizing maps [3]. The training of the P-MAP can be called associative training since phonemic, motor, and sensory states are presented synchronously to the network for each syllable. Each cycle comprised 703 incremental training steps, and each syllable was represented within the training set proportionally to the frequency of its occurrence in the children’s book corpus; i.e. the most frequent syllable occurred 25 times per training cycle, while the least frequent syllable (number 200 in the ranking) occurred one time per cycle. Thus, the leastfrequent syllable appeared 5000 times in total, and the most frequent syllable appeared 125000 times in total in the training.

Towards the Acquisition of a Sensorimotor Vocal Tract Action Repository

291

4 Results Our simulation experimen nts indicate that a P-MAP comprising at least 25 × 25 neurons is needed in orderr to represent all 200 syllables. 158 syllables were reppresented in the 15 × 15 phoneetic map, and 176 syllables were represented in the 20 × 20 map (see Fig. 2) after training was complete.

Fig. 2. Organization of the 20 0 × 20 neuron P-MAP. Each box represents a neuron withinn the self-organizing neural map. A syllable appears only if the activation of its phonemic statte is greater than 80% of maximum m activation.

While most of the syllab bles are represented by only one neuron in the 15 × 15 m map, approximately the 100 most m frequent syllables are represented by two or m more neurons in the 20 × 20 and 25 × 25 maps. This allows the map to represent more tthan one realization for each off these syllables (e.g. [!c`] is represented by 3 neuroons, while [!c`m] and [!j`m] aree represented by only one neuron each in the 20 × 20 m map:

292

B.J. Kröger et al.

see Fig. 2). It should be noted that the syllables in Figure 2 are loosely ordered with respect to syllable structure (e.g. CV vs. CCV or CVC), vowel type (e.g. [i] vs. [a]) and consonant type (e.g. plosive vs. fricative or nasal).

5 Discussion Our neural model of speech processing as developed thus far is capable of simulating the basic processes of acquiring the motor plan and sensory states of frequent syllables of a natural language by using unsupervised associative learning. This process is illustrated here on the basis of our Standard German children’s book corpus that 96% of fluent speech can be produced using only the 2000 most frequent syllables. These frequent syllables are assumed to be produced directly by activating stored motor plans, without using complex motor processing routines. In our neural network model, the sensory and motor information about frequent syllables is stored by the dynamic link weights of the neural associations occurring between a self-organizing P-MAP and neural state maps for motor plan, auditory, somatosensory, and phonemic states. Thus, a neuron within the P-MAP represents a syllable, which – if activated – leads to a syllable-specific activation pattern within each neural state map. These neural activations represent “internal speech” or “verbal imagery” [15], i.e. “how to articulate a syllable” (motor plan state), “what a syllable sounds like” (auditory state), and “what a syllable articulation feels like” (somatosensory state), without actually articulating that syllable. While in earlier experiments our simulations were based on an artificial and completely symmetric model language, comprising five vowels [i, e, D, o, a] and nine consonants [b, d, g, p, t, k, m, n, l] and all combinations of vowels and consonants as CV-syllables and all combinations of four CC-clusters [bl, gl, pl, kl] with all vowels as CCV-syllables, this paper gives the first results of simulation experiments based on a natural language, i.e. based on the 200 most frequent syllables of Standard German as they occur in our children’s book corpus, including phonetic simplifications which typically occur in children’s word production. While syllables are strictly ordered with respect to phonetic features in the P-MAP in the case of the model language (see [3], [7], and [8]), we can see here that syllables are ordered more “loosely” in the case of a natural language. This is due to the fact that natural languages are less symmetrical than the model language due to the gaps in syllable structure which are present in a natural language, i.e. not all combinations of vowels and consonants are equally likely to occur in a natural language as they are in a model language. Furthermore, our simulations indicate that the representation of 200 syllables within the P-MAP requires a minimum map size of 25 × 25 neurons. Phonetic maps of 15 × 15 or 20 × 20 neurons were not capable of representing all 200 syllables. In order to be able to account for complete acquisition of a language, more than 200 syllables (up to 2000) must be included in the training set, so the size of the P-MAP and the S-MAP must be increased before this will be possible (cf. [9]). Acknowledgments. We thank Cornelia Eckers and Cigdem Capaat for building the corpus. This work was supported in part by the German Research Council (DFG) grant Kr 1439/13-1 and grant Kr 1439/15-1 and in part by COST-action 2102.

Towards the Acquisition of a Sensorimotor Vocal Tract Action Repository

293

References 1. Guenther, F.H., Ghosh, S.S., Tourville, J.A.: Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language 96, 280–301 (2006) 2. Guenther, F.H., Vladusich, T.: A neural theory of speech acquisition and production. Journal of Neurolinguistics (in press) 3. Kröger, B.J., Kannampuzha, J., Neuschaefer-Rube, C.: Towards a neurocomputational model of speech production and perception. Speech Communication 51, 793–809 (2009) 4. Levelt, W.J.M., Roelofs, A., Meyer, A.: A theory of lexical access in speech production. Behavioral and Brain Sciences 22, 1–75 (1999) 5. Levelt, W.J.M., Wheeldon, L.: Do speakers have access to a mental syllabary? Cognition 50, 239–269 (1994) 6. Wade, T., Dogil, G., Schütze, H., Walsh, M., Möbius, B.: Syllable frequency effects in a context-sensitive segment production model. Journal of Phonetics 38, 227–239 (2010) 7. Kröger, B.J.: Computersimulation sprechapraktischer Symptome aufgrund funktioneller Defekte. Sprache-Stimme-Gehör 34, 139–145 (2010) 8. Kröger, B.J., Miller, N., Lowit, A.: Defective neural motor speech mappings as a source for apraxia of speech: Evidence from a quantitative neural model of speech processing. In: Lowit, A., Kent, R. (eds.) Assessment of Motor Speech Disorders. Plural Publishing, San Diego (in press) 9. Li, P., Farkas, I., MacWhinney, B.: Early lexical development in a self-organizing neural network. Neural Networks 17, 1345–1362 (2004) 10. Kohler, W.: Einführung in die Phonetik des Deutschen. Erich Schmidt Verlag, Berlin (1995) 11. Glinz, H.: Deutsche Syntax. Metzler Verlag, Stuttgart (1970) 12. Ferguson, C.A., Farwell, C.B.: Words and sounds in early language acquisition. Language 51, 419–439 (1975) 13. Bauer, D., Kannampuzha, J., Kröger, B.J.: Articulatory Speech Re-Synthesis: Profiting from natural acoustic speech data. In: Esposito, A., Vích, R. (eds.) Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions. LNCS (LNAI), vol. 5641, pp. 344–355. Springer, Heidelberg (2009) 14. Kröger, B.J., Birkholz, P., Lowit, A.: Phonemic, sensory, and motor representations in an action-based neurocomputational model of speech production (ACT). In: Maassen, B., van Lieshout, P. (eds.) Speech Motor Control: New Developments in Basic and Applied Research, pp. 23–36. Oxford University Press, Oxford (2010) 15. Ackermann, H., Mathiak, K., Ivry, R.B.: Temporal organization of “internal speech” as a basis for cerebellar modulation of cognitive functions. Behavioral and Cognitive Neuroscience Reviews 3, 14–22 (2004)

Neurophysiological Measurements of Memorization and Pleasantness in Neuromarketing Experiments Giovanni Vecchiato1,2 and Fabio Babiloni1,2 1 Dept. Physiology and Pharmacology, Univ. of Rome Sapienza, 00185, Rome, Italy 2 IRCCS Fondazione Santa Lucia, via Ardeatina 306, 00179, Rome, Italy [email protected]

Abstract. The aim of this study was to analyze the brain activity occurring during the “naturalistic” observation of commercial ads. In order to measure both the brain activity and the emotional engage we used electroencephalographic (EEG) recordings and the high resolution EEG technique to obtain an estimation of the cortical activity during the experiment. Results showed that TV commercials proposed to the population analyzed have increased the cortical activity mainly in the theta band in the left hemisphere when they will be memorized and judged pleasant. A correlation analysis also revealed that the increase of the EEG Power Spectral Density (PSD) at left frontal sites is negatively correlated with the degree of pleasantness perceived. Conversely, the de-synchronization of left alpha frontal activity is positively correlated with judgments of high pleasantness. Moreover, our data also presented an increase of PSD related to the observation of unpleasant commercials. Keywords: Neuromarketing, EEG, EEG frontal asymmetry, high resolution EEG, TV commercials.

1 Introduction In these last years we assisted to an increased interest in the use of brain imaging techniques, based on hemodynamic or electromagnetic recordings, for the analysis of brain responses to the commercial advertisements or for the investigation of the purchasing attitudes of the subjects [1, 2, 3, 4]. The interest is justified by the possibility to correlate the particular observed brain activations with the characteristics of the proposed commercial stimuli, in order to derive conclusions about the adequacy of such ad stimuli to be interesting, or emotionally engaging, for the subjects. Standard marketing techniques so far employed involved the use of an interview and the compilation of a questionnaire for the subjects after the exposition to novel commercial ads before the massive launch of the ad itself (ad pre-test). However, it is now recognised that often the verbal advertising pre-testing is flawed by the respondents’ cognitive processes activated during the interview, being the implicit memory and subject’s feelings often inaccessible to the interviewer that uses A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 294–308, 2011. © Springer-Verlag Berlin Heidelberg 2011

Neurophysiological Measurements of Memorization and Pleasantness

295

traditional techniques [5]. In addition, it was also suggested that the interviewer on this typical pre-testing interviews has a great influence on what respondent recalls and on the subjective experiencing of it [6, 7]. Taking all these considerations in mind, researchers have attempted to investigate the signs of the brain activity correlated with an increase of attention, memory or emotional engagement during the observation of such commercial ads. Researchers within the consumer neuroscience community promote the view that findings and methods from neuroscience complement and illuminate existing knowledge in consumer research in order to better understand consumer behaviour [8, 9]. The use of electroencephalographic (EEG) measurements allows to follow the brain activity on a ms base, but it has the problem that the recorded EEG signals are mainly due to the activity generated on the cortical structures of the brain. In fact, the electromagnetic activity elicited by deep structures advocated for the generation of emotional processing in humans is almost impossible to gather from usual superficial EEG electrodes [10, 11]. It has underlined as a positive or negative emotional processing of the commercial ads it is an important factor for the formation of stable memory traces [12]. Hence, it became relevant to infer the emotional engage of the subject by using indirect signs for it. Indirect variables of emotional processing could be also gathered by tracking variations of the activity of other anatomical structures linked to the emotional processing activity in humans, such as the pre- and frontal cortex (PFC and FC respectively; [13, 8]). The PFC region is structurally and functionally heterogeneous but its role in emotion is well recognized [14, 9]. EEG spectral power analyses indicate that the anterior cerebral hemispheres are differentially lateralized for approach and withdrawal motivational tendencies and emotions. Specifically, findings suggest that the left PFC is an important brain area in a widespread circuit that mediates appetitive approach, while the right PFC appears to form a major component of a neural circuit that instantiates defensive withdrawal [15, 16]. In this study we were interested to analyse the brain activity occurring during the “naturalistic” observation of commercial ads intermingled in a random order in a documentary. To measure both the brain activity and the emotional engage we used the EEG and high resolution EEG technique to obtain an estimation of the cortical activity during the experiment. The aim was to link significant variation of EEG measurements with the memory and pleasantness of the stimuli presented, as resulted successively from the subject’s verbal interview. In order to do that, different indexes were employed to summarize the cerebral measurements performed and used in the statistical analysis. In order to recreate, as much as possible, a “naturalist” approach to the task, the observer watched the TV screen without particular goals in mind. In fact, the subjects were not instructed at all on the aim of the task, and they were not aware that an interview about the TV commercials observed intermingled to the documentary would be generated at the end of the task. The experimental questions of the present studies are the following: 1.

In the particular task employed and for the analyzed population, are there particular EEG activities in the spectral domain that correlate with the memorization performed or the pleasantness perceived by the subjects?

296

G. Vecchiato and F. Babiloni

2.

Does there exist any EEG frontal asymmetrical activity when we are watching pleasant and unpleasant commercial advertisements?

3.

Is it possible to extract from the EEG signals a descriptor which is strictly correlated with the degree of perceived pleasantness?

In the following pages, a detailed description of two different experiments and the related methodologies employed will be presented. Successively, the description of the results derived from the experiments performed will be accomplished and a general discussion of the significance of such results against the existing literature will close the scientific part of the work.

2 Materials and Methods High-resolution EEG technologies have been developed to enhance the poor spatial information content of the EEG activity [17, 18, 10, 19, 20]. Basically, these techniques involve the use of a large number (64-256) of scalp electrodes. In addition, high-resolution EEG techniques rely on realistic MRI-constructed head models and spatial de-convolution estimations, which are usually computed by solving a linearinverse problem based on Boundary-Element Mathematics [21, 22]. Subjects were comfortably seated on a reclining chair, in an electrically shielded, dimly lit room. In the present work, the cortical activity was estimated from scalp EEG recordings by using realistic head models whose cortical surface consisted of about 5000 triangles uniformly disposed. The current density estimation of each one of the triangle, which represents the electrical dipole of the underlying neuronal population, was computed by solving the linear-inverse problem according to the techniques described in previous papers [23, 24, 25]. 2.1 Experiment 1 Fifteen healthy volunteers (mean age 27.5±7.5 years; 7 women and 8 men) have been recruited for this study. The procedure of the experimental task consisted in observing a thirty minutes long documentary in which we inserted three advertising breaks: the first one after eight minutes from the beginning, the second one in the middle and the last one at the end of the movie. Each interruption was formed by the same number of commercial videoclips of about thirty seconds. During the whole documentary, a total of six TV commercials was presented. The clips were related to standard international brands of commercial products, like cars, food, etc. and public service announcements (PSA) such as campaigns against violence. Randomization of the occurrence of the commercial videos within the documentary was made to remove the factor “sequence” as possible confounding effect in the following analysis. During the observation of the documentary and TV commercials, subjects were not aware that an interview would be held within a couple of hours from the end of the movie. They were simply told to pay attention to what they would have watched and no mention of the importance of the commercial clips was made. In the interview, subjects were asked to recall commercial clips they remembered. In addition, a


297

question on the pleasantness of the advertisement has been performed. According to the information acquired, the neurophysiologic activity recorded has been divided into four different datasets. The first pool was related to the activity collected during the viewing of the commercial clips that the subjects had correctly remembered, and this dataset was named RMB. The second pool was related to the activity collected during the observation of the TV commercials that had been forgotten by the subjects, and this set was named FRG. The third pool is instead formed by the activity associated to subjects who affirmed to like the advertisement in exam. This group has been named LIKE. Analogously, the fourth and last group comprises all the cerebral and autonomic activity of subjects who answered in a negative way to the question on likeability. We referred to this dataset as DISLIKE. In such a case, these two datasets (LIKE/DISLIKE) only take into account the emotional feeling of the subject since he/she is asked to answer to the question “Did you like the commercial you have seen in the movie?”. Hence, an advertisement could be labelled as DISLIKE even though the subject found it meaningful or interesting. In fact, the question does not investigate cognitive aspects but only the degree of pleasantness perceived. Finally, the neurophysiologic activity during the observation of the documentary was also analyzed and a final pool of data related to this state was generated with the name REST. This REST period was taken as the period in which the subject looked at the documentary. We took into account a two minutes long sequence of the documentary, immediately before the appearance of the first spot interruption, employed in order to minimize the variations of the spectral responses owing to fatigue or loss of concentration. The cerebral activity was recorded by means of a portable 64-channel system (BE+ and Galileo software, EBneuro, Italy). Informed consent was obtained from each subject after explanation of the study, which was approved by the local institutional ethics committee. All subjects were comfortably seated on a reclining chair, in an electrically-shielded, dimly-lit room. Electrodes positions were acquired in a 3D space with a Polhemius device for the successive positioning on the head model employed for the analysis. Recordings were initially extra-cerebrally referred and then converted to an average reference off-line. We collected the EEG activity at a sampling rate = 256 Hz while the impedances kept below 5 kΩ. Each EEG trace was then converted into the Brain Vision format (BrainAmp, Brainproducts GmbH, Germany) in order to perform signal pre-processing such as artefacts detection, filtering and segmentation. Raw EEG traces were first band pass filtered (high pass = 2 Hz; low pass = 47 Hz) and the Independent Component Analysis (ICA) was then applied to detect and remove components due to eye movements, blinks, and muscular artefacts. These EEG traces were then segmented to obtain the cerebral activity during the observation of the TV commercials and that associated to the REST period. Since we recorded such activity from fifteen subjects, for each proposed advertisement we collected fifteen trials which have been grouped and averaged to obtain the results illustrated in the following sections. This dataset has been used to evaluate the cortical activity and calculate the power spectral density (PSD) for each segment according to the Welch method [38].

298


2.2 Experiment 2 Eleven voluntary and healthy undergraduate students of our faculty participated in the study (age, 22–25 years; 8 males and 3 females). They had no personal history of neurological or psychiatric disorder. They were free from medications, or alcohol or drugs abuse. For the EEG data acquisition, subjects were comfortably seated on a reclining chair in an electrically shielded and dimly lit room. They were exposed to the vision of a film of about 30 minutes and asked to pay attention to the above stimuli; they were not aware about the aim of the experiment and did not know that an interview would be performed after the recording. The movie consisted in a neutral documentary. Three interruptions have been generated: one at the beginning, the second at the middle and the last one at the end of the documentary. Each interruption was composed by six 30 seconds long commercial video-clips. Eighteen commercials were showed during the whole documentary. The TV spots were relative to standard international brands of commercial products, such as cars and food, and no profit associations, such as FAO and Greenpeace. They have never been broadcasted in the country in which the experiment has been performed. Hence, the advertising material was new to the subject as well as the documentary they observed. After two hours from the end of the recording, each experimental subject was contacted and an interview was performed. In such a questionnaire, the experimenter asked the subjects to recall the clips they remembered. Firstly, the operator verbally listed the sequence of advertisements presented within the documentary asking them to tell which they remembered, one by one. Successively, the interviewer showed to the subject several sheets, each presenting several frame sequences of each commercial inserted in the movie in order to solicit the memory of the stimuli presented. Along with these pictures, we also showed an equal number of ads which we did not choose as stimuli. This was done to provide to the subject the same number of distractors when compared to the target pictures. Finally, for each advertisement the subjects remembered, we asked them to give a score ranging between 1 and 10 according to the level of pleasantness they perceived during the observation of the ad (1, lowly pleasant; 5, indifferent; 10, highly pleasant). The EEG signals were segmented and classified according to the rated pleasantness score in order to group, in different datasets, the neuroelectrical activity elicited during the observation of commercials. Moreover, for each subject, a two minutes EEG segment related to the observation of the documentary has been further taken into account as baseline activity. In the following analysis, we considered only those pleasantness scores which have been expressed at least by three subjects in the population analyzed, in order to avoid outliers. According to this criteria, we discarded the EEG activity related to the ads that have been rated as 1, 2 and 10. The signals associated to the lowest pleasantness ratings from 3 to 5 have been labelled as DISLIKE dataset; conversely, the ones related to the higher ratings from 7 to 9 have been labelled as LIKE dataset. In such a case, these two datasets (LIKE/DISLIKE) only take into account the emotional feeling of the subject since he/she is asked to answer to the question “Did you like the


299

commercial you have seen in the movie?”. Hence, an advertisement could be labelled as DISLIKE even though the subject found it meaningful or interesting. In fact, the question does not investigate cognitive aspects but only the degree of pleasantness perceived. A 96-channel system with a frequency sampling of 200 Hz (BrainAmp, Brainproducts GmbH, Germany) was used to record the EEG electrical potentials by means of an electrode cap which was built according to an extension of the 10-20 international system to 64 channels. Linked ears reference was used. Since a clear role of the frontal areas have been depicted for the phenomena we would like to investigate [13, 14, 15], we used the left and right frontal and prefrontal electrodes of the 10-20 international system to compute the following spectral analysis. In such a case, we considered the following couples of homologous channels: Fp2/Fp1, AF8/AF7, AF4/AF3, F8/F7, F6/F5, F4/F3, F2/F1. The EEG signals have been band pass filtered at 1-45 Hz and depurated of ocular artefacts by employing the Independent Component Analysis (ICA) in such a way the components due to eye blinks and ocular movements detected by eye inspection were then removed from the original signal. The EEG traces related to our datasets of interest have been further segmented in one second trials. Then, a semi-automatic procedure has been adopted to reject trials presenting muscular and other kinds of artefacts. Only artefacts-free trials have been considered for the following analysis. The extra-cerebrally referred EEG signals have been transformed by means of the Common Average Reference (CAR) and the Individual Alpha Frequency (IAF) has been calculated for each subject in order to define four bands of interest according to the method suggested in the scientific literature [26]. Such bands were in the following reported as IAF+x, where IAF is the Individual Alpha Frequency, in Hertz, and x is an integer displacement in the frequency domain which is employed to define the band. In particular we defined the following four frequency bands: theta (IAF-6, IAF-2), i.e. theta ranges between IAF-6 and IAF-2 Hz, alpha (IAF-2, IAF+2). The higher frequency ranges of the EEG spectrum have been also analyzed but we do not report any results since their variations were not significant. The spectral EEG scalp activity has been calculated by means of the Welch method [38] for each segment of interest. In order to discard the single subject’s baseline activity we contrasted the EEG power spectra computed during the observation of the commercial video clips with the EEG power spectra obtained in different frequency bands during the observation of the documentary by using the z-score transformation [41]. In particular, for each frequency band of interest the used transformation is described as follows:

Z=

X −μ σ N

(1)

where X denotes the distribution of PSD values (of cardinality N ) elicited during the observation of commercials and the superscription is the mean operator, μ denotes the mean value of PSD activity related to the documentary and σ its standard deviation [41]. By using z-score transformation, we removed the variance due to the baseline differences in EEG power spectra among the subjects.

300


To study the EEG frontal activity, we compared the LIKE activity against the DISLIKE one by evaluating the difference of their average spectral values as follows: Z = ZLIKE – ZDISLIKE

(2)

where ZLIKE is the z-score of the power spectra of the EEG recorded during the observation of commercial videoclips rated pleasant (“liked”) by the analyzed subjects in a particular frequency band of interest. ZDISLIKE is the z-score values for the EEG recorded during the observation of commercial videoclips rated unpleasant by the subjects. This spectral index has been mapped onto a real scalp model in the two bands of interest. Moreover, in order to investigate the cerebral frontal asymmetry, for each couple of homologous channels, we calculated the following spectral imbalance: ZIM = Zdx - Zsx

(3)

This index has been employed to calculate the Pearson product moment correlation coefficient [41] between the pleasantness score and the neural activity, in the theta and alpha band for each couples of channels we analyzed. Finally, we adopted the student’s t-test to compare the ZIM index between the LIKE and DISLIKE condition by evaluating the corresponding indexes.

3 Results 3.1 Experiment 1 The EEG signals gathered during the observation of the commercial spots were subjected to the estimation of the cortical power spectral density by using the techniques described in the Methods section. In each subject, the cortical power spectral density were evaluated in the different frequency bands adopted in this study and contrasted with the values of the power spectral density of the EEG during the observation of the documentary through the estimation of the z-score. These cortical distributions of the z-scores obtained during the observation of the commercials were then organized in two different populations: the first one was composed by the cortical z-scores relative to the observation of commercial videos that were remembered during the interview (RMB group), while the second was composed by the cortical distribution of the z-scores relative to the observation of commercial videos that were forgotten (FRG group). A contrast will be made between these cortical z-score distributions of these two populations, and the resulting cortical distributions in the four frequency bands highlight the cortical areas in which the estimated power spectra statistically differs between the populations. Fig.1 presents two cortical maps, in which the brain is viewed from a frontal perspective. The maps are relative to the contrast between the two population in the theta (upper left) and alpha (upper right) frequency bands. The gray scale on the cortex coded the statistical significance.


301

Fig.1 presents an increase of cortical activity in the theta band that it is prominent on the left pre and frontal hemisphere for the RMB group. The statistical significant activity in the alpha frequency band for the RMB group is still increased in the left hemisphere although there are few zones in the frontocentral and right prefrontal hemisphere where the cortical activity was prominent for the FRG group.

Fig. 1. Figure presents two cortical z-score maps, in the two frequency bands employed. Gray scale represents cortical areas in which increased statistically significant activity occurs (p 60%) and by Formigli (“Controcorrente”, Sky TG24). A set of 6 broadcasts places itself within the range of 10 points around 50%. Among these programs, the toughest interviewer is Vespa (“Porta a Porta”, Rai 1), followed by the two interviewers of “Otto e Mezzo” (La7), by those of “Telecamere” and “Tg3 Primo Piano”, both of Rai 3, and then by Mentana (“Matrix”, Canale 5) and Annunziata (“In mezz’ora”, Rai 3). “Conferenza Stampa” (Rai 3) is the least tough television broadcast. Radio and digital broadcasts have the less toughness levels. The broadcasts partiality is shown in Table 2 and it is referred to two political parties, namely Popolo della Libertà (PdL) and Partito Democratico (PD). “Radio anch’io” (Rai Radio 1), “Telecamere” (Rai 3), “Tg3 Primo Piano” (Rai 3) and “Ballarò” (Rai 3) support PD rather than PdL. Even “Matrix” (Canale 5) and “Controcorrente” (Sky TG24) support PD. Annunziata (“In mezz’ora”, Rai 3) and Santoro (“AnnoZero”, Rai 2) use more aggressive interruptions towards PD rather than towards PdL, but this trend is restrained (around 5%). A paired sample t test was conducted to assess if the two considered parties were interrupted in a different way during the Italian political broadcasts. The results (t (11) = -2.49, pr]// as a proud [democrat] Both hands raise, palm up and away from body Eyebrows flick Eyebrows flick and head shakes Eyebrows flick

2 3 [And a proud supporter of Barack Obama//] Head nods Fig. 2. Hilary Clinton’s speech at the Democratic Convention

(a)

b)

Fig. 3. Comparison between spontaneous smile and the expression observable after the word “Obama”

A similar phenomenon is recorded in Silvio Berlusconi’s expressions during the talk show “Ballarò” (Fig. 4 and Fig. 5). In the segment presented here, Silvio Berlusconi is in the position of listener while Massimo D’Alema – a political

Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication

a)

411

b)

Fig. 4. Silvio Berlusconi’s expressions during a talk show, while in the position of listener

opponent – is speaking. Fig. 4 shows the beginning of Massimo D’Alema’s speech (plate a) and the subsequent micro-expressive reaction as Massimo D’Alema leads his monologue towards unemployment rates among the young. Plate a) in Fig. 4 shows the smile Silvio Berlusconi presents himself with. At a first analysis, it will be visible that one of the halves of the face (the right one) is dramatically more active than the other one. After the topic shift, a sudden change of expression is visible: the corners of the mouth lower, and blinking is more frequent (Fig. 4, plate b). The blinking rate of Silvio Berlusconi is double that observed of Massimo D’Alema. Because blinking is considered to be a signal of unease [4], it can be easily associated with the micro-expressive change and analysed as a cue to leakage, while Berlusconi’s smile, shown in Figure 4 plate a), is probably deceptive: if, in fact, one focuses on the corners of the mouth (Fig. 5), it will be evident that, while the right corner of the mouth points up, the left corner of points down. The expression shown in the plate b of Figure 5 is in fact a negative one.

(a)

b)

Fig. 5. Silvio Berlusconi’s smile. The expression in the left half of the mouth is in contrast with the expression displayed in the right half.

412

N. Rossini

Shifting towards the expressions of Massimo D’Alema, it is possible to underline a moment of particular emphasis that is addressed in the next paragraph for voice quality. While evoking his efforts in describing the situation of the younger generation, Massimo D’Alema exhibits a sudden expression of anger, as shown in figure 6. The fact that this expression of anger is not synchronized with the most prominent part of the speech signal, but rather follows it, is an index of deception (see also Scherer’s work [19] for comparable results).

perchè non hanno più neanche la speranza di trovare [lavoro! //] because they haven’t even the hope to find a job! Anger expression Fig. 6. Anger expression in Massimo D’Alema, starting with the word “lavoro”

3.2 Voice Quality The analysis of voice quality can be a trustable cue for the interpretation and analysis of emotional states [13]. We will here focus on the voice signal of Massimo D’Alema’s speech in the same video clip presented in the previous section for the analysis of facial expressions. The video clip has been analysed by means of PRAAT, with simple parameters, such as utterance (reporting the speech signal), pauses (both silent and filled ones), pitch, and intensity. Figure 7 shows a screen capture of a PRAAT spectral display of the voice signal of Massimo D’Alema. As already stated, the speaker shifts the focus to employment rates among the young in Italy. D’Alema states that “this [shift] is important, so he [Silvio Berlusconi] understands the reason why [this phenomenon happens] // Because, on the basis of the statistics provided by Silvio Berlusconi, there should be a plebiscite” (the Italian version of the speech is provided in Figure 7). As visible in Figure 7, the most prominent part of the utterance is this case is the word “perchè” (here translated into “the reason why”), which shows the maximum pitch level and is followed by a long silent pause. The highest intensity in the voice stream seems to be placed on the same point, which provides emphasis to the signal.


413

Fig. 7. Massimo D’Alema’s speech, first segment

The utterance proceeds with “but this doesn’t happen, so I am providing him with a more convincing interpretational key, than the demented one that [we lie] with our televisions” (Figure 8). In this segment, it is interesting to note that the highest peak of intensity is synchronised with the first three syllables of “più convincente” (more convincing), while the highest pitch is recorded with the word “fornisco” (“I provide”). It is also interesting to note that, when reporting the idea of a conspiracy set up by the televisions controlled by the left party, a chant-like pitch (Figure 8).

Fig. 8. Massimo D’Alema’s speech. Second segment

D’Almena explains then that in the South of Italy, seventy thousand jobs were lost in 2005, referring to the data provided by ISTAT, the National Institute for Statistics (Figure 9). The interesting phenomenon here is the congruence between the highest pitch and the highest intensity values, that here fall on the syllable “ta” of the number settantamila (seventy thousand).

414

N. Rossini

Fig. 9. Massimo D’Alema’s speech. Third segment.

Afterwards, there is an intensification of emphasis, when D’Alema explains that the young people move from south to north because they do not even have the hope of finding a job at home. Figure 10 shows the most emphatic moment of the speech, followed by the expression of anger reported in Figure 6.

Fig. 10. Massimo D’Alema’s speech. Final segment

It is interesting to note that, while the highest pitch value is here synchronised with “Speranza” (Eng.: “hope”), the highest level of intensity is noted elsewhere. Moreover, as already stated, the most emphatic part of speech precedes the expression of anger recorded in figure 6.


415

If one compares this with an instance of genuine anger, the difference will be evident. For this purpose, we will here examine a display of anger by the comedian Luca Bizzarri at the Sanremo Song Festival. The episode is preceded by a satirical song about Silvio Berlusconi and Gianfranco Fini, that had been presented together with Paolo Kessisoglu. The song in question had caused embarrassment to the board of directors of the RAI (the Italian state owned broadcasting service). On the occasion presented here, Luca and Paolo are introduced to the authorities in the first row. When forced to homage the first row, both comedians take a sarcastic and falsely deferential behaviour. Figure 11 shows Luca Bizzarri’s sentence, uttered towards one of the Directors (“we won’t touch your Berlusconi again”): as is visible, when the tone of the utterance is sarcastic and/or playful, the pitch accent is higher than the intensity of the sentence. Figure 12 and Figure 13 show the expressions of Luca Bizzarri and the analysis of his voice after he hears the word “bipartisan”, uttered by Gianni Morandi (in the transcripts, M). Figure 12 shows the transitions of the expressions in Luca (on the right), when Morandi says “bravo” (frame 1), and when he utters the word “bipartisan” (frame 2). It will hopefully be visible here that Luca’s mouth changes from smile to anger before he replies. His voice is reported in Figure 13. As is visible here, in this case both pitch and maximum intensity tend to synchronise, while the angry speech slightly follows the expression shift.

Fig. 11. Luca Bizzarri’s sarcastic salute to one of the RAI directors

When compared to Massimo D’Alema’s performance (Figure 6), it will be evident that the politician’s anger was probably simulated, as indicated by excessive emphasis in the expression, the inversion mismatch between expression and speech, and the mismatch between pitch and intensity.

416

N. Rossini

1

2

3

4 M.: Bravo!//Bipartisan/ L: Ma che bipartisan! A me non me ne frega un cazzo né di uno né di e<eee> Bravo! Bipartisan what bipartisan! I don’t give a fuck nor for one nor for <eee>

Fig. 12. Luca Bizzarri’s expressions

Fig. 13. Luca Bizzarri’s voice

4 Conclusions The possibility of tracking deception in everyday interaction is an interesting hypothesis, although this possibility requires a broader understanding of perception, on the one hand. and signal emission, on the other hand, than normally brought into


417

play in current analysis. The study of deception, thus, while revealing intriguing possible applications in the field of verbal and nonverbal communication, intercultural communication, negotiation, and even human robot interaction, needs further field study aimed at addressing and unfolding the complexity of face to face human communication. Acknowledgments. Thanks to Anna Esposito, Karl-Erik McCullough, and Catherine Pelachaud for their comments on this work, and to my students for their vivid interest in this topic.

References 1. Izard, C.E.: Human emotions. Plenum Press, New York (1977) 2. Zhou, L., Burgoon, J.K., Nunamaker, J.F., Twitchell, D.: Automating Linguistics-Based Cues for Detecting Deception in Text-based Asynchronous Computer-Mediated Communication. Group Decision and Negotiation 13, 81–106 (2004) 3. Ekman, P., Friesen, W.V., Scherer, K.R.: Body movement and voice pitch in deceptive interactions. Semiotica 16, 23–27 (1976) 4. Ekman, P., Friesen, W.V.: Nonverbal leakage and clues to deception. Psychiatry 32, 88–106 (1969) 5. McNeill, D., Duncan, S., Franklin, D., Goss, J., Kimbara, I., Parrill, I., Welji, H., Chen, L., Harper, M., Quek, F., Rose, T., Tuttle, R.: Mind Merging. Festschrift in honor of Robert M. Krauss, Chicago (August 11, 2007) 6. Barnes, J.A.: A Pack of Lies: Towards a Sociology of Lying. Cambridge University Press, Canbridge (1994) 7. Martin, J.-C., Niewiadomski, R., Devillers, L., Buisine, S., Pelachaud, C.: Multimodal complex emotions: Gesture expressivity and blended facial expressions. International Journal of Humanoid Robotics, Special Edition “Achieving Human-Like Qualities in Interactive Virtual and Physical Humanoids” 3(3), 269–292 (2006) 8. Pelachaud, C.: Modelling Multimodal Expression of Emotion in a Virtual Agent. Philosophical Transactions of Royal Society B Biological Science, B 364, 3539–3548 (2009) 9. Esposito, A.: Affect in Multimodal Information. In: Tao, J., Tan, T. (eds.) Affective Information Processing IV, pp. 203–226. Springer, Berlin (2009) 10. Zuckerman, M., DeFrank, R.S., Hall, J.A., Larrance, D.T., Rosenthal, R.: Facial and vocal cues of deception and honesty. Journal of Experimental Social Psychology 15, 378–396 (1979) 11. Zuckerman, M., DePaulo, B.M., Rosenthal, R.: Verbal and nonverbal communication of deception. In: Berkowitz, L. (ed.) Advances in Experimental Social Psychology, vol. 14, pp. 1–59. Academic Press, New York (1981) 12. De Paulo, B.M., Lindsay, J.J., Malone, B.E., Muhlenbruck, L., Charlton, K., Cooper, H.: Cues to deception. Psychological Bulletin 129(1), 74–118 (2003) 13. Juslin, P.N., Scherer, K.R.: Vocal expression of affect. In: Harrigan, J., Rosenthal, R., Scherer, K.R. (eds.) The New Handbook of Methods in Nonverbal Behavior Research, pp. 65–135. Oxford University Press, Oxford (2005) 14. Scherer, K.R.: Speech and emotional states. In: Darby, J.K. (ed.) Speech Evaluation in Psychiatry, pp. 189–220. Grune & Stratton, New York (1981)

418

N. Rossini

15. Mehrabian, A.: Nonverbal communication. Aldine Atherton, Chicago (1972) 16. Ekman, P.: Facial Expression and Emotion. American Psychologist 48(4), 372–379 (1993) 17. Urbani, L.: La percezione delle emozioni. M.A. thesis in Non-Verbal Communication, Università del Piemonte Orientale (2008) 18. Scherer, K.: On the Nature and Function of Emotions. A component Process approach. In: Scherer, K.R., Ekman, P. (eds.) Approaches to Emotion, pp. 293–317. Erlbaum, Hillsdale (1984)

Selection Task with Conditional and Biconditional Sentences: Interpretation and Pattern of Answer Fabrizio Ferrara1 and Olimpia Matarazzo2 1

Department of Relational Sciences “G. Iacono”, University of Naples “Federico II”, Italy [email protected] 2 Department of Psychology, Second University of Naples, Italy [email protected]

Abstract. In this study we tested the hypothesis according to which sentence interpretation affects performance in the selection task, the most used task to investigate conditional reasoning. Through a between design, conditional (if p then q) and biconditional (if and only if p then q) sentences, of which participants had to establish the truth-value, were compared. The selection task was administered with a sentence-interpretation task. The results showed that the responses to the selection task widely depended on the sentence interpretation and that conditional and biconditional sentences were interpreted, at least in part, in analogous way. The theoretical implications of these results are discussed. Keywords: selection task, interpretation task, conditional reasoning, biconditional reasoning.

1 Introduction One of the most used experimental paradigms in the study of conditional reasoning – that is, reasoning with sentences of the form “if... then”– is the selection task. It is a rule-testing task, devised by Wason in 1966 [1], in order to investigate the procedure people follow to test a hypothesis. Selection task consists in selecting the states of affairs (p, not-p, q, not-q) necessary to determine the truth-value of a conditional rule “if p then q”. In its original version participants were presented with four cards: each card had a letter on one side and a number on the other side. The cards were presented so that two of them were visible only from the “letter” side and the other two were visible only from the “number” side (see fig. 1). A

K

2

7

Fig. 1. The four cards used in the original Wason’s experiment

A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 419–433, 2011. © Springer-Verlag Berlin Heidelberg 2011

420

F. Ferrara and O. Matarazzo

The relationship between letters and numbers in the four cards was expressed through the following rule: “if there is a vowel on one side then there is an even number on the other side”. Participants had to select those cards they needed to turn over to determine whether the rule was true or false1. According to propositional logic, a conditional sentence2 “if p then q” is conceived as material implication between the two simple sentences p (antecedent) and q (consequent). The relationship of material implication means that a conditional statement is false only when the antecedent is true (p) and the consequent is false (not-q), while it is true in the remaining combinations of truth-values for p and for q (p/q, not-p/not-q, not-p/q). Therefore, the logically correct answer to the selection task consists in selecting the cards “A” (p) and “7” (not-q) because they are the only ones that may present a letter/number combination able to falsify the rule (that is, a vowel on one side and an odd number on the other side). Selecting “K” (not-p) and “2” (q) is useless because any state of affairs associated with them makes the rule true. The selection of p & not-q allows to establish both the truth and the falsity of the rule in closed-context tasks - where all the states of affairs covered by the rule can be explored - whereas it allows to establish only the falsity of the rule in open-context tasks - where the rule concerns a set of cases that cannot be fully explored. In this case, the truth of the rule is indemonstrable. Originally, Wason devised the selection task as a closed context task but afterwards, in the countless studies based on this experimental paradigm, there has not been a clear definition of the context of the task. In the first experiments conducted by Wason (summarized in [2]) only 4% of participants gave the correct answer selecting p & not-q cards, whereas the most frequent answers were p & q (46%) or p alone (33%). These percentages have remained almost unchanged in studies using selection tasks with features analogous to the original ones, i.e. with instructions requiring to establish the truth-value of an arbitrary rule (for a review see [3]). Several hypotheses have been advanced to explain this recurring pattern of answers. Wason [1] assumes that p & q answer resulted from a confirmation bias: participants tend to confirm the rule rather than refute it. Evans (see [4] for a review) posits that most participants are guided by a matching bias, a heuristic process leading to select only the states of affairs directly mentioned in the rule (just p & q). In the framework of the relevance theory, Sperber, Cara & Girotto [5] argue that people use inferential unconscious processes (called relevance mechanisms), specialized for discourse comprehension, to solve the selection task. These mechanisms allow to select the cards containing the most relevant information, i.e. those producing high cognitive effects (new inferences) with low cognitive effort (processing cost): usually, in selection tasks with abstract content these cards are p and q. A different explanation is advanced by the information gain theory [6], [7]. The theory adopts a probabilistic conception of conditionals, grounded on Ramsey’s test [8], according to which people judge the truth-value of a conditional sentence on the 1

Following what is customary in literature, we will keep using the phrase “truth/falsity of a rule” and the terms “sentence” and “rule” interchangeably, although, strictly speaking, a rule cannot be called true or false because only the sentence describing it may take one of the two truth values. 2 The terms “sentence” and “statement” are used interchangeably.

Selection Task with Conditional and Biconditional Sentences

421

basis of the conditional probability of q given p, P(q|p). The selection task is seen as an inductive problem of optimal data selection rather than a deductive task. Participants would unconsciously interpret it as an open-context task, and select cards that have the greatest expected information gain in order to decide between two competing hypotheses: (H1) the rule is true and the p cases are always associated to the q ones, (H0) the rule is false and p and q cases are independent. Oaksford & Chater [6], [7], developed a model to calculate the expected information gain associated with the four cards, according to which the information gain of the cards is ordered as p > q > not-q > not-p. Since the participants’ responses conform to model predictions, they should no longer be viewed as biased, but as the most rational ones. A number of authors [9], [10], [11], [12], [13], [14], [15], who maintain the deductive view of conditionals, focused on the role played by the sentence interpretation on the responses people give to the selection task, and underline that the conditional sentence is often interpreted as a biconditional. In propositional logic a biconditional sentence “if and only if p then q” describes the relationship of double implication between two propositions: in this case not only p implies q, as in conditional sentences, but also q implies p. A biconditional is true when its antecedent and consequent are both true or both false (p/q or not-p/not-q) and is false when either the antecedent or the consequent is false (p/not-q or not-p/q). Unlike conditional sentence, the biconditional is logically equivalent to its converse sentence “if and only if q then p” and to its inverse sentence “if and only if not-p then not-q”. The logically correct answer to selection task with a biconditional sentence consists in selecting all the cards: indeed all of them may have a combination of states of affairs that falsifies the sentence. Nevertheless, the experimental instructions of the task, which require to select only the cards necessary to determine the truth value of the rule, pragmatically could discourage the production of this type of answer and favor the more economic selection of p & q cards. In natural language biconditional statements are often expressed with “if... then” sentences, their appropriate interpretation depending on the context. However, in conditional reasoning tasks, and especially in abstract selection tasks, the context is frequently not well defined; therefore, the biconditional interpretation of the conditional statement could be favoured. Moreover, this interpretation could be encouraged by the binary nature of the task’s materials [10]. For instance, in Wason’s original task the rule “if there is a vowel on one side, then there is an even number on the other side” could lead participants to believe that also the inverse rule “if there is a consonant then there is an odd number” holds. Margolis [11], [12] hypothesizes that performance in selection task is affected by wrong interpretations of the task. Participants, indeed, would unconsciously interpret the four cards not as individual cases, but as all-inclusive categories. For instance, the “A” card is not regarded as a single card, but as representative of all the possible “A” cards. So, the number found behind the single “A” card would be the same for all “A” cards. If there is, for example, an even number, it means that all “A” cards have an even number on the other side and that no “A” card has an odd number. Consequently, selecting only the p card is sufficient to establish the truth-value of a conditional rule because in this way all the states of affairs covered by the rule are

422


explored. For the same principle, in case of a biconditional interpretation it is sufficient to select p & q. In virtue of this misinterpretation of the cards, Margolis argues that p and p & q answers should not be considered as mistakes, but as the correct responses in conformity with a conditional and a biconditional interpretation of the rule, respectively. Laming and colleagues [14], [15] posit that most participants misunderstand the experimental instructions given in the selection task in different ways, the most typical being interpreting the conditional sentence as a biconditional and reading “one side/other side” as “top/underneath”. However, the participants’ responses are largely consistent with their understanding of the rule: the “top/underneath” interpretation leads to turn p card over, the biconditional interpretation to turn all the cards over, while the combination of the two misinterpretations leads to turn p & q cards over. So, these responses should be seen to be logical rather than erroneous. According to the Mental Model theory [9], [16], participants select only the cards that are exhaustively represented in their mental model of the rule. The theory assumes that people reason by constructing mental models of the possibilities compatible with the premises of an argument, from which they draw putative conclusions, successively validated by searching for counterexamples. However, usually people do not flesh out exhaustive models of the premises, but only those representing true possibilities. In selection task, when the rule is interpreted as a conditional, participants tend to construct only the model [p] q (the square brackets indicate that the state of affairs is exhaustively represented in the model) and select the p card. Instead, in the case of a biconditional interpretation, they construct the model [p] [q] and select the p & q cards. Only if participants are able to flesh out the exhaustive models of the rule – [p][q], [not-p][q], [not-p] [not-q] in the case of conditional interpretation; [p][q], [not-p][not-q] in the case of biconditional interpretation – they can infer the counter-example of the rule – p/not-q for conditionals, p/not-q and not-p/q for biconditionals – and select the logically correct answers. It is worth to note that also the relevance theorists (Sperber, Cara & Girotto [5]) advance the hypothesis of the biconditional interpretation of the rule, but they posit that in abstract tasks, p & q are viewed as the most relevant cards, regardless of the type of interpretation. Although the “sentence-interpretation” hypothesis has been shared by a number of authors, only few studies have explicitly assessed how people interpret the conditional rules presented in selection tasks. Laming and colleagues [14], [15] gave participants six sets of four cards and asked them to establish the truth-value of a conditional rule for each set by physically turning over the cards needed to do it. Green, Over and Pyne [17] administered a construction task after the selection task, in which participants were asked to imagine, supposing the truth of the rule, which state of affairs was depicted on the hidden side of the four cards. The study found that p & not-q responses are linked to a conditional interpretation. More recently, WagnerEgger [13] showed that conditional interpretation of the rule is associated with p & not-q and p alone responses while biconditional interpretation is linked to the p & q answer; in this study the effective interpretation of the rule was determined using a deductive task that, for each of the four cards, required to indicate the states of affairs compatible with the truth of the rule. To our knowledge, no studies have compared


423

conditional vs. biconditional sentences to investigate whether participants interpret them in the same or different manner and whether the responses they give to the selection task are affected by their sentence interpretation.

2 Experiment 1 This experiment aimed at further investigating the sentence-interpretation hypothesis in two ways: 1.

2.

by administering an interpretation task jointly with an abstract selection task in order to ascertain how participants interpreted the sentence and whether the responses to the selection task were affected by their interpretation; by comparing in both tasks a conditional vs. a biconditional sentence in order to establish whether the sentence interpretation and the pattern of responses to the selection task differed as a function of the type of sentence.

Concerning 1., we must point out that, unlike the interpretation tasks used in other studies [13], [15], where it was required to take as given the truth of the rule, our task held it uncertain: participants were presented with some possible ways in which the open side of each card could be matched with the covered side and, for each pattern, they had to indicate whether it confirmed or falsified the hypothesis to test. In our opinion, this procedure should prevent participants from believing that the hypothesis presented in the selection task was true and that they should look for evidence in support of its truth. The order of the two tasks, selection and interpretation, was balanced across the participants. We expected that this variable would affect the results: in our opinion, the interpretation task, requiring to reason about the combinations of states of affairs able to confirm or falsify the hypothesis, would improve the performance in the selection task. So the number of correct responses should have increased when the interpretation task was administered before the selection task. Two versions of the interpretation task were built: in one the hidden side had only the same states of affairs represented on the visible side of the cards; in the other, the hidden side had also different states of affairs. This last version should avoid a binary interpretation of the states of affairs and therefore prevent a biconditional interpretation of the sentence. As to 2., this is the first study that explicitly compared conditional vs. biconditional rules. We had two reasons for this choice: a) to inspect whether a biconditional sentence elicits a specific - biconditional - pattern of answers; b) to find out whether the overlap between conditional and biconditional interpretation is limited to the “if... then” sentences or affects also the “if and only if... then” sentences. In other words, we wondered if also biconditional statements are misinterpreted, as is the case for conditionals. If so, we should infer that in natural language, the connectives used to introduce conditional or biconditional statements are ambiguous and undefined and that understanding the participants' interpretation of the sentences they are presented with should be a preliminary step to any reasoning task.

424


2.1 Design The 2x2x2 research design involved the manipulation of three between-subjects variables: type of sentence (conditional vs. biconditional), order of administration of the tasks (interpretation task–selection task vs. selection task–interpretation task, henceforth: “IS” vs. “SI”), and type of materials used in the interpretation task (cards with same states of affairs vs. cards with different states of affairs, henceforth: same values vs. different values). 2.2 Participants Two hundred-forty undergraduates of the Universities of Naples participated in the experiment as unpaid volunteers. They had no knowledge of logic or psychology of reasoning and their age ranged between 18 and 35 years (M=21,74; SD=3,59). Participants were assigned randomly to one of the eight experimental conditions (n=30 for each condition). 2.3 Materials and Procedure The selection task and the interpretation task were presented together in a booklet. Participants were instructed to solve the tasks one by one, in the exact order they were presented: they could go to the next page only after completing the current page, and it was forbidden to return to previous page. The “IS” version of the booklet showed on the first page a presentation of the states of affairs: four cards having the name of a flower on one side and a geometric shape on the other side. The cards were visible only from one side. A hypothesis about the relationship between names of flower and geometric shapes was formulated: “if there is a daisy on one side then there is a square on the other side” (in experimental conditions with conditional sentence) or “if and only if there is a daisy on one side then there is a square on the other side” (in experimental conditions with biconditional sentence). The depiction of the four cards was presented (see fig. 2).

Daisy

Tulip

Fig. 2. The four cards used in the experiment

On the second page of the booklet there was the interpretation task. It presented four card patterns, in each of which four possible combinations of both sides of the four cards were depicted. In each pattern the four cards were depicted so that both sides were visible: the hidden side was colored in grey and placed beside the visible side (see figures 3 and 4). For each pattern, participants had to judge whether it confirmed or falsified the hypothesis. In the “same values” version of the task (see fig. 3), the hidden sides had the same states of affairs depicted on the visible sides of the four cards (daisy, tulip, square, triangle); in the “different values” version (see fig. 4), the hidden sides had also different states of affairs (i.e. sunflower, rose, orchid, circle, rectangle, pentagon).


425

The combinations presented in the four card patterns were the following: 1. 2. 3. 4.

p & q; not-p & not-q; q & p; not-q & not-p p & not-q; not-p & not-q; q & p; not-q & not-p p & q; not-p & not-q; q & not- p; not-q & not-p p & q; not-p & not-q; q & p; not-q & p.

The first card pattern confirmed both conditional and biconditional statements, the second and the fourth patterns falsified both statements, the third pattern (see fig. 3 and 4) confirmed the conditional statement and falsified the biconditional one. So, this pattern was able to discriminate whether participants made a conditional or a biconditional interpretation of the hypothesis: if they answered “confirms”, judging the q & not-p combination compatible with the hypothesis, then they interpreted it as a conditional statement; on the contrary, if they answered ”falsifies”, judging the combination incompatible with the hypothesis, they interpreted it as a biconditional one. The order of the four configurations was randomized across the participants. The third and last page of the booklet included the selection task. The same four cards of page 1 were presented again, along with the hypothesis formulated about the relationship between the two sides: “if there is a daisy on one side then there is a square on the other side” (in experimental conditions with conditional rule) or “if and only if there is a daisy on one side then there is a square on the other side” (in experimental conditions with biconditional rule). Participants were asked to indicate which card or cards needed to be turned over in order to determine whether the hypothesis was true or false. The “SI” version of the booklet presented a different order of administration of the two tasks: participants had to solve first the selection task and then the interpretation task. The first page was very similar to that of the “IS” booklet, the only difference being that, after the presentation of the states of affairs and the formulation of the hypothesis, participants were asked to indicate which card or cards needed to be turned over in order to determine whether the hypothesis was true or false. On the second page the interpretation task was presented in the same way as in the “IS” version.

Daisy

Tulip

Tulip

Tulip

Does this configuration confirm or falsify the hypothesis? confirms

falsifies

Fig. 3. “Same values” experimental condition - The critical configuration to discern whether the hypothesis was interpreted as a conditional or a biconditional sentence: the “square” visible side is associated to a flower different from a daisy.

426


Daisy

Dahlia

Tulip

Rose

Does this configuration confirm or falsify the hypothesis? confirms

falsifies

Fig. 4. “Different values” experimental condition - The critical configuration to discern whether the hypothesis was interpreted as a conditional or a biconditional sentence: the “square” visible side is associated to a flower different from a daisy

2.4 Results Sentence-interpretation task. The frequency of answers to the sentenceinterpretation task in the eight experimental conditions is reported in table 1. We counted as “conditional interpretation” when participants answered “confirms” to the critical pattern and correctly to the other three patterns, as “biconditional interpretation” when they answered “falsifies” to the critical pattern and correctly to the other three patterns, and “other interpretation” when participants, aside from their answer to the critical combination, made one or more mistakes judging the other three patterns (that is, choosing “confirms” to one or both of the combinations that falsified the hypothesis and/or choosing “falsifies” to the combination that confirmed the hypothesis). Observing the marginal totals of table 1, one can note that the biconditional interpretation is the most frequent: it was given by 47,5% of participants whereas the conditional one was delivered by 25,8% of them; other interpretations reached the percentage of 26,7%. The inspection of table 1 also shows that these percentages are independent of the type of sentence (conditional vs. biconditional) and of the order of tasks administration (IS vs. SI). In particular, conditional sentence was interpreted as Table 1. Frequency of answers to the interpretation task in the eight experimental conditions

Type of interpretation

Conditional Biconditional Other Tot

Type of sentence Conditional Biconditional Order of administration Order of administration IS SI IS SI Type of Type of Type of Type of materials materials materials materials Sa. Di. Sa. Di. Sa. Di. Sa. Di. val. val. val. val. val. val. val. val. 13 6 9 6 7 6 9 6 13 12 17 14 17 11 16 14 4 12 4 10 6 13 5 10 30 30 30 30 30 30 30 30

Sa. val. = same values; Di. val. = different values

Tot 62 114 64 240


427

conditional by 28,3% of the participants and as biconditional by 46,7%, while the remaining 25% gave other interpretations; biconditional sentence was interpreted as biconditional by 48,3% of the participants and as conditional by 23,3%, while the remaining 28,3% gave other interpretations. LOGIT analyses, conducted on the interpretation as dependent variable and the sentence, the order of administration and the type of materials (same values vs. different values) as independent variables, corroborated these considerations. The best model was the one in which the interpretation was affected only by the type of materials (G2 = 4,34; d. f. = 12; p =. 98). Parameter estimates showed that in the “same values” condition both conditional and biconditional interpretations increased, whereas other interpretations increased in the “different values” condition (all p< .001). Selection task. The answers retained for the analyses were: p & q, p (usually the most frequent ones), p & not-q, all cards (the logically correct responses according to a conditional and a biconditional interpretation, respectively); all the other types of answers were assembled in the other category. Table 2 presents frequencies of responses as a function of the eight experimental conditions and of the type of interpretation. Inspecting table 2, it is possible to note that, regardless of the type of sentence, the order of administration and the type of materials, 84,6% of p & not-q answers and 62,7% of p answers are associated with conditional interpretation of the sentence, while p & q responses (78,7%) and the selection of all cards (96%) are strongly linked to its biconditional interpretation. This observation has been supported by LOGIT analyses, performed on the answer as dependent variable, and the sentence (conditional vs. biconditional), the order of administration (IS vs. SI), the type of materials used in the interpretation task (same values vs. different values), and the interpretation (conditional vs. biconditional vs. other) as factors. The best model was the one in which the response was affected only by the interpretation (G2 = 82,766; d. f. = 84; p = .518). Parameter estimates showed that p and p & not-q responses were associated with the conditional interpretation, while p & q and all cards were linked to the biconditional interpretation; other responses increased with other interpretations (all p< .001).

3 Experiment 2 In the interpretation task of experiment 1 four card patterns were presented: as to the conditional sentence, two of them confirmed it and the other two falsified it; as regards the biconditional sentence, one pattern confirmed it and the other three falsified it. However, it should be noted that, whereas the conditional statement is falsified only by the p & not-q combination and is confirmed by all other combinations of antecedent and consequent, the biconditional statement is falsified whenever the presence of the antecedent does not correspond to the presence of the consequent and vice versa. Thus, the card patterns presented in the first experiment did not include the one with the fourth combination able to falsify the conditional, i.e. not-p & q. In this small-scale study, suggested by one of the reviewers of the first

Different values

Same values

Different values

Same values

Different values

Material

Order of administration: IS

Same values

Different values

Material

Order of administration: SI

Biconditional sentence

0

10

0

1

2

13

1

4

0

3

13

p & not-q

p&q

p

All

Others

Tot

4

0

0

2

2

0

O

6

0

1

4

1

0

C

12

0

2

0

10

0

B

12

6

0

3

3

0

O

9

1

0

6

1

1

C

17

3

5

0

9

0

B

4

2

0

2

0

0

O

6

1

0

4

0

1

C

14

0

6

0

8

0

B

10

6

0

4

0

0

O

7

0

0

5

2

0

C

17

4

3

0

9

1

B

C = Conditional interpretation; B = Biconditional interpretation; O: Other interpretation.

B

5

6

3

0

1

2

0

O

6

0

0

4

0

2

C

11

2

2

1

6

0

B

13

6

0

4

2

1

O

9

1

0

6

0

2

C

16

4

3

0

9

0

B

5

0

0

2

3

0

O

6

2

0

4

0

0

C

14

2

2

0

9

0

B

54

25

59

89

13

TOT

10 240

6

0

2

2

0

O

Interpretation Interpretation Interpretation Interpretation Interpretation Interpretation Interpretation Interpretation

Same values

Material

Material

C

Answer

Order of administration: SI

Order of administration: IS

Conditional sentence

Table 2. Frequencies of answers to the selection task as a function of the twelve experimental conditions and of the sentence interpretation

428 F. Ferrara and O. Matarazzo


429

version of the work, experiment 1 was replicated by adding the fifth pattern (see fig. 5) in the interpretation task. Since the results of experiment 1 showed that the use of different values in the interpretation task increased other interpretations and decreased the conditional and biconditional ones, in this study the interpretation task was performed only with cards presenting the same values on both sides. 3.1 Design The 2x2 research design involved the manipulation of two between-subjects variables: type of sentence (conditional vs. biconditional) and order of administration of the tasks (IS vs. SI). 3.2 Participants Eighty undergraduates of the Universities of Naples participated in the experiment as unpaid volunteers. They had no knowledge of logic or psychology of reasoning and their age ranged between 18 and 30 years (M=22,41; SD=2,85). Participants were assigned randomly to one of the four experimental conditions (n=20 for each condition). 3.3 Materials and Procedure The materials used in this study were the same as in experiment 1. However, unlike experiment 1, the interpretation task, presented only in the “same values” version, had five cards patterns instead of four.

Daisy

Tulip

Daisy

Tulip

Does this configuration confirm or falsify the hypothesis? □ confirms

□ falsifies

Fig. 5. The fifth cards pattern with the not-p & q combination

3.4 Results Since the number of participants in this study was smaller than in the first experiment, we preliminarily checked whether the order of task administration affected the responses, in order to suppress this variable in case it was ininfluential and thus to simplify the experimental design. The chi square test showed no effect of the administration order either on the interpretation task (χ2 = .096; d. f. = 2; p = .95) or on the selection task (χ2 = 4.634; d. f. = 4; p = .33). So, the two orders of administration were aggregate in the subsequent analyses.

430


Sentence-interpretation task. The frequency of answers to the sentenceinterpretation task is reported in table 3. Table 3. Frequencies of answers to the sentence-interpretation task

Interpretation Conditional Biconditional Other Tot

Sentence Conditional Biconditional 11 8 18 20 11 12 40 40

Tot 19 38 23 80

Examining table 3, it is possible to note that the biconditional interpretation is the most frequent, regardless of the type of sentence (conditional vs. biconditional). More specifically, conditional sentence was interpreted as conditional by 27,5% of the participants and as biconditional by 45%, while the remaining 27,5% gave other interpretations; biconditional sentence was interpreted as biconditional by 50% of the participants and as conditional by 20%, while the remaining 30% gave other interpretations. The sentence did not affect the interpretation (χ2 = .622; d. f. = 2; p = .73). Selection task. In table 4 the frequency of responses as a function of the sentence and of the interpretation is presented. Observing table 4, it is possible to note that, regardless of the sentence, 84,2% of p answers were associated with conditional interpretation of the sentence, while 83,3% of p & q responses were linked to its biconditional interpretation. This consideration was supported by LOGIT analyses, performed on the answer as dependent variable, and the sentence (conditional vs. biconditional) and the interpretation (conditional vs. biconditional vs. other) as factors. The best model was the one in which the response was affected only by the interpretation (G2 = 4,87; d.f. = 12; p = .96). Parameter estimates showed that p and p & not-q responses were associated with a conditional interpretation, while p & q and all responses were linked to a biconditional interpretation; other responses increased with other interpretations (all p< .001). Table 4. Frequencies of answers to the selection task as a function of the sentence and of the interpretation

Conditional sentence Biconditional sentence Tot Interpretation Interpretation Answer Conditional Biconditio. Other Conditional Biconditio. Other p¬-q 2 0 0 0 0 0 2 p&q 1 12 2 0 13 2 30 p 8 0 2 8 0 1 19 All 0 3 0 0 5 0 8 Others 0 3 7 0 2 9 21 Tot 11 18 11 8 20 12 80 Biconditio. = Biconditional interpretation.


431

4 Discussion and Conclusions The results of the interpretation task in both experiments showed that almost half of the participants interpreted both conditional and biconditional sentences as biconditionals, regardless of their linguistic formulation. On the other hand, about 28% of the participants appropriately interpreted the conditional statement and more than 20% of them interpreted it as a biconditional. Whereas the biconditional interpretation of conditionals is widely documented in reasoning literature (see [13] for a review), to our knowledge the conditional reading of biconditional statements has not been documented yet. These findings suggest that several people assign a similar meaning to “if… then” and “if and only if… then” sentences with abstract content and that, consequently, the linguistic formulation is not sufficient to determine alone the meaning of a (bi)conditional sentence, without referring to its thematic content and context. Contrary to our predictions, the use of cards also presenting different values on their hidden side from those shown on the visible side did not prevent or discourage a biconditional interpretation of the sentence – which was our aim – but it created a confounding effect that increased other interpretations. However, our findings widely support the “sentence-interpretation” hypothesis: the way the sentence is interpreted directly influences the pattern of answers. Aside from the conditional or biconditional formulation of the sentence presented to participants, p and p & not-q answers are associated with its conditional interpretation, while p & q and the selection of all cards are associated with its biconditional interpretation. The systematic link of p & q response with the biconditional interpretation of the statement undermines the alternative theoretical perspectives seeing this response either as the result of a confirmation [1] or of a matching bias [4], or as the most relevant [5] or the most rational [6], [7] response. We turn now to consider only the correct responses, given the conditional and the biconditional interpretations, respectively. Across the two experiments of this study, the percentage of p & not-q responses, given the conditional interpretation of the sentence, is 13,9%; the percentage of selection of all cards, given the biconditional interpretation, is 20,5%. Since the order of tasks administration (IS vs. SI), contrarily to our hypothesis, did not affect the participants’ responses, one can infer that making the sentences interpretation explicit, through the interpretation task, does not improve the performance in the selection task. Besides, the absence of difference between the results of the two experiments shows that presenting (in experiment 2) a further combination (not-p & q) able to falsify the biconditional does not affect the sentence interpretation nor does it increase the choice of all cards in the selection task. In fact, although our findings are analogous to those of similar studies [e.g. 13, experiment 1], we still have to address the question of why p and p & q responses are the most frequent given a conditional or a biconditional interpretation, respectively. The interpretation task showed that participants giving a correct interpretation (conditional or biconditional) recognized which combinations of states of affairs falsify the sentence and which cards may have these combinations, but they did not use this knowledge to select the logically correct cards in the selection task. For instance, although participants giving the conditional interpretation understood that not-q card, associated with p, falsified the rule, most of them tended to choose only p card in the

432


selection task. The congruence between interpretation and selection found by Laming and colleagues [14-15] has only partly been replicated in this study, which rather suggests that the cognitive processes involved in the two tasks only partially overlap. Although many hypotheses have already been advanced in order to explain what might be called “incomplete selection” – p instead of p & not-q; p & q instead of all cards – [see 13 for a review], here we formulate a further hypothesis. We might speculate that in performing the selection task people tend to reason only in the forward direction, i.e. from the antecedent (p) to the consequent (q). In other terms, they would consider it sufficient to reason about the p card (if p is associated to q then the hypothesis is true, whereas if p is associated to not-q then the hypothesis is false), and deem the more difficult backward reasoning about the not-q card to be needless, even if they are aware that it is able to falsify the hypothesis. The p & q answer would be the result of the same strategy when the sentence is interpreted as a biconditional; the selection of these two cards could be due to the reading of the biconditional as the conjunction of a conditional with its converse statement. The rarity of the p & not-p selection, the response corresponding to the interpretation of the biconditional as the conjunction of a conditional with its inverse statement, could be due to the welldocumented difficulties in reasoning with negations. Further studies will be carried out to test this hypothesis.

References 1. Wason, P.C.: Reasoning. In: Foss, B.M. (ed.) New Horizons in Psychology I. Penguin, Harmondsworth (1966) 2. Evans, J.S.B.T.: Logic and human reasoning: An assessment of the deduction paradigm. Psychological Bulletin 128, 978–996 (2002) 3. Wason, P.C., Johnson-Laird, P.N.: Psychology of reasoning: Structure and content. Penguin, Harmondsworth (1972) 4. Evans, J.S.B.T.: Matching bias in conditional reasoning: Do we understand it after 25 years? Thinking and Reasoning 4, 45–110 (1998) 5. Sperber, D., Cara, D., Girotto, V.: Relevance theory explains the selection task. Cognition 57, 31–95 (1995) 6. Oaksford, M., Chater, N.: A rational analysis of the selection task as optimal data selection. Psychological Review 101, 608–631 (1994) 7. Oaksford, M., Chater, N.: Rational explanation of the selection task. Psychological Review 103, 381–391 (1996) 8. Ramsey, F.P.: General Propositions and Causality. In: Mellor, D.H. (ed.) Philosophical Papers, pp. 145–163. Cambridge University Press, Cambridge (1900/1929) 9. Johnson-Laird, P.N., Byrne, R.M.J.: Conditionals: A theory of meaning, pragmatics and inference. Psychological Review 109, 646–678 (2002) 10. Legrenzi, P.: Relation between language and reasoning about deductive rules. In: Flores D’Arcais, G.B., Levelt, W.J.M. (eds.) Advances in Psycholinguistic. North-Holland, Amsterdam (1970) 11. Margolis, H.: Patterns, thinking and cognition. University of Chicago Press, Chicago (1987) 12. Margolis, H.: Wason’s selection task with reduced array. PSYCOLOQUY 11(005), ftp://ftp.princeton.edu/pub/harnad/Psycoloquy/2000.volume.11/


433

13. Wegner-Egger, P.: Conditional reasoning and the Wason selection task: Biconditional interpretation instead of reasoning bias. Thinking and Reasoning 13, 484–505 (2007) 14. Gebauer, G., Laming, D.: Rational choice in Wason’s selection task. Psychological Research 60, 284–293 (1997) 15. Osman, M., Laming, D.: Misinterpretation of conditional statements in Wason’s selection task. Psychological Research 65, 121–144 (2001) 16. Johnson-Laird, P.N.: Mental models. In: Towards a Cognitive Science of Language, Inference, and Consciousness. Cambridge University Press, Cambridge (1983) 17. Green, D.W., Over, D.E., Pyne, R.A.: Probability and choice in selection task. Thinking and Reasoning 3, 209–235 (1997)

Types of Pride and Their Expression Isabella Poggi and Francesca D’Errico Roma Tre University, Department of Education Sciences {poggi,fderrico}@uniroma3.it

Abstract. The paper analyzes pride, its nature, expression and functions, as a social emotion connected to the areas of image and self-image and to power relations. Three types of pride, dignity, superiority and arrogance, are distinguished, their mental ingredients are singled out, and two experimental studies are presented showing that they are conveyed by different combinations of smile, eyebrow and eyelid positions, and head posture. Keywords: pride, social emotion, social signal, facial expression.

1 Introduction In the last decade a new research area has arisen in the interface between Computer Scientists and Social Scientists, the area of social signal processing. If previous work on signal processing studied physical quantities in various modalities, since 2007 on Pentland [1, 2] launched the idea of analyzing physical signals that convey socially relevant information, such as activity level during an interaction, or mirroring between participants, and the like. The field of Social Signal processing is now being settled as the area of research that analyzes the communicative and informative signals which convey information about social interactions, social relations, social attitudes and social emotions. Among emotions, we can distinguish “individual” from “social” emotions, and within these, three types of them [3]. First, those felt toward someone else; in this sense, happiness and sadness are individual emotions, while admiration, envy, contempt, compassion are social ones: I cannot admire without admiring someone, I cannot envy or contemn but someone, while I can be happy or sad myself. Second, some emotions are “social” in that they are very easily transmitted from one person to another: like enthusiasm, panic, or anxiety. A third set are the so-called “selfconscious emotions” [4], like shame, pride, embarrassment, that we feel when our own image or self-image, an important part of our social identity, is at stake. They are triggered by our adequacy or inadequacy with respect to some standards and values, possibly imposed by the social context [5], that we want to live up to, and thus they concern and determine our relationships with others. In Social Signal processing, as well as in Affective Computing, a relevant objective is to build systems able to process and recognize signals of social emotions. In this paper we briefly overview some studies on the emotion of pride, trying to distinguish different types of it, and present two studies on the expression of this emotion aimed at recognizing the three types from the nuances of their display. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 434–448, 2011. © Springer-Verlag Berlin Heidelberg 2011

Types of Pride and Their Expression

435

2 Authentic and Hubristic Pride The emotion of pride has traditionally been an object of attention in myth, moral philosophy and religious speculation, more than in psychology. Within psychological literature, Darwin [6] and Lewis [4] include it among the “complex”, or “selfconscious” emotions. Different from the so called “primary” emotions, like joy or sadness, anger or disgust, the “self-conscious” emotions, like shame, guilt and embarrassment, have a less clear universal and biologically innate expressive pattern than the “primary” ones, and can be felt only by someone who has a concept of self, like a child of more than two years, or some great apes, since they entail the fulfilment and transgression of social norms and values. More recently, Tracy and Robins [7] investigated nature, function and expression of pride, and distinguished two types of it, authentic and hubristic. Authentic pride, represented in words like accomplished and confident, is positively associated with personality traits of extraversion, agreeableness, conscientiousness, and with genuine self-esteem, whereas hubristic pride, related to words like arrogant and conceited, is related positively to self-aggrandizing narcissism and shame-proneness. Hubristic pride “may contribute to aggression, hostility and interpersonal problems” (p.148), while authentic pride can favour altruistic action, since the most frequent behavioural responses to pride experience are seeking and making contact with others. Seen in terms of the attribution theory [24], “authentic pride seems to result from attributions to internal but instable, specific, and controllable causes, such as (...) effort, hard work, and specific accomplishments” [8], whereas hubristic pride is felt when one attributes one’s success to “internal but stable, global, and uncontrollable causes” such as “talents, abilities, and global positive traits” [9]. Concerning the adaptive function of pride, Tracy and Robins [7] suggest that its feeling “might have evolved to provide information about an individual’s current level of social status and acceptance” (p.149), thus being importantly liked to selfesteem. They also investigated the nonverbal expression of pride [10] and singled out its constituting elements: small smile, head slightly tilted back, arms raised and expanded posture. They argued that pride and its expression are universal and that their function may be “alerting one’s social group that the proud individual merits increased status and acceptance” [7] (p.149-150). By adopting a functionalist view of emotions, Tracy, Shariff & Cheng [8] propose that pride serves the adaptive function of promoting high status, and does so because the pleasant reinforcing emotion of pride due to previous accomplishments enhances motivation and persistence in future tasks, while the internal experience, by enhancing self-esteem, informs the individual – and the external nonverbal expression informs others – of one’s achievement, indicating one deserves a high status in the group. While wondering if the two facets of the emotion of pride, authentic and hubristic, have different adaptive functions, they stick to Henrich & Gil-White [25] distinction between two distinct forms of high status that humans are in search for: dominance, to be acquired mainly through force, threat, intimidation, aggression, and prestige, a respect-based status stemming from demonstrated knowledge, skill, and altruism. Tracy et al. [8] posit that the emotion of hubristic pride and its expression serve the function of dominance, while authentic pride serves the function of prestige, thus being a way to gain a higher status by demonstrating one’s real skills and social and

436

I. Poggi and F. D’Errico

cooperative ability. To sum up, for Tracy and Robin [7], “Authentic pride might motivate behaviours geared toward long-term status attainment, whereas hubristic pride provides a ‘short cut’ solution, promoting status that is immediate but fleeting and, in some cases, unwarranted”; it may have “evolved as a ‘cheater’ attempt to convince others of one’s success by showing the same expression when no achievement occurred” (p.150). The view of pride outlined by Tracy et al. [7, 8, 10], with its two contrasting facets and their function, looks interesting and insightful. Yet, their distinction between authentic and hubristic pride suffers from the connotation of their very names: authentic sounds as only positive, while hubristic sounds as negative and, being contrasted to authentic, as typically implying “cheating”. In our view, one thing is to distinguish types of pride in terms of their very nature, and one is to see whether they can be expressed to cheat others (or themselves) about one’s worth. Actually, the two (or more?) facets of pride might all have a positive function, and all might be simulated and used to cheat. But what makes them different is the feeling they entail and the different function they serve in a person’s relationship with others.

3 Superiority, Arrogance and Dignity: Types of Pride and Their Mental Ingredients In another work, following a model of mind, social actions and emotions in terms of goals and beliefs [7, 8, 11, 13, 16], pride was analyzed in terms of its “mental ingredients”, the beliefs and goals that are represented, whether in a conscious or an unconscious way1, in a person who is feeling that emotion. In this analysis, some ingredients are common to all possible cases of pride, while others allow one to distinguish three types of pride, that we call “superiority”, “arrogance”, and “dignity” pride. All types of pride share the same core of ingredients: 1. 2. 3. 4. 5. 6.

A believes that ((A did p) or (A is p) or (p has occurred)) A believes p is positive A believes p is connected to / caused by A A wants to evaluate A as to p A wants to evaluate A as valuable A believes A is valuable (because of p)

These are the necessary conditions for a person to feel proud: 1. an event p has occurred (e.g., A’s party won the elections); or A did an action (she ran faster than others); or A has a property (she is stubborn, she has long dark hair); 2. A evaluates this action, property or event as positive, i.e., as something which fulfils some of her goals; 1

The hypothesis of the model adopted is that the ingredients may be unconscious, that is, not meta-represented (you have that belief and that goal, but you do not have a meta-belief about your having that belief), but one cannot say that you are feeling that emotion unless those ingredients are there.


3.

437

A sees p as caused by herself, or anyway as an important part of her identity. I can be proud of my son because I see what he is or does as something, in any case, stemming from myself; or proud of the good weather of my country because I feel it as my own country. In the prototypical cases of pride A can be proud only of things she attributes to internal controllable causes [10, 11]; but in other cases the action, property or event is simply connected to, not necessarily caused by A; the positive evaluation refers to something that does make part of the selfimage A wants to have: something with respect to which A wants to evaluate herself positively; A wants to evaluate herself positively as a whole; the positive evaluation of p causes a more positive self-evaluation of A as a whole: it has a positive effect on A’s self-image.

4.

5. 6.

Superiority pride. In cases entailing actions or properties a possible ingredient is victory: doing or being p makes you win over someone else, and this implies that you are stronger or better than another. Further, if seen not as a single occurrence but as a steady property, this means you are superior to others: 7. 8.

A believes A once has been superior to B with respect to p A believes A is always superior to B with respect to p

You have more power than another as to some p in a specific situation (ingredient 7), and you feel in general superior to others with respect to p (8). Sometimes, if a single fact or capacity is very relevant in your overall judgment of how people should be, believing yourself superior to another as to it can make you believe you are superior to others in general. 9.

A believes judgment with respect to p is relevant for overall judgment of people 10. A believes A is in general superior to B Ingredients 7 – 10 are in a sense the bulk of “narcissism”: a high consideration of one’s capacities and of oneself as a whole, a very positive self-image. If added to ingredients 1 – 6, they make up “superiority pride”, which is typically felt when the event p is an action that makes one win in a competition. But one can also feel superior when event p is simply one’s belonging to a category (a social class, a Nation, a group of people) that one thinks is superior to others. Superiority of an individual over another is relevant for adaptation because in case of competition it allows a more frequent and effective access to resources. But this holds particularly when others are aware of one’s superiority. This leads to the necessity for one who feels superior – in case he also wants his superiority to give him access to resources – to have others know and acknowledge it. In other words, one who is superior often does not only want to evaluate himself positively, but wants others to evaluate him as superior: he does not only want to have a positive self-image, but also to have a positive image before others:

438


11. A wants B to evaluate A as to p 12. A believes B believes A is valuable (because of p)

Often one is proud of something not only before himself but also before others. Yet, within the “core” ingredients of pride (1 – 6) the goal of projecting one’s positive image to others is not a necessary condition. In this, pride is symmetrical to shame. One is sincerely ashamed before others only if one is ashamed before oneself [14], that is, only if the value one is evaluated against makes part not only of the image one wants to have before others but also of the evaluation one wants to have of oneself (self-image). In conclusion, one who feels genuine “superiority pride” is proud of something that others evaluate positively only if one also evaluates it positively. Arrogance pride. “Superiority pride” is generally felt when in a competition between people on the same level one wins in the power comparison and thus becomes superior. But in other cases one is, at the start, on the “down” side of the power comparison; A has less power than B, but does not want to submit to B’s superiority: either he wants to challenge B’s power and possibly become superior, or he does not long to superiority, but wants his worth to be acknowledged, and not to be considered inferior. We call the former “arrogance pride”, and the latter “dignity pride”. In arrogance the proud one challenges another person or institution having more power than he and possibly power over him. Thus he climbs the pyramid of power: he does not acknowledge the other’s power because he claims he has (or has the right to have) more power than the other. Here are the ingredients of “arrogance pride”: 13. A wants to have power over B 14. A believes A can have power over B 15. A wants B believe A can have power over B

A person feeling arrogance pride wants to have power over the other (13), he believes he can do so (14), and further wants the other to know that he can overcome his power (15). But while “superiority pride” sometimes is not even communicated to others (you may feel superior to such an extent that you do not either bother to make others know of your superiority), “arrogance pride” instead, encompassing an ing8redient of challenge (15), is by definition communicative. The arrogant communicates: I am not afraid of you, though you claim to have more power than me and even power over me; but since I am superior to you (n.10), I want to have power over you (n.14) and want you to know I have the power thereof (n.15). Sometimes the challenge, at least apparently, does not come from the less powerful, but from the more powerful in a dyad. This is the case with the so-called “arrogance of power”: one who is powerful is arrogant as he abuses of his power. For example, a politician from the government who insults an interviewer of a TV channel of the opposite side, or who blatantly violates general rules while displaying his not being subject to any other power. Here the powerful one does something more than he would be entitled to, according to the principle that rules and laws are for people who have not power, while one who has the power can establish rules himself. So even in this case there is, in a sense, a challenge to power: the power of law.


439

Dignity pride. Let us take the other case of unbalanced power: A at a lower level than B. If A does not accept his inferiority, he feels “dignity pride”: the pride of human dignity. One who feels this type of pride does not claim to be superior, but not to be inferior. He claims to his right of being treated as a peer, with same status, same rights, same freedom as the other: he wants to be acknowledged his worth as a human being, and the consequent right to be addressed respectfully and not to be a slave to anybody. One who feels “dignity pride” attributes a higher value to his self-image than to his image, and primarily cares his self-image both of self-sufficiency and of selfregulation. Being self-sufficient means you do not depend on others, since you have all the resources necessary to achieve your goals by yourself; but not being dependent, you also do not want anyone to have power over you; you claim your right to autonomy, i.e. self-regulation: your right to be free. 16. 17. 18. 19. 20.

A wants A/B believes A has all the resources A needs A wants A/B believes A does not depend on B A wants A/B believes A has not less power than B A wants A/B believes B has not power over A A wants B believes A has the dignity of a human

A wants to be considered by others and himself as one who has all the resources he needs, i.e. he wants to have an image and self-image of an autonomous person (16), and of one who does not depend on B (17); he wants to be considered as not having less power than B (18), and as not being submitted to B (19): to be acknowledge his dignity as a human (20). The three types of pride differ for the actual vs. ideal power relation aimed at by the proud person with respect to the other. In dignity, the proud one has less power than the other but wants to be considered equal to him; in superiority, A wants (considers right) to be considered superior, whether or not he is so; in arrogance, A may be equal or inferior to B, but wants to become superior.

4 Different Pride, Different Signals? As shown by Tracy and Robins [7], the emotion of pride is generally expressed by a small smile, expanded posture, head tilted backward, and arms extended out from the body, possibly with hands on hips. But notwithstanding their attempts they did not find systematic differences in the expressions of “authentic” vs. “hubristic” pride. In this work we present two studies to test if the three types of pride, superiority, arrogance and dignity pride, can be distinguished based on subtle differences in their facial expression. 4.1 First Study We conducted an observational study on the expressions of pride in six Italian political debates (six hours in total). After selecting the fragments in which the politicians express his pride by their verbal behaviour, we carried out a qualitative

440


analysis of the multimodal communication parallel to their words, through an annotation scheme that described the signals in various modalities (pauses, voice pitch, intensity and rhythm, gestures, posture, facial expression, gaze behavior) and attributed meanings to each of them. As argued by Poggi [18] in fact, for body behaviours too, if they are considered signals, by definition it is possible to attach them meanings, and these meanings, just like those of verbal language, can be subject to introspection and can be paraphrased in words. Hypothesis. Based on this analysis [22], three fragments were selected as prototypical expressions of the three types of pride: in these, dignity pride is characterized by gaze to the interlocutor, no smile, no conspicuous gestures, and a serious frown; superiority pride includes gazing down to the other, possibly with slightly lowered eyelids, no smile, or else, with smile and an ironic head canting of ironic compassion, and a distant posture. Arrogance entails ample gestures, gaze to the target, and a large smile, similar to a laughter of scorn. We then hypothesized that subjects can distinguish the three types of pride from their expression. Experimental design and procedure. The experimental design is3x3 within subject with independent variables being a facial display (Vendola, Scalfari and Brunetta) and three types of Pride (dignity, superiority and arrogance) and the dependent variable being the agreement of participants, measured on a Likert Scale, to interpret the face as a specific type of pride. A forced choice questionnaire was submitted to 58 participant (all females, to avoid the gender issue, range 18-32 years old, mean age 22) with three pictures of speakers in political shows (Nichi Vendola, a former governor of an italian Region, Eugenio Scalfari, the founder of a famous newspaper, and Renato Brunetta, a minister), hypothesized as respectively expressing dignity, superiority and arrogance; participants were asked to associate each picture to one of three sentences meaning dignity (voglio essere trattato da pari, I want to be treat as equal), superiority (mi sento superiore, I feel superior) and arrogance (sto lanciando una sfida, I defy you), by expressing their agreement on a Likert Scale (1-5). Results. As shown in Table 1, results confirm the previous qualitative analysis [19]. An Anova [F (2, 114)= 14,36, p

Analysis of Verbal and Nonverbal Communication and Enactment - COST 2102

Verbal and Nonverbal Communication Behaviours, COST Action 2102

Nonverbal Behaviour and Communication

Applications of Nonverbal Communication

Pragmatics and Non-Verbal Communication

Settler and Creole Re-Enactment

Communication Circuits: Analysis and Design

The Verbal Communication of Emotions: Interdisciplinary Perspectives

Analysis of Computer and Communication Networks

Cost-Benefit Analysis And Water Resources Management

Cost Benefit Analysis and Health Care Evaluations

Applied Cost-Benefit Analysis

Cost Benefit Analysis

Cost-Benefit Analysis

Textual Translation and Live Translation: The Total Experience of Nonverbal Communication in Literature, Theater and Cinema

Applications of Nonverbal Communication (Claremont Symposium on Applied Social Psychology)

Verbal Art, Verbal Sign, Verbal Time

Cost and Value Management

The Meaning of the Built Environment: A Nonverbal Communication Approach

Shakespeare and the Art of Verbal Seduction

Cost-Benefit Analysis of Environmental Change

Cost-Benefit Analysis (Second Edition)

Cost-Benefit Analysis (Second Edition)

Biometric ID Management and Multimodal Communication: Joint COST 2101 and 2102 International Conference, BioID_MultiComm 2009, Madrid, Spain, ... Vision, Pattern Recognition, and Graphics)

Speech Play and Verbal Art

Radio-frequency and microwave communication circuits: analysis and design

Radio-frequency and microwave communication circuits: analysis and design

Radio-Frequency and Microwave Communication Circuits : Analysis and Design

Analysis of Verbal and Nonverbal Communication and Enactment - COST 2102

Verbal and Nonverbal Communication Behaviours, COST Action 2102

Nonverbal Behaviour and Communication

Applications of Nonverbal Communication

Pragmatics and Non-Verbal Communication

Settler and Creole Re-Enactment

Communication Circuits: Analysis and Design

The Verbal Communication of Emotions: Interdisciplinary Perspectives

Analysis of Computer and Communication Networks

Cost-Benefit Analysis And Water Resources Management

Cost Benefit Analysis and Health Care Evaluations

Applied Cost-Benefit Analysis

Cost Benefit Analysis

Cost-Benefit Analysis

Textual Translation and Live Translation: The Total Experience of Nonverbal Communication in Literature, Theater and Cinema

Applications of Nonverbal Communication (Claremont Symposium on Applied Social Psychology)

Verbal Art, Verbal Sign, Verbal Time

Cost and Value Management

The Meaning of the Built Environment: A Nonverbal Communication Approach

Shakespeare and the Art of Verbal Seduction

Cost-Benefit Analysis of Environmental Change

Cost-Benefit Analysis (Second Edition)

Cost-Benefit Analysis (Second Edition)

Biometric ID Management and Multimodal Communication: Joint COST 2101 and 2102 International Conference, BioID_MultiComm 2009, Madrid, Spain, ... Vision, Pattern Recognition, and Graphics)

Speech Play and Verbal Art

Radio-frequency and microwave communication circuits: analysis and design

Radio-frequency and microwave communication circuits: analysis and design

Radio-Frequency and Microwave Communication Circuits : Analysis and Design

Recommend Documents