Karl-Friedrich Kraiss (Ed.) Advanced Man-Machine Interaction
Karl-Friedrich Kraiss (Ed.)
Advanced Man-Machine Interaction Fundamentals and Implementation
With 280 Figures
123
Editor
Professor Dr.-Ing. Karl-Friedrich Kraiss RWTH Aachen University Chair of Technical Computer Science Ahornstraße 55 52074 Aachen Germany
[email protected] Library of Congress Control Number: 2005938888
ISBN-10 3-540-30618-8 Springer Berlin Heidelberg New York ISBN-13 978-3-540-30618-4 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2006 Printed in Germany The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Digital data supplied by editor Cover Design: design & production GmbH, Heidelberg Production: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig Printed on acid-free paper
7/3100/YL
543210
To: Eva, Gregor, Lucas, Martin, and Robert.
Preface
The way in which humans interact with machines has changed dramatically during the last fifty years. Driven by developments in software and hardware technologies new appliances never thought of some years ago appear on the market. There are apparently no limits to the functionality of personal computers or mobile telecommunication appliances. Intelligent traffic systems promise formerly unknown levels of safety. Cooperation between factory workers and smart automation is under way in industrial production. Telepresence and teleoperations in space, under water or in micro worlds is made possible. Mobile robots provide services on duty and at home. As, unfortunately, human evolution does not keep pace with technology, we face problems in dealing with this brave new world. Many individuals have experienced minor or major calamities when using computers for business or in private. The elderly and people with special needs often refrain altogether from capitalizing on modern technology. Statistics indicate that in traffic systems at least two of three accidents are caused by human error. In this situation man-machine interface design is widely recognized as a major challenge. It represents a key technology, which promises excellent pay-offs. As a major marketing factor it also secures a competitive edge over contenders. The purpose of this book is to provide deep insight into novel enabling technologies of man-machine interfaces. This includes multimodal interaction by mimics, gesture, and speech, video based biometrics, interaction in virtual reality and with robots, and the provision of user assistance. All these are innovative and still dynamically developing research areas. The concept of this book is based on my lectures on man-machine systems held regularly for students in electrical engineering and computer science at RWTH Aachen University and on research performed during the last years at this chair. The compilation of the materials would not have been possible without contributions of numerous people who worked with me over the years on various topics of man-machine interaction. Several chapters are authored by former or current doctoral students. Some chapters build in part on earlier work of former coworkers as, e.g., Suat Akyol, Pablo Alvarado, Ingo Elsen, Kirsti Grobel, Hermann Hienz, Thomas Kr¨uger, Dirk Krumbiegel, Roland Steffan, Peter Walter, and Jochen Wickel. Therefore their work deserves mentioning as well. To provide a well balanced and complete view on the topic the volume further contains two chapters by Professor Gerhard Rigoll from Technical University of Munich and by Professor R¨udiger Dillmann from Karlsruhe University. I’m very much indebted to both for their contributions. I also wish to express my gratitude to Eva Hestermann-Beyerle and to Monika Lempe from Springer Verlag: to the former for triggering the writing of this book, and to both for provided helpful support during its preparation.
VIII
Preface
I am deeply indebted to J¨org Zieren for administrating a central repository for the various manuscripts and for making sure that the guidelines were accurately followed. Also Lars Libuda deserves sincere thanks for undertaking the tedious work of composing and testing the books CD. The assistance of both was a fundamental prerequisite for meeting the given deadline. Finally special thanks go to my wife Cornelia for unwavering support, and, more general, for sharing life with me.
Aachen, December 2005
Karl-Friedrich Kraiss
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2 Non-Intrusive Acquisition of Human Action . . . . . . . . . . . . . . . . . . . . . . . 2.1 Hand Gesture Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Image Acquisition and Input Data . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1.1 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1.2 Recording Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1.3 Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2.1 Hand Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2.2 Region Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2.3 Geometric Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Feature Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3.1 Classification Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3.2 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3.4 Rule-based Classification . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3.5 Maximum Likelihood Classification . . . . . . . . . . . . . . . . 2.1.3.6 Classification Using Hidden Markov Models . . . . . . . . . 2.1.3.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Static Gesture Recognition Application . . . . . . . . . . . . . . . . . . . . 2.1.5 Dynamic Gesture Recognition Application . . . . . . . . . . . . . . . . . 2.1.6 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Facial Expression Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Image Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2.1 Face Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2.2 Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Feature Extraction with Active Appearance Models . . . . . . . . . . 2.2.3.1 Appearance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3.2 AAM Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Feature Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4.1 Head Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4.2 Determination of Line of Sight . . . . . . . . . . . . . . . . . . . . . 2.2.4.3 Circle Hough Transformation . . . . . . . . . . . . . . . . . . . . . . 2.2.4.4 Determination of Lip Outline . . . . . . . . . . . . . . . . . . . . . . 2.2.4.5 Lip modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Facial Feature Recognition – Eye Localization Application . . . . 2.2.6 Facial Feature Recognition – Mouth Localization Application .
7 7 8 8 9 10 11 12 14 20 22 27 29 29 31 32 33 34 37 47 48 50 56 56 59 60 61 67 68 73 75 78 78 83 83 87 88 90 90
1
X
Table of Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
3 Sign Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.1 Recognition of Isolated Signs in Real-World Scenarios . . . . . . . . . . . . . . . . 99 3.1.1 Image Acquisition and Input Data . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.1.2 Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.1.2.1 Background Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.1.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.1.3.1 Overlap Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.1.3.2 Hand Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.1.4 Feature Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.1.5 Feature Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.1.6 Test and Training Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.2 Sign Recognition Using Nonmanual Features . . . . . . . . . . . . . . . . . . . . . . . . 111 3.2.1 Nonmanual Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.2.2 Extraction of Nonmanual Features . . . . . . . . . . . . . . . . . . . . . . . . 113 3.3 Recognition of continuous Sign Language using Subunits . . . . . . . . . . . . . 116 3.3.1 Subunit Models for Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.3.2 Transcription of Sign Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.3.2.1 Linguistics-orientated Transcription of Sign Language . 118 3.3.2.2 Visually-orientated Transcription of Sign Language . . . 121 3.3.3 Sequential and Parallel Breakdown of Signs . . . . . . . . . . . . . . . . 122 3.3.4 Modification of HMMs to Parallel Hidden Markov Models . . . . 122 3.3.4.1 Modeling Sign Language by means of PaHMMs . . . . . . 124 3.3.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 3.3.5.1 Classification of Single Signs by Means of Subunits and PaHMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 3.3.5.2 Classification of Continuous Sign Language by Means of Subunits and PaHMMs . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.3.5.3 Stochastic Language Modeling . . . . . . . . . . . . . . . . . . . . 127 3.3.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.3.6.1 Initial Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 3.3.6.2 Estimation of Model Parameters for Subunits . . . . . . . . 131 3.3.6.3 Classification of Single Signs . . . . . . . . . . . . . . . . . . . . . . 132 3.3.7 Concatenation of Subunit Models to Word Models for Signs. . . 133 3.3.8 Enlargement of Vocabulary Size by New Signs . . . . . . . . . . . . . . 133 3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.4.1 Video-based Isolated Sign Recognition . . . . . . . . . . . . . . . . . . . . 135 3.4.2 Subunit Based Recognition of Signs and Continuous Sign Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Table of Contents
XI
4 Speech Communication and Multimodal Interfaces . . . . . . . . . . . . . . . . . 141 4.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.1.1 Fundamentals of Hidden Markov Model-based Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.1.2 Training of Speech Recognition Systems . . . . . . . . . . . . . . . . . . . 144 4.1.3 Recognition Phase for HMM-based ASR Systems . . . . . . . . . . . 145 4.1.4 Information Theory Interpretation of Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.1.5 Summary of the Automatic Speech Recognition Procedure . . . . 149 4.1.6 Speech Recognition Technology . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.1.7 Applications of ASR Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.2 Speech Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 4.2.2 Initiative Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.2.3 Models of Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4.2.3.1 Finite State Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4.2.3.2 Slot Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.2.3.3 Stochastic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.2.3.4 Goal Directed Processing . . . . . . . . . . . . . . . . . . . . . . . . . 159 4.2.3.5 Rational Conversational Agents . . . . . . . . . . . . . . . . . . . . 160 4.2.4 Dialog Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 4.2.5 Scripting and Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 4.3 Multimodal Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 4.3.1 In- and Output Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.3.2 Basics of Multimodal Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.3.2.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 4.3.2.2 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 4.3.3 Multimodal Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 4.3.3.1 Integration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 4.3.4 Errors in Multimodal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.3.4.1 Error Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.3.4.2 User Specific Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.3.4.3 System Specific Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 4.3.4.4 Error Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 4.3.4.5 Error Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 4.4 Emotions from Speech and Facial Expressions . . . . . . . . . . . . . . . . . . . . . . . 175 4.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 4.4.1.1 Application Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 4.4.1.2 Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.4.1.3 Emotion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.4.1.4 Emotional Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.4.2 Acoustic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.4.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 4.4.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4.4.2.3 Classification Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
XII
Table of Contents
4.4.3 Linguistic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3.1 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3.2 Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3.3 Phrase Spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Visual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4.2 Holistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4.3 Analytic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5 Information Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
180 181 182 182 183 184 184 186 186 187 187
5 Person Recognition and Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 5.1 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 5.1.1 Challenges in Automatic Face Recognition . . . . . . . . . . . . . . . . . 193 5.1.2 Structure of Face Recognition Systems . . . . . . . . . . . . . . . . . . . . . 196 5.1.3 Categorization of Face Recognition Algorithms . . . . . . . . . . . . . 200 5.1.4 Global Face Recognition using Eigenfaces . . . . . . . . . . . . . . . . . . 201 5.1.5 Local Face Recognition based on Face Components or Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 5.1.6 Face Databases for Development and Evaluation . . . . . . . . . . . . 212 5.1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 5.1.7.1 Provided Images and Image Sequences . . . . . . . . . . . . . . 214 5.1.7.2 Face Recognition Using the Eigenface Approach . . . . . 215 5.1.7.3 Face Recognition Using Face Components . . . . . . . . . . . 221 5.2 Full-body Person Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 5.2.1 State of the Art in Full-body Person Recognition . . . . . . . . . . . . 224 5.2.2 Color Features for Person Recognition . . . . . . . . . . . . . . . . . . . . . 226 5.2.2.1 Color Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 5.2.2.2 Color Structure Descriptor . . . . . . . . . . . . . . . . . . . . . . . . 227 5.2.3 Texture Features for Person Recognition . . . . . . . . . . . . . . . . . . . 228 5.2.3.1 Oriented Gaussian Derivatives . . . . . . . . . . . . . . . . . . . . . 228 5.2.3.2 Homogeneous Texture Descriptor . . . . . . . . . . . . . . . . . . 229 5.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 5.2.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 5.2.4.2 Feature Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 5.3 Camera-based People Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.3.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 5.3.1.1 Foreground Segmentation by Background Subtraction . 234 5.3.1.2 Morphological Operations . . . . . . . . . . . . . . . . . . . . . . . . 236 5.3.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 5.3.2.1 Tracking Following Detection . . . . . . . . . . . . . . . . . . . . . 238 5.3.2.2 Combined Tracking and Detection . . . . . . . . . . . . . . . . . . 243 5.3.3 Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 5.3.3.1 Occlusion Handling without Separation . . . . . . . . . . . . . 248
Table of Contents
5.3.3.2 Separation by Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3.3 Separation by Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3.4 Separation by 3D Information . . . . . . . . . . . . . . . . . . . . . 5.3.4 Localization and Tracking on the Ground Plane . . . . . . . . . . . . . 5.3.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
XIII
248 249 250 250 253 259
6 Interacting in Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 6.1 Visual Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 6.1.1 Spatial Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 6.1.2 Immersive Visual Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 6.1.3 Viewer Centered Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.2 Acoustic Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 6.2.1 Spatial Hearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 6.2.2 Auditory Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 6.2.3 Wave Field Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 6.2.4 Binaural Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 6.2.5 Cross Talk Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 6.3 Haptic Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 6.3.1 Rendering a Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 6.3.2 Rendering Solid Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 6.4 Modeling the Behavior of Virtual Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 6.4.1 Particle Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 6.4.2 Rigid Body Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 6.5 Interacting with Rigid Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 6.5.1 Guiding Sleeves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 6.5.2 Sensitive Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 6.5.3 Virtual Magnetism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 6.5.4 Snap-In . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 6.6 Implementing Virtual Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 6.6.1 Scene Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 6.6.2 Toolkits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 6.6.3 Cluster Rendering in Virtual Environments . . . . . . . . . . . . . . . . . 303 6.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 6.7.1 A Simple Scene Graph Example . . . . . . . . . . . . . . . . . . . . . . . . . . 307 6.7.2 More Complex Examples: SolarSystem and Robot . . . . . . 308 6.7.3 An Interactive Environment, Tetris3D . . . . . . . . . . . . . . . . . . . . . . 310 6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 6.9 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
XIV
Table of Contents
7 Interactive and Cooperative Robot Assistants . . . . . . . . . . . . . . . . . . . . . 315 7.1 Mobile and Humanoid Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 7.2 Interaction with Robot Assistants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 7.2.1 Classification of Human-Robot Interaction Activities . . . . . . . . . 323 7.2.2 Performative Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 7.2.3 Commanding and Commenting Actions . . . . . . . . . . . . . . . . . . . . 325 7.3 Learning and Teaching Robot Assistants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 7.3.1 Classification of Learning through Demonstration . . . . . . . . . . . 326 7.3.2 Skill Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 7.3.3 Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 7.3.4 Interactive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 7.3.5 Programming Robots via Observation: An Overview . . . . . . . . . 329 7.4 Interactive Task Learning from Human Demonstration . . . . . . . . . . . . . . . . 330 7.4.1 Classification of Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 7.4.2 The PbD Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 7.4.3 System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 7.4.4 Sensors for Learning through Demonstration . . . . . . . . . . . . . . . . 336 7.5 Models and Representation of Manipulation Tasks . . . . . . . . . . . . . . . . . . . 338 7.5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 7.5.2 Task Classes of Manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 7.5.3 Hierarchical Representation of Tasks . . . . . . . . . . . . . . . . . . . . . . 340 7.6 Subgoal Extraction from Human Demonstration . . . . . . . . . . . . . . . . . . . . . 342 7.6.1 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 7.6.2 Segmentation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 7.6.3 Grasp Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 7.6.3.1 Static Grasp Classification . . . . . . . . . . . . . . . . . . . . . . . . 344 7.6.3.2 Dynamic Grasp Classification . . . . . . . . . . . . . . . . . . . . . 346 7.7 Task Mapping and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 7.7.1 Event-Driven Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 7.7.2 Mapping Tasks onto Robot Manipulation Systems . . . . . . . . . . . 350 7.7.3 Human Comments and Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 7.8 Examples of Interactive Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 7.9 Telepresence and Telerobotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 7.9.1 The Concept of Telepresence and Telerobotic Systems . . . . . . . . 356 7.9.2 Components of a Telerobotic System . . . . . . . . . . . . . . . . . . . . . . 359 7.9.3 Telemanipulation for Robot Assistants in Human-Centered Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 7.9.4 Exemplary Teleoperation Applications . . . . . . . . . . . . . . . . . . . . . 361 7.9.4.1 Using Teleoperated Robots as Remote Sensor Systems 361 7.9.4.2 Controlling and Instructing Robot Assistants via Teleoperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
Table of Contents
XV
8 Assisted Man-Machine Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 8.1 The Concept of User Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 8.2 Assistance in Manual Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 8.2.1 Needs for Assistance in Manual Control . . . . . . . . . . . . . . . . . . . . 373 8.2.2 Context of Use in Manual Control . . . . . . . . . . . . . . . . . . . . . . . . . 374 8.2.3 Support of Manual Control Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 379 8.3 Assistance in Man-Machine Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 8.3.1 Needs for Assistance in Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . 386 8.3.2 Context of Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 8.3.3 Support of Dialog Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 A LTI-L IB — A C++ Open Source Computer Vision Library . . . . . . . . . . 399 A.1 Installation and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 A.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 A.2.1 Duplication and Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 A.2.2 Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 A.2.3 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 A.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 A.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 A.3.1 Points and Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 A.3.2 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 A.3.3 Image Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 A.4 Functional Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 A.4.1 Functor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 A.4.2 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 A.5 Handling Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 A.5.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 A.5.2 Image IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 A.5.3 Color Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 A.5.4 Drawing and Viewing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 A.6 Where to Go Next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 B I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 B.1 Requirements, Installation, and Deinstallation . . . . . . . . . . . . . . . . . . . . . . . 424 B.2 Basic Components and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 B.2.1 Document and Processing Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 426 B.2.2 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 B.2.3 Processing Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 B.2.4 Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 B.2.5 Summary and further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 B.3 Adding Functionality to a Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
XVI
Table of Contents
B.3.1 Using the Macro Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.2 Rearranging and Deleting Macros . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Defining the Data Flow with Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.1 Creating a Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.2 Removing a Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5 Configuring Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5.1 Image Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5.2 Video Capture Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5.3 Video Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5.4 Standard Macro Property Window . . . . . . . . . . . . . . . . . . . . . . . . B.6 Advanced Features and Internals of I MPRESARIO . . . . . . . . . . . . . . . . . . . . B.6.1 Customizing the Graphical User Interface . . . . . . . . . . . . . . . . . . B.6.2 Macros and Dynamic Link Libraries . . . . . . . . . . . . . . . . . . . . . . . B.6.3 Settings Dialog and Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 I MPRESARIO Examples in this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.8 Extending I MPRESARIO with New Macros . . . . . . . . . . . . . . . . . . . . . . . . . . B.8.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.8.2 API Installation and further Reading . . . . . . . . . . . . . . . . . . . . . . . B.8.3 I MPRESARIO Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
432 434 434 434 436 436 437 440 442 444 446 446 448 449 450 450 450 451 451 451
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Contributors
Dipl.-Ing. Markus Ablaßmeier Institute for Human-Machine Communication Technische Universit¨at M¨unchen, Munich, Germany Email:
[email protected] Dr.-Ing. Jos´e Pablo Alvarado Moya Escuela de Electr´onica Instituto Tecnol´ogico de Costa Rica Email:
[email protected] Dipl.-Inform. Ingo Assenmacher Center for Computing and Communication, RWTH Aachen University, Aachen, Germany Email:
[email protected] Dr. rer. nat. Britta Bauer Gothaer Versicherungen, K¨oln, Germany Email:
[email protected] Dr.-Ing. Ulrich Canzler CanControls, Aachen Email:
[email protected] Prof. Dr.-Ing. R¨udiger Dillmann Institut f¨ur Technische Informatik (ITEC) Karlsruhe University, Germany Email:
[email protected] Dipl.- Ing. Peter D¨orfler Chair of Technical Computer Science, RWTH Aachen University, Aachen, Germany Email: ammi@pdoerfler.com Dipl.- Ing. Holger Fillbrandt Chair of Technical Computer Science, RWTH Aachen University, Aachen, Germany Email: fi
[email protected] Dipl.- Ing. Michael H¨ahnel Chair of Technical Computer Science, RWTH Aachen University, Aachen, Germany Email:
[email protected] XVIII Dipl.- Ing. Lenka Jer´abkov´a Center for Computing and Communication, RWTH Aachen University, Aachen, Germany Email:
[email protected] Prof. Dr.-Ing. Karl-Friedrich Kraiss Chair of Technical Computer Science, RWTH Aachen University, Aachen, Germany Email:
[email protected] Dr. rer. nat. Torsten Kuhlen Center for Computing and Communication, RWTH Aachen University, Aachen, Germany Email:
[email protected] Dipl.- Ing. Lars Libuda Chair of Technical Computer Science, RWTH Aachen University, Aachen, Germany Email:
[email protected] Dipl.-Ing. Ronald M¨uller Institute for Human-Machine Communication Technische Universit¨at M¨unchen, Munich, Germany Email:
[email protected] Dipl.-Ing. Tony Poitschke Institute for Human-Machine Communication Technische Universit¨at M¨unchen, Munich, Germany Email:
[email protected] Dipl.-Ing. Stefan Reifinger Institute for Human-Machine Communication Technische Universit¨at M¨unchen, Munich, Germany Email: reifi
[email protected] Prof. Dr.-Ing. Gerhard Rigoll Institute for Human-Machine Communication Technische Universit¨at M¨unchen, Munich, Germany Email:
[email protected] Dipl.-Ing. Bj¨orn Schuller Institute for Human-Machine Communication Technische Universit¨at M¨unchen, Munich, Germany Email:
[email protected] Contributors
Contributors Dipl.- Ing. J¨org Zieren Chair of Technical Computer Science, RWTH Aachen University, Aachen, Germany Email:
[email protected] Dipl.-Inform. Raoul D. Z¨ollner Institut f¨ur Technische Informatik (ITEC) Karlsruhe University, Germany Email:
[email protected] XIX
Chapter 1
Introduction Karl-Friedrich Kraiss Living in modern societies relies on smart systems and services that support our collaborative and communicative needs as social beings. Developments in software and hardware technologies as e.g. in microelectronics, mechatronics, speech technology, computer linguistics, computer vision, and artificial intelligence are continuously driving new applications for work, leisure, and mobility. Man-machine interaction, which is ubiquitous in human life, is driven by technology as well. It is the gateway providing access to functions and services which, due to increasing complexity, threatens to turn into a bottleneck. Since for the user the interface is the product, interface design is a key technology which accordingly is subject to significant national and international research efforts. Interfaces to smart systems exploit the same advanced technologies as the said systems themselves. This textbook therefore introduces to researchers, students and practitioners not only advanced concepts of interface design, but also gives insight into the related interface technologies. This introduction first outlines the broad spectrum of applications in which manmachine interaction takes place. Then, following an overview of interface design goals, major approaches to interaction augmentation are identified. Finally the single chapters of this textbook are shortly characterized.
Application Areas of Man-Machine Interaction According to their properties man-machine systems may be classified in dialog systems and dynamic systems (Fig. 1.1). The former comprise small single user appliances like mobile phones, personal digital assistants and personal computers, but also multi-user systems like plant control stations or call centers. All of these require discrete user actions by, e.g., keyboard or mouse in either stationary or mobile mode. Representatives of the latter are land, air, or sea vehicles, robots, and masterslave systems. Following the needs of an aging society also rehabilitation robots, which provide movement assistance and restoration for people with disabilities, increase steadily in importance. All these systems put the operator in a control loop and require continuous manual control inputs. As dynamic systems are becoming more complex, functions tend to be automated in parts, which takes the operator out of the control loop and assigns supervisory control tasks to him or suggests cooperation with machines by task sharing. This approach has a long tradition in civil and military aviation and industrial plants. Currently smart automation embedded in the real world makes rapid progress in road traffic, where a multitude of driver assistance functions are under development.
2
Introduction
Many of these require interaction with the driver. In the near future also mobile service robots are expected on the marketplace as companions or home helpers. While these will have built-in intelligence for largely autonomous operation, they nevertheless will need human supervision and occasional intervention. The characterization proposed in (Fig. 1.1) very seldom occurs in pure form. Many systems are blended in one way or other, requiring discrete dialog actions, manual control, and supervisory control at the same time. A prototypical example is a commercial airliner, where the cockpit crew performs all these action in sequence or, at times, in parallel. Man machine systems
Dynamic systems
Dialog systems
Stationary Electronic appliances in home and office, Industrial plants Control stations
Manual control
Mobile Mobile phones Laptops PDAs Blackberries
Land, air, sea vehicles, Master-slave systems rehabilitation robotics
Supervisory control Automated land, air, sea vehicles, Servicerobots Industrial plants
Fig. 1.1. Classification of man-machine systems and associated application areas.
Design Goals for Man-Machine Systems For an interaction between man and machine to run smoothly, the interface between both must be designed appropriately. Unfortunately, much different from technical transmission channels, no crisp evaluation measures are at hand to characterize quality of service. Looking from the user’s perspective, the general goals to be achieved are usability, safety of use, and freedom of error. Usability describes to which extend a product can be used effectively, efficiently, and with satisfaction as defined by DIN EN ISO 9241. In this definition effectivity describes the accuracy and completeness with which a user can reach his goals. Efficiency means the effort needed to work effectively, while satisfaction is a measure of user acceptance. More recently it has been realized that user satisfaction also correlates with fun of use. The design of hedonistic interfaces that grant joy of use is already apparent in game boxes and music players, but also attracts the automotive industry, which advertises products with pertinent slogans that promise, e.g. ”driving pleasure”. The concept of safety of use is complementary to usability. It describes to which extend a product is understandable, predictable, controllable, and robust. Understandability implies that the workings of system functions are understood by a user.
3 Predictability describes to which extend a user is aware of system limits. Controllability indicates if system function can be switched off or overruled at the user’s discretion. Robustness, finally, describes how a system reacts to unpredictable performance of certain risk groups, due to complacency, habituation processes, reflexes in case of malfunctioning, or from testing the limits (e.g. when driving too fast). Freedom of error is a third objective of interface design, since a product free of errors complies with product liability requirements so that no warranty related user claims must be apprehended. Freedom of errors requires that an interface satisfies a user’s security expectations, which may be based on manuals, instructions, warnings or advertising. Bugs in this sense are construction errors due to inadequate consideration of all possible modes of use and abuse. Other sources of malfunction are instruction errors resulting from incomplete or baffling manuals or mistakable marketing information. It is interesting to note here, that a product is legally considered defective, if it fails to meet the expectations of the least informed user. Since such a requirement is extremely difficult to meet, this legal situation prevents the introduction of more advanced functions with possibly dangerous consequences. The next section will address the question, how advanced interaction techniques can contribute to match the stated design goals. Advanced Interaction Concepts Starting point for the discussion of interface augmentation is the basic interaction scheme represented by the white blocks in (Fig. 1.2), where the term machine stands for appliance, vehicle, robot, simulation or process. Man and machine are interlinked via two unidirectional channels, one of which transmits machine generated information to the user, while the other one feeds user inputs in opposite direction. Interaction takes place with a real machine in the real world.
Context of use
User assistance Information display augmentation
User input augmentation
Automation
Fission Speech Gesture Mimics ...
Man
Fusion Speech Gesture Mimics ...
Real machine in real world
Operation in Virtual reality
Multimodality
Fig. 1.2. Interaction augmented by multimodality, user assistance, and virtual reality.
4
Introduction
There are three major approaches, which promise to significantly augment interface quality. These are multimodal interaction, operation in virtual reality, and user assistance, which have been integrated into the basic MMI scheme of (Fig. 1.2) as grey rectangles. Multimodality is ubiquitous in human communication. We gesture and mimic while talking, even at the phone, when the addressee can not see it. We nod or shake the head or change head pose to indicate agreement or disagreement. We also signal attentiveness by suitable body language, e.g. by turning towards a dialog partner. In so doing conversation becomes comfortable, intuitive, and robust. It is exactly this lack of multimodality why current interfaces often fail. It therefore is a goal of advanced interface design, to make use of various modalities. Modalities applicable to interfaces are mimics, gesture, body posture, movement, speech, and haptics. They serve either information display, or user input, sometimes even both purposes. The combination of modalities for input is sometimes called multimodal fusion while the combination of different modalities for information display is labeled multimodal fission. For fission modalities must be generated e.g. by synthesizing speech or by providing gesturing avatars. This book focuses on fusion, which requires automatic recognition of multimodal human actions. Speech recognition has been around for almost fifty years and is available for practical use. Further improvements of speech recognizers with respect to context sensitive continuous speech recognition and robustness against noise are under way. Observation of lip movements during talking contributes to speech recognition under noise conditions. Advanced speech recognizers will also cope with incomplete grammar and spontaneous uttering. Interest in gesture and mimic recognition is more recent. In fact the first related papers appeared only in the nineties. First efforts to record mimics and gesture in real time in laboratory setups and in movie studios involved intrusive methods with calibrated markers and multiple cameras, which are of no use for the applications at hand. Only recently video based recognition has achieved an acceptable performance level in out-of-the-laboratory settings. Gesture, mimics, head pose, line of sight and body posture can now be recognized based on video recordings in real time, even under adverse real world conditions. People may even be optically tracked or located in public places. Emotions derived from a fusion of speech, gestures and mimics now open the door for yet scarcely exploited emotional interaction. The second column on which interface augmentation rests is virtual reality, which provides comfortable means of interactive 3D-simulation, hereby enabling various novel approaches for interaction. Virtual reality allows, e.g., operating virtual machines in virtual environments such as flight simulators. It may also be used to generate virtual interfaces. By providing suitable multimodal information channels, e.g., realistic teleprecence is achieved that permits teleoperating in worlds which are physically not accessible because of distance or scale (e.g. space applications or micro- and nano-worlds). In master-slave systems the operator usually acts on a local real machine, and the actions are repeated by a remote machine. With virtual reality the local machine can be substituted by simulation. Virtual reality also has a major impact for robot explo-
5 ration, testing, and training, since robot actions can be taught-in and tested within a local system simulation before being transferred and executed in a remote and possibly dangerous and costly setting. Interaction may be further improved by making use of augmented or mixed reality, where information is presented non-intrusively on head mounted displays as an overlay to virtual or real worlds. Apart from multimodality and virtual reality, the concept of user assistance is a third ingredient of interpersonal communication lacking so far in man-machine interaction. Envision, e.g., how business executives often call upon a staff of personal assistants to get their work done. Based on familiarity with the context of the given task, i.e. the purpose and constraints of a business transaction, a qualified assistant is expected to provide correct information just in time, hereby reducing the workload of the executive and improving the quality of the decisions made. In man-machine interaction context of task transfers to context of use, which may be characterized by the state of the user, the state of the system, and the situation. In conventional man-machine interaction little of this kind is yet known. Neither user characteristics nor the circumstances of operation are made explicit for support of interaction; the knowledge about context resides with the user alone. However, if context of use was made explicit, assistance could be provided to the user, similar to that offered to the executive by his personal assistant. Hence knowledge of context of use is the key to exploiting interface adaptivity. As indicated by vertical arrows in (Fig. 1.2) assistance can augment interaction in three ways. First information display may be improved, e.g., by providing information just in time and coded in the most suitable modality. Secondly user inputs may be augmented, modified, or even suppressed as appropriate. A third method for providing assistance exists for dynamic systems, where system functions may be automated either permanently or temporarily. An appliance assisted in this way is supposed to appear to the user simpler than it actually is. It also will be easier to learn, more pleasant to handle, and less errorprone. The Scope of This Book Most publications in the field of human factors engineering focus on interface design, while leaving the technical implementation to engineers. While this book addresses design aspects as well, it puts in addition special emphasis on advanced interface implementation. A significant part of the book is concerned with the realization of multimodal interaction via gesture, mimics, and speech. The remaining parts deal with the identification and tracking of people, the interaction in virtual reality and with robots, and with the implementation of user assistance. The second chapter treats picture processing algorithms for the non-intrusive acquisition of human action. In the first section techniques for identifying and classifying hand gesture commands from video are described. These techniques are extended to facial expression commands in the second section. Sign language recognition, the topic of chapter 3, is the ultimate challenge in multimodal interaction. It builds on the basic techniques presented in the chapter 2, but considers specialized requirements for the recognition of isolated and continuous
6
Introduction
signs using manual as well as facial expression features. A method for the automatic transcription of signs into subunits is presented, which, comparable to phonemes in speech recognition, opens the door to large vocabularies. The role of speech communication in multimodal interfaces is the subject of chapter 4, which in its first section presents basics of advanced speech recognition. Speech dialogs are treated next, followed by a comprehensive description of multimodal dialogs. A final section is concerned with the extraction of emotions from speech and gesture. The 5th chapter deals with video-based person recognition and tracking. A first section is concerned with face recognition using global and local features. Then fullbody person recognition is dealt with based on the color and texture of the clothing a subject wears. Finally various approaches for camera-based people tracking are presented. Chapter 6 is devoted to problems arising when interacting with virtual objects in virtual reality. First technological aspects of implementing visual, acoustic, and haptic virtual worlds are presented. Then procedures for virtual object modeling are outlined. The third section is devoted to various methods to augment interacting with virtual objects. Finally the chapter closes with a description of software tools applicable for the implementation of virtual environments. The role of service robots in the future society is the subject of chapter 7. First a description of available mobile and humanoid robots is given. Then actual challenges arising from interacting with robot assistants are systematically tackled like, e.g., learning, teaching, programming by demonstration, commanding procedures, task planning, and exception handling. As far as manipulation tasks are considered, relevant models and representations are presented. Finally telepresence and telerobotics are treated as special cases of man-robot interaction. The 8th chapter addresses the concept of assistance in man-machine interaction, which is systematically elaborated for manual control and dialog tasks. In both cases various methods of providing assistance are identified based on user capabilities and deficiencies. Since context of use is identified as the key element in this undertaking, a variety of algorithms are introduced and discussed for real-time non-intrusive context identification. For readers with a background in engineering and computer science chapters 2 and 5 presents a collection of programming examples, which are implemented with ’Impresario’. This rapid prototyping tool offers an application programming interface for comfortable access to the LTI-Lib, a large public domain software library for computer vision and classification. A description of the LTI-Lib can be found in Appendix A, a manual of Impresario in Appendix B. For the examples a selection of head, face, eye, and lip templates, hand gestures, sign language sequences, and people tracking clips are provided. The reader is encouraged to use these materials to start experimentation on his own. Chapter 6 finally, provides the Java3D code of some simple VR application. All software, programming code, and templates come with the book CD.
Chapter 2
Non-Intrusive Acquisition of Human Action J¨org Zieren1 , Ulrich Canzler2 This chapter discusses control by gestures and facial expression using the visual modality. The basic concepts of vision-based pattern recognition are introduced together with several fundamental image processing algorithms. The text puts low-level methods in context with actual applications and enables the reader to create a fully functional implementation. Rather than provide a complete survey of existing techniques, the focus lies on the presentation of selected procedures supported by concrete examples.
2.1 Hand Gesture Commands The use of gestures as a means to convey information is an important part of human communication. Since hand gestures are a convenient and natural way of interaction, man-machine interfaces can benefit from their use as an input modality. Gesture control allows silent, remote, and contactless interaction without the need to locate knobs or dials. Example applications include publicly accessible devices such as information terminals [1], where problems of hygiene and strong wear of keys are solved, or automotive environments [2], where visual and mental distraction are reduced and mechanical input devices can be saved.
Fig. 2.1. The processing chain found in many pattern recognition systems.
Many pattern recognition systems have a common processing chain that can roughly be divided into three consecutive stages, each of which forwards its output to the next (Fig. 2.1). First, a single frame is acquired from a camera, a file system, a network stream, etc. The feature extraction stage then computes a fixed number of scalar values that each describe an individual feature of the observed scene, such as the coordinates of a target object (e.g. the hand), its size, or its color. The image is 1 2
Sect. 2.1 Sect. 2.2
8
Non-Intrusive Acquisition of Human Action
thereby reduced to a numerical representation called a feature vector. This step is often the most complex and computationally expensive in the processing chain. The accuracy and reliability of the feature extraction process can have significant influence on the system’s overall performance. Finally, the extracted features of a single frame (static gesture) or a sequence of frames (dynamic gesture) are classified either by a set of rules or on the basis of a training phase that occurs prior to the actual application phase. During the training the correct classification of the current features is always known and the features are stored or learned in a way that later allows to classify features computed in the application phase, even in the presence of noise and minor variations that inevitably occur in the real world. Moving along the processing chain, the data volume being processed continuously decreases (from an image of several hundred kB to a feature vector of several bytes to a single event description) while the level of abstraction increases accordingly (from a set of pixels to a set of features to a semantic interpretation of the observed scene). The following sections 2.1.1, 2.1.2, and 2.1.3 explain the three processing stages in detail. En route, we will create example applications that recognize various static and dynamic hand gestures. These examples show the practical application of the processing concept and algorithms presented in this chapter. Since the C++ source code is provided on the accompanying CD, they can also serve as a basis for own implementations. 2.1.1 Image Acquisition and Input Data Before looking into algorithms, it is important to examine those aspects of the planned gesture recognition system that directly affect the interaction with the user. Three important properties can be defined that greatly influence the subsequent design process: •
•
•
The vocabulary, i.e. the set of gestures that are to be recognized. The vocabulary size is usually identical with the number of possible system responses (unless multiple gestures trigger the same response). The recording conditions under which the system operates. To simplify system design, restrictions are commonly imposed on image background, lighting, the target’s minimum and maximum distance to the camera, etc. The system’s performance in terms of latency (the processing time between the end of the gesture and the availability of the recognition result) and recognition accuracy (the percentage or probability of correct recognition results).
The first two items are explained in detail below. The third involves performance measurement and code optimization for a concrete implementation and application scenario, which is outside the scope of this chapter. 2.1.1.1 Vocabulary Constellation and characteristics of the vocabulary have a strong influence on all processing stages. If the vocabulary is known before the development of the recog-
Hand Gesture Commands
9
nition system starts, both design and implementation are usually optimized accordingly. This may later lead to problems if the vocabulary is to be changed or extended. In case the system is completely operational before the constellation of the vocabulary is decided upon, recognition accuracy can be optimized by simply avoiding gestures that are frequently misclassified. In practice, a mixture of both approaches is often applicable. The vocabulary affects the difficulty of the recognition task in two ways. First, with growing vocabulary size the number of possible misclassifications rises, rendering recognition increasingly difficult. Second, the degree of similarity between the vocabulary’s elements needs to be considered. This cannot be quantified, and even a qualitative estimate based on human perception cannot necessarily be transferred to an automatic system. Thus, for evaluating a system’s performance, vocabulary size and recognition accuracy alone are insufficient. Though frequently omitted, the exact constellation of the vocabulary (or at least the degree of similarity between its elements) needs to be known as well. 2.1.1.2 Recording Conditions The conditions under which the input images are recorded are crucial for the design of any recognition system. It is therefore important to specify them as precisely and completely as possible before starting the actual implementation. However, unless known algorithms and hardware are to be used, it is not always foreseeable how the subsequent processing stages will react to certain properties of the input data. For example, a small change in lighting direction may cause an object’s appearance to change in a way that greatly affects a certain feature (even though this change may not be immediately apparent to the human eye), resulting in poor recognition accuracy. In this case, either the input data specification has to be revised not to allow changes in lighting, or more robust algorithms have to be deployed for feature extraction and/or classification that are not affected by such variations. Problems like this can be avoided by keeping any kind of variance or noise in the input images to a minimum. This leads to what is commonly known as “laboratory recording conditions”, as exemplified by Tab. 2.1. While these conditions can usually be met in the training phase (if the classification stage requires training at all), they render many systems useless for any kind of practical application. On the other hand, they greatly simplify system design and might be the only environment in which certain demanding recognition tasks are viable. In contrast to the above enumeration of laboratory conditions, one or more of the “real world recording conditions” shown in Tab. 2.2 often apply when a system is to be put to practical use. Designing a system that is not significantly affected by any of the above conditions is, at the current state of research, impossible and should thus not be aimed for. General statements regarding the feasibility of certain tasks cannot be made, but from experience one will be able to judge what can be achieved in a specific application scenario.
10
Non-Intrusive Acquisition of Human Action Table 2.1. List of common “laboratory recording conditions”. Domain Condition(s) Image Content The image contains only the target object, usually in front of a unicolored untextured background. No other objects are visible. The user wears long-sleeved, non-skin colored clothing, so that the hand can be separated from the arm by color. Lighting Strong diffuse lighting provides an even illumination of the target with no shadows or reflections. Setup The placement of camera, background, and target remains constant. The target’s distance to the camera and its position in the image do not change. (If the target is moving, this applies only to the starting point of the motion.) Camera The camera hardware and camera parameters (resolution, gain, white balance, brightness etc.) are never changed. They are also optimally adjusted to prevent overexposure or color cast. In case of a moving target, shutter speed and frame rate are sufficiently high to prevent both motion blur and discontinuities. A professional quality camera with a high resolution is used.
Table 2.2. List of common “real world recording conditions”. Domain Condition(s) Image Content The image may contain other objects (distractors) besides, or even in place of, the target object. The background is unknown and may even be moving (e.g. trees, clouds). A target may become partially or completely occluded by distractors or other targets. The user may be wearing short-sleeved or skin-colored clothing that prevents color-based separation of arm and hand. Lighting Lighting may not be diffuse, resulting in shadows and uneven illumination within the image region. It may even vary over time. Setup The position of the target with respect to the camera is not fixed. The target may be located anywhere in the image, and its size may vary with its distance to the camera. Camera Camera hardware and/or parameters may change from take to take, or even during a single take in case of automatic dynamic adaptation. Overexposure and color cast may occur. In case of a moving target, low shutter speeds may cause motion blur, and a low frame rate (or high target velocity) may result in discontinuities between successive frames. A consumer quality camera is used.
2.1.1.3 Image Representation The description and implementation of image processing algorithms requires a suitable mathematical representation of various types of images. A common representation of a rectangular image with a resolution of M rows and N columns is a discrete function
Hand Gesture Commands
11
I(x, y)
with
x ∈ {0, 1, . . . , N − 1} y ∈ {0, 1, . . . , M − 1}
(2.1)
The 2-tuple (x, y) denotes pixel coordinates with the origin (0, 0) in the upperleft corner. The value of I(x, y) describes a property of the corresponding pixel. This property may be the pixel’s color, its brightness, the probability of it representing human skin, etc. Color is described as an n-tuple of scalar values, depending on the chosen color model. The most common color model in computer graphics is RGB, which uses a 3-tuple (r, g, b) specifying a color’s red, green, and blue components: ⎛ ⎞ r(x, y) I(x, y) = ⎝ g(x, y) ⎠ (2.2) b(x, y) Other frequently used color models are HSI (hue, saturation, intensity) or YCbCr (luminance and chrominance). For color images, I(x, y) is thus a vector function. For many other properties, such as brightness and probabilities, I(x, y) is a scalar function and can be visualized as a gray value image (also called intensity image). Scalar integers can be used to encode a pixel’s classification. For gesture recognition an important classification is the discrimination between foreground (target) and background. This yields a binary valued image, commonly called a mask, Imask (x, y) ∈ {0, 1}
(2.3)
Binary values are usually visualized as black and white. An image sequence (i.e. a video clip) of T frames can be described using a time index t as shown in (2.4). The amount of real time elapsed between two successive frames is the inverse of the frame rate used when recording the images. I(x, y, t),
t = 1, 2, . . . , T
(2.4)
The C++ computer vision library LTI-L IB (described in Annex A) uses the types channel and channel8 for gray value images, image for RGB color images, and rgbPixel for colors. 2.1.1.4 Example We will now specify the input data to be processed by the two example recognition applications that will be developed in the course of this chapter, as well as recording conditions for static and dynamic gestures. Static Gestures The system shall recognize three different gestures – “left”, “right”, and “stop” – as shown in Fig. 2.2 (a-c). In addition, the event that the hand is not visible in the image or rests in an idle position (such as Fig. 2.2d) and does not perform any of these gestures shall be detected. Since this vocabulary is of course very small and its elements are sufficiently dissimilar (this is proven in Sect. 2.1.3.7), high recognition accuracy can be expected.
12
Non-Intrusive Acquisition of Human Action
(a) “left”
(b) “right”
(c) “stop”
(d) none (idle)
Fig. 2.2. Static hand gestures for “left”, “right”, and “stop” (a-c), as well as example for an idle position (d).
Dynamic Gestures The dynamic vocabulary will be comprised of six gestures, where three are simply backwards executions of the other three. Tab. 2.3 describes these gestures, and Fig. 2.3 shows several sample executions. Since texture will not be considered, it does not matter whether the palm or the back of the hand faces the camera. Therefore, the camera may be mounted atop a desk facing downwards (e.g. on a tripod or affixed to the computer screen) to avoid background clutter. Table 2.3. Description of the six dynamic gestures to be recognized. Gesture “clockwise”
Description circular clockwise motion of closed hand, starting and ending at the bottom of an imaginary circle “counterclockwise” backwards execution of “clockwise” “open” opening motion of flat hand by rotation around the arm’s axis by 90°(fingers extended and touching) “close” backwards execution of “open” “grab” starting with all fingers extended and not touching, a fist is made “drop” backwards execution of “grab”
Recording Conditions As will be described in the following sections, skin color serves as the primary image cue for both example systems. Assuming an office or other indoor environment, Tab. 2.4 lists the recording conditions to be complied with for both static and dynamic gestures. The enclosed CD contains several accordant examples that can also be used as input data in case a webcam is not available. 2.1.2 Feature Extraction The transition from low-level image data to some higher-level description thereof, represented as a vector of scalar values, is called feature extraction. In this process,
Hand Gesture Commands
13
(a) “clockwise” (left to right) and “counterclockwise” (right to left)
(b) “open” (left to right) and “close” (right to left)
(c) “grab” (left to right) and “drop” (right to left) Fig. 2.3. The six dynamic gestures to be recognized. In each of the three image sequences, the second gesture is a backwards execution of the first, i.e. corresponds to reading the image sequence from right to left.
irrelevant information (background) is discarded, while relevant information (foreground or target) is isolated. Most pattern recognition systems perform this step, because processing the complete image is computationally too demanding and introduces an unacceptably high amount of noise (images often contain significantly more background pixels than foreground pixels). Using two-dimensional images of the unmarked hand, it is at present not possible to create a three-dimensional model of a deformable object as complex and flexible as the hand in real-time. Since the hand has 27 degrees of freedom (DOF), this would require an amount of information that cannot be extracted with sufficient accuracy and speed. Model-based features such as the bending of individual fingers are therefore not available in video-based marker-free real-time gesture recognition. Instead, appearance-based features that describe the two-dimensional view of the hand are exploited. Frequently used appearance-based shape features include geometric properties of the shape’s border, such as location, orientation, size etc. Texture features shall not be considered here because resolution and contrast that can be achieved with consumer cameras are usually insufficient for their reliable and accurate computation. This section presents algorithms and data structures to detect the user’s hand in the image, describe its shape, and calculate several geometric features that allow the distinction of different static hand configurations and different dynamic gestures,
14
Non-Intrusive Acquisition of Human Action Table 2.4. Recording conditions for the example recognition applications. Domain Condition(s) Image Content Ideally, the only skin colored object in the image is the user’s hand. Other skin colored objects may be visible, but they must be small compared to the hand. This also applies to the arm; it should be covered by a long-sleeved shirt if it is visible. In dynamic gestures the hand should have completely entered the image before recording starts, and not leave the image before recording ends, so that it is entirely visible in every recorded frame. Lighting Lighting is sufficiently diffuse so that no significant shadows are visible on the hand. Slight shadows, such as in Fig. 2.2 and Fig. 2.3, are acceptable. Setup The distance between hand and camera is chosen so that the hand fills approximately 10–25% of the image. The hand’s exact position in the image is arbitrary, but no parts of the hand should be cropped. The camera is not rotated, i.e. its x axis is horizontal. Camera Resolution may vary, but should be at least 320 × 240. Aspect ratio remains constant. The camera is adjusted so that no overexposure and only minor color cast occurs. (Optimal performance is usually achieved when gain, brightness and shutter speed are set so that the image appears somewhat underexposed to the human observer.) For dynamic gestures, a frame rate of 25 per second is to be used, and shutter speed should be high enough to prevent motion blur. A consumer quality camera is sufficient.
such as those shown in Fig. 2.2 and 2.3. This process, which occurs within the feature extraction module in Fig. 2.1, is visualized in Fig. 2.4.
Fig. 2.4. Visualization of the feature extraction stage. The top row indicates processing steps, the bottom row shows examples for corresponding data.
2.1.2.1 Hand Localization The identification of foreground or target regions constitutes an interpretation of the image based on knowledge which is usually specific to the application scenario. This
Hand Gesture Commands
15
knowledge can be encoded explicitly (as a set of rules) or implicitly (in a histogram, a neural network, etc.). Known properties of the target object, such as shape, size, or color, can be exploited. In gesture recognition, color is the most frequently used feature for hand localization since shape and size of the hand’s projection in the twodimensional image plane vary greatly. It is also the only feature explicitly stored in the image. Using the color attribute to localize an object in the image requires a definition of the object’s color or colors. In the RGB color model (and most others), even objects that one would call unicolored usually occupy a range of numerical values. This range can be described statistically using a three-dimensional discrete histogram hobject(r, g, b), with the dimensions corresponding to the red, green, and blue components. hobject is computed from a sufficiently large number of object pixels that are usually marked manually in a set of source images that cover all setups in which the system is intended to be used (e.g. multiple users, varying lighting conditions, etc.). Its value at (r, g, b) indicates the number of pixels with the corresponding color. The total sum of hobject over all colors is therefore equal to the number of considered object pixels nobject, i.e. hobject(r, g, b) = nobject (2.5) r
g
b
Because the encoded knowledge is central to the localization task, the creation of hobject is an important part of the system’s development. It is exemplified below for a single frame. Fig. 2.5 shows the source image of a hand (a) and a corresponding, manually generated binary mask (b) that indicates object pixels (white) and background pixels (black). In (c) the histogram computed from (a), considering only object pixels (those that are white in (b)), is shown. The three-dimensional view uses points to indicate colors that occurred with a certain minimum frequency. The three one-dimensional graphs to the right show the projection onto the red, green, and blue axis. Not surprisingly, red is the dominant skin color component.
(a) Source image
(b) Object mask
(c) Object color histogram
Fig. 2.5. From a source image (a) and a corresponding, manually generated object mask (b), an object color histogram (c) is computed.
16
Non-Intrusive Acquisition of Human Action
On the basis of hobject, color-based object detection can now be performed in newly acquired images of the object. The aim is to compute, from a pixel’s color, a probability or belief value indicating its likeliness of representing a part of the target object. This value is obtained for every pixel (x, y) and stored in a probability image Iobject(x, y) with the same dimensions as the analyzed image. The following paragraphs derive the required stochastic equations. Given an object pixel, the probability of it having a certain color (r, g, b) can be computed from hobject as P (r, g, b|object) =
hobject(r, g, b) nobject
(2.6)
By creating a complementary histogram hbg of the background colors from a total number of nbg background pixels the accordant probability for a background pixel is obtained in the same way: P (r, g, b|bg) =
hbg (r, g, b) nbg
(2.7)
Applying Bayes’ rule, the probability of any pixel representing a part of the object can be computed from its color (r, g, b) using (2.6) and (2.7):
P (object|r, g, b) =
P (r, g, b|object) · P (object) P (r, g, b|object) · P (object) + P (r, g, b|bg) · P (bg)
(2.8)
P (object) and P (bg) denote the a priori object and background probabilities, respectively, with P (object) + P (bg) = 1. Using (2.8), the object probability image Iobj,prob is created from I as Iobj,prob(x, y) = P (object|I(x, y))
(2.9)
To classify each pixel as either background or target, an object probability threshold Θ is defined. Probabilities equal to or above Θ are considered target, while all others constitute the background. A data structure suitable for representing this classification is a binary mask Iobj,mask (x, y) =
1 0
if Iobj,prob(x, y) ≥ Θ otherwise
(target) (background)
(2.10)
Iobj,mask constitutes a threshold segmentation of the source image because it partitions I into target and background regions. Rewriting (2.8) as (2.11), it can be seen that varying P (object) and P (bg) is equivalent to a non-linear scaling of P (object|r, g, b) and does not affect Iobj,mask when Θ is scaled accordingly (though it may improve contrast in the visualization of Iobj,prob ). In other words, for any prior probabilities P (object), P (bg) = 0, all possible segmentations can be obtained by varying Θ from 0 to 1. It is therefore
Hand Gesture Commands
17
practical to choose arbitrary values, e.g. P (object) = P (bg) = 0.5. Obviously, the value computed for P (object|r, g, b) will then no longer reflect an actual probability, but a belief value.3 This is usually easier to handle than the computation of exact values for P (object) and P (bg). P (object|r, g, b) = 1 +
P (bg) P (r, g, b|bg) · P (r, g, b|object) P (object)
−1 (2.11)
A suitable choice of Θ is crucial for an accurate discrimination between background and target. In case the recording conditions remain constant and are known in advance, Θ can be set manually, but for uncontrolled environments an automatic determination is desirable. Two different strategies are presented below. If no knowledge of the target object’s shape and/or position is available, the following iterative low-level algorithm presented in [24] produces good results in a variety of conditions. It assumes the histogram of Iobj,prob to be bi-modal but can also be applied if this is not the case. 1. Arbitrarily define a set of background pixels (some usually have a high a priori background probability, e.g. the four corners of the image). All other pixels are defined as foreground. This constitutes an initial classification. 2. Compute the mean values for background and foreground, μobject and μbg , based on the most recent classification. If the mean values are identical to those computed in the previous iteration, halt. 3. Compute a new threshold Θ = 12 (μobject + μbg ) and perform another classification of all pixels, then goto step 2. Fig. 2.6. Automatic computation of the object probability threshold Θ without the use of highlevel knowledge (presented in [24]).
In case the target’s approximate location and geometry are known, humans can usually identify a suitable threshold just by observation of the corresponding object mask. This approach can be implemented by defining an expected target shape, creating several object masks using different thresholds, and choosing the one which yields the most similar shape. This requires a feature extraction as described in Sect. 2.1.2 and a metric to quantify the deviation of a candidate shape’s features from those of the expected target shape. An example for hand localization can be seen in Fig. 2.7. Using the skin color histogram shown in Fig. 2.5 and a generic background histogram hbg , a skin probability image (b) was computed from a source image (a) of the same hand (and some clutter) as shown in Fig. 2.5, using P (object) = P (bg) = 0.5. For the purpose of visualization, three different thresholds Θ1 < Θ2 < Θ3 were then applied, resulting in three different binary masks (c, d, e). In this example none of the binary masks 3
For simplicity, the term “probability” will still be used in either case.
18
Non-Intrusive Acquisition of Human Action
is entirely correct: For Θ1 , numerous background regions (such as the bottle cap in the bottom right corner) are classified as foreground, while Θ3 leads to holes in the object, especially at its borders. Θ2 , which was computed automatically using the algorithm described in Fig. 2.6, is a compromise that might be considered optimal for many applications, such as the example systems to be developed in this section.
(a) Source Image
(c) Skin/Background classification using Θ1
(b) Skin Probability Image
(d) Skin/Background classification using Θ2
(e) Skin/Background classification using Θ3
Fig. 2.7. A source image (a) and the corresponding skin probability image (b). An object/background classification was performed for three different thresholds Θ1 < Θ2 < Θ3 (c-e).
While the human perception of an object’s color is largely independent of the current illumination (an effect called color constancy), colors recorded by a camera are strongly influenced by illumination and hardware characteristics. This restricts the use of histograms to the recording conditions under which their source data was created, or necessitates the application of color constancy algorithms [7, 9].4 In [12], a skin and a non-skin color histogram are presented that were created from several thousand images found on the WWW, covering a multitude of skin 4
In many digital cameras, color constancy can be achieved by performing a white balance: The camera is pointed to a white object (such as a blank sheet of paper), and a transformation is performed so that this object appears colorless (r = g = b) in the image as well. The transformation parameters are then stored for further exposures under the same illumination conditions.
Hand Gesture Commands
19
hues, lighting conditions, and camera hardware. Thereby, these histograms implicitly provide user, illumination, and camera independence, at the cost of a comparably high probability for non-skin objects being classified as skin (false alarm). Fig. 2.8 shows the skin probability image (a) and binary masks (b, c, d) for the source image shown in Fig. 2.7a, computed using the histograms from [12] and three different thresholds Θ4 < Θ5 < Θ6 . As described above, P (object) = P (bg) = 0.5 was chosen. Compared to 2.7b, the coins, the pens, the bottle cap, and the shadow around the text marker have significantly higher skin probability, with the pens even exceeding parts of the fingers and the thumb (d). This is a fundamental problem that cannot be solved by a simple threshold classification because there is no threshold Θ that would achieve a correct result. Unless the histograms can be modified to reduce the number of false alarms, the subsequent processing stages must therefore be designed to handle this problem.
(a) Skin Probability Image
(b) Skin/Background classification using Θ4
(c) Skin/Background classification using Θ5
(d) Skin/Background classification using Θ6
Fig. 2.8. Skin probability image (b) for Fig. 2.8a and object/background classification for three thresholds Θ4 < Θ5 < Θ6 (b, c, d).
Implementational Aspects In practical applications the colors r, g, and b commonly have a resolution of 8 bit each, i.e. r, g, b ∈ {0, 1, . . . , 255}
(2.12)
A corresponding histogram would have to have 2563 = 16, 777, 216 cells, each storing a floating point number of typically 4 bytes in size, totaling 64 MB. To reduce these memory requirements, the color values are downsampled to e.g. 5 bit through a simple integer division by 23 = 8:
g
r b g = b = (2.13) r = 8 8 8 This results in a histogram h (r , g , b ) with 323 = 32, 768 cells that requires only 128 kB of memory. The entailed loss in accuracy is negligible in most appli-
20
Non-Intrusive Acquisition of Human Action
cations, especially when consumer cameras with high noise levels are employed. Usually the downsampling is even advantageous because it constitutes a generalization and therefore reduces the amount of data required to create a representative histogram. The LTI-L IB class skinProbabilityMap comes with the histograms presented in [12]. The algorithm described in Fig. 2.6 is implemented in the class optimalThresholding. 2.1.2.2 Region Description On the basis of Iobj,mask the source image I can be partitioned into regions. A region R is a contiguous set of pixels p for which Iobj,mask has the same value. The concept of contiguity requires a definition of pixel adjacency. As depicted in Fig. 2.9, adjacency may be based on either a 4-neighborhood or an 8-neighborhood. In this section the 8-neighborhood will be used. In general, regions may contain other regions and/or holes, but this shall not be considered here because it is of minor importance in most gesture recognition applications.
Fig. 2.9. Adjacent pixels (gray) in the 4-neighborhood (a) and 8-neighborhood (b) of a reference pixel (black).
Under laboratory recording conditions, Iobj,mask will contain exactly one target region. However, in many real world applications, other skin colored objects may be visible as well, so that Iobj,mask typically contains multiple target regions (see e.g. Fig. 2.8). Unless advanced reasoning strategies (such as described in [23, 28]) are used, the feature extraction stage is responsible for identifying the region that represents the user’s hand, possibly among a multitude of candidates. This is done on the basis of the regions’ geometric features. The calculation of these features is performed by first creating an explicit description of each region contained in Iobj,mask. Various algorithms exist that generate, for every region, a list of all of its pixels. A more compact and computationally more efficient representation of a region can be obtained by storing only its border points (which are, in general, significantly fewer). A border point is conveniently defined as a pixel p ∈ R that has at least one pixel q ∈ / R within its 4-neighborhood. An example region (a) and its border points (b) are shown in Fig. 2.10. In a counterclockwise traversal of the object’s border, every border point has a predecessor and a successor within its 8-neighborhood. An efficient data structure for the representation of a region is a sorted list of its border points. This can be
Hand Gesture Commands
(a) Region pixels
21
(b) Border points
(c) Polygon
Fig. 2.10. Description of an image region by the set of all of its pixels (a), a list of its border pixels (b), and a closed polygon (c). Light gray pixels in (b) and (c) are not part of the respective representation, but are shown for comparison with (a).
interpreted as a closed polygon whose vertices are the centers of the border points (Fig. 2.10c). In the following, the object’s border is defined to be this polygon. This definition has the advantages of being sub-pixel accurate and facilitating efficient computation of various shape-based geometric features (as described in the next section).5 Finding the border points of a region is not as straightforward as identifying its pixels. Fig. 2.11 shows an algorithm that processes Iobj,mask and computes, for every region R, a list of its border points BR = {(x0 , y0 ), (x1 , y1 ), . . . , (xn−1 , yn−1 )}
(2.14)
Fig. 2.12 (a, b, c) shows the borders computed by this algorithm from the masks shown in Fig. 2.8 (b, c, d). In areas where the skin probability approaches the threshold Θ, the borders become jagged due to the random nature of the input data, as can be seen especially in (b) and (c). This effect causes an increase in border length that is random as well, which is undesirable because it reduces the information content of the computed border length value (similar shapes may have substantially different border lengths, rendering this feature less useful for recognition). Preceding the segmentation (2.10) by a convolution of Iobj,prob with a Gaussian kernel alleviates this problem by dampening high frequencies in the input data, providing smooth borders as in Fig. 2.12 (d, e, f). The LTI-L IB uses the classes borderPoints and polygonPoints (among others) to represent image regions. The class objectsFromMask contains an algorithm that is comparable to the one presented in Fig. 2.11, but can also detect holes within regions, and further regions within these holes. 5
It should be noted that the polygon border definition leads to the border pixels no longer being considered completely part of the object. Compared to other, non-sub-pixel accurate interpretations, this may lead to slight differences in object area that are noticeable only for very small objects. 6 m(x, y) = 1 indicates touching the border of a region that was already processed. 7 m(x, y) = 2 indicates crossing the border of a region that was already processed.
22
Non-Intrusive Acquisition of Human Action 1. Create a helper matrix m with the same dimensions as Iobj,mask and initialize all entries to 0. Define (x, y) as the current coordinates and initialize them to (0, 0). Define (x , y ) and (x , y ) as temporary coordinates. 2. Iterate from left to right through all image rows successively, starting at y = 0. If m(x, y) = 0 ∧ Iobj,mask (x, y) = 1, goto step 3. If m(x, y) = 1, ignore this pixel and continue with the next pixel.6 If m(x, y) = 2, increase x until m(x, y) = 2 again, ignoring the whole range of pixels.7 3. Create a list B of border points and store (x, y) as the first element. 4. Set (x , y ) = (x, y). 5. Scan the 8-neighborhood of (x , y ) using (x , y ), starting at the pixel that follows the butlast pixel stored in B in a counterclockwise orientation, or at (x − 1, y − 1) if B contains only one pixel. Proceed counterclockwise, skipping coordinates that lie outside of the image, until Iobj,mask (x , y ) = 1. If (x , y ) is identical with the first element of B, goto step 6. Else store (x , y ) in B. Set (x , y ) = (x , y ) and goto step 5. 6. Iterate through B, considering, for every element (xi , yi ) its predecessor (xi−1 , yi−1 ) and its successor (xi+1 , yi+1 ). The first element is the successor of the last, which is the predecessor of the first. If yi−1 = yi+1 = yi , set m(xi , yi ) = 1 to indicate that the border touches the line y = yi at xi . Otherwise, if yi−1 = yi ∨ yi = yi+1 , set m(xi , yi ) = 2 to indicates that the border intersects with the line y = yi at xi . 7. Add B to the list of computed borders and proceed with step 2. Fig. 2.11. Algorithm to find the border points of all regions in an image.
2.1.2.3 Geometric Features From the multitude of features that can be computed for a closed polygon, only a subset is suitable for a concrete recognition task. A feature is suitable if it has a high inter-gesture variance (varies significantly between different gestures) and a low intra-gesture variance (varies little between multiple productions of the same gesture). The first property means that the feature carries much information, while the second indicates that it is not significantly affected by noise or unintentional variations that will inevitably occur. Additionally, every feature should be stable, meaning that small changes in input data never result in large changes in the feature. Finally the feature must be computable with sufficient accuracy and speed. A certain feature’s suitability thus depends on the constellation of the vocabulary (because this affects inter- and intra-gesture variance) and on the actual application scenario in terms of recording conditions, hardware etc. It is therefore a common approach to first compute as many features as possible and then examine each one with respect to its suitability for the specific recognition task (see also Sect. 2.1.3.3). The choice of features is of central importance since it affects system design throughout the processing chain. The final system’s performance depends significantly on the chosen features and their properties. This section presents a selection of frequently used features and the equations to calculate them. Additionally, the way each feature is affected by the camera’s perspective, resolution, and distance to the object is discussed, for these are often
Hand Gesture Commands
23
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2.12. a, b, c: Borders of the regions shown in Fig. 2.8 (d, e, f), computed by the algorithm described in Fig. 2.11. d, e, f: Additional smoothing with a Gaussian kernel before segmentation.
important factors in practical applications. In the image domain, changing the camera’s perspective results in rotation and/or translation of the object, while resolution and object distance affect the object’s scale (in terms of pixels). Since the error introduced in the shape representation by the discretization of the image (the discretization noise) is itself affected by image resolution, rotation, translation, and scaling of the shape, features can, in a strict sense, be invariant of these transformations only in continuous space. All statements regarding transformation invariance therefore refer to continuous space. In discretized images, features declared “invariant” may still show small variance. This can usually be ignored unless the shape’s size is on the order of several pixels. Border Length The border length l can trivially be computed from B, considering that the distance between √ two successive border points is 1 if either their x or y coordinates are equal, and 2 otherwise. l depends on scale/resolution, and is translation and rotation invariant. Area, Center of Gravity, and Second Order Moments In [25] an efficient algorithm for the computation of arbitrary moments νp,q of polygons is presented. The area a = ν0,0 , as well as the normalized moments αp,q
24
Non-Intrusive Acquisition of Human Action
and central moments μp,q up to order 2, including the center of gravity (COG) (xcog , ycog ) = (α1,0 , α0,1 ), are obtained as shown in (2.15) to (2.23). xi and yi with i = 0, 1, . . . , n − 1 refer to the elements of BR as defined in (2.14). Since these equations require the polygon to be closed, xn = x0 and yn = y0 is defined for i = n. 1 xi−1 yi − xi yi−1 2 i=1 n
ν0,0 = a =
(2.15)
α1,0 = xcog =
1 (xi−1 yi − xi yi−1 )(xi−1 + xi ) 6a i=1
(2.16)
α0,1 = ycog =
1 (xi−1 yi − xi yi−1 )(yi−1 + yi ) 6a i=1
(2.17)
n
n
α2,0 =
1 (xi−1 yi − xi yi−1 )(x2i−1 + xi−1 xi + x2i ) 12a i=1
α1,1 =
1 (xi−1 yi − xi yi−1 )(2xi−1 yi−1 + xi−1 yi + xi yi−1 + 2xi yi ) 24a i=1
n
(2.18)
n
(2.19) α2,0
n 1 2 (xi−1 yi − xi yi−1 )(yi−1 + yi−1 yi + yi2 ) = 12a i=1
(2.20)
μ2,0 = α2,0 − α21,0
(2.21)
μ1,1 = α1,1 − α1,0 α0,1 μ0,2 = α0,2 − α20,1
(2.22) (2.23)
The area a depends on scale/resolution, and is independent of translation and rotation. The center of gravity (xcog , ycog ) is obviously translation variant and depends on resolution. It is rotation invariant only if the rotation occurs around (xcog , ycog ) itself, and affected by scaling (i.e. changing the object’s distance to the camera) unless its angle to the optical axis remains constant. The second order moments (p+q = 2) are primarily used to compute other shape descriptors, as described below. Eccentricity One possible measure for eccentricity e that is based on central moments is given in [10]: (μ2,0 − μ0,2 )2 + 4μ1,1 (2.24) a (2.24) is zero for circular shapes and increases for elongated shapes. It is rotation and translation invariant, but variant with respect to scale/resolution. e=
Hand Gesture Commands
25
Another intuitive measure is based on the object’s inertia along and perpendicular to its main axis, j1 and j2 , respectively: e=
j1 j2
(2.25)
with
μ2,0 + μ0,2 + j1 = 2a μ2,0 + μ0,2 − j2 = 2a
μ2,0 − μ0,2 2a μ2,0 − μ0,2 2a
2 + 2 +
μ
1,1
2 (2.26)
a μ
1,1
a
2 (2.27)
Here e starts at 1 for circular shapes, increases for elongated shapes, and is invariant of translation, rotation, and scale/resolution. Orientation The region’s main axis is defined as the axis of the least moment of inertia. Its orientation α is given by 2μ1,1 1 (2.28) α = arctan 2 μ2,0 − μ0,2 and corresponds to what one would intuitively call “the object’s orientation”. Orientation is translation and scale/resolution invariant. α carries most information for elongated shapes and becomes increasingly random for circular shapes. It is 180◦ periodic, which necessitates special handling in most cases. For example, the angle by which an object with α1 = 10◦ has to be rotated to align with an object with α2 = 170◦ is 20◦ rather than |α1 − α2 | = 160◦ . When using orientation as a feature for classification, a simple workaround to this problem is to ensure 0 ≤ α < 180◦ and, rather than use α directly as a feature, compute two new features f1 (α) = sin α and f2 (α) = cos 2α instead, so that f1 (0) = f1 (180◦ ) and f2 (0) = f2 (180◦ ). Compactness A shape’s compactness c ∈ [0, 1] is defined as 4πa (2.29) l2 Compact shapes (c → 1) have short borders l that contain a large area a. The most compact shape is a circle (c = 1), while for elongated or frayed shapes, c → 0. Compactness is rotation, translation, and scale/resolution invariant. c=
26
Non-Intrusive Acquisition of Human Action
Border Features In addition to the border length l, the minimum and maximum pixel coordinates of the object, xmin , xmax , ymin and ymax , as well as the minimum and maximum distance from the center of gravity to the border, rmin and rmax , can be calculated from B. Minimum and maximum coordinates are not invariant to any transformation. rmin and rmax are invariant to translation and rotation, and variant to scale/resolution. Normalization Some of the above features can only be used in a real-life application if a suitable normalization is performed to eliminate translation and scale variance. For example, the user might perform gestures at different locations in the image and at different distances from the camera. Which normalization strategy yields the best results depends on the actual application and recording conditions, and is commonly found empirically. Normalization is frequently neglected, even though it is an essential part of feature extraction. The following paragraphs present several suggestions. In the remainder of this section, the result of the normalization of a feature f is designated by f . For both static and dynamic gestures, the user’s face can serve as a reference for hand size and location if it is visible and a sufficiently reliable face detection is available. In case different resolutions with identical aspect ratio are to be supported, resolution independence can be achieved by specifying lengths and coordinates relative to the image width N (x and y must be normalized by the same value to preserve image geometry) and area relative to N 2 . If a direct normalization is not feasible, invariance can also be achieved by computing a new feature from two or more unnormalized features. For example, from xcog , xmin , and xmax , a resolution and translation invariant feature xp can be computed as xp =
xmax − xcog xcog − xmin
(2.30)
xp specifies the ratio of the longest horizontal protrusion, measured from the center of gravity, to the right and to the left. For dynamic gestures, several additional methods can be used to compute a normalized feature f (t) from an unnormalized feature f (t) without any additional information: •
Subtracting f (1) so that f (1) = 0: f (t) = f (t) − f (1)
•
(2.31)
Subtracting the arithmetic mean f or median fmedian of f : f (t) = f (t) − f
f (t) = f (t) − fmedian
(2.32) (2.33)
Hand Gesture Commands •
27
Mapping f to a fixed interval, e.g. [0, 1]: f (t) =
•
f (t) − min f (t) max f (t) − min f (t)
(2.34)
Dividing f by max |f (t)| so that |f (t)| ≤ 1: f (t) =
f (t) max |f (t)|
(2.35)
Derivatives In features computed for dynamic gestures, invariance of a constant offset may also be achieved by derivation. Computing the derivative f˙(t) of a feature f (t) and using it as an additional element in the feature vector to emphasize changes in f (t) can sometimes be a simple yet effective method to improve classification performance. 2.1.2.4 Example To illustrate the use of the algorithms described above, the static example gestures introduced in Fig. 2.2 were segmented as shown in Fig. 2.13, using the automatic algorithms presented in Fig. 2.6 and 2.11, and features were computed as shown in Tab. 2.5, using the LTI-L IB class geometricFeatures. For resolution independence, lengths and coordinates are normalized by the image width N , and area is normalized by N 2 . α specifies the angle by which the object would have to be rotated to align with the x axis.
(a) “left”
(b) “right”
(c) “stop”
(d) none
Fig. 2.13. Boundaries of the static hand gestures shown in Fig. 2.2. Corresponding features are shown in Tab. 2.5.
For the dynamic example gestures “clockwise”, “open”, and “grab”, the features area a, compactness c, and x coordinate of the center of gravity xcog were computed. Since a depends on the hand’s anatomy and its distance to the camera, it was normalized by its maximum (see (2.35)) to eliminate this dependency. xcog was divided by N to eliminate resolution dependency, and additionally normalized according to (2.32) so that x cog = 0. This allows the gestures to be performed anywhere in the image. The resulting plots for normalized area a , compactness c (which does not require normalization), and normalized x coordinate of the center of gravity xcog are visualized in Fig. 2.14. The plots are scaled equally for each feature to allow a direct comparison. Since “counterclockwise”, “close”, and “drop” are backwards executions of the above gestures, their feature plots are simply temporally mirrored versions of these plots.
28
Non-Intrusive Acquisition of Human Action Table 2.5. Features computed from the boundaries shown in Fig. 2.13. “left” Feature Symbol (2.13a) Normalized Border Length l 1.958 Normalized Area a 0.138 Normalized Center of Gravity xcog 0.491 ycog 0.486 Eccentricity e 1.758 Orientation α 57.4° Compactness c 0.451 Normalized Min./Max. Coordinates xmin 0.128 xmax 0.691 ymin 0.256 ymax 0.747 Protrusion Ratio xp 0.550
Gesture “right” “stop” (2.13b) (2.13c) 1.705 3.306 0.115 0.156 0.560 0.544 0.527 0.487 1.434 1.722 147.4° 61.7° 0.498 0.180 0.359 0.241 0.894 0.881 0.341 0.153 0.747 0.747 1.664 1.110
none (2.13d) 1.405 0.114 0.479 0.537 2.908 58.7° 0.724 0.284 0.681 0.325 0.747 1.036
Fig. 2.14. Normalized area a , compactness c, and normalized x coordinate of the center of gravity xcog plotted over time t = 1, 2, . . . , 60 for the dynamic gestures “clockwise” (left column), “open” (middle column), and “grab” (right column) from the example vocabulary shown in Fig. 2.3.
Hand Gesture Commands
29
2.1.3 Feature Classification The task of feature classification occurs in all pattern recognition systems and has been subject of considerable research effort. A large number of algorithms is available to build classifiers for various requirements. The classifiers considered in this section operate in two phases: In the training phase the classifier “learns” the vocabulary from a sufficiently large number of representative examples (the training samples). This “knowledge” is then applied in the following classification phase. Classifiers that continue to learn even in the classification phase, e.g. to automatically adapt to the current user, shall not be considered here. 2.1.3.1 Classification Concepts This section introduces several basic concepts and terminology commonly used in pattern classification, with the particular application to gesture recognition. Mathematical Description of the Classification Task From a mathematical point of view, the task of classification is that of identifying an unknown event ω, given a finite set Ω of n mutually exclusive possible events ω ∈ Ω = {ω1 , ω2 , . . . , ωn }
(2.36)
The elements of Ω are called classes. In gesture recognition Ω is the vocabulary, and each class ωi (i = 1, 2, . . . , n) represents a gesture. Note that (2.36) restricts the allowed inputs to elements of Ω. This means that the system is never presented with an unknown gesture, which influences the classifier’s design and algorithms. The classifier receives, from the feature extraction stage, an observation O, which, for example, might be a single feature vector for static gestures, or a sequence of feature vectors for dynamic gestures. Based on this observation it outputs a result ω ˆ ∈Ω
where ω ˆ = ωk ,
k ∈ {1, 2, . . . , n}
(2.37)
k denotes the index of the class of the event ω assumed to be the source of the observation O. If ω ˆ = ω then the classification result is correct, otherwise it is wrong. To account for the case that O does not bear sufficient similarity to any element in Ω, one may wish to allow a rejection of O. This is accomplished by introducing a ˆ as pseudo-class ω0 and defining a set of classifier outputs Ω ˆ = Ω ∪ {ω0 } Ω
(2.38)
Thus, (2.37) becomes ˆ ω ˆ ∈Ω
where ω ˆ = ωk ,
k ∈ {0, 1, . . . , n}
(2.39)
30
Non-Intrusive Acquisition of Human Action
Modeling Classification Effects Several types of classifiers can be designed to consider a cost or loss value L for classification errors, including rejection. In general, the loss value depends on the actual input class ω and the classifier’s output ω ˆ: L(ω, ω ˆ ) = cost incurred for classifying event of class ω as ω ˆ
(2.40)
The cost of a correct result is set to zero: L(ω, ω) = 0
(2.41)
This allows to model applications where certain misclassifications are more serious than others. For example, let us consider a gesture recognition application that performs navigation of a menu structure. Misclassifying the gesture for “return to main menu” as “move cursor down” requires the user to perform “return to main menu” again. Misclassifying “move cursor down” as “return to main menu”, however, discards the currently selected menu item, which is a more serious error because the user will have to navigate to this menu item all over again (assuming that there is no “undo” or “back” functionality). Simplifications Classifiers that consider rejection and loss values are discussed in [22]. For the purpose of this introduction, we will assume that the classifier never rejects its input (i.e. (2.37) holds) and that the loss incurred by a classification error is a constant value L: L(ω, ω ˆ) = L
for ω = ω ˆ
(2.42)
This simplifies the following equations and allows us to focus on the basic classification principles. Garbage Classes An alternative to rejecting the input is to explicitly include garbage classes in Ω that represent events to which the system should not react. The classifier treats these classes just like regular classes, but the subsequent stages do not perform any action when the classification result ω ˆ is a garbage class. For example, a gesture recognition system may be constantly observing the user’s hand holding a steering wheel, but react only to specific gestures different from steering motions. Garbage classes for this application would contain movements of the hand along the circular shape of the steering wheel.
Hand Gesture Commands
31
Modes of Classification We distinguish between supervised classification and unsupervised classification: •
•
In supervised classification the training samples are labeled, i.e. the class of each training sample is known. Classification of a new observation is performed by a comparison with these samples (or models created from them) and yields the class that best matches the observation, according to a matching criterion. In unsupervised classification the training samples are unlabeled. Clustering algorithms are used to group similar samples before classification is performed. Parameters such as the number of clusters to create or the average cluster size can be specified to influence the clustering process. The task of labeling the samples is thus performed by the classifier, which of course may introduce errors. Classification itself is then performed as above, but returns a cluster index instead of a label.
For gesture recognition, supervised classification is by far the most common method since the labeling of the training samples can easily be done in the recording process. Overfitting With many classification algorithms, optimizing performance for the training data at hand bears the risk of reducing performance for other (generic) input data. This is called overfitting and presents a common problem in machine learning, especially for small sets of training samples. A classifier with a set of parameters p is said to overfit the training samples T if there exists another set of parameters p that yields lower performance on T , but higher performance in the actual “real-world” application [15]. A strategy for avoiding overfitting is to use disjunct sets of samples for training and testing. This explicitly measures the classifier’s ability of generalization and allows to include it in the optimization process. Obviously, the test samples need to be sufficiently distinct from the training samples for this approach to be effective. 2.1.3.2 Classification Algorithms From the numerous strategies that exist for finding the best-matching class for an observation, three will be presented here: • • •
A simple, easy-to-implement rule-based approach suitable for small vocabularies of static gestures. The concept of maximum likelihood classification, which can be applied to a multitude of problems. Hidden Markov Models (HMMs) for dynamic gestures. HMMs are frequently used for the classification of various dynamic processes, including speech and sign language.
Further algorithms, such as artificial neural networks, can be found in the literature on pattern recognition and machine learning. A good starting point is [15].
32
Non-Intrusive Acquisition of Human Action
2.1.3.3 Feature Selection The feature classification stage receives, for every frame of input data, one feature vector from the feature extraction stage. For static gestures this constitutes the complete observation, while dynamic gestures are represented by a sequence of feature vectors. The constellation of the feature vector (which remains constant throughout the training and classification phases) is an important design decision that can significantly affect the system’s performance. Most recognition systems do not use all features that, theoretically, could be computed from the input data. Instead, a subset of suitable features is selected using criteria and algorithms described below. Reducing the number of features usually reduces computational cost (processing time, memory requirements) and system complexity in both the feature extraction and classification stage. In some cases it may also increase recognition accuracy by eliminating erratic features and thereby emphasizing the remaining features. In the feature selection process one or more features are chosen that allow the reliable identification of each element in the vocabulary. This is provided if the selected features are approximately constant for multiple productions of the same gesture (low intra-gesture variance) but vary significantly between different gestures (high inter-gesture variance). It is not required that different gestures differ in every element of the feature vector, but there must be at least one element in the feature vector that differs beyond the natural (unintentional) intra-gesture variance, segmentation inaccuracy, and noise. It is obvious that, in order to determine intra- and inter-gesture variance, a feature has to be computed for numerous productions of every gesture in the vocabulary. A representative mixture of permitted recording conditions should be used to find out how each feature is affected e.g. by lighting conditions, image noise etc. For person-independent gesture recognition this should include multiple persons. Based on these samples features are then selected either by manual inspection according to the criteria mentioned above, or by an automatic algorithm that finds a feature vector which optimizes the performance of the chosen classifier on the available data. Three simple feature selection algorithms (stepwise forward selection, stepwise backward elimination, and exhaustive search) that can be used for various classifiers, including Hidden Markov Models and maximum likelihood classification (see Sect. 2.1.3.5 and 2.1.3.6), are presented in Fig. 2.15. It is important to realize that although the feature selection resulting from either manual or automatic analysis may be optimal for the data upon which it is based, this is not necessarily the case for the actual practical application. The recorded samples often cannot cover all recording conditions under which the system may be used, and features that work well on this data may not work under different recording conditions or may even turn out to be disadvantageous. It may therefore be necessary to reconsider the feature selection if the actual recognition accuracy falls short of the expectation.
Hand Gesture Commands
33
Stepwise Forward Selection 1. Start with an empty feature selection and a recognition accuracy of zero. 2. For every feature that is not in the current feature selection, measure recognition accuracy on the current feature selection extended by this feature. Add the feature for which the resulting recognition accuracy is highest to the current selection. If no feature increases recognition accuracy, halt. 3. If computational cost or system complexity exceed the acceptable maximum, remove the last added feature from the selection and halt. 4. If recognition accuracy is sufficient or all features are included in the current selection, halt. 5. Goto step 2.
Stepwise backward elimination 1. Start with a feature selection containing all available features, and measure recognition accuracy. 2. For every feature in the current feature selection, measure recognition accuracy on the current feature selection reduced by this feature. Remove the feature for which the resulting recognition accuracy is highest from the current selection. 3. If computational cost or system complexity exceed the acceptable maximum and the current selection contains more than one feature, goto step 2. Otherwise, halt.
Exhaustive search 1. Let n be the total number of features and m the number of selected features. n possible 2. For every m ∈ {1, 2, . . . , n}, measure recognition accuracy for all m n combinations). Discard feature combinations (this equals a total of n i=1 i all cases where computational cost or system complexity exceed the acceptable maximum, then choose the selection that yields the highest recognition accuracy. Fig. 2.15. Three algorithms for automatic feature selection. Stepwise forward selection and stepwise backward elimination are fast heuristic approaches that are not guaranteed to find the global optimum. This is possible with exhaustive search, albeit at a significantly higher computational cost.
2.1.3.4 Rule-based Classification A simple heuristic approach to classification is a set of explicit IF-THEN rules that refer to the target’s features and require them to lie within a certain range that is typical of a specific gesture. The fact that an IF-THEN rule usually results in just a single line of source code makes this method straightforward to implement. For example, to identify the static “stop” gesture (Fig. 2.2c) among various other gestures, a simple solution would be to use the shape’s compactness c: IF
c < 0.25
THEN
the observed gesture is ”stop”.
(2.43)
34
Non-Intrusive Acquisition of Human Action
The threshold of 0.25 for c was determined empirically from a data set of multiple productions of “stop” and all other gestures, performed by different users. On this data set, max c ≈ 0.23 and c ≈ 0.18 was observed for “stop”, while c 0.23 for all other gestures. Obviously, with increasing vocabulary size, number and complexity of the rules grow and they can quickly become difficult to handle. The manual creation of rules, as described in this section, is therefore only suitable for very small vocabularies such as the static example vocabulary, where appropriate thresholds are obvious. Algorithms for automatic learning of rules are described in [15]. For larger vocabularies, rules are often used as a pre-stage to another classifier. For instance, a rule like IF
a < Θa N 2
THEN
the object is not the hand.
(2.44)
discards noise and distractors in the background that are too small to represent the hand. The threshold Θa specifies the required minimum size relative to the squared image width. Another example is to determine whether an object in an image sequence is moving or idle. The following rule checks for a certain minimum amount of motion:
IF
max i=j
(xcog (i) − xcog (j))2 + (ycog (i) − ycog (j))2 ) < Θmotion N THEN
(2.45)
the hand is idle.
where i and j are frame indices, and Θmotion specifies the minimum dislocation for the hand to qualify as moving, relative to the image width. 2.1.3.5 Maximum Likelihood Classification Due to its simplicity and general applicability, the concept of maximum likelihood classification is widely used for all kinds of pattern recognition problems. An extensive discussion can be found in [22]. This section provides a short introduction. We assume the observation O to be a vector of K scalar features fi : ⎛ ⎞ f1 ⎜ f2 ⎟ ⎟ ⎜ O=⎜ . ⎟ (2.46) ⎝ .. ⎠ fK Stochastic Framework The maximum likelihood classifier identifies the class ω ∈ Ω that is most likely to have caused the observation O by maximizing the a posteriori probability P (ω|O): ω ˆ = argmax P (ω|O) ω∈Ω
To solve this maximization problem we first apply Bayes’ rule:
(2.47)
Hand Gesture Commands
35
P (ω|O) =
P (O|ω)P (ω) P (O)
(2.48)
For simplification it is commonly assumed that all n classes are equiprobable: 1 (2.49) n Therefore, since neither P (ω) nor P (O) depend on ω, we can omit them when substituting (2.48) in (2.47), leaving only the observation probability P (O|ω): P (ω) =
ω ˆ = argmax P (O|ω)
(2.50)
ω∈Ω
Estimation of Observation Probabilities For reasons of efficiency and simplicity, the observation probability P (O|ω) is commonly computed from a class-specific parametric distribution of O. This distribution is estimated from a set of training samples. Since the number of training samples is often insufficient to identify an appropriate model, a normal distribution is assumed in most cases. This is motivated by practical considerations and empirical success rather that by statistical analysis. Usually good results can be achieved even if the actual distribution significantly violates this assumption. To increase modeling accuracy one may decide to use multimodal distributions. This decision should be backed by sufficient data and motivated by the nature of the regarded feature in order to avoid overfitting (see Sect. 2.1.3.1). Consider, for example, the task of modeling the compactness c for the “stop” gesture shown in Fig. 2.2c. Computing c for e.g. 20 different productions of this gesture would not provide sufficient motivation for modeling it with a multimodal distribution. Furthermore, there is no reason to assume that c actually obeys a multimodal distribution. Even if measurements do suggest this, it is not backed by the biological properties of the human hand; adding more measurements is likely to change the distribution to a single mode. In this section only unimodal distributions will be used. The multidimensional normal distribution is given by 1 1 exp − (O − μ)T C−1 (O − μ) N (O, μ, C) = 2 (2π)K det C
(2.51)
where μ and C are the mean and covariance of the stochastic process {O}: μ = E{O} C = E{(O − μ)(O − μ) } = cov{O} T
mean of {O}
(2.52)
covariance matrix of {O}
(2.53)
For the one-dimensional case with a single feature f this reduces to the familiar equation
36
Non-Intrusive Acquisition of Human Action 2 1 f −μ 1 exp − N (f, μ, σ ) = 2 σ (2π)σ 2
(2.54)
Computing μi and Ci from the training samples for each class ωi in Ω yields the desired n models that allow to solve (2.50): P (O|ωi ) = N (O, μi , Ci ),
i = 1, 2, . . . , n
(2.55)
Evaluation of Observation Probabilities Many applications require the classification stage to output only the class ωi for which P (O|ω) is maximal, or possibly a list of all classes in Ω, sorted by P (O|ω). The actual values of the conditional probabilities themselves are not of interest. In this case, a strictly monotonic mapping can be applied to (2.55) to yield a score S that imposes the same order on the elements of Ω, but is faster to compute. S(O|ωi ) is obtained by multiplying P (O|ωi ) by (2π)K and taking the logarithm: 1 1 S(O|ωi ) = − ln(det Ci ) − (O − μi )T C−1 i (O − μi ) 2 2
(2.56)
ω ˆ = argmax S(O|ω)
(2.57)
Thus,
ω∈Ω
It is instructive to examine this equation in feature space. The surfaces of constant score S(O|ωi ) form hyperellipsoids with center μi . This is caused by the fact that, in general, the elements of O show different variances. In gesture recognition, variance is commonly caused by the inevitable “natural” variations in execution, by differences in anatomy (e.g. hand size), segmentation inaccuracy, and by noise introduced in feature computation. Features are affected by this in different ways: For instance, scale invariant features are not affected by variations in hand size. If the covariance matrices Ci can be considered equal for i = 1, 2, . . . , n (which is often assumed for simplicity), i.e. Ci = C i = 1, 2, . . . , n
(2.58)
we can further simplify (2.56) by omitting the constant term − 21 ln(det C) and multiplying by −2: S(O|ωi ) = (O − μi )T C−1 (O − μi )
(2.59)
Changing the score’s sign requires to replace the argmax operation in (2.57) with argmin. The right side of (2.59) is known as the Mahalanobis distance between O and μ, and the classifier itself is called Mahalanobis classifier. Expanding (2.59) yields
Hand Gesture Commands
37
S(O|ωi ) = OT C−1 O − 2μTi C−1 O + μTi C−1 μi
(2.60)
For efficient computation this can be further simplified by omitting the constant OT C−1 O: S(O|ωi ) = −2μTi C−1 O + μTi C−1 μi
(2.61)
2
If all elements of O have equal variance σ , C can be written as the product of σ 2 and the identity matrix I: C = σ2 I
(2.62)
Replacing this in (2.59) and omitting the constant σ 2 leads to the Euclidian classifier: S(O|ωi ) = (O − μi )T (O − μi ) = |O − μi |2
(2.63)
2.1.3.6 Classification Using Hidden Markov Models The classification of time-dependent processes such as gestures, sign language, or speech, suggests to explicitly model not only the samples’ distribution in the feature space, but also the dynamics of the process from which the samples originate. In other words, a classifier for dynamic events should consider variations in execution speed, which means that we need to identify and model these variations. This requirement is met by Hidden Markov Models (HMMs), a concept derived from Markov chains.8 A complete discussion of HMMs is out of the scope of this chapter. We will therefore focus on basic algorithms and accept several simplifications, yet enough information will be provided for an implementation and practical application. Further details, as well as other types and variants of HMMs, can be found e.g. in [8, 11, 13, 19–21]. The LTI-L IB contains multiple classes for HMMs and HMMbased classifiers that offer several advanced features not discussed here. In the following the observation O denotes a sequence of T individual observations ot : O = (o1 o2 · · · oT )
(2.64)
Each individual observation corresponds to a K-dimensional feature vector: ⎛ ⎞ f1 (t) ⎜ f2 (t) ⎟ ⎜ ⎟ ot = ⎜ . ⎟ (2.65) ⎝ .. ⎠ fK (t) For the purpose of this introduction we will at first assume K = 1 so that ot is scalar: 8
Markov chains describe stochastic processes in which the conditional probability distribution of future states, given the present state, depends only on the present state.
38
Non-Intrusive Acquisition of Human Action
ot = ot
(2.66)
This restriction will be removed later. Concept This section introduces the basic concepts of HMMs on the basis of an example. In Fig. 2.16 the feature x˙ cog (t) has been plotted for three samples of the gesture “clockwise” (Fig. 2.3a). It is obvious that, while the three plots have similar shape, they are temporally displaced. We will ignore this for now and divide the time axis into segments of 5 observations each (one observation corresponds to one frame). Since the samples have 60 frames this yields 12 segments, visualized by straight vertical lines.9 For each segment i, i = 1, 2, . . . , 12, mean μi and variance σi2 were computed from the corresponding observations of all three samples (i.e. from 15 values each), as visualized in Fig. 2.16 by gray bars of length 2σi , centered at μi .
Fig. 2.16. Overlay plot of the feature x˙ cog (t) for three productions of the gesture “clockwise”, divided into 12 segments of 5 observations (frames) each. The gray bars visualize mean and variance for each segment.
This represents a first crude model (though not an HMM) of the time dependent stochastic process x˙ cog (t), as shown in Fig. 2.17. The model consists of 12 states 9
The segment size of 5 observations was empirically found to be a suitable temporal resolution.
Hand Gesture Commands
39
si , each characterized by μi and σi2 . As in the previous section we assume unimodal normal distributions, again motivated by empirical success. The model is rigid with respect to time, i.e. each state corresponds to a fixed set of observations. For example, state s2 models x˙ cog (t) for t = 6, 7, 8, 9, 10.
Fig. 2.17. A simple model for the time dependent process x˙ cog (t) that rigidly maps 60 observations to 12 states.
This model allows to compute the probability of a given observation O originating from the modeled process (i.e., x˙ cog (t) for the gesture “clockwise”) as
P (O|“clockwise”) =
60
t−1 + 1 (2.67) with i(t) = 5
2 N (ot , μi(t) , σi(t) )
t=1
where N is the normal distribution (see (2.54)). Thus far we have modeled the first stochastic process, the amplitude of x˙ cog (t), for fixed time segments. To extend the model to the second stochastic process inherent in x˙ cog (t), the variation in execution speed and temporal displacement apparent in Fig. 2.16, we abandon the rigid mapping of observations to states. As before, s1 and s12 are defined as the initial and final state, respectively, but the model is extended by allowing stochastic transitions between the states. At each time step the model may remain in its state si or change its state to si+1 or to si+2 . A transition from si to sj is assigned a probability ai,j , with ai,j = 1 (2.68) j
and ai,j = 0 for j ∈ / {i, i + 1, i + 2}
(2.69)
Fig. 2.18 shows the modified model for x˙ cog (t), which now constitutes a complete Hidden Markov Model. Transitions from si to si allow to model a “slower-thannormal” execution part, while those from si to ss+2 model a “faster-than-normal” execution part. Equation (2.69) is not a general requirement for HMMs; it is a property of the specific Bakis topology used in this introduction. One may choose other topologies, allowing e.g. a transition from si to si+3 , in order to optimize classification performance. In this section we will stick to the Bakis topology because it is widely used
40
Non-Intrusive Acquisition of Human Action
Fig. 2.18. A Hidden Markov Model in Bakis topology. Observations are mapped dynamically to states.
in speech and sign language recognition, and because it is the simplest topology that models both delay and acceleration.10 Even though we have not changed the number of states compared to the rigid model, the stochastic transitions affect the distribution parameters μi and σi2 of each state i. In the rigid model, the lack of temporal alignment among the training samples leads to μi and σi2 not accurately representing the input data. Especially for the first six states σ is high and μ has a lower amplitude than the training samples (cf. Fig. 2.16). Since the HMM provides temporal alignment by means of the transition probabilities ai,j we expect σ to decrease and μ to better match the sample plots (this is proven in Fig. 2.21). The computation of ai,j , μi and σi2 will be discussed later. For the moment we will assume these values to be given. In the following a general HMM with N states in Bakis topology is considered. The sequence of states adopted by this model is denoted by the sequence of their indices q = (q1 q2 . . . qT ) with
qt ∈ {1, 2, . . . , N },
t = 1, 2, . . . , T
(2.70)
where qt indicates the state index at time t and T equals the number of observations. This notion allows to specify the probabilities ai,j as ai,j = P (qt = j|qt−1 = i)
(2.71)
For brevity, the complete set of transition probabilities is denoted by a square matrix of N rows and columns A = [ai,j ]N ×N
(2.72)
One may want to loosen the restriction that q1 = s1 by introducing an initial state probability vector π: 10
If the number of states N is significantly smaller than the number of observations T , acceleration can also be modeled solely by the transition from si to si+1 .
Hand Gesture Commands π = (π1 π2 . . . πN )
41 with πi = P (q1 = si ) and
πi = 1
(2.73)
i
This can be advantageous when recognizing a continuous sequence of gestures (such as a sentence of sign language) and coarticulation effects might lead to the initial states of some models being skipped. In this section, however, we stick to π1 = 1 and πi = 0 for i = 1. To achieve a compact notation for a complete HMM we introduce a vector containing the states’ distribution functions: B = (b1 b2 . . . bN )
(2.74)
where bi specifies the distribution of the observation o in state si . Since we use unimodal normal distributions, bi = N (o, μi , σi2 )
(2.75)
A Hidden Markov Model λ is thus completely specified by λ = (π, A, B)
(2.76)
Obviously, to use HMMs for classification, we again need to compute the probability that a given observation O originates from a model λ. In contrast to (2.67) this is no longer simply a product of probabilities. Instead, we have to consider all possible state sequences of length T , denoted by the set QT , and sum up the accordant probabilities:
P (O|λ) =
q∈QT
P (O, q|λ) =
b1 (o1 )
q∈QT
T
aqt−1 ,qt bqt (ot )
(2.77)
t=2
with q1 = 1 ∧ qT = N
(2.78)
Since, in general, the distributions bi will overlap, it is not possible to identify a single state sequence q associated with O. This property is reflected in the name Hidden Markov Model. ˆ that best matches the observation O is found by maximizing (2.77) The model λ on the set Λ of all HMMs: ˆ = argmax P (O|λ) λ
(2.79)
λ∈Λ
Even though efficient algorithms exist to solve this equation, (2.79) is commonly approximated by ˆ = argmax P (O|λ, q∗ ) λ λ∈Λ
(2.80)
42
Non-Intrusive Acquisition of Human Action
where q∗ represents the single most likely state sequence: q∗ = argmax P (q|O, λ) q∈QT
(2.81)
This is equivalent to approximating (2.77) by its largest summand, which may appear drastic, but (2.79) and (2.80) have been found to be strongly correlated [21], and practical success in terms of accuracy and processing speed warrant this approximation. Application of Bayes’ rule to P (q|O, λ) yields P (q|O, λ) =
P (O, q|λ) P (O|λ)
(2.82)
Since P (O|λ) does not depend on q we can simplify (2.81) to q∗ = argmax P (O, q|λ) q∈QT
(2.83)
In order to solve this maximization problem it is helpful to consider a graphical representation of the elements of QT . Fig. 2.19 shows the Trellis diagram in which all possible state sequences q are shown superimposed over the time index t.
Fig. 2.19. Trellis diagram for an HMM with Bakis topology.
The optimal state sequence q∗ can efficiently be computed using the Viterbi algorithm (see Fig. 2.20): Instead of performing a brute force search of QT , this method
Hand Gesture Commands
43
exploits the fact that from all state sequences q that arrive at the same state si at time index t, only the one for which the probability up to si is highest, q∗i , can possibly solve (2.83): q∗i = argmax b1 (o1 ) q∈{q|qt =i}
t
aqt −1 ,qt bqt (ot )
(2.84)
t =2
All others need not be considered, because regardless of the subsequence of states following si , the resulting total probability would be lower than that achieved by following q∗i by the same subsequence. (A common application of this algorithm in everyday life is the search for the shortest path from a location A to another location B in a city: From all paths that have a common intermediate point C, only the one for which the distance from A to C is shortest has a chance of being the shortest path between A and B.) Viterbi Search Algorithm Define t as a loop index and Qt as a set of state sequences of length t. 1. Start with t = 1, Q1 = {(s1 )} and Qt = ∅ for t = 2, 3, . . . , T . 2. For each state si (i = 1, 2, . . . , N ) that is accessible from the last state of at least one of the sequences in Qt do: Find the state sequence q with t
q = argmax b1 (o1 ) q∈Qt
t =2
aqt −1 ,qt bqt (ot ) aqt ,i bi (ot+1 )
(2.85)
Append i to q then add q to Qt+1 . 3. Increment t. 4. If t < T , goto step 2. 5. The state sequence q ∈ QT with qT = N solves (2.83). Fig. 2.20. The Viterbi algorithm for an efficient solution of (2.83).
In an implementation of the Viterbi algorithm the value of (2.85) is efficiently stored to avoid repeated computation. In case multiple state sequence maximize (2.85) only one of them needs to be considered. Implementational Aspects The product of probabilities in (2.77) and (2.85) may become very small, causing numerical problems. This can be avoided by using scores instead of probabilities, with Score(P ) = ln(P )
(2.86)
Since ln ab = ln a + ln b this allows to replace the product of probabilities with a sum of scores. Because the logarithm is a strictly monotonic mapping, the inverse
44
Non-Intrusive Acquisition of Human Action
exp(·) operation may be omitted without affecting the argmax operation. In addition to increased numerical stability the use of scores also reduces computational cost.11 Estimation of HMM Parameters We will now discuss the iterative estimation of the transition probabilities A and the distribution functions B for a given number of states N and a set of M training samples O1 , O2 , . . . , OM . Tm denotes the length of Om (m ∈ {1, 2, . . . , M }). An initial estimate of B is obtained in the same way as shown in Fig. 2.16, i.e. by equally distributing the elements of all training samples to the N states.12 A is initialized to ⎞ ⎛ 0.3 0.3 0.3 0 0 · · · 0 ⎜ 0 0.3 0.3 0.3 0 · · · 0 ⎟ ⎟ ⎜ ⎜ .. .. ⎟ .. ⎜ . . . ⎟ (2.87) A=⎜ ⎟ ⎜ 0 ··· 0 0.3 0.3 0.3 ⎟ ⎟ ⎜ ⎝ 0 ··· 0 0 0.5 0.5 ⎠ 0 ··· 0 0 0 1 We now apply the Viterbi algorithm described above to find, for each training sample Om = (o1,m o2,m . . . oTm ,m ), the optimal state sequence q∗m = (q1,m q2,m . . . qT,m ). Each q∗m constitutes a mapping of observations to states, and the set of all q∗m , m = 1, 2, . . . , M , is used to update the model’s distributions B and transition probabilities A accordingly: Mean μi and variance σi2 for each state si are recomputed from all sample values ot,m for which qt,m = i holds. The transition probabilities ai,j are set to the number of times that the subsequence {. . . , qi , qj , . . . } is contained in the set of all q∗m and normalized to fulfill (2.68). Formally, μi = E{ot,m |qt,m = i} σi2 = var{ot,m |qt,m = i} a ˜i,j ai,j = ˜i,j ja
(2.88) (2.89)
a ˜i,j = |{qt,m |qt = i ∧ qt+1 = j}|
(2.91)
(2.90)
with
This process is called segmental k-means or Viterbi training. Each iteration improves the model’s match with the training data, i.e. P (O, q∗ |λ(n) ) ≤ P (O, q∗ |λ(n−1) )
(2.92)
where λ(n) denotes the model at iteration n. 11 12
To avoid negative values, the score is sometimes defined as the negative logarithm of P . If N is not a divisor of Tm , an equal distribution is approximated.
Hand Gesture Commands
45
The training is continued on the same training samples until λ remains constant. This is the case when the optimal state sequences q∗m remain constant for all M samples.13 A proof of convergence can be found in [13]. It should be noted that the number of states N is not part of the parameter estimation process and has to be specified manually. Lower values for N reduce computational cost at the risk of inadequate generalization, while higher values may lead to overspecialization. Also, since the longest forward transition in the Bakis topology is two states per observation, N ≤ 2 min{T1 , T2 , . . . , TM } − 1
(2.93)
is required. Revisiting the example introduced at the beginning of this section, we will now create an HMM for x˙ cog from the three productions of “clockwise” presented in Fig. 2.16. To allow a direct comparison with the rigid state assignment shown in Fig. 2.17 the number of states N is set to 12 as well. Fig. 2.21 overlays the parameters of the resulting model’s distribution functions with the same three plots of x˙ cog .
Fig. 2.21. Overlay plot of the feature x˙ cog (t) for three productions of the gesture “clockwise”, and mean μ and variance σ of the 12 states of an HMM created therefrom. The gray bars are centered at μ and have a length of 2σ. Each represents a single state.
Alternatively, the training may also be aborted when the number of changes in all q∗m falls below a certain threshold. 13
46
Non-Intrusive Acquisition of Human Action
Compared to Fig. 2.16, the mean values μ now represent the shape of the plots significantly better. This is especially noticeable for 10 ≤ t ≤ 30, where the differences in temporal offset previously led to μ having a much smaller amplitude than x˙ cog . Accordingly, the variances σ are considerably lower. This demonstrates the effectiveness of the approach. Multidimensional Observations So far we have assumed scalar observations ot (see (2.66)). In practice, however, we will rarely encounter this special case. The above concepts can easily be extended to multidimensional observations, removing the initial restriction of K = 1 in (2.65). (2.75) becomes bi = N (o, μi , Ci )
(2.94)
where μi and Ci are mean and covariance matrix of state si estimated in the Viterbi training: μi = E{ot,m |qt,m = i} Ci = cov{ot,m |qt,m = i}
(2.95) (2.96)
These equations replace (2.88) and (2.89). In practical applications the elements of o are frequently assumed uncorrelated so that C becomes a diagonal matrix, which simplifies computation. Extensions A multitude of extensions and modifications exist to improve the performance of the simple HMM variant presented here, two of which shall be briefly mentioned here. Firstly, the distributions b may be chosen differently. This includes other types of distributions, such as Laplace, and multimodal models.14 The latter modification brings about a considerable increase in complexity. Furthermore, in the case of multidimensional observations one may want the temporal alignment performed by the HMM to be independent for certain parts of the feature vector. This can be achieved by simply using multiple HMMs in parallel, one for each set of temporally coupled features. For instance, a feature vector in a gesture recognition system typically consist of features that describe the hand’s position, and others that characterize its posture. Using a separate HMM for each set of features facilitates independent temporal alignment (see Sect. 3.3.4 for a discussion of such Parallel Hidden Markov Models (PaHMMs)). Whether this is actually desirable depends on the concrete application.
14
For multimodal models, the considerations stated in Sect. 2.1.3.5 apply.
Hand Gesture Commands
47
2.1.3.7 Example Based on the features of the static and the dynamic vocabulary computed in Sect. 2.1.2.4, this section describes the selection of suitable features to be used in the two example applications. We will manually analyze the values in Tab. 2.5 and the plots in 2.14 to identify features with high inter-gesture variance. Since all values and plots describe only a single gesture, they first have to be computed for a number of productions to get an idea of the intra-gesture variance. For brevity, this process is not described explicitly. The choice of classification algorithm is straightforward for both the static and the dynamic case. Static Gesture Recognition The data in Tab. 2.5 shows that a characteristic feature of the “stop” gesture (Fig. 2.2a) is its low compactness: c is smaller than for any other shape. This was already mentioned in Sect. 2.1.3.4, and found to be true for any production of “stop” by any user. Pointing gestures result in an roughly circular, compact shape with a single protrusion from the pointing finger. For the “left” and “right” gestures (Fig. 2.2b,c), this protrusion is along the x axis. Assuming that the protruding area does not significantly affect xcog , the protrusion ratio xp introduced in (2.30) can be used: Shapes with xp 1 are pointing to the right, while x1p 1 indicates pointing to the left. Gestures for which xp ≈ 1 are not pointing along the x axis. This holds regardless of whether the thumb or the index finger is used for pointing. Empirically, xp > 1.2 and x1p > 1.2 was found for pointing right and left, respectively. Since the hand may not be the only skin-colored object in the image (see Tab. 2.4), the segmentation might yield multiple skin colored regions. In our application scenario it is safe to assume that the hand is the largest object; however, it need not always be visible. The recording conditions in Tab. 2.4 demand a minimum shape size amin of 10% of the image size, or amin = 0.1N M
(2.97)
Therefore, all shapes with a < amin may be discarded, and from the remainder, we simply choose the one with the largest area a. If no shape exceeds the minimum area, the image is considered empty. The feature vector for static gesture recognition is therefore ⎛ ⎞ a O=⎝ c ⎠ (2.98) xp The above observations regarding a, c, and xp suggest to use a set of rules for classification. The required thresholds can easily be identified since different gestures (including typical “idle” gestures) scarcely overlap in the feature space. A suitable threshold Θc for identifying “stop” was already found in (2.43).
48
Non-Intrusive Acquisition of Human Action Rephrasing the above paragraphs as rules yields 1. Discard all objects with a < 0.1N M. 2. If no objects remain then the image is empty.
(2.99) (2.100)
3. Consider the object for which a is maximum. 4. If c < 0.25 then the observed gesture is “stop”.
(2.101) (2.102)
5. If xp > 1.2 then the observed gesture is “right”.
(2.103)
6. If
1 xp
> 1.2 then the observed gesture is “left”.
7. Otherwise the hand is idle and does not perform any gesture.
(2.104) (2.105)
These rules conclude the design of the static gesture recognition example application. Sect. 2.1.4 presents the corresponding I MPRESARIO process graph. Dynamic Gesture Recognition The feature plots in Fig. 2.14 show that normalized area a , compactness c, and normalized x coordinate of the center of gravity xcog should be sufficient to discriminate among the six dynamic gestures. “clockwise” and “counterclockwise” can be identified by their characteristic sinusoidal xcog feature. “open” and “grab” both exhibit a drop in a around T2 but can be distinguished by compactness, which simultaneously drops for “open” but rises for “grab”. “close” and “drop” are discerned correspondingly. We thus arrive at a feature vector of ⎛ ⎞ a O=⎝ c ⎠ (2.106) xcog An HMM-based classifier is the obvious choice for processing a time series of such feature vectors. The resulting example application and its I MPRESARIO process graph is presented in Sect. 2.1.5. 2.1.4 Static Gesture Recognition Application To run the example application for static gesture recognition, please install the rapid prototyping system I MPRESARIO from the accompanying CD as described in Sect. B.1. After starting I MPRESARIO, choose “File”→“Open...” to load the process graph named StaticGestureClassifier.ipg (see Fig. 2.22). The data flow is straightforward and corresponds to the structure of the sections in this chapter. All six macros are described below. Video Capture Device Acquires a stream of images from a camera connected e.g. via USB. In the following, each image is processed individually and independently. Other sources may be used by replacing this macro with the appropriate source macro (for instance, Image Sequence (see Sect. B.5.1) for a set of image files on disk).
Hand Gesture Commands
49
Fig. 2.22. The process graph of the example application for static gesture recognition (file StaticGestureClassifier.ipg). The image shows the classification result.
ProbabilityMap Computes a skin probability image as described in Sect. 2.1.2.1. The parameters Object color histogram file and Non-object color histogram file point to the files skin-32-32-32.hist and nonskin-32-32-32.hist (these are the histograms presented in [12], with a resolution of 32 bins per dimension as described in (2.13)). All parameters should be set to their default values. GaussFilter Smoothes the skin probability image to prevent jagged borders (see Sect. 2.1.2.2). The parameter Kernel size should be set to around 15 for a resolution of 320×240. Higher resolutions may require larger kernels. Setting Variance to zero computes a suitable Gauss kernel variance automatically. ConvertChannel The output of the preceding stage is a floating point gray value image I(x, y) ∈ [0, 1]. Since the next macro requires a fixed point gray value image I(x, y) ∈ {0, 1, . . . , 255}, a conversion is performed here. ObjectsFromMask Extracts contiguous regions in the skin probability image using an algorithm similar to that presented in Fig. 2.11. This macro outputs a list of all detected regions sorted descending by area (which may be empty if no regions were found). A value of zero for the parameter Threshold uses the algorithm shown in Fig. 2.6 for automatic threshold computation. If this fails, the threshold
50
Non-Intrusive Acquisition of Human Action
may have to be set manually. The remaining parameters control further aspects of the segmentation process and should be left at their defaults. StaticGestureClassifier Receives the list of regions from the preceding stage and discards all but the largest one. The macro then computes the feature vector described in (2.98), using the equations discussed in Sect. 2.1.2.3, and performs the actual classification according to the rules derived in Sect. 2.1.3.7 (see (2.99) to (2.105)). The thresholds used in the rules can be changed to modify the system’s behavior. For the visualization of the classification result, the macro also receives the original input image from Video Capture Device (if the output image is not shown, double-click on the macro’s red output port). It displays the border of the hand and a symbol indicating the classification result (“left”, “right”, or “stop”). If the hand is idle or not visible, the word IDLE is shown. Press the “Play” button or the F1 key to start the example. By double-clicking on the output port of GaussFilter the smoothed skin probability image can be displayed. The hand should be clearly distinguishable from the background. Make sure that the camera’s white balance is set correctly (otherwise the skin color detection will fail), and that the image is never overexposed, but rather slightly underexposed. If required, change the device properties as described in Sect. B.5.2. The application recognizes the three example gestures “left”, “right”, and “stop” shown in Fig. 2.2. If the classification result is always IDLE or behaves erratic, see Sect. 2.1.6 for troubleshooting help. 2.1.5 Dynamic Gesture Recognition Application The process graph of the example application for dynamic gesture recognition is stored in the file DynamicGestureClassifier.ipg (see Fig. 2.23). It is identical to the previous example (cf. Fig. 2.22), except for the last macro DynamicGestureClassifier. From the list of regions received from ObjectsFromMask, this macro considers only the largest one (just as StaticGestureClassifier) and computes the feature vector shown in (2.106). The feature vectors of subsequent frames are buffered, and when the gesture is finished, the resulting feature vector sequence is forwarded to an HMM-based classifier integrated in the macro. The following explains the parameters and operation of DynamicGestureClassifier. When starting the example, the macro is in “idle” state: It performs a segmentation and visualizes the resulting hand border in the color specified by Visualization Color but does not compute feature vectors. This allows you to position the camera, set the white balance etc. Setting the State parameter to “running” activates the recording and recognition process. Whenever the macro’s state is switched from “idle” to “running” it enters a preparation phase in which no features are computed yet, in order to give you some time to position your hand (which is probably on the mouse at that point) in the image. During this period a green progress bar is shown at the bottom of the image so that you know when the preparation phase is over and the actual recording starts. The duration of the preparation phase (in frames) is specified by the parameter Delay Samples.
Hand Gesture Commands
51
Fig. 2.23. Process graph of the example application for dynamic gesture recognition (file DynamicGestureClassifier.ipg). The image was recorded with a webcam mounted atop the screen facing down upon the keyboard.
After the preparation phase, the recording phase starts automatically. Its duration can be configured by Gesture Samples. A red progress bar is shown during recording. The complete gesture must be made while this progress bar is running. The hand motion should be smooth and take up the entire recording phase. The parameter Hidden Markov Model specifies a file containing suitable Hidden Markov Models.15 You can either use gestures.hmm, which is provided on the CD and contains HMMs for the six example gestures shown in Fig. 2.3, or generate your own HMMs using the external tool generateHMM.exe as described below. Since the HMMs are identified by numerical IDs (0, 1, . . . ), the parameter Gesture Names allows to set names for each model that are used when reporting classification results. The names are specified as a list of space-separated words, where the first word corresponds to ID 0, the second to ID 1, etc. Since the HMMs in gestures.hmm are ordered alphabetically this parameter defaults to “clockwise close counterclockwise drop grab open”.
15
The file actually contains an instance of lti::hmmClassifier, which includes instances of lti::hiddenMarkovModel.
52
Non-Intrusive Acquisition of Human Action
The parameter Last Classified Gesture is actually not a parameter, but is set by the macro itself to reflect the last classification result (which is also dumped to the output pane). Start the example application by pressing the “Play” button or the F1 key. Adjust white balance, shutter speed etc. as for StaticGestureRecognition (Sect. 2.1.4). Additionally, make sure that the camera is set to the same frame rate used when recording the training samples from which the HMMs were created (25 fps for gestures.hmm). When the segmentation of the hand is accurate, set State to “running” to start the example. Perform one of the six example gestures (Fig. 2.3) while the red progress bar is showing and check Last Classified Gesture for the result. If the example does not work as expected, see Sect. 2.1.6 for help. Test/Training Samples Provided on CD The subdirectory imageinput\Dynamic_Gestures in your I MPRESARIO installation contains three productions of each of the six example gestures. From these samples, gestures.hmm was created (this process is described in detail below). For testing purposes you can use them as input to DynamicGestureClassifier. However, conclusions regarding classification performance cannot be made by feeding training data back to the system, but only by testing with previously unseen input. The process graph DynamicGestureClassifierFromDisk.ipg shown in Fig. 2.24 uses Image Sequence as the source macro. It allows to select one of six image sequences from a drop-down list, where each sequence corresponds to one production of the gesture of the same name. A detailed description of Image Sequence is provided in B.5.1. In contrast to acquiring images from a camera, reading them from disk does not require a preparation phase, so the parameter Delay Samples of DynamicGestureClassifier is set to 0. Gesture Samples should exactly match the number of files to read (60 for the example gestures). Make sure that State is set to “running” every time before starting the example, otherwise one or more images would be ignored. Generation of Hidden Markov Models The command line utility generateHMM.exe creates Hidden Markov Models from a set of training samples and stores them in an .hmm file16 for use in DynamicGestureClassifier. These training samples may either be the above described samples provided on the CD, your own productions of the example gestures, or any other gestures that can be classified based on the feature vector introduced in (2.106). (To use a different feature vector, modify the class DynamicGR accordingly and recompile both the affected I MPRESARIO files and generateHMM.) The training samples must be arranged on disk as shown in Fig. 2.25. The macro RecordGestures facilitates easy recording of gestures and creates this directory 16
As above, this file actually contains a complete lti::hmmClassifier.
Hand Gesture Commands
53
Fig. 2.24. Recognition of dynamic gestures read from disk frame by frame (process graph file DynamicGestureClassifierFromDisk.ipg). Each frame corresponds to a single file.
Fig. 2.25. Directory layout for training samples to be read by the generateHMM.exe command line utility.
structure automatically. Its process graph is presented in Fig. 2.26. Like DynamicGestureRecognition, it has an “idle” and a “running” state, and precedes the ac-
54
Non-Intrusive Acquisition of Human Action
tual recording with a preparation phase. The parameters State, Delay Samples, and Gesture Samples work as described above.
Fig. 2.26. Process graph RecordGestures.ipg for recording gestures to be processed by generateHMM.exe.
To record a set of training gestures, first set the parameter Destination Directory to specify where the above described directory structure shall be created. Ensure that Save Images to Disk is activated and State is set to “idle”, then press “Play” or F1. For each gesture you wish to include in the vocabulary, perform the following steps: 1. Set the parameter Gesture Name (since the name is used as a directory name, it should not contain special characters). 2. Set Sequence Index to 0. 3. Set State to “running”, position your hand appropriately in the preparation phase (green progress bar) and perform the gesture during the recording phase (red progress bar). After recording has finished, Sequence Index is automatically increased by 1, and State is reset to “idle”. 4. Repeat step 3 several times to record multiple productions of the gesture. 5. Examine the results using an image viewer such as the free IrfanView17 or the macro Image Sequence (see Sect. B.5.1). If a production was incorrect or oth17
http://www.irfanview.com/
Hand Gesture Commands
55
erwise not representative, set Sequence Index to its index and repeat step 3 to overwrite it. Each time step 3 is performed, a directory with the name of Sequence Index is created in a subdirectory of Destination Directory named according to Gesture Name, containing as many image files (frames) as specified in Gesture Samples. This results in a structure as shown in Fig. 2.25. All images are stored in BMP format. The files created by RecordGestures can be used for both training and testing. For instance, in order to observe the effects of changes to a certain parameter while keeping the input data constant, replace the source macro Video Capture Device in the process graphs StaticGestureRecognition.ipg or DynamicGestureRecognition.ipg with Image Sequence and use the recorded gestures as image source. To run generateHMM.exe on the recorded gestures, change to the directory where it resides (the I MPRESARIO data directory), ensure that the files skin-32-32-32.hist and non-skin-32-32-32.hist are also present there, and type
generateHMM.exe -d -e <extension> -f where the options -d, -e, and -f specify the input directory, the input file extension, and the output filename, respectively. For use with Record Gestures, replace with the directory specified as Destination Directory, <extension> with bmp18 , and with the name of the .hmm file to be created. The directories are then scanned and each presentation is processed (this may take several seconds). The resulting .hmm file can now be used in DynamicGestureClassifier. To control the segmentation process, additional options as listed in Tab. 2.6 may be used. It is important that these values match the ones of the corresponding parameters of the ObjectsFromMask macro in DynamicGestureClassifier.ipg to ensure that the process of feature computation is consistent. The actual HMM creation can be configured as shown in Tab. 2.7. Table 2.6. Options to control the skin color segmentation in generateHMM.exe. -k -t
Size of the Gauss kernel used for smoothing the skin probability images. Segmentation threshold Θ.
Calling generateHMM.exe with no arguments shows a short help as well as the default values and allowed ranges for all options. For the provided file gestures.hmm the option -s 15 was used, and all other parameters were left at their defaults.
18
For the samples provided on CD, specify jpg instead.
56
Non-Intrusive Acquisition of Human Action Table 2.7. Options to control the HMM creation in generateHMM.exe.
-c -s -m, -M
Type of the distribution functions B to be used. Number of states N . Minimum and maximum number of states to jump forward per observation. These values specify a range, so -m 0 -M 2 allows jumps of zero, one, or two states (Bakis topology).
2.1.6 Troubleshooting In case the classification performance of either example is poor, verify that the recording conditions in Tab. 2.4 are met. Wooden furniture may cause problems for the skin color segmentation, so none should be visible in the image. Point the camera to a white wall to be on the safe side. If the segmentation fails (i.e. the hand’s border is inaccurate) even though no other objects are visible, recheck white balance, shutter speed, and gain/brightness. The camera may perform an automatic calibration that yields results unsuitable for the skin color histogram. Disable this feature and change the corresponding device parameters manually. The hand should be clearly visible in the output of the GaussFilter macro (double-click on its red output port if no output image is shown).
2.2 Facial Expression Commands The human face is, besides speech and gesture, a primary channel of information exchange during natural communication. By facial expressions emotions can be expressed faster, more subtle, and more efficiently than with words [3]. Facial expressions are used actively by a speaker to support communication, but also intuitively for mimic feedback by a dialog partner. In medicine, symptoms and progress in therapy may be read from the face. Sign languages make use of facial expressions as additional communication channel supplementing syntax and semantics of a conversation. Babies focus already after few minutes after birth on faces and differentiate hereby between persons. While the recognition of facial expressions is easy for humans, it is extremely difficult for computers. Many attempts are documented in the literature which mostly concentrate on subparts of the face [17]. Eye blinks are, e.g., applied in the automotive industry to estimate driver fatigue, lip outline and motion contribute to the robustness of speech recognizers, and face recognition is used as a biometric method for identification [4]. Parallel registration of facial features currently takes place only in “motion capturing” for the realistic animation of computer-generated characters. Here tiniest movements in the face are recorded with intricate optical measurement devices. Fig. 2.27 shows a selection of current hardware for facial expression recording. The use of non-intrusive video-based registration techniques poses serious problems due to possible variations in user appearance, which may have intrinsic and ex-
Facial Expression Commands
57
Fig. 2.27. Hardware for facial expression recording. Sometimes the reflections of Styrofoam balls are exploited (as in a. and b.) Other solutions use a head cam together with markers on the face (c) or a mechanical devise (d). In medical diagnostics line of sight is measured with piezo elements or with cameras fixed to the spectacles to measure skin tension.
trinsic causes. Intrinsic variations result from individual face shape and from movements while speaking, extrinsic effects result from camera position, illumination, and possibly occlusions. Here we propose a generic four stage facial feature analysis system that can cope with such influence factors. The processing chain starts with video stream acquisition and continues over image preprocessing and feature extraction to classification. During image acquisition, a camera image is recorded while camera adjustments are optimized continuously. During preprocessing, the face is localized and its appearance is improved in various respects. Subsequently feature extraction results in
58
Non-Intrusive Acquisition of Human Action
Fig. 2.28. Information processing stages of the facial expression recognition system.
the segmentation of characteristic facial regions among which significant attributes are identified. A final classification stage assigns features to predefined facial expressions. An overview of the complete system is depicted in Fig. 2.28. The single processing stages are described in more detail in the following sections.
Facial Expression Commands
59
2.2.1 Image Acquisition During image acquisition, the constancy of color and luminance even under variable illumination conditions must be guaranteed. Commercial web cams which are used here, usually permit the adjustment of various parameters. Mostly there exists a feature for partial or full self configuration. The hardware implemented algorithms take care of white balance, shutter speed, and back light compensation and yield, to the average, satisfactory results. Camera parameter adjustment can however be further optimized if context knowledge is available, as e.g. shadows from which the position of a light source can be estimated, or the color of a facial region. Since the factors lighting and reflection cannot be affected due to the objective, only the adjustment of the camera parameters remains. The most important parameters of a digital camera for the optimization of the representation quality are the shutter speed, the white-balance, and the gain. In order to optimize these factors in parallel, the extended simplex algorithm is utilized, similar to the approach given in [16]. With a simplex in n-dimensional space (here n=3 for the three parameters) it acts around a geometrical object with n+1 position vectors which possess an identical distance to each other in each case. The position vector is defined as follows: ⎛ ⎞ s s = shutter w = white − balance (2.107) P = ⎝w⎠ g g = gain P is a scalar value that the number of skin-colored pixel in the face region represents, dependent on the three parameters s, w and g. The worst (smallest) vector is mirrored at the line, that by the remaining vectors is given and results in a new simplex. The simplex algorithm tries to iteratively find an optimal position vector which comes the desired targets next. In the available case it applies to approximate the average color of the face region with an empirically determined reference value. For the optimal adjustment of the parameters, however, a suitable initialization is necessary which can be achieved for example by a pre-use of the automatic whitebalance of the camera. Fig. 2.29 (left) shows an example of the two-dimensional case. To the position vector two additional position vectors U and V are added. Since |P| produces too few skin-colored areas, it is mirrored at the straight line UV on P . It results the new simplex S . The procedure is repeated until it converges towards a local maximum Fig. 2.29 (right). A disadvantage of the simplex algorithm is its rigid gradual and linear performance. Therefore the approach was extended, as an upsetting and/or an aspect ratio was made possible for the mirroring by introduction of a set of rules. Fig. 2.30 represents the extended simplex algorithm for a two-dimensional parameter space. The rules are as follows:
60
Non-Intrusive Acquisition of Human Action
Fig. 2.29. Simple simplex similar to the approach described in [16].
1. If P > |U| ∧ P > |V| then repeat the mirroring with stretching factor 2 from P to P2 2. If P > |U| ∧ P < |V| ∨ P < |U| ∧ P < |V| then use P 3. If P < |U| ∧ P < |V| ∧ P > |P| then repeat mirroring with stretching factor 0.5 from P to P0,5 4. If P < |U| ∧ P < |V| ∧ P < |P| then repeat mirroring with stretching factor -0.5 from P to P−0,5
Fig. 2.30. Advanced simplex with dynamic stretching factor.
2.2.2 Image Preprocessing The goal of image preprocessing is the robust localization of the facial region which corresponds to the rectangle bounded by bottom lip and brows. With regard to processing speed only skin colored regions are analyzed in an image. Additionally a general skin color model is adapted to each individual. Face finding relies on a holistic approach which takes the orientation of facial regions into consideration. Brows, e.g., are characterized by vertically alternating
Facial Expression Commands
61
bright and dark horizontal regions. A speed-up of processing is achieved by a socalled integral-image approach and an Ada-Boosting classification. Variations in head pose and occlusion by, e.g., hands are compensated by tracing characteristic face regions in parallel. Furthermore, a reduction of shadow and glare effects is performed as soon as the face has been found [5]. 2.2.2.1 Face Localization Face localization is simplified by exploiting a-priori knowledge, either with respect to the whole face or to parts of it. Analytical or feature-based approaches make use of local features like edges, intensity, color, movement, contours and symmetry apart or in combination, to localize facial regions. Holistic approaches consider regions as a whole. The approach described here is holistic by finding bright and dark facial regions and their geometrical relations. A search mask is devised to find skin colored regions with suitable movement patterns only. For the exploitation of skin color information the approach is utilized which is already described in Sect. 2.1.2.1. A movement analysis makes use of the fact that neighboring pixels of contiguous regions move in similar fashion. There are several options how to calculate optical flow. Here a Motion History Image (MHI), i.e. the weighted sum of subsequent binary difference images is used, where a motion image is defined as: 1.0 if || I(x, y, tk ) − I(x, y, tk−1 )||2 > Θ Imotion (x, y, tk ) = max(0.0, IB (x, y, tk−1 ) − T1 otherwise
(2.108)
Imotion is incremented, if the intensity difference of a pixel in two following images is, according to the L2-norm, larger than a given threshold Θ. Otherwise the contribution of elapsed differences will be linearly decreased by the constant T . Fig. 2.31 shows the result of this computation for head and background movements. The intensity of the visualized movement pattern indicates contiguous regions.
Fig. 2.31. Visualized movement patterns for movement of head and background (left), movement of the background alone (middle), and movement of the head alone (right).
62
Non-Intrusive Acquisition of Human Action
For a combined skin color and movement search the largest skin colored object is selected and subsequently limited by contiguous, non skin colored regions (Fig. 2.32).
Fig. 2.32. Search mask composed of skin color (left) and motion filter (right).
Holistic face finding makes use of three separate modules. The first transforms the input image in an integral image for the efficient calculation of features. The second module supports the automatic selection of suitable features which describe variations within a face, using Ada-Boosting. The third module is a cascade classifier that sorts out insignificant regions and analyzes in detail the remaining regions. All three modules are subsequently described in more detail. Integral Image For finding features as easily as possible, Haar-features are used which are presented by Papgeorgiou et al. [18]. A Haar-feature results from calculating the luminance difference of equally sized neighboring regions. These combine to the five constellations depicted in Fig. 2.33, where the feature M is determined by: M (x1 , y1 , x2 , y2 ) = abs(Intensity
− Intensity )
(2.109)
Fig. 2.33. The five feature constellations used.
With these features, regions showing maximum luminance difference can be found. E.g., the brow region mostly shows different intensity from the region above the brow. Hence, the luminance difference is large and the resulting feature is rated significant. However, in a 24x24 pixel region there are already 45396 options to form the constellations depicted in Fig. 2.33. This enormous number results from the fact that
Facial Expression Commands
63
the side ratio of the regions may vary. To speed-up search in such a large search space an integral image is introduced which provides the luminance integral in an economic way. The integral image Iint at the coordinate (x, y) is defined as the sum over all intensities I(x, y): Iint = I(x , y ) (2.110) x ≤x,y ≤y,
By introducing the cumulative column sum Srow (x, y) two recursive calculations can be performed: Srow (x, y) = Srow (x, y − 1) + Iint (x, y)
(2.111)
Iint (x, y) = Iint (x − 1, y) + Srow (x, y)
(2.112)
The application of these two equations enables the computation of a complete integral image by iterating just once through the original image. The integral image has to be calculated only once. Subsequently, the intensity integrals Fi of any rectangle are determined by executing a simple addition and two subtractions.
Fig. 2.34. Calculation of the luminance of a region with the integral image. The luminance of region FD is I(xD , yD ) + I(xA , yA ) − I(xB , yB ) − I(xD , yD ).
In Fig. 2.34 we find:
64
Non-Intrusive Acquisition of Human Action FA =Iint (xA , yA )
(2.113)
FB =Iint (xB , yB ) − Iint (xA , yA ) FC =Iint (xC , yC ) − Iint (xA , yA )
(2.114) (2.115)
FD =Iint (xD , yD ) + Iint (xA , yA ) − Iint (xB , yB ) − Iint (xC , yC ) (2.116) (2.117) FE =Iint (xE , yE ) − Iint (xC , yC ) FF =Iint (xF , yF ) + Iint (xC , yC ) − Iint (xD , yD ) − Iint (xE , yE ) (2.118) (2.119) FF − FD =Iint (xF , yF ) + Iint (xA , yA ) − Iint (xB , yB ) − Iint (xE , yE ) + 2Iint (xC , yC ) − 2Iint (xC , yC ) The approach is further exemplified when regarding Fig. 2.35. The dark/gray constellation is the target feature. In a first step the integral image is constructed. This, in a second step, enables the economic calculation of a difference image. Each entry in this image represents the luminance difference of both regions with the upper left corner being the reference point. In the example there is a maximum at position (2, 3).
Fig. 2.35. Localizing a single feature in an integral image, here consisting of the luminance difference of the dark and gray region.
Ada-Boosting Now the n most significant features are selected with the Ada-Boosting-algorithm, a special kind of ensemble learning. The basic idea here is to construct a diverse ensemble of “weak” learners and combine these to a “strong” classifier.
Facial Expression Commands
65
Starting point is a learning algorithm A, that generates a classifier h from training data N , where: A:D→h
(2.120)
N = (xi , yi ) ∈ X × Y i = 1 . . . m h:X→Y
(2.121) (2.122)
X is the set of all features; Y is the perfect classification result. The number of training examples is m. Here A was selected to be a perceptron: 1 if pj fj (x) < pj Θj (2.123) hj (x) = 0 otherwise The perceptron hj (x) depends on the feature fj (x) and a given threshold Θj whereby pj serves for the correct interpretation of the inequality. During search for a weak classifier a number of variants Dt with t = 1, 2, . . . K is generated from the training data D. This results in an ensemble of classifiers ht = A(Dt ). The global classifier H(x) is then implemented as weighted sum: H(x) =
K
αk hk (x)
(2.124)
k=1
Ada-Boosting iteratively generates classifiers. During iteration each training example is weighted according to its contribution to the classification error. Since the weight of “difficult” examples is steadily increased, the training of later generated classifiers concentrates on the remaining problematic cases. In the following the algorithm for the selection of suitable features by Ada-Boosting is presented. Step 1: Initialization • • • •
let m1 . . . mn be the set of training images let bi be a binary variable, with: bi = 0, 1∀mi ∈ Mfaces , mi ∈ / Mfaces let m be the number of all faces and l the number of non-faces 1 1 , 2l ∀bi = 0, 1 w1,i are initialization weights with: w1,i = 2m Step 2: Iterative search for the T most significant features For t = 1, . . . , T
• • • •
w
Normalize the weights wt,i = n t,iwt,j such that wt forms a probability distrij=1 bution For each feature j train a classifier hj . The error depending on wt is: j = i wi · |hj (xi − yi | Select the classifier ht with the smallest error t Actualize the weights wt+1,i = wt,i βt1−ei , where ei = 0, 1 for correct, uncorrect t classification and βt = 1− t Step 3: Estimation of the Resulting-Classifier
66 •
Non-Intrusive Acquisition of Human Action The resulting classifier finally is: h(x) =
1 0
for Tt=1 αt ht (x) ≥ otherwise
1 2
T t=1
αt with αt = log β1t
(2.125)
Cascaded Classifier This type of classifier is applied to reduce classification errors and simultaneously shorten processing time. To this end small efficient classifiers are built, that reject numerous parts of a image as non-face (false positives), without incorrectly discarding faces (false negatives). Subsequently more complex classifiers are applied to further reduce the false positives-rate. A simple classifier for a two-region feature hereby reaches, e.g., in a first step a false-positive rate of only 40 % while at the same time recognizing 100 % of the faces correctly. The total system is equivalent to a truncated decision tree, with several classifiers arranged in layers called cascades (Fig. 2.36).
Fig. 2.36. Schematic diagram of the Cascaded Classifier: Several Classifiers in line are applied for each image-region. Each classifier can reject that region. This approach strongly reduces computational and at the same provides very accurate results.
The false-positive rate F of a cascade is: F =
K
fi
(2.126)
i=1
where fi is the false-positive rate of one of the K layers. The recognition rate D is respectively: D=
K
di
(2.127)
i=1
The final goal for each layer is to compose a classifier from several features, until a required maximum recognition rate D is reached and simultaneously the number F of misclassifications is minimal. This is achieved by the following algorithm:
Facial Expression Commands
67
Step 1: Initialization • • • •
The user defines f and FTarget Let Mfaces be the set of all faces Let M¬faces be the set of all non-faces F0 = 1.0; D0 = 1.0; i = 0
Step 2: Iterative Determination of all Classifiers of a Layer • •
•
i = i + 1; ni = 0; Fi = Fi−1 while Fi > FTarget – ni = ni+1 – Use Mfaces and M¬faces to generate with Ada-Boosting a classifier with ni features. – Test the classifier of the actual layer on a separate test set to determine Fi and Di . – Reduce the threshold of the i-th classifier, until the classifier of the actual layer reaches a recognition rate of d × Di−1 . If Fi > FTarget , test the classifier of the actual layer on a set of non-faces and add each misclassified object to N .
2.2.2.2 Face Tracking For side view images of faces the described localization procedure yields only uncertain results. Therefore an additional module for tracking suitable points in the facial area is applied, using the algorithm of Tomasi and Kanade [26]. The tracking of point pairs in subsequent images can be partitioned in two subtasks. First suitable features are extracted which are then tracked in the following frames of a sequence. In case features are lost, new ones must be identified. If in subsequent images only small motions of the features to be tracked occur, the intensity of immediately following frames is: I (x, y) = I(x + ζ, y + η)
(2.128)
Here d := (ζ, η) describes the shift of point (x, y). Now the assumption holds that points whose intensity varies strongly in either vertical or horizontal direction in the first frame show similar characteristics in the second frame. Therefore, a feature is described by a small rectangular region w with center (x, y) that shows a large variance with respect to its pixel intensities. The region W is called feature window. The intensity variance is determined by the gradient g = (gx gy ) at the position (x, y): 2 gx gx gy (2.129) G= gx gy gy2 In case matrix G has large eigenvalues λ1 and λ2 , the region establishes a good feature with sufficient variance. Consequently those windows w are selected, for which the eigenvalue
68
Non-Intrusive Acquisition of Human Action
λ = min(λ1 , λ2 )
(2.130)
is highest. In addition, further restrictions like the minimal distance between features or the minimal distance from image border determine the selection. For the selected features, the shift in the next frame is determined based on translation models. If the intensity function is approximated linearly by I(x + ζ, y + η) = I(x, y) + g T d
(2.131)
then the following equations have to be solved: G·d =e with e = (I(x, y) − I (x, y))g T · ω dW
(2.132) (2.133)
w
Here the feature region W is integrated, with ω being a weighting function. Since neither the translation model nor the linear approximation is exact, the equations are solved iteratively and the admittable shift d is assigned an upper limit. In addition a motion model is applied which enables to define a measure of feature unsimiliarity by estimating the affine transform. In case this unsimiliarity gets too large it is substituted by a new one. 2.2.3 Feature Extraction with Active Appearance Models For the analysis of single facial features the face is modeled by dynamic face graphs. Most common are Elastic Bunch Graphs (EBG) [27] and Active Appearance Models (AAM), the latter are treated here. An active Appearance Model (AAM) is a statistic model combining textural and geometrical information into an appearance. The combination of outline and texture models are based on an eigenvalue approach which reduces the amount of data needed, hereby enabling real-time calculation. Outline Model The contour of an object is represented by a set of points placed at characteristic positions of the object. The point configuration must be stable with respect shift, rotation and scaling. Cootes et al. [6] describe how marking points, i.e. landmarks, are selected consistently on each image of the training set. Hence, the procedure is restricted to object classes with a common basic structure. In a training set of triangles, e.g., typically the corner points are selected since they describe the essential property of a triangle and allow easy distinction of differences in the various images of a set. In general, however, objects have more complicated outlines so that further points are needed to adequately represent the object in question as, e.g., the landmarks for a face depicted in Fig. 2.37. As soon as n points are selected in d dimensions, the outline (shape) is represented by a vector x which is composed of the single landmarks.
Facial Expression Commands
69
Fig. 2.37. 70 landmarks are used for the description of a face.
x = [x11 , . . . , x1n , x21 , . . . , x2n , . . . , xd1 , . . . , xdn ]T
(2.134)
If there are s examples in the training set, then s vectors x are generated. For statistical analysis images must be transformed to the same pose, i.e. the same position, scaling, and rotation. This is performed by a procrustes analysis which considers the outlines in a training set and minimizes the sum of distances with respect to the average outline (Fig. 2.38). For making adjustments, only outline-preserving operations must be applied in order to keep the set of points compact and to avoid unnecessary nonlinearities.
Fig. 2.38. Adjustment of a shape with respect to a reference shape by a procrustes analysis using affine transformation. Shape B is translated such that its center of gravity matches that of shape A. Subsequently B is rotated and scaled until the average distance of corresponding points is minimized (procrustes distance).
Now the s point sets are adjusted in a common coordinate system. They describe the distribution of the outline points of an object. This property is now used to construct new examples which are similar to the original training set. This means that a new example x can be generated from a parameterized model M in the form
70
Non-Intrusive Acquisition of Human Action x = M (p)
(2.135)
where p is the model parameter vector. By a suitable selection of p new examples are kept similar to the trainings set. The Principal Component Analysis (PCA) is a means for dimensionality reduction by first identifying the main axes of a cluster. Therefore it involves a mathematical procedure that transforms a number of possibly correlated parameters into a smaller number of uncorrelated parameters called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. For performing the PCA, first the mean vector x of the s trainings set vectors xi is determined by: 1 x s i=1 i s
x=
(2.136)
Let the outline points of the model be described by n parameter xi . Then the variance of a parameter xi describes how far the values x2ij deviate from the expectation value E(Xi ). The covariance Cov then describes the linear coherence of the various xi parameters: Cov(x1 , x2 , . . . , xn ) = E(x1 − E(x1 )) · E(x2 − E(x2 )) · . . . · E(xn − E(xn )) (2.137) The difference between each outline vector xi and the average xi is ∂xi . Arranging the differences as columns of a matrix yields: D = (∂x1 . . . ∂xs )
(2.138)
From this the covariance matrix follows 1 DDT (2.139) s from which the eigenvalues λi and Eigenvectors ϕi which form a orthogonal basis, are calculated. Each eigenvalue represents the variance with respect to the mean value in the direction of the corresponding Eigenvector. Let the deformation matrix Φk contain the t largest Eigenvectors of S, i.e. those influencing the deviation from the average value most: S=
Φk = (ϕ1 , ϕ2 , . . . ϕt )
(2.140)
Then these t vectors form a subspace basis which again is orthogonal. Each point in the training set can now be approximated by x ≈ x + Φk · pk
(2.141)
Facial Expression Commands
71
where pk corresponds to the model parameter vector mentioned above, now with a reduced number of parameters. The number t of Eigenvectors may be selected so that the model explains the total variance to a certain degree, e.g. 98%, or that each example in the training set is approximated to a desired accuracy. Thus an approximation for the n marked contour points with dimension d each is found for all s examples in the training set, as related to the deviations from the average outline. With this approximation, and considering the orthogonality of the eigenvalue matrix, the parameter vector pk results in pk = ΦT (x − x)
(2.142)
By varying the elements of pk the outline x may be varied as well. The average outline and exemplary landmark variances are depicted in (Fig. 2.39).
Fig. 2.39. Average outline and exemplary landmark variances.
The eigenvalue λi is the variance of the i-th parameter pki , over all examples in the training set. To make sure that a newly generated outline is similar to the training patterns, limits are set. Empirically it was√found that a maximum deviation for the parameter pki should be no more than ±3 λi (Fig. 2.40). Texture Model With the calculated mean outline and the principal components it is possible to reconstruct each example of the training data. This removes texture variances resulting from the various contours. Afterwards the RGB values of the normalized contour are extracted in a texture-vector tim . To minimize illumination effects, the texture vectors are normalized once more using a scaling factor and an offset a resulting in:
72
Non-Intrusive Acquisition of Human Action
Fig. √ models for variations of the first three eigenvalues φ1 , φ2 and φ3 between √ 2.40. Outline 3 λ, 0 and −3 λ.
(tim − β · 1) (2.143) α For the identification of t the average of all texture vectors has to be determined first. Additionally a scaling and an offset have to be estimated, because the sum of the elements must be zero and the variance should be 1. The values for α and β for normalizing tim are then t=
α = tim · tβ =
(tim · 1) n
(2.144)
where n is the number of elements in a vector. As soon as all texture vectors have been assigned scaling and offset, the average is calculated again. This procedure is iterated for the new average until the average converges to a stable value. From here the further processing is identical to that described in the outline section. Again a PCA is applied to normalized data which results in the following approximation for a texture vector in the training set t = t + Φt · pt
(2.145)
where Φt is again the Eigenvector matrix of the covariance matrix, now for texture vectors. The texture parameters are represented by pt .
Facial Expression Commands
73
For texture description here an artificial 3-band-image is used which is composed by the intensity channel K1 , the I3 channel K2 and the Y gradient K3 (Fig. 2.41). It boosts edges and color information, facilitating the separation of lips from the face. The color space with its three channels K1 , K2 and K3 is defined as: R+G+B 3 2G − R − B K2 = 4 2G − R − B K3 = 4 K1 =
(2.146) (2.147) (2.148)
Fig. 2.41. An artificial 3-band image (a) composed of intensity channel (b), I3 channel (c), and Y gradient (d).
2.2.3.1 Appearance Model The parameter vectors pk and pt permit to describe each example in the training set by outline and texture. To this end a concatenated vector pa is generated: Wk pk (2.149) pa = pt where Wk is a diagonal matrix of weights for each outline parameter. This measure is needed, since the elements of the outline parameter vectors pk are pixel distances while those of the texture parameter vectors are pixel intensities. Hence both can not be compared directly. The weights are identified by systematically varying the values of the outline parameter vector p and controlling the effect of this measure on texture. Since the Eigenvector matrix is orthogonal, its inverse is the transposed. Therefore: pk = ΦTk (x − x) pt =
ΦTt (t
− t)
(2.150) (2.151)
74
Non-Intrusive Acquisition of Human Action By insertion we then get: pa =
Wk ΦTk (x − x) ΦTt (t − t)
(2.152)
A further PCA is applied to exploit possible correlations between the outline and texture parameters for data compression. Since the parameter vectors pk and pt describe variances, their mean is the zero vector. The PCA then results in the following approximation for an arbitrary appearance vector from the training set: pa ≈ Φa c
(2.153)
Here Φa denotes the appearance Eigenvector matrix and c its parameter vector. The latter controls the parameters of outline as well as of texture. The linearity of the model now permits to apply linear algebra to formulate outline and texture as a function of c. Setting:
Φpk Φp = Φpt cpk c= cpt
(2.154) (2.155)
results in Φpk cpk Wk ΦTk x − Wk ΦTk x Wk ΦTk (x − x) = = P = ΦTt t − ΦTt t Φpt cpt ΦTt (t − t) Φpk cpk + Wk ΦTk x Wk ΦTk x = ⇔ ΦTt t Φpt cpt + ΦTt t Φk Wk−1 Φpk cpk + x x = ⇔ t Φt Φpt cpt + t
(2.156) (2.157) (2.158)
From this the following functions for the determination of an outline x and a texture t can be read: x = x + Φk Wk−1 Φpk cpk
(2.159)
t = t + Φt Φpt cpt
(2.160)
or, in shortened form x = x + Qk cpk
(2.161)
t = t + Qt cpt
(2.162)
Facial Expression Commands
75
with Qk = Φk Wk−1 Φpk Qt = Φt Φpt
(2.163) (2.164)
Using these equations an example model can be generated by • • •
defining a parameter vector c, generating the outline free texture image with vector t, deforming this by using the control points in x.
In Fig. 2.42√example √ appearance models for variations of the first five Eigenvectors between 3 λ, 0, −3 λ are presented. 2.2.3.2 AAM Search Image interpretation in general tries to recognize a known object in an a priori unknown image, even if the object is changed e.g. by a variation in camera position. When Active Appearance Models are used for image interpretation, the model parameters have to be identified such that the difference between the model and the object in the image is minimized. Image interpretation is thus transformed into an optimization problem. By performing a PCA the amount of data is considerably reduced. Nevertheless the optimization process is rather time consuming if the search-space is not limited. There are although possibilities to optimize the search-algorithm. Each trial to match the face-graph on an actual image proceeds similar, that means on changing one parameter of the configuration a specific pattern changes in the output-image. One can prove the existence of linear dependencies in this process that can be utilized for speeding-up the system. The general approach is as follows: The distance vector ∂I between a given image and the image generated by an AAM is: ∂I = Ii − Im
(2.165) 2
For minimizing this distance, the error Δ = |∂I| is considered. Optimization then concerns • •
Learning the relation between image difference ∂I and deviations in the model parameters. Minimizing the distance error Δ by applying the learned relation iteratively.
Experience shows that a linear relation between ∂I and ∂c is a good approximation: ∂c = A · ∂I
(2.166)
To identify A a multiple multivariate linear regression is performed on a selection of model variations and corresponding difference images ∂I. The selection could be
76
Non-Intrusive Acquisition of Human Action
Fig. for variations of the first five Eigenvectors c1 , c2 , c3 , c4 and c5 between √ √ 2.42. Appearance 3 λ, 0, −3 λ.
emphasized by adding carefully, randomly changing the model parameters for the given examples. For the determination of error Δ the reference points have to be identical. Therefore the image outlines to be compared are made identical and their texture is compared. In addition to variations in the model parameters, small deviations in pose, i.e. translation, scaling and orientation are introduced. These additional parameters are made part of the regression by simply adding them to the model parameter deviation vector ∂c. To maintain model linearity, pose
Facial Expression Commands
77
is varied only by scaling s, rotation Θ and translation in x-/y-direction (tx , ty ). The corresponding parameters are sx , sy , tx , ty and sx = s·cos(Θ)−1 and sy = s·sin(Θ) perform scaling and rotation in combination. To identify the relation between ∂c, ∂I, and ∂t the following steps are performed for each image in the training set: • • •
• • •
Initialize the starting vector c0 with the fixed appearance parameters c of the actually selected texture tim . Define arbitrary deviations for the parameters in |∂c| and compute new model parameters c = ∂c + c0 With this new parameter vector c, generate the corresponding target contour which is then used to normalize ti , such that its outline matches that of the modeled object. Then generate a texture image tm free of outline by applying c. Now the difference to the current image induced by ∂c can be calculated as ∂t = tn − tm Note ∂c and the corresponding ∂t and iterate several times for the same example image. Then iterate over all images in the training set.
The selected ∂c should be made as large as possible. Empirical evaluations have shown that the optimal region for the model parameters is 50% of the training set standard deviation, 10% scaling, and 2 pixel translation. For larger values the assumption of a linear relation is no longer warranted. The identified value pairs ∂c and ∂t are now submitted to a multiple multivariate regression to find A. In general a regression analysis the effect of influence variables on some target variable. Here, more specifically, the linear dependency of the single induced image differences of texture ∂t are examined which in turn depend on several selected parameter deviations ∂c. So A encodes a-priori knowledge which is gained during the iteration described above. By establishing the relation between model parameter deviations and their effect on generated images the first part of the optimization problem has been solved. What follows now in a second step is error minimization. This is achieved by iterations, whereby the linear model described earlier predicts for each iteration the change of model-parameters which lets the model converge to an optimum. Convergence is assumed as soon as the error reaches a given threshold. If the estimated model parameters c0 are initialized with those of the standard model, i.e. with the average values, once again the normalized image tn is involved. The algorithm proceeds as follows: • • • • •
Identify image difference ∂t0 = tn − tm and den error E0 = |∂t0 |2 Compute the estimated deviation ∂c = A · ∂t0 Set k = 1 and c1 = c0 − k∂c Generate the image to estimate the parameter vector c1 which results in a new tm . Execute for the given normalized image a further normalization, this time for the outline given by the new parameter-vector c1 .
78 • •
Non-Intrusive Acquisition of Human Action Compute the new error vector ∂t1 . Repeat the procedure until the error Ei converges. In case error Ei does not converge, retry with smaller values, e.g., 0.5, 0.25 etc. for k.
This results in a unique configuration of model parameters for describing an object, here the face, in a given image. Hence the object can be assigned to a particular object class, i.e. to the one for which the model was constructed. In case the configuration coincides with a normalized example of the trainingset, the object is not only localized but also identified. To give an example: If the algorithm for optimization determines a graph with a certain parameter-configuration very similar to parameters of a particular view in the training-set, then the actually localized face could match that of the trained person. 2.2.4 Feature Classification To localize areas of interest, a face graph is iteratively matched to the face using a modified active appearance model (AAM) which exploits geometrical and textural Information within a triangulated face. The AAM must be trained with data of many subjects. Facial features of interest which have to be identified in this processing stage, are head pose, line of sight, and lip outline. 2.2.4.1 Head Pose Estimation For the identification of the posture of the head two approaches are prosecuted in parallel. The first method is an analytical method based on the trapezoid described by the outer corners of the eyes and the corners of the mouth. The four required points are taken from a matched face graph. The second, holistic method also makes use of a face graph. Here the face region is derived from the convex hull that contains all nodes of the face-graph (Fig. 2.43). Then a transformation of the face region into the eigenspace takes place. The base of the eigenspace is created by the PCA transformation of 60 reference views which represent different head poses. The calculation of the Euclidian Distance of the actual view into the eigenspace allows the estimation of the head pose. Finally the results of the analytic and the holistic approach are compared. In case of significant differences the actual head pose is estimated by utilizing the last correct result combined with a prediction involving the optical flow. Analytic Approach In a frontal view the area between eye and mouth corners appears to be symmetrical. If, however, the view is not frontal, the face plane will be distorted. To identify roll σ and pitch angle τ the distorted face plan is transformed into a frontal view (Fig. 2.44). A point x on the distorted plane is transformed into an undistorted point x by x = U x + b
(2.167)
where U is a 2x2 transformation matrix and b the translation. It can be shown that U can be decomposed to:
Facial Expression Commands
79
Fig. 2.43. Head pose is derived by a combined analytical and holistic approach.
U = λ1 R(Θ)P (λ, τ )
(2.168)
Here R(T heta) is a rotation around Θ and P (λ, τ ) is the following symmetric matrix: λ0 P (λ, τ ) = R(τ ) R(−τ ) (2.169) 01 with R(τ ) =
cos τ − sin τ sin τ cos τ
(2.170)
The decomposition of U shows the ingredients of the linear back transformation consisting of an isotropic scaling by λ1 , a rotation around the optical axis of the virtual cam corresponding to R(Θ), and a scaling by factor λ in the direction of τ . With two orthogonal vectors a (COG eyes – outer corner of the left eye) and b (COG eyes – COG mouth corners) the transformation matrix U is: U (ab) = ⇔U =
0 21 Re −1 0 0 12 Re −1 0
(2.171)
(ab)−1
(2.172)
Where Re is the ratio of a and b Since a and b are parallel and orthogonal with respect to the symmetry axes it follows:
80
Non-Intrusive Acquisition of Human Action
Fig. 2.44. Back transformation of a distorted face plane (right) on an undistorted frontal view of the face (left) which yields roll and pitch angle.
aT · U T · U · b = 0
(2.173)
Hence U must be a symmetrical matrix: αβ UT · U = βγ
(2.174)
Also it follows that:
tan(2τ ) =
2β α−γ
and
1 √ cos(σ) = √ μ+ μ−1
with
μ=
(α + γ)2 4(α · γ − β 2 )
Finally the normal vector for head pose is: ⎡ ⎤ sin(σ) · cos(τ ) n = ⎣ sin(σ) · sin(τ ) ⎦ − cos(σ)
(2.175) (2.176)
(2.177)
Caused by the implied trigonometric functions, roll and pitch angles are always ambiguous, i.e. the trapezoid described by mouth and eye corners is identical for different head poses. This problem can be solved by considering an additional point such as, e.g., the nose tip, to fully determine the system.
Facial Expression Commands
81
Fig. 2.45. Pose reconstruction from facial geometry distortion is ambiguous as may be seen by comparing the left and right head pose.
Holistic Approach A second method for head pose estimation makes use of the Principal Component Analysis (PCA). It transforms the face region into a data space of Eigenvectors that results from different poses. The monochrome images used for reference are first normalized by subtracting the average intensity from individual pixels and then dividing the result by the standard deviation. Hereby variations resulting from different illuminations are averaged. For n views of a rotating head a pose eigenspace (PES) can be generated by performing the PCA. A reference image sequence distributed equally between -60 and 60 degrees yaw angle is collected, where each image has size m = w × h and is represented as m-dimensional row vector x. This results in a set of such vectors (x1 , x2 , . . . , xn ). The average μ and the covariance matrix S is calculated by:
μ=
1 x n i=1 i
(2.178)
S=
1 [x − μ][xi − μ]T n i=1 i
(2.179)
n
n
Let φj with j = 1 . . . n be the Eigenvectors of S with the largest corresponding eigenvalues λj : (2.180) ϕj = λj ϕj
82
Non-Intrusive Acquisition of Human Action
The n Eigenvectors establish the basis for the PES. For each image of size w × h a pattern vector Ω(x) = [ω1 , ω2 , . . . , ωn ] may be found by projection with the Eigenvectors φj . ωj = ϕTj (x − μ) with j = 1, . . . , n
(2.181)
After normalizing the pattern vector with the eigenvalues the image can be projected in the PES: # " ωn ω1 ω2 (2.182) , ,··· , Ωnorm (x) = λ1 λ2 λn The reference image sequence then corresponds to a curve in the PES. This is illustrated by Fig. 2.46 where only the first three Eigenvectors are depicted.
Fig. 2.46. top: Five of sixty synthesized views. middle: masked faces. bottom: cropped faces. right: Projection into the PES.
In spite of color normalization the sequence projection is influenced significantly by changes in illumination as illustrated by Fig. 2.47. Here three image sequences with different illumination are presented together with the resulting projection curves in the PES.
Fig. 2.47. Variations in the PES due to different illumination of the faces.
Facial Expression Commands
83
To compensate this influence a Gabor Wavelet Transformation (GWT) is applied which is the convolution with a Gabor Wavelet in Fourier space. Here a combination of the four GWT responses corresponding to 0°, 45°, 90° and 145° is used. As a result the curves describing head pose for different illumination are closer together in PES (Fig. 2.48).
Fig. 2.48. Here four gabor wavelet transforms are used instead of image intensity. The influence of illumination variations is much smaller than in Fig. 2.47
Now, if the pose of a new view is to be determined, this image is projected into the PES as well and subsequently the reference point with the smallest Euclidean distance is identified. 2.2.4.2 Determination of Line of Sight For iris localization the circle Hough transformation is used which supports the reliable detection of circular objects in an image. Since the iris contains little red hue, only the red channel is extracted from the eye region which contains a high contrast between skin and iris. By the computation of directed X and Y gradients in the red channel image the amplitude and phase between iris and its environment is emphasized. The Hough space is created using an extended Hough transform. It is applied on search mask which is generated based on a threshold segmentation of the red channel image. Local maxima then point to circular objects in the original image. The iris by its very nature represents a homogeneous area. Hence the local extremes in the Hough space are being validated by verifying the filling degree of the circular area with a expected radius in the red-channel Finally the line of sight is determined by comparing the intensity distributions around the iris with trained reference distributions using a Maximum-Likelihood-Classifier. In Fig. 2.49 the entire concept for line of sight analysis is illustrated. 2.2.4.3 Circle Hough Transformation The Hough transformation (HT) represents digital images in a special parameter space called Hough Space. For a line, e.g., this space is two dimensional and has the coordinates slope and shift. For representing circles and ellipses a Hough space of higher dimensions is required.
84
Non-Intrusive Acquisition of Human Action
Fig. 2.49. Line of sight is identified based on amplitude and phase information in the red channel image. An extended Hough transformation applied to a search mask finds circular objects in the eye region.
For the recognition of circles, the circle hough transformation (CHT) is applied. Here the Hough space H(x, y, r) corresponds to a three dimensional accumulator that indicates the probability that there is a circle with radius r at position (x, y) in the input image. Let Cr be a circle with radius r: Cr = {(i, j)|i2 + j 2 = r2 } Then the Hough Space is described by: I(x + i, y + j) H(x, y, r) =
(2.183)
(2.184)
(i,j)∈Cr
This means that for each position P of the input image where the gradient is greater than a certain threshold, a circle with a given radius is created in the Hough space. The contributions of the single circles are accumulated in Hough space. A local maximum at position P then points to a circle center in the input image as illustrated in Fig. 2.50 for 4 and 16 nodes, if a particular radius is searched for. For the application at hand the procedure is modified such that the accumulator in Hough Space is no longer incremented by circles but by circular arc segments (Fig. 2.51). The sections of the circular arc are incremented orthogonally in respect of the phase to both sides.
Facial Expression Commands
85
Fig. 2.50. In the Hough transformation all points in an edge-coded image are interpreted as centers of circles. In Hough space the contributions of single circles are accumulated. This results in local maxima which correspond to circle centers in the input image. This is exemplarily shown for 4 and 16 nodes.
Fig. 2.51. Hough transformation for circular arc segments which facilitates the localization of circles by reducing noise. This is exemplarily shown for 4 and 16 nodes.
For line of sight identification the eyes’ sclera is analyzed. Following iris localization, a concentric circle is described around it with the double of its diameter. The resulting annulus is divided in to eight segments. The intensity distribution of these segments then indicates line of sight. This is illustrated by Fig. 2.52 where distributions for two different lines of sight are presented which are similar for both eyes but dissimilar for different lines of sight. A particular line of sight associated with a distribution as shown in Fig. 2.52 is then identified with a maximum likelihood classifier. The maximum likelihood classifier assigns a feature vector g = m(x, y) to one of t object classes. Here g consists of the eight intensities of the circular arc segments mentioned above. For classifier training a sample of 15 (5 x 3) different lines of sights were collected from ten subjects. From these data the average vector μi (i = 0 . . . 14), the covariance matrix Si , and the inverse covariance matrix Si−1 were calculated. Based on this the statistical spread vector si could be calculated.
86
Non-Intrusive Acquisition of Human Action
Fig. 2.52. Line of sight identification by analyzing intensities around the pupil.
From these instances the rejection threshold dzi = sTi · Si−1 · si for each object class is determined. For classification the Mahalanobis distances di = (g − μi )T Si−1 (g − μi ) of the actual feature vector to all classes is calculated. Afterwards a comparison with pretrained thresholds takes place. The feature vector g then is assigned to that class kj for which the Mahalanobis distance is minimal and for which the distance is lower than the rejection threshold of that class(Fig. 2.53). .
Fig. 2.53. A maximum likelihood classifier for line of sight identification. A feature vector is assigned to that class to which the Mahalanobis distance is smallest.
Facial Expression Commands
87
2.2.4.4 Determination of Lip Outline A point distribution model (PDM) in connection with an active shape model (ASM) is used for lip outline identification. For ASM initialization the lip borders must be segmented from the image as accurately as possible. Lip Region Segmentation Segmentation is based on four different feature maps which all emphasize the lip region from the surrounding by color and gradient information Fig. 2.54.
Fig. 2.54. Identification of lip outline with an active shape model. Four different features contribute to this process.
The first feature represents the probability that a pixel belongs to the lips. The required ground truth, i.e. lip color histograms, has been derived from 800 images segmented by hand. The a posteriori probabilities for lips and background are then calculated as already described in Sect. 2.1.2.1. In the second feature the nonlinear LUX color space (Logarithmic hUe eXtention) is exploited to emphasize the contrast between lips and surrounding [14], where L is the luminance, and U and X incorporate chromaticity. Here the U channel is used as a feature.
88
Non-Intrusive Acquisition of Human Action L = (R + 1)0.3 (G + 1)0.6 (B + 1)0.1 − 1 256 R+1 if R < L 2 ( L+1 ) U= L+1 ( ) else 256 − 256 2 R+1 256 B+1 if B < L 2 ( L+1 ) X= 256 L+1 256 − 2 ( B+1 ) else
(2.185) (2.186) (2.187)
A third feature makes use of the I1 I2 I3 color space which also is suited to emphasize the contrast between lips and surrounding. The three channels are defined as follows: R+G+B (2.188) 3 R−B (2.189) I2 = 2 2G − R − B I3 = (2.190) 4 The fourth feature utilizes a Sobel operator that emphasizes the edges between the lips and skin- or beard-region. ⎡ ⎤ ⎡ ⎤ −1 0 1 −1 −2 −1 ⎣ −2 0 2 ⎦ or rather ⎣ 0 0 0 ⎦ (2.191) −1 0 1 1 2 1 The filter mask is convoluted with the corresponding image region. The four different features need to be combined for establishing an initialization mask. For fusion, a logical OR without individual weighting was selected, as weighting did not improve the results. While the first three features support the segmentation between lips and surrounding, the gradient map serves the union of single regions in cases where upper and lower lips are segmented separately due to dark corners of the mouth or to teeth. Nevertheless, segmentation errors still occur when the color changes between lips and surrounding is floating. This effect is observed if, e.g., shadow falls below the lips, making that region dark. Experiments show that morphological convolution operators like erosion or dilatation are inappropriate to smooth out such distortions. Therefore here an iterative procedure is used which relates the center of gravity of the rectangular surrounding the lips Cog Lip to the center of gravity of the segmented area Cog Area . If Cog Lip lies above Cog Area , this most likely is due to line-formed artifacts. These then are erased linewise until a recalculation results in approximately identical centers of gravity (Fig. 2.55). 2.2.4.5 Lip modeling I1 =
For the collection of ground truth data, mouth images were taken from 24 subjects with the head filling the full image format. Each subject had to perform 15 visually distinguishable mouth forms under two different illumination conditions. Subsequently upper and lower lip, mouth opening, and teeth were segmented manually
Facial Expression Commands
89
Fig. 2.55. The mask resulting from fused features (left) is iteratively curtailed until the distance of the center of gravity from the bounding box to the center of gravity of the area is minimized (right).
in these images. Then 44 points were equally assigned on the outline with point 1 and point 23 being the mouth corners. It is possible to use vertically symmetrical and asymmetrical PDMs. Research into this shows that most lip outlines are asymmetrical, so asymmetrical models fit better. Since the segmented lip outlines vary in size, orientation, and position in the image, all points have to be normalized accordingly. The average form of training lip outlines, their Eigenvector matrix and variance vector, then form the basis for the PDMs which are similar to the outline models of the Active Appearance Models. The resulting PDMs are depicted√ in Fig. 2.56, √ where the first four Eigenvectors have been varied in a range between 3 λ, 0, −3 λ.
Fig. 2.56. √ Point distribution models for variations of the first four Eigenvectors φ1 , φ2 , φ3 , φ4 √ between 3 λ, 0, −3 λ.
90
Non-Intrusive Acquisition of Human Action
2.2.5 Facial Feature Recognition – Eye Localization Application To run the example application for eye localization, please install the rapid prototyping system I MPRESARIO from the accompanying CD as described in Sect. B.1. After starting I MPRESARIO, choose File → Open to load the process graph named FaceAnalysis_Eye.ipg (Fig. 2.57). The data flow is straightforward and corresponds to the structure of the sections in this chapter. All six macros are described below. Image Sequence Acquires a sequence of images from hard disk or other storage media. In the following, each image is processed individually and independently. Other sources may be used by replacing this macro with the appropriate source macro (for instance, Video Capture Device for a video stream from an USB cam). SplitImage Split image in three basic channels Red – Green – Blue. It is possible to split all the channels at the same time or to extract just one channel (here the Red channel is used). Gradient This is a simple wrapper for the convolution functor with some convenience parameterization to choose between different common gradient kernels. Not only the classical simple difference computation (right minus left for the x direction or bottom minus top for the y direction) and the classical Sobel, Prewitt, Robinson, Roberts and Kirsch kernels can be used, but more sophisticated optimal kernels (see lti::gradientKernelX) and the approximation using oriented Gaussian derivatives can be used. (output here: An absolute channel computed from X and Y gradient channels) ConvertChannel The output of the preceding stage is a floating point gray value image I(x, y) ∈ [0, 1]. Since the next macro requires a fixed point gray value image I(x, y) ∈ {0, 1, . . . , 255} a conversion is performed here. EyeFinder creates the Hough space by utilizing the gradient-information of the red input-channel. Local maxima in the Hough space are taken as the center of the eyes. There are two possibilities to visualize the results: First the Hough space itself could be shown, second the contours of the iris could be high-lighted green. 2.2.6 Facial Feature Recognition – Mouth Localization Application To run the example application for mouth localization, please install the rapid prototyping system I MPRESARIO from the accompanying CD as described in Sect. B.1. After starting I MPRESARIO, choose File → Open to load the process graph named FaceAnalysis_Mouth.ipg (Fig. 2.58). The data flow is straightforward and corresponds to the structure of the sections in this chapter. All six macros are described below.
Facial Expression Commands
91
Fig. 2.57. Process graph of the example application for eye-localization.
Image Sequence Acquires a sequence of images from hard disk or other storage media. In the following, each image is processed individually and independently. Other sources may be used by replacing this macro with the appropriate source macro (for instance, Video Capture Device for a video stream from an USB cam). SplitImage Split image in three basic-channels Red – Green – Blue. It is possible to split all the channels at the same time or to get just one channel (Here the Red channel is used). Gradient This is a simple wrapper for the convolution functor with some convenience parameterization to choose between different common gradient kernels. Not only the classical simple difference computation (right minus left for the x direction or bottom minus top for the y direction) and the classical Sobel, Prewitt, Robinson, Roberts and Kirsch kernels can be used, but more sophisticated optimal kernels (see lti::gradientKernelX) and the approximation using oriented Gaussian derivatives can be used. (output here: An absolute channel computed from X and Y gradient channels) GrayWorldConstancy Performs a color normalization on an image using a gray world approach, in order to eliminate effects of illumination color. Gray world methods and white world methods assume that some low order statistics on the
92
Non-Intrusive Acquisition of Human Action
colors of a real world scene lie on the gray line of a color space. The most simple approach uses only the mean color and maps it linearly into one specific point on the gray line, assuming that a diagonal transform (a simple channel scaling) is valid. Normalization is done under the assumption of mondrian worlds, constant point light sources and delta function camera sensors. ASMInitializer creates an initialization mask for the MouthFinder. Therefore three color-based maps are generated and together with the gradient-information combined. The result is a gray image that represent the probability for each pixel to belong to background or to the lips. ObjectsFromMask outputs a list of all detected regions sorted descending by area (which may be empty if no regions were found). A value of zero for the parameter Threshold uses the algorithm shown in Fig. 2.6 for automatic threshold computation. If this fails, the threshold can be set manually. The remaining parameters control further aspects of the segmentation process and should be left at their defaults. MouthFinder utilizes point distribution models to extract the mouth contour. Therefore, first a selectable number of points are extracted from the initialization mask given by module ASMInitializer. Afterwards the points are interpolated by a spline. The won contour serves for the initialization of the Active Shape models.
References 1. Akyol, S. and Canzler, U. An Information Terminal using Vision Based Sign Language Recognition. In B¨uker, U., Eikerling, H.-J., and M¨uller, W., editors, ITEA Workshop on Virtual Home Environments, VHE Middleware Consortium, volume 12, pages 61–68. 2002. 2. Akyol, S., Canzler, U., Bengler, K., and Hahn, W. Gesture Control for Use in Automobiles. In Proceedings of the IAPR MVA 2000 Workshop on Machine Vision Applications. 2000. 3. Bley, F., Rous, M., Canzler, U., and Kraiss, K.-F. Supervised Navigation and Manipulation for Impaired Wheelchair Users. In Thissen, W., Wierings, P., Pantic, M., and Ludema, M., editors, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics. Impacts of Emerging Cybernetics and Huamn-Machine Systems, pages 2790–2796. 2004. 4. Canzler, U. and Kraiss, K.-F. Person-Adaptive Facial Feature Analysis for an Advanced Wheelchair User-Interface. In Drews, P., editor, Conference on Mechatronics & Robotics, volume Part III, pages 871 – 876. Sascha Eysoldt Verlag, September 2004. 5. Canzler, U. and Wegener, B. Person-adaptive Facial Feature Analysis. The Faculty of Electrical Engineering, page 62, May 2004. 6. Cootes, T., Edwards, G., and Taylor, C. Active Appearance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681 – 685, 2001. 7. Csink, L., Paulus, D., Ahlrichs, U., and Heigl, B. Color Normalization and Object Localization. In 4. Workshop Farbbildverarbeitung, Koblenz, pages 49–55. 1998.
References
93
Fig. 2.58. Process graph of the example application for mouth localization. 8. Dugad, R. and Desai, U. B. A Tutorial on Hidden Markov Models. Technical Report SPANN-96.1, Department of Electrical Engineering, Indian Institute of Technology, May 1996. 9. Finlayson, G. D., Schiele, B., and Crowley, J. L. Comprehensive Colour Image Normalization. Lecture Notes in Computer Science, 1406(1406), 1998. 10. Jain, A. K. Fundamentals of Digital Image Processing. Prentice-Hall, 1989. 11. Jelinek, F. Statistical Methods for Speech Recognition. MIT Press, 1998. ISBN 0-26210066-5. 12. Jones, M. and Rehg, J. Statistical Color Models with Application to Skin Detection. Technical Report CRL 98/11, Compaq Cambridge Research Lab, December 1998. 13. Juang, B. H. and Rabiner, L. R. The Segmental k-means Algorithm for Estimating the Parameters of Hidden Markov Models. In IEEE Trans. Acoust., Speech, Signal Processing, volume ASSP-38, pages 1639–1641. 1990. 14. Lievin, M. and Luthon, F. Nonlinear Colour Space and Spatiotemporal MRF for Hierarchical Segmentation of Face Features in Video. IEEE Transactions on Image Processing, 13:63 – 71, 2004. 15. Mitchell, T. M. Machine Learning. McGraw-Hill, 1997. 16. Nelder, J. and Mead, R. A Simplex Method for Function Minimization. Computer Journal, 7(4):308 – 313, 1965. 17. Pantic, M. and Rothkrantz, L. J. M. Expert Systems for Automatic Analysis of Facial Expressions. In Image and Vision Computing Journal, volume 18, pages 881–905. 2000.
94
Non-Intrusive Acquisition of Human Action
18. Papageorgiou, C., Oren, M., and Poggio, T. A General Framework for Object Detection. pages 555 – 562. 1998. 19. Rabiner, L. and Juang, B.-H. An Introduction to Hidden Markov Models. IEEE ASSP Magazine, 3(1):4–16, 1986. 20. Rabiner, L. R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Readings in Speech Recognition, 1990. 21. Schukat-Talamazzini, E. Automatische Spracherkennung. Vieweg Verlag, Braunschweig/Wiesbaden, 1995. 22. Sch¨urmann, J. Pattern Classification. John Wiley & Sons, 1996. 23. Sherrah, J. and Gong, S. Tracking Discontinuous Motion Using Bayesian Inference. In Proceedings of the European Conference on Computer Vision, pages 150–166. Dublin, Ireland, 2000. 24. Sonka, M., Hlavac, V., and Boyle, R. Image Processing, Analysis and Machine Vision. Brooks Cole, 1998. 25. Steger, C. On the Calculation of Arbitrary Moments of Polygons. Technical Report FGBV–96–05, Forschungsgruppe Bildverstehen (FG BV), Informatik IX, Technische Universit¨at M¨unchen, October 1996. 26. Tomasi, C. and Kanade, T. Detection and Tracking of Point Features. Technical Report CS-91-132, CMU, 1991. 27. Wiskott, L., Fellous, J., Kr¨uger, N., and von der Malsburg, C. Face Recognition by Elastic Bunch Graph Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):775–779, 1997. 28. Zieren, J. and Kraiss, K.-F. Robust Person-Independent Visual Sign Language Recognition. In Proceedings of the 2nd Iberian Conference on Pattern RecognitionandImage Analysis IbPRIA 2005, volume Lecture Notes in Computer Science. 2005.
Chapter 3
Sign Language Recognition J¨org Zieren1 , Ulrich Canzler2 , Britta Bauer3 , Karl-Friedrich Kraiss Sign languages are naturally developed and fully fledged languages used by the Deaf and Hard-of-Hearing for everyday communication among themselves. Information is conveyed visually, using a combination of manual and nonmanual means of expression (hand shape, hand posture, hand location, hand motion; head and body posture, gaze, mimic). The latter encode e.g. adjectives and adverbials, or provide specialization of general items. Some signs can be distinguished by manual parameters alone, while others remain ambiguous unless additional nonmanual information is made available. Unlike pantomime, sign language does not include its environment. Signing takes place in a 3D space close to the trunk and the head, called signing space. The grammar of sign language is fundamentally different from spoken language. The structure of a sentence in spoken language is linear, one word followed by another, whereas in sign language, a simultaneous structure exists with a parallel temporal and spatial configuration. The configuration of a sign language sentence carries rich information about time, location, person, or predicate. Thus, even though the duration of a sign is approximately twice as long as the duration of a spoken word, the duration of a signed sentence is about the same. Unfortunately, very few hearing people speak sign language. The use of interpreters is often prohibited by limited availability and high cost. This leads to problems in the integration of deaf people into society and conflicts with an independent and self-determined lifestyle. For example, most deaf people are unable to use the World Wide Web and communicate by SMS and email in the way hearing people do, since they commonly do not surpass the level of literacy of a 10-year-old. The reason for this is that hearing people learn and perceive written language as a visual representation of spoken language. For deaf people, however, this correspondence does not exist, and letters – which encode phonemes – are just symbols without any meaning. In order to improve communication between Deaf and Hearing research in automatic sign language recognition is needed. This work shall provide the technical requirements for translation systems and user interfaces that support the integration of deaf people into the hearing society. The aim is the development of a mobile system, consisting of a laptop equipped with a webcam that visually reads a signer’s gestures and mimic, and performs a translation into spoken language. This device is intended as an interpreter in everyday life (e.g. at the post office or in a bank). Fur1
Sect. 3.1 Sect. 3.2 3 Sect. 3.3 2
96
Sign Language Recognition
thermore, it allows deaf people intuitive access to electronic media such as computers or the internet (Fig. 3.1).
Fig. 3.1. laptop and webcam based sign language recognition system.
State of the Art in Automatic Sign Language Recognition Since a two-dimensional video signal is significantly more complex than a onedimensional audio signal, the current state in sign language recognition is roughly 30 years behind speech recognition. In addition, sign language is by far not fully explored yet. Little is known about syntax and semantics, and no dictionary exists. The lack of national media – such as radio, TV, and telephone for the hearing – leads to strong regional variation. For a large number of signs, there is not even a common definition. Sign language recognition has become the subject of scientific publications only in the beginning of the 90s. Most presented systems operate near real-time and require 1 to 10 seconds of processing time after completion of the sign. For videobased approaches, details on camera hardware and resolution are rarely published, suggesting that professional equipment, high resolution, low noise, and optimal camera placement was used. The method of data acquisition defines a user interface’s quality and constitutes the primary feature for classification of different works. The most reliable, exact, and at the same time the simplest techniques are intrusive: Data gloves measure the
97 flexion of the finger joints, optical or magnetic markers placed on face and hands facilitate a straightforward determination of mimic and manual configuration. For the user, however, this is unnatural and restrictive. Furthermore, data gloves are unsuitable for practical applications due to high cost. Also most existing systems exploit manual features only; so far facial features were rarely used [17]. A recognition system’s usability is greatly influenced by the robustness of its image processing stage, i.e. its ability to handle inhomogeneous, dynamic, or generally uncontrolled backgrounds and suboptimal illumination. Many publications do not explicitly address this issue, which – in connection with accordant illustration – suggests homogeneous backgrounds and strong diffuse lighting. Another common assumption is that the signer wears long-sleeved clothing that differs in color from his skin, allowing color-based detection of hands and face. The majority of systems only support person-dependent operation, i.e. every user is required to train the system himself before being able to use it. Person-independent operation requires a suitable normalization of features early in the processing chain to eliminate dependencies of the features on the signer’s position in the image, his distance from the camera, and the camera’s resolution. This is rarely described in publications; instead, areas and distances are measured in pixels, which even for person-dependent operation would require an exact reproduction of the conditions under which the training material was recorded. Similar to the early days of speech recognition, most researchers focus on isolated signs. While several systems exist that process continuous signing, their vocabulary is very small. Recognition rates of 90% and higher are reported, but the exploitation of context and grammar – which is sometimes rigidly fixed to a certain sentence structure – aid considerably in classification. As in speech recognition, coarticulation effects and the resulting ambiguities form the primary problem when using large vocabularies. Tab. 3.1 lists several important publications and the described systems’ features. When comparing the indicated performances, it must be kept in mind that, in contrast to speech recognition, there is no standardized benchmark for sign language recognition. Thus, recognition rates cannot be compared directly. The compilation also shows that most systems’ vocabularies are in the range of 50 signs. Larger vocabularies have only been realized with the use of data gloves. All systems are person-dependent. All recognition rates are valid only for the actual test scenario. Information about robustness in real-life settings is not available for any of the systems. Furthermore, the exact constellation of the vocabularies is unknown, despite its significant influence on the difficulty of the recognition task. In summary it can be stated that none of the systems currently found in literature meets the requirements for a robust real world application. Visual Non-Intrusive Sign Language Recognition This chapter describes the implementation of a video-based non-intrusive sign language classifier for real world applications. Fig. 3.2 shows a schematic of the concept that can be divided into a feature extraction stage and a classification stage.
98
Sign Language Recognition Table 3.1. Classifiers characteristics for user dependent sign language recognition. Author, Year Vamplew 1996 [24] Holden 2001 [10] Yang 2002 [28] Murakami 1991 [15] Liang 1997 [14] Fang 2002 [8] Starner 1998 [20] Vogler 1999 [25] Parashar 2003 [17]
Features manual manual manual manual manual manual manual manual manual and facial
Interface
Vocabulary Language Recognition Level Rate in % data glove 52 word 94 optical markers 22 word 95,5 video 40 word 98,1 data glove 10 sentence 96 data glove 250 sentence 89,4 data glove 203 sentence 92,1 video 40 sentence 97,8 video 22 sentence 91,8 video 39 sentence 92
Fig. 3.2. Schematic of the sign language recognition system.
The input image sequence is forwarded to two parallel processing chains that extract manual and facial features using a priori knowledge of the signing process. Before the final sign classification is performed, a pre-classification module restricts the active vocabulary to reduce processing time. Manual and facial features are then classified separately, and both results are merged to yield a single recognition result. In the following section of this chapter the person-independent recognition of isolated signs in real world scenarios is dealt with. The second section explains how facial features can be integrated in sign language recognition process. Both parts build on related basic concepts presented already in chapter 2. The third section describes how signs can be divided automatically in subunits by a data driven procedure and how these subunits enable coding of large vocabularies and accelerate training times. Finally some performance measures for the described system are provided.
Recognition of Isolated Signs in Real-World Scenarios
99
3.1 Recognition of Isolated Signs in Real-World Scenarios Sign language recognition constitutes a challenging field of research in computer vision. Compared to gesture recognition in controlled environments as described in Sect. 2.1, recognition of sign language in real-world scenarios places significantly higher demands on feature extraction and processing algorithms. First, sign language itself provides a framework that restricts the developer’s freedom in the choice of vocabulary. In contrast to an arbitrary selection of gestures, common computer vision problems like overlap of both hands and the face (Fig. 3.3a4 ), ambiguities (Fig. 3.3b), and similarities between different signs (Fig. 3.3c, d) occur frequently and cannot be avoided.
(a) “digital”
(b) “conference”
(c) “food”
(d) “date”
Fig. 3.3. Difficulties in sign language recognition: Overlap of hands and face (a), tracking of both hands in ambiguous situations (b), and similar signs with different meaning (c, d).
Just like in spoken language, there are dialects in sign language. In contrast to the pronunciation of words, however, there is no standard for signs, and people may use an altogether different sign for the same word (Fig. 3.4). Even when performing identical signs, the variations between different signers are considerable (Fig. 3.5). Finally, when using color and/or motion to detect the user’s hands and face, uncontrolled backgrounds as shown in Fig. 3.6 may give rise to numerous false alarms 4
The video material shown in this section was kindly provided by the British Deaf Association (http://www.signcommunity.org.uk). All signs are British Sign Language (BSL).
100
Sign Language Recognition
(a)
(b)
Fig. 3.4. Two different signs for “actor”: Circular motion of hands in front of upper body (a) vs. dominant hand (here: right) atop non-dominant hand (b).
(a) “autumn”
(b) “recruitment”
(c) “tennis”
(d) “distance”
(e) “takeover”
Fig. 3.5. Traces of (xcog , ycog ) for both hands from different persons (white vs. black plots) performing the same sign. The background shows one of the signers for reference.
that require high-level reasoning to tell targets from distractors. This process is computationally expensive and error-prone, because the required knowledge and experience that comes natural to humans is difficult to encode in machine-readable form.
Fig. 3.6. Examples for uncontrolled backgrounds that may interfere with the tracking of hands and face.
Recognition of Isolated Signs in Real-World Scenarios
101
A multitude of algorithms has been published that aim at the solution of the above problems. This section describes several concepts that were successfully combined in a working sign language recognition system at the RWTH Aachen university [29]. A comprehensive overview of recent approaches can be found in [6, 16]. This section focuses on the recognition of isolated signs based on manual parameters. The combination of manual and non-manual parameters such as facial expression and mouth shape, which constitute an integral part of sign language, is discussed in Sect. 3.2. Methods for the recognition of sign language sentences are described in Sect. 3.3. The following text builds upon Sect. 2.1. It follows a similar structure since it is based on the same processing chain. However, the feature extraction stage is extended as shown in Fig. 3.7. An image preprocessing stage is added that applies low-level algorithms for image enhancement, such as background modeling described in Sect. 3.1.2.1. High-level knowledge is applied for the resolution of overlaps (Sect. 3.1.3.1) and to support hand localization and tracking (Sect. 3.1.3.2). Compared to Sect. 2.1, the presentation is less concerned with concrete algorithms. Due to the complexity and diversity of the topic, the focus lies on general considerations that should apply to most, though certainly not all, applications.
Fig. 3.7. Feature extraction extended by an image preprocessing stage. High-level knowledge of the signing process is used for overlap resolution and hand localization/tracking.
3.1.1 Image Acquisition and Input Data With regard to the input data, switching from gestures to sign language input brings about the following additional challenges:
102
Sign Language Recognition
1. While most gestures are one-handed, signs may be one- or two-handed. The system must therefore not only be able to handle two moving objects, but also to detect whether the sign is one- or two-handed, i.e. whether one of the hands is idle. 2. Since the hands’ position relative to the face carries a significant amount of information in sign language, the face must be included in the image as a reference point. This reduces image resolution available for the hands and poses an additional localization task. 3. Hands may occlude one another and/or the face. For occluded objects, the extraction of features through a threshold segmentation is no longer possible, resulting in a loss of information. 4. Signs may be extremely similar (or even identical) in their manual features and differ mainly (or exclusively) in non-manual features. This makes automatic recognition based on manual features difficult or, for manually identical signs, even unfeasible. 5. Due to dialects and lack of standardization, it may not be possible to map a given word to a single sign. As a result, either the vocabulary has to be extended to include multiple variations of the affected signs, or the user needs to know the signs in the vocabulary. The latter is problematic because native signers may find it unnatural to be forced to use a certain sign where they would normally use a different one.5 The last two problems cannot be solved by the approach described here. The following text presents solution strategies for the former three. 3.1.2 Image Preprocessing The preprocessing stage improves the quality of the input data to increase the performance (in terms of processing speed and/or accuracy) of the subsequent stages. High-level information computed in those stages may be exploited, but this bears the usual risks of a feedback loop, such as instability or the reinforcement of errors. Low-level pixel-wise algorithms, on the other hand, do not have this problem and can often be used in a wide range of applications. A prominent example is the modeling of the image background. 3.1.2.1 Background Modeling The detection of background (i.e. static) areas in dynamic scenes is an important step in many pattern recognition systems. Background subtraction, the exclusion of these areas from further processing, can significantly reduce clutter in the input images and decrease the amount of data to be processed in subsequent stages. If a view of the background without any foreground objects is available, a statistical model can be created in a calibration phase. This approach is described in Sect. 5.3.1.1. In the following, however, we assume that the only available input data is the signing user, and that the signing process takes about 1–10 seconds. Thus the 5
This problem also exists in speech recognition, e.g. for the words “zero” and “oh”.
Recognition of Isolated Signs in Real-World Scenarios
103
background model is to be created directly from the corresponding 25–250 image frames (assuming 25 fps) containing both background and foreground, without prior calibration. Median Background Model A simple yet effective method to create a background model in the form of an RGB image Ibg (x, y) is to compute, for every pixel (x, y), the median color over all frames: Ibg (x, y) = median{I(x, y, t)|1 ≤ t ≤ T }
(3.1)
where the median of a set of vectors V = {v1 , v2 , . . . , vn } is the element for which the sum of Euclidian distances to all other elements is minimal: median V = argmin v∈V
n
|vi − v|
(3.2)
i=1
For the one-dimensional case this is equivalent to the 50th percentile, which is significantly faster to compute, e.g. by simply sorting the elements of V . Therefore, (3.2) is in practice often approximated by the channel-wise median ⎛ ⎞ median{r(x, y, t)|1 ≤ t ≤ T } Ibg (x, y) = ⎝ median{g(x, y, t)|1 ≤ t ≤ T } ⎠ (3.3) median{b(x, y, t)|1 ≤ t ≤ T } The median background model has the convenient property of not requiring any parameters and is consequently very robust. Its only requirement is that the image background must visible at the considered pixel in more than 50% of the input frames, which is a reasonable assumption in most scenarios. A slight drawback might be that all input frames need to be buffered in order to compute (3.1). To apply the background model, i.e. classify a given pixel as background or foreground, we need to define a suitable metric Δ to quantify the difference between a background color vector (rbg gbg bbg )T and a given color vector (r g b)T : Δ (r g b)T , (rbg gbg bbg )T ≥ 0
(3.4)
Computing Δ for every pixel in an input image I(x, y, t) and comparison with a motion sensitivity threshold Θmotion yields a foreground mask Ifg,mask : 1 if Δ (I(x, y, t), Ibg (x, y)) ≥ Θmotion Ifg,mask (x, y, t) = (3.5) 0 otherwise Θmotion is chosen just large enough so that the camera noise is not classified as foreground. A straightforward implementation of Δ is the Euclidian distance: Δ (r g b)T , (rbg gbg bbg )T = (r − rbg )2 + (g − gbg )2 + (b − bbg )2
(3.6)
104
Sign Language Recognition
However, (3.6) does not distinguish between differences in brightness and differences in hue. Separating the two is desirable since it permits to ignore shadows (which mainly affect brightness) cast on the background by moving (foreground) objects. Shadows can be modeled as a multiplication of Ibg with a scalar. We define Δ = I − Ibg ϕ = (I, Ibg )
(3.7) (3.8)
This allows to split the color difference vector Δ in orthogonal components Δbrightness and Δhue and weight the contribution of shadows in the total difference with a factor w ∈ [0, 1]: Δbrightness = |Δ| cos ϕ Δhue = |Δ| sin ϕ Δ = (wΔbrightness )2 + Δ2hue
(3.9) (3.10) (3.11)
A value of w = 1 disables shadow detection and is equivalent to (3.6), while w = 0 yields maximum shadow tolerance. No distinction is made between shadows that appear on the background and shadows that vanish from it. Obviously, the choice of w is a tradeoff between errors (foreground classified as background) and false alarms (background classified as foreground) and thus depends on which is handled better in the subsequent processing stages. Similar approaches to shadow detection are described e.g. in [13, 18]. Thus, while the computation of the median background model is itself parameterfree, its application involves the two parameters Θmotion and w. Fig. 3.8 shows four example images from an input video of 54 frames in total. Each of the 54 frames contains at least one person in the background. The resulting background model, as well as an exemplary application to a single frame, is visualized in Fig. 3.9. Gaussian Mixture Models Most background models, including the one presented above, assume a static background. However, backgrounds may also be dynamic, especially in outdoor scenarios. For example, a tree with green leaves that move in front of a white wall results in each background pixel alternating between green and white. Approaches that model dynamic backgrounds are described in [13, 18, 21]. Instead of computing a single background color as in (3.1), background and foreground are modeled together for each pixel as a mixture of Gaussian distributions, usually in RGB color space. The parameters of these distributions are updated online with every new frame. Simple rules are used to classify each individual Gaussian as background or foreground. Handling of shadows can be incorporated as described above.
Recognition of Isolated Signs in Real-World Scenarios
105
Fig. 3.8. Four example frames from the sign “peel”. The input video consists of 54 frames, all featuring at least one person in the background.
Compared to simpler methods this is computationally more demanding and requires more input frames to compute an accurate model, because the number of parameters to be estimated is higher. Also, several parameters have to be specified manually to control creation and updates of the individual Gaussians and their background/foreground classification. Performance may degrade if these parameters are inappropriate for the input video. If a sufficient amount of input data is available, the resulting model performs well under a wide variety of conditions. For single signs of approximately two seconds duration this is usually a critical factor, while one or more sign language sentences may provide enough data. For the recognition of isolated signs one might therefore consider to never stop capturing images and continuously update the distribution parameters. 3.1.3 Feature Extraction Systems that operate in real-world conditions require sophisticated feature extraction approaches. Rather than localize a single hand, both hands and the face (which is needed as a reference point) have to be segmented. Since a sign may be one- or two-handed, it is impossible to know in advance whether the non-dominant hand will remain idle or move together with the dominant hand. Also, the resolution available for each target object is significantly smaller than in most gesture recognition appli-
106
Sign Language Recognition
(a) Background Model Ibg
(b) Example Frame
(c) Foreground Mask Ifg,mask Fig. 3.9. Background model computed for the input shown in Fig. 3.8 (a), its application to an example frame (b), and the resulting foreground mask (c).
cations, where the hand typically fills the complete image. Background subtraction alone cannot be expected to isolate foreground objects without errors or false alarms. Also, the face – which is mostly static – has to be localized before background subtraction can be applied. Since the hands’ extremely variable appearance prevents the use of shape or texture cues, it is common to exploit only color for hand localization. This leads to the restriction that the signer must wear long-sleeved, non-skin-colored clothing to facilitate the separation of the hand from the arm by means of color. By using a generic color model such as the one presented in [12], user and illumination independence can be achieved at the cost of a high number of false alarms. This method therefore requires high-level reasoning algorithms to handle these ambiguities. Alternatively, one may choose to either explicitly or automatically calibrate the system to the current user and illumination. This chapter describes the former approach. 3.1.3.1 Overlap Resolution In numerous signs both hands overlap with each other and/or with the face. When two or more objects overlap in the image, the skin color segmentation yields only a single blob for all of them, rendering a direct extraction of meaningful features impossible. Low contrast, low resolution, and the hands’ variable appearance usually
Recognition of Isolated Signs in Real-World Scenarios
107
do not allow a separation of the overlapping objects by an edge-based segmentation either. Most of the geometric features available for unoverlapped objects can therefore not be computed for overlapping objects and have to be interpolated. However, a hand’s appearance is sufficiently constant over several frames for template matching [19] to be applied. Using the last unoverlapped view of each overlapping object as a template, at least position features – which carry much information – can be reliably computed during overlap. The accuracy of this method decreases with increasing template age. Fortunately, the same technique can be used twice for a single period of overlap, the second time starting with the first unoverlapped view after the cessation of the overlap and proceeding temporally backwards. This effectively halves the maximum template age and increases precision considerably. 3.1.3.2 Hand Tracking Performing a simple hand localization in every frame, as described in Sect. 2.1.2.1, is not sufficiently reliable in complex scenarios. The relation between temporally adjacent frames has to be exploited in order to increase performance. Localization is thus replaced by tracking, using information from previous frames as a basis for finding the hand in the current frame. A common problem in tracking is the handling of ambiguous situations where more than one interpretation is plausible. For instance, Fig. 3.10a shows the skin color segmentation of a typical scene (Fig. 3.9b). This observation does not allow a direct conclusion as to the actual hand configuration. Instead, there are multiple interpretations, or hypotheses, as visualized in Fig. 3.10 (b, c, d). Simple approaches weigh all hypotheses against each other and choose the most likely one one in every frame on the basis of information gathered in preceding frames, such as, for instance, a position prediction. All other hypotheses are discarded. This concept is error-prone because ambiguous situations that cannot reliably be interpreted occur frequently in sign language. Robustness can be increased significantly by evaluating not only preceding, but also subsequent frames, before any hypothesis is discarded. It is therefore desirable to delay the final decision on the hands’ position in each frame until all available data has been analyzed and the maximum amount of information is available. These considerations lead to a multiple hypotheses tracking (MHT) approach [29]. In a first pass of the input data all conceivable hypotheses are created for every frame. Transitions are possible from each hypothesis at time t to all hypotheses at time t + 1, resulting in a state space as shown exemplarily in Fig. 3.11. The total $ number of paths through this state space equals t H(t), where H(t) denotes the number of hypotheses at time t. Provided that the skin color segmentation detected both hands and the face in every frame, one of these paths represents the correct tracking result. In order to find this path (or one as close as possible to it), probabilities are computed that indicate the likeliness of each hypothesized configuration, pstate , and the likeliness of each transition, ptransition (see Fig. 3.11).
108
Sign Language Recognition
(a)
(b)
(c)
(d)
Fig. 3.10. Skin color segmentation of Fig. 3.9b (a) and a subset of the corresponding hypotheses (b, c, d; correct: d).
The computation of pstate and ptransition is based on high-level knowledge, encoded as a set of rules or learned in a training phase. The following aspects, possibly extended by application-dependent items, provide a general basis: •
•
•
•
The physical configuration of the signer’s body can be deduced from position and size of face and hands. Configurations that are anatomically unlikely or do not occur in sign language reduce pstate . The three phases of a sign (preparation, stroke, retraction), in connection with the signer’s handedness (which has to be known in order to correctly interpret the resulting feature vector), should also be reflected in the computation of pstate . Even in fast motion, the hand’s shape changes little between successive frames at 25 fps. As long as no overlap occurs, the shape at time t can therefore serve as an estimate for time t + 1. With increasing deviation of the actual from the expected shape, ptransition is reduced. Abrupt shape changes due to overlap require special handling. Similarly, hand position changes little from frame to frame (at 25 fps) so that coordinates at time t may serve as a prediction for time t + 1. Kalman filters [27] may increase prediction accuracy by extrapolating on the basis of all past measurements.
Recognition of Isolated Signs in Real-World Scenarios
109
Fig. 3.11. Hypothesis space and probabilities for states and transitions.
•
Keeping track of the hand’s mean or median color can prevent confusion of the hand with nearby distractors of similar size but different color. This criterion affects ptransition .
To search the hypothesis space, the Viterbi algorithm (see Sect. 2.1.3.6) is applied in conjunction with pruning of unlikely paths at each step. The MHT approach ensures that all available information is evaluated before the final tracking result is determined. The tracking stage can thus exploit, at time t, information that becomes available only at time t1 > t. Errors are corrected retrospectively as soon as they become apparent. 3.1.4 Feature Normalization After successful extraction, the features need to be normalized. Assuming that resolution dependencies have been eliminated as described in Sect. 2.1.2.3, the area a still depends on the signer’s distance to the camera. Hand coordinates (xcog , ycog ) additionally depend on the signer’s position in the image. Using the face position and size as a reference for normalization can eliminate both dependencies. If a simple threshold segmentation is used for the face, the neck is usually included. Different necklines may therefore affect the detected face area and position (COG). In this case one may use the topmost point of the face and its width instead to avoid this problem. If the shoulder positions can be detected or estimated, they may also provide a suitable reference.
110
Sign Language Recognition
3.1.5 Feature Classification The classification of the extracted feature vector sequence is usually performed using HMMs as discussed in Sect. 2.1.3.6. However, due to the nature of sign language, the following additional processing steps are advisable. Firstly, leading and/or trailing idle frames in the input data can be detected by a simple rule (e.g. (2.45)). This is done for each hands individually. Cropping frames in which both hands are idle speeds up classification and prevents the classifier from processing input data that carries no information. Obviously, this process must be applied in the training phase as well. If the sign is single-handed, i.e. one hand remains idle in all frames, all classes in the vocabulary that represent two-handed signs can be disabled (or vice versa). This further reduces computational cost in the classification stage. 3.1.6 Test and Training Samples The accompanying CD contains a database of test and training samples that consists of 80 signs in British Sign Language (BSL) performed five times each, all by the same signer.6 The signs were recorded under laboratory conditions, with a unicolored blue background that allows to fade-in arbitrary other backgrounds using the common bluescreen method. Image resolution is 384 × 288, the frame rate is 25 fps, and the average sign duration is 73.4 frames.
Fig. 3.12. Example frame from the test and training samples provided on CD.
As shown in Fig. 3.12 the signer is standing. The hands are always visible, and their start/end positions are constant and identical, which simplifies tracking. The directory slrdatabase has five subdirectories named var00 to var04, each of which contains one production of all 80 signs. Each sign production is stored in the form of individual JPG files in a directory named accordingly. 6
We sincerely thank the British Deaf Association (http://www.signcommunity.org.uk) for these recordings.
Sign Recognition Using Nonmanual Features
111
This database can be used to test tracking and feature extraction algorithms, and to train and test a sign language classifier based e.g. on Hidden Markov Models.
3.2 Sign Recognition Using Nonmanual Features Since sign languages are multimodal languages, several channels for transferring information are used at the same time. One basically differentiates between the manual/gestical channels and the nonmanual/ facial channels and their respective parameters [4]. In the following first the nonmanual parameters are described in more detail and afterwards the extraction of facial features for sign language recognition is presented. 3.2.1 Nonmanual Parameters Nonmanual parameters are indispensable in signs language. They code adjectives and contribute to grammar. In particular some signs are identical with respect to gesturing and can only be differentiated by making reference to nonmanual parameters [5]. This is, e.g., the case for the signs ”not” and ”to” in German Sign Language, which can only be distinguished by making reference to head motion (Fig. 3.13). Similarly in British Sign Language (BSL) the signs ”now” and ”offer” need lip outline for disambiguation (Fig. 3.14). In the following the most important nonmanual parameters will be described in more detail.
Fig. 3.13. In German Sign Language the signs ”not”(A) and ”to”(B) are identical with respect to manual gesturing but vary in head movement.
Upper Body Posture The torso generally serves as reference of the signing space. Spatial distances and textual aspects can be communicated by the posture of the torso. The signs ”reject-
112
Sign Language Recognition
Fig. 3.14. In British Sign Language the signs ”now” (A) and ”offer” (B) are identical with respect to manual gesturing but vary in lip outline.
ing” or ”enticing”, e.g., show a slight inclination of the torso towards the rear and in forward direction. Likewise grammatical aspects as, e.g., indirect speech can be coded by torso posture. Head pose The head pose also supports the semantics of sign language. Questions, affirmations, denials, and conditional clauses are, e.g., communicated with the help of head pose. In addition, information concerning time can be coded. Signs, which refer to a short time lapse are, e.g., characterized by a minimal change of head pose while signs referring to a long lapse are performed by turning the head clearly into the direction opposite to the gesture. Line of sight Two communicating deaf persons will usually establish a close visual contact. However, a brief change of line of sight can be used to refer to the spatial meaning of a gesture. Likewise line of sight can be used in combination with torso posture to express indirect speech as, e.g., as representation between two absent persons is virtual placed behind the signer. Facial expression Facial expressions essentially serve the transmission of feelings (lexical mimics). In addition grammatical aspects may be mediated as well. A change of head pose combined with the lifting of the eye brows corresponds, e.g., to a subjunctive.
Sign Recognition Using Nonmanual Features
113
Lip outline Lip outline represents the most pronounced nonmanual characteristic. Often it differs from voicelessly expressed words in that part of a word is shortened. Lip outline solves ambiguities between signs (brother vs. sister), and specifies expressions (meat vs. Hamburger). It also provides information redundant to gesturing to support differentiation of similar signs. 3.2.2 Extraction of Nonmanual Features In Fig. 3.15 a general concept for nonmanual feature extraction is depicted. At the beginning of the processing chain images of the camera are acquired. In the next step the Face Finder (Sect. 2.2.2.1) is included, in order to locate the face region of the signer in the image, which covers the entire sign space. Afterwards this region is cropped and upscaled.
Fig. 3.15. Processing chain of nonmanual feature extraction.
The next step is the fitting of a user-adapted face-graph by utilizing Active Appearance Models (Sect. 2.2.3). The face graph serves the extraction of the nonmanual parameters like lip outline, eyes and brows (Fig. 3.16). The used nonmanual features (Fig. 3.17) can be divided in three groups. At first the lip outline is described by width, height and form-features like invariant moments, eccentricity and orientation (Sect. 2.1.2.3). The distances between eyes and mouth corners are in the second group and the third group contains the distances between eyes and eye brows. For each image of the sequence the extracted parameters are merged into a feature vector, which in the next step is used for classification (Fig. 3.18). Overlap Resolution In case of partially overlapping of hands and face an accurate fitting of the Active Appearance Models is usually no longer possible. Furthermore it is problematic that
114
Sign Language Recognition
Fig. 3.16. Processing scheme of the face region cropping and the matching of an adaptive face-graph.
sometimes the face graph is computed astounding precisely, if there is not enough face texture visible. In this case a determination of the features in the affected regions is no longer possible. For compensation of this effect an additional module is involved, that evaluates all specific regions separately with regard to overlappings by hands. In Fig. 3.19 two cases are presented. In the first case one eye and the mouth are hidden, so the feature-vector of the facial parameters is not used for classification. In the second case the overlapping is not critical for classification so the facial features are thus consulted for the classification process. The hand tracker indicates a crossing of one or both hands with the face, once the skin colored surfaces touch each other. In these cases it is necessary to decide, whether the hands affect the shape substantially. Therefore, an ellipse for the hand is computed by using the manual parameters. In addition an oriented bounding box is drawn around the lip contour of the active appearance shape. If the hand ellipse touches the bounding box, the Mahalanobis-distance of the shape fitting is deciding. If this is too large, the shape is marked as invalid. Since the Mahalanobis-distance of the shapes depends substantially on the trained model, not an absolute value is
Sign Recognition Using Nonmanual Features
115
Fig. 3.17. Representation of nonmanual parameters suited for sign language recognition.
Fig. 3.18. Scheme of the extraction and classification of nonmanual features.
used here, but a proportional worsening. In experiments it showed itself that a good overlapping recognition can be achieved, if 25% of the face is hidden.
116
Sign Language Recognition
Fig. 3.19. During the overlapping of the hands and face several regions of the face are evaluated separately. If e.g. mouth and one eye could be hidden (left), no features of the face are considered. However, if eyes and mouth are located sufficiently the won features could be used for the classification (right).
3.3 Recognition of continuous Sign Language using Subunits Sign language recognition is a rather young research area compared to speech recognition, where research is already done for about 50 years. The headstart in the field of speech recognition makes a comparison between these two fields worthwhile, especially to draw conclusion in regards of how to model sign language. While phoneme based speech recognition systems represent today’s state of the art, the early speech recognizer dealt with words as a model unity. A similar development is observed for sign language recognition. Current systems are still based on word models. First steps towards subunit based recognition systems have been undertaken only recently [2,26]. The following section outlines a sign language recognition system, which is based on automatic generated subunits of signs. 3.3.1 Subunit Models for Signs It is yet unclear in sign language recognition, which part of a sign sentence serves as a good underlying model. The theoretical best solution is a whole sign language sentence as one model. Due to their long duration sentences do not lead to ill classification. However, the main disadvantage is obvious: there are far too many combinations to train all possible sentences in sign language. Consequently most sign language recognition systems are based on word models where one sign represents one model in the model database. Hence, the recognition of sign language sentences is more flexible. Nevertheless, even this method contains some drawbacks: • •
The training complexity increases with vocabulary size. A future enlargement of the vocabulary is problematic as new signs are usually signed in context of other signs for training (embedded training).
Recognition of continuous Sign Language using Subunits
117
As an alternative approach signs can also be modeled with subunits of signs, which is similar to modeling speech by means of phonemes. Subunits are segments of signs, which emerge from the subdivision of signs. Figure 3.20 shows an example of the above mentioned different possibilities for model unities in the model database. Subunits should represent a finite quantity of models. Their number should be chosen I
Position [Pixel]
TODAY
COOK
300 250 200 150 100 50 0 0
10
SU1
SU2
SU3
TODAY
20
SU4
SU5 I
30
SU6
40
SU7
SU8
50 Time [Frames]
SU9
SU10
COOK
Fig. 3.20. The signed sentence TODAY I COOK (HEUTE ICH KOCHEN) in German Sign Language (GSL). Top: the recorded sign sequence. Center: Vertical position of right hand during signing. Bottom: Different possibilities to divide the sentence. For a better visualization the signer wears simple colored cotton gloves.
in such a way that any sign can be composed with subunits. The advantages are: •
• •
The training period terminates when all subunit models are trained. An enlargement of the vocabulary is achieved by composing a new sign through concatenation of convenient subunit models. The general vocabulary size can be enlarged. Eventually, the recognition will be more robust (depending on the vocabulary size).
Modification of Training and Classification for Recognition with Subunits A subunit based recognition system needs an additional knowledge source, where the coding (also called transcription) of the sign into subunits is itemized. This knowledge source is called sign-lexicon and contains all transcriptions of signs and their coherence with subunits. Both, training and classification processes are based on a sign-lexicon.
118
Sign Language Recognition
Modification for Training Aim of the training process is the estimation of subunit model parameters. The example in figure 3.21 shows that sign 1 consists of subunits (SU) SU4 , SU7 and SU3 . The accordant model parameter of these subunits are trained on the recorded data of sign 1 by means of the Viterbi algorithm (see section 2.1.3.6). Segmented Images
Recorded images
VIDEOCAMERA
FEATURE EXTRACTION
IMAGE PROCESSING
CLASSIFIKATION
recognised Signal
Sequence of Feature Vectors
TRAINING
Model Database Subunits Sign 1 Sign 2
. . .
Sign M
SU 4 SU 7 SU 3 SU 6 SU 1 SU 5
. . .
SU 2 SU 7 SU 5
Sign-Lexicon
Fig. 3.21. Components of the training process for subunit identification.
Modification for Classification After completion of the training process, a database is filled with all subunit models which serve now as a base for the classification process. However, the aim of the classification is not the recognition of subunits but of signs. Hence, again the information which sign consists of which subunit is needed, which is contained in the sign-lexicon. The accordant modification for classification is depicted in figure 3.22. 3.3.2 Transcription of Sign Language So far we have assumed that a sign-lexicon is available, i.e. it is already known which sign is composed of which subunits. This however is not the case. The subdivision of a sign into suitable subunits still poses difficult problems. In addition, the semantics of the subunits have yet to be determined. The following section gives an overview of possible approaches to linguistic subunit formation. 3.3.2.1 Linguistics-orientated Transcription of Sign Language Subunits for speech recognition are mostly linguistically motivated and are typically syllables, half-syllables or phonemes. The base of this breakdown of speech is similar to the speech’s notation system: a written text with the accordant orthography is the standard notation system for speech. Nowadays, huge speech-lexica exist consisting
Recognition of continuous Sign Language using Subunits Segmented Images
Recorded images
VIDEOCAMERA
119
FEATURE EXTRACTION
IMAGE PROCESSING
CLASSIFIKATION
recognised Signal
Sequence of Feature Vectors
TRAINING
Model Database Subunits Sign 1 Sign 2
. . .
Sign M
SU 4 SU 7 SU 3 SU 6 SU 1 SU 5
. . .
SU 2 SU 7 SU 5
Sign-Lexicon
Fig. 3.22. Components of the classification process based on subunits.
of the transcription of speech into subunits. These lexica are usually the base for today’s speech recognizer. Transferring this concept of linguistic breakdown to sign language recognition, one is confronted with a variety of options for notation, which all are unfortunately not yet standardized as is the case for speech. An equally accepted notation system as written text does not exist for sign language. However, some known notation systems are examined in the following section, especially with respect to its applicability in a recognition system. Corresponding to phonemes, the term cheremes (derived from the Greek term for ’manual’) is used for subunits in sign language. Notation System by Stokoe Stokoe was one of the first, who did research in the area of sign language linguistic in the sixties [22]. He defined three different types of cheremes. The first type describes the configuration of handshape and is called dez for designator. The second type is sig for signation and describes the kind of movement of the performed sign. The third type is the location of the performed sign and is called tab for tabula. Stokoe developed a lexicon for American Sign Language by means of the above mentioned types of cheremes. The lexicon consists of nearly 2500 entries, where signs are coded in altogether 55 different cheremes (12 ’tab’, 19 ’dez’ and 24 different ’sig’). An example of a sign coded in the Stokoe system is depicted in figure 3.23 [22]. The employed cheremes seem to qualify as subunits for a recognition system, their practical employment in a recognition system however turns out to be difficult. Even though Stokoe’s lexicon is still in use today and consists of many entries, not all signs, especially no German Sign Language signs, are included in this lexicon. Also most of Stokoe’s cheremes are performed in parallel, whereas a recognition system expects subunits in subsequent order. Furthermore, none of the signations cheremes (encoding the movement of a performed sign) are necessary for a recognition system, as movements are modeled by Hidden Markov Models (HMMs). Hence, Stokoe’s
120
Sign Language Recognition
Fig. 3.23. Left: The sign THREE in American Sign Language (ASL). Right: notation of this sign after Stokoe [23].
lexicon is a very good linguistic breakdown of signs into cheremes, however without manual alterations, it is not useful as a base for a recognition system. Notation system by Liddell and Johnson Another notation system was invented by Liddell and Johnson. They break signs into cheremes by a so called Movement-Hold-Model, which was introduced in 1984 and further developed since then. Here, the signs are divided in sequential order into segments. Two different kinds of segments are possible: ’movements’ are segments, where the configuration of a hand is still in move, whereas for ’hold’-segments no movement takes place, i.e. the configuration of the hands is fixed. Each sign can now be modeled as a sequence of movement and hold-segments. In addition, each hold-segment consists of articulary features [3]. These describe the handshape, the position and orientation of the hand, movements of fingers, and rotation and orientation of the wrist. Figure 3.24 depicts an example of a notation of a sign by means of movement- and hold-segements.
Fig. 3.24. Notation of the ASL sign FATHER by means of the Movement-Hold model (from [26]).
While Stokoe’s notation system is based on a mostly parallel breakdown of signs, here a sequence of segments is produced, which is better suited for a recognition systems. However, similar to Stokoe’s notation system, there exists no comprehensive lexicon where all signs are encoded. Moreover, the detailed coding of the articulary features might cause additional problems. The video-based feature extraction of the recognition system might not be able to reach such a high level of detail. Hence, the
Recognition of continuous Sign Language using Subunits
121
movement-hold-notation system is not suitable for an employment as a sign-lexicon within a recognition system, without manual modifications or even manual transcription of signs. 3.3.2.2 Visually-orientated Transcription of Sign Language The visually7 approach of a notation (or transcription) system for sign language recognition does not rely on any linguistic knowledge about sign language – unlike the two approaches described in the previous sections. Here, the breakdown of signs into subunits is based on an data-driven process. In a first step each sign of the vocabulary is divided sequentially into different segments, which have no semantic meaning. A subsequent process determines similarities between the identified segments. Similar segments are then pooled and labeled. They are deemed to be one subunit. This process requires no other knowledge source except the data itself. Each sign can now be described as a sequence of the contained subunits, which are distinguished by their labels. This notation is also called fenonic baseform [11]. Figure 3.25 depicts as an example the temporal horizontal progression (right hand) of two different signs.
Position [Pixel]
Sign 2
SU3
SU4 SU6
SU2
1600
Sign 2
1400 1200 1000 800 600 400
Sign 1
200 0 0
Sign 1
5
SU3
SU7
10
SU1
15 Time [Frames]
SU9
SU5
Transcription of Sign Fig. 3.25. Example for different transcriptions of two signs.
7
For speech-recognition the accordant name is acoustic subunits. For sign language recognitions the name is adapted.
122
Sign Language Recognition
The performed signs are similar in the beginning. Consequently, both signs are assigned to the same subunit (SU3 ). However the further progression differs significantly. While the gradient of sign 2 is going upwards, the slope of sign 1 decreases. Hence, the further transcription of both signs differs. 3.3.3 Sequential and Parallel Breakdown of Signs The example in Fig. 3.25 illustrates merely one aspect of the performed sign: the horizontal progression of the right hand. Regarding sign language recognition and their feature extraction, this aspect would correspond to the y-coordinate of the right hand’s location. For a complete description of a sign, one feature is however not sufficient. In fact a recognition system must handle many more features which are merged in so called feature groups. The composition of these feature groups must take the lingustical sign language parameters into account, which are location, handshape and -orientation. The further details in this section refer to an example separation of a feature vector into a feature group ’pos’, where all features regarding the position (pos) of the both hands are grouped. Another group represents all features describing the ’size’ of the visible part of the hands (size), whereas the third group comprises all features regarding distances between all fingers (dist). The latter two groups stand for the sign parameter handshape and -orientation. Note, that these are only examples of how to model a feature vector and its accordant feature groups. Many other ways of modeling sign language are conceivable, also the number of feature groups may vary. To demonstrate the general approach, this example makes use of the three feature groups ’pos’, ’size’ and ’dist’ mentioned above. Following the parallel breakdown of a sign, each resulting feature group is segmented in sequential order into subunits. The identification of similar segments is not carried out on the whole feature vector, but only within each of the three feature groups. Similar segments finally stand for the subunits of one feature group. Pooled segments of the feature group ’pos’ e.g., now represent a certain location independent of any specific handshape and -orientation. The parallel and sequential breakdown of the signs finally yields three different sign lexica, which are combined to one (see also figure 3.26). Figure 3.27 shows examples of similar segments of different signs according to specific feature groups. 3.3.4 Modification of HMMs to Parallel Hidden Markov Models Conventional HMMs are suited to handle sequential signals (see section 2.1.3.6). However, after the breakdown into feature groups, we now deal with parallel signals which can be handled by Parallel Hidden Markov Models (PaHMMs) [9]. Parallel Hidden Markov Models are conventional HMMs used in parallel. Each of the parallel combined HMMs is called a channel. Each channel is independent from each other, the state probabilities of one channel do not influence any of the other channels.
Recognition of continuous Sign Language using Subunits
SignLexicon ‘Pos’
Sign 1 Sign 2. . .
Sign M
123
SU4 SU7 SU3 SU6 SU1 SU5 . . .
SU2 SU7 SU5 Sign 1
SignLexicon Size
SignLexicon Dist
Sign 1 Sign 2.
SU1 SU9 SU7 SU5 SU8 SU2
Sign 2
Sign M
SU5 SU1 SU3
Sign M
Sign 1 Sign 2.
SU3 SU1 SU2 SU1 SU6 SU3
. .
. .
Sign M
. . .
SU4 SU 7 SU 3 Pos SU1 SU 9 SU 7 Size SU3 SU 1 SU 2 Dist SU6 SU 1 SU 5 Pos SU5 SU . 8 SU 2 Size SU1 SU . 6 SU 3 Dist
. . .
.
SU2 SU 7 SU 5 Pos SU5 SU 1 SU 3 Size SU2 SU 4 SU 1 Dist
SignLexicon
. . .
SU2 SU4 SU1
Fig. 3.26. Each feature group leads to an own sign-lexicon, which are finally combined to one sign-lexicon.
ESSEN
Same Position
RED
EAT
Same Distance
HOW MUCH
SALAT
Same Size
LIKE
TASTE
Fig. 3.27. Different examples for the assignment to all different feature groups.
An accordant PaHMM is depicted in figure 3.28. The last state, a so called confluent state, combines the probabilities of the different channels to one probability, valid for the whole sign. The combination of probabilities is determined by the following equation: J P (O j |λj ) . (3.12) P (O|λ) = j=1
O j stands here for the relevant observation sequence of one channel, which is evaluated for the accordant segment. All parallel calculated probabilities of a sign finally
124
Sign Language Recognition s11
s12
s13
s41
s12
s22
s32
s42
s13
s32
s33
s34
Fig. 3.28. Example of a PaHMMs with Bakis-Topology.
result in the following equation: P (O|λ) =
K J
P (O kj |λkj ) .
(3.13)
j=1 k=1
3.3.4.1 Modeling Sign Language by means of PaHMMs For recognition based on subunit models each of the feature groups is modeled by one channel of the PaHMMs (see also 3.3.3). The sequential subdivision into subunits is then conducted in each feature group separately. Figure 3.29 depicts the modeling of the GSL sign HOW MUCH (WIEVIEL) with its three feature groups. The figure shows the word model of the sign, i.e. the sign and all its contained subunits in all three feature groups. Note that it is possible and highly probable that the different feature groups of a sign contain different numbers of subunit models. This is the case if, as in this example, the position changes during the execution of the sign, whereas the handshape and -orientation remains the same.
s0
s1
s2
s3
s1
s2
s3
s1
s2
s3
s4
s5
s4
s5
s6
s7
s8
s9
Pos s0
Size s0
Dist Gebärde WIEVIEL Sign HOW MUCH
Fig. 3.29. Modeling the GSL sign HOW MUCH (WIEVIEL) by means of PaHMMs.
Figure 3.29 also illustrates the specific topology for a subunit based recognition system. As the duration of one subunit is quite short, a subunit HMM consists merely
Recognition of continuous Sign Language using Subunits
125
of two states. The connection of several subunits inside one sign depend however on Bakis topology (see also equation 2.69 in section 2.1.3.6). 3.3.5 Classification ˆ which fits the feature vector sequence Determining the most likely sign sequence S best X results in a demanding search process. The recognition decision is carried out by jointly considering the visual and linguistic knowledge sources. Following [1], where the most likely sign sequence is approximated by the most likely state sequence, a dynamic programming search algorithm can be used to compute the probabilities P (X|S) · P (S). Simultaneously, the optimization over the unknown sign sequence is applied. Sign language recognition is then solved by matching the input feature vector sequence to all the sequences of possible state sequences and finding the most likely sequence of signs using the visual and linguistic knowledge sources. The different steps are described in more detail in the following subsections. 3.3.5.1 Classification of Single Signs by Means of Subunits and PaHMMs Before dealing with continuous sign language recognition based on subunit models, the general approach will be demonstrated by a simplified example of single sign recognition with subunit models. The extension to continuous sign language recognition is then given in the next section. The general approach is the same in both cases. The classification example is depicted in figure 3.30. Here, the sign consists of three subunits for feature group ’size’, four subunits for ’pos’ and eventually two for feature group ’dist’. It is important to note, that the depicted HMM is not a random sequence of subunits in each feature group, nor is it a random parallel combination of subunits. The combination of subunits – in parallel as well as in sequence – depends on a trained sign, i.e. the sequence of subunits is transcribed in the sign-lexicon of the accordant feature-group. Furthermore, the parallel combination, i.e. the transcription in all three sign-lexica codes the same sign. Hence, the recognition process does not search ’any’ best sequence of subunits independently. The signal of the example sign of figure 3.30 has a total length of 8 feature-vectors. In each channel an assignment of feature-vectors (the part of the feature vector of the accordant feature group) to the different states happens entirely independent from each other by time alignment (Viterbi algorithm). Only at the end of the sign, i.e. after the 8th feature vector is assigned, the so far calculated probabilities of each channel are combined. Here, the first and last states are confluent states. They are not emitting any probability, as they serve as a common beginning- and end-state for the three channels. The confluent end-state can only be reached by the accordant end-states of all channels. In the depicted example this is the case only after feature vector 8, even though the end-state in channel ’dist’ is already reached after 7 times steps. The corresponding equation for calculating the combined probability for one model is: (3.14) P (X|λ) = P (X|λP os ) · P (X|λSize ) · P (X|λDist ). The decision on the best model is reached by a maximisation over all models of the signs of the vocabulary:
Sign Language Recognition
Size
Pos
Dist
126
Sequence of MerkmalsFeature Vectors vektorfolge
t=1
t=2
t=3
t=4
t=5
t=6
t=7
t=8
x1
x2
x3
x4
x5
x6
x7
x8
Time t Zeit
t
Fig. 3.30. Classification of a single sign by means of subunits and PaHMMs.
Sˆ = argmax P (X|λi ).
(3.15)
i
After completion of the training process, a word model λi exists for each sign S i , which consists of the Hidden Markov Models of the accordant subunits. This word model will serve as reference for recognition. 3.3.5.2 Classification of Continuous Sign Language by Means of Subunits and PaHMMs In principle the classification process for continuous and isolated sign language recognition is identical. However, in contrast to the recognition of isolated signs, continuous sign language recognition is concerned with a number of further difficulties, such as: • • • •
A sign may begin or end anywhere in a given feature vector sequence. It is ambiguous, how many signs are contained in each sentence. There is no specific order of given signs. All transitions between subsequent signs are to be detected automatically.
All these difficulties are strongly linked to the main problem: The detection of sign boundaries. Since these can not be detected accurately, all possible beginning and
Recognition of continuous Sign Language using Subunits
127
Sign N N Gebärde
Size
Pos Dist Size
. . .
Gebärde i Sign i
Size Pos Dist
Gebärde i Sign i
Sign 1 1 Gebärde
Gebärde N
Size Pos Dist
Size Pos Dist
. . .
. . .
Size Pos Dist
Size Pos Dist
Sign 1 1 Gebärde
Size Pos Dist
Size
Pos Dist
Pos Dist
end points have to be accounted for. As introduced in the last section, the first and last state of a word model is a confluent common state for all three channels. Starting from the first state, the feature vector is divided into feature groups for the different channels of the PaHMM. From the last joint confluent state of a model exists a transition to first confluent states of all other models. This scheme is depicted in figure 3.31.
. . .
. . .
Gebärde 1 Sign 1
Gebärde i Sign i
Sign N Gebärde N
Fig. 3.31. A Network of PaHMMs for Continuous Sign Language Recognition with Subunits.
Detection of Sign Boundaries At the time of classification, the number of signs in the sentence as well as the transitions between these signs are unknown. To find the correct sign transitions all models of signs are combined as depicted in figure 3.31. The generated model now constitutes one comprehensive HMM. Inside a sign-model there are still three transitions (Bakis-Topology) between states. The last confluent state of a sign model has transitions to all other sign models. The Viterbi algorithm is employed to determine the best state sequence of this three-channel PaHMM. Hereby the assignment of feature vectors to different sign models becomes obvious and with it the detection of sign boundaries. 3.3.5.3 Stochastic Language Modeling The classification of sign language depends on two knowledge sources: the visual model and the language model. Visual modeling is carried out by using HMMs as described in the previous section. Language modeling will be discussed in this section. Without any language model technology, the transition probabilities between two successive signs are equal. Knowledge about a specific order of the signs in the training corpus is not utilised during recognition. In contrast, a statistical language model takes advantage of the knowledge, that pairs of signs, i.e. two successive signs, occur more often than others. The following equation gives the probability of so called bigramm-models.
128
Sign Language Recognition P (w) = P (w1 ) ·
m
P (wi |wi−1 )
(3.16)
i=2
The equation describes the estimation of sign pairs wi−1 and wi . During the classification process the probability of a subsequent sign changes, depending on the classification result of the preceding sign. By this method, typical sign pairs receive a higher probability. The estimation of these probabilities requires however a huge training corpus. Unfortunately, since training corpora do not exist for sign language, a simple but efficient enhancement of the usual statistical language model is introduced in the next subsection. Enhancement of language models for sign language recognition The approach of an enhanced statistical language model for sign language recognition is based on the idea of dividing all signs of the vocabulary into different sign-groups (SG) [2]. The determination of occurance probabilities is then calculated between these sign-groups and not between specific signs. If a combination of different sign-groups is not seen in the training corpus, this is a hint, that signs of these specific sign-groups do not follow each other. This approach does not require that all combinations of signs occur in the data base. If the sequence of two signs of two different sign-groups SGi and SGj is observed in the training corpus, any sign of sign-group SGi followed by any other sign of sign-group SGj is allowed for recognition. For instance, if the sign succession ’I EAT’ is contained in the training corpus, the probability that a sign of group ’verb’ (SGverb ) occurs when a sign of sign-group ’personal pronoun’ (SGpersonalpronoun ) was already seen, is increased. Hereby, the occurrence of the signs ’YOU DRINK’ receives a high probability even though this succession does not occur in the training corpus. On the other hand, if the training corpus does not contain a sample of two succeeding ’personal pronoun’ signs (e.g. ’YOU WE’), it is a hint that this sequence is not possible in sign language. Therefore, the recognition of these two succeeding signs is excluded from the recognition process. By this modification, a good compromise between statistical and linguistic language modeling is found. The assignment to specific sign-groups is mainly motivated by word categories known from German Speech Grammar. Sign-groups are ’nouns’, ’personal pronouns’, ’verbs’, ’adjectives’, ’adverbs’, ’conjunctions’, ’modal verbs’, ’prepositions’ and two additional groups, which take the specific sign language characteristics into account. 3.3.6 Training This section finally details the training process of subunit models. The aim of the training phase is the estimation of all subunit model parameters λ = (Π, A, B) for each sign of the vocabulary. It is important to note, that the subunit-model parameters are based on data recorded by single signs. The signs are not signed in context of
Recognition of continuous Sign Language using Subunits
129
a whole sign language sentences, which is usually the case for continuous sign language recognition and is then called embedded training. As only single signs serve as input for the training, this approach leads to a decreased training complexity. The training for subunits of signs proceeds in four iterative different steps. An overview is depicted in figure 3.32. In a first step a transcription for all signs has to be determined for the sign-lexicon. This step is called initial transcription (see section 3.3.6.1). Hereafter, the model parameter are estimated (estimation step, see section 3.3.6.2), which leads to to preliminary models for subunits. A classification step follows (see section 3.3.6.3). Afterwards, the accordant results has to be reviewed in an evaluation step. The result of the whole process is also known as singleton fenonic base form respectively synthetic base forms [11]. However, before the training process starts, the feature vector has to be split to consider the different feature groups. Split of feature-vector The partition of the feature vector into three different feature groups (pos, size, dist) has to be considered during the training process. Hence, as a first step before the actual training process starts, the feature vector is split into the three feature groups. Each feature group can be considered as a complete feature vector and is trained separately, since they are (mostly) independent from each other. Each training step is performed for all three feature groups. The training process eventually leads to subunit-models λpos , λsize and λdist . The following training steps are described for one of the three feature groups. The same procedure applies to the two remaining feature groups. The training process eventually leads to three different sign-lexica, where the transcription for all signs in the vocabulary are listed with respect to the corresponding feature-group. 3.3.6.1 Initial Transcription To start the training process, a sign-lexicon is needed which does not yet exist. Hence, an initial transcription of the signs for the sign-lexicon is needed before the training can start. This choice of an initial transcription is arbitrary and of little importance for the success of the recognition system, as the transcription is updated iteratively during training. Even with a poor initial transcription, the training process may lead to good models for subunits of signs [11]. Two possible approaches for the initial transcription are given in the following. 1. Clustering of Feature Vectors Again, this step is undertaken for each of the feature group data separately. The existing data is divided into K different cluster. Each vector is assigned to one of the K cluster. Eventually, all feature vectors belonging to the same cluster are similar, which is one of the goals of the initial transcription. A widespread cluster method is the K-Means-Algorithm [7]. After processing of the K-means algorithm, a assignment of the feature vectors to the clusters is fixed. Each cluster is also associated
130
Sign Language Recognition
Initial Transcription
Estimation for Model-parmeters for Subunits
Trainingsdata (Feature Vectors of single signs)
Sign 1 Sign 2
SU SU
. . .
. . .
Sign M
SU
6
SU 7 SU SU 1 SU
2
SU 7 SU
4
3 5
5
Sign-Lexicon Model-parameter of Subunits
Classification of single signs based on the Models for Subunits
Sign 1
Recognition Result
SU SU SU
6
SU 7 SU 5 SU 5 SU . 1 SU 5
6
SU
1 6
Sign M
SU
2
SU
. .
SU
1
SU
7
SU
5
7
SU
5
. . .
. . . SU
5
9
SU
Evaluation of preliminary Recognition Results
next Iteration
Sign 1 Sign 2
Training Result
Model-parameter of Subunits
. . . Sign M
SU SU
4
. . . SU
6
SU SU
2
SU
7 1
SU SU
7
SU
3 5
5
Final Sign-Lexicon
Fig. 3.32. The different processing steps for the identification of subunit models.
with a unique index number. For an initial transcription in the sign-lexicon, these index numbers have to be coded for each feature vector into the sign-lexicon. An example is depicted in figure 3.33. Experiments show, that an inventory of K = 200 is reasonable for sign recognition.
Recognition of continuous Sign Language using Subunits
131
THIS x.
x.
x
x
.. ... .. .
11
1D
X1
.. ... ...
21
HOW MUCH 2D
x.
x.
x.
x.
x
x
x
x
X3
X4
X5
X6
... ... ..
X2
31
3D
.. ... .. .
41
.. ... .. .
4D
51
.. ... .. .
5D
61
COST x.
x.
x.
x
x
x
X7
X8
X9
6D
... ... ..
71
... ... ..
7D
x. ... ... ..
11
x
1D
X1
x. .. ... .. .
81
8D
... ... ..
91
9D
41
x
4D
X4
x. .. ... ...
x.
91
.. ... .. .
9D
X9
C1 (SU1)
... ... ..
x.
x
.. ... .. .
. ... ... .
x
2D
X2
. ... ... .
31
X7
x. . ... ... .
8D
x.
5D
X5
7D
81
x
21
X8
51
x
x
x.
x.
71
C2 (SU2)
61
x
6D
x
3D
X6
X3 C3(SU3)
THIS
SU1
SU2
HOW MUCH
SU2
SU1
SU3 SU2
COST
SU3
SU2
SU1
Sign-Lexicon
Fig. 3.33. Example for clustering of feature vectors for a GSL sign language sentence THIS HOW MUCH COST (DAS KOSTEN WIEVIEL).
2. Evenly Division of Signs into Subunits For this approach, each sign is divided in a fixed number of subunits, e.g. six, per sign and feature group. Accordingly, sign 1 is associated with subunit 1 - 6, sign 2 with subunit 7 - 12 and so forth. However, as it is not desirable to have 6 times more subunits than signs in the vocabulary, not all signs of the vocabulary are chosen. Instead only a subgroup, i.e. the first approximately 30 signs are chosen at random for the initial transcription, which finally leads to approximately 200 subunits. 3.3.6.2 Estimation of Model Parameters for Subunits The estimation of model parameters is actually the core of this training process, as it leads eventually to the parameters of an HMM λi for each subunit. However, to start with, the data of single signs (each sign with m repetitions) serve here as input. In addition, the transcription of signs coded in the sign-lexicon (which is in the first iteration the result of the former initial transcription step) is needed. In all following steps, the transcription of the prior iteration step is the input for the actual iteration. Before the training starts, the number of states per sign has to be determined. The total number of states for a sign is equal to the number of contained subunits of the sign times two, as each subunit consists of two states. Viterbi training, which has been already introduced in section 2.1.3.6 is employed to estimate the model parameters for subunits. Thus, only the basic approach will here be introduced, for further details and equations see section 2.1.3.6. The final result
132
Sign Language Recognition
UE SU5
UE SU19
UE SU1
UE7 SU
of this step are HMMs for subunits. However, as data for single signs are recorded, we are again concerned with the problem of finding boundaries, here the boundaries of subunits within signs. Figure 3.34 illustrates how these boundaries are found.
Feature-
MerkmalsVectors vektorfolge
t=1
t=2
t=3
t=4
t=5
t=6
t=7
Zeit t t=8 Time
x1
x2
x3
x4
x5
x6
x7
x8
UE SU5
UE SU19
UE SU1
UE SU7
Fig. 3.34. Example for the detection of subunit boundaries. After time alignment the assignment of feature vectors to states is obvious.
The figure details the situation after time alignment by the Viterbi algorithm performed on a single sign. Here, the abscissa presents the time and thereby the feature vector sequence, whereas the concatenated models for each subunit contained in the sign are depicted at the ordinate. After the alignment of each feature vector sequence to the states, the detection of subunit boundaries is obvious. In this example the first two feature vectors are assigned to the states of the model SU5 , the following two are assigned to subunit SU1 9, one to SU1 and finally the last three to subunit SU7 . Now the Viterbi algorithm has to be performed again, this time on the feature vectors assigned to each subunit, in order to identify the parameters of an HMM λi for one subunit. 3.3.6.3 Classification of Single Signs The same data of single signs, which already served as input to the training process is now classified to proof the quality of the estimated subunit models (of the previous step). It is expected that most recognition results are similar to the transcription coded in the sign-lexicon. However, not all results will comply with the original transcription, which is especially true if data of several, e.g. m, repetitions of one sign exist. As the recognition results of all repetitions of a single sign are most probable not identical, the best representation for all repetitions of that specific sign has to be
Recognition of continuous Sign Language using Subunits
133
found. To reach this goal, an HMM for each of the m repetition is constructed. The Hidden Markov Models is based on the recognition result of the last mentioned step. The HMMs of the accordant subunit models are concatenated to a word model for that repetition of the sign. For instance, if the recognition result of a repetition of a sign1 is SU4 , SU1 , SU7 and SU9 , the resulting word model for that repetition is the concatenation of SU4 ⊕ SU1 ⊕ SU7 ⊕ SU9 . λsi refers here to the ith repetition of a sign s. We start with the evaluation of the word model λs1 of the first repetition. For this purpose, a probability is determined, that the feature vector sequence of a specific repetition, say s2 , is produced by the model (λs1 ). This procedure leads eventually to m different probabilities (one for each of the m repetitions) for the model λs1 . The multiplication of all these probabilities finally results in an indication of how ’good’ this model fits to all m repetitions. For the remaining repetitions s2 , s3 , . . . , sm of that sign the proceeding is analogous. Finally, the model of that repetition si is chosen, which results in the highest probability. This probability is determined by: M
M
j=1
i=1
C ∗ (w) = argmax
PAj (w) (Ai (w)).
(3.17)
Eventually, the transcription of the most probable repetition si is coded in the signlexicon. If the transcription of this iteration step differs from the previous iteration step, another iteration of the training will follow. 3.3.7 Concatenation of Subunit Models to Word Models for Signs. By determination of subunit models the basic training process is complete. For continuous sign language recognition ’legal’ sequences of subunits must be composed to signs. It may however happen, that an ’illegal’ sequence of subunits, i.e. a sequence of subunits, which is not coded in the sign-lexicon, is recognised. To avoid this misclassification, the subunits are concatenated according to their transcription in the sign-lexicon. For example the model λSi for sign S with a transcription of SU6 , SU1 , SU4 is then defined by λSi = λUE6 ◦ λUE1 ◦ λUE4 . 3.3.8 Enlargement of Vocabulary Size by New Signs A data-driven approach of subunit-models gives the opportunity to increase the vocabulary size by non-trained signs, as simply a new entry in the sign-lexicon has to be determined. An additional training step is not needed for the new signs. The approach is similar to the approach described in section 3.3.6, the training of subunit models [11]. For new signs, i.e. signs which have so far not been included in the vocabulary, it has to be determined, which sequence of already trained subunit models is able to represent the specific sign. Figure 3.35 illustrates this process. To determine a new entry to the sign-lexicon at least one performance of the sign is needed. However, for a better modeling, more repetitions of that sign are desirable. In a first step the new sign is recognised. For this recognition process, the
134
Sign Language Recognition New Signs (Sequence of Feature Vectors of Single Signs)
Model-Parameter of Subunits
Classification of Single Signs Reference: Models of Subunits
Sign 1
Recognition Result
SU SU SU
1 6 6
SU 7 SU 5 SU 5 SU . 1 SU 5
Sign M
SU
2
6
7
SU
5
. . .
. . .
. .
SU
SU
SU 1 SU 5
SU
9
SU
7
SU
5
Evaluation of preliminary Recognition Results
Sign 1 Sign 2
SU SU
4
. . .
. . .
Sign M
SU
6
SU 7 SU SU 1 SU
2
SU
7
SU
3 5
5
Sign-Lexicon including new Signs
Fig. 3.35. Steps to enlarge the vocabulary by new, non-trained signs.
trained subunit models serve as reference for classification. However, if more than one repetition of the sign is available it is most unlikely, that all repetitions lead to the same recognition result. The worst case represents n different results for n different sign repetitions. To solve this problem, a further evaluation step, which is the same as the one in the training process, is required. Here, for all repetitions of the sign, the ’best’ recognition result (in terms of all repetitions) is chosen and added to the sign-lexicon.
3.4 Performance Evaluation In this section some performance data that has been achieved with the video based sign language recognizer described above are presented. Person dependent as well
Performance Evaluation
135
as person independent recognition rates are given for single signs under controlled laboratory condition as well as under real world condition. Recognition results are also presented as to the applicability of subunits, which were automatically derived from signs. 3.4.1 Video-based Isolated Sign Recognition As a first result Tab. 3.2 shows the person-dependent recognition rates from a leaving-one-out test for four signers and various video resolutions under controlled laboratory conditions. The training resolution was always 384 × 288. Since the number of recorded signs varies slightly the vocabulary size is specified separately for each signer. Only manual features were considered. Interestingly COG coordinates alone accounted for approx. 95% recognition of a vocabulary of around 230 signs. On a 2 GHz PC, processing took an average of 11.79s/4.15s/3.08s/2.92s per sign, depending on resolution. Low resolutions caused only a slight decrease in recognition rate but reduced processing time considerably. So far a comparably high performance has only been reported for intrusive systems. Table 3.2. Person-dependent isolated sign recognition with manual features in controlled environments. Test Signer, Vocabulary Size Video Features Ben Michael Paula Sanchu Resolution 235 signs 232 signs 219 signs 230 signs 229 signs 384 × 288 all 98.7% 99.3% 98.5% 99.1% 98.9% 192 × 144 all 98.5% 97.4% 98.5% 99.1% 98.4% 128 × 96 all 97.7% 96.5% 98.3% 98.6% 97.8% 96 × 72 all 93.1% 93.7% 97.1% 95.9% 94.1% 384 × 288 x, x, ˙ y, y˙ 93.8% 93.9% 95.5% 96.1% 94.8%
Under the same conditions head pose, eyebrow position, and lip outline were employed as nonmanual features. The recognition rates achieved on the vocabularies presented in Tab. 3.2 varied between 49,3% and 72,4% among the four signers, with an average of 63,6%. Hence roughly two of three signs were recognized just from nonmanual features, a result that emphasizes the importance of mimics for sign language recognition. Tab. 3.3 shows results for person-independent recognition. Since the signers used different signs for some words, the vocabulary has been chosen as the intersection of the test signs with the union of all training signs. No selection has been performed otherwise, and no minimal pairs have been removed. As expected, performance drops significantly. This is caused by strong interpersonal variance in signing, as visualized in Fig. 3.5 by COG traces for identical signs done by different signers. Recognition rates are also affected by the exact constellation of training/test signers.
136
Sign Language Recognition
Table 3.3. Person-independent isolated sign recognition with manual features in controlled environments. The n-best rate indicates the percentage of results for which the correct sign was among the n signs deemed most similar to the input sign by the classifier. For n = 1 this corresponds to what is commonly specified as the recognition rate. Training Signer(s) Michael Paula, Sanchu Ben, Paula, Sanchu Ben, Michael, Paula Ben, Michael, Sanchu Michael, Sanchu
Test Vocabulary n-Best Rate Signer Size 1 5 10 Sanchu 205 36.0% 58.0% 64.9% Michael 218 30.5% 53.6% 63.2% Michael 224 44.5% 69.3% 77.1% Sanchu 221 54.2% 79.4% 84.5% Paula 212 37.0% 63.6% 72.8% Ben 206 48.1% 70.0% 77.4%
Person-independent performance in uncontrolled environments is difficult to measure since it depends on multiple parameters (signer, vocabulary, background, lighting, camera). Furthermore, noise and outliers are inevitably introduced in the features when operating in real-world settings. Despite the large interpersonal variance, person-independent operation is feasible for small vocabularies, as can be seen in Tab. 3.4. Each test signer was recorded in a different real-life environment, and the selection of signs is representative of the complete vocabulary (it contains onehanded and two-handed signs, both with and without overlap). Table 3.4. Person-independent recognition rates in real life environments. Vocabulary Test Signer Size Christian Claudia Holger J¨org Markus Ulrich 6 96.7% 83.3% 96.7% 100% 100% 93.3% 95.0% 18 90.0% 70.0% 90.0% 93.3% 96.7% 86.7% 87.8%
3.4.2 Subunit Based Recognition of Signs and Continuous Sign Language Recognition of continuous sign language requires a prohibitive amount of training material if signed sentences are used. Therefore the automated transcription of signs to subunits described in Sect. 3.3 was used here, which is expected to reduce the effort required for vocabulary extension to a minimum. For 52 signs taken arbitrarily from the domain ”shopping” an automatic transcription was performed, which resulted in 184 (Pos), 184 (Size) and 187 (Dist) subunits respectively. Subsequently performed classification tests with signs synthesized from the identified subunits yielded ¿ 90% correct recognition. This result shows, the the automatic segmentation of signs produced reasonable subunits. Additionally, 100 continuous sign language sentences were created using only signs synthesized from subunits. Each sentence
References
137
consisted of two to nine signs. After eliminating coarticulation effects and applying bigrams according to the statistical probability of sign order, a recognition rate of 87% was achieved. This finding is very essential as it solves the problem of vocabulary extension without additional training. 3.4.3 Discussion In this chapter a concept for video based sign language recognition was described. Manual and non-manual features have been integrated into the recognizer to cover all aspects of sign language expressions. Excellent recognition performance has been achieved for person-dependent classification and medium sized vocabularies. The presented system is also suitable for person-independent real world applications where small vocabularies suffice, as e.g. for controlling interactive devices. The results also show that a reasonable automatic transcription of signs to subunits is feasible. Hence the extension of an existing vocabulary is made possible without the need for large amounts of training data. This constitutes a key feature in the development of sign language recognition systems supporting large vocabularies.
References 1. Bahl, L., Jelinek, F., and Mercer, R. A Maximum Likelihood Approach to Continuous Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):179–190, March 1983. 2. Bauer, B. Erkennung kontinuierlicher Geb¨ardensprache mit Untereinheiten-Modellen. Shaker Verlag, Aachen, 2003. 3. Becker, C. Zur Struktur der deutschen Geb¨ardensprache. WVT Wissenschaftlicher Verlag, Trier (Germany), 1997. 4. Canzler, U. and Dziurzyk, T. Extraction of Non Manual Features for Videobased Sign Language Recognition. In The University of Tokyo, I. o. I. S., editor, Proceedings of IAPR Workshop on Machine Vision Applications, pages 318–321. Nara, Japan, December 2002. 5. Canzler, U. and Ersayar, T. Manual and Facial Features Combination for Videobased Sign Language Recognition. In 7th International Student Conference on Electrical Engineering, page IC8. Prague, May 2003. 6. Derpanis, K. G. A Review of Vision-Based Hand Gestures. Technical report, Department of Computer Science, York University, 2004. 7. Duda, R., Hart, P., and Stork, D. Pattern Classifikation. Wiley-Interscience, New York, 2000. 8. Fang, G., Gao, W., Chen, X., Wang, C., and Ma, J. Signer-Independent Continuous Sign Language Recognition Based on SRN/HMM. In Revised Papers from the International Gesture Workshop on Gestures and Sign Languages in Human-Computer Interaction, pages 76–85. Springer, 2002. 9. Hermansky, H., Timberwala, S., and Pavel, M. Towards ASR on Partially Corrupted Speech. In Proc. ICSLP ’96, volume 1, pages 462–465. Philadelphia, PA, 1996. 10. Holden, E. J. and Owens, R. A. Visual Sign Language Recognition. In Proceedings of the 10th International Workshop on Theoretical Foundationsof Computer Vision, pages 270–288. Springer, 2001.
138
Sign Language Recognition
11. Jelinek, F. Statistical Methods for Speech Recognition. MIT Press, 1998. ISBN 0-26210066-5. 12. Jones, M. and Rehg, J. Statistical Color Models with Application to Skin Detection. Technical Report CRL 98/11, Compaq Cambridge Research Lab, December 1998. 13. KaewTraKulPong, P. and Bowden, R. An Improved Adaptive Background Mixture Model for Realtime Tracking with Shadow Detection. In AVBS01. 2001. 14. Liang, R. H. and Ouhyoung, M. A Real-time Continuous Gesture Interface for Taiwanese Sign Language. In UIST ‘97 Proceedings of the 10th Annual ACM Symposium on User Interface Software and Technology. ACM, Banff, Alberta, Canada, October 14-17 1997. 15. Murakami, K. and Taguchi, H. Gesture Recognition Using Recurrent Neural Networks. In Proceedings of the SIGCHI conference on Human factors in computingsystems, pages 237–242. ACM Press, 1991. 16. Ong, S. C. W. and Ranganath, S. Automatic Sign Language Analysis: A Survey and the Future Beyond Lexical Meaning. IEEE TPAMI, 27(6):873–8914, June 2005. 17. Parashar, A. S. Representation And Interpretation Of Manual And Non-manual InformationFor Automated American Sign Language Recognition. Phd thesis, Department of Computer Science and Engineering, College of Engineering,University of South Florida, 2003. 18. Porikli, F. and Tuzel, O. Human Body Tracking by Adaptive Background Models and Mean-Shift Analysis. Technical Report TR-2003-36, Mitsubishi Electric Research Laboratory, July 2003. 19. Sonka, M., Hlavac, V., and Boyle, R. Image Processing, Analysis and Machine Vision. Brooks Cole, 1998. 20. Starner, T., Weaver, J., and Pentland, A. Real-Time American Sign Language Recognition Using Desk and WearableComputer Based Video. IEEE Transactions on Pattern Analysis and Machine Intelligence,, 20(12):1371–1375, December 1998. 21. Stauffer, C. and Grimson, W. E. L. Adaptive Background Mixture Models for Real-time Tracking. In Computer Vision and Pattern Recognition 1999, volume 2. 1999. 22. Stokoe, W. Sign language structure: An Outline of the Visual Communication Systems of the American Deaf. (Studies in Linguistics. Occasional Paper; 8) University of Buffalo, 1960. 23. Sutton, V. http://www.signwriting.org/, 2003. 24. Vamplew, P. and Adams, A. Recognition of Sign Language Gestures Using Neural Networks. In European Conference on Disabilities, Virtual Reality and AssociatedTechnologies. 1996. 25. Vogler, C. and Metaxas, D. Parallel Hidden Markov Models for American Sign Language Recognition. In Proceedings of the International Conference on Computer Vision. 1999. 26. Vogler, C. and Metaxas, D. Toward Scalability in ASL Recognition: Breaking Down Signs into Phonemes. In Braffort, A., Gherbi, R., Gibet, S., Richardson, J., and Teil, D., editors, The Third Gesture Workshop: Towards a Gesture-Based Communication in Human-Computer Interaction, pages 193–204. Springer-Verlag Berlin, Gif-sur-Yvette (France), 2000. 27. Welch, G. and Bishop, G. An Introduction to the Kalman Filter. Technical Report TR 95-041, Department of Computer Science, University of North Carolina at Chapel Hill, 2004. 28. Yang, M., Ahuja, N., and Tabb, M. Extraction of 2D Motion Trajectories and Its Application to HandGesture Recognition. IEEE Transactions On Pattern Analysis And Machine Intelligence, 24:1061–1074, 2002.
References
139
29. Zieren, J. and Kraiss, K.-F. Robust Person-Independent Visual Sign Language Recognition. In Proceedings of the 2nd Iberian Conference on Pattern RecognitionandImage Analysis IbPRIA 2005, volume Lecture Notes in Computer Science. 2005.
Chapter 4
Speech Communication and Multimodal Interfaces Bj¨orn Schuller, Markus Ablaßmeier, Ronald M¨uller, Stefan Reifinger, Tony Poitschke, Gerhard Rigoll Within the area of advanced man-machine interaction, speech communication has always played a major role for several decades. The idea of replacing the conventional input devices such as buttons and keyboard by voice control and thus increasing the comfort and the input speed considerably, seems that much attractive, that even the quite slow progress of speech technology during those decades could not discourage people from pursuing that goal. However, nowadays this area is in a different situation than in those earlier times, and these facts shall be also considered in this book section: First of all, speech technology has reached a much higher degree of maturity, mainly through the technique of stochastic modeling which shall be briefly introduced in this chapter. Secondly, other interaction techniques became more mature, too, and in the framework of that development, speech became one of the preferred modalities of multimodal interaction, e.g. as ideal complementary mode to pointing or gesture. This shall be also reflected in the subsection on multimodal interaction. Another relatively recent development is the fact that speech is not only a carrier of linguistic information, but also one of emotional information, and emotions became another important aspect in today’s advanced man-machine interaction. This will be considered in a subsection on affective computing, where this topic is also consequently investigated from a multimodal point of view, taking into account the possibilities for extracting emotional cues from the speech signal as well as from visual information. We believe that such an integrated approach to all the above mentioned different aspects is appropriate in order to reflect the newest developments in that field.
4.1 Speech Recognition This section is concerned with the basic principles of Automatic Speech Recognition (ASR). This research area has had a dynamic development during the last decades, and has been considered in its early stage as an almost unsolvable problem, then went through several evolution steps during the 1970s and 1980s, and eventually became more mature during the 1990s, when extensive databases and evaluation schemes became available that clearly demonstrated the superiority of stochastic machine learning techniques for this task. Although the existing technology is still far from being perfect, today there is a speech recognition market with a number of existing products that are almost all based on the before mentioned technology.
142
Speech Communication and Multimodal Interfaces
Our goal here is neither to describe the entire history of this development, nor to provide the reader with a detailed presentation of the complete state-of-the-art of the fundamental principles of speech recognition (which would probably require a separate book). Instead, the aim of this section is to present a relatively compact overview on the currently most actual and successful method for speech recognition, which is based on a probabilistic approach for modeling the production of speech using the technique of Hidden-Markov-Models (HMMs). Today, almost every speech recognition system, including laboratory as well as commercial systems, is based on this technology and therefore it is useful to concentrate in this book exclusively on this approach. Although this method makes use of complicated mathematical foundations in probability theory, machine learning and information theory, it is possible to describe the basic functionality of this approach using only a moderate amount of mathematical expressions which is the approach pursued in this presentation. 4.1.1 Fundamentals of Hidden Markov Model-based Speech Recognition The fundamental assumption of this approach is the fact that each speech sound for a certain language (more commonly called phoneme) is represented as a HiddenMarkov-Model, which is nothing else but a stochastic finite state machine. Similarly to classical finite state machines, an HMM consists of a finite number of states, with possible transitions between those states. The speech production process is considered as a sequence of discrete acoustic events, where typically each event is characterized by the production of a vector of features that basically describe the produced speech signal at the equivalent discrete time step of the speech production process. At each of these discrete time steps, the HMM is assumed to be in one of its states, where the next time step will let the HMM perform a transition into a state that can be reached from its current state according to its topology (including possible transitions back to its current state or formerly visited states).
Fig. 4.1. Example for a Hidden Markov Model.
Fig. 4.1 shows such an HMM, which has two major sets of parameters: The first set is the matrix of transition probabilities describing the probability p(s(k − 1) →
Speech Recognition
143
s(k)), where k is the discrete time index and s the notation for a state. These probabilities are represented by the parameters a in Fig. 4.1. The second set represents the so-called emission probabilities p(x|s(k)), which is the probability that a certain feature vector can occur while the HMM is currently in state s at time k. This probability is usually expressed by a continuous distribution function, denoted as function b in Fig. 4.1, which is in many cases a mixture of Gaussian distributions. If a transition into another state is assumed at discrete time k with the observed acoustic vector x(k), it is very likely that this transition will be into a state with a high emission probability for this observed vector, i.e. into a state that represents well the characteristics of the observed feature vector.
Fig. 4.2. Example for HMM-based speech recognition and training.
With these basic assumptions, it is already possible to formulate the major principles of HMM-based speech recognition, which can be best done by having a closer look at Fig. 4.2: In the lower part of this figure, one can see the speech signal of an utterance representing the word sequence ”this was”, also displayed in the upper part of this figure. This signal has been subdivided into time windows of constant length (typically around 10 ms length) and for each window a vector of features has been generated, e.g. by applying a frequency transformation to the signal, such as a Fourier-transformation or similar. This results in a sequence of vectors displayed just above the speech signal. This vector sequence is now ”observed” by the MarkovModel in Fig. 4.2, which has been generated by representing each phoneme of the underlying word sequence by a 2-state-HMM and concatenating the separate phoneme HMMs into one larger single HMM. An important issue in Fig. 4.2 is represented by the black lines that visualize the assignment of each feature vector to one specific state of the HMM. We have thus as many assignments as there are vectors in the feature vector sequence X = [x(1), x(2), ..., x(K)], where K is the number of vectors,
144
Speech Communication and Multimodal Interfaces
i.e. the length of the feature vector sequence. This so-called state-alignment of the feature vectors is one of the essential capabilities of HMMs and there are different algorithms for computation of this alignment, which shall not be presented here in detail. However, the basic principle of this alignment procedure can be made clear by considering one single transition of the HMM at discrete time k from state s(k−1) to state s(k), where at the same time the occurrence of feature vector x(k) is observed. Obviously, the joint probability of this event, consisting of the mentioned transition and the occurrence of vector x(k) can be expressed according to Bayes law as: p(x(k), s(k − 1) → s(k)) = p(x(k)|s(k − 1) → s(k)) · p(s(k − 1) → s(k)) = p(x(k)|s(k)) · p(s(k − 1) → s(k))
(4.1)
and thus is exactly composed out of the two parameter sets a and b mentioned before, that describe the HMM and are shown in Fig. 4.1 As already stated before, a transition into one of the next states will be likely, that results into a high joint probability as expressed in the above formula. One can thus imagine that an algorithm for computing the most probable state sequence as alignment to the observed feature vector sequence must be based on an optimal selection of a state sequence that eventually leads to the maximization of the product of all probabilities according to the above equation for k = 1 to K. A few more details on this algorithm will be provided later. For now, let us assume that the parameters of the HMMs are all known and that the optimal state sequence has been determined as indicated in Fig. 4.2 by the black lines that assign each vector to one most probable state. If that is the case, then this approach has produced two major results: The first one is a segmentation result that assigns each vector to one state. With this, it is for instance possible to determine which vectors – and thus which section of the speech signal – has been assigned to a specific phoneme, e.g. to the sound /i/ in Fig. 4.2 (namely the vectors nr. 6-9). The second result will be the before mentioned overall probability, that the feature vector sequence representing the speech signal has been assigned to the optimal state sequence. Since this probability will be the maximum possible probability and no other state sequence will lead to a larger value, this probability can be considered as the overall production probability that the speech signal has been produced by the underlying hidden Markov model shown in Fig. 4.2. Thus, an HMM is capable of processing a speech signal and producing two important results, namely a segmentation of the signal into subordinate units (such as e.g. phonemes) and the overall probability that such a signal can have been produced at all by the underlying probabilistic model. 4.1.2 Training of Speech Recognition Systems These results can be exploited in HMM-based speech recognition in the following manner: In the training phase, the HMM parameters in Fig. 4.2 are not known, but the transcription of the corresponding word sequence (here: ”what is” will be known). If some initial parameters for the large concatenated HMM of Fig. 4.2 are assumed
Speech Recognition
145
for the start of an iterative procedure, then it will be of course possible to compute with these parameters the optimal corresponding state sequence, as outlined before. This will be certainly not an optimal sequence, since the initial HMM parameters might not have been selected very well. However, it is possible to derive from that state sequence a new estimation for the HMM parameters by simply exploiting its statistics. This is quite obvious for the transition probabilities, since one only needs to count the occurring transitions between the states of the calculated state sequence and divide that by the total number of transitions. The same is possible for the probabilities p(x(k)|s(k)) which can basically be (without details) derived by observing which kind of vectors have been assigned to the different states and calculating statistics of these vectors, e.g. their mean values and variances. It can be shown that this procedure can be repeated iteratively: With the HMM parameters updated in the way as just described, one can now compute again a new improved alignment of the vectors to the states and from there again exploit the state sequence statistics for updating the HMM parameters. Typically, this leads indeed to a useful estimation of the HMM parameters after a few iterations. Moreover, at the end of this procedure, another important and final step can be carried out: By ”cutting” the large concatenated HMM in Fig. 4.2 again into the smaller phoneme-based HMMs, one obtains a single HMM for each phoneme which has now estimated and assigned parameters that represent well the characteristics of the different sounds. Thus, the single HMM for the phoneme /i/ in Fig. 4.2 will have certainly different parameters than the HMM for the phoneme /a/ in this figure and serves well as probabilistic model for this sound. This is basically the training procedure of HMMs and it becomes obvious, that the HMM technology allows the training of probabilistic models for each speech unit (typically the unit ”phoneme”) from the processing of entire long sentences without the necessity to phoneme-label or to pre-segment these sentences manually, which is an enormous advantage. 4.1.3 Recognition Phase for HMM-based ASR Systems Furthermore, Fig. 4.2 can also serve as suitable visualization for demonstrating the recognition procedure in HMM-based speech recognition. In this case – contrary to the training phase – now the HMM parameters are given (from the training phase) and the speech signal represents an unknown utterance, therefore the transcription in Fig. 4.2 is now unknown and shall be reconstructed by the recognition procedure. From the previous description, we have to recall once again that very efficient algorithms exist for the computation of the state-alignment between the feature vector sequence and the HMM-states, as depicted by the black lines in the lower part of Fig. 4.2. So far we have not looked at the details of such an algorithm, but shall go into a somewhat more detailed analysis of one of these algorithms now by looking at Fig. 4.3, which displays in the horizontal direction the time axis with the feature vectors x appearing at discrete time steps and the states of a 3-state HMM on the vertical axis. Assuming that the model starts in the first state, it is obvious that the first feature vector will be assigned to that initial state. Looking at the second feature vector,
146
Speech Communication and Multimodal Interfaces
Fig. 4.3. Trellis diagram for the Viterbi algorithm.
according to the topology of the HMM, it is possible that the model stays in state 1 (i.e. makes a transition from state 1 to state1) or moves with a transition from state 1 into state 2. For both options, the probability can be computed according to Eqn. 4.1 and both options are shown as possible path in Fig. 4.3. It is then obvious that from both path end points for time step 2, the path can be extended, in order to compute if the model has made a transition from state 1 (into either state 1 or state 2) or a transition from state 2 (into either state 2 or state 3). Thus, for the 3rd time step, the model can be in state 1, 2 or 3, and all these options can be computed with a certain probability, by multiplying the probabilities obtained for time step 2 by the appropriate transition and emission probabilities for the occurrence of the third feature vector. In this way it should be easily visible that the possible state sequence can be displayed in a grid (also called trellis) as shown in Fig. 4.3, which displays all possible paths that can be taken from state 1 to the final state 3 in this figure, by assuming the observation of five different feature vectors. The bold line in this grid shows one possible path through this grid and it is clear that for each path, a probability can be computed that this path has been taken, according to the before described procedure of multiplying the appropriate production probabilities of each feature vector according to Eqn. 4.1. Thus, the optimal path can be computed with the help of the principle of dynamic programming, which is well-known from the theory of optimization. The above procedure describes the Viterbi algorithm, which can be considered as one of the major algorithms for HMM-based speech recognition. As already mentioned several times, the major outcome of this algorithm is the optimal state sequence as well as the ”production probability” for this sequence, which is as well the probability that the given HMM has produced the associated feature vector sequence X. Let’s assume that the 3-state HMM in Fig. 4.3 represents a phoneme, as mentioned in the previous section on the training procedure, resulting in trained parameters for HMMs which typically represent all the phonemes in a given language. Typically, an unknown speech signal will either represent a spoken word or an entire spoken sentence. How can such a sentence then be recognized by the Viterbi algorithm as described above for the state-alignment procedure of an HMM representing a single
Speech Recognition
147
phoneme? This can be achieved by simply extending the algorithm so that it computes the most likely sequence of phoneme-HMMs that maximize the probability of emitting the observed feature vector sequence. That means that after the final state of the HMM in Fig. 4.3 has been reached (assuming a rather long feature vector sequence of which the end is not yet reached) another HMM (and possibly more) will be appended and the algorithm is continued until all feature vectors are processed and a final state (representing the end of a word) is obtained. Since it is not known which HMMs have to be appended and what will be the optimal HMM sequence, it becomes obvious that this procedure implies a considerable search problem and this process is therefore also called ”decoding”. There are however several ways to support this decoding procedure, for instance by considering the fact that the phoneme order within words is rather fixed and variation can basically only happen between word boundaries. Thus the phoneme search procedure is more a word search procedure which will be furthermore assisted by the so-called language model, that will be discussed later and assigns probabilities for extending the search path into a new word model if the search procedure has reached the final state of a preceding word model. Therefore, finally the algorithm can compute the most likely word sequence and the final result of the recognition procedure is then the recognized sentence. It should be noted that due to the special capabilities of the HMM approach, this decoding result can be obtained additionally with the production probability of that sentence, and with the segmentation result indicating which part of the speech signal can be assigned to the word and even to the phoneme boundaries of the recognized utterance. Some extensions of the algorithms make it even possible to compute the N most likely sentences for the given utterance (for instance for further processing of that result in a semantic module), indicating once again the power and elegance of this approach. 4.1.4 Information Theory Interpretation of Automatic Speech Recognition With this background, it is now possible to consider the approach to HMM-based speech recognition from an information theoretic point of view, by considering Fig. 4.4.
Fig. 4.4. Information theory interpretation of automatic speech recognition.
Fig. 4.4 can be interpreted as follows: A speaker formulates a sentence as a sequence of words denoted as W = [w(1), w(2), ..., w(N )]. He speaks that sentence into the microphone that captures the speech signal which is actually seen by the automatic speech recognizer. Obviously, this recognizer does not see the speaker’s
148
Speech Communication and Multimodal Interfaces
originally uttered word sequence, but instead sees the ”encoded” version of that in form of the feature vector sequence X that has been derived from the acoustic waveform, as output of the acoustic channel as displayed in Fig. 4.4. There is a probabilistic relation between the word sequence W and the observed feature vector sequence X, and indeed this probabilistic relation is modeled by the Hidden Markov Models that represent the phoneme sequence implied by the word sequence W . In fact, this model yields the already mentioned ”production probability” that the word sequence W , represented by the appropriate sequence of phoneme HMMs has generated the observed feature vector sequence and this probability can be denoted as p(X|W ). The second part of Fig. 4.4 shows the so-called ”linguistic decoder”, a module that is responsible for decoding the original information W from the observed acoustic sequence X, by taking into account the knowledge about the model provided by the HMMs expressed in p(X|W ). The decoding strategy of this module is to find the best possible word sequence W from observing the acoustic feature string X and thus to maximize p(W |X) according to Bayes’ rule as follows: max p(W |X) = max[p(X|W ) · W
p(W ) ] p(X)
(4.2)
And because finding the optimal word sequence W is independent of the probability p(X), the final maximization rule is: max[p(X|W ) · p(W )] W
(4.3)
Exactly this product of probabilities has to be maximized during the search procedure described before in the framework of the Viterbi algorithm. In this case, it should be noted that p(X|W ) is nothing else but the probability of the feature vector sequence X under the assumption that it has been generated by the underlying word sequence W , and exactly this probability is expressed by the HMMs resulting from the concatenation of the phoneme-based HMMs into a model that represents the resulting word string. In this way, the above formula expresses indeed the before mentioned decoding strategy, namely to find the combination of phoneme-based HMMs that maximize the corresponding emission probability. However, the above mentioned maximization rule contains an extra term p(W ) that has to be taken into account additionally to the so far known maximization procedure: This term p(W ) is in fact the ”sentence probability” that the word sequence W will occur at all, independent of the acoustic observation X. Therefore, this probability is described by the so-called language model, that simply expresses the occurrence probability of a word in a sentence given its predecessors, which can be expressed as: p(w(n)|w(n − 1), w(n − 2), ..., w(n − m))
(4.4)
In this case, the variable m denotes the ”word history”, i.e. the number of predecessor words that are considered to be relevant for the computation of the current word’s appearance probability. Then, the overall sentence probability can be expressed as the product of all single word probabilities in a sentence according to
Speech Recognition
p(W ) =
149 N
p(w(n)|w(n − 1), ..., w(n − m))
(4.5)
n=1
where N is the length of the sentence and m is the considered length of the word history. As already mentioned, the above word probabilities are completely independent of any acoustic observation and can be derived from statistics obtained e.g. from the analysis of large written text corpora by counting the occurrence of all words given a certain history of word predecessors. 4.1.5 Summary of the Automatic Speech Recognition Procedure
Fig. 4.5. Block diagram for HMM-based speech recognition.
Finally, to summarize the functioning of HMM-based speech recognition, the block diagram in Fig. 4.5 can be interpreted as follows: The speech signal is captured by a microphone, sampled and digitized. Preprocessing of the speech signal includes some filtering process and possible noise compensation. The next step in Fig. 4.5 is feature extraction, where the signal is split into windows of roughly 10 msec length and for each window, a feature vector is computed that typically represents the windowed signal in the frequency domain. Then, in recognition mode, for each vector of the resulting feature vector sequence, state conditional probabilities are computed, basically by inserting the feature vector x(k) into the right hand side of Eqn. 4.1 for each state which will be considered in the decoding procedure. According to Fig. 4.5, the gray-shaded table labeled as ”acoustic phoneme models”
150
Speech Communication and Multimodal Interfaces
contains the parameters of the distribution functions that are used to compute these state conditional probabilities. This computation is integrated into the already mentioned search procedure, that attempts to find the best possible string of concatenated HMM phoneme models that maximize the emission probability of the feature vector sequence. This search procedure is controlled by the gray-shaded table containing the phonological word models (i.e. how a word is composed of phonemes) and the table containing the language model that delivers probabilities for examining the next likely word if the search procedure has reached the final HMM-state of a previous word model. In this way, the above mentioned computation of state conditional probabilities does not need to be carried out for all possible states, but only for the states that are considered to be likely by the search procedure. The result of the search procedure is the most probable word model sequence that is displayed as transcription representing the recognized sentence to the user of the speech recognition system. This brief description of the basic functionality of HMM-based speech recognition can of course not cover this complicated subject in sufficient detail and it is therefore not amazing that many interesting sub-topics in ASR have not been covered in this introduction. These include e.g. the area of discriminative training techniques for HMMs, different HMM architectures such as discrete, hybrid and tied-mixture approaches, the field of context-dependent modeling where acoustic models are created that model the coarticulation effects of phonemes in the context of neighboring phonemes, as well as clustering techniques that are required for representing the acoustic parameters of these extended models. Other examples include the use of HMM multi-stream techniques to handle different acoustic feature streams or the entire area of efficient decoding techniques, e.g. the inclusion of higher level semantics in decoding, fast search techniques, efficient dictionary structures or different decoder architectures such as e.g. stack decoders. The interested reader is advised to study the available literature on these topics and the large number of conference papers describing those approaches in more detail. 4.1.6 Speech Recognition Technology As already mentioned, the HMM-technology has become the major technique for Automatic Speech Recognition and has nowadays reached a level of maturity that has led to the fact that this is not only the dominating technology in laboratory systems but in commercial systems as well. The HMM technology is also that much flexible that it can be deployed for almost every specialization in ASR. Basically, one can distinguish the following different technology lines: Small-vocabulary ASR systems with 10-50 word vocabulary, in speaker-independent mode, mainly used for telephone applications. Here, the capability of HMMs to capture the statistics of large training corpora obtained from many different speakers is exploited. The second line is represented by speaker-independent systems with medium size vocabulary, often used in automotive or multimedia application environments. Here, noise reduction technologies are often combined with the HMM framework and the efficiency of HMMs for decoding entire sequences of phonemes and words are exploited for the recognition of continuously spoken utterances in adverse environments. The last ma-
Speech Recognition
151
jor line are dictation systems with very large vocabulary (up to 100,000 words) which often operate in speaker-dependent and/or speaker-adaptive mode. In this case, the special capabilities of HMMs are in the area of efficient decoding techniques for searching very large trellis spaces, and especially in the field of context-dependent acoustic modeling, where coarticulation effects in continuous speech can be efficiently modeled by so-called triphones and clustering techniques for their acoustic parameters. One of the most recent trends is the developments of so-called embedded systems, where medium to large vocabulary size ASR systems are implemented on systems such as mobile phones, thin clients or other electronic devices. This has become possible with the availability of appropriate memory cards with sufficiently large storage capacity, so that the acoustic parameters and especially the memoryintensive language model with millions of word sequence probabilities can be stored directly on such devices. Due to these mentioned memory problems, another recent trend is so-called Distributed Speech Recognition (DSR), where only feature extraction is computed on the local device and the features are then transferred by wireless transmission to a large server where all remaining recognition steps are carried out, i.e. computation of emission probabilities and the decoding into the recognized word sequence. For these mentioned steps, an arbitrarily large server can be employed, with sufficient computation power and memory for large language models and a large number of Gaussian parameters for acoustic modeling. 4.1.7 Applications of ASR Systems It is not completely amazing that the above mentioned algorithms and available technologies have led to a large variety of interesting applications. Although ASR technology is far from being perfect, the currently achievable performance is in many cases satisfactory enough in order to create novel ideas for application scenarios or revisit already established application areas with now improved technology. Although current speech recognition applications are manifold, a certain structure can be established by identifying several major application areas as follows: Telecommunications: This is still the probably most important application area due to the fact that telecommunications is very much concerned with telephony applications involving naturally speech technology and in this case, speech recognition is a natural bridge between telecommunications and information technology by providing a natural interface in order to enter data via a communication channel into information systems. Thus, most application scenarios are in the area of speech recognition involving the telephone. Prominent scenarios include inquiry systems, where the user will inquire information by calling an automated system, e.g. for banking information or querying train and flight schedules. Dialing assistance, such as speaking the telephone number instead of dialing or typing it belongs to this application area. More advanced applications include telephony interpretation, i.e. the automatic translation of phone calls between partners from different countries, and all techniques involving mobile communications, such as embedded speech recognition on mobile clients and distributed speech recognition.
152
Speech Communication and Multimodal Interfaces
Office Automation: Similarly to telecommunications, office automation has been a traditional application area of ASR for several decades and has been one of the driving forces for very large vocabulary ASR. This area has been revived by the latest emerging commercial large vocabulary speech recognition systems that really provided the required performance in terms of vocabulary size and appropriate recognition performance. Typical concrete applications in that area include the classical dictation task where a secretary creates a letter directly via voice input and other scenarios, such as e.g. the use of ASR in CAD applications or the direct command input for PCs. Medical Applications: This area has also a quite long tradition and has been a favorite experimental scenario for ASR since at least 20 years. The most prominent field has been radiology, where the major idea has been to create a medical report directly from the visual analysis of an x-ray image by dictating directly into a microphone connected to an ASR system. Other scenarios include the control of microscopes via voice and another major application area of speech technology in general, namely the area of handicapped users, where the malfunction of hands or arms can be compensated by voice control of devices and interfaces. The size of this potential user group and the variety of different applications for handicapped users make this one of the most important ASR application scenarios. Production and Manufacturing: This area may be less popular than the previously mentioned application areas, but has been also investigated for a long time as potentially very interesting application with a large industrial impact. Popular applications include data distribution via spoken ID codes, programming of NC machines via voice, or spoken commands for the control of large plants, such as power or chemical plants. Multimedia Applications: Naturally, the rise of multimedia technology has also increased the demand for speech-based interfaces, as one major modality of multimodal interfaces. Most of those systems are still in the experimental phase, but some industrial applications are already underway, such as e.g. in smart environments or information kiosks, that require e.g. user input via voice and pointing. Another interesting area in this field are voice-enabled web applications, where speech recognition is used for the access to web documents. Private Sector: This field contains several very popular speech application fields with huge potential, such as the automotive sector, electronic devices and games. Without any doubt, those represent some of the most important application fields, where consumers are ready to make some extra investment in order to add more functionality to their user interface, e.g. in case of luxury automobiles or expensive specialized electronic devices. This overview demonstrates the huge potential of speech recognition technology for a large variety of interesting applications that can be well subdivided into the above mentioned major application areas which will represent also in the future the most relevant domains for even further improved ASR systems to come.
Speech Dialogs
153
4.2 Speech Dialogs 4.2.1 Introduction It is strongly believed that, following command line interfaces being popular in the years 1960-80 and graphical user interfaces in the years 1980-2000, the future lies in speech and multimodal user interfaces. A lot of factors thereby clearly speak for speech as interaction form: • • • •
Speech is the most natural communication form between humans. Only limited space is required: a microphone and a loudspeaker can be placed even in wearable devices. However, this does not respect computational effort. Hands and eyes are left free, which makes speech interaction the number one modality in many controlling situations as driving a car. Approximately 1.3 billion telephones exist worldwide, which resembles more than five times the number of computers connected to the Internet at the time. This provides a big market for automatic dialog systems in the future.
Using speech as an input and output form leads us to dialog systems: in general these are, in hierarchical order by complexity, systems that allow for control e.g. of functions in the car, information retrieval as flight data, structured transactions by voice, for example stock control, combined tasks of information retrieval and transactions as booking a hotel according to flight abilities and finally complex tasks as care of elder persons. More specifically a dialog may be defined as an exchange of different aspects in a reciprocal conversation between at least two instances which may either be human or machine with at least one change of the speaker. Within this chapter we focus on spoken dialog, still it may also exist in other forms as textual or combined manners. More concretely we will deal with so called Spoken Language Dialog Systems, abbreviated SLDS, which in general are a combination of speech recognition, natural language understanding, dialog act generation, and speech synthesis. While it is not exactly defined which parts are mandatory for a SLDS, this contributes to their interdisciplinary nature uniting the fields of speech processing, dialog design, usability engineering and process analysis within the target area. As enhancement of Question and Answer (Q&A) systems, which allow natural language access to data by direct answers without reference to past context (e.g. by pronominal allusions), dialog systems also allow for anaphoric expressions. This makes them significantly more natural as anaphora are frequently used as linguistic elements in human communication. Furthermore SLDS may also take over initiative. What finally distinguishes them from sheer Q&A and Command and Control (C&C) systems in a more complex way is that user modeling may be included. However, both Q&A and SLDS already provide an important improvement over today’s predominant systems where users are asked to adhere to a given complex and artificial syntax. The following figure gives an overview of a typical circular pipeline architecture of a SLDS [23]. On top the user can be found who controls an application
154
Speech Communication and Multimodal Interfaces
found at the bottom by use of a SLDS as interaction medium. While the components are mostly the same, other architectures exist, as organization around a central process [15]. In detail the single components are:
Fig. 4.6. Overview of a SLDS.
•
•
•
Automatic Speech Recognition (ASR): Spoken input analysis leading to hypotheses of linguistic units as phonemes or words and often some form of confidence measurement of the certainty of these (see Sect. 4.1. The output is mostly provided in so called lattices resembling tree structures or n-best lists. Key factors of an ASR module or engine, as it is mostly referred to, are speaker(in)dependence, the vocabulary size of known words and its general robustness. On the acoustic level also a user’s underlying affect may be analyzed in order to include emotional aspects (see Sect. 4.4 [48]. Natural Language Understanding (NLU): Interpretation of the intention or meaning of the spoken content. Here again several hypotheses may be forwarded to the dialog management combined with certainty. Dialog Management (DM): The DM is arguably the central module of a voice interface as it functions as an intermediate agent between user and application and is responsible for the interaction between them. In general it operates on an intention representation provided by the NLU which models what the user (presumably) said. On the basis of this information, the DM has several options as to change the state of an underlying application in the case of voice control or retrieve a piece of data from a database of interest in the case of an information service. Furthermore the DM decides, when and which type of system voice output is performed. Shortly summarized the DM’s primary tasks are storage
Speech Dialogs
• •
•
155
and analysis of context and dialog history, flow control e.g. for active initiative or barge-in handling, direction of the course of a conversation, answer production in an abstract way, and database access or application control. Database (DB): Storage of information in respect of dialog content. Natural Language Generation (NLG): Formulation of the abstract answer provided by the DM. A variety of approaches exists for this task reaching from probabilistic approaches with grammatical post-processing to pre-formulated utterances. Speech Synthesis (TTS): Audio-production for the naturally formulated system answer. Such modules are in general called Text-to-Speech engines. The two major types of such are once formant synthesizers resembling a genuinely artificial production of audio by formant tract modeling and concatenative synthesis. Within the latter audio clips at diverse lengths reaching from phonemes to whole words of recorded speech are concatenated to produce new words or sentences. At the moment these synthesizers tend to sound more natural depending on the type of modeling and post-processing as pitch and loudness correction. More sophisticated approaches use bi- or trigrams to model phonemes in the context of their neighboring ones. Recently furthermore prosodic cues as emotional speech gain interest in the field of synthesis. However, the most natural form still remains prerecorded speech, while providing less flexibility or more cost at recording and storage space.
Still, as it is disputed which parts besides the DM belong to a strict definition of a SLDS, we will focus hereon in the ongoing. 4.2.2 Initiative Strategies Before getting into dialog modeling, we want to make a classification in view of the initiative: • • •
system-driven: the system keeps the initiative throughout the whole dialog user-driven: the user keeps the initiative mixed initiative: the initiative changes throughout the dialog
Usually, the kind of application, and thus the kind of voice interface, codetermines this general dialog strategy. For example systems with limited vocabulary tend to employ rigid, system initiative dialogs. The system thereby asks very specific questions, and the user can do nothing else but answer them. Such a strategy is required due to the highly limited set of inputs the system can cope with. C&C applications tend to have rigid, user initiative dialogs: the system has to wait for input from the user before it can do anything. However, the ideal of many researchers and developers is a natural, mixed initiative dialog: both counterparts have the possibility of taking the initiative when this is opportune given the current state of the dialog, and the user can converse with the system as (s)he would with another human. This is difficult to obtain in general, for at least two reasons: Firstly, it is technically demanding, as the user should have the freedom to basically say anything at any moment,
156
Speech Communication and Multimodal Interfaces
which is a severe complication from a speech recognition and language processing point of view. Secondly, apart from such technical hurdles, it is also difficult from a dialog point of view, as initiative has to be tracked and reactions have to be flexible. 4.2.3 Models of Dialog Within the DM an incorporated dialog model is responsible for the structure of the communication. We want to introduce the most important such models in the ongoing. They can be mainly divided into structural models, which will be introduced firstly, and nonstructural ones [23, 31]. Thereby a more or less predefined path is given within structural ones. While these are quite practicable in the first order, their methodology is not very principled, and the quality of dialogs based on them is arguable. Trying to overcome this, the non-structural approaches rely rather on general principles of dialog. 4.2.3.1 Finite State Model The dialog structure represents a sequence of limited predetermined steps or states in form of a condition transition graph which models all legal dialogs within the very basic graph-based-, or Finite State Automaton (FSA) model. This graph’s nodes represent system questions, system outputs or system actions, while its edges show all possible paths in the network which are labeled with the according user expressions. As all ways in a deterministic finite automaton are fixed, it cannot be spoken of an explicit dialog control. The dialog therefore tends to be rigid, and all reactions for each user-input are determined. We can denote such a model as the quintuple {S, s0 , sf , A, τ }. Thereby S represents a set of states besides s0 , the initial state, and sf , the final state, and A a set of actions including an empty action . τ finally is a transition function with τ : S × A → S. By τ it is specified to which state an action given the actual state leads. Likewise a dialog is defined as the path from s0 to sf within the space of possible states. Generally, deterministic FSA are the most natural kind to represent system-driven dialogs. Furthermore they are well suitable for completely predictable information exchange due to the fact that the entire dialog model is defined at the time of the development. Often it is possible to represent the dialog flow graphically, which resembles a very natural development process. However, there is no natural order of the conditions, as each one can be followed by any other resulting in a combinatorial explosion. Furthermore the model is unsuited for different abstraction levels of the exchanged information, and for complex dependencies between information units. Systems basing on FSA models are also inflexible, as they are completely system-led, and no path deviations are possible. The user’s input is restricted to single words or phrases that provide responses to carefully designed system prompts. Tasks, which require negotiations, cannot be implemented with deterministic automata due to the uncertainty of the result. Grounding and repair is very rigid and must be applied after each turn. Later corrections by the user are hardly possible, and the context of the statements cannot be used. Summed up, such
Speech Dialogs
157
systems are appropriate for simple tasks with flat menu structure and short option lists. 4.2.3.2 Slot Filling In so called frame-based systems the user is asked questions that enable to fill slots in a template in order to perform a task [41]. These systems are more or less today’s standard for database retrieval systems as flight or cinema information. In difference to automaton techniques the dialog flow is not predetermined but depends on the content of the user’s input and the information that the system has to elicit. However, if the user provides more input than is requested at a time, the system can accept this information and check if any additional item is still required. A necessary component therefore is a frame that controls over already expressed information fragments. If multiple slots are yet to fill, the question of the optimal order in respect of system questioning remains. The idea thereby is to constrain the originally large set of a-priori possibilities of actions in order to speed up the dialog. Likewise, one reasonable approach to establish a hierarchy of slots is to stick to the order of highest information based on Shannon’s entropy measure, as proposed in [24]. Such an item with high information might e.g. be the one leaving most alternatives open. Let therefore ck be a random variable of items within the set C, and vj a value of attribute a. The attribute a that minimizes the following entropy measure H(c|a) will be accordingly selected: P (ck , a = vj ) log2 P (ck |a = vj ) (4.6) H(c|a) = − vj ∈a ck ∈C
Thereby the probability P (ck , a = vj ) is set to 0 unless ck matches the revised partial description. In this case it resembles P (ck ). The further missing probability P (ck |a = vj ) is calculated by the following equation, where C is the set of items that match the revised description that includes a = vj , and P (ck ) can be approximated basing on the assumption that items are equally likely: 1 P (ck ) , P (ck ) = P (ck |a = vj ) = |C| P (cj )
(4.7)
cj ∈C
An advantage of dialog control with frames is higher flexibility for the user and multiple slot filling. The duration or dialogs are shorter, and a mixed initiative control is possible. However, an extended recognition grammar is required for more flexible user expressions, and the dialog control algorithm must specify the next system actions on basis of the available frames. The context that decides on the next action (last user input, status of the slots, simple priority ranking) is limited, and the knowledge level of the user, negotiating or collaborative planning cannot – or only be modeled to a very limited degree. No handling for communication problems exists, and the form of system questions is not specified. Still, there is an extension [53], whereby complex questions are asked if communication works well. In case of problems a system can switch to lower-level questions splitting up the high-level ones. Finally,
158
Speech Communication and Multimodal Interfaces
an overview over the system is not possible due to partially complex rules: which rule fires when? 4.2.3.3 Stochastic Model In order to find a less hand-crafted approach which mainly bases on a designer’s input, recently data-driven methods successfully applied in machine learning are used within the field of dialog modeling. They aim at overcoming the shortcomings of low portability to new domains and the limited predictability of potential problems within a design process. As the aim is to optimize a dialog, let us first define a cost function C. Thereby the numbers Nt of turns , Ne of errors, and Nm of missing values are therefore summed up by introduction of individual according weights wt , we , and wm : C = wt (Nt ) + we (Ne ) + wm (Nm )
(4.8)
Now if we want to find an ideal solution minimizing this target function, we face |A||S| possible strategies. A variety of suited approaches for a machine based solution to this problem exists [57]. However, we chose the predominant Markov Decision Processes (MDP) as a probabilistic extension to the introduced FSA following [28] herein. MDP differ from FSA, as they introduce transition probabilities instead of the functions τ . We therefore denote at a time t the state st , and the action at . Given st and at we change to state st+1 with the conditional probability P (st+1 |st , at ). Thereby we respect only events one time step behind, known as the limited horizon Markov property. Next, we combine this with the introduced costs, whereby ct shall be the cost if in state st the action at is performed, with the probability P (ct |st , at ). The cost of a complete dialog can likewise be denoted as: C=
tf
ct ,
(4.9)
t=0
where tf resembles the instant when the final state sf is reached. In the consequence the best action V ∗ (s) to take within a state s is the action that minimizes over incurred and expected costs for the succeeding state with the best action within there, too: % & ∗ ∗ Ptf (s |s, a)V (s ) (4.10) V (s) = arga min c(s, a) + s
Given all model parameters and a finite state space, the unique function V ∗ (s) can be computed by value interaction techniques. Finally, the optimal strategy ψ ∗ resembles the chain of actions that minimizes overall costs. In our case of a SLDS the parameters are not known in advance, but have to be learned by data. Normally such data might be acquired by test users operating on a simulated system known as Wizard-of-Oz experiments [33]. However, as loads of data are needed, and there is no data like more data, the possibility of simulating a user by another stochastic process as a solution to sparse data handling exists [27]. Basing on annotated dialogs with
Speech Dialogs
159
real users, probabilities are derived with which a simulated user acts responding to the system. Next reinforced learning e.g. by Monte Carlo with exploring starts is used to obtain an optimal state-action value function Q∗ (s, a). It resembles the costs to be expected of a dialog starting in state s, moving to s by a, and optimally continuing to the end: Ptf (s |s, a) arga min Q∗ (s , a ), (4.11) Q∗ (s, a) = c(s, a) + s
Note that V ∗ (s) = arga min Q∗ (s, a). Starting in an arbitrary set of Q∗ (s, a) the algorithm iteratively finds the costs for a dialog session. In [27] it is shown that it converges to an optimal solution, and that it is expensive to immediately shortcut a dialog by ”bye!”. The goals set at the beginning of this subsection can be accomplished by stochastic dialog modeling. The necessity of data seems one drawback, while it often suffices to have very sparse data, as low accuracies of the initial transition probabilities may already lead to satisfying results. Still, the state space is hand-crafted, which highly influences the learning process. Likewise the state space itself should be learned. Furthermore the cost function determination is not trivial, especially assigning the right weights, while also highly influencing the result. In the suggested cost function no account is taken for subjective costs as user satisfaction. 4.2.3.4 Goal Directed Processing The more or less finite state and action based approaches with predefined paths introduced so far are well suited for C&C and information retrieval, but less suited for task-oriented dialogs where the user and the system have to cooperate to solve a problem. Consider therefore a plan-based dialog with a user that has only sparse knowledge about the problem to solve, and an expert system. The actual task thereby mostly consists of sub-tasks, and influences the structure of the dialog [16]. A system has now to be able to reason about the problem at hand, the application, and the user to solve the task [50]. This leads to a theorem prover that derives conclusions by the laws of logic from a set of true considered axioms contained in a knowledge-base. The idea is to proof whether a subgoal or goal is accomplished, as defined by a theorem, yet. User modeling is thereby also done in form of stored axioms that consist of his competence and knowledge. Whenever an axiom is missing, interaction with the user becomes necessary, which claims for an interruptible theorem prover that is able to initiate a user directed question. This is known as the missing axiom theory [50]. As soon as a subgoal is fulfilled, the selection of a new one can be made according to the highest probability of success given the actual dialog state. However, the prover should be flexible enough to switch to different sub-goals, if the user actively provides new information. This process is iteratively repeated until the main goal is reached. In the framework of Artificial Intelligence (AI) this can be described by the Beliefs, Desires, and Intentions (BDI) model [23], as shown in the following figure 4.7.
160
Speech Communication and Multimodal Interfaces
This model can be applied to conversational agents that have beliefs about the current state of the world, and desires, how they want it to be. Basing on these they determine their intention, respectively goal, and build a plan consisting of actions to satisfy their desires. Note that utterances are thereby treated as (speech) actions.
Fig. 4.7. Beliefs, Desires, and Intentions Model.
To conclude, the advantages of goal directed processing are the ability to also deal with task-oriented communication and informational dialogs, a more principled approach to dialog based on a general theory of communication, and less domaindependency at least for the general model. Yet, by applying AI methods, several problems are inherited: high computational cost, the frame problem dealing with the specification of non-influenced parts of the world by actions, and the lack of proper belief and desire formalizations and independently motivated rationality principles in view of agent behavior. 4.2.3.5 Rational Conversational Agents Rational Conversational Agents directly aim at imitation of natural conversation behavior by trying to overcome plan-based approaches’ lack in human-like rationality. The latter is therefore reformulated in a formal framework establishing an intelligent system that relies on a more general competence basis [43]. The beliefs, desires and intentions are logically formulated, and a rational unit is constructed basing thereon that decides upon the actions to be taken. Let us denote an agent i believing in proposition p as B(i, p), and exemplary propositions as ϕ, and ψ. We can next construct a set of logical rules known as NKD45 or Weak S5 that apply for the B operator: (N ) : always ϕ → always B(i, ϕ)
(4.12)
(K) : B(i, ϕ) ∧ B(i, ϕ → ψ) → B(i, ψ)
(4.13)
(D) : B(i, ϕ) → ¬B(i, ¬ϕ)
(4.14)
(4) : B(i, ϕ) → B(i, B(i, ϕ))
(4.15)
(5) : ¬B(i, ϕ) → B(i, ¬B(i, ϕ)
(4.16)
Speech Dialogs
161
Let us furthermore denote desires, respectively goals, as G(i, p). Likewise an agent indexed i has the desire that p comes true. As an exercise, one can reflect why only (K) and (4) apply for goals: (K) : G(i, ϕ) ∧ G(i, ϕ → ψ) → G(i, ψ)
(4.17)
(4) : G(i, ϕ) → G(i, B(i, ϕ))
(4.18)
We next connect goals and beliefs by a realism constraint which states that an agent cannot have a goal he believes to be false: B(i, ϕ) → G(i, ϕ)
(4.19)
Expected consequences of a goal furthermore also have to be a goal of an agent, which is known as the expected consequences constraint: G(i, ϕ) ∧ B(i, ϕ → ψ) → G(i, ϕ)
(4.20)
Finally, let us introduce the persistent goal constraint with the persistent goal P G(i, ϕ): an agent i should not give up a goal ϕ to be true in the future (1) that he believes not true presently (2) until either its fulfillment or necessarily non-availability (3). (1) : P G(i, ϕ) if and only if G(i, (Future ϕ))∧ (4.21) (2) : B(i, ¬ϕ)∧ (3) : [Before ((B(i, ϕ) ∨ B(i, (Necessary ¬ϕ)))¬G(i, (Future ϕ)))]
(4.22) (4.23)
The rational unit of a SLDS fed with the formalized notions of belief, desires and intentions has now a selection of communication actions to choose from that are associated with feasibility preconditions and rational effects [43]. Such an action is e.g. that an agent i informs an agent j about ϕ, whereby the user is also understood as an agent. Now, if an agent has the intention to achieve a goal, it selects an appropriate action and thereby inherits the intention to fulfill according preconditions. This approach defines a planning algorithm demanding for a theorem prover as in the previous section. However, user directed questions do not directly concern missing axioms – it is rather only checked whether preconditions are fulfilled and effects of system-actions help in view of the current goal. To sum up, the characteristics of rational conversational agents are full mixed initiative, the ability to implement theoretically complex systems which solve dynamic and (only) cooperative complex tasks. They are also much more oriented on the linguistic theory. However, there is no general definition of the agent term besides that they should be reactive, autonomous, social, rational and antropomorph. Furthermore way more resources are needed both quantitatively (computer speed, programming expenditure) and qualitatively (complexity of the problems which can be solved). Also, the formalization of all task relevant domain-knowledge is not trivial. So far there are only academic systems realized.
162
Speech Communication and Multimodal Interfaces
4.2.4 Dialog Design Let us now turn our attention to the design of a dialog, as a lot of factors besides recognition performance of the ASR unit have significant influence on the utmost design goals naturalness, efficiency, and effectiveness. As we learned so far, initiative can e.g. be chosen mixed or by one side only. Furthermore confirmations can be given or spared, and suggestions made in case of failure, etc. In the following three main steps in the design of a dialog will be outlined: •
•
•
Script writing: In this part, also known as call flow layout, the interaction between user and system is laid out step-wisely. Focus should be given to the naturalness of the dialog, which is also significantly influenced by the quality of the dialog flow. Prompt Design: Similar to a prompt in console based interfaces a sound or announcement of the system signals the user when and depending on the situation what to speak. This acoustic prompt has to be well designed in order to be informative, well heard, but not disturbing. A frequent prompt might therefore be chosen short. On the other hand in the case of error handling, tutorial information or help provision, prompts might be chosen more complex, as the quality of a user’s answer depends strongly on the quality of the prompt. Finally, only by appropriate prompt crafting the initiative throughout a dialog may be controlled effectively. Grammar Writing: Within the grammar the possible user statements given a dialog-state are defined. A compromise between coverage and recognition accuracy has to be found, as too broad a coverage often leads to decreased performance due to a large sphere of possible hypotheses. Also, phrases of multiple words should be handled. This is often realized by finite state grammars, and triggering by keywords.
For an automatic system it seems crucial to understand the actual meaning of a user statement, which highly depends on the context. Furthermore it is important to design system announcements clear and understandable for the user. We therefore also want to take a brief linguistic view on dialog within this section. Three different aspects are important considering the meaning of a verbally uttered phrase in the chain from sender to receiver: Firstly, we have the locution, which represents the semantic or literal significance of the utterance; secondly there is the illocution, the actual intention of the speaker, and finally the perlocution stands for how an utterance is received by the counterpart. Likewise to speak is to perform a locution, but to speak with an intent (ask, promise, request, assert, demand, apologize, warn, etc.) is to perform an illocution. The purpose, the illocutionary intent, is meaningful and will ordinarily be recognized by hearers. Within this context we want to have a look at four well known Greek conversational maxims: •
Maxim of relevance: be relevant. Consider hereon: ’He kicked the bucket’ (we assume that someone died because that’s what’s relevant), or ’Do you know what time it is?’ (we assume that the speaker wants to know the time because the ”real” question is irrelevant).
Speech Dialogs •
• •
163
Maxim of quality: be truthful. E.g. ’If I hear that song again I’ll kill myself’ (we accept this as a hyperbole and do not immediately turn the radio off), or ’The boss has lost his marbles’ (we imagine a mental problem and not actual marbles). Maxim of quantity: be informative, say neither too much not too little. (Asked the date, we do not include the year). Maxim of manner: be clear and orderly.
In order to obtain high overall dialog quality, some aspects shall be further outlined: Firstly, consistency and transparency are important to enable the user to picture a model of the system. Secondly, social competence in view of user behavior modeling and providing a personality to the system seems very important. Thirdly, error handling plays a key role, as speech recognition is prone to errors. The precondition thereby is clarification, demanding that problems at the user-input (no input, cut input, word-recognition errors, wrong interpretations, etc.) must become aware to the system. In general it is said that users accept a maximum of five percent errors or less. Therefore trade-offs have to be made considering the vocabulary size, and naturalness of the input. However, there is a chance to raise the accuracy by expert prompt modeling, allusion to the problem nature, or at least cover errors and reduce annoyance of the users by a careful design. Also, the danger of over-modeling a dialog in view of world knowledge, social experience or general complexity of manlike communication shall be mentioned. Finally, the main characteristic of a good voice interface is probably that is usable. Usability thereby is a widely discussed concept in the field of interfaces, and various operationalizations have been proposed. In [33] it is stated that usability is a multidimensional concept comprising learnability, efficiency, memorability, errors and satisfaction, and ways are described, in which these can be measured. Defining when a voice interface is usable is one thing, developing one is quite another. It is by now received wisdom that usability design is an iterative process which should be integrated in the general development process. 4.2.5 Scripting and Tagging We want to conclude this section with a short introduction of the two most important dialog mark-up languages in view of scripting and tagging: VoiceXML and DAMSL. VoiceXML (VXML) is the W3C’s standard XML format for specifying interactive voice dialogs between a human and a computer. VXML is fully analogous to HTML, and just as HTML documents are interpreted by a visual web browser, VXML documents are interpreted by a voice browser. A common architecture is to deploy banks of voice browsers attached to the public switched telephone network so that users can simply pick up a phone to interact with voice applications. VXML has tags that instruct the voice browser to provide speech synthesis, automatic speech recognition, dialog management, and soundfile playback. Considering dialog management, a form in VXML consists of fields and control units: fields collect information from the user via speech input or DMTF (dual tone multi-frequency). Control units are sequences of procedural statements, and the control of the dialog is made by the following form interpretation algorithm, which consists of at least one major loop with three phases:
164 • •
•
Speech Communication and Multimodal Interfaces Select: The first form not yet filled is selected in a top down manner in the active VXML document that has an open guard condition. Collect: The selected form is visited, and the following prompt algorithm is applied to the prompts of the form. Next the input grammar of the form is activated, and the algorithm waits for user input. Process: Input evaluation of a form’s fields in accordance to the active grammar. Filled elements are called for example to the input validation. The process phase ends, if no more items can be selected or a jump point is reached.
Typically, HTTP is used as the transport protocol for fetching VXML pages. While simpler applications may use static VXML pages, nearly all rely on dynamic VXML page generation using an application server. In a well-architected web application, the voice interface and the visual interface share the same back-end business logic. Tagging of dialogs for machine learning algorithms on the other hand is mostly done using Dialogue Act Mark-up in Several Layers (DAMSL) - an annotation scheme for communicative acts in dialog [9]. While different dialogs being analyzed with different aims in mind will lead to diverse acts, it seems reasonable to agree on a common basis for annotation in order to enable database enlargement by integration of other ones. The scheme has three layers: Forward Communicative Functions, Backward Communicative Functions, and Information Level. Each layer allows multiple communicative functions of an utterance to be labeled. The Forward Communicative Functions consist of a taxonomy in a similar style as the actions of traditional speech act theory. The most important thereby are statement (assert, reassert, other statement), influencing addressee future action (open-option, directive (inforequest, action-directive), and committing speaker future action (offer, commit). The Backward Communicative Functions indicate how the current utterance relates to the previous dialog: agreement (accept, accept-part, maybe, reject-part, reject, hold), understanding (signal non-understanding, signal understanding (acknowledge, repeat phrase, completion), correct misspeaking), and answer. Finally, the Information Level annotation encodes whether an utterance is occupied with the dialog task, the communication process or meta-level discussion about the task.
4.3 Multimodal Interaction The research field of Human-Computer Interaction (HCI) focuses on arranging the interaction with computers easier, safer, more effective and to a high degree seamless for the user. ”Human-Computer Interaction is a discipline concerned with the design, evaluation and implementation of interactive computing systems for human use and with the study of major phenomena surrounding them.” [18] As explained, HCI is an interdisciplinary field of research where many different subjects are involved to reach the long-term objective of a natural and intuitive
Multimodal Interaction
165
way of interaction with computers. In general, the term interaction describes the mutual influence of several participants to exchange information that is transported by most diverse means in a bilateral fashion. Therefore, a major goal of HCI is to converge the interface increasingly towards a familiar and ordinary interpersonal way of interaction. For natural human communication several in- and output channels are combined in a multimodal manner. 4.3.1 In- and Output Channels The human being is able to gather, process and express information through a number of channels. The input channel can be described as sensor function or perception, the processing of information as cognition, and the output channel as motor function (see figure 4.8).
Fig. 4.8. Human information input and output channels.
Humans are equipped with six senses respectively sense organs to gathers stimuli. These senses are defined by physiology [14]: sense of sight sense of hearing sense of smell sense of taste sense of balance sense of touch
visual channel auditive channel olfactory channel gustatory channel vestibular channel tactile channel
The human being is also provided with broad range of abilities for information output. The output channel can process less amount of information than the input channel. Outgoing information is transmitted by the auditive, the visual and the haptic channel (tactile as perception modality is distinguished from haptic as output manner). By this means, the human is enabled to communicate with others in a simple, effective, and error robust fashion. With an integrated and synchronized use of
166
Speech Communication and Multimodal Interfaces
different channels he can flexibly adapt to the specific abilities of the conversational partner and the current surroundings. However, HCI is far from information transmission of this intuitive and natural manner. This is due to the limitations of the technical systems in regard of their number and performance of the single in- and output modalities. The term modality refers to the type of communication channel used to convey or acquire information. One of the core difficulties of HCI is the divergent boundary conditions between computers and human. Nowadays, the user can transfer his commands to the computer only by standard input devices – e.g. mouse or keyboard. The computer feedback is carried out by visual or acoustic channel – e.g. monitor or speakers. Thus, todays’ technology uses only a few interaction modalities of the humans. Whenever two or more of these modalities are involved you speak from multimodality. 4.3.2 Basics of Multimodal Interaction The term ”multimodal” is derived from ”multi” (lat.: several, numerous) and ”mode” (lat.: naming the method). The word ”modal” may cover the notion of ”modality” as well as that of ”mode”. Scientific research focuses on two central characteristics of multimodal systems: • •
the user is able to communicate with the machine by several input and output modes the different information channels can interact in a sensible fashion
According to S. Oviatt [35] multimodal interfaces combine natural input modes - such as speech, pen, touch, manual gestures, gaze and head and body movements - in a coordinated manner with multimedia system output. They are a new class of interfaces which aims to recognize naturally occurring forms of human language and behavior, and which incorporate one or more recognition-based technologies (e.g., speech, pen, vision). Benoit [5] expanded the definition to a system which represents and manipulates information from different human communication channels at multiple levels of abstraction. These systems are able to automatically extract meaning from multimodal raw input data and conversely produce perceivable information from symbolic abstract representations. In 1980, Bolt’s ”Put That There” [6] demonstration showed the new direction for computing which processed speech in parallel with touch-pad pointing. Multimodal systems benefit from the progress in recognition-based technologies which are capable to gather naturally occurring forms of human language and behavior. The dominant theme in users’ natural organization of multimodal input actually is complementary of content that means each input consistently contributes to different semantic information. The partial information sequences must be fused and can only be interpreted altogether. However, redundancy of information is much more less common in human communication. Sometimes different modalities can input concurrent content that has to be processed independently. Multimodal applications range from map-based (e.g. tourist information) and virtual reality systems, to person identification and verification systems, to medical and web-based transaction systems. Recent systems integrate two or more
Multimodal Interaction
167
recognition-based technologies like speech and lips. A major aspect is the integration and synchronization of these multiple information streams what is discussed intensely in Sect. 4.3.3. 4.3.2.1 Advantages Multimodal interfaces are largely inspired by the goal of supporting more transparent, flexible, effective, efficient and robust interaction [29, 35]. The flexible use of input modes is an important design issue. This includes the choice of the appropriate modality for different types of information, the use of combined input modes, or the alternate use between modes. A more detailed differentiation is made in Sect. 4.3.2.2. Input modalities can be selected according to context and task by the user or system. Especially for complex tasks and environments, multimodal systems permit the user to interact more effectively. Because there are large individual differences in abilities and preferences, it is essential to support selection and control for diverse user groups [1]. For this reason, multimodal interfaces are expected to be easier to learn and use. The continuously changing demands of mobile applications enables the user to shift these modalities, e.g. in-vehicle applications. Many studies proof that multimodal interfaces satisfy higher levels of user preferences. The main advantage is probably the efficiency gain that derives from the human ability to process input modes in parallel. The human brain structure is developed to gather a certain kind of information with specific sensor inputs. Multimodal interface design allows a superior error handling to avoid and to recover from errors which can have user-centered and system-centered reasons. Consequently, it can function in a more robust and stable manner. In-depth information about error avoidance and graceful resolution from errors is given in Sect. 4.3.4. A future aim is to interpret continuous input from visual, auditory, and tactile input modes for everyday systems to support intelligent adaption to user, task and usage environment. 4.3.2.2 Taxonomy A base taxonomy for the classification of multimodal systems was created in the MIAMI-project [17]. As a basic principle there are three decisive degrees of freedom for multimodal interaction: • • •
the degree of abstraction the manner (temporal) of application the fusion of the different modalities
Figure 4.9 shows the resulting classification space and the consequential four basic categories of multimodal applications dependent on the parameters value of data fusion (combined/independent) and temporal usage (sequential/parallel). The exclusive case is the simplest variant of a multimodal system. Such a system supports two or more interaction channels but there is no temporal or content related connection. A sequential application of the modalities with functional cohesion is denominated as alternative multimodality. Beside a sequential appliance of the modalities there is the possibility of parallel operation of different modalities as seen in figure 4.9. Thereby,
168
Speech Communication and Multimodal Interfaces
Fig. 4.9. Classification space for classification of multimodal systems [44].
it is differentiated between the manner of fusion of the interaction channels in simultaneous and synergistic multimodality. The third degree of freedom is the level of abstraction. This refers to the technical level, on which the signals are processed that ranges from simple binary sequences to highly complex semantic terms. 4.3.3 Multimodal Fusion As described, in multimodal systems the information flow from human to computer occurs by different modalities. To be able to utilize the information transmitted by different senses, it is necessary to integrate the input channels to create an appropriate command which is equivalent to the user’s intention. The input data gathered from the single modalities is generated via single mode recognizers. There are three basic approaches for combining the results of each recognition modules to one information stream [36, 56]: •
Early (signal) fusion: The earliest possible fusion of the sensor data is the combination of the sensor specific raw data. The classification of the data is mostly achieved by Hidden-Markov-Models (HMM), temporal Neural Networks (NN) or Dynamic Bayesian Networks (DBN). Early fusion is well suited for temporally synchronized inputs. This approach to fusion only succeeds if the data provided by different sources is of the same type and a strong correlation of modalities exists. For example the fusion of the images generated with a regular camera and an infrared camera for use in night vision systems. Furthermore, this type of fusion is applied in speech recognition systems supported by lip-reading technology, in which the viseme1 and phoneme progression can be registered 1
A visem is the generic image of the face (especially the lip positioning) in the moment of creation of a certain sound. Visemes are thus the graphic pendant to phonemes.
Multimodal Interaction
169
collective in one HMM. A great problem of early fusion is the large data amount necessary for the training of the utilized HMMs. •
Late (semantic) fusion: Multimodal systems which use late fusion consist of several single mode recognition devices as well as a downstream data fusion device. This approach contains a separate preprocessing, feature extraction and decision level for each separate modality. The results of the separate decision levels are fused to a total result. For each classification process each discrete decision level delivers a probability result respective a confidence result for the choice of a class n. These confidence results are afterwards fused for example by appropriate linear combination. The advantage of this approach is the different recognition devices being independently realizable. Therefore the acquisition of multimodal data sets is not necessary. The separate recognition devices are trained with monomodal data sets. Because of this easy integration of new recognizers, systems that use late fusion scale up easier compared to early fusion, either in number of modalities or in size of command set. [56]
•
Soft decision fusion: A compromise between early and late fusion is the so called soft decision fusion. In this method the confidence of each classifier is also respected as well as the integration of an N -best list of each classifier.
In general, multimodal systems consist of various modules (e.g., different single mode recognizers, multimodal integration, user interface). Typically, these software components are developed and implemented independently from each other. Therefore, different requirements to the software architecture of a multimodal system arise. A common infrastructure approach that has been adopted by the multimodal research community involves multi-agent architectures, where agents are defined as any software process. In such architectures, the many modules that are needed to support the multimodal system, may be written in different programming languages and run on several machines. One example for an existing multimodal framework is given by [30]. The system architecture consists of three main processing levels: the input level, the integration level, and the output level. The input level contains any kind of interface that is capable of recognizing user inputs (e.g., mouse, buttons, speech recognizer, etc.). Dedicated command mappers (CMs) encode the information bits of the single independent modality recognizers and context sensors into a meta language based on a context-free grammar (CFG). In the integration level, the recognizer outputs and additional information of context sensors (e.g., information about application environment, user state) are combined in a late semantic fusion process. The output level provides any devices for adequate multimodal system feedback. 4.3.3.1 Integration Methods For signal and semantic fusion, there exist different integration methods. In the following, some of these methods are explained: •
Unification-based Integration:
170
Speech Communication and Multimodal Interfaces Typed-feature structure unification is an operation that verifies the consistency of two or more representational structures and combines them into a single result. Typed feature structures are used in natural language processing and computational linguistics to enhance syntactic categories. They are very similar to frames from knowledge representation systems or records from various programming languages like C, and have been used for grammar rules, lexical entries and meaning representation. A feature structure consists of a type, which indicates the kind of entity it represents, and an associated collection of feature-value or attribute-value pairs [7]. Unification-based integration allows different modalities to mutually compensate for each others’ errors [22]. Feature structure unification can combine complementary and redundant input, but excludes contradictory input. For this reason it is well suited to integrate e.g. multimodal speech and gesture inputs.
•
Statistical Integration: Every input device produces an N -best list with recognition results and probabilities. The statistical integrator produces a probability for every meaningful combination from different N -best lists, by calculating the cross product of the individual probabilities. The multimodal command with the best probability is then chosen as the recognized command. In a multimodal system with K input devices we assume that Ri,j = 1, . . . , Ni with the probabilities Pi,j = Pi (1), . . . , Pi (Ni ) are the Ni possible recognition results from interface number i, i ∈ (1, . . . , K). The statistical integrator calculates the product of each combination of probabilities Cn , n = 1, . . . ,
K
Ni
(4.24)
i=1
combinations with the probabilities Pn =
K
Pi,j , j ∈ (1, . . . , Ni )
(4.25)
i=1
The integrator chooses which combinations represent valid system commands. The choice could be based on a semantically approach or on a database with all meaningful combinations. $ Invalid results are deleted from the list giving a new K list Cvalid,n with at most i=1 Ni valid combinations. The valid combination with the maximum probability max[Pvalid,n ] is chosen as the recognized command. The results can be improved significantly if empirical data is integrated in the statistical process. Statistical integrators are easy to scale up, because all application knowledge and empirical data is integrated at the configuration level. Statistical integrators will work very well with supplementary information, and good with complementary. A problem is, how the system should react, when no suggestive combinations are available. Furthermore the system has to be
Multimodal Interaction
171
programmed with all meaningful combinations to decide which ones represent valid system commands. Another possibility to decide whether a combination is valid or not is to use a semantical integration process after the statistical integration. This avoids the explicit programming of all valid combinations. •
Hybrid Processing: In hybrid architectures symbolic unification-based techniques that integrate feature structures are combined with statistical approaches. In contrast to symbolic approaches these architectures are very robust functioning. The Associative Mapping and Members-Teams-Committee (MTC) are two main techniques. They develop and optimize the following factors with a statistical approach: the mapping structure between multimodal commands and their respective constituents, and the manner of combining posterior probabilities. The Associative Mapping defines all semantically meaningful mapping relations between the different input modes. It supports a process of table lookup that excludes consideration of those feature structures that can impossibly be unified semantically. This table can be build by the user or automatically. The associative mapping approach basically helps to exclude concurrency inputs form the different recognizer and to quickly rule out impossible combinations. Members-Team-Committee is a hierarchical technique with multiple members, teams and committees. Every recognizer is represented by a member. Every member reports his results to the team leader. The team leader has an own function to weight the results and reports it to the committee. At last, the committee chooses the best team, and reports the recognition result to the system. The weightings at each level have to be trained.
•
Rule-based Integration: Rule-based approaches are well established and applied in a lot of integration applications. They are similar to temporal approaches, but are not strictly bound to timing constraints. The combination of different inputs is given by rules or lookup tables. For example in a look-up table every combination is rated with a score. Redundant inputs are represented with high scores, concurrency information with negative scores. This allows to profit from redundant and complementary information and excludes concurrency information. A problem with rule-based integration is the complexity: many of the rules have preconditions in other rules. With increasing amount of commands these preconditions lead to an exponential complexity. They are highly specified and bound to the domains developed for. Moreover application knowledge has to be integrated in the design process to a high degree. Thus, they are hard to scale up.
•
Temporal Aspects: Overlapped inputs, or inputs that fall within a specific period of time are combined by the temporal integrator. It checks if two or more signals from different input devices occur in a specific period of time. The main use for temporal integrators is to determine whether a signal should be interpreted on its own or in
172
Speech Communication and Multimodal Interfaces combination with other signals. Programmed with results from user studies the system can use the information how users react in general. Temporal integration consists of two parts: microtemporal and macrotemporal integration: Microtemporal integration is applied to combine related information from various recognizers in parallel or in a pseudo-parallel manner. Overlapped redundant or complementary inputs are fused and a system command or a sequence of commands is generated. Macrotemporal integration is used to combine related information from the various recognizers in a sequential manner. Redundant or complementary inputs, which do not directly overlap each other, but fall together in one timing window are fused by macrotemporal integration. The macrotemporal integrator has to be programmed with results from user studies to determine which inputs in a specified timing window belong together.
4.3.4 Errors in Multimodal Systems Generally, we define an error in HCI, if the user does not reach her or his desired goal, and no coincidence can be made responsible for it. Independent from the domain, error robustness substantially influences the user acceptance of a technical system. Based on this fact, error robustness is necessary. Basically, errors can never be avoided completely. Thus, in this field both passive (a-priori error avoidance) and active (a-posteriori error avoidance) error handling are of great importance. Undesired system reactions due to faulty operation as well as system-internal errors must be avoided as far as possible. Error resolution must be efficient, transparent and robust. The following scenario shows a familiar error-prone situation: A driver wants to change the current radio station using speech command ”listen to hot radio”. The speech recognition misinterprets his command as ”stop radio” and the radio stops playing. 4.3.4.1 Error Classification For a systematic classification of error types, we basically assume that either the user or the system can cause an error in the human-machine communication process. 4.3.4.2 User Specific Errors The user interacting with the system is one error source. According to J. Reason [42] user specific errors can be categorized in three levels: •
Errors on the skill-based level (e.g., slipping from a button) The skill-based level comprises smooth, automated, and highly integrated routine actions that take place without conscious attention or control. Human performance is governed by stored patterns of pre-programmed instructions represented as analog structures in a time-space domain. Errors at this level are related to the intrinsic variability of force, space, or time coordination. Sporadically, the user checks, if the action initiated by her or him runs as planned, and if the plan for reaching the focused goal is still adequate. Error
Multimodal Interaction
173
patterns on skill-based level are execution or memory errors that result from inattention or overattention of the user. •
Errors on the rule-based level (e.g., using an valid speech command, which is not permitted in this, but in another mode) Concerning errors on the rule-based level, the user violates stored prioritized rules (so-called productions). Errors are typically associated with the misclassification of situations leading to the application of the wrong rule or with the incorrect recall of procedures.
•
Errors on the knowledge-based level (e.g., using a speech command, which is unknown to the system) At the knowledge-based level, the user applies stored knowledge and analytical processes in novel situations in that actions must be planned on-line. Errors at this level arise from resource limitations (bounded rationality) and incomplete or incorrect knowledge.
4.3.4.3 System Specific Errors In the error taxonomy errors caused by the system are addressed. System specific errors can be distinguished in three categories: •
Errors on the recognition level Examples are errors like misinterpretation, false recognition of a correct user input, or an incorrect system-intrinsic activation of a speech recognizer (e.g., the user coincidentally applies the keyword which activates the speech recognizer in a conversation).
•
Errors on the processing level Timing problems or contradictory recognition results of different monomodal recognizers, etc. (e.g., the result of speech recognition differs from gesture recognition input) are causing processing errors.
•
Errors on the technical level System overflow or breakdown of system components are leading to system errors.
4.3.4.4 Error Avoidance According to [49] there are eight rules for designing user interfaces. These rules are derived from experience and applicable in most interactive systems. They do not conduce error avoidance in a direct way, but they simplify the user’s interaction with the system. In that way, many potential errors are prevented. From these rules one can derive some guidelines for multimodal interfaces:
174 •
•
•
•
Speech Communication and Multimodal Interfaces
Strive for consistency: Similar situations should consist of similar sequences of action, as identical terminology e.g. menus or prompts. Consistency means for multimodal interfaces consistency in two ways. First, within one situation all commands should be the same for all modalities. Second, within one modality, all similar commands in different situations should be the same. E.g., the command to return to the main menu should be the same command in all submenus and all submenus should be accessible by all modalities by the same command (menuname on the button is identical with the speech command). Offer informative feedback: Every step of interaction should be answered with a system feedback. This feedback should be modest for frequent and minor actions and major for infrequent or major actions. Multimodal interfaces have the opportunity to use the advantage of different output modalities. Feedback should be given in the same modality as the used input modality. E.g., a speech command should be answered with a acoustical feedback. Reduce short-term memory load: The human information processing is limited in short-term memory. This requires simple dialog system. Multimodal systems should use the modalities in a way, that reduces the user’s memory load. E.g., object selection is easy by pointing on it, but difficult by using speech commands. Synchronize multiple modalities: Interaction by speech is highly temporal. Visual interaction is spatial. Synchronization of these input modalities is needed. E.g., selection of objects by pointing with the finger on it and starting the selection by using speech (”Select this item”).
There are also many other design aspects and guidelines to avoid errors. In comparison to monomodal interfaces, multimodal interfaces can even improve error avoidance by enabling the user to choose freely which input modality to use. In this way, the user is able to select the input modality, which is the most comfortable and efficient way to achieve his aim. Also, if interacting by more than one channel, typical errors of a single modality can be compensated by merging all input data to one information. So, providing more than one input modality increases the robustness of the system and helps to avoid errors in advance. 4.3.4.5 Error Resolution In case of occurring errors (system or user errors) the system tries to solve upcoming problems by initiating dialogs with the user. Error resolution strategies are differentiated in single-step and multi-level dialog strategies. In the context of a single-step strategy a system prompt is generated, to which the user can react with an individual input. On the other hand, in the context of a multi-level strategy, a complex further inquiry dialog is initiated, in which the user is led step by step through the error resolution process. Especially, the second approach offers enormous potential for an adaptation to the current user and the momentary environment situation. For example, the following error handling strategies can be differentiated: • • •
Warning Asking for repetition of last input Asking to change the input modality
Emotions from Speech and Facial Expressions •
175
Offering alternative input modalities
These strategies differ by characteristics as initialization of the error warning, strength of context, individual characteristics of the user, complexity of the error strategy, and inclusion of the user. The choice which dialog strategy to use depends mainly on contextual parameters and current state of the system (e.g., eventually chosen input modality, state of the application). According to Sect. 4.3.3, the error management component is located in the integration level of a multimodal architecture [30]. The error management process consists of four steps: error feature extraction, error analysis, error classification, and error resolution. First, a module continuously extracts certain features from the stream of incoming messages and verifies an error potential. Then, the individual error patterns can be classified. In this phase of the error management process, the resulting error type(s) are determined. Afterwards, for the selection of a dedicated dialog strategy, the current context parameters as well as the error types are analyzed. From the results, the strategy with the highest plausibility is chosen and finally helps to solve the failure in an comfortable way for the user. Summarizing, the importance of error robustness for multimodal systems has been discussed and ways of avoidance and resolution have been presented.
4.4 Emotions from Speech and Facial Expressions Today the great importance of the integration of emotional aspects as the next step toward more natural human-machine interaction is commonly accepted. Throughout this chapter we therefore want to give an overview over important existing approaches to recognize human affect out of the audio and video signal. 4.4.1 Background Within this section we motivate emotion recognition and the modalities chosen in this article. Also, we introduce models of emotion and talk about databases. 4.4.1.1 Application Scenarios Even though button pressing starts to be substituted by more natural communication forms such as talking and gesturing, human-computer communication still feels somehow impersonal, insensitive, and mechanical. If we take a comparative glance at human-human communication we will realize a lack of the extra information sensed by man concerning the affective state of the counterpart. This emotional information highly influences the explicit information, as recognized by today’s human-machine communication systems, and with an increasingly natural communication, respect of it will be expected. Throughout the design of next generation man-machine interfaces inclusion of this implicit channel therefore seems obligatory [10]. Automatic emotion recognition is nowadays already introduced experimentally in call centers, where an annoyed customer is handed over from a robot to a human call operator [11, 25, 39], and in first commercial lifestyle products as fun software
176
Speech Communication and Multimodal Interfaces
intended to detect lies, stress- or love level of a telephoner. Besides these, more general fields of application are an improved comprehension of a user intention, emotional accommodation in the communication (e.g. adaptation of acoustic parameters for speech synthesis if a user seems sad), behavioral observation (e.g. whether an airplane passenger seems aggressive) , objective emotional measurement (e.g. as a guide-line for therapists), transmission of emotion (e.g. sending laughing or crying images within text-based emails), affect-related multimedia retrieval (e.g. highlight spotting in a sports event), and affect-sensitive lifestyle products (e.g. a trembling cross-hair in video games, if the player seems nervous) [3, 10, 38]. 4.4.1.2 Modalities Human emotion is basically observable within a number of different modalities. First attempts to automatic recognition applied invasive measurement of e.g. the skin conductivity, heart rate, or temperature [40]. While exploitation of this information source provides a reliable estimation of the underlying affect, it is often felt uncomfortable and unnatural, as a user needs to be wired or at least has to stay in touch with a sensor. Modern emotion recognition systems therefore focus rather on video or audio based non-invasive approaches in the style of human emotion recognition: It is claimed that we communicate by 55% visually, through body language, by 38% through the tone of our voice and by 7% through the actual spoken words [32]. In this respect the most promising approach clearly seems to be a combination of these sources. However, in some systems and situations only one may be available. Interestingly, contrary to most other modalities, speech allows the user to control the amount of emotion shown, which may play an important role, if the user feels too much observed otherwise. Speech-based emotion recognition in general provides reasonable results already by now. However, it seems sure that the visual information helps to enable a more robust estimation [38]. Seen from an economical point of view a microphone as sensor is standard hardware in many HCI-systems today, and also more and more cameras emerge as in cellular phones of today’s generation. In these respects we want to give an insight into acoustic, linguistic and vision-based information analysis in search of affect within this chapter, and provide solutions to a fusion of these. 4.4.1.3 Emotion Model Prior to recognizing emotion one needs to establish an underlying emotion model. In order to obtain a robust recognition performance it seems reasonable to limit the complexity of the model, e.g. kind and number of emotion labels used, in view of the target application. This may be one of the reasons that no consensus exists about such a model in technical approaches, yet. Two generally different views dominate the scene: on the one hand an emotion sphere is spanned by two up to three orthogonal axes: firstly arousal or activation, respecting the readiness to take some action, secondly valence or evaluation, considering a positive or negative attitude, and finally control or power, analyzing the speaker’s dominance or submission [10]. While this approach provides a good basis for emotional synthesis, it is often too complex for concrete application scenarios. The better known way therefore is to classify emo-
Emotions from Speech and Facial Expressions
177
tion by a limited set of discrete emotion tags. A first standard set of such labels exists within the MPEG-4 standard comprising anger, disgust, fear, joy, sadness and surprise [34]. In order to discriminate a non-emotional state it is often supplemented by neutrality. While this model opposes many psychological approaches, it provides a feasible basis in technical view. However, further emotions as boredom are often used. 4.4.1.4 Emotional Databases In order to train and test intended recognition engines, a database of emotional samples is needed. Such a corpus should provide spontaneous and realistic emotional behavior out of the field. The sample quality should ensure studio audio and video quality, but for analysis of robustness in the noise also samples with known background noise conditions may be desired. A database further has to consist of a high number of ideally equally distributed samples for each emotion, both of the same, and of many different persons in total. These persons should provide a flashy model considering genders, age groups, ethical backgrounds, among others. Respecting further variability, uttered phrases should possess different contents, lengths, or even languages. Thereby an unambiguous assignment of collected samples to emotion classes is especially hard in this discipline. Also, perception tests by human testpersons are very useful: As we know, it may be hard to rate one’s emotion for sure. In this respect it seems obvious that comparatively minor recognition rates can be demanded in this discipline considering related pattern recognition tasks. However, the named human performance provides a reasonable benchmark for a maximum expectation. Finally, a database should be made publicly available in view of international comparability, which seems a problem considering the lacking consensus about emotion classes used and privacy of the test-persons. A number of methods exist to create a database, with arguably different strengths: The predominant ones among these are acting or eliciting of emotions in test set-ups, hidden or conscious long-term observations, and use of clips out of public media content. However, most databases use acted emotions, which allow for fulfillment of the named requirements besides the spontaneity, as there is doubt whether acted emotions are capable of representing true characteristics of affect. Still, they do provide a reasonable starting point, considering that databases of real emotional speech are hard to obtain. In [54] an overview over existing speech databases can be found. Among the most popular ones we want to name the Danish Emotional Speech Database (CEICES), the Berlin Emotional Speech Database (EMO-DB), and the AIBO Emotional Speech Corpus (AEC) [4]. Audio-visual databases are however still sparse, especially in view of the named requirements. 4.4.2 Acoustic Information Basically it can be said that two main information sources are exploited considering emotion recognition from speech: the acoustic information analyzing the prosodic structure as well as the spoken content itself, namely the language information. Hereby the predominant aims besides high reliability are an independence of the
178
Speech Communication and Multimodal Interfaces
speaker, the spoken language, the spoken content when considering acoustic processing, and the background noise. A number of parameters besides the named underlying emotion model and database size and quality strongly influence the quality in these respects and will be mostly addressed throughout the ongoing: the signal capturing, pre-processing, feature selection, classification method, and a reasonable integration in the interaction and application context. 4.4.2.1 Feature Extraction In order to estimate a user’s emotion by acoustic information one has to carefully select suited features. Such have to carry information about the transmitted emotion, but they also need to fit the chosen modeling by means of classification algorithms. Feature sets used in existing works differ greatly, but the feature types used in acoustic emotion recognition may be divided into prosodic features (e.g. intensity, intonation, durations), voice quality features (e.g. 1-7 formant positions, 1-7 formant band widths, harmonic-to-noise ratio (HNR), spectral features, 12-15 Mel Frequency Cepstral Coefficients (MFCC)), and articulatory ones (e.g. spectral centroid, and more hard to compute ones as centralization of vowels). In order to calculate these, the speech signal is firstly weighted with a shifting soft window function (e.g. Hamming window) of lengths reaching from 10-30ms with a window overlap around 50%. This is a common procedure in speech processing and is needed, as the speech signal is quasi stationary. Next a contour value is computed for every frame and every contour, leading to a multivariate time series. As for intensity mostly simple logarithmic frame energy is computed. However, it should be mentioned that this does not respect human perception. Spectral analysis mostly relies on the Fast Fourier Transform or MFCC - a standard homomorphic spectral transformation in speech processing aiming at de-convolution of the vowel tract transfer function and perceptual modeling by the Mel-frequncy-scale. First problems now arise, as the remaining feature contours pitch, HNR, or formants can only be estimated. Especially pitch and HNR can either be derived out of the spectrum, or – more populary – by peak search within the auto correlation function of the speech signal. Formants may be obtained by analysis of the Linear Prediction Coefficients (LPC), which we will not dig into. Often backtracking by means of dynamic programming is used to ensure smooth feature contours and reduce global costs rather than local ones. Mostly, also higher order derivatives as speed and acceleration are included to better model temporal changes. In any case filtering of the contours leads to a gain by noise reduction, and is done with low-pass filters as moving average or median filters [34, 46]. Next, two in general different approaches exist considering the further acoustic feature processing in view of the succeeding classification: dynamic and static modeling. Within the dynamic approach the raw feature contours, e.g. the pitch or intensity contours, are directly analyzed frame-wise by methods capable of handling multivariate time-series as dynamic programming (e.g. Hidden Markov Models (HMM) [34] or Dynamic Bayesian Nets (DBN)). The second way, by far more popular, is to systematically derive functionals out of the time-series by means of descriptive statistics. Mostly used are thereby moments
Emotions from Speech and Facial Expressions
179
as mean and standard deviation, or extrema and their positions. Also zero-crossingrates (ZCR), number of turning points and others are often considered. As temporal information is thereby mostly lost, duration features are included. Such may be the mean length of pauses or voiced sounds, etc. However, these are more complex to estimate. All features should generally be normalized by either mean subtraction and division by the standard deviance or maximum, as some classifiers are susceptible to different number ranges. In a direct comparison under constant test conditions the static features outperformed the dynamic approach in our studies [46]. This is highly due to the unsatisfactory independence of the overall contour in respect of the spoken content. In the ongoing we therefore focus on the static approach. 4.4.2.2 Feature Selection Now that we generated a high order multivariate time-series (approx. 30 dimensions), included delta regression-coefficients (approx. 90 dimensions in total), and started to derive static features in a deterministic way, we end up with a too high dimensionality (>300 features) for most classifiers to handle, especially considering typically sparse databases in this field. Also, we would not expect every feature that is generated to actually carry important and non-redundant information about the underlying affect. Still, this costs extraction effort. Likewise, we aim at a dimensionality reduction by feature selection (FS) methods. A often chosen approach thereby is to use Principal Component Analysis (PCA) in order to construct superposed-features out of all features, and select the ones with highest eigenvalues corresponding to the highest variance [8]. This however does not save the original extraction effort, as still all features are needed for the computation of the artificial ones. A genuine reduction of the original features should therefore be favored, and can be done e.g. by single feature relevance calculation (e.g. information gain ratio based on entropy calculation), named filter-based selection. Still, the best single features do not necessarily result in the best set. This is why so called wrapper-based selection methods usually deliver better overall results at lower dimensionality. The term wrapper alludes to the fact that the target classifier is used as an optimization target function, which helps to not only optimize a set, but rather the compound of features and classifiers as a whole. A search function is thereby needed, as exhaustive search is in general NP-hard. Mostly applied among these are Hill Climbing search methods (e.g. Sequential Forward Search (SFS) or Sequential Backward Search (SBS)) which start from a full or empty feature set and step-wisely reduce it by the least relevant or add the most important one. If this is done in a floating manner we have the Sequential Floating Search Methods (SFSM) [48]. However, several other mighty such methods exist, among which especially genetic search proves powerful [55]. After such reduction the feature vector may be reduced to approximate 100 features, and will be classified in a next step.
180
Speech Communication and Multimodal Interfaces
4.4.2.3 Classification Methods A number of factors influence the choice of the classification method. Besides high recognition rates and efficiency, economical aspects and a reasonable integration in the target application framework play a role. In the ongoing research a broad spectrum reaching from rather basic classifiers such as instance based learners or Naive Bayes to more complex as Decision Trees, Artificial Neuronal Nets (e.g. Multi Layer Perceptrons (MLP) or Radial Basis Function Networks (RBF)), and Support Vector Machines (SVM) [10, 38]. While the more complex tend to show better results, no general agreement can be found so far. However, SVM tend to be among the most promising ones [45]. The power of such base classifiers can also be boosted or combined my methods of ensemble construction (e.g. Bagging, Boosting, or Stacking) [39, 48, 55]. 4.4.3 Linguistic Information Up to here we described a considerable amount of research effort on feature extraction and classification algorithms to the investigation of vocal properties for the purpose of inferring to probably expressed emotions from the sound. So the question after information transmitted within the acoustic channel ”How was it said?” has been addressed with great success. Recently more attention is paid to the interpretation of the spoken content itself dealing with the related question ”What was said?” in view of the underlying affect. In psychological studies it is claimed that a connection between certain terms and the related emotion has been learned by the speaker [25]. As hereby the speaker’s expression of his emotion consists in usage of certain phrases that are likely to be mixed with meaningful statements in the context of the dialog, an approach with abilities in spotting for emotional relevant information is needed. Consider for this the example ”Could you please tell me much more about this awesome field of research”. The ratio of affective words is clearly dependent of the underlying application background and the personal nature of the speaker, however it will be mostly very low. It therefore remains questionable whether linguistic information might be sufficient applied standalone. However, its integration showed clear increase in performance [8, 11, 25, 45], even though the conclusions drawn rely per definition on erroneous Automatic Speech Recognition (ASR) outputs. In order to reasonably handle the incomplete and uncertain data of the ASR unit, a robust approach should take acoustic confidences into account throughout the processing. Still, none existing system for emotional language interpretation calculates an output data certainty based upon the input data certainty, except for the one presented in [47]. The most trivial approach to linguistic analysis would be the spotting for single emotional terms wj ∈ U within an utterance U labeled with an emotion ei ∈ E out of the set of emotions E. All known emotional keywords wk ∈ V would than be stored within a vocabulary V . In order to handle only emotional keywords, sorting out abstract terms that cannot carry information about the underlying emotion as names helps comparable to the feature space reduction for acoustic features. This is known as stopping within linguistics. A so called stop-list
Emotions from Speech and Facial Expressions
181
can thereby be obtained either by expert-knowledge or by automated approaches as calculation of the salience of a word [25]. One can also cope with emotionally irrelevant information by a normalized log likelihood ratio between an emotion and a general task specific model [11]. Additionally by stemming words of the same stem are clustered, which also reduces vocabulary size while in general directly increasing performance. This comes, as hits within an utterance are crucial, and their number increases significantly if none is lost due to minor word differences as plural forms or verb conjunctions. However, such an approach does not model word order, or the fact, that one term can represent several emotions, which leads us to more sophisticated approaches, as shown in the ongoing. 4.4.3.1 N-Grams A common approach to speech language modeling is the use of n-grams. Let us first assume the conditional probability of a word wj is given by its predecessors from left to right within an utterance U as P (wj |w1 , ..., wj−1 ). Next, for language interpretation based emotion recognition class-based n-grams are needed. Likewise an emotion ei within the emotion set E shall have the a-posteriori probability P (ei |w1 , ..., wj ) given the words wj and its predecessors in U . However, following Zipf’s principle of least effort, which states that irrelevant function words occur very frequently, but terms of interest are rather sparse, we reduce the number of considered words to N in order to prevent over-modeling. Applying the first order Markov assumption we can therefore use the following estimation: P (ei |w1 , ..., wj ) ≈ P (ei |wj−N −1 , ..., wj )
(4.26)
However, mostly uni-grams have been applied so far [11, 25], besides bi-grams and trigrams [2], which is due to the very limited typical corpus sizes in speech emotion recognition. Uni-grams provide the probability of an emotion under the condition of single known words, which means they are contained in a vocabulary, without modeling of neighborhood dependencies. Likewise, in a decision process, the actual emotion e can be calculated as: P (ei |wj ) (4.27) e = argi max (wj ∈U)∧(wj ∈V )
Now, in order to calculate P (ei |wj ), we can use simple Maximum Likelihood Estimation (MLE): PMLE (ei |wj ) =
T F (wj , ei ) T F (wj , E)
(4.28)
Let thereby T F (wj , ei ) denote the frequency of occurrence of the term wj tagged with the emotion ei within the whole training corpus. T F (wj , E) then resembles the whole frequency of occurrence of this particular term. One problem thereby is that never occurring term/emotion couples lead to a probability resembling zero. As this is crucial within the overall calculation, we assume that every word has a general
182
Speech Communication and Multimodal Interfaces
probability to appear under each emotion. This is realized by introduction of the Lidstone coefficient λ as shown in the following equation: Pλ (ei |wj ) =
T F (wj , ei ) + λ , λ ∈ [0...1] T F (wj , E) + λ · |E|
(4.29)
If λ resembles one, this is also known as Maximum a-Posteriori (MAP) estimation. 4.4.3.2 Bag-of-Words The so called Bag-of-Words method is a standard representation form for text in automatic document categorization [21] that can also be applied to recognize emotion [45]. Thereby each word wk ∈ V in the vocabulary V adds a dimension to a linguistic vector x representing the logarithmic term frequency within the actual utterance U known as logTF (other calculation methods exist, which we will not show herein). Likewise a component xlogT F,j of the vector xlogTF with the dimension |V | can be calculated as: xlogT F,j = log
T F (wj , U ) |U |
(4.30)
As can be seen in the equation, the term frequency is normalized by the phrase length. Due to the fact that a high dimensionality may decrease the performance of the classifier and flections of terms reduce performance especially within small databases, stopping, stemming, or further methods of feature reduction are mandatory. A classification can now be fulfilled as with the acoustic features. Preferably, one would chose SVM for this task, as they are well known to show high performance [21]. Similar as for the uni-grams word order is not modeled by this approach. However, one advantage is the possibility of direct inclusion of linguistic features within the acoustic feature vector. 4.4.3.3 Phrase Spotting As mentioned, a key-drawback of the approaches so far is a lacking view of the whole utterance. Consider hereby the negation in the following example: ”I do not feel too good at all,” where the positively perceived term ”good” is negated. Therefore Bayesian Nets (BN) as a mathematical background for the semantic analysis of spoken utterances may be used taking advantage of their capabilities in spotting and handling uncertain and incomplete information [47]. Thereby each BN consists of a set of N nodes related to state variables Xi , comprising a finite set of states. The nodes are connected by directed edges reaching from parent to child nodes, and expressing quantitatively the conditional probabilities of nodes and their parent nodes. A complete representation of the network structure and conditional probabilities is provided by the joint probability distribution: P (X1 , ..., XN ) =
N i=1
P (Xi |parents(Xi ))
(4.31)
Emotions from Speech and Facial Expressions
183
Methods of interfering the states of some query variables based on observations regarding evidence variables are provided by the network. Similar to a standard approach to natural speech interpretation, the aim here is to make the net maximize the probability of the root node modeling the specific emotion expressed by the speaker via his choice of words and phrases. The root probabilities are distributed equally in the initialization phase and resemble the priors of each emotion. If the emotional language information interpretation is used stand-alone, a maximum likelihood decision takes place. Otherwise the root probability for each emotion is fed forward to a higher-level fusion algorithm. On the input layer a standard HMM-based ASR engine with zero-grams as language model providing N -best hypotheses with single word confidences may be applied. In order to deal with the acoustic certainties the traditional BN may be extended to handle soft evidences [47]. The approach discussed here is to be based on integration and abstraction of semantically similar units to higher leveled units in several layers. On the input level the N -best recognized phrases are presented to the algorithm, which maps this input on defined interpretations via its semantic model consisting in a BN. At the beginning the spotting on items known to the semantic model is achieved by matching the words in the input level to word-nodes contained in the lowest layer of the BN. Within this step the knowledge about uncertainty of the recognized words represented by their confidences is completely transferred into the interpretation model by accordingly setting soft evidence in the corresponding word-nodes. While stepping forward to any superior model-layer, those units resembling each other in their semantic properties regarding the target interpretations are clustered to semantic super-units until the final layer with its root-nodes of the network is reached. Thereby the evidences assigned to word-nodes, due to corresponding appearance in the utterance, finally result in changes of probabilities in the root-nodes representing confidences of each specific emotion and their extent. After all the BN approach allows for an entirely probabilistic processing of uncertain input to gain real probability afflicted output. To illustrate what is understood as semantically similar, consider for instance some words expressing positive attitude, as ”good”, ”well”, ”great”, etc. being integrated into a super-word ”Positive”. The quantitative contribution P (ei |wj ) of any word wj to the belief in an emotion ei is calculated in a training phase by its frequency of occurrence under the observation of the emotion on basis of the speech corpus as shown within n-grams. Given a word order within a phrase, an important modification of classic BN has to be carried out, as BN’s are in general not capable of processing sequences due to their entirely commutative evidence assignment. 4.4.4 Visual Information For a human spectator the visual appearance of a person provides rich information about his or her emotional state [32]. Thereby several sources can be identified, such as body-pose (upright, slouchy), hand-gestures (waving about, folded arms), head-gestures (nodding, inclining), and – especially in a direct close conversation – the variety of facial expressions (smiling, surprised, angry, sad, etc.). Very few ap-
184
Speech Communication and Multimodal Interfaces
proaches exist towards an affective analysis of body-pose and gestures, while several works report considerable efforts in investigating various methods for facial expression recognition. Therefore we are going to concentrate on the latter in this section, provide background information, necessary pre-processing stages, algorithms for affective face analysis, and give an outlook on forthcoming developments. 4.4.4.1 Prerequisites Similar to the acoustic analysis of speech for emotion estimation, the task of Facial Expression Recognition can be regarded as a common pattern recognition problem with the familiar stages of preprocessing, feature extraction, and classification. The addressed signal, i.e. the camera view to a face, provides lots of information that is neither dependent on nor relevant to the expression. These are mainly ethnic and inter-cultural differences in the way to express feelings, inter-personal differences in the look of the face, gender, age, facial hair, hair cut, glasses, orientation of the face, and direction of gaze. All these influences are quasi disturbing noise with respect to the target source facial expression and the aim is to reduce the impact of all noise sources, while preserving the relevant information. This task however constitutes a considerable challenge located in the preprocessing and feature extraction stage, as explicated in the following. From the technical point of view another disturbing source should be minimized – the variation in the position of the face in the camera image. Therefore, the preprocessing comprises a number of required modules for robust head-localization and estimation of the face-orientation. As a matter of fact facial expression recognition algorithms perform best, when the face is localized most accurately with respect to the person’s eyes. Thereby, best addresses quality and execution time of the method. However, eye localization is a computationally even more complex task than face localization. For this reason a layered approach is proposed, where the search area for eyes is limited to the output-hypotheses of the previous face localizer. Automatic Facial Expression Recognition has still not arrived in real world scenarios and applications. Existing systems postulate a number of restrictions regarding camera hardware, lighting conditions, facial properties (e.g. glasses, beard), allowed head movements of the person, and view to the face (frontal, profile). Different approaches show different robustness on the mentioned parameters. In the following we want to give an insight in different basic ideas that are applied to automatic mimic analysis. Many of them were derived from the task of face recognition, which is the visual identification or authentication of a person based on his or her face. Hereby, the applied methods can generally be categorized in holistic and non-holistic or analytic [37]. 4.4.4.2 Holistic Approaches Holistic methods (Greek: holon = the whole, all parts together) strive to process the entire face as it is without any incorporated expert-knowledge, like geometric properties or special regions of interest for mimic analysis. One approach is to extract a comprehensive and respectively large set of features from the luminance representation or textures of the face image. Exemplarily, Gabor-
Emotions from Speech and Facial Expressions
185
Wavelet coefficients proved to be an adequate parametrization for textures and edges in images [51]. The response of the Gabor filter can be written as a correlation of the input image I(x), with the Gabor kernel pk (x) ak (x0 ) = I(x)pk (x − x0 )dx, (4.32) where the Gabor filter pk (x) can be formulated as: k2 σ2 k2 , pk (x) = 2 exp(− 2 x2 ) exp (ikx) − exp − σ 2σ 2
(4.33)
while k is the characteristic wave vector. Most works [26] apply 5 spatial fre π π , 32 and 8 orientations from 0 to π differing by quencies with ki = π2 , π4 , π8 , 16 π/8, while σ is set to the value of π. Consequently, 40 coefficients are computed for each position of the image. Let I(x) be of height 150 pixels and width 100 pixels, likewise 150 · 100 · 40 = 6 · 105 features are computed. Subsequently, Machine Learning methods for feature selection identify the most relevant features as described, which allow for a best possible discrimination of the addressed classes during training phase. A common algorithm applied for this problem was proposed by Freud and Schapire and is known as AdaBoost.M1 [13, 19]. AdaBoost and its derivates are capable to perform on feature vectors of six-digit dimensionality, while execution time remains tolerable. On the way to assignment of the reduced static feature set to emotional classes, any statistical algorithm can be used. In case of video processing and real-time requirements the choice of classifiers might focus on linear methods or decision trees, depending on the computational effort to localize the face and extract the limited number of features. In Face Recognition the approach of Eigenfaces proposed by Turk and Pentland has been examined thoroughly [52]. Thereby, the aim is to find the principal components of the distribution of two-dimensional face representations. This is achieved by the determination and selection of Eigenvectors from the covariance matrix of a set of representative face images. This set should cover different races for the computation of the covariance matrix. Each image of size N × M pixels, which can be thought of as a N × M matrix of 8bit luminance values, is transformed into a vector of dimensionality N · M . Images of faces, being similar in overall configuration, are not randomly distributed in this very high-dimensional image space, and thus can be described by a relatively low dimensional subspace, spanned by the relevant Eigenvectors. This relevance is measured by their corresponding Eigen-values, indicating a different amount of variation among the faces. Each image pixel contributes more or less to each Eigen-vector, so that each of them can be displayed, resulting in a kind of ghostly face – the so called Eigenface. Finally, each individual face can be represented exactly by linear combination of these Eigenfaces, with a remaining error due to the reduced dimensionality of the new face space. The coefficients or weights of that linear combination that minimize the error between the original image and the face space representation now serve as features for any kind of classification. In case of Facial Expression Recognition the covariance matrix would be computed on
186
Speech Communication and Multimodal Interfaces
a set of faces expressing all categories of mimics. Subsequently, we apply the Eigenvector analysis and extract representative sets of weights for each mimic-class that should be addressed. During classification of unknown faces, the weights for this image are computed accordingly, and will then be compared to weight-vectors of the training set. 4.4.4.3 Analytic Approaches These methods concentrate on the analysis of dominant regions of interest. Thus, pre-existing knowledge about geometry and facial movements is incorporated, so that subsequent statistical methods benefit [12]. Ekman and Friesen introduced the so called Facial Action Coding System in 1978, which consists of 64 Action Units (AU). Presuming that all facial muscles are relaxed in the neutral state, each AU models the contraction of a certain set of them, leading to deformations in the face. Thereby the focus lies on the predominant facial features, such as eyes, eye-brows, nose, and mouth. Their shape and appearance contain most information about the facial expressions. One approach for analyzing shape and appearance of facial features is known as Point Distribution Model (PDM). Cootes and Taylor gave a comprehensive introduction to the theory and implementation aspects of PDM. Subclasses of PDM are Active Shape Models (ASM) and Active Appearance Models (AAM), which showed their applicability to mimic analysis [20]. Shape models are statistical descriptions of two-dimensional relations between landmarks, positioned on dominant edges in face images. These relations are freed of all transformations, like rotation, scaling and translation. The different shapes that occur just due to the various inter-personal proportions are modeled by PCA of the observed landmark displacements during training phase. The search of the landmarks starts with an initial estimation where the shape is placed over a face manually or automatically, when preprocessing stages allow for. During search, the edges are approximated by an iterative approach that tries to measure the similarity of the sum of edges under the shape to the model. Appearance Models additionally investigate the textures or gray-value distributions over the face, and combine this knowledge with shape statistics. As mentioned before, works in the research community try to investigate a broad range of approaches and combinations of different holistic and analytic methods in order to proceed towards algorithms, that are robust to the broad range of different persons and real-life situations. One of the major problems is still located in the immense computational effort and applications will possibly have to be distributed on multiple CPU and include the power of GPU to converge to real-time abilities. 4.4.5 Information Fusion In this chapter we aim to fuse the acoustic, linguistic and vision information obtained. This integration (see also Sect. 4.3.3) is often done in a late semantic manner as (weighted) majority voting [25]. More elegant however is the direct fusion of the streams within one feature vector [45], known as the early feature fusion. The advantage thereby is that less knowledge is lost prior to the final decision. A compromise
References
187
between these two is the so called soft decision fusion whereby the confidence of each classifier is also respected. Also the integration of a N -best list of each classifier is possible. A problem however is the synchronization of video, and the acoustic and linguistic audio streams. Especially if audio is classified on a global word or utterance level, it may become difficult to find video segments that correspond to these units. Likewise, audio processing may be preferred by dynamic means in view of early fusion with a video stream. 4.4.6 Discussion Especially automatic speech emotion recognition based on acoustic features already comes close to human performance [45] somewhere around 80% recognition performance. However, usually very idealized conditions as known speakers and studio recording conditions are considered, yet. Video processing does not reach these regions at the time, and conditions are very ideal, as well. When using emotion recognition systems out-of-the-lab, a number of new challenges arise, which has been hardly addressed within the research community up to now. In this respect the future research efforts will have to lead to larger databases of spontaneous emotions, robustness under noisy conditions, less person-dependency, reliable confidence measurements,integration of further multimodal sources, contextual knowledge integration, and acceptance studies of emotion recognition applied in everyday systems. In this respect we are looking forward to a flourishing human-like man-machine communication supplemented by emotion for utmost naturalness.
References 1. When Do We Interact Multimodally? Cognitive Load and Multimodal Communication Patterns., 2004. 2. Ang, J., Dhillon, R., Krupski, A., Shriberg, E., and Stolcke, A. Prosody-Based Automatic Detection of Annoyance and Frustration in Human-Computer Dialog. In Proceedings of the International Conference on Speech and Language Processing (ICSLP 2002). Denver, CO, 2002. 3. Arsic, D., Wallhoff, F., Schuller, B., and Rigoll, G. Video Based Online Behavior Detection Using Probabilistic Multi-Stream Fusion. In Proceedings of the International IEEE Conference on Image Processing (ICIP 2005). 2005. 4. Batliner, A., Hacker, C., Steidl, S., N¨oth, E., Russel, S. D. M., and Wong, M. ’You Stupid Tin Box’ - Children Interacting with the AIBO Robot:A Cross-linguisitc Emotional Speech Corpus. In Proceedings of the LREC 2004. Lisboa, Portugal, 2004. 5. Benoit, C., Martin, J.-C., Pelachaud, C., Schomaker, L., and Suhm, B., editors. Audiovisual and Multimodal Speech Systems. In: Handbook of Standards and Resources for Spoken Language Systems-Supplement Volume. D. Gibbon, I. Mertins, R.K. Moore ,Kluwer International Series in Engineering and Computer Science, 2000. 6. Bolt, R. A. “Put-That-There”: Voice and Gesture at the Graphics Interface. In International Conference on Computer Graphics and Interactive Techniques, pages 262–270. July 1980.
188
Speech Communication and Multimodal Interfaces
7. Carpenter, B. The Logic of Typed Feature Structures. Cambridge, England, 1992. 8. Chuang, Z. and Wu, C. Emotion Recognition using Acoustic Features and Textual Content. In Proceedings of the International IEEE Conference on Multimedia and Expo (ICME) 2004. Taipei, Taiwan, 2004. 9. Core, M. G. Analyzing and Predicting Patterns of DAMSL Utterance Tags. In AAAI Spring Symposium Technical Report SS-98-01. AAAI Press, 1998. ISBN ISBN 1-57735046-4. 10. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. G. Emotion Recognition in Human-computer Interaction. IEEE Signal Processing magazine, 18(1):32–80, January 2001. 11. Devillers, L. and Lamel, L. Emotion Detection in Task-Oriented Dialogs. In Proceedings of the International Conference on Multimedia and Expo(ICME 2003), IEEE, Multimedia Human-Machine Interface and Interaction, volume III, pages 549–552. Baltimore, MD, 2003. 12. Ekman, P. and Friesen, W. Facial Action Coding System. Consulting Psychologists Press, 1978. 13. Freund, Y. and Schapire, R. Experiments with a New Boosting Algorithm. In International Conference on Machine Learning, pages 148–156. 1996. 14. Geiser, G., editor. Mensch-Maschine-Kommunikation. Oldenbourg-Verlag, M¨unchen, 1990. 15. Goldschen, A. and Loehr, D. The Role of the DARPA Communicator Architecture as a Human-Computer Interface for Distributed Simulations. In Simulation Interoperability Standards Organization (SISO) Spring Simulation Interoperability Workshop. Orlando, Florida, 1999. 16. Grosz, B. and Sidner, C. Attentions, Intentions and the Structure of Discourse. Computational Linguistics, 12(3):175–204, 1986. 17. Hartung, K., M¨unch, S., and Schomaker, L. MIAMI: Software Architecture, Deliverable Report 4. Report of ESPRIT III: Basic Research Project 8579, Multimodal Interface for Advanced Multimedia Interfaces (MIAMI). Technical report, 1996. 18. Hewett, T., Baecker, R., Card, S., Carey, T., Gasen, J., Mantei, M., Perlman, G., Strong, G., and Verplank, W., editors. Curricula for Human-Computer Interaction. ACM Special Interest Group on Computer-Human Interaction, Curriculum Development Group, 1996. 19. Hoch, S., Althoff, F., McGlaun, G., and Rigoll, G. Bimodal Fusion of Emotional Data in an Automotive Environment. In Proc. of the ICASSP 2005, IEEE Int. Conf. on Acoustics, Speech, and Signal Processing. 2005. 20. Jiao, F., Li, S., Shum, H., and Schuurmanns, D. Face Alignment Using Statistical Models and Wavelet Features. In Conference on Computer Vision and Pattern Recognition. 2003. 21. Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Technical report, LS-8 Report 23, Dortmund, Germany, 1997. 22. Johnston, M. Unification-based Multimodal Integration. In Below, R. K. and Booker, L., editors, Proccedings of the 4th International Conference on Genetic Algorithms. Morgan Kaufmann, 1997. 23. Krahmer, E. The Science and Art of Voice Interfaces. Technical report, Philips Research, Eindhoven, Netherlands, 2001. 24. Langley, P., Thompson, C., Elio, R., and Haddadi, A. An Adaptive Conversational Interface for Destination Advice. In Proceedings of the Third International Workshop on Cooperative Information Agents. Springer, Uppsala, Sweden, 1999. 25. Lee, C. M. and Pieraccini, R. Combining Acoustic and Language Information for Emotion Recognition. In Proceedings of the International Conference on Speech and Language Processing (ICSLP 2002). Denver, CO, 2002.
References
189
26. Lee, T. S. Image Representation Using 2D Gabor Wavelets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10):959–971, 1996. 27. Levin, E., Pieraccini, R., and Eckert, W. A stochastic Model of Human-Machine Interaction for Learning Dialog Strategies. IEEE Transactions on Speech and Audio Processing, 8(1):11–23, 2000. 28. Litman, D., Kearns, M., Singh, S., and Walker, M. Automatic Optimization of Dialogue Management. In Proceedings of the 18th International Conference on Computational Linguistics. Saarbr¨ucken, Germany, 2000. 29. Maybury, M. T. and Stock, O. Multimedia Communication, including Text. In Hovy, E., Ide, N., Frederking, R., Mariani, J., and Zampolli, A., editors, Multilingual Information Management: Current Levels and Future Abilities. A study commissioned by the US National Science Foundation and also delivered to European Commission Language Engineering Office and the US Defense Advanced Research Projects Agency, 1999. 30. McGlaun, G., Althoff, F., Lang, M., and Rigoll, G. Development of a Generic Multimodal Framework for Handling Error Patterns during Human-Machine Interaction. In SCI 2004, 8th World Multi-Conference on Systems, Cybernetics, and Informatics, Orlando, FL, USA. 2004. 31. McTear, M. F. Spoken Dialogue Technology: Toward the Conversational User Interface. Springer Verlag, London, 2004. ISBN 1-85233-672-2. 32. Mehrabian, A. Communication without Words. Psychology Today, 2(4):53–56, 1968. 33. Nielsen, J. Usability Engineering. Academic Press, Inc., 1993. ISBN 0-12-518405-0. 34. Nogueiras, A., Moreno, A., Bonafonte, A., and Marino, J. Speech Emotion Recognition Using Hidden Markov Models. In Eurospeech 2001 Poster Proceedings, pages 2679– 2682. Scandinavia, 2001. 35. Oviatt, S. Ten Myths of Multimodal Interaction. Communications of the ACM 42, 11:74– 81, 1999. 36. Oviatt, S., Cohen, P., Wu, L., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T., Winograd, T., Landay, J., Larson, J., and Ferro, D. Designing the User Interface for Multimodal Speech and Pen-based Gesture Applications: State-of-the-Art Systems and Future Research Directionss. Human Computer Interaction, (15(4)):263–322, 2000. 37. Pantic, M. and Rothkrantz, L. Automatic Analysis of Facial Expressions: The State of the Art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1424– 1445, 2000. 38. Pantic, M. and Rothkrantz, L. Toward an Affect-Sensitive Multimodal Human-Computer Interaction. Proccedings of the IEEE, 91:1370–1390, September 2003. 39. Petrushin, V. Emotion in Speech: Recognition and Application to Call Centers. In Proceedings of the Conference on Artificial Neural Networks in Engineering(ANNIE ’99). 1999. 40. Picard, R. W. Affective Computing. MIT Press, Massachusetts, 2nd edition, 1998. ISBN 0-262-16170-2. 41. Pieraccini, R., Levin, E., and Eckert, W. AMICA: The AT&T Mixed Initiative Conversational Architecture. In Proceedings of the Eurospeech ’97, pages 1875–1878. Rhodes, Greece, 1997. 42. Reason, J. Human Error. Cambridge University Press, 1990. ISBN 0521314194. 43. Sadek, D. and de Mori, R. Dialogue Systems. In de Mori, R., editor, Spoken Dialogues with computers, pages 523–562. Academic Press, 1998. 44. Schomaker, L., Nijtmanns, J., Camurri, C., Morasso, P., and Benoit, C. A Taxonomy of Multimodal Interaction in the Human Information Processing System. Report of ESPRIT III: Basic Research Project 8579, Multimodal Interface for Advanced Multimedia Interfaces (MIAMI). Technical report, 1995.
190
Speech Communication and Multimodal Interfaces
45. Schuller, B., M¨uller, R., Lang, M., and Rigoll, G. Speaker Independent Emotion Recognition by Early Fusion of Acousticand Linguistic Features within Ensembles. In Proceedings of the ISCA Interspeech 2005. Lisboa, Portugal, 2005. 46. Schuller, B., Rigoll, G., and Lang, M. Hidden Markov Model-Based Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP 2003), volume II, pages 1–4. 2003. 47. Schuller, B., Rigoll, G., and Lang, M. Speech Emotion Recognition Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine - Belief Network Architecture. In Proceedings of the IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP 2004), volume I, pages 577–580. Montreal, Quebec, 2004. 48. Schuller, B., Villar, R. J., Rigoll, G., and Lang, M. Meta-Classifiers in Acoustic and Linguistic Feature Fusion-Based Affect Recognition. In Proceedings of the International Conference on Acoustics, Speechand Signal Processing (ICASSP) 2005, volume 1, pages 325–329. Philadelphia, Pennsylvania, 2005. 49. Shneiderman, B. Designing the user interface: Strategies for effective human-computer interaction (3rd ed.). Addison-Wesley Publishing, 1998. ISBN 0201694972. 50. Smith, W. and Hipp, D. Spoken Natural Language Dialog Systems: A Practical Approach. Oxford University Press, 1994. ISBN 0-19-509187-6. 51. Tian, Y., Kanade, T., and Cohn, J. Evaluation of Gabor-wavelet-based Facial Action Unit Recognitionin Image Sequences of Increasing Complexity. In Proceedings of the Fifth IEEE International Conference on AutomaticFace and Gesture Recognition, pages 229– 234. May 2002. 52. Turk, M. and Pentland, A. Face Recognition Using Eigenfaces. In Proc. of Conference on Computer Vision and Pattern Recognition, pages 586–591. 1991. 53. van Zanten, G. V. User-modeling in Adaptive Dialogue Management. In Proceedings of the Eurospeech ’99, pages 1183–1186. Budapest, Hungary, 1999. 54. Ververidis, D. and Kotropoulos, C. A State of the Art Review on Emotional Speech Databases. In Proceedings of the 1st Richmedia Conference, pages 109–119. Lausanne, Sitzerland, 2003. 55. Witten, I. H. and Frank, E. Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco, CA, 2000. ISBN 1-558-60552-5. 56. Wu, L., Oviatt, S., and Cohen, P. Multimodal integration - A Statistical Review. 1(4), pages 334–341. 1999. 57. Young, S. Probabilistic Methods in Spoken Dialogue Systems. Philosophical Transactions of the Royal Society, 358:1389–1402, 2000.
Chapter 5
Person Recognition and Tracking Michael H¨ahnel1 , Holger Fillbrandt2 Automatic person recognition is a term for the process of determining the identity of a person by using information of a group of subjects. In general, recognition can be categorized into authentication (or verification) and identification processes. For authentication, a person has to claim his identity which is then verified using the previously stored information about this person (e.g. PIN number, face image). This is called a 1 : 1 comparison. Identification is performed in those applications where an unknown person needs to be identified from a set of people without any claimed identity (1 : n comparison). Authentication systems have made everyday life easier in many ways. With a small plastic card one can get cash at nearly any corner of a city or even directly pay for goods and services. Access to this source of financial convenience, however, must be protected against unauthorized use. Information exchange via internet and email made it necessary to personalize information which needs to be protected against unauthorized access as well. Passwords and PIN codes are needed to gain access to mail accounts or personalized web pages. Just like the world wide web also local computer networks are secured using passwords. Apart from security aspects authentication also has an essential role in manmachine interaction. To make applications adaptive to the context of use, it is essential to know the user, his whereabouts, and his activities. Person tracking in combination with person authentication must be applied to acquire suitable information for either interactive as well as assisted systems. Identification is usually performed in surveillance applications where the persons to be recognized do not actively participate in the recognition process. The methods discussed in this section can be used for identification as well as for authentication purposes. Therefore, when referring to the term “recognition”, it can be substituted by either identification or authentication. Recognition methods can be put into three categories: token-based (keys, passport, cards), knowledge-based (passwords, PINs), and biometric (Fig. 5.1). Knowledge-based recognition is based on information which is unique and identifies a person when fed into the identification system. Token-based methods imply the possession of an identifying object like a key, a magnetic card or a passport. The advantage of authentication systems following these approaches are their relatively simple implementation, the ease of use, and their widespreaded application. However, tokens can be lost or get stolen, passwords and PINs might be forgotten. An
1 2
Sect. 5.1, 5.2 Sect. 5.3
192
Person Recognition and Tracking
access to a system is then not possible anymore or at least requires additional effort and causes higher costs. recognition
identification
token-based
authentication
knowledge-based
biometric
physiological features
behavioral features
intrusive
keys
retina
passport
hand
speech
clothing (full-body)
fingerprint
signature
face
gait
PIN
password
non-intrusive Fig. 5.1. Biometric and non-biometric person recognition methods. In the following two sections face recognition and camera-based full-body recognition are discussed.
Hence, in the last years much attention has been paid to biometric systems which are either based on physiological features (face, hand shape and hand vein pattern, fingerprint, retina pattern) or depend on behavioral patterns that are unique for each person to be recognized (gait, signature stroke, speech). These systems are mostly intrusive because a minimum of active participation of the person to be identified is required as one has to lay down his thumb on the fingerprint sensor or look at a camera or scanner to get the necessary biometric features extracted properly. This condition is acceptable in many applications. Non-intrusive recognition can be applied for surveillance tasks where subjects usually are observed by cameras. In this case, it is not desired that the persons participate in the recognition process. The second advantage of non-intrusive over intrusive recognition is the increased comfort of use of such systems, e.g. if a person can be identified while heading towards a security door, the need for putting the thumb on the fingerprint sensor or the insertion of a magnetic card into a slot would be unnecessary. Today, identification of a person from a distance is possible only by gait recognition, though also this biometric method is still a research issue. Face recognition has the potential to fulfill this task as well, but still research needs to be done to develop
Face Recognition
193
reliable systems based on face recognition at a glance. This section can not address all aspects of recognition, however, selected topics of special interest are discussed. Face recognition has evolved to commercial systems which control the access to restricted areas e.g. in airports or high-security rooms or even computers. Though being far from reliably working under all real world conditions, many approaches have been developed to at least guarantee a safe and reliable functioning under well determined conditions (Sect. 5.1). A new non-biometric method for non-intrusive person recognition is based on images of the full body, in applications where it can be assumed that clothing is not changed. The big variety in colors and patterns makes clothing a valuable indicator that adds information to an identification system even if fashion styles and seasonal changes may affect the condition of “constant clothing”. An approach to this tokenbased method is discussed in Sect. 5.2. As full-body recognition, also gait recognition depends on a proper detection and segmentation of persons in a scene which is a main issue of video-based person tracking. Hence, person tracking can also be considered as a preprocessing step for these recognition methods. Besides that, determining the position of persons in an observed scene is a main task for surveillance applications in buildings, shopping malls, airports and other public places and is hence discussed as well (Sect. 5.3).
5.1 Face Recognition For humans, the face is the most discriminating part of the body used to determine a persons identity. Automatic face recognition, however, is still far away from human performance in this task. As many factors influence the appearance of the face, an overall approach to model, detect and recognize faces still could not be found even after 30 years of research. For many applications, however, it is not necessary to deal with all factors. For those applications, commercial systems are already available, but they are highly intrusive and require active participation in the recognition process. Sect. 5.1.1 will outline factors influencing the recognition rate which need to be considered when developing a face recognition system. The general structure of a face recognition system is outlined in Sect. 5.1.2. Known approaches to automatic face recognition will be categorized in Sect. 5.1.3. Afterwards the popular “eigenface” approach (Sect. 5.1.4) and a component-based approach to face recognition (Sect. 5.1.5) are described in more detail. This section will close with a brief overview of the most common face databases that are publicly available. 5.1.1 Challenges in Automatic Face Recognition The challenge of (non-intrusive) automatic face recognition is to decrease the influence of factors that reduce the face recognition performance. Some factors may have lower priority (e.g. illumination changes when a quite constant illumination through lighting is available), some can be very important (recognition when only a single image per person could be trained) depending on the application and
194
Person Recognition and Tracking
environment. These factors can be categorized into human and technical factors (Fig. 5.2).
technical factors
human factors biological
number of sample images
resolution
aging
behavioral
occlusion
illumination
head pose
facial expression
Fig. 5.2. Factors mostly addressed by the research community that affect face recognition performance and their categorization. Additionally, fashion (behavioral) and twin faces (biological) must be mentioned.
Technical factors arise from the technology employed and the environment in which a recognition system is operated. Any image-based object recognition task has to deal with these factors. Their influence can often easily be reduced by choosing suitable hardware, like lighting (for illumination) and cameras (for illumination and resolution). 1. Image resolution: The choice of the needed image resolution is a trade-off between a desired processing performance (⇒ low resolution) and separability of the features used for recognition (⇒high resolution). The highest possible resolution should be used that allows to meet given time restrictions. 2. One-sample problem: Many methods can extract good descriptions of a person’s face out of a bunch of images taken under different conditions, e.g. illuminations, facial expressions and poses. However, several images of the same person are often not available. Sometimes a few, maybe even only one, sample image can
Face Recognition
195
be used to train a recognition system. In these cases, generic a-priori knowledge (e.g. head models, databases of face images of other persons) needs to be added to model possible variations. 3. Illumination: Light sources from different directions and different intensities as well as objects between a light source and the observed face let the same face appear in many ways due to cast shadows and (over-)illuminated regions. Neither important facial features nor the whole face itself can be detected or the features do not properly describe the face anymore, leading to a misclassification or a rejection of the face image. Hence, two approaches of considering illumination can be followed to solve these problems: (1) use illumination invariant features and classifiers or (2) model the illumination changes and preprocess the image considering this model. Such a model has to integrate a lot of a-priori knowledge taken from images of faces of varying identity and illumination, maybe even under different poses and showing several facial expressions. These images are often not available (one-sample problem). Human factors influence the recognition process due to the special characteristics and properties of the face as object to be recognized. They can be sub-divided into behavioral and biological factors. 1. Behavioral factors As the human head is not a steady and rigid object it can undergo certain changes due to (non-)intentional movements and augmentation with objects like e.g. glasses. a) Facial expressions: Beside speech, different facial expressions are the most important information source for communication through the face. However, the ability to change the appearance of the face states a big problem to image processing and recognition methods. Dozens of facial muscles allow a very subtle movement of the skin so that wrinkles and bulges appear and important texture regions move or are distorted. b) Head pose: Extensive changes in appearance are due to 3D head rotation. This does not only lead to textural changes but also to geometrical changes, as the human head is not rotationally invariant. When the face is rotated, important regions of the (frontal) face are not visible anymore. The more a head is rotated, the more one half of the face is getting invisible and valuable information for the recognition process is getting lost. c) Occlusion: A reduction of exploitable information is often the consequence of occlusion by objects, e.g. scarfs, glasses, caps. Either the mouth region or even the more important eye region is covered in part and does not allow the extraction of features anymore. However, occlusion can be handled by local approaches for face recognition (Sect. 5.1.3). The variation of occlusion is infinite because the appearance of the occluding objects (e.g. glasses, newspapers, caps) is usually not known. d) Fashionable appearance changes: Intentional changes of appearance due to e.g make-up, mustaches, or beards, can complicate feature detection and
196
Person Recognition and Tracking
extraction if these textural changes are not considered by the detection algorithms. 2. Biological factors a) Aging: Changes of facial appearance due to aging and their modeling are not well studied yet. Main changes of the skin surface and slight geometrical changes (considering longer time spans) cause problems in the recognition process due to strong differences in the extracted features. b) The twin-problem: The natural “copy” of a face to be identified is the twin’s face. Even humans have difficulties to distinguish between two unknown twins, so have automatic systems. However, humans are able to recognize twins if they are known well. In contrast to a non-intrusive face recognition system that has to deal with most of the above mentioned factors, an intrusive systems can neglect many of these factors. If a so called “friendly user” can be asked either to step into an area with controlled illumination or to look straight into the camera, it is unnecessary to handle strong variations in e.g. illumination and pose changes. Operational conditions can often reduce the need for more sophisticated algorithms by setting up suitable hardware environments. However, friendly users usually can not be found in surveillance tasks. 5.1.2 Structure of Face Recognition Systems This section will discuss the basic processing steps of a face recognition system (Fig. 5.3). Referring to chapter 2.1, the coarse structure of an image-based object recognition system consists of an (image) acquisition, a feature extraction and a feature classification stage. These stages can be broken up into several steps: While acquisition, suitable preprocessing and face detection as well as facial feature extraction are described more detailed in chapter 2.2, feature extraction, feature transformation, classification and combination will be discussed in the following. Acquisition During acquisition the camera takes snapshots or video streams of a scene containing one or more faces. Already at this stage it should be considered where to set up a camera. If possible, one should use more than one camera to get a couple of views of the same person either to combine recognition results or to build a face model that is used to handle different influencing factors like pose or illumination. Besides positioning the camera, the location for suitable lighting should be considered as well. If a constant illumination situation can be provided in the enrollment phase (training) as well as in the test phase (classification), less effort needs to be put into modeling illumination changes and the accuracy of the system will increase. If possible, the camera parameters like shutter speed, amplification and white balance should be adaptively controlled by a suitable
facial feature detection
acquisition
feature extraction
197
face detection
Face Recognition
preprocessing
acquisition
feature extraction
feature transformation
combination
recognition result
classification
feature classification
S Fig. 5.3. General structure of a face recognition system.
algorithm that evaluates the current illumination situation and adapts the parameters accordingly. For a more detailed discussion of this issue, please refer to Sect. 2.2. Preprocessing In the preprocessing stage usually the images are prepared on a low-level basis to be further processed in other stages on a high-level basis. Preprocessing steps include e.g. (geometrical) image normalization (e.g. Sect. 5.1.4), the suitable enhancement of brightness and contrast, e.g. brightness planes or histogram equalization [45] [52] or image filtering. This preprocessing is always strongly depending on the following processing step. For the face detection stage, it might be useful to determine face candidates, i.e. image regions that might contain a face, on a low-level basis. This can e.g. be achieved using skin color histograms (Sect. 2.1.2.1). For the facial feature extraction
198
Person Recognition and Tracking
step, a proper gray-level or color normalization is needed (Sect. 2.2). Face Detection Face detection is a wide field of research and many different approaches have been developed with creditable results [5] [58] [50]. However, it is difficult to get these methods work under real world conditions. As all face detection algorithms are controlled by several parameters, generally suitable parameter settings can usually not be found. Additional image preprocessing is needed, to gain a working system that performs fast (maybe even real-time) and reliable, even under different kind of influencing factors as e.g. resolution, pose and illumination (Sect. 5.1.1). The result of most face detection algorithms are rectangles containing the face area. Usually a good estimate of the face size, which is needed for normalization and facial feature detection can be derived from these rectangles. Facial Feature Detection This processing step is only necessary for recognition methods using local information (Sect. 5.1.3). Also normalizing the size of images often is done by making use of the distance between the eyes which needs to be detected for that purpose (Sect. 5.1.4). A holistic approach which performs feature detection quite well are the Active Appearance Models (AAM) [13]. With AAMs, however, the compromise between precision and independence from e.g. illumination needs to be accepted. For further details, please refer to Sect. 2.2. Feature Extraction Face recognition research usually concentrates on the development of feature extraction algorithms. Besides a proper preprocessing, feature extraction is the most important step in face recognition because a description of a face is built which must be sufficient to separate the face images of one person from the images of all other subjects. Substantial surveys about face recognition are available, e.g. [60]. Therefore, we refrain from giving an overview of face recognition methods and do not discuss all methods in detail because much literature can be found for the most promising approaches. Some feature extraction methods are however briefly outlined in Sect. 5.1.5 for face recognition based on local regions. Those methods originally were applied globally, i.e. on the whole face image. Feature Transformation Feature extraction and feature transformation are very often combined in one single step. The term feature transformation, however, describes a set of mathematical methods that try to decrease the within-class distance while increasing the betweenclass difference of the features extracted from the image. Procedures like e.g. PCA (principle component analysis, Sect. 5.1.4) transform the feature space to a more
Face Recognition
199
suitable representation that can better be processed by a classifier. Classification The result of a recognition process is determined in the classification stage in which a classification method compares a given feature vector to the previously stored training data. Simple classifiers like the k-nearest-neighbor classifier (Sect. 5.1.4) are often surprisingly powerful, especially when only one sample per class is available. More sophisticated methods like neural networks [6] or Support Vector Machines [49] can determine more complex, non-linear feature space subdivisions (class borders) which are often needed if more than one sample per class was trained. Combination A hybrid classification system can apply different combination strategies which in general can be categorized in three groups: •
•
•
Voting: The overall result is given by the ID which was rated best by the majority of all single classifiers. Ranking: If the classifiers can build a ranking for each test image then the ranks for each class are added. Winner is the class with the lowest sum. Scoring: If the rankings are annotated with a score that integrates information about how similar a test image is in comparison to the closest trained image of each class, more sophisticated combination methods can be applied that combine these scores to an overall score. The winner class is the one with the best score. Recent findings can be found in [33]
The input to a combination algorithm can be the result either of classifications of different facial parts or the result of different types of recognition algorithms and classifiers (e.g. local and hybrid recognition methods, Sect. 5.1.3). It was argued in [39] that, e.g. PCA-based methods (global classifiers) should be combined with methods extracting local information. PCA-based methods are robust against noise because of estimating low-order modes with large eigenvalues, while local classifiers better describe high-order modes that amplify noise but are more discriminant concerning different persons. Fig. 5.4 shows some Eigenfaces (extracted by a PCA on gray-level images) ordered by their eigenvalues. The upper Eigenfaces (high eigenvalues) seem to code only basic structure, e.g. the general shape of a face, while the Eigenfaces in the lower part of Fig. 5.4, structures can be identified that are more specific to certain persons. The results of the two methods (global and local) penalize each other while amplifying the correct result. In general, a higher recognition rate can be expected when different kinds of face recognition algorithms are combined [60].
200
Person Recognition and Tracking
Fig. 5.4. Visualization of a mean face (upper left corner) and 15 Eigenfaces. The corresponding eigenvalues are decreasing from left to right and top to bottom. (from [47]). Eigenfaces with high eigenvalues only code general appearance information like the face outline and global illumination (upper row).
5.1.3 Categorization of Face Recognition Algorithms Face recognition algorithms can be categorized into four groups: •
Geometry-based methods depend only on geometrical relations of different facial regions. These methods were developed in the early stages of face recognition research [27]. However, it could be proven that geometrical features are not sufficient when used exclusively [8] [21]. Today these kind of methods are not relevant anymore. Experiments of Craw et al. [14] could show that shape-free representations are even more suitable than representations including geometrical information.
•
Global or holistic methods use the whole image information of the face at once. Here geometrical information is not used at all or just indirectly through the distribution of intensity values (e.g. eyes in different images). The most prominent representative of this class of algorithms is the eigenface approach [48] which will be discussed in Sect. 5.1.4.
•
Local feature (analysis) methods only consider texture information of regions in the face without directly integrating their geometrical information, e.g. [39] [51]. The local regions contain just a few pixels but are distinctive for the specific face
Face Recognition
201
image. These methods include an algorithm to detect distinctive regions which is applied during (a) enrollment (training) to gain a proper description of the face and (b) during testing to find corresponding regions in the test image that are similar to regions previously described in the training stage. •
Hybrid algorithms combine geometrical and appearance information to recognize a face. These methods usually incorporate some kind of local information, in the sense, that they use appearance information from facial regions with a high entropy regarding face recognition (e.g. eye region), while disregarding regions with poor informational content i.e. homogeneous textures (e.g. forehead, cheeks). Additionally, geometrical information about the relation of locations where features were extracted are incorporated in the classification process. The use of local information reduces the influence of illumination changes and facial expressions and can handle occlusion if detected. Most modern recognition algorithms like Elastic Bunch Graph Matching [28] [55] and component-based face recognition [26] follow the hybrid approach.
5.1.4 Global Face Recognition using Eigenfaces The eigenface approach [44] [48] is one of the earliest methods of face recognition that uses appearance information. It originally was applied globally and, though it is lacking stability and good performance under general conditions, it is still a state-of-the-art method, which often is used as a reference to new classification methods. The following sections will describe the eigenface approach considering the basic structure of face recognition algorithms discussed in Sect. 5.1.2. Preprocessing The eigenface approach shows bad performance if the training and input images do not undergo a preprocessing that normalizes them in a way that only keeps the usable information in the face area of the image. For this normalization step, the face and the position of different facial features must be detected as described in 2.2. The coordinates of the eye centers and the nose tip are sufficient for a simple yet often applicable normalization (Fig. 5.5). In a first step, the image must be rotated by an angle α to make sure the eye centers have the same y-coordinate (Fig. 5.5b). The angle α is given by: α = arctan(
ylefteye − yrighteye ) xlefteye − xrighteye
where (xlefteye , ylefteye ) and (xrighteye , yrighteye ) are the coordinates of the left and right eye, respectively. Then the image is scaled in the directions of x and y independently (Fig. 5.5 (c)). The scaling factors are determined by:
202
Person Recognition and Tracking scale
rotate
deyes dnose
(a)
(b)
(c)
shift, crop & mask
deyes dnose
(d)
(e)
Fig. 5.5. Face image normalization for recognition using Eigenfaces. (a) original image, (b) rotated image (eyes lie on a horizontal line), (c) scaled image (eyes have a constant distance; the nose has a constant length), (d) a mask with previously defined coordinates for the eyes and the nose tip where the image is shifted to, (e) normalized image)
xlefteye − xrighteye deyes yeye − ynose scalex = dnose scalex =
where xlefteye and xrighteye determine the x-coordinates of the eyes and yeye are the y-coordinate of the eyes (nose tip) after the rotation step. The distance between the eyes deye and the nose length dnose must be constant for all considered images and must be chosen a-priori. Cropping the image, shifting the eyes to a previously determined position, and masking out the border regions of the face image (using a mask like in Fig. 5.5d) leads to an image that contains only the appearance information of the face (Fig. 5.5e). All training and test images must be normalized to yield images that have corresponding appearance information at the same positions. (ynose )
The “face space” The eigenface approach considers an image as a point in a high-dimensional data space. As faces in general have a similar appearance, it is assumed that face images
Face Recognition
203
are not uniformly distributed over the whole image space but are concentrated in a lower dimensional “face space” (Fig. 5.6).
face space
Fig. 5.6. Simplified illustration of an image space. The face images are located in a sub-space of the whole image space and form the face space.
The basis of this face space can be used to code the images using their coordinates. This coding is a unique representation for each image. Assuming that images of the same person are grouped together allows to apply a nearest-neighbor classifier (Sect. 5.1.4) or more sophisticated classification methods. Feature Extraction and Transformation To obtain the basis of the face space, a set of N training images Ii (x, y) of width W and height H and i = 0 . . . N are needed. Each of the 2D-images must be sampled row-wise into a one-dimensional vector: Ii (x, y) −→ Ji
(5.1)
All Ji have the dimension j = W · H. This is the simplest feature extraction possible. Just the grey-levels of the image pixels are considered. Modified approaches preprocess the images, e.g. by filtering with Gabor functions3 [10]. k2 x2
A Gabor function is given by: g(x) = cos(kx)·e −2σ2 . This is basically a cosine wave combined with an enveloping exponential function limiting the support of the cosine wave. k 3
204
Person Recognition and Tracking
Now applying the principle component analysis (PCA or Karhunen-Loeve expansion) on the Ji will lead to a set of orthogonal Eigenvectors which constitute the basis of the images. First, the mean of the feature vectors (here the Ji ) must be calculated by: 1 Ji (5.2) Ψ= N i This mean value is then used to determine the covariance matrix C: 1 (Ji − Ψ)T (Ji − Ψ) C= N i
(5.3)
Applying PCA to C will yield a set of Eigenvectors ek . The corresponding eigenvalues λk of each ek can be obtained by: 1 T ek (Ji − Ψ) (5.4) λk = N i Considering only a subset of these Eigenvectors yields the lower-dimensional basis of the face space. Which Eigenvectors to consider depends on the value of the eigenvalues. The larger the eigenvalue the larger is the variance of the data in the direction of the corresponding Eigenvector. Therefore only the Eigenvectors with the largest eigenvalues are used because smaller eigenvalues correspond to dimensions mainly describing noise. Fig. 5.4 (see page 200) shows a visualization of a set of these Eigenvectors. Because they have a face-like appearance, they were called ”eigenpictures“ [44] or Eigenfaces [48]. Belhumeur [4] suggested to discard some of the most significant Eigenfaces (high eigenvalues) because they mainly seem to code variations in illumination but no significant differences between subjects which are coded by Eigenfaces with smaller eigenvalues. This leads to some stability, at least when global illumination changes appear. There are basically two strategies how to determine the number of Eigenvectors ˜ of Eigenvectors (N ˜ ≤ N )4 or (2) to be kept: (1) simply define a constant number N define a minimum amount of information to be kept by determining an information quality measure Q: Q=
˜ N k=0 N k=0
λk λk
Eigenvalues must be added to the numerator until Q is bigger than the previously ˜ of dimensions is deterdetermined threshold Qthreshold . Hereby, the number N mined. Commonly used thresholds are Qthreshold = 0,95 or Qthreshold = 0,98. determines the direction ( k) and the frequency (|k|) of the wave in the two-dimensional space. 4 Turk and Pentland claim that about 40 Eigenfaces are sufficient [48]. This, however, strongly depends on the variations (e.g. illumination, pose) in the images to be trained. The more variation is expected, the more Eigenfaces should be kept. The optimal number of Eigenfaces can only determined heuristically.
Face Recognition
205
The choice of Qthreshold is often determined by finding a compromise between processing performance (low Qthreshold ) and recognition rate (high Qthreshold ). After calculating the PCA and performing dimension reduction, we get a basis B = {e1 , . . . , eN ˜ } of Eigenvectors and corresponding eigenvalues Bλ = ˜ y) −→ J ˜ is then projected into the face space {λ1 , . . . , λN˜ }. A new image I(x, determined by the basis B. The projection into each dimension of the face space basis is given by: ˜ − Ψ) (5.5) wk = ek T (J ˜ . Each wk describes the contribution of eigenface ek to the currently for k = 1, . . . N ˜ The vector Wi = {w1 , . . . , w ˜ } is hence a representation of examined image J. N the image and can be used to distinguish it from other images. The application of the PCA and the following determination of the wk can be regarded as a feature transformation step which transforms the original feature vector (i.e. the grey-level values) into a differently coded feature vector containing the wk . Classification A quite simple yet often very effective algorithm is the k-nearest-neighbor classifier (kNN), especially when just a few feature vectors of each class are available. In this case, training routines of complex classification algorithms do not converge or do not achieve better recognition results (than a kNN) but are computationally more expensive. Introductions to more complex classification methods like Support Vector Machines or neural networks can be found in the literature [49] [16]. The k-nearest-neighbor classifier determines the k trained feature vectors which lie closest to the examined feature vector. The winner class is determined by voting, e.g. selecting the class to which the most of the k feature vectors belong. The property “closest” is defined by a distance measure that needs to be chosen depending on the structure of the feature vector distribution, i.e. the application. Training in the sense of calculating class borders or weights like with neural networks is not performed here. It basically just consists of storing the feature vectors and their corresponding class IDs5 . If W = (wi ) is the feature vector to be classified and Ti = (ti ), i = 0, . . . , N are the trained (stored) vectors, commonly used distance measures d are: W·Ti = cos(α) 1. Angle: dangle = |W|·|T i| 2. Manhattan distance (L1 -norm): dManh. = |wi − ti | 1 |wi − ti |2 2 3. Euclidean distance (L2 -norm): dEukl. = ' (1 4. Mahalanobis distance: dMahal. = (W − Ti )T C (W − Ti ) 2 5
The term “class ID” is usually used for a unique number that acts as a label for each class (person) to be identified.
206
Person Recognition and Tracking where C describes the covariance matrix of the Ti .
Fig. 5.7 exemplifies the functionality of a kNN classifier using the Euclidean distance. The vectors which were trained are drawn as points into the two-dimensional feature space while the vector to be classified is marked with a cross. For each of the training samples the corresponding class is stored (only shown for classes A, B, and C). Determining the training sample that is closest to the examined feature vector determines the class which this vector is assigned to (here class “A”, as dA is the smallest distance). Usually k = 1 is used because exact class assignment is needed in object recognition tasks. The class ID is then determined by: class ID = argminp {d(W, {Ti })} For k > 1 an ordered list is built containing the class IDs of the k closest feature vectors in increasing order. class IDs = {a, b, . . .} with d(W, Ta ) ≤ d(W, Tb ) ≤ · · · ≤ d(W, Ti ) and ||class IDs|| = k. From this k results the winner class must be determined. Usually the voting scheme is applied, however, other combination strategies can be used as well (Sect. 5.1.2).
A
B
dA C
Fig. 5.7. Two-dimensional feature space. A 1NN classifier based on the Euclidean distance measure will return the point A as it is the closest to the point to be classified (marked with a cross). Class borders between A, B, and C are marked with dotted lines.
5.1.5 Local Face Recognition based on Face Components or Template Matching Psychological experiments showed that human perception of a face is determined by face parts and not by the whole face [43]. Additionally, the good results of spatial restricted feature extraction was proven with different face recognition approaches (e.g. [26] [30]).
Face Recognition
207
Besides modelling possible facial expressions, a common approach to achieve facial expression invariance is to ignore the face regions that are likely to change with different facial expressions, but contain no significant information that can be used for recognition (e.g. cheek, forehead, lips). Therefore, only certain face regions (face components) are considered. Experiments showed that especially the eye region contains information that codes the identity of a person [1]. Also other facial regions can be used for recognition. However, their influence on the overall recognition result should be limited by weighting the different facial regions according to their importance [1] [30]. It was further argued that component-based approaches can better cope with slight rotations than holistic approaches because variations within a component are much smaller than the overall variation of an image of the whole face. The reason is, that the positions of the components can individually be adapted to the rotated face image. In the whole face image the face regions can be regarded as rigidly connected, and hence, can not be subject to adaptation.
determine component positions
contour detection
extract components
F.E.
F.E.
F.E.
F.E.
F.E.
F.E.
F.E.
F.E.
F.E.
F.E.
class. class. class. class. class. class. class. class. class. class. combination
classification result
Fig. 5.8. Overview of a component-based face recognition system. For each component, a feature extraction (F.E.) and a classification step (class.) is separately performed.
Fig. 5.8 shows an overview of a component-based face recognition system. In the input image, face contours must be detected. From distinctive contour points the position of the components can be determined and the image information of the components can be extracted. For each component, feature extraction and classification is
208
Person Recognition and Tracking
performed separately. The single classification results are finally combined, leading to an overall classification result. Component Extraction As a preprocessing step for component-based face recognition, facial regions must be detected. Suitable approaches for the detection of distinctive facial points and contours were already discussed in chapter 2.2. As soon as facial contours are detected, facial components can be extracted by determining their position according to distinctive points in the face (Fig. 5.9).
b
a
c
Fig. 5.9. Face component extraction. Starting from mesh/shape (thin white line in (a)) that is adapted to the input image, the centers of facial components (filled white dots) can be determined (b) and with pre-defined size the facial components can be extracted (white lines in (c)). The image was taken from the FERET database [41].
The centers ci (x, y) of the facial components are either determined by a single contour point, i.e. cj (x, y) = pk (x, y) (e.g. for the corners of the mouth or the nose root) or are calculated using the points of a certain facial contour (e.g. the center of the points of the eye or nose contour) with: cj (x, y) =
1 pi (x, y) n ˜ i∈C
where C is a contour containing the n ˜ distinctive points pi (x, y) of e.g. the eyes. Heisele et al. determined 10 face components to be the most distinctive (Fig. 5.10). Especially the eye and eye brow components are proved to contain most important information [26].
Face Recognition
209
Fig. 5.10. 10 most distinctive face components determined by [26].
Feature Extraction and Classification In the feature extraction stage, any suitable feature vector can be determined for each component independently. However, it must be considered that the face components usually only have a fraction of the size of a face image. Hence, not all features are suitable to describe face components. Texture descriptors, for example, in general need image information of a certain size, so that the texture can be described properly. Features Based on Grey-Level Information (Template Matching) Grey-level information can be used in two ways for object recognition purposes: (1) the image is transformed row-wise into a feature vector. Then, the trained vector having the smallest distance to the testing vector is determined, or, (2) a template matching based on a similarity measure can be applied. Given an image I(x, y) and a template T (x, y)6 , the similarity can be calculated by: 1 2 (i,j)∈V [I(i + u, j + v) − T (i, j)] + 1
S(u, v) =
where V is the set of all coordinate pairs. The template with the highest similarity score Smax (u, v) determines the classified person. 6
A template is an image pattern showing the object to be classified. A face component can be regarded as a template as well.
210
Person Recognition and Tracking
Eigenfeatures (local Eigenfaces) An extension to the grey-level features is the locally applied eigenface approach (Sect. 5.1.4). With IF Cj (x, y) −→ JFCj being the image information of face component F Cj , the mean 1 JFCj ,i ΨFCj = NF Cj i and the covariance matrix of the face component: 1 (JFCj ,i − ΨFCj )T (JFCj ,i − ΨFCj ) CFCj = NF Cj i can be determined (i being the index of the training images, and j being the index of the facial component, here j ∈ {1, . . . , 10}. The Eigenvectors of CFCj are often called “eigenfeatures” referring to the term “Eigenfaces” as they span a sub-space for specific facial features [40]: The classification stage provides a classification result for each face component. This is done according to any holistic method by determining a distance measure (Sect. 5.1.4) Fourier coefficients for image coding The 2D-DFT (Discrete Fourier Transform) of an image I(x, y) is defined as: y·n x·m I(x, y) · ej2π( M + N ) g(m, n) = x
y
where M and N are the spectral coefficient in x and y-direction, respectively.
Fig. 5.11. Face image and a typical 2D-DFT (magnitude).
Fig. 5.11 shows a face image and the magnitude of a typical DFT spectrum in polar coordinates. Note, that mainly coefficients representing low frequencies (close to the center) can be found in face images. Therefore, only the r most relevant Fourier coefficients g(m, n) are determined and combined in an r-dimensional feature vector which is classified using e.g. the Euclidean distance.
Face Recognition
211
Discussion To show the distinctiveness of facial components, we ran experiments with different combinations of facial components (Fig. 5.12) on a subset of the FERET database (Sect. 5.1.6). Only grey-level information was used as feature, together with a 1NN classifier. Frontal images of a set of 200 persons were manually annotated with distinctive points/shapes. One image of each person was trained, another one showing slight facial expression changes, was tested.
a
b
c
d
Fig. 5.12. Test sets: combinations of face components.
In table 5.1 the recognition rate of each single component is given. The results show that the eye region (eye and eye brow components) is the most discriminating part of the face. Just with a single eye component a recognition rate of about 86% could be achieved. The lower face part (mouth, nose) performs worse with rates up to only 50%. Table 5.1. Classification rates of different face components (Fig. 5.10) component left eye right eye
recognition rate 86,9% 85,4%
left eye brow right eye brow
88,9% 85,4%
left temple right temple
73,9% 68,3%
nose root nose tip
76,4% 49,7%
left mouth corner right mouth corner
43,2% 41,7%
212
Person Recognition and Tracking
The results of the combined tests are shown in table 5.2. The combination of all face components increased the recognition rate to 97%. Even though the mouth components alone perform not very well, they still add valuable information to the classification as the recognition rate drops by 1,5% when the mouth components are not considered (test set (b)). Further reduction (test set (c)) shows however, that the nose tip in our experiments could not add valuable information, it actually worsened the result. A recognition rate of about 90% can be achieved under strong occlusion, simulated by only considering four components of the right face half. This shows that occlusion can be compensated reasonably well by the component-based approach. Table 5.2. Classification rates of different combinations of face components (Fig. 5.12) test sets (a) 10 components (b) 8 components (no mouth) (c) 7 components (eye region only) (d) 4 components (right face half)
recognition rate 97,0% 95,5% 96,5% 90,1%
5.1.6 Face Databases for Development and Evaluation To evaluate a face recognition algorithm it is essential to have an extensive image database containing images of faces under different conditions. Many databases are publicly available and some contain about 1000 persons, some only a few persons but images under a big variety of conditions (e.g. pose, illumination). Table 5.3. A selection of publicly available face databases for evaluation. Color FERET AR Face Database Yale Face Database Yale Face Database B
URL http://www.itl.nist.gov/iad/humanid/colorferet/home.html http://rvl1.ecn.purdue.edu/ aleix/ar.html http://cvc.yale.edu/projects/yalefaces/yalefaces.html http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html
The Color FERET Face Database [41] is the biggest database in terms of different individuals. It contains images of 994 persons including frontal images with neutral and a randomly different facial expression (usually a smile) for all persons. For about 990 persons there are images of profiles, half profiles and quarter profiles. Images are also included, where the head was turned more or less randomly. The ground-truth information defines the year of birth (1924-1986), gender, race and if the person was wearing glasses, a beard, or a mustache. For the frontal images, also the eye, mouth and nose tip coordinates are annotated. Older images of the first version of the FERET database are included in the newer Color FERET database. However, these images are supplied in grey-scale only.
Color FERET AR Yale A Yale B Fig. 5.13. Example images of the four mentioned face databases.
214
Person Recognition and Tracking
In comparison to the FERET database, the AR Face database [31] contains systematically annotated changes in facial expression (available are neutral, smile, anger, scream), illumination (ambient, left, right, direct light from all sides) and occlusion by sun glasses or scarfs. No pose changes are included. Images of 76 men and 60 women were taken in two sessions on different days. Fiducial point annotations are missing. For some of the images the FGnet project supported annotations which are available in the internet7 . There are two versions of the Yale Face Database. The original version [4] contains 165 frontal face images of 15 persons under different illuminations, facial expressions, with and without glasses. The second version (Yale Face Database B) [20] contains grey-scale images of 10 individuals (nine male and one female) under 9 poses and 64 illumination conditions with neutral facial expressions. No subject wears glasses and no other kind of occlusion is contained. 5.1.7 Exercises The following exercises deal with face recognition using the Eigenface (Sect. 5.1.7.2) and the component-based approach (Sect. 5.1.7.3). Image processing macros for the I MPRESARIO software are explained and suitable process graphs are provided on the book CD. 5.1.7.1 Provided Images and Image Sequences A few face images and ground truth data files (directories /imageinput/LTI-Faces-Eigenfaces and /imageinput/LTI-Faces-Components)8 are provided on the book CD with which the process graphs of this exercise can be executed. This image database is not extensive, however, it should be big enough to get a basic understanding of the face recognition methods. Additionally, we suggest to use the Yale face database [4] which can be obtained easily through the internet9 . Additional ground truth data like coordinates of facial feature points are provided on the CD shipped with this book and can be found after installation of I MPRESARIO in the directories /imageinput/Yale DB-Eigenfaces and /imageinput/Yale DB-Components, respectively. Set up of the Yale Face Database To set up the image and ground truth data of the Yale face database follow these steps: 1. Download the Yale face database from the internet9 and extract the image files and their 15 sub-directories (one for each subject) to the following directories: 7
http://www-prima.inrialpes.fr/FGnet/data/05-ARFace/tarfd markup.html denotes the installation directory on your hard disc. 9 http://cvc.yale.edu/projects/yalefaces/yalefaces.html 8
Face Recognition
215
•
/imageinput/Yale DB-Eigenfaces for the Eigenface exercise • /imageinput/Yale DB-Components for the exercise dealing with the component-based face recognition approach Each of the 15 sub-directories of the two above mentioned directories should now contain 11 ground truth files (as provided with the book CD) and 11 image files. 2. Convert all image files to the bitmap (BMP) format using any standard image processing software. 5.1.7.2 Face Recognition Using the Eigenface Approach Face recognition using the Eigenface approach (Sect. 5.1.4) can be implemented in I MPRESARIO in three steps which are modelled in separated process graphs10 : 1. Calculating the transformation matrix (Eigenfaces_calcPCA.ipg): The PCA needs to be calculated from a set of face images which contain possible variations in face appearance, i.e. mainly variations in illumination and facial expression. 2. Training (Eigenfaces_Train.ipg): Using the transformation matrix calculated in the first step, the face images of the persons to be recognized are projected into the eigenspace and their transformed coordinates are used to train a classifier. 3. Experiments/Testing (Eigenfaces_Test.ipg): The test images are projected into the eigenspace as well and are processed by the previously trained nearest-neighbor classifier. By comparing the classified ID with the target ID of a set of test images, a recognition rate can be determined by: R=
correct classifications correct classifications + false classifications
In the following, these three process graphs are explained in more detail. All processes have in common that they perform a normalization of the face images which will be described in the next section. Calculating the Transformation Matrix by Applying PCA Fig. 5.14 shows the first process graph. The ImageSequence source provides an image (left output pin) together with its file name (right output pin) which is needed in the LoadFaceCoordinates macro which loads the coordinates of the eyes, mouth and nose which are stored in a file with the same name but different extension like the image file (e.g. subject05.normal.asf for image subject05.normal.bmp). The ExtractIntensityChannel macro will extract the intensity channel of the image which is then passed to the FaceNormalizer macro. 10
ets.
The process graph files can be found on the book CD. The file names are given in brack-
216
Person Recognition and Tracking
Fig. 5.14. I MPRESARIO process graph for the calculation of the PCA transformation matrix. The grey background marks the normalization steps which can be found in all process graphs of this Eigenface exercise. (the face image was taken from the Yale face database [4])
FaceNormalizer The FaceNormalizer macro uses the facial feature coordinates to perform the normalization (Sect. 5.1.4) of the image. Five parameters must be set to perform a proper normalization: width: The width of the result image. The face image will be cropped to this width. height: The height of the result image. The face image will be cropped to this height. norm width: The distance between the eye centers. The face image will be scaled horizontally to gain this distance between the eyes. norm height: The distance between the y-coordinates of the eye centers and the nose tip/mouth. The face image will be scaled vertically to gain this distance. normalization by: This parameter can be set to mouth or nose tip and determines which feature point is used when scaling to norm height. At this point, the face image is normalized regarding scale and rotation and is cropped to a certain size (face image in the center of Fig. 5.14). Normalization of the intensity values is then performed in the macro IntensityNormalizer (cropped face image in the lower left corner of Fig. 5.14). IntensityNormalizer Performs a normalization of the intensity values. The current grey-level mean avgcur is determined and all pixel values are adapted considering the new (normalized) grey-level mean avgnorm by: Inorm (x, y) = I(x, y) ·
avgnorm avgcur
Face Recognition
217
After this normalization step, all face images have the same mean grey-level avgnorm . Global linear illumination differences can be reduced by this step. mean grey-value: The new mean value of all intensities in the image (avgnorm ). As some parts of the image might still contain background, the normalized face image is masked using a pre-defined mask image which is loaded by LoadImageMask. The macro MaskImage actually performs the masking operation. Finally, another ExtractIntensityChannel macro converts the image information into a feature vector which is then forwarded to the calculatePCAMatrix macro. calculatePCAMatrix For each image, this macro stores all feature vectors. After the last image is processed (the ExitMacro function of the macros will be called), calculatePCAMatrix can calculate the mean feature vector, the covariance matrix and their eigenvalues which are needed to determine the transformation matrix (Sect. 5.1.4) that is then stored. data file: Path to the file in which the transformation matrix is stored. result dim: With this parameter the number of dimensions to which the feature vectors are reduced can be explicitly determined. If this parameter is set to −1, a suitable dimension reduction is performed by dismissing all Eigenvectors λmax where λmax determines the highest whose eigenvalues are smaller than 100000 eigenvalue.
Fig. 5.15. I MPRESARIO process graph for training an Eigenface classifier.
218
Person Recognition and Tracking
Eigenface Training The normalization steps described above (Fig. 5.14) can be found in the training process graph as well (Fig. 5.15). Instead of calculating the PCA, the training graph contains the EigenfaceFeature macro which loads the previously stored transformation matrix and projects given feature vectors into the eigenspace (face space). The transformed feature vectors are then forwarded to the kNNTrainer macro which stores all feature vectors with their corresponding person IDs which are determined by the macro ExtractIDFromFilename. EigenfaceFeature Determines the eigenspace representation of the given image. data file: File where the transformation matrix to the eigenspace (face space) was stored by the macro calculatePCAMatrix. ignored Eigenfaces: Number of most relevant Eigenfaces which are ignored when determining the eigespace representation of the given image. width: This is the width of the images to be processed. The width is explicitly needed to display the eigenface determined by the parameter eigenface number. height: This is the height of the images to be processed. The height is explicitly needed to display the eigenface determined by the parameter eigenface number. eigenface number: You can display an eigenface image at the eigenface image output (the right one). The index of the eigenface to be displayed can be set by this parameter. ExtractIDFromFilename This macro extracts the ID from the file name following one of two naming conventions. naming convention: The naming conventions of two face databases can be chosen with this parameter: • Yale face database (A): example file name: subject04.glasses.bmp ⇒ ID = 4 • FERET face database: example file name: 00002_930831_fa.jpg ⇒ ID = 2 kNNTrainer This macro trains a kNN classifier and stores the trained information on disc. training file: Path to the file where the trained kNN classifier is stored. Recognition Experiments Using Eigenfaces Fig. 5.16 shows the graph which can be used for running tests on the Eigenface approach. The kNNTrainer macro of the training graph is substituted by a kNNClassifier. kNNClassifier Classifies the given feature vector by loading the previously stored classifier information (see kNNTrainer above) from hard disc.
Face Recognition
219
Fig. 5.16. I MPRESARIO process graph for testing the eigenface approach using a nearestneighbor classifier.
training file: File in which the kNNTrainer macro stored the trained feature vectors. RecognitionStatistics Compares the classified ID (obtained by kNNClassifier) to the target ID which is again obtained by the macro ExtractIDFromFilename. Additionally, it counts the correct and false classifications and determines the classification rate which is displayed in the output window of I MPRESARIO (Appendix B). label: A label which is displayed in the output windows of I MPRESARIO to mark each message from the respective RecognitionStatistics macro in the case more than one macro is used. Annotations For the normalization of the images, parameters must be set which affect the size of the result images. FaceNormalizer crops the image to a certain size (determined by parameters width and height). These parameters must be set to the same value as in LoadImageMask and EigenfaceFeature (Fig. 5.17). If this convention is not followed, the macros will exit with an error because the dimensionality for vectors and matrices do not correspond and hence the calculations can not be performed. Suggested experiments To understand and learn about how the different parameters influence the performance of the eigenface approach, the reader might want to try changing the following parameters:
220
Person Recognition and Tracking
Fig. 5.17. Proper image size settings for the Eigenface exercise. The settings in all involved macros must be set to the same values, e.g., 120 and 180 for the width and height, respectively.
•
Variation of the normalization parameters Possible variations of the normalizations parameters are: – Image size: The parameters width and height of FaceNormalizer, LoadImageMask, and EigenfaceFeature must be set in all three processing graphs. – Norm sizes: The distance between the eyes and nose/mouth can be varied in a sense, that only certain parts of the face are visible, e.g. only the eye region.
•
Manually determine the dimensionality of the subspace (face space) The dimensionality of the face space strongly determines the performance of the eigenface approach. In general, decreasing the number of dimensions will lead to a decreasing recognition rate. Increasing the number of dimensions at some point has no effect as only noise will be described by additional dimensions that does not improve the recognition performance.The number of dimensions can manually be set by the result dim parameter of the calculatePCAMatrix macro. All process graphs must be executed, however, the same images can be used.
•
Varying the training set The recognition is also determined by the images which are used to calculate the transformation matrix and the training of the classifier. These two sets of images do not necessarily need to be the same. While varying the training set, the image set to determine the face space can be kept constant. However, variations (e.g. illumination) in the training set should be included in the image set for calculating the PCA as well. Tests on how e.g. illumination effects the recognition rate can be performed
Face Recognition
221
by training face images (e.g. of the Yale face database) under one illumination situation (e.g. the yale.normal.seq images) and testing under another illumination (the yale.leftlight.seq images). The same accounts for training and testing images showing different facial expressions. •
Ignoring dominant Eigenfaces for illumination invariance Another possibility to gain a certain independence from illumination changes is to ignore dominant Eigenfaces when transforming the feature vectors into the face space. The number of dominant Eigenfaces which should be ignored can be set by changing the parameter ignored Eigenfaces (macro EigenfaceFeature). This parameter must be changed in the training as well as in the testing process graph. The transformation matrix does not need to be recalculated.
5.1.7.3 Face Recognition Using Face Components This exercise will deal with face recognition based on facial components. Two process graphs are needed for this exercise which are shipped with the book CD. The IntensityFeatureExtractor macro is provided with C++ source code and can be used as a basis for new feature extraction macros that can be coded for these processing graphs.
Fig. 5.18. I MPRESARIO Process graph for training a component-based face recognizer.
Training of a Component-Based Face Recognizer The training graph of the component-based classifier is quite simple (Fig. 5.18). The macro LoadComponentPositions loads a list of facial feature points from a file on disc that corresponds to each image file to be trained. The macro returns a list of feature points which is forwarded to the ComponentExtractor.
222
Person Recognition and Tracking
ComponentExtractor Determines the positions and sizes of 12 face components from the facial feature points given by the macro LoadComponentPositions. The macro returns a list containing images of the components. component: One can take a look at the extracted components by setting the component parameter to the ID of the desired component (e.g. 4 for the nose tip. All IDs are documented in the macro help; on the left of Fig. 5.18). The list of face component images is then forwarded to a feature extractor macro that builds up a list of feature vectors containing one vector per component image. In Fig. 5.18 the IntensityFeatureExtractor is used which only extracts the grey-level information from the component images. This macro must be substituted if other feature extraction methods should be used. The list of feature vectors is then used to train a set of 12 kNN classifiers (macro ComponentskNNTrainer). For each classifier, a file is stored in the given directory (parameter training directory) that contains information of all trained feature vectors. ComponentskNNTrainer Trains 12 kNN classifiers, one for each component. The ground truth ID is extracted by the macro ExtractIDFromFilename (Sect. 5.1.7.2). training directory: Each trained classifier is stored in a file in the directory given here. training file base name: The names of each of the 12 classifier files contains the base file name given here and the ID of the face component, e.g. TrainedComponent4.knn for the nose tip. Face Recognition Experiments Using a Component-Based Classifier For a test image the facial feature point positions are loaded and the face component images are extracted from the face image. Then, the features for all components are determined and forwarded to the classifier. The ComponentskNNTrainer is substituted by a ComponentskNNClassifier that will load the previously stored 12 classifier files to classify each component separately (Fig. 5.19). ComponentskNNClassifier For each component, this macro determines a winner ID and an outputVector11 containing information about the distances (Euclidean distance is used) and the ranking of the different classes, i.e. a list containing a similarity value for each class. The winner ID can be used to determine the classification rate of a single component by feeding it into a RecognitionStatistics macro that compares the winner ID with the target ID. The outputVector can be used to combine the results of the single components to an overall result. training directory: Directory where ComponentskNNTrainer stored the classification information for all components. 11
outputVector is a LTILib class. Please refer to the documentation of the LTILib for implementations details.
Face Recognition
223
Fig. 5.19. I MPRESARIO process graph of a component-based face recognizer. If a component should influence the overall recognition result can be controlled by connecting the classifier output with the corresponding combiner input. Here, all components contribute to the overall result.
training file base name: The base name of the files in which the classification information is stored. The ResultCombiner macro performs this step by linearly combining the outputs and determines an overall winner ID which again can be examined in an RecognitionStatistics macro to get an overall classification rate. Classification results of single components can be considered in the overall recognition result by connecting the corresponding output of the ComponentskNNClassifier macro to the input of the ResultCombiner. In Fig. 5.19 all components are considered as all are connected to the combination macro. Suggested Experiments In this exercise the reader is encouraged to implement own feature extractors on the basis of the IntensityFeatureExtractor macro. By reimplementing the function CIntensityFeatureExtractor::featureExtraction (see Fig. 5.20) this macro can be extended to extract more sophisticated features. The basic structure of the macro can be kept because it already provides the necessary interface for the processing of all components for which this function is called. Also the reader is encouraged to train and test different face images, as the performance of the component-based classifier still strongly depends on which variations are trained and tested, e.g., on differences in illumination, facial expressions or facial accessories like glasses.
224
Person Recognition and Tracking void CIntensityFeatureExtractor::featureExtraction() { // set the size of the feature vector to the dimension of the face component if(m_Feature.size() != m_FaceComponent.rows()*m_FaceComponent.columns()) m_Feature.resize(m_FaceComponent.rows()*m_FaceComponent.columns(), 0,false,false); // Implement your own feature extraction here and remove the // following code ... // copy the (intensity) channel to a lti::dvector to generate // a feature vector m_FaceComponentIter = m_FaceComponent.begin(); int i=0; while(m_FaceComponentIter!=m_FaceComponent.end()) { m_Feature.at(i) = *m_FaceComponentIter; ++m_FaceComponentIter; ++i; } }
Fig. 5.20. Code of the feature extractor interface for component-based face classification. Reimplement this function to build your own feature extraction method.
5.2 Full-body Person Recognition Video cameras are widely used for visual surveillance which is still mainly performed by human observers. Fully automatic systems would have to perform (1) person detection, (2) person tracking (Sect. 5.3) and (3) person recognition. For this non-intrusive scenario neither knowledge nor token-based recognition systems are applicable. Face recognition can not be performed due to the lack of illumination and pose-variation invariance of state-of-the-art systems. On top of that, resolution is often poor. A possible solution might be considering the overall outer appearance of a person. The big variety in colors and textures of clothes and the possibility of combining jackets, shirt, pants, shoes and accessories like scarfs and hats offers a unique measure for identification. Different styles and typical color combinations can be associated with a single person and hence be used for identification (Fig. 5.21). For a suitable description of the appearance of persons, the following section will outline extraction algorithms for color (Sect. 5.2.2) and texture descriptors (Sect. 5.2.3). It is assumed that persons are already detected and the image accordingly segmented (Sect. 5.3). 5.2.1 State of the Art in Full-body Person Recognition Until today, the description of the appearance of people for recognition purposes is only addressed by the research community. Work on appearance-based full-body recognition was done by Yang et al. [57] and Nakajima et al. [37] whose work is briefly reviewed here. Yang et al. developed a multimodal person identification system which, besides face
Full-body Person Recognition
a
b
225
c
d
e
Fig. 5.21. Different persons whose appearance differs because of their size and especially in texture and color of their clothing (a - d). However, sometimes similar clothing might not allow a proper separation of individuals (d + e).
and speaker recognition, also contains a module for the recognition based on color appearance. Using the whole segmented image, they build up color histograms based on the r-g color space and the tint-saturation color space (t-s-color space) to gain illumination independence to some extent (Sect. 5.2.2.1). Depending on the histograms which are built after the person is separated from the background, a smoothed probability distribution function is determined for each histogram. During training for each color model and each person, i probability distribution functions Pi,model (r, g) and Pi,model (t, s) are calculated and stored. For classification, distributions Pi,test (r, g) and Pi,test (t, s) are determined and compared to all previously trained Pi,model for the respective color space. For comparison the following measure (Kullback-Leibler divergence) is used: P (r,g) D(modeli test) = s∈S Pi,model (r, g) · Pi,model i,test (r,g) where S is the set of all histogram bins. For the t-s-color space models, this is performed analogously. The model having the smallest divergence determines the recognition result. Tests on a database containing 5000 images of 16 persons showed that the t-s model performs slightly better than the r-g color model. Recognition is not influenced heavily by the size of the histograms using any of the two color spaces and recognition rates are up to 70% depending on histogram size and color model. Nakajima et al. examined different features by globally performing feature extraction. Besides a RGB histogram and a r-g histogram (like Yang et. al [57]) a shape histogram and local shape features are used. The shape histogram simply contains the number of pixels in rows and columns of the image to be examined. This feature
226
Person Recognition and Tracking
was combined with a RGB histogram. The local shape features were determined by convolving the image with 25 shape patterns patches. The convolution is performed on the following color channels: R - G, R + G and R + G - B. For classification a Support Vector Machine [49] was used. Tests were performed on a database of about 3500 images of 8 persons which were taken during a period of 16 days. All persons wore different clothes during the acquisition period so that different features were assigned to the same person. Recognition rates of about 98% could be achieved. However, the ratio of training images to test images was up to 9 : 1. It could be shown that the simple shape features are not useful for this kind of recognition task, probably because of insufficient segmentation. The local shape features performed quite well, though the normalized color histogram produced the best results. This is in line with the results of Yang [57].
5.2.2 Color Features for Person Recognition This section will describe color features for person recognition and discusses their advantages and disadvantages. 5.2.2.1 Color Histograms Histograms are commonly used to describe the appearance of objects by counting the number of pixels of each occurring color. Each histogram bin represents either a single color or a color range. With the number of bins the size of the histogram increases and hence the need for computational memory resources. With bins representing a color range the noise invariance increases because subtle changes in color has no effect as the same bin is increased when building the histogram. Therefore, a trade-off between computing performance, noise invariance (small number of bins), and differentiability (large number of bins) must be found. The dimensionality of a histogram is determined by the number of channels which are considered. A histogram on the RGB color space is three times bigger than a histogram of intensity values. A histogram of an image I(x, y) with height h and width w is given by: i = 0, . . . , K − 1 ni =| {I(x, y) | I(x, y) = i} |, K where K is the number of colors and i=0 ni = w · h. Histograms can be based on any color space that suits the specific application. The RGBL-histogram is a four-dimensional histogram built on the three RGB channels of an image and the luminance channel which is defined by: min(R, G, B) + max(R, G, B) 2 Yang et al. [57] and Nakajima et al. [37] used the r-g color space for person identification where a RGB image is transformed into a r-channel and a g-channel based on the following equations applied to each pixel P (x, y) = (R, G, B): L=
Full-body Person Recognition r=
227
R R+G+B
g=
G R+G+B
Each combination (˜ r , g˜) of specific colors are represented by a bin in the histogram. Nakajima used 32 bins for each channel, i.e. K = 32 · 32 = 1024 bins for the whole r-g histogram. The maximum size of the histogram (assuming the commonly used color resolution of 256 per channel) is Kmax = 256 · 256 = 65536. So, just a fraction of the maximum size was used leading to more robust classification. As the r-channel and the g-channel are coding the chrominance of an image, this histogram is also called chrominance histogram. 5.2.2.2 Color Structure Descriptor The Color Structure Descriptor (CSD) was defined by the MPEG-7 standard for image retrieval applications and image content based database queries. It can be seen as an extension to a color histogram which is not just representing the color distribution but also considers the local spatial distribution of the colors. With the so called structuring element not only each pixel of the image is considered when building the histogram but also its neighborhood.
structuring element
Fig. 5.22. 3 × 3 structuring elements located in two images to consider the neighborhood of the pixel under the elements center. In both cases, the histogram bins representing the colors “black” and “white” are increased by exactly 1.
Fig. 5.22 illustrates the usage of the structuring element (here with a size of 3 × 3 shown in grey) with two simply two-colored images. Both images have a size of 10 × 10 pixels of which 16 are black. For building the CSD histogram the structuring element is moved row-wise over the whole image. A bin of the histogram is increased by exactly 1 if the corresponding color can be found within the structuring element window even though more than one pixel have this color. At the position marked in Fig. 5.22 in the left image one black pixel and eight white pixels are found. The values of the bin representing the black color and the white color are increased by 1 each, just like on the right side of Fig. 5.22, even though two black pixels and only seven white pixels are found within the structuring element.
228
Person Recognition and Tracking
By applying the structuring element, the histogram bins will have the values: nBlack = 36 and nW hite = 60 for the left image, nBlack = 62 and nW hite = 64 for the right. As can be seen for the black pixels, the more the color is distributed the larger is the value of the histogram bin. A normal histogram built on these two images will have the bin values nBlack = 16 and nW hite = 84. The images could not be separated by these identical histograms while the CSD histograms still show a difference. The MPEG-7 standard suggests a structure element size of 8×8 pixels for images smaller than 256 × 256 pixels which is the case for most segmented images in our experiments. It must be noted that the MPEG-7 standard also introduced a new color space [29]. Images are transformed into the HMMD color space (Hue, Min, Max, Diff) and then the CSD is applied. The hue value is the same as in the HSV color space while Min and Max are the smallest and the biggest of the R, G, and B values and Diff is the difference between these extremes: ⎧ ⎨ (0 + Hue = (2 + ⎩ (4 +
G−B Max−Min ) B−R Max−Min ) R−G Max−Min )
× 60, × 60, × 60,
if if if
R = Max G = Max B = Max
Min = min(R, G, B) Max = max(R, G, B) Diff = max(R, G, B) − min(R, G, B) 5.2.3 Texture Features for Person Recognition Clothing is not only characterized by color but also by texture. So far, texture features have not been evaluated as much as color features though they might give valuable information about a person’s appearance. The following two features turned out to have good recognition performance, especially when combined with other features. In general, most texture features filter an image, split it up into separated frequency bands and extract energy measures that are merged to a feature vector. The applied filters have different specific properties, e.g. they extract either certain frequencies or frequencies in certain directions. 5.2.3.1 Oriented Gaussian Derivatives For a stable rotation-invariant feature we considered a feature based on the Oriented Gaussian Derivatives (OGD) which showed promising results in other recognition tasks before [54]. The OGDs are steerable, i.e. it can be determined how the information in a certain direction influences the filter result and therefore the feature vector. In general, a steerable filter g θ (x, y) can be expressed as:
Full-body Person Recognition
229
g θ (x, y) =
J
kj (θ)gjθ (x, y)
j=1
where gjθ (x, y) denote basis filters and kj (θ) interpolation functions. For the firstorder OGDs the filter can be defined as: ◦
g θ (x, y) = g 0 (x , y ) =
∂ g(x , y ) ∂x
2 1 x + y2 (x cos θ + y sin θ) exp − 2πσ 4 2σ 2 ∂ ∂ = cos θ g(x, y) + sin θ g(x, y) ∂x ∂y =−
◦
◦
= cos θg 0 (x, y) + sin θg 90 (x, y) where x = x cos θ and y = y sin θ. Here the interpolation functions are k1 (θ) = cos(θ) and k2 (θ) = sin(θ). The basis functions are given by the derivatives of the Gaussian in the direction of x and y, respectively. A directed channel C θ with its elements cθ (x, y) can then be extracted by convolving the steerable filter g θ (x, y) with the image I(x, y): cθ (x, y) = g θ (x, y) ∗ I(x, y) The power pθ (x, y) of the channel C θ is then determined by: pθ (x, y) = (cθ (x, y))2 For a n-th order Gaussian derivative, the energy of a region R in a filtered channel cθ (x, y) can then be expressed as [2]: θ E = pθ (x, y)dxdy =
J
ki (θ)kj (θ)
cθi (x, y)cθj (x, y)dxdy
(i,j)=1
= A0 +
n
Ai cos(2iθ − Θi )
i=1
where A0 is independent of the steering angle θ and the Ai are rotation-invariant, i.e. θ the energy ER is steerable as well. The feature vector is composed of all coefficients Ai . For the exact terms of the coefficients Ai and more details the interested reader is referred to [2]. 5.2.3.2 Homogeneous Texture Descriptor The Homogeneous Texture Descriptor (HTD) describes texture information in a region by determining the mean energy and energy deviations from 30 frequency
230
Person Recognition and Tracking
channels non-uniformly distributed in a cylindrical frequency domain [38] [42]. The MPEG-7 specifications for the HTD suggest to use Gabor functions to model the channels in the polar frequency domain which is splitted into six bands in angular direction and five bands in radial direction (Fig. 5.23).
θ, s 3
4
2
5
ωs=1
6 12
18 24
i=1 ω, r
30
Fig. 5.23. (Ideal) HTD frequency domain layout. The 30 ideal frequency bands are modelled with Gabor filters.
In the literature an implementation of the HTD is proposed that applies the “central slice problem” and the Radon transformation [42]. However, an implementation using a 2D Fast-Fourier-Transformation (FFT) turned out to have a much higher computational performance in our experiments. The 2D-FFT is defined as: FI (u, v) =
W −1 H−1
vy
I(x, y)e−2πi( W + H ) ux
u = 0, . . . , W − 1; v = 0, . . . , H − 1
x=0 y=0
The resulting frequency domain is symmetric i.e. only N2 points of the discrete transformation need to be calculated and afterwards filtered by Gabor filters which are defined as: G(ω, θ) = e
−(ω−ωs )2 2 2σw s
·e
−(θ−θr )2 2σ2 θr
where ωs (θr ) determines the center frequency (angle) of the band i where i = 6s + r + 1, r defining the radial index and s the angular index. From each of the 30 bands the energies ei and their deviations di are defined as: ei = log(1 + pi ) di = log(1 + qi )
Full-body Person Recognition with
231 ◦
pi =
1 360
[G(w, θ) · F (w, θ)]2
w=0 θ=0◦
, - 1 360◦ - {[G(w, θ) · F (w, θ)]2 − pi } qi = . w=0 θ=0◦
respectively, where G(w, θ) denotes the Gabor function in the frequency domain, F (w, θ) the 2D-Fourier transformation in polar coordinates. The HTD feature vector is then defined as: HT D = [fDC , fSD , e1 , e2 , · · · , e30 , d1 , d2 , · · · , d30 ] fDC is the intensity mean value of the image (or color channel), fSD the corresponding standard deviation. If fDC is not used, the HTD can be thought of a intensity invariant feature [42]. 5.2.4 Experimental Results This section will show some results which were achieved by applying the above described feature descriptors. 5.2.4.1 Experimental Setup The tests were performed on a database of 53 persons [22]. For each individual, a video sequence was available showing the person standing upright. 10 test images were chosen to train a person and 200 images were used for testing. Some examples of the used images are depicted in Fig. 5.21. 5.2.4.2 Feature Performance The results achieved with the above described features are shown in Fig. 5.24. It can clearly be seen that the performance of the color features is superior to the performance of the texture features. Note, the good stability of the CSD with increasing number of classes (persons). The good performance of the chrominance histogram in [37] could not be verified. The chrominance histogram as well as the RGBL histogram in average have a lower performance than the CSD. The performance of the RGBL-histogram is 3 − 4% lower, the performance of the chrominance histogram up to about 10%. Note that the combination of the CSD and the HTD12 still led to a slight performance improvement, though the performance of the HTD was worse than the 12 Combination was performed by adding the distance measures of the single features: d(x) = dCSD (x) + dHT D (x) and then determining the class with the smallest overall distance.
232
Person Recognition and Tracking 1
0,95
recognition rate
0,9 0,85 0,8 0,75 0,7 0,65 10
20
30
40
50
number of persons
OGD
HTD
CSD
RGBL
chrominance
CSD+HTD
Fig. 5.24. Recognition results of the examined features. Color features are marked with filled symbols. Texture features have outlined symbols.
performance of the CSD. This proofs that the HTD describes information that can not completely be covered by the CSD. Therefore, the combination of texture and color features is advisable.
5.3 Camera-based People Tracking Numerous present and future tasks of technical systems require increasingly the perception and understanding of the real world outside the machine’s metal box. Like the human’s eyes, camera images are the preferred input to provide the required information, and therefore computer vision is the appropriate tool to approach these tasks. One of the most fundamental abilities in scene understanding is the perception and analysis of people in their natural environment, a research area often titled the “Looking at People”-domain in computer vision. Many applications rely on this ability, like autonomous service robots, video surveillance, person recognition, man-machine interfaces, motion capturing, video retrieval and content-based low-bandwidth video compression. “People Tracking” is the image analysis element that all of these applications have in common. It denotes the task to detect humans in an image sequence and to keep track of their movements through the scene. Often this provides merely the initial information for following analysis, e.g. of the body posture, the position of people in the three-dimensional scene, the behaviour, or the identity. There is no general solution at hand that could process the images of any camera in any situation and deliver the positions and actions of the detected persons, be-
Camera-based People Tracking
(a)
233
(b)
(c)
Fig. 5.25. Different situations require more or less complex approaches, depending on the number of people in the scene as well as the camera perspective and distance.
cause real situations are far too varied and complex (see Fig. 5.25) to be interpreted without any prior knowledge. Furthermore, people have a wide variety of shape and appearance. A human observer solves the image analysis task by three-dimensional understanding of the situation in its entirety, using a vast background knowledge about every single element in the scene, physical laws and human behaviour. No present computer model is capable to represent and use that amount of world knowledge, so the task has to be specialized to allow for more abstract descriptions of the desired observations, e.g. “Humans are the only objects that move in the scene” or “People in the scene are walking upright and therefore have a typical shape”.
camera image
Segmentation
Tracking
people detection
updating the position of each person
New person detection
Optional Feature Extraction - position in 3D scene - body posture - identity
- person description update - position prediction
Fig. 5.26. Overview of the general structure of a people tracking system.
People tracking is a high-level image processing task that consists of multiple processing steps (Fig. 5.26). In general, a low-level segmentation step, i.e. a search for image regions with a high probability of representing people, preceeds the tracking step, which uses knowledge and observations from the previous images to assign each monitored person to the corresponding region and to detect new persons entering the scene. But as will be shown, there are also approaches that combine the segmentation and the tracking step. The following sections present various methods to solve the individual tasks of a tracking system, as there are the initial detection and segmentation of the people in the image, the tracking in the image plane and, if required, the calculation of the real world position of each person. Every method has advantages and disadvantages in different situations, so the choice of how to approach the development of a “Looking at People” system depends entirely on the specific requirements of the intended ap-
234
Person Recognition and Tracking
plication. Surveys and summaries of a large number of publications in this research area can be found in [19, 35, 36]. 5.3.1 Segmentation “Image segmentation” denotes the process to group pixels with a certain common property and can therefore be regarded as a classification task. In the case of people tracking, the goal of this step is to find pixels in the image that belong to persons. The segmentation result is represented by a matrix M(x, y) of Booleans, that labels each image pixel I(x, y) as one of the two classes “background scene” or “probably person” (Fig. 5.27 b): 0 if background scene M(x, y) = (5.6) 1 if moving foreground object (person) While this two-class segmentation suffices for numerous tracking tasks (Sect. 5.3.2.1), individual tracking of people during inter-person overlap requires further segmentation of the foreground regions, i.e. to label each pixel as one of the classes background scene, n-th person, or unrecognized foreground (probably a new person) (Fig. 5.27 c). This segmentation result can be represented by individual foreground masks Mn (x, y) for every tracked person n ∈ {1, . . . , N } and one mask M0 (x, y) to represent the remaining foreground pixels. Sect. 5.3.3 presents several methods to segment the silhouettes separately using additional knowledge about shape and appearance.
(a)
(b)
(c)
Fig. 5.27. Image segmentation; a) camera image, b) foreground segmentation, c) separate person segmentation.
5.3.1.1 Foreground Segmentation by Background Subtraction Most people-tracking applications use one or more stationary cameras. The advantage of such a setup is that each pixel of the background image B(x, y), i.e. the camera image of the static empty scene without any persons, has a constant value as long as there are no illumination changes. The most direct approach to detect
Camera-based People Tracking
235
moving objects would be to label all those pixels of the camera image I(x, y, t) at time t as foreground, that differ more than a certain threshold θ from the prerecorded background image: 1 if B(x, y) − I(x, y, t) > θ M(x, y, t) = (5.7) 0 otherwise θ has to be chosen greater than the average camera noise. In real camera setups,
a) background model
b) camera image
c) background subtraction
d) shadow reduction
e) morphological filtering
Fig. 5.28. Foreground segmentation steps.
however, the amount of noise is not only correlated to local color intensities (e.g. due to saturation effects), but there are also several additional types of noise that affect the segmentation result. Pixels at high contrast edges often have high noise variance due to micro-movements of the camera and “blooming/smearing”-effects in the image sensor. Non-static background like moving leaves on trees in outdoor scenes also causes high local noise. Therefore it is more suitable to describe each background pixel individually by its mean value μc (x, y) and its noise variance σ 2c (x, y). In the following, it is assumed that there is no correlation between the noise of the separate color channels Bc (x, y), c ∈ {r, g, b}. The background model is created from N images B(x, y, i), i ∈ {1, . . . , N } of the empty scene: N 1 Bc (x, y, i) N i=1
(5.8)
N 2 1 Bc (x, y, i) − μc (x, y) N i=1
(5.9)
μc (x, y) =
σ 2c (x, y) =
Assuming a Gaussian noise model, the Mahalanobis distance ΔM provides a suitable measure to decide, whether the difference between image and background is relevant or not. 2 Ic (x, y, t) − μc (x, y) (5.10) ΔM (x, y, t) = σ 2c (x, y) c∈{r,g,b}
Camera-based People Tracking
237
element positioned at (x, y), the formal description of the morphological operations is given by: +d
+d F(x, y) =
δx =−d
M(x + δx , y + δy )E(δx , δy ) +d δx =−d δy =−d E(δx , δy )
δy =−d
+d
Erosion: ME (x, y) = Dilation: MD (x, y) =
(5.14)
1 0
if F(x, y) = 1 otherwise
(5.15)
1 0
if F(x, y) > 0 otherwise
(5.16)
Structure Element
Original
Erosion
Dilation
Dilation
Erosion
Closing
d d
Opening
The effects of both operations are shown in Fig. 5.29. Since the erosion reduces the
Fig. 5.29. Morphological operations. Black squares denote foreground pixels, grey lines mark the borderline of the original region.
size of the original shape by the radius of the structure element while the dilation expands it, the operations are always applied pairwise. An erosion followed by a dilation is called morphological opening. It erases foreground regions that are smaller than the structure element. Holes in the foreground are closed by morphological closing, i.e. a dilation followed by an erosion. Often opening and closing operations are performed subsequently, since both effects are desired. 5.3.2 Tracking The foreground mask M(x, y) resulting from the previous section marks each image pixel either as “potential foreground object” or as “image background”, however so far no meaning has been assigned to the foreground regions. This is the task of the
238
Person Recognition and Tracking
tracking step. “Tracking” denotes the process of associating the identity of a moving object (person) with the same object in the preceding frames of an image sequence, thus following the trajectory of the object either in the image plane or in real world coordinates. To this end, features of each tracked object, like the predicted position or appearance descriptions, are used to classify and label the foreground regions. There are two different types of tracking algorithms: In the first one (Sect. 5.3.2.1), all human candidates in the image are detected before the actual tracking is done. Here, tracking is primarily a classification problem, using various features to assign each person candidate to one of the internal descriptions of all observed persons. The second type of tracking systems combines the detection and tracking step (Sect. 5.3.2.2). The predicted image area of each tracked person serves as initialization to search for the new position. This search can be performed either on the segmented foreground mask or on the image data itself. In the latter case, the process often also includes the segmentation step, using color descriptions of all persons to improve the segmentation quality or even to separate people during occlusion. This second, more advanced class of tracking algorithms is also preferred in model-based tracking, where human body models are aligned to the person’s shape in the image to detect the exact position or body posture. In the following, both methods are presented and discussed. Inter-person overlap and occlusions by objects in the scene will be covered in chapter 5.3.3. 5.3.2.1 Tracking Following Detection The first step in algorithms of the tracking following detection type is the detection of person candidates in the segmented foreground mask (see Fig. 5.30 b). The most common solution for this problem is to analyze the size of connected foreground regions, keeping only those with a pixel number above a certain threshold. An appropriate algorithm is presented in Sect. 2.1.2.2 (lti::objectsFromMask, see Fig. 2.11). If necessary, additional criteria can be considered, e.g. a valid range for the width-to-height ratio of the bounding boxes surrounding the detected regions. Which criteria to apply depends largely on the intended application area. In most cases, simple heuristic rules are adequate to distinguish persons from other moving objects that may appear in the scene, like cars, animals or image noise (see also example in Sect. 5.3.5). After detection, each tracked person from the last image is assigned to one of the person candidates, thus finding the new position. This classification task relies on the calculation of appropriate distance metrics using spatial or color features as described below. New appearing persons are detected if a person candidate can not be identified. If no image region can be assigned to a tracked person, it is assumed that the person has left the camera field of view. Tracking following detection is an adequate approach in situations that are not too demanding. Examples are the surveillance of parking lots or similar set-ups fulfilling the following criteria (Fig. 5.25 b):
Camera-based People Tracking
239
A B C
(a)
(b)
(c)
Fig. 5.30. Tracking following detection: a) camera image, b) foreground segmentation and person candidate detection, c) people assignment according to distance and appearance.
•
•
• •
The camera is mounted at a position significantly higher than a human. The steeper the tilt towards the scene, the smaller is the probability of inter-person overlap in the image plane. The distance between the camera and the monitored area is large, so that each person occupies a small region in the camera image. As a result, the variability of human shape and even grouping of persons have only negligible influence on the detected position. There are only few people moving simultaneously in the area, so the frequency of overlap is small. The application does not require separating people when they merge into groups (see Sect 5.3.3).
If, on the other hand, the monitored area is narrow with many people inside, who can come close towards the camera or are likely to be only partially visible in the camera image (Fig. 5.25 a), it is more appropriate to use the combined tracking and detection approach as presented in Sect. 5.3.2.2. Various features have been applied in tracking systems to assign the tracked persons to the detected image areas. The most commonly used distance metric dn,k between a tracked person n and a person candidate k is the Euclidean distance between the predicted center position (xpn , ypn ) of the person’s shape and those of the respective segmented foreground region (xck , yck ): (5.17) dn,k = (xpn − xck )2 + (ypn − yck )2 The prediction (xpn , ypn ) is extrapolated from the history of positions in the preceding frames of the sequence, e.g. with a Kalman Filter [53]. A similar distance metric is the overlap ratio between the predicted bounding box of a tracked person and the bounding box of the compared foreground region. Here, the region with the largest area inside the predicted bounding box is selected. While assignment by spatial distance provides the basic framework of a tracking system, more advanced features are necessary for reliable identification whenever people come close to each other in the image plane or have to be re-identified after occlusions (Sect. 5.3.3). To this end, appearance models of all persons are created when they enter the scene and compared to the candidates in each new frame. There
240
Person Recognition and Tracking
are many different types of appearance models proposed in the literature, of which a selection is presented below. The most direct way to describe a person would be his or her full-body image. Theoretically, it would be possible to take the image of a tracked person in one frame and to compare it pixel-by-pixel with the foregound region in question in the next frame. This, however, is not suitable for re-identification after more than a few frames, e.g. after an occlusion, since the appearance and silhouette of a person changes continuously during the movement through the scene. A good appearance model has to be general enough to comprise the varying appearance of a person during the observation time, as well as being detailed enough to distinguish the person from others in the scene. To solve this problem for a large number of people under varying lighting conditions requires elaborate methods. The appearance models presented in the following sections assume that the lighting within the scene is temporally and spatially constant and that the clothes of the tracked persons look similar from all sides. Example applications using these models for people tracking can be found in [24, 59] (Temporal Texture Templates), [3, 32] (color histograms), [34, 56] (Gaussian Mixture Models) and [17] (Kernel Density Estimation). Other possible descriptions, e.g. using texture features, are presented in Sect. 5.2. Temporal Texture Templates A Temporal Texture Template consists of the average color or monochrome image T(/ x, y/) of a person in combination with a matrix of weights W(/ x, y/) that counts the occurrence of every pixel as foreground. The matrix is initialized with zeros. (/ x, y/) are the relative coordinates inside the bounding box of the segmented person, normalized to a constant size to cope with the changing scale of a person in the camera image. In the following, (x, y) will be considered as the image coordinates that correspond to (/ x, y/) and vice versa. In image frame I(x, y, t) with the isolated foreground mask Mn (x, y, t) for the n-th person, the Temporal Texture Template is updated as follows: Wn (/ x, y/, t) = W(/ x, y/, t − 1) + Mn (x, y, t) Tn (/ x, y/, t) =
x, y/, t − 1)Tn (/ x, y/, t) + Mn (x, y, t)I(x, y, t) Wn (/ max{Wn (/ x, y/, t), 1}
(5.18) (5.19)
Figure 5.31 shows an example of how the template and the weight matrix evolve over time. Depicting the weights as a grayscale image shows the average silhouette of the observed person with higher values towards the body center and lower values in the regions of moving arms and legs. Since these weights are also a measure for the reliability of the respective average image value, they are used to calculate the weighted sum of differences dTn,k between the region of a person candidate k and the average appearance of the compared n-th person. x, y/, t − 1) Tn (/ x, y/, t − 1) − I(x, y, t) x y Mk (x, y, t)Wn (/ dTn,k = x, y/, t − 1) x y Mk (x, y, t)Wn (/ (5.20)
Camera-based People Tracking
241
(a) k = 1
(b) k = 5
(c) k = 50
Fig. 5.31. Temporal Texture Templates calculated from k images.
Color Histograms A color histogram h(r , g , b ) is a non-parametric representation of the color distribution inside an object (Fig. 5.32 a). The indices r , g and b quantize the (r,g,b)-color space, e.g. in 32 intervals per axis. Each histogram bin represents the total number of pixels inside the corresponding color space cube. For a generalized description of a person, the histogram has to be generated from multiple images. The normalized histogram / h(r , g , b ) represents the relative occurrence frequency of each color interval: h(r , g , b ) / (5.21) h(r , g , b ) = r , / g , /b ) r g b h(/ To compare a person candidate to this description, a normalized color histogram of the corresponding foreground region / hk (r , g , b ) is created, and the difference between both histograms is calculated: |hn (r , g , b ) − hk (r , g , b )| (5.22) dHn,k = r
g
b
An alternative method to compare two normalized histograms is the “histogram intersection” that was introduced by Swain and Ballard [46]. The higher the similarity between two histograms, the greater is the sum sHn,k : min{hn (r , g , b ), hk (r , g , b )} (5.23) sHn,k = r
g
b
Instead of the (r,g,b)-space, other color spaces are often used that are less sensitive regarding illumination changes (see appendix A.5.3). Another expansion of this principle uses multiple local histograms instead of a global one to include spatial information in the description. Gaussian Mixture Models A Gaussian Mixture Model is a parametric representation of the color distribution of an image. The image is clustered into N regions of similar color (“blobs”), e.g.
242
Person Recognition and Tracking
using the Expectation Maximization (EM-) algorithm [15]. Each v-dimensional blob i = 1 . . . N is described as a Gaussian distribution by its mean μi and covariance matrix Ki either in three-dimensional color space (c = (r, g, b), v = 3) or in the five-dimensional combined feature space of color and normalized image coordinates (c = (r, g, b, x /, y/), v = 5): pi (c) =
1 1 √ · exp − (c − μi )T K−1 i (c − μi ) 2 detKi
(2π)v/2
(5.24)
The Mahalanobis distance between a pixel value c and the i-th blob is given as: ΔM (c, i) = (c − μi )T K−1 i (c − μi )
(5.25)
For less calculation complexity, the uncorrelated Gaussian distribution is often used instead, see eq. 5.8 . . . 5.10. The difference dGn,k between the description of the n-th person and an image region Mk (x, y) is calculated as the average minimal Mahalanobis distance between each image pixel c(x, y) inside the mask and the set of Gaussians: x y Mk (x, y) minin =1...Nn ΔM (c(x, y), in ) (5.26) dGn,k = x y Mk (x, y) While a parametric model is a compact representation of the color distribution of a person, it is in most cases only a rough approximation of the actual appearance. On the other hand, this results in high generalization capabilities, so that it is often sufficient to create the model from one image frame only. A difficult question, however, is to choose the right number of clusters for optimal approximation of the color density function. Kernel Density Estimation Kernel density estimation combines the advantages of both, parametric Gaussian modeling and non-parametric histogram descriptions, and allows the approximation of any color distribution by a smooth density function. Given a set of N randomly chosen sample pixels ci , i = 1 . . . N , the estimated color probability density function pK (c) is given by the sum of kernel functions K(ci ) which are centered at every color sample: N 1 K(ci ) (5.27) pK (c) = N i=1 Multivariate Gaussians (eq. 5.24) are often used as an appropriate kernel. The variances of the individual axes of the chosen feature space have to be defined a priori. This offers the possibility to weigh the axes individually, e.g. with the purpose to increase the stability towards illumination changes. Fig. 5.32 illustrates the principle of kernel density estimation in comparison to histogram representation. To measure the similarity between a person described by kernel density estimation and a segmented image region, the average color probability according to eq. 5.27 is calculated over all pixels in that region.
Camera-based People Tracking
243
h(c1)
pK(c1)
3
3
2
2
1
1
c1
c1
Sample points (a)
(b)
Fig. 5.32. Comparison of probability density approximation by histograms (a) and kernel density estimation (b).
5.3.2.2 Combined Tracking and Detection In contrast to the tracking principle presented in the preceding section, the tracked persons here are not assigned to previously detected person candidates, but instead the predicted positions of their silhouettes are iteratively matched to image regions (Fig. 5.33). For robust matching, the initial positions have to be as close as possible to the target coordinates, therefore high frame rates and fast algorithms are preferable. After the positions of all tracked people are identified, new persons entering the camera field of view are detected by analyzing the remaining foreground regions as explained in the previous section. A
B C
(a) Initialization
A
B
C
(b) Adaptation
Fig. 5.33. Combined tracking and detection by iterative model adaptation.
This approch to tracking can be split up in two different types of algorithms depending on the chosen person modeling. The first type, shape-based tracking, aligns a model of the human shape to the segmented foreground mask, while the second type, color-based tracking, works directly with the camera image and combines segmentation and tracking by using a color description of each person. Both approaches are discussed in the following.
244
Person Recognition and Tracking
Shape-based Tracking The silhouette adaptation process is initialized with the predicted image position of a person’s silhouette, which can be represented either by the parameters of the surrounding bounding box (center coordinates, width and height) or by the parameters of a 2D model of the human shape. The most simple model is an average human silhouette, that is shifted and scaled to approximate a person’s silhouette in the image. If the framerate is not too low, the initial image area overlaps with the segmented foreground region that represents the current silhouette of that person. The task of the adaptation algorithm is to update the model parameters to find the best match between the silhouette representation (bounding box, 2D model) and the foreground region. Mean Shift Tracking Using a bounding box representation, the corresponding rectangular region can be regarded as a search window that has to be shifted towards the local maximum of the density of foreground pixels. That density is approximated by the number of foreground pixels inside the search window at the current position. The Mean Shift Algorithm is an iterative gradient climbing algorithm to find the closest local density maximum [9, 11]. In each iteration i, the mean shift vector dx(i) = (dx(i), dy(i)) is calculated as the vector pointing from the center (xm , ym ) of the search window (with width w and height h) to the center of gravity (COG) of all included data points. In addition, each data point can be weighted by a kernel function k(x, y) centered at the middle of the search window, e.g., a Gaussian kernel with σx ∼ w and σy ∼ h: k(x, y) =
1 x2 y2 1 ∗ exp − ( 2 + 2 ) 2πσx σy 2 σx σy
(5.28)
As a result, less importance is given to the outer points of a silhouette, which include high variability and therefore unreliability due to arm and leg movements. Using the segmented foreground mask M(x, y) as source data, the mean shift vector dx(i) is calculated as h/2
dx(i) =
w/2 x M(xm (i − 1) + x, ym (i − 1) + y)k(x, y) y
y=−h/2 x=−w/2 h/2
w/2
(5.29) M(xm (i − 1) + x, ym (i − 1) + y)k(x, y)
y=−h/2 x=−w/2
The position of the search window is then updated with the mean shift vector: xm (i) = xm (i − 1) + dx(i)
(5.30)
This iteration is repeated until it converges, i.e. dx(i) < Θ, with Θ being a small threshold. Since the silhouette grows or shrinks if a person moves forwards or backwards in the scene, the size of the search window has to be adapted accordingly.
Camera-based People Tracking
245
To this end, the rectangle is resized after the last iteration according to the size of the enclosed foreground area (CAMShift - Continuously Adaptive Mean Shift Algorithm [7]), w /∗/ h=C∗
h/2
w/2
M(xm (i) + x, ym (i) + y)k(x, y)
(5.31)
y=−h/2 x=−w/2
with C and the width-to-height ratio w h being user-defined constants. An alternative method is to calculate the standard deviations σx and σy of the enclosed foreground and set the width and height to proportional values. One advantage of this method over tracking following detection is the higher stability in case of bad foreground segmentation: Where an advanced person detection algorithm would be necessary to recognize a group of isolated small foreground regions as one human silhouette, the mean shift algorithm solves this problem automatically by adapting to the density maximum. Linear Regression Tracking For further improvement of tracking stability, knowledge about the human shape is incorporated into the system. In the following, an average human silhouette is used for simplicity, of which the parameters are x- and y-position, scale and width-to-height ratio (Fig. 5.34 a). The task is to find the optimal combination of model parameters that minimizes the difference between the model and the target region.
(a)
(b)
(c)
Fig. 5.34. Iterative alignment (c) of a shape model (a) to the segmented foreground (b) using linear regression.
One approach to solve this problem is to define a measure for the matching error (e.g. the sum of all pixel differences) and to find the minimum of the error function in parameter space using global search techniques or gradient-descent algorithms. These approaches, however, disregard the fact that the high-dimensional vector of pixel differences between the current model position and the foreground region contains valuable information about how to adjust the parameters. For example, the locations of pixel differences in Fig. 5.34 c between the average silhouette and the
246
Person Recognition and Tracking
foreground region correspond to downscaling and shifting the model to the right. Expressed mathematically, the task is to find a transformation between two vector spaces: the w ∗ h-dimensional space of image differences di (by concatenating all rows of the difference image into one vector) and the space of model parameter changes dp , which has 4 dimensions here. One solution to this task is provided by linear regression, resulting in the regression matrix R as a linear approximation of the correspondance between the two vector spaces: dp = Rdi
(5.32)
The matching algorithm works iteratively. In each iteration, the pixel difference vector di is calculated using the current model parameters, which are then updated using the parameter change resulting from eq. 5.32. To calculate the regression matrix R, a set of training data has to be collected in advance by randomly varying the model parameters and calculating the difference vector between the original and the shifted and scaled shape. All training data is written in two matrices Di and Dp , of which the corresponding columns are the difference vectors di or the parameter changes dp respectively. The regression matrix is then calculated as R = Dp Φ(Λ−1 ΦT DTi )
(5.33)
where the columns of matrix Φ are the first m Eigenvectors of DTi Di and Λ is the diagonal matrix of the corresponding eigenvalues λ. A derivation of this equation with application to human face modeling can be found in [12]. Models of the human shape increase the overall robustness of the tracking system, because they cause the system to adapt to similar foreground areas instead of tracking arbitrarily shaped regions. Furthermore, they improve people separation during overlap (Sect. 5.3.3). Color-based Tracking In this type of tracking algorithms, appearance models of all tracked persons in the image are used for the position update. Therefore no previous segmentation of the image in background and foreground regions is necessary. Instead, the segmentation and tracking steps are combined. Using a background model (Sect. 5.3.1) and the color descriptions of the persons at the predicted positions, each pixel is classified either as background, as belonging to a specific person, or as unrecognized foreground. An example is the Pfinder (“Person finder”) algorithm [56]. Here, each person n is represented by a mixture of m multivariate Gaussians Gn,i (c), i = 1 . . . m, with c = (r, g, b, x, y) being the combined feature space of color and image coordinates (Sect. 5.3.2.1). For each image pixel c(x, y), the Mahalanobis distances ΔM,B to the background model and ΔM,n,i to each Gaussian are calculated according to equation 5.10 or 5.25 respectively. A pixel is then classified using the following equation, where θ is an appropriate threshold:
Camera-based People Tracking
247
ΔM,min = min{ΔM,B , ΔM,n,i } n,i
(5.34)
⎧ ⎨ unrecognized foreground c(x, y) = background ⎩ person n, cluster i
if ΔM,min ≥ θ if ΔM,B = ΔM,min and ΔM,B < θ if ΔM,n,i = ΔM,min and ΔM,n,i < θ (5.35) The set of pixels assigned to each cluster is used to update the mean and variance of the respective Gaussian. The average translation of all clusters of one person defines the total movement of that person. The above algorithm is repeated iteratively until it converges.
(a) source image
(b) high threshold
(c) low threshold
(d) expectation based segmentation
Fig. 5.35. Image segmentation with fixed thresholds (b,c) or expected color distributions (d).
One problem of common background subtraction (Sect. 5.3.1) is to find a global threshold that minimizes the segmentation error (see Sect. 2.1.2.1: optimal thresholding algorithm in Fig. 2.6), due to people wearing clothes of more or less similar colors to different background regions (Fig. 5.35 b,c). In theory, the optimal decision-border between two Gaussians is defined by equal Mahalanobis distances to both distributions. The same principle is applied here to improve image segmentation (Fig. 5.35 d). An additional advantage of this method is the possibility to include the separation of people during occlusion in the segmentation process (Sect. 5.3.3.3). In a similar way, the method can be applied with other appearance models, e.g. with kernel density estimation (Sect. 5.3.2.1).
248
Person Recognition and Tracking
5.3.3 Occlusion Handling In the previous sections, several methods have been presented to track people as isolated regions in the image plane. In real situations, however, people form groups or pass each other in different distances from the camera. Depending on the monitored scene and the camera perspective, it is more or less likely that the corresponding silhouettes in the image plane merge into one foreground region. How to deal with these situations is an elementary question in the development of a tracking system. The answer depends largely on the application area and the system requirements. In the surveillance of a parking lot from a high and distant camera position, it is sufficient to detect when and where people merge into a group and, after the group splits again, to continue tracking them independently (Sect. 5.3.3.1). Separate person tracking during occlusion is neither necessary nor feasible here due to the small image regions the persons occupy and the relatively small resulting position error. A counter-example is the automatic video surveillance in public transport like a subway wagon, where people partially occlude each other permanently. In complex situations like that, elaborate methods that combine person separation by shape, color and 3D information (Sect. 5.3.3.2 - 5.3.3.3) are required for robust tracking. 5.3.3.1 Occlusion Handling without Separation As mentioned above, there are many surveillance tasks where inter-person occlusions are not frequent and therefore a tracking logic that analyzes the merging and splitting of foreground regions suffices. This method is applied particularly in tracking following detection systems as presented in Sect. 5.3.2.1 [24, 32]. Merging of two persons in the image plane is detected, if both are assigned to the same foreground region. In the following frames, both are tracked as one group object until it splits up again or merges with other persons. The first case occurs, if two or more foreground regions are assigned to the group object. The separate image regions are then identified by comparing them to the appearance models of the persons in question. This basic principle can be extended by a number of heuristic rules to cope with potential combinations of appearing/disappearing and splitting/merging regions. The example in Sect. 5.3.5 describes a tracking system of this type in detail. 5.3.3.2 Separation by Shape When the silhouettes of overlapping people merge into one foreground region, the resulting shape contains information about how the people are arranged (Fig. 5.36). There are multiple ways to take advantage of that information, depending on the type of tracking algorithm used. The people detection method in tracking following detection systems (Sect. 5.3.2.1) can be improved by further analyzing the shape of each connected foreground region to derive a hypothesis about the number of persons it is composed of. A common approach is the detection and counting of head shapes that extend from the foreground blob (Fig. 5.36 b). Possible methods are the detection of local peaks in combination with local maxima of the extension in y-direction [59] or the analysis of the curvature of the foreground region’s top boundary [23]. Since these methods
Camera-based People Tracking
(a)
249
(b)
(c)
Fig. 5.36. Shape-based silhouette separation: (b) by head detection, (c) by silhouette alignment.
will fail in situations where people are standing directly behind each other so that not all of the heads are visible, the results are primarily used to provide additional information to the tracking logic or to following processing steps. A more advanced approach uses models of the human shape as introduced in Sect. 5.3.2.2. Starting with the predicted positions, the goal is to find an arrangement of the individual human silhouettes that minimizes the difference towards the respective foreground region. Using the linear regression tracking algorithm presented above, the following principle causes the silhouettes to align only with the valid outer boundaries and ignore the areas of overlap: In each iteration, the values of the difference vector di are set to zero in foreground points that lie inside the bounding box of another person (gray region in Fig. 5.36 c). As a result, the parameter update is calculated only from the remaining non-overlapping parts. If a person is completely occluded, the difference vector and therefore also the translation is zero, resulting in the prediction as final position assumption. Tracking during occlusion using only the shape of the foreground region requires good initialization to converge to the correct positions. Furthermore, there is often more than one possible solution in arranging the overlapping persons. Tracking robustness can be increased in combination with separation by color or 3D information. 5.3.3.3 Separation by Color The best results in individual people tracking during occlusions are achieved if their silhouettes are separated from each other using appearance models as described in Sect. 5.3.2.2. In contrast to the previous methods that work entirely in the 2D image plane, reasoning about the relative depth of the occluding people towards each other is necessary here to cope with the fact that people in the back are only partially visible, thus affecting the position calculation. Knowing the order of overlapping persons from front to back in combination with their individually segmented image regions, a more advanced version of the shape-based tracking algorithm from Sect. 5.3.3.2 can be implemented that adapts only to the visible parts of a person and ignores the
250
Person Recognition and Tracking
corresponding occluded areas, i.e. image areas assigned to persons closer towards the camera. The relative depth positions can be derived either by finding the person arrangement that maximizes the overall color similarity [17], by using 3D information, or by tracking people on the ground plane of the scene [18, 34]. 5.3.3.4 Separation by 3D Information While people occluding each other share the same 2D region in the image plane, they are positioned at different depths in the scene. Therefore additional depth information allows to split the combined foreground region into areas of similar depth, that can then be assigned to the people in question by using one of the above methods (e.g. color descriptions). One possibility to measure the depth of each image pixel is the use of a stereo camera [25]. It consists of a pair of identical cameras attached side by side. An image processing module searches the resulting image pair for corresponding feature points, of which the depth is calculated from the horizontal offset. Drawbacks of this method are that the depth resolution is often coarse and that feature point detection in uniformly colored areas is unstable. An extension of the stereo principle is the usage of multiple cameras that are looking at the scene from different viewpoints. All cameras are calibrated to transform the image coordinates in 3D scene coordinates and vice versa as explained in the following section. People are tracked on the ground plane, taking advantage of the least occluded view of each person. Such multi-camera systems are the most favored approach to people tracking in narrow indoor scenes [3, 34]. 5.3.4 Localization and Tracking on the Ground Plane In the preceding sections, various methods have been presented to track and describe people as flat image regions in the 2D image plane. These systems provide the information, where the tracked persons are located in the camera image. However, many applications need to know their location and trajectory in the real 3D scene instead, e.g., to evaluate their behaviour in sensitive areas or to merge the views from multiple cameras. Additionally, it has been shown that knowledge about the depth positions in the scene improves the segmentation of overlapping persons (Sect. 5.3.3.3). A mathematical model of the imaging process is used to transform 3D world coordinates into the image plane and vice versa. To this end, the prior knowledge of extrinsic (camera position, height, tilt angle) and intrinsic camera parameters (focal length, opening angle, potential lense distortions) is necessary. The equations derived in the following use so-called homogeneous coordinates to denote points in the world or image space: ⎛ ⎞ wx ⎜ wy ⎟ ⎟ x=⎜ (5.36) ⎝ wz ⎠ w
Camera-based People Tracking
251
The homogeneous coordinate representation extents the original coordinates (x, y, z) by a scaling factor w, resulting in an infinite number of possible descriptions of each 3D point. The concept of homogeneous coordinates is basically a mathematical trick to represent broken-linear transformations by linear matrix operations. Furthermore, it enables the computer to perform calculations with points lying in infinity (w = 0). A coordinate transformation with homogeneous coordinates has the general form x ˜ = Mx, with M being the transformation matrix: ⎞ ⎛ r11 r12 r13 xΔ ⎜ r21 r22 r23 yΔ ⎟ ⎟ (5.37) M=⎜ ⎝ r31 r32 r33 zΔ ⎠ 1 dx
1 dy
1 dz
1 s
The coefficients rij denote the rotation of the coordinate system, (xΔ , yΔ , zΔ ) the translation, ( d1x , d1y , d1z ) the perspective distortion and s the scaling.
y
yC DC
xC 0
HC z
zC
x
0
Fig. 5.37. Elevated and tilted pinhole camera model.
Most cameras can be approximated by the pinhole camera model (Fig. 5.37). In case of pincushion or other image distortions caused by the optical system, additional normalization is necessary. The projection of 3D scene coordinates into the image plane of a pinhole camera located in the origin is given by the transformation matrix MH : ⎞ ⎛ 1 0 0 0 ⎜0 1 0 0⎟ ⎟ (5.38) MH = ⎜ ⎝0 0 1 0⎠ 1 0 0 − DC 0 DC denotes the focal length of the camera model. Since the image coordinates are measured in pixel units while other measures are used for the scene positions (e.g.
252
Person Recognition and Tracking
cm), measurement conversion is included in the coordinate transformation by expressing DC in pixel units using the width- or height resolution rx or ry of the camera image together with the respective opening angles θx or θy : DC =
ry rx = θ 2 tan( θ2x ) 2 tan( 2y )
(5.39)
In a typical set-up, the camera is mounted at a certain height y = HC above the ground and tilted by an angle α. The according transformation matrices, MR for the rotation around the x-axis and MT for the translation in the already rotated coordinate system, are given by: ⎛ ⎛ ⎞ ⎞ 1 0 0 0 1 0 0 0 ⎜ 0 cos α sin α 0 ⎟ ⎜ 0 1 0 −HC cos α ⎟ ⎜ ⎟; ⎟ = MR = ⎜ M T ⎝ 0 − sin α cos α 0 ⎠ ⎝ 0 0 1 HC sin α ⎠ (5.40) 0 0 0 1 0 0 0 1 The total transformation matrix M results from the concatenation of the three transformations: ⎞ ⎛ 1 0 0 0 ⎜ 0 cos α sin α −HC cos α ⎟ ⎟ (5.41) M = MH MT MR = ⎜ ⎝ 0 − sin α cos α HC sin α ⎠ HC sin α cos α sin α − DC − DC 0 DC The camera coordinates (xC , yC ) can now be calculated from the world coordinates (x, y, z) as follows: ⎞ ⎞ ⎛ ⎛ ⎞ ⎛ x x wC xC ⎟ ⎜ wC yC ⎟ ⎜y⎟ ⎜ y cos α − HC cos α ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ ⎝ wC zC ⎠ = M ⎝ z ⎠ = M ⎝ −y sin α + z cos α + HC sin α ⎠ (5.42) 1 wC 1 DC (y sin α − z cos α − HC sin α) By dividing the homogeneous coordinates by wc , the final equations for the image coordinates are derived: xC = −DC yC = DC
x z cos α + (HC − y) sin α
(HC − y) cos α − z sin α z cos α + (HC − y) sin α
(5.43)
(5.44)
The value of zc is a constant and denotes the image plane, zc = −Dc . The inverse transformation of the image coordinates into the 3D scene requires the prior knowledge of the value in one dimension due to the lower dimensionality of the 2D image. This has to be the height y above the ground, since the floor position (x, z) is unknown. In the analysis of the camera image, the head and feet coordinates of a person can be detected as the extrema of the silhouette, or, in a more robust way, with the
Camera-based People Tracking
253
help of body models. Therefore the height is either equal to zero or to the body height HP of the person. The body height can be calculated, if the y-positions of the head yc,H and of the feet yc,F are detected simultaneously: H P = HC D C
yc,F − yc,H (yc,F cos α + DC sin α)(DC cos α − yc,H sin α)
(5.45)
Besides improved tracking stability especially during occlusions [3,59], tracking on the ground plane also enables the inclusion of additional real world knowledge into the system, like the human walking speed limits, minimum distance between two persons or valid ground positions defined by a floor map. To deal with partial occlusions by objects in the scene, additional knowledge about the three-dimensional structure of the monitored area can be used to predict, which image regions occlude a person’s silhouette and therefore contains no valid information about the true shape of the person in the image plane [18]. 5.3.5 Example In this example, a people tracking system of the type tracking following detection is built and tested using I MPRESARIO and the LTI-L IB (see appendices A and B). As explained in Sect. 5.3.2.1, systems of this type are more appropriate in scenes where inter-person occlusion is rare, e.g. the surveillance of large outdoor areas from a high camera perspective, than in narrow or crowded indoor scenes. This is due to the necessity of the tracked persons to be separated from each other most of the time to ensure stable tracking results. The general structure of the system as it appears in I MPRESARIO is shown in Fig. 5.38. Image source can either be a live stream from a webcam or a prerecorded image sequence. Included on the book CD are two example sequences, one of an outside parking lot scene (in I MPRESARIO-subdirectory ./imageinput/ParkingLot) and an indoor scene (directory ./imageinput/IndoorScene). The corresponding I MPRE SARIO -projects are PeopleTracking Parking.ipg and PeopleTracking Indoor.ipg. The first processing step is the segmentation of moving foreground objects by background subtraction (see Sect. 5.3.1.1). The background model is calculated from the first initialFrameNum frames of the image sequence, so it has to be ensured that these frames show only the empty background scene without any persons or other moving objects. Alternatively, a pre-built background model (using the I MPRE SARIO -macro trainBackgroundModel) can be loaded that must have been created under exactly the same lighting conditions. The automatic adjustment of the camera parameters has to be deactivated to prevent background colors from changing when people with dark or bright clothes enter the scene. After reducing the noise of the resulting foreground mask with a sequence of morphological operations, person candidates are detected using the objectsFromMask macro (see Fig. 2.11 in Sect. 2.1.2.2). The detected image regions are passed on to the peopleTracking macro, where they are used to update the internal list of tracked objects. A tracking logic handles the appearing and disappearing of people in
254
Person Recognition and Tracking
Fig. 5.38. I MPRESARIO example for people tracking.
the camera field of view as well as the merging and splitting of regions as people occlude each other temporarily in the image plane. Each person is described by a oneor two-dimensional Temporal Texture Template (Sect. 5.3.2.1) for re-identification after an occlusion or, optionally, after re-entering the scene. The tracking result is displayed using the drawTrackingResults macro (Fig. 5.39). The appearance models of individual persons can be visualized with the macro extractTrackingTemplates.
(a)
(b)
(c)
Fig. 5.39. Example output of tracking system (a) before, (b) during and (c) after an overlap.
In the following, all macros are explained in detail and the possibilities to vary their parameters are presented. The parameter values used in the example sequences
Camera-based People Tracking
255
are given in brackets behind each description.
Background Subtraction As presented in Sect. 5.3.1.1, the background model describes each pixel of the background image as a Gaussian color distribution. The additional shadow reduction algorithm assumes that a shadow is a small reduction in image intensity without a significant change in chromaticity. The shadow similarity s of a pixel is calculated as s = ws (ΔI )
wI ΔI − Δc , wI ΔI
(5.46)
where ΔI is the decrease of intensity between the background model and the current image and Δc is the chromaticity difference, here calculated as the Euclidean distance between the u and v values in the CIELuv color space. The user defined parameter intensityWeight wI can be used to give more or less weight to the intensity difference, depending on how stable the chromaticity values in the shadows are in the particular scene. To prevent dark foreground colors (e.g. black clothes) to be classified as shadows, the shadow similarity is weighted with a function ws (ΔI ) that gives a higher weight to a low decrease in intensity (Fig. 5.40). This function is defined by the parameters minShadow and maxShadow. If a segmented foreground pixel has a shadow similarity s that is greater than the user defined threshold shadowThreshold, the pixel is classified as background. wS( DI) 1
0
0 minShadow
maxShadow
DI
Fig. 5.40. Shadow similarity scaling factor ws as a function of the decrease of intensity ΔI .
The macro has the following parameters: load background model: True: load background model from file (created with macro trainBackgroundModel), false: create background model from the first initFrameNum frames of the image sequence. These frames have to show the empty background scene, without any moving persons or objects. (False) model file: Filename to load background model from if load background model is true. initFrameNum: Number of first frames to create the background model from if load background model is false. (30)
256
Person Recognition and Tracking
threshold: Threshold for the Mahalanobis distance in color space to decide between foreground object and background. (30.0) use adaptive model: True: Update background model in each frame at the background pixels, false: use static background model for entire sequence. (True) adaptivity: Number of past images to define the background model. A new image is weighted with 1/adaptivity, if the current frame number is lower than adaptivity, or with 1/f rame number otherwise. “0” stands for an infinite number of frames, i.e. the background model is the average of all past frames. (100) apply shadow reduction: True: reduce the effect of classifying shadows as foreground object. (True) minShadow: Shadow reduction parameter, see explanation above. (0.02) maxShadow: Shadow reduction parameter. (0.2) shadowThreshold: Shadow reduction parameter. (0.1) intensityWeight: Shadow reduction parameter. (3.0) Morphological Operations This macro applies a sequence of morphological operations like dilation, erosion, opening and closing on a binary or grayscale image as presented in Sect. 5.3.1.2. Input and output are always of the same data type, either a lti::channel8 or a lti::channel. The parameters are: Morph. Op. Sequence: std::string that defines the sequence of morphological operations using the following apprevations: “e” = erosion, “d” = dilation, “o” = opening, “c” = closing. (“ddeeed”) Mode: process input channel in “binary” or “gray” mode. (“binary”) Kernel type: Type of structure element used. (square) Kernel size: Size of the structure element (3,5,7,...). (3) apply clipping: Clip values that lie outside the valid range (only used in gray mode). (False) filter strength: Factor for kernel values (only used in gray mode). (1) ObjectsFromMask This macro detects connected regions in the foreground mask. Threshold: Threshold to separate between foreground and background values. (100) Level: Iteration level to describe hierarchy of objects and holes. (0) Assume labeled mask: Is input a labeled foreground mask (each region identified by a number)? (False) Melt holes: Close holes in object description (not necessary when Level=0). (False) Minimum size: Minimum size of detected objects in pixel. (60) Sort Objects: Sort objects according to Sort by Area. (False) Sort by Area: Sort objects by their area size (true) or the size of the data structure (false). (False) PeopleTracker The tracking macro is a tracking following detection system (see Sect. 5.3.2.1) based on a merge/split logic (Sect. 5.3.3.1). The internal state of the system, which is also the output of the macro, is represented by an instance of the class trackedObjects (Fig. 5.41). This structure includes two lists: The first one is a list of the individual objects (here: persons), each described by a one-
Camera-based People Tracking
257
trackedObjects std::list
std::list
Fig. 5.41. Structure of class trackedObjects. One or more individual objects (persons) are assigned to one visible foreground blob. Gray not-assigned objects have left the scene, but are kept for re-identification.
or two-dimensional Temporal Texture Template. The second list includes all current foreground regions (“blobs”), represented each by lti::areaPoints, the bounding box and a Kalman filter to predict the movement from the past trajectory. One or, in case of inter-person overlap, multiple individual objects are assigned to one visible blob. In each image frame, this structure is updated by comparing the predicted positions of the foreground blobs in the visibleBlob list with the new detected foreground regions. The following cases are recognized (see Fig. 5.42):
a
b
c
d
e
Fig. 5.42. Possible cases during tracking. Dashed line = predicted bounding box of tracked blob, solid line and gray rectangle = bounding box of new detected foreground region.
a) Old blob does not overlap with new blob: The blob is deleted from the visibleBlob list, the assigned persons are deleted or deactivated in the individualObjects list. b) One old blob overlaps with one new blob: Clear tracking case, the position of the blob is updated.
258
Person Recognition and Tracking
c) Multiple old blobs overlap with one new blob: Merge case, new blob is added to visibleBlob list, old blobs are deleted, and all included persons are assigned to the new blob. d) One old blob overlaps with mutliple new blobs: To solve the ambiguity, each person included in the blob is assigned to one of the new blobs using the Temporal Texture Template description. Each assigned new blob is added to the visibleBlob list, the old blob is deleted. In the special case of only one included person and multiple overlapping blobs which do not overlap with other old blobs, a “false split” is assumed. This can occur if the segmented silhouette of a person consists of separate regions due to similar colors in the background. In this case, the separate regions are united to update the blob position. e) New blob that doesn’t overlap with an old blob: The Blob is added to the visibleBlob list, if its width-to-height ratio is inside the valid range for a person. If parameter memory is true, it is tried to re-identify the person, otherwise a new person is added to the individualObjects list. The macro uses the following parameters: Memory: True: Persons that have left the field of view are not deleted but used for re-identification whenever a new person is detected. (True) Threshold: Similarity threshold to re-identify a person. (100) Use Texture Templates: True: use Temporal Texture Templates to identify a person, false: assign persons using only the distance between the predicted and detected position (True) Texture Template Weight: Update ratio for Temporal Texture Templates (0.1) Template Size x/y: Size of Temporal Texture Template in pixels (x=50, y=100) Average x/y Values: Use 2D Temporal Texture Template or use average values in x- or y-direction. (x: True, y: False) Intensity weight: Weight of intensity difference in contrast to color difference when comparing templates (0.3) UseKalmanPrediction: True: use a Kalman filter to predict the bounding box positions in the next frame, false: use old positions in new frame. (True) Min/Max width-to-height ratio: Valid range of width-to-height ratio of new detected objects (Min = 0.25, Max = 0.5) Text output: Write process information in console output window (False) DrawTrackingResults This macro draws the tracking result into the source image. Text: Label objects. (True) Draw Prediction: Draw predicted bounding box additionally. (False) Color: Choose clearly visible colors for each single object and merged objects. ExtractTrackingTemplates This macro can be used to display the appearance model of a chosen person from the trackedObjects class. Object ID: ID number of tracked object (person) to display. Display Numbers: True: Diplay ID number. Color of the numbers: Color of displayed ID number.
References
259
References 1. Ahonen, T., Hadid, A., and Pietik¨ainen, M. Face Recognition with Local Binary Patterns. In Proceedings of the 8th European Conference on Computer Vision, pages 469–481. Springer, Prague, Czech Republic, May 11-14 2004. 2. Alvarado, P., D¨orfler, P., and Wickel, J. AXON 2 - A Visual Object Recognition System for Non-Rigid Objects. In Hamza, M. H., editor, IASTED International ConferenceSignal Processing, Pattern Recognition and Applications (SPPRA), pages 235–240. Rhodes, Greece, July 3-6 2001. 3. Batista, J. Tracking Pedestrians Under Occlusion Using Multiple Cameras. In Int. Conference on Image Analysis and Recognition, volume Lecture Notes in Computer Science 3212, pages 552–562. 2004. 4. Belhumeur, P., Hespanha, J., and Kriegman, D. Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 711–720. 1997. 5. Bileschi, S. and Heisele, B. Advances in Component-based Face Detection. In 2003 IEEE International Workshop on Analysis and Modeling of Faces and Gestures, AMFG 2003, pages 149–156. IEEE, October 2003. 6. Bishop, C. Neural Networks for Pattern Recognition. Oxford University Press, 1995. 7. Bradski, G. Computer Vision Face Tracking for Use in a Perceptual User Interface. Intel Technology Journal, Q2, 1998. 8. Brunelli, R. and Poggio, T. Face Recognition: Features versus Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1042–1052, 1993. 9. Cheng, Y. Mean Shift, Mode Seeking and Clustering. IEEE PAMI, 17:790–799, 1995. 10. Chung, K., Kee, S., and Kim, S. Face Recognition Using Principal Component Analysis of Gabor Filter Responses. In International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems. 1999. 11. Comaniciu, D. and Meer, P. Mean Shift: A Robust Approach toward Feature Space Analysis. IEEE PAMI, 24(5):603–619, 2002. 12. Cootes, T. and Taylor, C. Statistical Models of Appearance for Computer Vision. Technical report, University of Manchester, September 1999. 13. Cootes, T. and Taylor, C. Statistical Models of Appearance for Computer Vision. Technical report, Imaging Science and Biomedical Engineering, University of Manchester, March 2004. 14. Craw, I., Costen, N., Kato, T., and Akamatsu, S. How Should We Represent Faces for Automatic Recognition? IEEE Transactions on Pattern Recognition and Machine Intelligence, 21(8):725–736, 1999. 15. Dempster, A., Laird, N., and Rubin, D. Maximum Likelihood from Incomplete Data Using the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977. 16. Duda, R. and Hart, P. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973. 17. Elgammal, A. and Davis, L. Probabilistic Framework for Segmenting People Under Occlusion. In IEEE ICCV, volume 2, pages 145–152. 2001. 18. Fillbrandt, H. and Kraiss, K.-F. Tracking People on the Ground Plane of a Cluttered Scene with a Single Camera. WSEAS Transactions on Information Science and Applications, 9(2):1302–1311, 2005. 19. Gavrila, D. The Visual Analysis of Human Movement: A Survey. Computer Vision and Image Understanding, 73(1):82–98, 1999.
260
Person Recognition and Tracking
20. Georghiades, A., Belhumeur, P., and Kriegman, D. From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):643–660, 2001. 21. Gutta, S., Huang, J., Singh, D., Shah, I., Tak´acs, B., and Wechsler, H. Benchmark Studies on Face Recognition. In Proceedings of the International Workshop on Automatic Faceand Gesture-Recognition (IWAFGR). Z¨urich, Switzerland, 1995. 22. H¨ahnel, M., Kl¨under, D., and Kraiss, K.-F. Color and Texture Features for Person Recognition. In International Joint Conference onf Neural Networks (IJCNN 2004). Budapest, Hungary, July 2004. 23. Haritaoglu, I., Harwood, D., and Davis, L. Hydra: Multiple People Detection and Tracking Using Silhouettes. In Proc. IEEE Workshop on Visual Surveillance, pages 6–13. 1999. 24. Haritaoglu, I., Harwood, D., and Davis, L. W4: Real-Time Surveillance of People and Their Activities. IEEE PAMI, 22(8):809–830, 2000. 25. Harville, M. and Li, D. Fast, Integrated Person Tracking and Activity Recognition with Plan-View Templates from a Single Stereo Camera. In IEEE CVPR, volume 2, pages 398–405. 2004. 26. Heisele, B., Ho, P., and Poggio, T. Face Recognition with Support Vector Machines: Global versus Component-based Approach. In ICCV01, volume 2, pages 688–694. 2001. 27. Kanade, T. Computer Recognition of Human Faces. Birkhauser, Basel, Switzerland, 1973. 28. Lades, M., Vorbr¨uggen, J., Buhmann, J., Lange, J., von der Malsburg, C., W¨urtz, R. P., and Konen, W. Distortion Invariant Object Recognition in the Dynamic Link Architecture. IEEE Transactions on Computers, 42(3):300–311, 1993. 29. Manjunath, B., Salembier, P., and Sikora, T. Introduction to MPEG-7. John Wiley and Sons, Ltd., 2002. 30. Martinez, A. Recognizing Imprecisely Localized, Partially Occluded, and Expression Variant Faces from a Single Sample per Class. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(6):748–763, June 2002. 31. Martinez, A. and Benavente, R. The AR face database. Technical Report 24, CVC, 1998. 32. McKenna, S., Jabri, S., Duric, Z., Wechsler, H., and Rosenfeld, A. Tracking Groups of People. CVIU, 80(1):42–56, 2000. 33. Melnik, O., Vardi, Y., and Zhang, C.-H. Mixed Group Ranks: Preference and Confidence in Classifier Combination. PAMI, 26(8):973–981, August 2004. 34. Mittal, A. and Davis, L. M2-Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene. Int. Journal on Computer Vision, 51(3):189–203, 2003. 35. Moeslund, T. Summaries of 107 Computer Vision-Based Human Motion Capture Papers. Technical Report LIA 99-01, University of Aalborg, March 1999. 36. Moeslund, T. and Granum, E. A Survey of Computer Vision-Based Human Motion Capture. Computer Vision and Image Understanding, 81(3):231–268, 2001. 37. Nakajima, C., Pontil, M., Heisele, B., and Poggio, T. Full-body Person Recognition System. Pattern Recognition, 36:1997–2006, 2003. 38. Ohm, J. R., Cieplinski, L., Kim, H. J., Krishnamachari, S., Manjunath, B. S., Messing, D. S., and Yamada, A. Color Descriptors. In Manjunath, B., Salembier, P., and Sikora, T., editors, Introduction to MPEG-7, chapter 13, pages 187–212. Wiley & Sons, Inc., 2002. 39. Penev, P. and Atick, J. Local Feature Analysis: A General Statistical Theory for Object Representation. Network: Computation in Neural Systems, 7(3):477–500, 1996. 40. Pentland, A., Moghaddam, B., and Starner, T. View-based and Modular Eigenspaces for Face Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’94). Seattle, WA, June 1994.
References
261
41. Phillips, P., Moon, H., Rizvi, S., and Rauss, P. The FERET Evaluation Methodology for Face-Recognition Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10):1090–1104, October 2000. 42. Ro, Y., Kim, M., Kang, H., Manjunath, B., and Kim, J. MPEG-7 Homogeneous Texture Descriptor. ETRI Journal, 23(2):41–51, June 2001. 43. Schwaninger, A., Mast, F., and Hecht, H. Mental Rotation of Facial Components and Configurations. In Proceedings of the Psychonomic Society 41st Annual Meeting. New Orleans, USA, 2000. 44. Sirovich, L. and Kirby, M. Low-Dimensional Procedure for the Characterization of Human Faces. Journal of the Optical Society of America, 4(3):519–524, 1987. 45. Sung, K. K. and Poggio, T. Example-Based Learning for View-Based Human Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):39–51, 1997. 46. Swain, M. and Ballard, D. Color Indexing. Int. Journal on Computer Vision, 7(1):11–32, 1991. 47. Turk, M. A Random Walk through Eigenspace. IEICE Transactions on Information and Systems, E84-D(12):1586–1595, 2001. 48. Turk, M. and Pentland, A. Eigenfaces for Recognition. Journal of Cognitive Neuroscience, 3(1):71–86, 1991. 49. Vapnik, V. Statistical Learning Theory. Wiley & Sons, Inc., 1998. 50. Viola, P. and Jones, M. Robust Real-time Object Detection. In ICCV01, 2, page 747. 2001. 51. Wallraven, C. and B¨ulthoff, H. View-based Recognition under Illumination Changes Using Local Features. In Proceedings of the International Conference on Computer Vision and pattern Recognition (CVPR). Kauai, Hawaii, December, 8-14 2001. 52. Waring, C. A. and Liu, X. Face Detection Using Spectral Histograms and SVMs. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 35(3):467–476, June 2005. 53. Welch, G. and Bishop, G. An Introduction to the Kalman Filter. Technical Report TR 95-041, Department of Computer Science, University of North Carolina at Chapel Hill, 2004. 54. Wickel, J., Alvarado, P., D¨orfler, P., Kr¨uger, T., and Kraiss, K.-F. Axiom - A Modular Visual Object Retrieval System. In Jarke, M., Koehler, J., and Lakemeyer, G., editors, Proceedings of 25th German Conference on Artificial Intelligence (KI 2002), Lecture Notes in Artificial Intelligence, pages 253–267. Springer, Aachen, September 2002. 55. Wiskott, L., Fellous, J., Kr¨uger, N., and von der Malsburg, C. Face Recognition by Elastic Bunch Graph Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):775–779, 1997. 56. Wren, C., Azerbayejani, A., Darell, T., and Pentland, A. Pfinder: Real-Time Tracking of the Human Body. IEEE PAMI, 19(7):780–785, 1997. 57. Yang, J., Zhu, X., Gross, R., Kominek, J., Pan, Y., and Waibel, A. Multimodal People ID for a Multimedia Meeting Browser. In Proceedings of the Seventh ACM International Conference on Multimedia, pages 159 – 168. ACM, Orlando, Florida, United States, 1999. 58. Yang, M., Kriegman, D., and Ahuja, N. Detecting Faces in Images: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:34–58, 2002. 59. Zhao, T. and Nevatia, R. Tracking Multiple Humans in Complex Situations. IEEE PAMI, 26(9):1208–1221, 2004. 60. Zhao, W., Chellappa, R., Phillips, P. J., and Rosenfeld, A. Face recognition: A Literature Survey. ACM Computing Surveys, 35(4):399–458, 2003.
263
Chapter 6
Interacting in Virtual Reality Torsten Kuhlen, Ingo Assenmacher, Lenka Jeˇra´ bkov´a The word ”virtual” is widely used in computer science and in a variety of situations. In general, it refers to something that is merely conceptual from something that is physically real. Insofar, the term ”Virtual Reality”, which has been formed by the American artist Jaron Lanier in the late eighties, is a paradox in itself. Until today, there exists no standardized definition of Virtual Reality (VR). All definitions have in common however, that they specify VR as a computer-generated scenario of objects (virtual world, virtual environment) a user can interact with in real time and in all three dimensions. The characteristics of VR are most often described by means of the I 3 : Interaction, Imagination, and Immersion: •
•
Interaction Unlike an animation which cannot be influenced by the user, in VR real-time interaction is a fundamental and imperative feature. Thus, movies like ”Jurassic Parc” realize computer-generated, virtual worlds, strictly speaking they are not a matter of VR however. The interaction comprises navigation as well as the manipulation of the virtual scene. In order to achieve real-time interaction, two important basic criteria have to be fulfilled: 1. Minimum Frame Rate: A VR application is not allowed to work with a frame rate lower than a threshold which is defined by Bryson [6] to 10 Hz. A more restrictive limit is set by Kreylos et. al. [9] who recommend a minimum rate of 30 frames per second. 2. Latency: Additionally, Bryson as well as Kreylos have defined the maximum response time of a VR system after a user input was applied. Both claim that a maximum latency of 100 milliseconds can be tolerated until a feedback must take place. I/O devices, and especially methods and algorithms have to meet these requirements. As a consequence, high quality, photorealistic rendering algorithms including global illumination methods, or complex algorithms like the Finite Element Method for a physically authentic manipulation of deformable objects, typically cannot be used in VR without significant optimizations or simplifications. Imagination The interaction within a virtual environment should be as intuitive as possible. In the ideal case, a user can perceive the virtual world and interact with it in exactly the same way as with the real world, so that the difference between both becomes blurred. Obviously, this goal has impact on the man-machine interface design.
264
•
Interacting in Virtual Reality
First of all, since humans interact with the real world in all three dimensions, the interface must be three-dimensional as well. Furthermore, multiple senses should be included into the interaction, i.e., besides the visual sense, the integration of other senses like the auditory, the haptic/tactile, and the olfactory ones should be considered in order to achieve a more natural, intuitive interaction with the virtual world. Immersion By special display technology, the user has the impression of being a part of the virtual environment, standing in the midst of the virtual scene, fully surrounded by it instead of looking from outside.
In the recent years, VR has proven its potential to provide an innovative human computer interface, from which multiple application areas can profit. Whereas in the United States VR research is mainly pushed by military applications, the automotive industry has been significantly promoting VR in Germany during the last ten years, provoking Universities and research facilities like the Fraunhofer Institutes to significant advances in VR technology and methodology. The reason for their role as a precursor is their goal to realize the digital vehicle, where the whole development from the initial design process up to the production planning is implemented in the computer. By VR technology, so called ”virtual mockups” can to some extent replace physical prototypes, having the potential to shorten time to market, to reduce costs, and to lead to a higher quality of the final product. In particular, the following concrete applications are relevant for the automotive industry and can profit from VR as an advanced human computer interface: • • • • • • • • •
Car body design Car body aerodynamics Design and ergonomical studies of car interiors Simulation of air conditioning technology in car interiors Crash simulations Car body vibrations Assembly simulations Motor development: Simulation of airflow and fuel injection in the cylinders of combustion engines Prototyping of production lines and factories
Besides automotive applications, medicine has emerged as a very important field for VR. Obviously, data resulting from 3D computer tomography or magnet resonance imaging suggest a visualization in virtual environments. Furthermore, the VRbased simulation of surgical interventions is a significant step towards the training of surgeons or the planning of complicated operations. The development of realistic surgical procedures evokes a series of challenges however, starting with a real-time simulation of (deformable) organs and tissue, and ending in a realistic simulation of liquids like blood and liquor. Obviously, VR covers a large variety of complex aspects which cannot all be covered within the scope of a single book chapter. Therefore, instead of trying to cover
Visual Interfaces
265
all topics, we decided to pick out single aspects and to describe them in an adequate detailedness. Following the overall focus of this book, chapter 6 concentrates on the interaction aspects of VR. Technological aspects are only described when they contribute to an understanding of the algorithmic and mathematical issues. In particular, in sections 6.1 to 6.3 we address the multimodality of VR with emphasis on visual, acoustic, and haptic interfaces for virtual environments. Sect. 6.4 is about the modeling of a realistic behavior of virtual objects. The methods described are restricted to rigid objects, since a treatment of deformable objects would go beyond the scope of this chapter. In Sect. 6.5, support mechanisms are introduced as an add-on to haptics and physically-based modeling, which facilitate the completion of elementary interaction tasks in virtual environments. Sect. 6.6 gives an introduction to algorithmic aspects of VR toolkits, and scene graphs as an essential data structure for the implementation of virtual environments. Finally, Sect. 6.7 explains the examples which are provided on the CD that accompanies this book. For further reading, we recommend the book ”3D User Interfaces – Theory and Practice” written by D. Bowman et. al. [5].
6.1 Visual Interfaces 6.1.1 Spatial Vision The capability of the human visual system to perceive the environment in three dimensions originates from psychological and physiological cues. The most important psychological cues are: • • • • •
(Linear) perspective Relative size of objects Occlusion Shadows and lighting Texture gradients
Since these cues rely on monocular and static mechanisms only, they can be covered by conventional visual displays and computer graphics algorithms. The main difference between vision in VR and traditional computer graphics is that VR aims at emulating all cues. Thus, in VR interfaces the following physiological cues must be taken into account as well: •
Stereopsis Due to the interocular distance, slightly different images are projected onto the retina of the two eyes (binocular disparity). Within the visual cortex of the brain, these two images are then fused to a three-dimensional scene. Stereopsis is considered the most powerful mechanism for seeing in 3D. However, it merely works for objects that are located at a distance not more than a few meters from the viewer.
266 •
•
Interacting in Virtual Reality
Oculomotor factors The oculomotor factors are based on muscle activities and can be further subdivided into accommodation and convergence. – Accommodation In order to focus on an object and to produce a sharp image on the eye’s retina, the ciliar muscles deform the eye lenses accordingly. – Convergence The viewer rotates her eyes towards the object in focus so that the two images can be fused together. Motion parallax When the viewer is moving from left to right while looking at two objects at different distances, the object which is closer to the viewer passes her field of view faster than the object which is farer away.
6.1.2 Immersive Visual Displays While the exact imitation of psychological cues – especially photorealistic global illumination – is mainly limited by the performance of the computer system and its graphics hardware, the realization of physiological cues requires dedicated hardware. In principle, two different types of visual, immersive VR displays can be distinguished: head-mounted and room-mounted displays. In the early days of VR, a head-mounted display (HMD) usually formed the heart of every VR system. A HMD is a display worn like a helmet and consists of two small monitors positioned directly in front of the viewer’s eyes, which realize the stereopsis. The first HMD has already been developed in 1965 by Ivan Sutherland [19]. The first commercially available systems came up in the late 80’s however (see Fig. 6.1, left). Although HMDs have the advantage that they fully immerse the user into the virtual environment, they also suffer from major drawbacks. Besides ergonomic problems, they tend to isolate the user from the real environment, which makes the communication between two users sharing the same virtual environment more difficult. In particular, the user loses the visual contact to her own body, which can lead to significant irritations if the body is not represented by an adequate computergraphic counterpart. Today, HMDs are mainly used in low cost or mobile and portable VR systems. Furthermore, a special variant of HMDs called see-through displays has only recently gained high relevance in Augmented Reality. Here, semi-transparent mirrors are used in order to supplement the real environment by virtual objects. In the early 90’s, the presentation of the CAVETM at the ACM SIGGRAPH conference [7] caused a paradigm change, away from head-mounted displays towards room-mounted installations based on a combination of large projection screens which surround the user in order to achieve the immersion effect. Fig. 6.2 shows a CAVE-like display installed at RWTH Aachen University, Germany, consisting of five projection screens. Although it is preferable to exclusively use rear projection screens to prevent the user from standing in the projection beam and producing shadows on the screens, the floor of such systems is most often realized as a front
Visual Interfaces
267
(a)
(b)
Fig. 6.1. Head-mounted displays (left photograph courtesy of P. Winandy).
projection in order to achieve a more compact construction. In systems with four walls allowing a 360 degrees look-around, one of the walls is typically designed as a (sliding) door to allow entering and leaving the system. In room-mounted systems, a variety of techniques exist to realize stereopsis. One of the most common principles is based on a time multiplex method. Here, the images for the left eye and the right eye are projected on the screen one after another, while the right and the left glass of so-called shutter glasses are opened and closed respectively. Beside this ”active” technology, methods based on polarized or spectral filters (e.g., INFITECTM ) are in use. Unlike the shutter technique, these ”passive” methods do not require active electronic components in the glasses and work well
(a)
(b)
Fig. 6.2. Immersive, room-mounted displays: The CAVE-like display at RWTH Aachen University, Germany.
268
Interacting in Virtual Reality
without an extremely precise time synchonization (known as ”gen locking”). However, since the images for the left and the right eye are shown in parallel, they demand two projectors per screen. 6.1.3 Viewer Centered Projection Head-mounted as well as room-mounted VR displays typically come along with a so-called tracking system, which captures the user’s head position and orientation in real time. By this information, the VR software is capable of adapting the perspective of the visual scene to the user’s current viewpoint and view direction. A variety of tracking principles for VR systems are in use, ranging from mechanical, acoustic (ultrasonic), and electromagnetic techniques up to inertia and opto-electronical devices. Recently, a combination of two or more infrared (IR) cameras and IR light reflecting markers attached to the stereo glasses have become the most popular tracking technology in virtual environments (Fig. 6.3) since, in comparison to other technologies, they are working with higher precision and lower latency and are also nearly non-intrusive.
(a) IR Camera
(b) Stereo shutter glasses with markers
Fig. 6.3. Opto-electronical head tracking (Courtesy of A.R.T. GmbH, Weilheim, Germany).
While in VR systems with HMDs the realization of motion parallax is obviously straight forward by just setting the viewpoint and view direction to the position and orientation of the tracking sensor (added by an offset indicating the position of the left/right eye relative to the sensor), the situation is a little more difficult with roommounted displays. At first glance, it is not obvious that motion parallax is mandatory in such systems at all, since many stereo vision systems are working quite well without any head tracking (cp., e.g., IMAX cinema). However, since navigation is a key feature in VR, the user is typically walking around, looking at the scene from very different viewpoints. The left drawing in Fig. 6.4 shows what happens: Since a static stereogram is only correct for one specific eye position, it shows a distorted image of virtual objects when the viewer starts moving in front of the screen. This is not only annoying, but also makes a direct, manual interaction with virtual objects in 3D
Visual Interfaces
269
space impossible. Instead, the projection has to be adopted to the new viewpoint as shown in the right drawing of Fig. 6.4. By this adaptation called Viewer Centered Projection (VCP), the user is also capable of looking around objects by just moving her head. In the example, the right side of the virtual cube becomes visible when the user moves to the right. Fig. 6.5 demonstrates the quasi-holographic effect of VCP by means of a tracking sensor attached to the camera. Like real objects, virtual objects are perceived as stationary in 3D space.
projection plane
eye position 2
(a) Distortion of the virtual cube when moving from viewpoint 1 to viewpoint 2
(b) Correction of perspective
Fig. 6.4. The principle of viewer centered projection.
Fig. 6.5. The viewer centered projection evokes a pseudo-holographic effect. As with the lower, real cuboid, the right side of the upper, virtual cube becomes visible when the viewpoint moves to the right.
In the following, the mathematical background of VCP is briefly explained. In computer graphics, 3D points are typically described as 4D vectors, the so called
270
Interacting in Virtual Reality
homogeneous coordinates (see, e.g., [10]). Assuming the origin of the coordinate system is in the middle of the projection plane with its x- and y-axes parallel to the plane, the following shearing operation has to be applied to the vertices of a virtual, polygonal object: ⎛ ⎞ 1 0 00 ⎜ 0 ey ex 1 0 0⎟ ⎟ =⎜ (6.1) SHxy − , − ⎝ − ex − ey 1 0 ⎠ ez ez ez ez 0 0 01 For an arbitrary point P = (x, y, z, 1) in the object space, applying the projection matrix leads to the following new coordinates: ey ex ex ey · z, y − · z, z, 1 (6.2) = x− (x, y, z, 1) · SHxy − , − ez ez ez ez By this operation, the viewpoint is shifted to the z-axis, i.e., P = (ex , ey , ez , 1) is transformed to P = (0, 0, ez , 1). The result is a simple, central perspective projection (Fig. 6.6), which is most common in computer graphics and which can, e.g., be further processed into a parallel projection in the unit cube and handled by computer graphics hardware.
Fig. 6.6. Shearing the view volume.
Of course, shearing the view volume as shown above must leave the pixels (picture elements) on the projection plane unaffected, which is the case when applying (6.1): ey ex = (x, y, 0, 1) (6.3) (x, y, 0, 1) · SHxy − , − ez ez In CAVE-like displays with several screens not parallel to each other, VCP is absolutely mandatory. If the user is not tracked, perspective distortions arise, especially
Visual Interfaces
271
at the boundaries of the screens. Since every screen has its own coordinate system, for each eye the number of different projection matrices is equal to the number of screens. Experience has shown that not every VR application needs to run in a fully immersive system. In many cases, semi-immersive displays work quite well, especially in the scientific and technical field. In the simplest case such displays only consist of a single, large wall (see Fig. 6.7 left) or a table-like horizontal screen known as Responsive Workbench [12]. Workbenches are often completed by an additional vertical screen in order to enlarge the view volume (see Fig. 6.7 right). Recent efforts towards more natural visual interfaces refer to high-resolution, tiled displays built from a matrix of multiple projectors [11].
(a) Large rear projection wall
(b) L-shaped workbench (Photograph courtesy of P. Winandy)
Fig. 6.7. Semi-immersive, room-mounted displays.
In summary, today’s visual VR interfaces are able to emulate all physiological cues for 3D viewing. It must be taken into consideration however, that accommodation and convergence build a natural teamwork when looking at real objects. Since in all projection-based displays the eye lenses have to focus on the screen, this teamwork is disturbed, which can cause eye strain and can be one of the reasons for simulator sickness. Physiological cues can only be realized by usage of dedicated hardware like HMDs or stereo glasses, and tracking sensors, which have to be attached to the user. Since a major goal of VR interfaces is to be as non-intrusive as possible, autostereoscopic and volumetric displays are under development (see, e.g., [8]), which are still in infancy however. Today’s optical IR tracking systems represent a significant step forward, although they still rely on markers. Up to now, fully non-intrusive optical systems based on video cameras and image processing are still too slow for use in VR applications.
272
Interacting in Virtual Reality
6.2 Acoustic Interfaces 6.2.1 Spatial Hearing In contrast to the human visual system, the human auditory system can perceive input from all directions and has no limited field of view. As such, it provides valuable cues for navigation and orientation in virtual environments. With respect to immersion, the acoustic sense is a very important additional source of information for the user, as acoustical perception works precisely especially for close auditory stimuli. This enhances the liveliness and credibility of the virtual environment. The ultimate goal thus would be the ability to place virtual sounds in any three dimensions and distance around the user in real-time. However, audio stimuli are not that common in today’s VR applications. For visual stimuli, the brain compares pictures from both eyes to determine the placing of objects in a scene. With this information, it creates a three-dimensional cognitive representation that humans perceive as a three-dimensional image. In straight analogy, stimuli that are present at the eardrums will be compared by the brain to determine the nature and the direction of a sound event [3]. In spatial acoustics, it is most convenient to describe the position of a sound source relative to the position and orientation of the listener’s head by means of a coordinate system which has its origin in the head, as shown in Fig. 6.8. Depending on the horizontal angle of incidence, different time delays and levels between both ears consequently arise. In addition, frequency characteristics dependent on the angle of incidence are influenced by the interference between the direct signal and the reflections of head, shoulder, auricle and other parts of the human body. The interaction of these three factors permits humans to assign a direction to acoustic events [15].
Fig. 6.8. Dimensions and conventions of measurement for head movements.
The characteristics of the sound pressure at the eardrum can be described in the time domain by the Head Related Impulse Response (HRIR) and in the frequency domain by the Head Related Transfer Function (HRTF). These transfer functions can be measured individually with small in-ear microphones or with an artificial head. Fig. 6.9 shows a measurement with an artificial head under 45◦ relating to the frontal direction in the horizontal plane. The Interaural Time Difference (ITD) can
Acoustic Interfaces
273
be assessed in the time domain plot. The Interaural Level Difference (ILD) is shown in the frequency domain plot and clarifies the frequency dependent level increase at the ear turned toward the sound source, and the decrease at the ear which is turned away from the sound source.
(a)
(b)
Fig. 6.9. HRIR and HRTF of a sound source under 45◦ measured using an artificial head. The upper diagram shows the Interaural Time Difference in the time domain plot. The Interaural Level difference is depicted in the frequency domain plot of the lower diagram.
6.2.2 Auditory Displays In VR systems with head-mounted displays, an auditory stimulation by means of headphones integrated into the HMD is suitable. For room-mounted displays, a solution based on loudspeakers is obviously preferable. Here, simple intensity panning is most often used to produce three-dimensional sound effects. However, intensity panning is not able to provide authentic virtual sound scenes. In particular, it is impossible to create virtual near-to-head sound sources although these sources are of high importance in direct interaction metaphors for virtual environments, such as by-hand-placements or manipulations. In CAVE-like displays, virtual objects which shall be directly manipulated are typically displayed with negative parallax, i.e., in front of the projection plane, rather near to the user’s head and within the grasping range, see Fig. 6.10. As a consequence, in high quality, interactive VR systems, a technique has to be applied which is more complex than the simple intensity panning. Creating a virtual sound scene with spatially distributed sources needs a technique for adding spatial cues to audio signals and an appropriate reproduction. There are mainly two different
274
Interacting in Virtual Reality
Fig. 6.10. Typical interaction range in a room-mounted, CAVE-like display.
approaches reproducing a sound event with true spatial relation: wave field synthesis and binaural synthesis. While wave field synthesis reproduces the whole sound field with a large number of loudspeakers, binaural synthesis in combination with cross talk cancellation (CTC) is able to provide a spatial auditory representation by using only few loudspeakers. The following sections will briefly introduce the principles and problems of these technologies with focus on the binaural approach. Another problem can be found in the synchronization of the auditory and visual VR subsystems. Enhanced immersion dictates that the coupling between the two systems has to be tight and with small computational overhead, as each sub-system introduces its own lag and latency problems due to different requirements on the processing. If the visual and the auditory cues differ too much in time and space, this is directly perceived as a presentation error, and the immersion of the user ceases. 6.2.3 Wave Field Synthesis The basic theory of the wave field synthesis is the Huygens principle. An array of loudspeakers (ranging from just a few to a few hundreds in number) is placed in the same position a microphone array was placed in at the time of recording the sound event in order to reproduce an entire real sound field. Fig. 6.11 shows this principle of recording and reproduction [2], [20]. In a VR environment the loudspeaker signal will then be calculated for a given position of one or more virtual sources. By reproducing a wave field in the whole listening area, it is possible to walk around or turn the head while the spatial impression does not change. This is the big advantage of this principle. The main drawback, beyond the high effort of processing power, is the size of the loudspeaker array. Furthermore, mostly two-dimensional solutions have been presented so far. The placement of those arrays in semi-immersive, projection-based VR systems like a workench or a single wall is only just possible, but in display systems with four to six surfaces as in CAVE-like environments, it is nearly impossible. A wave field synthesis approach is thus not applicable, as the
Acoustic Interfaces
275
Wave field synthesis
Recording
Fig. 6.11. Setup for recording a sound field and reproducing it by wave field synthesis.
loudspeaker arrays have to be positioned around the user and outside of the environment’s boundaries. 6.2.4 Binaural Synthesis In contrast to wave field synthesis a binaural approach does not deal with the complete reproduction of the sound field. It is convenient and sufficient to reproduce that sound field only at two points, the ears of the listener. In this case, only two signals have to be calculated for a complete three-dimensional sound scene. The procedure of convolving a mono sound source with an appropriate pair of HRTFs in order to obtain a synthetic binaural signal is called binaural synthesis. The binaural synthesis transforms a sound source without position information into a virtual source related to the listener’s head. The synthesized signals contain the directional information of the source, which is provided by the information in the HRTFs. For the realization of a room related virtual source, the HRTF must be changed when the listener turns or moves her head, otherwise the virtual source would move with the listener. Since a typical VR system includes head tracking for the adaptation of the visual perspective, the listener’s position is always known and can also be used to realize a synthetic sound source with a fixed position corresponding to the room coordinate system. The software calculates the relative position and orientation of the listener’s head to the imaginary point where the source should be localized. By knowing the relative position and orientation the appropriate HRTF can be chosen from a database (see Fig. 6.12) [1]. All in all, the binaural synthesis is a powerful and at the same time feasible method for an exact spatial imaging of virtual sound sources, which in principle only needs two loudspeakers. In contrast to panning systems where the virtual sources are always on or behind the line spanned by the speakers, binaural synthesis can realize a source at any distance to the head, and in particular one close to the listener’s
276
Interacting in Virtual Reality
Fig. 6.12. The principle of binaural synthesis.
head, by using an appropriate HRTF. It is also possible to synthesize many different sources and create a complex three-dimensional acoustic scenario. 6.2.5 Cross Talk Cancellation The requirement for a correct binaural presentation is that the right channel of the signal is audible only in the right ear and the left one is audible only in the left ear. From a technical point of view, the presentation of binaural signals by headphones is the easiest way since the acoustical separation between both channels is perfectly solved. While headphones may be quite acceptable in combination with HMDs, in room-mounted displays the user should wear as few active hardware as possible. Furthermore, when the ears are covered by headphones, the impression of a source located at a certain point and distance to the listener often does not match the impression of a real sound field. For these reasons, a reproduction by loudspeakers should be considered, especially in CAVE-like systems. Speakers can be mounted, e.g., above the screens. The placing of virtual sound sources is not restricted due to the binaural reproduction. Using loudspeakers for binaural synthesis introduces the problem of cross talk. As sound waves determined for the right ear arrive at the left ear and vice versa, without an adequate CTC the three-dimensional cues of the binaural signal would be destroyed. In order to realize a proper channel separation at any point of the listening area, CTC filters have to be calculated on-line when the listener moves his head. When XL and XR are the signals coming from the binaural synthesis (see Fig. 6.12 and 6.13), and ZL and ZR are the signals arriving at the listener’s left and right ear, filters must be found providing signals YL and YR , so that XL = ZL and XR = ZR . Following the way of the binaural signals to the listener’s ears results in the two equations !
ZL = YL · L · HLL + YR · R · HRL = XL !
ZR = YR · R · HRR + YL · L · HLR = XR ,
(6.4) (6.5)
Acoustic Interfaces
277
where L, R are the transfer functions of the loudspeakers, and Hpq is the transfer function from speaker p to ear q. In principle, a simple analytic approach exists for this problem, leading to YL = L−1 (
HRL HRR XL − XR ) HLL HRR − HRL HLR HLL HRR − HRL HLR
(6.6)
and an analogous solution for YR . However, since singularities arise especially for deeper frequencies, where HLL ∼ HRL and HRR ∼ HLR , this simple calculation is not feasible. Instead, an iterative solution has to be applied, of which the principle is depicted in Fig. 6.13 by example for an impulse presented to the left ear.
Fig. 6.13. Principle of (static) cross-talk cancellation. Presenting an impulse addressed to the left ear (1) and the first steps of compensation (2, 3).
For a better understanding of the CTC filters, the first four iteration steps will be shown by example for a signal from the right speaker. In an initial step, the influence R of the right speaker and HRR are eliminated by −1 BlockRR := R−1 · HRR
(6.7)
278
Interacting in Virtual Reality
1. Iteration Step The resulting signal XR · BlockRR · R from the right speaker also arrives at the left ear via HRL and must be compensated for by an adequate filter K1R , which is sent from the left speaker L to the left ear via HLL : !
XR · BlockRR · R · HRL − XR · K1R · L · HLL = 0 −1 ·( ⇒ K1R = L−1 · HRR
HRL ) =: BlockRL HLL
(6.8) (6.9)
2. Iteration Step The compensation signal −XR ·K1R ·L from the first iteration step arrives at the right ear via HLR and is again compensated for by a filter K2R , sent from the right speaker R to the right ear via HRR : −1 ·( −XR · [L−1 · HRR
HRL ! )] · L · HLR + XR · K2R · R · HRR = 0 HLL
HRL · HLR −1 ·( ) ⇒ K2R = R−1 · HRR 0 12 3 HLL · HRR 12 3 0 Block RR
(6.10)
(6.11)
=:K
3. Iteration Step The cross talk produced by the preceding iteration step is again eliminated by a compensation signal from the left speaker. −1 · K] · R · HRL − XR · [K3R ] · L · HLL = 0 XR · [R−1 · HRR !
−1 HRL ·K · ⇒ K3R = L−1 · HRR HLL 0 12 3
(6.12)
(6.13)
BlockRL
4. Iteration Step The right speaker produces a compensation signal to eliminate the cross talk of the third iteration step: −1 · −XR · [L−1 · HRR
HRL ! · K] · L · HLR + XR · [K4R ] · R · HRR = 0 HLL −1 ·K2 ⇒ K4R = R−1 · HRR 0 12 3
(6.14) (6.15)
BlockRR
With the fourth step, the regularity of the iterative CTC becomes clear. Even iterations extend BlockRR and odd iterations extend BlockRL . All in all, the following compensation filters can be identified for the right channel (compare to Fig. 6.13): CT CRR = BlockRR ·(1 + K + K 2 + . . .) CT CRL = BlockRL ·(1 + K + K 2 + . . .)
(6.16)
Haptic Interfaces
279
For an infinite number of iterations, the solution converges to the simple analytic approach: ∞
Ki =
i=1
HRL · HLR 1 with K = 1−K HLL · HRR
(6.17)
Be aware that the convergence is guaranteed, since |K| < 1 due to HRL < HLL and HLR < HRR . In principle, a complete binaural synthesis with CTC can be achieved by only two loudspeakers. However, a dynamic CTC is only possible in the angle spanned by the loudspeakers. When the ITD in one HRTF decreases toward zero – this happens when the listener faces one speaker – an adequate cancellation is impossible [14]. Since for CAVE-like displays, a dynamic CTC is needed which allows the user a full turnaround, the well-known two-speaker CTC solution must be extended to a four-speaker setup in order to provide a full 360◦ rotation for the listener. An implementation of such a setup is described in [13].
6.3 Haptic Interfaces Haptics comes from the Greek hapthesthai - to touch. Haptic refers to touching as visual refers to seeing and auditory to hearing. There are two types of human haptic sensing: •
•
tactile – the sensation arises from stimulus to the skin (skin receptors). The tactile haptics is responsible for sensing heat, pressure, vibration, slip, pain and surface texture. kinesthetic – the sensation arises from the body movements. The mechanoreceptors are located in muscles, tendons and joints. The kinesthetic haptics is responsible for sensing the limb positions, motion and forces.
The VR research focuses mainly on kinesthetics, which is easier to realize technically. Kinesthetic human-computer interfaces are also called force feedback interfaces. Force feedback interfaces have derived from robotics and teleoperation applications. In the early 90’s, the field of haptic force feedback began to develop an identity of its own. There were research projects specifically aimed at providing force feedback or tactile feedback rather than teleoperation. The first commercial device, the PHANToM (Fig. 6.14), was for sale in 1995. Designing and developing haptic interfaces is an intensely interdisciplinary area. The knowledge of human haptics, machine haptics and computer haptics have to be put together to accomplish a particular interaction task. Human haptics is a psychological study of the sensory and cognitive capacities relating to the touch sense, and the interaction of touch with other senses. This knowledge is crucial to effective interface design. Machine haptics is where of mechanical and electrical engineers get involved. It consists of the design of the robotic machine itself, including its kinematic configuration, electronics and sensing, and communication to the computer
280
Interacting in Virtual Reality
Fig. 6.14. The PHANToM haptic device, Sensable Technologies.
controller. The computed haptic models and algorithms that run on a computer host belong to computer haptics. This includes the creation of virtual environments for particular applications, general haptic rendering techniques and control issues. The sensorimotor control guides our physical motion in coordination with our touch sense. There is a different balance of position and force control when we are exploring an environment (i.e. lightly touching a surface), versus manipulation (dominated by motorics), where we might rely on the touch sense only subconsciously. As anyone knows intuitively, the amount of force you generate depends on the way you hold something. Many psychological studies have been done to determine human capabilities in force sensing and control resolution. Frequency is an important sensory parameter. We can sense tactile frequencies much higher than kinesthetic. This makes sense, since skin can be vibrated more quickly than a limb or even a fingertip (10-10000 Hz vs. 20-30 Hz). The control bandwidth, i.e., how fast we can move our own limbs (5-10 Hz), is much lower than the rate of motion we can perceive. As a reference, table 6.1 provides more detail on what happens at different bandwidths. A force feedback device is typically a robotic device interacting with the user, who supplies physical input by imposing force or position onto the device. If the measured state is different from the desired state, a controller translates the desired signal (or the difference between the measured and the desired state) into a control action. The controller can be a computer algorithm or an electronic circuit. Haptic control systems can be difficult, mainly because the human user is a part of the system, and it is quite hard to predict what the user is going to do. Fig. 6.15 shows the haptic control loop schematically. The virtual environment takes as input the measured state, and supplies a desired state. The desired state depends on the particular environment, and its computation is called haptic rendering. The goal of haptic rendering is to enable a user to touch, feel, and manipulate virtual objects through a haptic device. A rendering algorithm will generally include a code loop that reads the position of the device, tests the virtual environment for
Haptic Interfaces
281 Table 6.1. Force sensing and control resolution.
1-2 Hz The maximum bandwidth with which the human finger can react to unexpected force/position singnals 5-10 Hz The maximum bandwidth with which the human finger can apply force and motion commands comfortably 8-12 Hz The bandwidth beyond which the human finger cannot correct for its positional disturbances 12-16 Hz The bandwidth beyond which the human fingers cannot correct their grasping forces if the grasped object slips 20-30 Hz The minimum bandwidth with which the human finger demands the force input signals to be present for meaningful perception 320 Hz The bandwidth beyond which the human fingers cannot discriminate two consecutive force input signals 5-10 kHz The bandwidth over which the human finger needs to sense vibration during skillful manipulative task
Fig. 6.15. The force feedback loop.
collisions, calculates collision responses and sends the desired device position to the device. The loop must execute in about 1 millisecond (1 kHz rate). 6.3.1 Rendering a Wall A wall is the most common virtual object to render haptically. It involves two different modes (contact and non-contact) and a discontinuity between them. In noncontact mode, your hand is free (although holding the haptic device) and no forces are acting, until it encounters the virtual object. The force is proportional to the penetration depth determined by the collision detection, pushing the device out of the virtual wall Fig. 6.16. The position of the haptic device in the virtual environment is called the haptic interaction point (HIP). Rendering a wall is actually one of the hardest tests for a haptic interface even though it is conceptually so simple. This is because of the discontinuity between a regime with zero stiffness and a regime with very high stiffness (usually the highest stiffness the system can manage). This can lead to a computational and physical instability and cause vibrations. In addition to that, the contact might not feel very
282
Interacting in Virtual Reality
(a) no contact, F = 0
(b) contact, F = kΔx
Fig. 6.16. Rendering a wall. In non-contact mode (a) no forces are acting. Once the penetration into the virtual wall is detected (b), the force is proportional to the penetration depth, pushing the device out of the virtual wall.
hard. When you first enter the wall, the force you are feeling is not very large, kΔx is close to zero when you are near the surface. Fig. 6.17 depicts changes in force
Fig. 6.17. The energy gain. Due to the finite sampling frequency of the haptic device a “staircase” effect is observed when rendering a virtual wall. The difference in the areas enclosed by the curves that correspond to penetrating into and moving out of the virtual wall is a manifestation of energy gain.
profile with respect to position for real and virtual walls of a given stiffness. Since the position of the probe tip is sampled with a certain frequency during the simulation, a “staircase” effect is observed. The difference in the areas enclosed by the curves that correspond to penetrating into and moving out of the virtual wall is a manifestation of energy gain. This energy gain leads to instabilities as the stiffness coefficient is increased (compare the energy gains for stiffness coefficients k1 and k2 ). On the
Haptic Interfaces
283
other hand, a low value of the stiffness coefficient generates a soft virtual wall, which is not desirable either. The hardness of a contact can be increased by applying damping when the user enters the wall, i.e., by adding a term proportional to velocity v to the computation of force: F = kΔx + bv It makes the force large right at entry if the HIP is moving with any speed. This term has to be turned off when the HIP is moving out of the wall, otherwise the wall would feel sticky. Adding either visual or acoustic reinforcement of the wall’s entry can make the wall seem harder than relying on haptic feedback alone. Therefore, the wall penetration should never be displayed visually. The HIP is always displayed where it would be in the real world (on the object’s surface). In the literature, the ideal position has got different names, but all of them refer to the same concept Fig. 6.18.
Fig. 6.18. The surface contact point (SCP) is also called the ideal haptic interaction point (IHIP), the god-object or the proxy point. Basically, it corresponds to the position of the HIP on the objects’s surface.
6.3.2 Rendering Solid Objects Penalty based methods represent a naive approach to the rendering of solid objects. Similarly to the rendering of a wall, the force is proportional to the penetration depth. Fig. 6.19 demonstrates the limitations of the penalty based methods. The penalty based methods simply pull the HIP to the closest surface point without tracking the HIP history. This causes force discontinuities at the object’s corners, where the probe is attracted to other surfaces than to the one the object originally penetrated. When the user touches a thin surface, he feels a small force. As he pushes harder, he penetrates deeper into the object until he passes more than a halfway through the object, where the force vector changes direction and shoots him out the other side. Furthermore, when multiple primitives touch or are allowed to intersect, it is often difficult to determine, which exterior surface should be associated with a given internal volume. In the worst case, a global search of all primitives may be required to find the nearest exterior surface. To avoid these problems, constraint based methods have been proposed by [21] (the god-object method) and [17] (the virtual
284
Interacting in Virtual Reality
(a) object corners
(b) thin objects
(c) multiple touching objects
Fig. 6.19. The limitations of the penalty based methods.
proxy method). These methods are similar in that they trace the position of a virtual proxy, which is placed where the haptic interface point would be if the haptic interface and the object were infinitely stiff (Figure 6.20).
Fig. 6.20. The virtual proxy. As the probe (HIP) is moving, the virtual proxy (SCP) is trying to keep the minimal distance to the probe by moving along the constrained surfaces.
The constraint based methods use the polygonal representation of the geometry to create one-way penetration constraints. The constraint is a plane determined by the polygon through which the geometry has been entered. For convex parts of objects, only one constraint is active at a time. Concave parts require up to three active constraints at a time. The movement is then constrained to a plane, a line or a point Fig. 6.21. Finally, we are going to show how the SCP can be computed if the HIP and the constraint planes are known. Basically, we want to minimize the distance between the SCP and HIP so that the SCP is placed on the active constraint plane(s). This is a typical application case where the Lagrange multipliers can be used. Lagrange multipliers are a mathematical tool for constrained optimization of differentiable functions. In the unconstrained situation, we have some (differentiable) function f that we want to minimize (or maximize). This can be done by finding the points where the gradient ∇f is zero, or, equivalently, each of the partial derivatives is zero. The constraints have to be given by functions gi (one function for each constraint). The ith constraint is fulfilled if gi (x) = 0. It can be shown, that for the constrained case, the gradient of f has to be equal to the linear combination of the gradients of the constraints. The coefficients of the linear combination are the Lagrange multipliers
Haptic Interfaces
285
Fig. 6.21. The principle of constraints. Convex objects require only one active constraint at a time. Concave objects require up to three active constraints at a time. The movement is then constrained to a plane, a line or a point.
λi .
∇f (x) = λ1 ∇g1 (x) + λn ∇gn (x)
The function we want to minimize is |HIP − SCP | = (xHIP − xSCP )2 + (yHIP − ySCP )2 + (zHIP − zSCP )2 Alternatively, we can use f=
( 1' (xHIP − xSCP )2 + (yHIP − ySCP )2 + (zHIP − zSCP )2 2
which corresponds to the energy of a virtual spring between SCP and HIP with the spring constant k = 1. This function is easier to differentiate and leads to the same results. The constraint planes are defined by the plane equation gi = Ai xSCP + Bi ySCP + Ci zSCP − Di The unknowns are the xSCP , ySCP , zSCP and the λi . Evaluating the partial derivatives and rearranging leads to the following system of 4, 5, or 6 linear equations (depending on the number of constraints): ⎤⎡ ⎡ ⎤ ⎡ ⎤ xHIP xSCP 1 0 0 A1 A2 A3 ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎢ 0 1 0 B1 B2 B3 ⎥ ⎢ ySCP ⎥ ⎢ yHIP ⎥ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎥⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ zSCP ⎥ ⎢ zHIP ⎥ 0 constraints ⎢ 0 0 1 C C C 1 2 3 ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥=⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ λ1 ⎥ ⎢ D1 ⎥ 1 constraint ⎢ A B C 0 0 0 1 1 1 ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎢ ⎢ ⎥ 2 constraints ⎢ A2 B2 C2 0 0 0 ⎥ ⎢ λ2 ⎥ ⎢ D2 ⎥ ⎥ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎦ ⎣ ⎦ ⎣ ⎦⎣ 3 constraints A3 B3 C3 0 0 0 λ3 D3
286
Interacting in Virtual Reality
6.4 Modeling the Behavior of Virtual Objects Physically based modeling (PBM) has become an important approach to computer animation and computer graphics modeling in the late 80’s. The raising computer performance allows for simulating physics in real time making PBM a useful technique for VR. A lot of effort is being invested to make virtual worlds look realistically. PBM is the next step toward making virtual objects behave realistically. The users are familiar with the behavior of objects in the real world, therefore providing ”real physics” in the virtual environment allows for an intuitive interaction with the virtual objects. There is a wide range of VR applications that benefit from PBM. To mention some examples, think of assembly simulations, robotics, training and teaching (medical, military, sports), cloth simulation, hair simulation, entertainment. In fact, anything that flies, rolls, slides, deforms can be modeled to create a believable content of the virtual world. Physics is a vast field of science that covers many different subjects. The subject most applicable to virtual worlds is mechanics, and especially a part of it called dynamics, which focuses on bodies in motion. Within the subject of dynamics, there are even more specific subjects to investigate, namely kinematics and kinetics. Kinematics focuses on the motion of bodies without regard to forces. Kinetics considers both the motion of bodies and the forces that affect the bodies motion. PBM involves numerical methods for solving differential equations describing the motion, collision detection and collision response, application of constraints (e.g., joints or contact) and, of course, the laws of mechanics. 1 6.4.1 Particle Dynamics Particles are objects that have mass, position and velocity (particle state), and respond to forces, but they have no spatial extent. Particles are the easiest objects to simulate. Particle systems are used to simulate natural phenomena, e.g., rain, dust and nonrigid structures, e.g., cloth – particles connected by springs and dampers. The theory of mechanics is based on Newton’s laws of motion, from which the second Newton’s law is of particular interest in the study of dynamics. It states, that a force F acting on a body gives it an acceleration a, which is in the direction of the force and has magnitude inversely proportional to the mass m of the body. F = ma
(6.18)
Further, we can introduce the particle velocity v and acceleration a as the first and second time derivatives of position. dx =v dt 2 d x dv =a x ¨= 2 = dt dt x˙ =
(6.19) (6.20)
1 The Particle Dynamics and Rigid Body Dynamics sections are based on the Physically Based Modeling SIGGRAPH 2001 Course Notes by Andrew Witkin and David Baraff. http://www.pixar.com/companyinfo/research/pbm2001
Modeling the Behavior of Virtual Objects
287
The equation 6.18 is thus an ordinary differential equation (ODE) of the second order, and solving it means finding a function x that satisfies the relation F = m¨ x. Moreover, the initial value of x0 at some starting time t0 is given and we are interested in following x in the time thereafter. There are various kinds of forces in the real world, e.g., the gravitational force is causing a constant acceleration, a damping force is proportional to velocity, a (static) spring force is proportional to the spring elongation. The left side of the equation 6.18 has to be replaced with a specific formula describing the acting force. When solving the equation numerically it is easier to split it into two equations of first order introducing velocity. The state of the particle is then described by its position and velocity: [x, v]. At the beginning of the simulation, the particle is in the initial state [x0 , v0 ]. A discrete time step Δt is used to evolve the state [x0 , v0 ] → [x1 , v1 ] → [x2 , v2 ] → . . . corresponding to the next time steps. The most basic numerical integration method is the Euler method. v=
dx dt
f initetimestep
=⇒
explicit
Δx = vΔt =⇒ xt+Δt = xt + Δtvt implicit
=⇒ xt+Δt = xt + Δtvt+Δt
a=
dv dt
f initetimestep
=⇒
explicit
Δv = aΔt =⇒ vt+Δt = vt + Δtat implicit
=⇒ vt+Δt = vt + Δtat+Δt
(6.21) (6.22) (6.23) (6.24)
xt=0 = x0 , vt=0 = v0 The equations 6.21, 6.23 correspond to the explicit (forward) Euler method, equations 6.22, 6.24 correspond to the implicit (backward) Euler method. For each time step t, the current state [xt , vt ] is known, and the acceleration can be evaluated using equation 6.18: Ft m Though simple, the explicit Euler’s method is not accurate. Moreover, the explicit Euler’s method can be unstable. For sufficiently small time steps we get reasonable behavior, but as the time step gets larger, the solution oscillates or even explodes toward infinity. The implicit Euler’s method is unconditionally stable, in general yet not necessarily more accurate. However, the implicit method leads to a system of equations that has to be solved in each time step, as xt+Δt and vt+Δt not only depend on the previous time step but also on vt+Δt and at+Δt respectively. The first few terms of an approximation of at+Δt using a Taylor’s series expansion can be used in the equation 6.24. at =
∂a(xt , vt ) ∂a(xt , vt ) + Δv + ... at+Δt = a(xt+Δt , vt+Δt ) = a(xt , vt ) + Δx ∂x ∂v # " ∂a(xt , vt ) ∂a(xt , vt ) + Δv vt+Δt = vt + Δtat+Δt ⇒ Δv = Δt a(xt , vt ) + Δx ∂x ∂v
288
Interacting in Virtual Reality
Furthermore, Δx can be eliminated using equation 6.22. xt+Δt = xt + Δtvt+Δt ⇒ Δx = Δtvt+Δt = Δt(vt + Δv) # " ∂a(xt , vt ) ∂a(xt , vt ) + Δv Δv = Δt a(xt , vt ) + Δt(vt + Δv) ∂x ∂v Finally, we get an equation for Δv. ∂a(xt , vt ) ∂a(xt , vt ) ∂a(xt , vt ) − Δt2 Δv = Δt at + Δtvt I3×3 − Δt ∂v ∂x ∂x (6.25) This equation has to be solved for Δv in each time step. In general, it is a system of ∂a three equations. The a, v and x are vectors, therefore ∂a ∂v and ∂x are 3 × 3 matrices. The I3×3 on the left side of the equation is a 3 × 3 identity matrix. However, in many practical cases, the matrices can be replaced by a scalar. When the equation 6.25 is solved, vt+Δt and xt+Δt can be computed as vt+Δt = vt + Δv xt+Δt = xt + Δtvt+Δt The choice of the integration method depends on the problem. The explicit Euler’s method is very simple to implement and it is sufficient for a number of problems. Other methods focus on improving accuracy, but of course, the price for this accuracy is a higher computational cost per time step. In fact, more sophisticated methods can turn out to be worse than the Euler’s method, because their higher cost per step is more than the increase of the time step they allow. For more numerical integration methods (e.g., midpoint, Runge-Kutta, Verlet, multistep methods), the stability analysis and methods for solving systems of linear and nonlinear equations see textbooks on numerical maths or [16]. 6.4.2 Rigid Body Dynamics Simulating the motion of a rigid body is almost the same as simulating the motion of a particle. The location of a particle in space at time t can be described as a vector xt , which describes the translation of the particle from the origin. Rigid bodies are more complicated, in that in addition to translating them, we can also rotate them. Moreover, a rigid body, unlike a particle, has a particular shape and occupies a volume of space. The shape of a rigid body is defined in terms of a fixed and unchanging space called body space. Given a geometric description of a body in body space, we use a vector xt and a matrix Rt to translate and rotate the body in the world space respectively (Fig. 6.22). In order to simplify some equations, it is required that the origin of the body space lies in the center of mass of the body. If p0 is an arbitrary point on the rigid body in body space, then the world space location p(t) of p0 is the result of rotating p0 about the origin and then translating it: p(t) = R(t)p0 + x(t)
(6.26)
Modeling the Behavior of Virtual Objects
289
y
y’
p0 y
x
z
z’ x’
z
(a) body space
p(t)=R(t)p0+x(t)
x(t)
R(t)
x
(b) world space
Fig. 6.22. Body space vs. world space
We have defined the position and orientation of the rigid body. As the next step ˙ we need to determine how they change over time, i.e. we need expressions for x(t) ˙ is the ˙ Since x(t) is the position of the center of mass in world space, x(t) and R(t). velocity of the center of mass in world space v(t) (also called linear velocity). The angular velocity ω(t) describes the spinning of a rigid body. The direction of ω(t) gives the direction of the spinning axis that passes through the center of mass. The magnitude of ω(t) tells how fast the body is spinning. Angular velocity ω(t) cannot
Fig. 6.23. Linear and angular velocity.
˙ ˙ is a matrix. Instead: since ω(t) is a vector and R(t) be R(t), ˙ = ω ∗ R(t) ˙ = ω(t) × r(t), R(t) r(t) ⎤ ⎡ −ωz (t) ωy (t) 0 −ωx (t) ⎦ 0 ω ∗ = ⎣ ωz (t) 0 −ωy (t) ωx (t) where r(t) is a vector in world space fixed to the rigid body r(t) = p(t) − x(t) and R(t) is the rotation matrix of the body.
290
Interacting in Virtual Reality
To determine the mass of the rigid body, we can imagine that the rigid body is made up of a large number of small particles. The particles are indexed from 1 to N . The mass of the ith particle is mi . The total mass of the body M is the sum M=
N
mi
(6.27)
i=1
Each particle has a constant location p0i in the body space. The location of the ith particle in world space at time t, is therefore given by the formula pi (t) = R(t)p0i + x(t)
(6.28)
The center of mass of a rigid body in world space is N
c(t) =
N
mi pi (t)
i=1
=
M
N
mi (R(t)p0i + x(t))
i=1
= R(t) i=1
M
mi p0i M
+ x(t)
When we are using the center of mass coordinate system for the body space, it means that: N mi p 0 i i=1 =0 M Therefore, the position of the center of mass in the world space is x(t). Imagine Fi as an external force acting on the ith particle. Torque acting on the ith particle is defined as τi (t) = (pi (t) − x(t)) × Fi (t) Torque differs from force in that it depends on the location of the particle relative to the center of mass. We can think of the direction of torque as being the axis the body would spin about due to Fi if the center of mass was held in place. F =
N
Fi (t)
(6.29)
i=1
τ =
N i=1
τi (t) =
N
(pi (t) − x(t)) × Fi (t)
(6.30)
i=1
The torque τ (t) tells us something about the distribution of forces over the body. The total linear momentum P (t) of a rigid body is the sum of the products of mass and velocity of each particle: P (t) =
N i=1
mi p˙ i (t)
Modeling the Behavior of Virtual Objects
291
which leads to P (t) = M v(t) Thus, the linear momentum of a rigid body is the same as if the body was a particle with mass M and velocity v(t). Because of this, we can also write: P˙ (t) = M v(t) ˙ = F (t) The change in linear momentum is equivalent to the force acting on a body. Since the relationship between P (t) and v(t) is simple, we can use P (t) as a state variable of the rigid body instead of v(t). However, P (t) tells us nothing about the rotational velocity of the body. While the concept of linear momentum P (t) is rather intuitive (P (t) = M v(t)), the concept of angular momentum L(t) for a rigid body is not. However, it will let us write simpler equations than using angular velocity. Similarly to the linear momentum, the angular momentum is defined as: L(t) = I(t)ω(t) where I(t) is a 3 × 3 matrix called the inertia tensor. The inertia tensor describes how the mass in a body is distributed relative to the center of mass. It depends on the orientation of the body, but does not depend on the body’s translation. I(t) =
N
mi (riT (t) · ri (t))[1]3×3 − ri (t) · riT (t)
i=1
ri (t) = pi (t) − xi (t) [1]3×3 is a 3 × 3 identity matrix. Using the body space coordinates, I(t) can be efficiently computed for any orientation: I(t) = R(t)Ibody RT (t) Ibody is specified in the body space and is constant over the simulation. Ibody =
N
mi (pT0i · p0i )[1]3×3 − p0i · pT0i
i=1
To calculate the inertia tensor, the sum can be treated as a volume integral. T Ibody = ρ (p0i · p0i )[1]3×3 − p0i · pT0i dV V
ρ is the density of the rigid body (M = ρV ). For example, the inertia tensor of a block a × b × c with origin in the center of mass is ⎡ 2 ⎤ −zx y + z 2 −yx ⎣ −xy x2 + z 2 −zy ⎦ dxdydz Ibody = ρ V −xz −yz x2 + y 2 ⎤ ⎡ 2 0 0 b + c2 M⎣ 0 ⎦ 0 a 2 + c2 = 12 0 0 a2 + b 2
292
Interacting in Virtual Reality
The relationship between the angular momentum L(t) and the total torque τ is very simple ˙ L(t) = τ (t) analogously to the relation P˙ (t) = F (t). The simulation loop performs the following steps: • • •
determine the force F (t) and torque τ (t) acting on a rigid body ˙ ˙ update the state derivatives x(t), ˙ R(t), P˙ (t), L(t) update the state, e.g., using the Euler’s method
Table 6.2 summarizes all important variables and formulas for particle dynamics and rigid body dynamics. Table 6.2. The summary of particle dynamics and rigid body dynamics. constants state time derivatives of state simulation input useful forms
Particles m x(t), v(t) v(t), a(t) F (t)
Rigid bodies M, Ibody x(t), R(t), P (t), L(t) v(t), ω ∗ R(t), F (t), τ (t) F (t), τ (t) P (t) = M v(t) F (t) = ma(t) ω(t) = I −1 (t)L(t) −1 I −1 (t) = R(t)Ibody RT (t)
6.5 Interacting with Rigid Objects Multiple studies have been carried out which demonstrate that physically-based modeling (PBM) and haptics have the potential to significantly improve interaction in virtual environments. As an example, one such study is described in [18]. Here, three virtual elements in form of the letters I, L, and T, had to be positioned into a frame. The experiment consisted of three phases. In the first phase, subjects had to complete the task in a native ”virtual environment”, where collisions between a letter and the frame were only optically signaled by a wireframe presentation of the letter. During the second phase, the subjects were supported by haptic feedback. Finally, in the third phase, the subjects were asked to move the letters above their target positions and let them fall into the frame. The falling procedure, as well as the behaviour of the objects when touching the frame, had been physically modeled. The results depicted in Fig. 6.24 show that haptics as well as PBM significanty improve the user interface. However, the preceding sections about haptics and PBM should have made also clear that the ultimate goal of VR – an exact correlation between interaction in the virtual and in the real world – cannot be achieved in the near future for the following main reasons: •
Modeling
Interacting with Rigid Objects
293
Fig. 6.24. Influence of haptics and PBM on the judgement of task difficulty (a) and user interface (b) [18].
•
– Geometry To achieve a graphical representation of the virtual scene in real time, the shapes of virtual objects are typically tessellated, i.e., they are modeled as polyhedrons. It is obvious that by such a polygonal representation, curved surfaces in the virtual world can only be an approximation of those in a real world. As a consequence, even very simple positioning tasks lead to undesired effects. As an example, it nearly impossible for a user to put a cylindrical virtual bolt into a hole, since both, bolt and hole, are not perfectly round, so that inadequate collisions inevitably occur. – Physics Due to the discrete structure of a digital computer, collisions between virtual objects and the resulting, physically-based collision reaction can only be computed at discrete time steps. As a consequence, not only the graphical, but also the ”physical” complexity of a virtual scene is limited by the performance of the underlying computer system. As with the graphical representation, the physical behavior of virtual objects can only be an approximation of reality. Hardware Technology The development of adequate interaction devices, especially force and tactile feedback devices, remains a challenging task. Hardware solutions suitable for a use in industrial and scientific applications, like the PHANToM Haptic Device, typically only display a single force vector and cover a rather limited work volume. First attempts have also been made to add force feedback to instrumented gloves (see Fig. 6.25), thus providing kinesthetic feedback to complete hand movements. It seems arguable however, whether such approaches become accepted in industrial and technical applications. With regard to tactile devices, it seems even impossible at all to develop feasible solutions due to the large amount of tactile sensors in the human skin that have to be addressed.
In order to compensate for the problems arising from the non-exact modeling of geometry and behavior of virtual objects as well as from the inadequacy of available interaction devices, additional mechanisms should be made available to the user which facilitate the execution of manipulation tasks in virtual environments.
294
Interacting in Virtual Reality
Fig. 6.25. The CyberForceTM Haptic Interface, Immersion Corporation.
As an example, we show here how elementary positioning tasks can be supported by artificial support mechanisms which habe been developed at RWTH Aachen University [18]. Fig. 6.26 illustrates how the single mechanisms – guiding sleeves, sensitive polygons, virtual magnetism, and snap-in – are assigned to the three classical phases of a positioning process. On the book CD, movies are available which show these support mechanisms. In the following, the single mechanisms will be described in more detail. 6.5.1 Guiding Sleeves At the beginning of a positioning task is the transport phase, i.e., the movement to bring a selected virtual object towards the target position. Guiding sleeves are a support mechanism for the transport phase which make use of a priori knowledge about possible target positions and orientations of objects. In the configuration phase of the virtual scenario, the guiding sleeves are generated in form of a scaled copy or a scaled bounding box of the objects that are to be assembled, and then placed as invisible items at the corresponding target location. When an interactively guided object collides with an adequate guiding sleeve, the target position and orientation is visualized by means of a semi-transparent copy of the manipulated object. In addition, the motion path necessary to complete the positioning task is animated as a wireframe representation (see Fig. 6.27). When the orientations of objects are described as quaternions (see [10]), the animation path can be calculated on the basis of the position difference pdif f and the orientation difference qdif f between the current position pcur and orientation qcur , and the target position pgoal and orientation qgoal :
Interacting with Rigid Objects
295
Fig. 6.26. Assignment of support mechanisms to the single interaction phases of a positioning task.
Fig. 6.27. Guiding Sleeves support the user during the transport phase of a positioning task by means of a wireframe animation of the correct path. A video of this scenario is available on the book CD.
pdif f = pcur − pgoal qdif f = qcur − qgoal
(6.31) (6.32)
The in-between representations are then computed according to the Slerp (Spherical Linear intERPolation) algorithm [4]:
296
Interacting in Virtual Reality sin(tθ) sin((1 − t)θ) qn,cur + qn,goal , sin(θ) sin(θ) θ = arccos(qn,cur , qn,goal ), t ∈ [0.0, 1.0] .
Slerp(qn,cur , qn,goal , t) =
(6.33) (6.34)
The interpolation step t = 0.0 describes the normalized, current orientation qn,cur , and t = 1.0 stands for the normalized goal orientation qn,goal . 6.5.2 Sensitive Polygons Sensitive polygons are created before simulation start and positioned at adequate positions on object surfaces. A function can be assigned to every sensitive polygon, which is automatically called when an interactively guided element collides with the sensitive polygon. Sensitive polygons are primarily designed to support the user during the coarse positioning phase of an assembly task. Among others, the polygons can be used to constrain the degrees of freedom for interactively guided objects. For instance, a sensitive polygon positioned at the port of a hole can restrict the motion of a pin to the axial direction of the hole and thus considerably facilitate the manipulation task. In case of this classical peg in hole assembly task, a copy of the guided object is created and orientated into the direction of the hole’s axis. The original object is switched into a semi-transparent representation (see Fig. 6.28, left) in order to provide the user at any time with a visual feedback of the actual movement. The new position Pneu of the copy can be calculated from the current manipulator position Pact , the object’s target position Pz , and the normal vector N of the sensitive polygon, which is identical to the object’s target orientation: Pneu = Pz + N (Pact − Pz ) N.
(6.35)
Fig. 6.28. Sensitive polygons can be used to constrain the degrees of freedom for an interactively guided object.
6.5.3 Virtual Magnetism In assembly procedures, it is often necessary to position objects exactly, i.e., parallel to each other at arbitrary locations, without leaving any space between them. Since
Interacting with Rigid Objects
297
in a virtual environment this task is nearly impossible to accomplish without any artificial support mechanisms, the principle of virtual magnetism has been introduced [18]. The left part of Fig. 6.29 illustrates the effect of virtual magnetism on movable objects. In contrast to other support mechanisms, a virtual magnetism should not be activated automatically in order to avoid non-desired object movements. Instead, the user should actively initiate it during the simulation, e.g., by a speech command.
Fig. 6.29. Virtual magnetism to facilitate a correct alignment of virtual objects. A short video on the book CD shows virtual magnetism in action.
Before simulation start, an object’s influence radius is calculated as a function of its volume. Whenever two of such influence areas overlap during the simulation, an attraction force is generated that consists of two components (see Fig. 6.29, right). The first component is calculated from the distance between the two objects’ centers of gravity, and the second component is a function of polygon surfaces. To calculate the second component, a search beam starts from every polygon midpoint into direction of the surface normal. If such a beam intersects with a polygon of the other object, a force directed to the surface normal is created, which is proportional to the distance between the polygons. Finally, the sum of the two force components is applied to move the smaller object towards the larger one. 6.5.4 Snap-In In addition to virtual magnetism, a snap-in mechanism can be provided to support the final phase of a positioning task. The snap-in is activated when position and orientation difference between a guided object and the target location falls below a specific threshold. As with guiding sleeves and sensitive polygons, snap-in mechanisms require a priory knowledge about the exact target positions. The following threshold conditions d and α for position and orientation can be used to initiate the snap-in process: NPcur . (6.36) d ≥ |Pgoal − Pcur | and α ≥ arccos |N| |Pcur | Studies could show that the additional support from guiding sleeves, sensitive polygons, virtual magnetism, and snap-in, enhance user performance and acceptance significantly [18]. One of the experiments that have been carried out refers to a task,
298
Interacting in Virtual Reality
in which virtual nails must be brought into the corresponding holes of a virtual block. The experiment consisted of three phases: In phase 1, subjects had to complete the task in a native virtual environment. In phases 2 and 3, they were supported by guiding sleeves and sensitive polygons, respectively. Every phase was carried out twice. During the first trial, objects snapped in as far as the subject placed them into the target position below a certain tolerance. Fig. 6.30 depicts the qualitative analysis of this experiment. Subjects judged task difficulty as much easier and the quality of the user interface as much better when the support mechanisms were activated.
Fig. 6.30. Average judgement of task difficulty and user interface with and without support mechanisms [18].
6.6 Implementing Virtual Environments When implementing virtual environments, a number of obstacles have to be overcome. For a lively and vivid virtual world, there is the need for an efficient data structure that keeps track of hierarchies and dependencies between objects. The most fundamental data structure that is used for this purpose is called scene graph. While scene graphs keep track of things in the virtual world, there is usually no standardized support for real world problems, e.g., the configuration of display and input devices and the management of user input. For tasks like these, a number of toolkits exists that assist in programming virtual environments. Basic assumptions and common principles about scene graphs and toolkits will be discussed in the following section. 6.6.1 Scene Graphs The concept of scenes is derived from theatre plays, and describes units of a play that are smaller than the act or a picture. Usually, scenes define the changing of a place or the exchange of characters on the set. In straight analogy for virtual environments, scenes define the set of objects that define the place where the user can act. This definition of course includes the objects that can be interacted with by the user. From a computer graphics point of view, several objects have to work together in order to display a virtual environment. The most prominent data is the geometry of the objects, usually defined as polygonal meshes. In addition to that, information about
Implementing Virtual Environments
299
the illumination of the scene is necessary. Trivially, positions and orientations of objects are needed in order to place objects in the scene. Looking closer, there often is a dependency between the geometrical transformations of certain objects, e.g., tires are always attached to the chassis of a car, as well as objects that reside inside the vehicle move along when the vehicle moves. For a programmer, it rapidly evolves as a tedious task to keep these dependencies in between objects correctly. For that purpose, a scene graph is an established data structure that eases this task a lot. Basically, a scene graph can be modeled as a directed acyclic graph (DAG) where the nodes describe objects of the scene in various attributes, and edges define a ”child of” dependency in between these objects. In graphics processing, scene graphs are used for the description of a scene. As such, it provides a hierarchical framework for easily grouping objects spatially. There are basically three processes that are covered by scene graph toolkits. The first is the management of objects in the scene, mainly creating, accessing, grouping and deleting them. The second one is the traversal of the scene, where distinct traversal paths can result in different scenes that are rendered. A further aspect covers operations that are called when encountering nodes, that is, realizing functionality on the basis of the types of objects that are contained in the scene. As stated above, there exist different types of nodes that can be contained in a scene graph. In order to realize the spatial grouping of objects, group nodes define the root of a subtree of a scene graph. Nodes that do not have a child are called leaf nodes. Children of group nodes can be group nodes again or content nodes, that is an abstract type of a node that can have individual properties, such as a geometry to draw, color to use, and so on. Depending on the toolkit you chose to implement the virtual environment, content nodes are never part of the scene graph directly, but are indirectly linked to traversal nodes in the scene graph. The scene graph nodes then only define the spatial relation of the content nodes. This enables the distinction of algorithms that do the traversal of a scene graph and the actual rendering of the contents. However, the following passages will assume that leaf nodes define the geometrical content to render. As all things have a beginning and an end, a special set of group nodes is entitled as root nodes of the virtual world. While some toolkits allow only one root node, others may allow more than one and usually select a special root node for an upcoming rendering procedure. Fig. 31(a) depicts a naive example of the scene graph that could be used to render a car. Generally speaking, the traversal of a scene graph is a concept of its own and thus the order in which a scene graph is traversed can be toolkit specific. In the subsequent passages, a left-orderdepth-first tree walk is assumed. That means that in every group node, the leftmost child is visited in order to determine the next traversal step. If this child is a group node itself, again the leftmost child will be visited. Once the traversal encounters a leaf node, the content that is linked to this leaf node will be rendered. After that, the right sibling of the rendered node will be visited until there is no right sibling. Once the traversal routine re-encounters the world’s root node, another rendering traversal may begin as the whole scene is painted. This process will terminate on a finite DAG. As stated above, any node defines a spatial relation, more precisely, transformations that are applied to the parent of a set of nodes affect the transformations of the
300
Interacting in Virtual Reality
(a) A naive scene graph example, where content nodes are modeled as leaf nodes.
(b) Still a simple but more elaborate scene graph for a car.
Fig. 6.31. Two very basic examples for a scene graph that describes a car with four tyres.
child nodes. With a left-order-depth-first tree walk, transformations are processed ”bottom up”, that is from leaf nodes to the root node. This corresponds to a matrix multiplication of a set of transformation matrices from right to left. In order to let the car from Fig. 6.31(a) drive, the programmer has to apply nine transformations as a sum, touching every single entity in the graph with exception of the root node. Fig. 6.31(b) therefore depicts a more elegant solution, where a group node for a parent object ”car” is introduced. This node defines a spatial relation itself, called local transformation, i.e., a translation of the node ”car” translates all it’s child nodes by the same amount. This introduces the concept of a local coordinate system, that defines all spatial relations locally for every node and its children. The coordinate system of the world’s root node is called global coordinate system. The concrete definition of the global coordinate system is either defined by the application programmer or the toolkit that is used. An example with quite the same structure as the above depicted car example can be found on the book CD. See sec. 6.7 for more details on this simple example. As stated above, content nodes can have different types and can influence what can be seen in the scene. This ranges from nodes that describe the polygonal mesh to be rendered, over application programmer’s callback nodes up to intelligent ”pruning” nodes that can be used to simplify a scene and speed up rendering. This happens, e.g., when parts of the scene that are not visible from the current viewpoint get culled from the scene or highly detailed geometries that are at a high distance are replaced by simpler counterparts that still contain the same visual information for the viewer. The latter method is called level of detail rendering. Some content nodes describe the position of the camera in the current scene, or a ”virtual platform” where the user is located in the virtual world. By using such nodes correctly, the application programmer can tie the user to an object. E.g., the camera that is attached as a child of the
Implementing Virtual Environments
301
car in Fig. 6.31(b) is affected by the transformation of the car node, as well are its siblings. 6.6.2 Toolkits Currently, the number of available toolkits for the implementation of virtual environments varies, as well as their abilities and focuses vary. But some basic concepts evolve that can be distinguished and are more or less equally followed by most environments. A list of some available toolkits and a URL to download them is enlisted in the README.txt on the book CD in the vr directory. The following section will try to concentrate on these principles, without describing a special toolkit and without enumerating the currently available frameworks. The core of any VR toolkit is the application model that can be followed by the programmer to create the virtual world. A common approach is to provide a complete and semi-open infrastructure, that includes scene graphs, input/output facilities, device drivers and a tightly defined procedure for the application processing. Another strategy is to provide independent layers that augment existing low level facilities, e.g., scene graph APIs with VR functionality, such as interaction metaphors, hardware abstraction, physics or artificial intelligence. While the first approach tends to be a static all-in-one approach, the latter one needs manual adaptations that can be tedious when using exotic setups. However, the application models can be event driven or procedural models where the application programmer either listens to events propagated over an event bus or defines a number of callback routines that will be called by the toolkit in a defined order. A common model for interactive applications is the frame-loop. It is depicted in Fig. 6.32. In this model, the application calculates its current state in between the rendering of two consequent frames. After the calculation is done, the current scene is rendered onto the screen. This is repeated in an endless loop way, until the user breaks the loop and the application exits. A single iteration of the loop is called a frame. It consists of a calculation step for the current application state and the rendering of the resulting scene. User interaction is often dispatched after the rendering of the current scene, e.g., when no traversal of the scene graph is occurring. Any state change of an interaction entity is propagated to the system and the application using events. An event indicates that a certain state is present in the system. E.g., the press of a button on the keyboard defines such a state. All events are propagated over an event-bus. Events usually trigger a change of state in the world’s objects which results in a changed rendering on the various displays. As a consequence, the task of implementing and vitalizing a virtual environment deals with the representation of the world’s objects for the various senses. For the visual sense, the graphical representation, that is the polygonal meshes and descriptions that are needed for the graphics hardware to render an object properly. For the auditory sense, this comprises of sound waves, and radiation characteristics or attributes that are needed to calculate filters that drive loudspeakers. Force feedback devices which appeal the haptic sense usually need information about materials, such as stiffness and masses with forces and torques.
302
Interacting in Virtual Reality event bus
application start
rendering
interactionupdate
application update
application end
system update
Fig. 6.32. A typical endless loop of a VR application.
Real VR toolkits therefore provide algorithms for the transformation or storage of data structures that are needed for the rendering of virtual objects. Usually a collection of algorithms for numerical simulation and physically based modeling have to be present, too. In addition to the structures that help to implement the virtual objects in a virtual world, a certain amount of real world housekeeping is addressed by proper VR toolkits. The most appealing fact is that a number of non standard input devices have to be supported in order to process user input. This includes tracking devices, ranging from electro magnetic over ultra-sonic trackers up to optical tracking systems. VR toolkits provide a layer that maps the broad range of devices to a standard input and output bus and allow the application programmer to process the output of the devices. The most fundamental concepts for input processing include strategies for real time dispatching of incoming data and the transformation from device dependent spatial data to spatial data for the virtual world. Another aspect of input processing is the ability to join spatially distributed users of a virtual environment in a collaborative setting. If that aspect is supported, the VR toolkit provides a mean of differentiating between the ”here” and ”there” of user input and suitable locking mechanisms to avoid conflicting states of the distributed instances of the virtual environment. Another fundamental aspect of a toolkit that implements virtual environments is the flexible setup for a varying range of display devices. A trivial requirement for the toolkit is to support different resolutions and geometries of the projection surfaces as well as active and passive stereo support. This includes multi-pipe environments, where a number of graphics boards is hosted by a single machine, or distributed cluster architectures where each node in a cluster drives a single graphics board but the whole cluster drives a single display. A fundamental requirement is the presence of a well performing scene graph structure that enables the rendering of complex scenes. In addition to that, the tool-
Implementing Virtual Environments
303
kit should provide basic means to create geometry for 3D widgets and algorithms for collision detection, deformation and topological changes of geometries. A very obvious requirement for toolkits that provide an immersive virtual environment is the automatic calculation for the user centered projection that enables stereoscopic views. Ideally, this can directly be bound to a specific tracking device sensor such that it is transparent to the user of the toolkit. Fig. 6.33 depicts the building blocks of modern virtual environments that are more or less covered by most available toolkits for implementing virtual environments today. It can be seen that the software necessary for building vivid virtual worlds is more than just a scene graph and even simple applications can be rather complex in its set-up and runtime behavior. 6.6.3 Cluster Rendering in Virtual Environments As stated above, room mounted multi-screen projection displays are nowadays driven by off-the-shelf PC clusters instead of multi-pipe, shared memory machines. The topology and system layout of a PC cluster raises fundamental differences in the software design, as the application has to respect distributed computation and independent graphics drawing. The first issue introduces the need for data sharing among the different nodes of the cluster, while the latter one raises the need for a synchronization of frame drawing across the different graphics boards. Data locking deals with the question of sharing the relevant data between nodes. Usually, nodes in a PC cluster architecture do not share memory. Requirements on the type of data differ, depending if a system distributes the scene graph or synchronizes copies of the same application. Data locked applications are calculating on the same data, and if the algorithms are deterministic, are computing the same results in the same granularity. An important issue especially for VR applications is rendering. The frame drawing on the individual projection screens has to be precisely timed. This is usually achieved with specialized hardware, e.g., gen locking or frame locking features, that are available on the graphics boards. However, this hardware is not common in offthe-shelf graphics boards and usually expensive. Ideally, a software-based solution to the swap synchronization issue would strengthen the idea of using non-specialized hardware for VR rendering, thus making the technique more common. In VR applications, it is usually distinguished between two types of knowledge, the graphical setup of the application (the scene graph) and the values of the domain models that define the state of the application. Distributing the scene graph results in a simple application setup for cluster environments. A setup like this is called a client-server setup, where the server cluster nodes provide the service of drawing the scene, while the client dictates what is drawn by providing the initial scene graph and subsequent modifications to it. This technique is usually embedded as a low level infrastructure in the scene graph API that is used for rendering. Alternatives to the distribution of the scene graph can be seen in the distribution of pixel based information over high bandwidth networks, where all images are rendered in a high performance graphics environment, or the distribution of graphics primitives. A totally different approach respects the idea that a Virtual Reality application that has basically the same state of its domain objects will render the same scene,
304
Interacting in Virtual Reality
Fig. 6.33. Building blocks and aspects that are covered by a generic toolkit that implements virtual environments. Applications based on these toolkits need a large amount of infrastructure and advanced real time algorithms.
respectively. It is therefore sufficient to distribute the state of the domain objects to render the same scene. In a multi-screen environment, the camera on the virtual scene has to be adapted to the layout of your projection system. This is a very common approach. It is called the master-slave, or mirrored application paradigm, as the calculation of the domain objects for a given time step is done on a master machine and afterward distributed to the slave machines.
Examples
305
One can choose between the distribution of the domain objects or the distribution of the influences that can alter the state of any domain object, e.g., user input. Domain objects and their interactions are usually defined on the application level by the author of the VR application, so it seems more reasonable to distribute the entities of influence to these domain objects and apply these influences to the domain objects on the slave nodes of the PC cluster, see Fig. 6.34. Every VR system constantly records user interaction and provides information about this to the application which will change the state of the domain objects. For a master-slave approach, as depicted above, it is reasonable to distribute this interaction and non-deterministic influences across several cluster nodes, which will in turn calculate the same state of the application. This results in a very simple application model at the cost of redundant calculations and synchronization costs across different cluster nodes. We can see that the distribution of user interaction in the form of events is a key component to the master-slave approach. As a consequence, in a well designed system it is sufficient to mirror application events to a number of cluster nodes to transparently realize a master-slave approach of a clustered virtual environment. The task for the framework is to listen to the events that run over the event bus during the computational step in between the rendering steps of an application frame, and distribute this information across a network.
6.7 Examples The CD that accompanies this book contains some simple examples for the implementation of VR applications. It is, generally speaking, a hard task to provide some real world examples for VR applications, as much of its fascination and expressiveevent bus slave node
event bus master node interaction update
event observer
application update
net listener
application update
serialized events 0101011101101 1010110101011 01010010110...
cluster’s intranet
ti
ti+1
ti
domain objects
t i+1
domain objects
Fig. 6.34. Data locking by event sharing.
306
Interacting in Virtual Reality
ness comes from the hardware that is used, e.g., stereo capable projection devices, haptic robots or precisely calibrated loudspeaker set-ups. The examples provided here are to introduce the reader to the very basic work with 3D graphical applications and provide a beginner’s set-up that can be used for one’s own extensions. For a number of reasons, we have chosen to use the Java3D API for implementation of the examples, which is a product of SUN Microsystems. Java3D needs a properly installed Java environment in order to run. For the time of writing this book, the current Java environment version number is 1.5.0 (J2SE). Java3D is an add-on package that has to be downloaded separately, and its interface is still under development. The version that was used for this book is 1.3.2. The book CD contains a README.txt file where you can find the URLs to download the Java Environment, the Java3D API, and some other utilities needed for the development. You can use any environment that is suitable for the development of normal Java applications, e.g., Eclipse, JBuilder or the like. Due to the licensing policies you have to download all the needed libraries, they are not contained on the book CD. You can find the contents of this chapter on the book CD in the directory vr. There you will find the folders SimpleScene, Robot, SolarSystem and Tetris3D. Each folder contains a README.txt file that contains some additional information about the execution of the applications or the configuration. The Java3D API provides a well structured and intuitive access to the work with scene graphs. For a detailed introduction, please see the comprehensive tutorials on the Java3D website. You can find the URLs for the tutorials and additional stuff on the book CD. The following examples all use a common syntax when depicting the scene graph layouts. Figure 6.35 shows the entities that are used in all the diagrams. This syntax is close to the one in the Java3D tutorials that can be found at the URLs as given on the book CD.
Fig. 6.35. Syntax of the scene graph figures that describe the examples in this chapter. The syntax is close to the one that can be found in the Java3D tutorials.
Examples
307
6.7.1 A Simple Scene Graph Example The folder SimpleScene contains an example that can be used as a very first step into the world of scene graphs. It is a starting point for playing around with hierarchies, local and global transformations. In addition to that, it embeds a 3D canvas (an object that is used for the rendering of a 3D scene) to an AWT context, that is a simple 2D user interface layer of Java, which provides many types of 2D control widgets, e.g., buttons, listviews and sliders. The scene that is displayed with the unmodified SimpleScene.java is outlined in its structure in figure 6.36. The figure depicts the part of the scene graph that is created by the application and the part that is automatically provided by the object SimpleUniverse, which is shipped with the standard Java3D API in order to easily realize a 3D window mouse interaction. The application first creates
Fig. 6.36. The scene graph that is outlined by the SimpleScene example. In addition to that, the augmented scene graph that is automatically created by the SimpleUniverse object of the Java3D API.
the root node, called objRoot, and a transform node that allows the interaction
308
Interacting in Virtual Reality
with an input device on this node. Java3D provides predefined behavior classes that can be used to apply rotation, translation and scaling operations on transform nodes. Behaviors in Java3D can be active or inactive depending on the current spatial relations of objects. In this example, we define the active region of the mouse behavior for the whole scene. This can be accomplished by the code fragment in the method createBehavior(), see Fig. 6.37. public void createBehavior(TransformGroup oInfluencedNode, Group oParent) { // behavior classes shipped with the Java3D API mouseRotate = new MouseRotate(); mouseTranslate = new MouseTranslate(); mouseZoom = new MouseZoom(); // you have to set the node that is transformed by the mouse // behavior classes mouseRotate.setTransformGroup(oInfluencedNode); mouseTranslate.setTransformGroup(oInfluencedNode); mouseZoom.setTransformGroup(oInfluencedNode); // these values determine the "smootheness" of the interaction mouseTranslate.setFactor(0.02); mouseZoom.setFactor(0.02);
// determines the region of influence // setting the max of the Point3D value to Float.MAX_VALUE // states the these behaviors are active all over BoundingSphere boundingSphere = new BoundingSphere(new Point3d (0.0, 0.0, 0.0), Float.MAX_VALUE); mouseRotate.setSchedulingBounds(boundingSphere); mouseTranslate.setSchedulingBounds(boundingSphere); mouseZoom.setSchedulingBounds(boundingSphere); // behaviors are part of the scene graph and they live // as a child of the transform node oParent.addChild(mouseRotate); oParent.addChild(mouseTranslate); oParent.addChild(mouseZoom); }
Fig. 6.37. Example code for mouse behavior. This code sets the influence region of the mouse interaction to the complete scene.
The application’s scene graph is constructed by the method createSceneGraph(). This method can be adapted to create an arbitrary scene graph, but take care to inialize the members t1Group and t2Group, which indicate a target for the mouse transformation that is changed by clicking on the radio button on the left hand side of the application panel. 6.7.2 More Complex Examples: SolarSystem and Robot The SimpleScene application is a very basic step in understanding the power of a scene graph. The book CD provides two more elaborated example applications that are described in the following section.
Examples
309
A classic example of spatial relations and the usage of scene graphs is the visualization of an animated solar system containing earth, moon and the sun. The earth rotates around the sun, and the moon orbits the earth. Additionally, the sun and the earth rotate around their own axis. The folder SolarSystem on the book CD contains the necessary code in order to see how to construct these spatial relations. Figure 6.38 depicts the scene graph of the application. In addition to the branch group and transform group nodes, the figure contains the auxiliary objects that are needed for the animation. These are given, e.g., by the rotator objects that modify the transformation of the transform group they are attached to over time.
Fig. 6.38. The scene graph that is outlined by the SolarSystem example. It contains the branch, transform and content nodes as well as special objects needed for the animation of the system, e.g., rotator objects.
In the folder Robot you will find the file start.bat, which starts the Robot application on your Microsoft Windows system. You can take the command line of
310
Interacting in Virtual Reality
the batch file as an example to start the Robot application on a different operating system that has Java and Java3D installed. The Robot application example creates an animated and textured robot which can be turned and zoomed with the mouse, using left and middle mouse button respectively. In addition to the more complex scene graph, it features light nodes. Light is an important aspect, and in most scene graph APIs, special light nodes provide different types of lightning effects. These lights are themselves part of the scene graph. The Robot application indicates the use of spatial hierarchies rather clearly. Every part that can move independently is part of a subtree of a transform group. Figure 6.39 depicts the scene graph needed to animate The Robot. It omits the auxiliary objects necessary for the complete technical realization of the animation. See the source code on the book CD for more details. 6.7.3 An Interactive Environment, Tetris3D While SimpleScene, SolarSystem and Robot deal with transformations and spatial hierarchies, they are not very interactive besides the point that the complete scene can be rotated, translated, or zoomed. Typical VR applications gain much fascination when they allow to interact with objects in the scene in real time. Java3D uses behavior objects for the implementation of interaction with scene objects. An intuitive way of learning this interaction concept is a game that benefits from depth perception. On the book CD, a complete Tetris3D game can be found in the Tetris3D directory contained in the vr folder. It shows how to realize time-based behavior with the TetrisStepBehavior class, which realizes the time dependent falling of a puzzle piece. A way to react on the keyboard is outlined in the TetrisKeyBahavior class. Puzzle elements are defined in the Figure class and are basically a number of cubes compiled to the well known Tetris figures by cube primitives. In addition to that, this example provides the configuration file j3d1x2-stereo-vr for a workbench set-up with two active stereo projectors, an electro magnetic tracking system (Flock of Birds, Ascension Technology) and a wand like interaction metaphor. It enables head tracking and VCP for the user. The default configuration file for the game is in the same directory on the book CD and is called j3d1x1-window. In order to determine which configuration file is picked at start-up, the Tetris3D::Tetris3D() method has to be changed and the class has to be recompiled. See Tetris3D.java file on the book CD for comprehensive details.
6.8 Summary In the recent years, VR could profit considerably from dramatically increasing computing power and especially from the growing computer graphics market. Virtual
Summary
311
Fig. 6.39. The scene graph that is outlined by the Robot example.
scenarios can now be modeled and visualized at interactive frame rates and at a fidelity which could by far not be achieved a few years before. As a consequence, VR technology promises to be a powerful and henceforth affordable man-machine interface not only for the automotive industry and military, but also for smaller companies and medical applications. However, in practice most applications only address the visual sense and are restricted to navigation through the virtual scenario. More complex interaction functionality, which is, e.g., absolutely necessary for applications like assembly simulation, are still at the beginning. The main reason for this situation is that such interaction tasks heavily rely on the realistic behavior of objects, and on the team play of multiple senses. Therefore, we concentrated in this chapter on the physically-based modeling of virtual object behavior as well as on the
312
Interacting in Virtual Reality
multimodal aspects of VR with focus on acoustics and haptics, since for assembly tasks these two modalities promise to be the most important complementation to the visual sense. Due to limitations in hardware technology and modeling, for which solutions are outside the range of vision in the near future, we introduced additional support mechanisms as an example to cope with the completion of VR-based assembly tasks. The chapter closes with an introduction to the fundamental concept of scene graphs that are used for the modeling of spatial hierarchies and play an important role in the vitalization of virtual worlds. Scene graphs are often the most important part of VR toolkits that provide general functionality for special VR hardware like trackers and display systems. The usage of PC cluster based rendering introduces a tool that can be used to lower the total costs for a VR installation and to ease the utilization of VR techniques in the future. On the book CD, videos are available which demonstrate how visual and haptic interfaces, physically-based modeling, and additional support mechanisms have been integrated to a comprehensive interaction tool, allowing for an intuitive completion of fundamental assembly tasks in virtual environments.
6.9 Acknowledgements We want to thank Tobias Lentz and Professor Michael Vorl¨ander, our cooperation partners in a joint research project funded by the German Research Foundation about acoustics in virtual environments. Without their expertise it would have been impossible to write Sect. 6.2. Furthermore, we thank our former colleague and friend Roland Steffan. The discourse on support mechanisms in Sect. 6.5 is based on studies in his Ph.D. thesis. We kindly thank Marc Schirski for the Robot example that is contained on the book CD.
References 1. Assenmacher, I., Kuhlen, T., Lentz, T., and Vorl¨ander, M. Integrating Real-time Binaural Acoustics into VR Applications. In Virtual Environments 2004, Eurographics ACM SIGGRAPH Symposium Proceedings, pages 129–136. June 2004. 2. Berkhout, A., Vogel, P., and de Vries, D. Use of Wave Field Synthesis for Natural Reinforced Sound. In Proceedings of the Audio Engineering Society Convention 92, volume Preprint 3299. 1992. 3. Blauert, J. Spatial Hearing, Revised Edition. MIT Press, 1997. 4. Bobick, N. Rotating Objects Using Quaternions. Game Developer, 2(26):21–31, 1998. 5. Bowman, D. A., Kruijff, E., LaViola, J. J., and Poupyrev, I. 3D User Interfaces - Theory and Practice. Addison Wesley, 2004. 6. Bryson, S. Virtual Reality in Scientific Visualization. Communications of the ACM, 39(5):62–71, 1996. 7. Cruz-Neira, C., Sandin, D. J., DeFanti, T. A., and K, R. Virtual Reality in Scientific Visualization. Communications of the ACM, 39(5):62–71, 1996.
References
313
8. Delaney, B. Forget the Funny Glasses. IEEE Computer Graphics & Applications, 25(3):14–19, 1994. 9. Farin, G., Hamann, B., and Hagen, H., editors. Virtual-Reality-Based Interactive Exploration of Multiresolution Data, pages 205–224. Springer, 2003. 10. Foley, J., van Dam, A., Feiner, S., and Hughes, J. Computer Graphics: Principles and Practice. Addison Wesley, 1996. 11. Kresse, W., Reiners, D., and Kn¨opfle, C. Color Consistency for Digital Multi-Projector Stereo Display Systems: The HEyeWall and The Digital CAVE. In Proceedings of the Immersive Projection Technologies Workshop, pages 271–279. 2003. 12. Kr¨uger, W. and Fr¨ohlich, B. The Responsive Workbench. IEEE Computer Graphics & Applications, 14(3):12–15, 1994. 13. Lentz, T. and Renner, C. A Four-Channel Dynamic Cross-Talk Cancellation System. In Proceedings of the CFA/DAGA. 2004. 14. Lentz, T. and Schmitz, O. Realisation of an Adaptive Cross-talk Cancellation System for a Moving Listener. In Proceedings of the 21st Audio Engineering Society Conference. 2002. 15. Møller, H. Reproduction of Artificial Head Recordings through Loudspeakers. Journal of the Audio Engineering Society, 37(1/2):30–33, 1989. 16. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, second edition edition, 1992. 17. Ruspini, D. C., Kolarov, K., and Khatib, O. The Haptic Display of Complex Graphical Environments. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. 1997. 18. Steffan, R. and Kuhlen, T. MAESTRO - A Tool for Interactive Assembly Simulation in Virtual Environments. In Proceedings of the Immersive Projection Technologies Workshop, pages 141–152. 2001. 19. Sutherland, I. The Ultimate Display. In Proceedings of the IFIP Congress, pages 505– 508. 1965. 20. Theile, G. Potential Wavefield Synthesis Applications in the Multichannel Stereophonic World. In Proceedings of the 24th AES International Conference on Multichannel Audio, page published on CDROM. 2003. 21. Zilles, C. B. and Salisbury, J. K. A Constraint-based God-object Method for Haptic Display. In Proceedings of the International Conference on Intelligent Robots and Systems. 1995.
Chapter 7
Interactive and Cooperative Robot Assistants R¨udiger Dillmann, Raoul D. Z¨ollner A probate approach for learning knowledge about actions and sensory-motor abilities is to acquire prototypes of human actions by observing these actions with sensors and transferring the acquired abilities to the robot, the so-called ”learning by demonstration´´. Learning by demonstration requires human motion capture, observation of interactions and object state transitions, and observation of spatial and physical relations between objects. By doing so, it is possible to acquire so-called ”skills”, situative knowledge as well as task knowledge. Such, the robot can be introduced to new and unknown tasks. New terms, new objects and situations, even new types of motion can be learned with the help of a human tutor or be corrected interactively via multimodal channels. In this case, the term “multimodality´´ points out communication channels which are intuitive for humans, such as language, gesture and haptics (i.e. physical human-robot contact). These communication channels are to be used for commanding and instructing the robot system. The research field of programming by demonstration has evolved as a response to the need of generating flexible programs for service robots. It is largely driven by attempts to model human behavior and to map it onto virtual androids or humanoid robots. Programming by demonstration comprises a broad set of observation techniques processing large sets of data from high speed camera systems, laser scanners, data gloves and even exoskeleton devices. Some programming by demonstration systems operate with precise a priori models, others use statistical approaches to approximate human behavior. In any case, observation is done to identify motion over space and time, interaction with the environment and its effects, useful regularities or structures and their interpretation in a given context. Keeping this goal in mind, systems have been developed which combine active sensing, computational learning techniques and multimodal dialogs to enrich the robot’s semantic system level, memorization techniques and mapping strategies to make use of the learned knowledge to control a real robot. In the remainder of this chapter, we will introduce several different aspects of interactive and cooperative robot assistants. The first section presents an overview of currently developed mobile and humanoid robots with their abilities to interact and cooperate with humans. In the second section, basic principles of human-robot interaction are introduced, whereas the third section discusses how robot assistants can learn interactively from a human in a programming by demonstration process. More details of this process and the relevant models and representations are presented in sections four and five, respectively. Section six is concerned about how to extract subgoals from human demonstrations, section seven with the questions of task mapping and exception handling. Section eight gives a short overview of the interesting
316
Interactive and Cooperative Robot Assistants
research areas of telepresence and telerobotics. Finally, section nine shows some examples of human-robot interaction and cooperation.
7.1 Mobile and Humanoid Robots Often, robots are described according to their functionality, e.g. underwater robot, service robot etc. In contrast to this, the term “humanoid robot´´ does not in particular refer to the functionality of a robot, but rather to its outer design and form. Compared to other more specialized robot types, humanoid robots resemble humans not only in their form, but also in the fact that they can cope with a variety of different tasks. Their human-like form also makes them very suitable to work in environments which are designed for humans like households. Another important factor is that a more humanoid form and behavior is more appealing especially to non-expert users in everyday situations, i.e. a humanoid robot will probably be more accepted as a robot assistant e.g. in a household. On the other hand, users expect much more in terms of communication, behavior and cognitive skills of these robots. It is important to note, though, that there is not one established meaning of the term “humanoid robot´´ in research – as shown in this section, there currently exists a wide variety of socalled humanoid robots with very different shapes, sizes, sensors and actors. What is common among them is the goal of creating more and more human-like robots. Having this goal in mind, humanoid robotics is a fast developing area of research world-wide, and currently represents one of the main challenges for many robotics researchers. This trend is particularly evident in Japan. Many humanoid robotic platforms have been developed there in the last few years, most of them with an approach very much focussed on mechanical, or more general on hardware problems, in the attempt to replicate as closely as possible the appearance and the motions of human beings. Impressive results were achieved by the Waseda University of Tokyo since 1973 with the WABOT system and its later version WASUBOT, that was exhibited in playing piano with the NHK Orchestra at a public concert in 1985. The WASUBOT system could read the sheets of music by a vision system (Fig. 7.1) and could use its feet and five-finger hands for playing piano. The evolution of this research line at the Waseda University has led to further humanoid robots, such as: Wabian (1997), able to walk on two legs, to dance and to carry objects; Hadaly-2 (1997), focused on human-robot interaction through voice and gesture communication; and Wendy (1999) ( [29]). With a great financial effort over more than ten years, Honda was able to transfer the results of such research projects into a family of humanoid prototypes. Honda’s main focus has been on developing a stable and robust mechatronic structure for their humanoid robots (cf. Fig. 7.2). The result is the Honda Humanoid Robot, whose current version is called ASIMO. It presents very advanced walking abilities and fully functional arms and hands with very limited capabilities of hand-eye coordination for grasping and manipulation ( [31]). Even if the robots presented in Fig. 7.2 seem
Mobile and Humanoid Robots
317
Fig. 7.1. Waseda humanoid robots: a) Wabot-2, the pioneer piano player, b) Hadaly-2, c) walking Wabian and d) Wendy.
to have about the same size, the real size was consistently reduced during the evolution: height went down from the P3’s 160cm to the ASIMO’s 120cm, and weight was reduced from the P3’s 130kg to ASIMO’s 43kg.
Fig. 7.2. The Honda humanoid robots: a) P2, b) P3, and c) the current version, ASIMO.
As Honda’s research interest concentrated mostly on mechatronic aspects, the combination of industrial and academic research by Honda and the Japanese National Institut of Advanced Industrial Science and Technology (AIST) led to first applications like the one shown in Fig. 7.3 a) where HRP-1S is operating a backhoe. The engagement of Honda in this research area motivated several other Japanese companies to start their own research projects in this field. In 2002, Kawada Industries introduced the humanoid robot HRP-2 ( [39]). With a size of 154 cm and 58 kg, HRP-2 has about the dimensions of a human. It is the first humanoid robot which is able to lie down and stand up without any help (as shown in Fig. 7.3). By reuse of components, Toyota managed to set up a family of humanoid robots with different abilities in relatively short time. In 2004, the family of robot partners consisted of four models, two of which are shown in Fig. 7.4. The left one is a walking android which is intended to assist the elderly. It is 1.20 m tall and weighs
318
Interactive and Cooperative Robot Assistants
Fig. 7.3. a) HRP-1S operating a backhoe, b) The humanoid robot HRP-2 is able to lie down and stand up without any help.
35 kg. The right robot, 1 m tall and weighing 35 kg, is supposed to assist humans in manufacturing processes ( [63]).
Fig. 7.4. Two of Toyota’s humanoid robot partners: a) a walking robot for assisting elderly people, and b) a robot for manufacturing purposes.
The Japanese research activities which led to these humanoid robot systems concentrate especially on the mechanical, kinematic, dynamic, electronic, and control problems, with minor concern for the robot’s behavior, their application perspectives, and their acceptability. Most progresses in humanoid robotics in Japan have been accomplished without focusing on specific applications. In Japan, the long-term perspective of humanoid robots concerns all those roles in the Society in which humans can benefit by being replaced by robots. This includes some factory jobs, emergency actions in hazardous environments, and other service tasks. Amongst them, in a Soci-
Mobile and Humanoid Robots
319
ety whose average age is increasing fast and steadily, personal assistance to humans is felt to be one of the most critical tasks. In the past few years, Korea has developed to be a second center for humanoid robotics research on the Asian continent. Having carried out several projects on walking machines and partly humanoid systems in the past, the Korea Advanced Institute of Science and Technology (KAIST) announced a new humanoid system which they call HUBO in 2005 (Fig. 7.5). HUBO has 41 DOF, stands 1.25m tall and weighs about 55 kg. It can walk, talk and understand speech. As a second humanoid project the robot NBH-1 (Smart-Bot, Fig. 7.5 b)) is developed in Seoul. The research group claims their humanoid to be the first network-based humanoid which will be able to think like a human and learn like a human. NBH-1 stands 1.5m tall, weighs about 67 kg, and is also able to walk, talk and understand speech.
Fig. 7.5. Korean humanoid robot projects: a) HUBO, b) Smart-Bot.
In the USA, the research on humanoid robotics received its major impulse through the studies related to Artificial Intelligence, mainly through the approach proposed by Rodney Brooks, who identified the need for a physical human-like structure as prior for achieving human-like intelligence in machines ( [12]). Brooks’ group at the AI lab of the MIT is developing human-like upper bodies which are able to learn how to interact with the environment and with humans. Their approach is much more focused on robot behavior, which is built-up by experience in the world. In this framework, research on humanoids does not focus on any specific application. Nevertheless, it is accompanied by studies on human-robot interaction and sociability, which aim at favoring the introduction of humanoid robots in the society of Humans. The probably most famous example of MIT’s research is COG, shown in Fig. 7.6. Still at the MIT AI Lab, the Kismet robot has been developed as a platform for investigation on human-robot interaction (Fig. 7.6). Kismet is a pet-like head with
320
Interactive and Cooperative Robot Assistants
vision, audition, speech and eye and neck motion capability. It can therefore perceive external stimuli, track faces and objects, and express its own feelings accordingly ( [11]).
Fig. 7.6. a) The MIT’s COG humanoid robot and b) Kismet.
Europe is more cautiously entering the field of humanoid robotics, but can rely on an approach that, based on the peculiar cultural background, allows to integrate various considerations of different nature, by integrating multidisciplinary knowledge from engineering, biology and humanities. Generally speaking, in Europe, research on robotic personal assistants has received a higher attention, even without the implication of anthropomorphic solutions. On the other hand, in the European humanoid robotics research the application as personal assistants has always been much more clear and explicitly declared. Often, humanoid solutions are answers to the problem of developing personal robots able to operate in human environments and to interact with human beings. Personal assistance or, more in general, helpful services are the European key to introduce robots in the society and, actually, research and industrial activities on robotic assistants and tools (not necessarily humanoids) have received a larger support than research on basic humanoid robotics. While robotic solutions for rehabilitation and personal care are now at a more advanced stage with respect to their market opportunities, humanoid projects are currently being carried out by several European Universities. Some joint European Projects, like the Brite-Euram Syneragh and the IST-FET Paloma, are implementing biologically-inspired sensory-motor coordination models onto humanoid robotic systems for manipulation. In Italy, at the University of Genova, the Baby-Bot robot is being developed for studying the evolution of sensorymotor coordination as it happens in human babies (cf. Fig. 7.7). Starting the development in 2001, a cooperation between the russian company New Era and the St. Petersburg Polytechnic University led to the announcement of two humanoid systems called ARNE (male) and ARNEA (female) in 2003. Both robots have 28 DOF, stand about 1.23m and weigh 61 kg (7.7), i.e. their dimensions resemble those of
Interaction with Robot Assistants
321
Asimo. Like Asimo, they can walk on their own, avoid obstacles, distinguish and remember objects and colors, recognize 40 separate commands, and can speak in a synthesized voice. The androids can run on batteries for 1 hour, compared to 1.5 hours for the Asimo. Located in Sofia, Bulgaria, the company Kibertron Inc. has a full scale humanoid project called Kibertron. Their humanoid is 1.75m tall and weighs 90 kg. Kibertron has 82 DOF (7.7). Each robot hand has 20 DOF and each arm has another 8 DOF ( [45]).
Fig. 7.7. a) Baby-Bot, b) ARNE, and c) Kibertron.
Compared to the majority of electro-motor driven robot systems the german company Festo is following a different approach. Their robot called Tron X (cf. Fig. 7.8 a)) is driven by servo-pneumatic muscles and uses over 200 controllers to emulate human-like muscle movement ( [23]). ARMAR, developed at Karlsruhe University, Germany, is an autonomous mobile humanoid robot for supporting people in their daily life as personal or assistance robot (Fig. 7.8). Currently, two anthropomorphic arms have been constructed and mounted on a mobile base with a flexible torso and studies on manipulation based on human arm movements are carried out ( [7,8]). The robot system is able to detect and track the user by vision and acoustics, talk and understand speech. Simple objects can be manipulated. Thus, European humanoid robot systems in general are being developed in view of their integration in our society and they are very likely to be employed in assistance activities, meeting the need for robotic assistants already strongly pursued in Europe.
7.2 Interaction with Robot Assistants Robot assistants are expected to be deployed in a vast field of applications in everyday environments. To meet the consumer’s requirements, they should be helpful, simple and easy to instruct. Especially household environments are tailored to an
322
Interactive and Cooperative Robot Assistants
Fig. 7.8. Humanoid robot projects in Germany: a) Festo’s Tron X and b) ARMAR by the University of Karlsruhe.
individual’s needs and taste. So, in each household a different set of facilities, furniture and tools is used. Also, the wide variety of tasks the robot encounters cannot be foreseen in advance. Ready-made programs will not be able to cope with such environments. So, flexibility will surely be the most important requirement for the robot’s design. It must be able to navigate through a changing environment, to adapt its recognition abilities to a particular scenery and to manipulate a wide range of objects. Furthermore, it is extremely important for the user to easily adapt the system to his needs, i.e. to teach the system what to do and how to do it. The basis of intelligent robots acting in close cooperation with humans and being fully accepted by humans is the ability of the robot systems to understand and to communicate. This process of interaction, which should be intuitive for humans, can be structured along the role of the robot in: •
•
Passive interaction or Observation. The pure observation of humans together with the understanding of seen actions is an important characteristic of robot assistants, and it enables them to learn new skills in an unsupervised manner and to decide by prediction when and how to get active. Active interaction or Multimodal Dialog. Performing a dialog using speech, gesture and physical contact denotes a mechanism of controlling robots in an intuitive way. The control process can either take place by activity selection and parameterization or by selecting knowledge to be transferred to the robot system.
Regardless of the type and role of interaction with robots assistants, one necessary capability is to understand and to interpret the given situation. In order to achieve this, robots must be able to recognize and to interpret human actions and spatio-temporal situations. Consequently the cognitive capabilities of robot systems in terms of understanding human activities are crucial. In cognitive psychology, human activity is characterized by three features [3]: Direction: Human activity is purposeful and directed to a specific goal situation.
Interaction with Robot Assistants
323
Decomposition: The goal to be reached is being decomposed into subgoals. Operator selection: There are known operators that may be applied in order to reach a subgoal. The operator concept designates an action that directly realizes such a subgoal. The solution of the overall problem can be represented as a sequence of such operators. It is important to know that cognition is mainly taken to be directed as well. Furthermore, humans tend to perceive activity as a clearly separated sequence of elementary actions. For transferring their concept to robots, the set of elementary actions must be identified, and, furthermore, a classification taxonomy for the interpretation must be set up. Concerning the observation and the recognition process, the most useful information for interpreting a sequence of actions is to be found in the transition from one elementary action to another [52]. 7.2.1 Classification of Human-Robot Interaction Activities The set of supported elementary actions should be derived from human interaction mechanisms. Depending on the goal of the operator’s demonstration, a categorization into three categories seems to be appropriate [20]: •
•
•
Performative actions: One human shows a task solution to somebody else. The interaction partner observes the activity as a teaching demonstration. The elementary actions change the environment. Such demonstrations are adapted to the particular situation in order to be captured in an appropriate way by the observer [36]. Commenting actions: Humans refer to objects and processes by their name, they label and qualify them. Primarily, this type of action serves to reduce complexity and to aid the interpretation process after and during execution of performative actions. Commanding actions: Giving orders falls into the last category. This could, e.g., be commands to move, stop, hand something over, or even complex sequences of single commands that directly address robot activity.
The identified categories may be rolled out by the modality of their application (see figure 7.9). For teaching an action sequence to a robot system, the most important type of action is the performative one. But perfomative actions are usually accompanied by comments, containing additional information about the performed task. So, for knowledge acquisition, it is very important to include this type of action into a robot assistant system. Intuitive handling of robot assistance through instruction is based on commanding actions, which trigger all function modes of the system. 7.2.2 Performative Actions Performed actions are categorized along the action type (i.e. ”change position” or ”move object”), and, closely related to this first categorization, also along the involved actuators or subsystems. Under this global viewpoint, user and robots are
324
Interactive and Cooperative Robot Assistants one hand
manipulate
static
Cutkosky
dynamic
Zöllner
grasp
two hands
with objective
performative action
without object
under restrictions
carrying object
with contact
using a tool
move
without tool transport
without contact
navigate strategy utter
iconic commenting action
speech
metaphoric gesticulate
commanding action
guide
deictic
tact
emblematic
Fig. 7.9. Hierarchy and overview of possible demonstration actions. Most of them are taken from everyday tasks (see lower right images)
part of the environment. Manipulation, navigation and the utterance of verbal performative sentences are different forms of performative actions. Manipulation: Manipulation is certainly one of the main functions of a robot assistant: tasks as e.g. fetch and carry and handling tools or devices are vital parts of most applications in household- or office-like environments. Grasps and movements are relevant for interpretation and representation of actions of the type ”manipulation”. Grasps: Established schemes can be reverted to classify grasps that involve one hand. Here, an underlying distinction is made between grasps during which finger configuration does not change (”static grasps”) and grasps that require such configuration changes (”dynamic grasps”). While for static grasps exhaustive taxonomies exist based on finger configurations and the geometrical structure of the carried object [14], dynamic grasps may be categorized by movements of manipulated objects around the local hand coordinate system [71]. Grasps being performed by two hands have to take into account synchronicity and parallelism in addition to the requirements of single-handed grasp recognition. Movement: Here, the movement of both extremities and of objects has to be discerned. The first may be partitioned further into movements that require a specific goal pose and into movements where position changes underly certain conditions (e.g. force/torque, visibility or collision). On the other hand, the transfer of objects can be carried out with or without contact. It is very useful to check if the object in the hand has or has not tool quality. The latter case eases reasoning on the goal of the operator (e.g.: tool type screwdriver → operator turn screw upwards or downwards). Todays programming systems support very limited movement patterns only. Transportation movements
Learning and Teaching Robot Assistants
325
have been investigated in block worlds [33, 37, 41, 54, 64] or in simple assembly scenarios [9] and dedicated observation environments [10,27,49,61]. The issue of movement analysis is being addressed differently by many researchers in contrast to the issue of grasp analysis. The representation of movement types varies strongly with respect to the action representation as well as with respect to the goal situation. Navigation: In contrast to object manipulation, navigation means the movement of the interaction partner himself. This includes position changes with a certain destination in order to transport objects and movement strategies that may serve for exploration [13, 55, 56]. Verbal performative utterance: In language theory, actions are known that get realized by speaking out so called performative sentences (such as ”Your’re welcome”). Performative utterance is currently not relevant in the context of robot programming. We treat this as a special case and will not further discuss it. 7.2.3 Commanding and Commenting Actions Commanding actions are used for instructing a robot either to fulfill a certain action or to initiate adequate behaviors. Typically these kinds of actions imply a robot activity in terms of manipulation, locomotion or communication. In contrast to commands, commenting actions of humans usually result in a passive attitude of the robot system, since they do not require any activity. Nevertheless, an intelligently behaving robot assistant might decide to initiate a dialog or interaction after perceiving a comment action. Looking onto robot assistants in human environments the most intuitive way of transferring commands or actions are the channels which are also used for humanhuman communication. These can either be speech, gestures or physical guidance. Speech and gesture are usually used for communication and through these interaction channels a symbolic high level communication is possible. But for explaining manipulation tasks which are often performed mechanically by humans, physical guidance seems more appropriate as a communication channel. As can be seen by the complexity of grasp performance, navigation, and interaction, observation of human actions requires vast and dedicated sensors. Hereby, diverse information is vital for the analysis of an applied operator: a grasp type may have various rotation axes, a certain flow of force/torque exerted on the grasped object, special grasp points where the object is touched etc.
7.3 Learning and Teaching Robot Assistants Besides interaction, learning is one of the most important properties of robot assistants, which are supposed to act in close cooperation with humans. Here the learning process must enable robots to cope with everyday situations and natural (household ore office) environments in a way intuitive for humans. This means that through
326
Interactive and Cooperative Robot Assistants
the learning process on the one hand new skills have to be acquired and on the other hand the knowledge of the system has to be adapted to new contexts and situations. Robot assistants coping with these demands must be equipped with innate skills and should learn live-long from their users. Supposedly, these are no robot experts and require a system that adapts itself to their individual needs. For meeting these requirements, a new paradigm for teaching robots is defined to solve the problems of skill and task transfer from human (user) to robot, as a special way of knowledge transfer between man and machine. The only approach that satisfies the latter condition are programming systems that automatically acquire relevant information for task execution by observation and multimodal interaction. Obviously, systems providing such a functionality require: 1. powerful sensor systems to gather as much information as possible by observing human behavior or processing explicit instructions like commands or comments, 2. a methodology to transform observed information for a specific task to a robotindependent and flexible knowledge structure, and 3. actuator systems using this knowledge structure to generate actions that will solve the acquired task in a specific target environment. 7.3.1 Classification of Learning through Demonstration A classification of methods for teaching robot assistants in an intuitive way might be done according the type of knowledge which is supposed to be transferred between human and robot. Concerning performative actions, especially manipulation and navigation tasks are addressed. Here the knowledge needed for characterizing the actions can be categorized by the included symbolic and semantic information. The resulting classification, shown in figure 7.10, contains three classes of learning or teaching methods, namely: skill learning, task learning and interactive learning. 7.3.2 Skill Learning For the term ”skill”, different definitions exist in literature. In this chapter, it will be used as follows: A skill denotes an action (i.e. manipulation or navigation), which contains a close sensor-actuator coupling. In terms of representation, it denotes the smallest symbolic entity which is used for describing a task. Examples for skills are grasps or moves of manipulated objects with constraints like ”move along a table surface” etc. The classical ”peg in hole” problem can be modeled as a skill, for example using a force controller which minimizes a cost function or a task consisting of a sequence of skills like ”find the hole”, ”set perpendicular position” and ”insert”. Due to the close relation between sensors and actors the process of skill learning is mainly done by direct programming using the robot system and if necessary some special control devices. Skill learning means to find the transfer function R, which satisfy the equation U (t) = R(X(t), Z(t)). The model of the skill transfer process is visualized in Fig. 7.11.
Learning and Teaching Robot Assistants
327
Fig. 7.10. Classes of skill transfer through interaction.
Fig. 7.11. Process of skill transfer.
As learning methods for skill acquisition often statistical methods like Neuronal Networks or Hidden Markov Models are used, in order to generalize a set of skill examples in terms of sensory input. Given a model of a skill for example through demonstration, reinforcement techniques can be applied to refine the model. Here a lot of background knowledge in form of a valuation function which guides the optimization process has to be included into the system. However, teaching skills to robot assistants is time-intensive and affords a lot of knowledge about the system. Therefore, looking to robot assistants were the user of the systems is no expert, it seems more feasible that the majority of the needed skills are pre-learned. The system would have a set of innate skills and will only have to learn a few new skills during its ”living time.” 7.3.3 Task Learning From the semantic point of view, a task denotes a more complex action type than a skill. From the representation and model side a task is usually seen as a sequence of skills, and that should serve as a definition for the following sections. In terms of knowledge representation a task denotes a more abstract view of actions, were the control mechanisms for actuators and sensors are enclosed in modules represented by symbols.
328
Interactive and Cooperative Robot Assistants
Learning of tasks differs from skill learning, since, apart from control strategies of sensors and actuators, the goal and subgoal of actions have to be considered. As in this case the action depends on the situation as well as on the related action domain, the learning methods used rely on a vast knowledge base. Further analytical learning seems to be more appropriate for coping with symbolic environment descriptions used to represent states and effects of skills within a task. For modeling a task T as a sequence of skills (or primitives) Si , the context including environmental information or constraints Ei and internal states of the robot system Zi are introduced as pre- and post-conditions of the skills. Formally, a task can be described as: T = {(P re − Cond(Ei , Zi ))Si (Effect(Ei , Zi ))} Effect = P ost − Cond\P re − Cond
(7.1)
were Effect() denotes the effect of a skill on the environment and the internal robot states. Consequently, the goal and subgoals of a task are represented by the effect of the task or the subsequence of the task. Noticeable in the presented task model is the fact that tasks are state-, but not time-dependent like skills, and therefore enable a higher generalization in terms of learning. Although the sensory-motor skills in this representation are encapsulated in symbols, learning and planning task execution for robot assistants becomes very complex due to the fact that state descriptions of environments (including humans) and internal states are very complex, and decision making is a not always well-defined. 7.3.4 Interactive Learning The process of interactive learning has the highest abstraction level because of the information exchange through speech and gestures. Through these interaction channels various situation-dependent symbols are used to refer to objects in the environment, or to select actions to be performed. Information, like control strategies on the sensory-motor level, can not be transferred any more, but rather can global parameters, which are used for triggering and adapting existing skills and tasks. The difference between task learning and interactive learning on the one hand consist in the complexity of actions that can be learned and on the other hand in the use of dialog during the learning process with explicit system demands. Beside communication issues, one of the main problems in developing robots assistants with such capabilities denotes the understanding and handling of ”common sense” knowledge. In this context the problem of finding the same view of situations and communicating these between robot and human is still a huge research area. For learning however, a state driven representation of actions which can either be skills, tasks or behaviors is required as well as a vast decision making module, which is able to disambiguate gathered information by initiating adequate dialogs. This process relies on a complex knowledge base including skill and task models, environment descriptions, and human robot interaction mechanisms. Learning in this
Learning and Teaching Robot Assistants
329
context denotes finding and memorizing sequences of actions or important parameters for action execution or interaction. Associating semantic information to objects or actions for expanding space and action concepts is a further goal of interactive learning. 7.3.5 Programming Robots via Observation: An Overview Different programming systems and approaches to teach robots based on human demonstrations have been proposed during the past years. Many of them address special problems or a special subset of objects only. An overview and classification of the approaches can be found in [16, 58]. Learning of action sequences or operation plans is an abstract problem, which supposes a modeling of cognitive skills. When learning complex action sequences, basic manipulation skills or controlling techniques become less relevant. In fact, the aim is to generate an abstract description of the demonstration, reflecting the user´s intention and modeling the problem solution as optimally as possible. Robot-independence is an important issue because this would allow for exchange of robot programs between robots with different kinematics. Likewise, a given problem has to be suitably generalized, for example distinguishing parameters specific for the particular demonstration and parameters specific for the problem concept. Both generalizations demand a certain insight into the environment and the user’s performance. Basis for the mapping of a demonstration to a robot system are the task representation and the task analysis. Often, the analysis of a demonstration takes place observing the changes in the scene. These changes can be described using relational expressions or contact relations [40, 47, 53]. Issues for learning to map action sequences to dissimilar agents have been investigated by [2]. Here, the agent learns an explicit correspondence between his own possible actions and the actions performed by a demonstrator agent by imitation. To learn correct correspondences, in general several demonstrations of the same problem are necessary. For generalizing a single demonstration, mainly explanation-based methods are used [24, 51]. They allow for an adequate generalization taken from only one example (One-Shot-Learning). Approaches based on One-shot-learning techniques are the only ones feasible for end users, since giving many similar performing examples is an annoying task. A physical demonstration in the real world can be too time-consuming and may not be mandatory. Some researchers rely on virtual or iconic demonstrations [5]. The advantage of performances in a virtual world is that here manipulations can be executed that would not be feasible in the real world because of physical constraints. Furthermore, inaccuracies of sensor-based approaches do not have to be taken into account. Such object poses or trajectories [41, 62, 64] and/or object contexts [26, 30, 53,59] can be generalized in an appropriate way. A physical demonstration, however, is more convenient for the human operator. Iconic programming starts at an even more abstract level. Here, a user may fall back on a pool of existing or acquired basic skills. These can be retrieved through specific cues (like icons) and embedded
330
Interactive and Cooperative Robot Assistants
in an action sequence. The result of such a demonstration of operator sequences can then be abstracted, generalized and summarized as new knowledge for the system. An example can be found in the skill-oriented programming system SKORP [5] that helps to set up macro operators interactively from actuator or cognitive elementary operators. Besides a derivation of action sequences from a user demonstration direct cooperation with users has been investigated. Here, user and robot reside in a common work cell. The user may direct the robot on the basis of a common vocabulary like speech or gestures [60, 69, 70]. But this approach allows for rudimentary teaching procedures only. A more general approach to programming by demonstration together with the processing phases that have to be realized when setting up such a system is outlined in [19]. A discussion of sensor employment and sensor fusion for action observation can be found in [21, 57, 71]. Here, basic concepts for the classification of dynamic grasps are presented as well. To summarize, the most crucial task concerning the programming process is the recognition of the human´s actions and the recognition of effects in the environment. Assuming that background knowledge about tasks and actions is limited, valuable information must be extracted by intimately observing the user during demonstrations. For real-world problems simplified demonstration environments will not lead to robot instructions that are applicable in general. Therefore, we propose to rely on powerful sensor systems that allow to keep track on human actions in unstructured environments as well. Unfortunately, the on-board sensor systems of today’s autonomous robot systems will not provide enough information for obtaining relevant information in the general case. So, it is necessary to integrate additional sensor systems in the environment to enlarge the robot’s recognition capabilities as well as to undertake a fundamental inspection of actions that can be recognized in everyday environments.
7.4 Interactive Task Learning from Human Demonstration The approach of programming by demonstration (PbD) emphasizes the interpretation of what has been done by the human demonstrator. Only a correct interpretation enables a system to reuse formerly observed action sequences in a different environment. Actions that might be performed during the demonstration can be classified as explained above, cf. figure 7.9. The main distinction here is drawn by their objectives: when observing a human demonstrator, performative, commenting and commanding actions are the most relevant. They are performed in order to show something by doing it on your own (performative actions), in order annotate or comment on something (commenting actions), and in order to make the observer act (commanding actions). These actions may be further subclassified as explained before. Some of these actions may be interpreted immediately (e.g. the orally given command stop!). But mostly, the interpretation depends on environment states, robot and user positions and so on. This allows to interpret actions that serve for man-machine
Interactive Task Learning from Human Demonstration
331
interaction. It is even harder, though, to analyze manipulation sequences. Here, the interpretation has to take into account previously performed actions as well. Thus, the process of learning complex problem solving knowledge consists of several different phases. 7.4.1 Classification of Task Learning Systems programmed by user demonstrations are not restricted to a specific area of robotics. Even for construction of graphical user interfaces and work-flow management systems such systems have been used [15].In any case the system learns from examples provided by a human user. Within the field of robotics, and especially for manipulation tasks, PbD systems can be classified according to several classification features (Fig. 7.12), in particular with regard to task complexity, class of demonstration, internal representation and mapping strategy. The concrete system strongly depends on the complexity of the task that is being learned.
Fig. 7.12. Classification features for Programming by Demonstration (PbD) systems.
•
The first classification feature is the the abstraction level. Here, the learning goal can be divided in learning of low-level (elementary) skills or high-level skills (complex task knowledge). Elementary skills represent human reflexes or actions applied unconsciously. For them, a direct mapping between the actuator’s actions and the corresponding sensor input is learned. In most cases neural networks are trained for a certain sensor/actuator combination like grasping objects or peg-in-hole tasks [6,38,46]. Obviously, those systems must be trained explicitly and can not easily be reused for a different sensor/actuator combination. Another drawback is the need for a high number of demonstration examples, since otherwise the system might only learn a specific sensor/actuator configuration.
332
•
•
Interactive and Cooperative Robot Assistants
To acquire complex task knowledge from user demonstrations on the other hand needs a high amount of background knowledge and sometimes communication with the user. Most applications deal with assembly tasks like for example [35, 43, 59] Another important classification feature is the methodology of giving examples to the system. Within robotic applications, active, passive and implicit examples can be provided for the learning process. Following the terminology, active examples indicate those demonstrations where the user performs the task by himself, while the system uses sensors like data-gloves, cameras and haptic devices for tracking the environment and/or the user. Obviously, powerful sensor systems are required to gather as much data as available [25, 32, 42, 44, 65, 66]. Normally, the process of finding relevant actions and goals is a demanding challenge for those systems. Passive examples can be used if the robot can be controlled by an external device as for example a space-mouse, a master-slave-system or graphical interface. During the demonstration process the robots supervises its internal or external sensors and tries to directly map those actions to a well known action or goal [38]. Obviously, this method requires less intelligence, since perception is restricted to well known sensor sources and actions for the robot system are directly available. However, the main disadvantage is given by the fact that the demonstration example is restricted to the target system. If implicit examples are given, the user directly specifies system goals to be achieved by selecting a set of iconic commands. Although this methodology is the easiest one to interpret, since the mapping to well known symbols is implicitly given, it both restricts the user to a certain set of actions and provides the most uncomfortable way of giving examples. A third aspect is the system’s internal representation of the given examples. Today’s systems regard effects in the environment, trajectories, operations and object positions. While observing effects in the environment often requires highly developed cognitive components, observing trajectories of the user’s hand and fingers can be considered comparably simple. However, recording only trajectories is in most cases not enough for real world applications, since all information is restricted to only one environment. Tracking positions of objects, either statically at certain phases of the demonstration or dynamically by extracting kinematic models, abstracts on the one hand from the demonstration devices, but implies an execution environment similar to the demonstration environment. If the user demonstration is transformed into a sequence of pre-defined actions, it is called operation-based representation. If the target system supports the same set of actions, execution becomes very easy. However, this approach only works in very structured environments where a small number of actions is enough to solve most of the forecoming tasks. From our point of view a hybrid representation, where knowledge about the user demonstration is provided not only on the level of representation class, but also on different abstraction levels is the most promising approach for obtaining the best results.
Interactive Task Learning from Human Demonstration •
333
Finally, the representation of the user demonstration must somehow be mapped on a target system. Therefore, the systems can be classified with regard to the classification strategy. If the user demonstration is represented by effects in the environment or goal states given by object positions, planning the robot movements is straightforward. However, this requires a very intelligent planning system and most of the data that were acquired in the demonstration, like trajectories or grasps, is lost. Therefore, for learning new tasks this approach has only limited applicability. Instead of planning actions based on the environment state and current goal, the observed trajectories can be mapped directly to robot trajectories, using a fixed or learned transformation model [42, 48]. The problem here is, that the execution environment must be similar to the demonstration environment. Thus, this mapping method is restricted to applications where easy teach-in tasks must be applied.
7.4.2 The PbD Process In this section, the process of programming a robot by demonstration is presented in more detail. To this means, the steps of the PbD process are discussed in their order in such a programming cycle. As summarized in figure 7.13, the whole programming process starts with a user demonstration of a specific task which is observed by a sensor system. The following phases are the basic components for successfully making use of data from a human demonstration: 1. Sensor systems are used for observing the users movements and actions. Also important changes like object positions and constraints in the environment can be detected. The sensor efficiency might be improved by allowing the user to comment on his actions. 2. During the next phase relevant operations or environment states based on the sensor data are extracted. This process is called segmentation. Segmentation can be performed online during the observation process or offline based on recorded data. Here, the system´s performance can be improved significantly by asking the user whether decisions made by the system are right or wrong. This is also important to reduce sensor noise. 3. Within the interpretation phase the segmented user demonstration is mapped onto a sequence of symbols. These symbols contain information about the action (e.g. type of grasp or trajectory) as well as important data from the sensor readings as forces, contact points, etc. 4. Abstraction from the given demonstration is vital for representing the task solution as general as possible. Generalization of the obtained operators includes further advice by the user. Spontaneous and not goal-oriented motions may be identified and filtered in this phase. It is important to store the task knowledge in a form that makes it reusable even if execution conditions do vary slightly from the demonstration conditions.
334
Interactive and Cooperative Robot Assistants
5. The next necessary phase is the mapping of the internal knowledge representation to the target system. Here, the task solution knowledge generated in the previous phase serves as input. Additional background knowledge about the kinematic structure of the target system is required. Within this phase as much information as possible available from the user demonstration should be used to allow robot programming for new tasks. 6. In the simulation phase the generated robot program is tested for its applicability in the execution environment. It is also desirable to let the user confirm correctness in this phase to avoid dangerous situations in the execution phase. 7. During execution in the real world success and failure can be used to modify the current mapping strategy (e.g. by selecting higher grasping forces when a picked workpiece slips out of the robot´s gripper).
Fig. 7.13. The process of Programming by Demonstration (PbD).
The overall process with its identified phases is suitable for manipulation tasks in general. However, in practice the working components are restricted to a certain domain to reduce environment and action space.
Interactive Task Learning from Human Demonstration
335
7.4.3 System Structure
Fig. 7.14. The overall system: training center and robot.
In this section, the structure of a concrete PbD system developed at the University of Karlsruhe [17] is presented as an example of necessary components of a Programming by demonstration system. Since visual sensors can not easily cope with occlusions and therefore cannot track dexterous manipulation demonstrations, a fixed robot training center has been set up in this project that integrates various sensors for observing user actions. The system is depicted schematically in figure 7.14. It uses data gloves, magnetic field tracking sensors supplied by Polhemus, and active stereo camera heads. Here, manipulations tasks can be taught that the robot then is able to execute in a different environment. To meet the requirements for each phase of the PbD-process identified in section 7.4.2 the software system consists of the components shown in figure 7.15. The system has four basic processing modules which are connected to a set of databases and a graphical user interface. The observation and segmentation module is responsible for analysis and pre-segmentation of information channels, connecting the system to the sensor data. Its output is a vector describing states in the environment and user actions. This information is stored in a database including the world model. The interpretation module operates on the gained observation vectors and associates sequences of observation vectors to a set of predefined symbols. These parameterizable symbols do represent the elementary action set. During this interpretation phase the symbols are chunked into hierarchical macro operators after replacing specific task-dependent parameters by variables. The result is stored in a database as generalized execution knowledge. This knowledge is used by the execution module which uses specific kinematic robot data for processing. It calculates optimized movements for the target system, taking into account the actual world model. Before sending the generated program to the target system its validity is tested through simulation. In case of unforeseen errors, movements of the robot have to be corrected and op-
336
Interactive and Cooperative Robot Assistants
timized. All four components do communicate with the user by a graphical userinterface. Additional information can be retrieved from the user, and also hypotheses can be accepted or rejected via this interface.
Fig. 7.15. System structure.
7.4.4 Sensors for Learning through Demonstration The question of which sensors are useful for learning from human demonstration is closely related to the goal of the learning process and the background knowledge of the system. Trying to imitate or learn movements from humans, the exact trajectory of the articulated limbs is needed and consequently motion capturing systems like the camera-based VICON system [68] can be used. These sensors extract, based on marker positions, the 6D positions of the limbs, which are further used for calculating the joint angles, or their change, of the human demonstrator. Another way for measuring the human body configurations are exoskeletons, like the one offered by MetaMotion [50]. However imitation of movements alone is only possible using humanoids or anthropometric robots, having a kinematic structure similar to that of humans. While observing manipulation tasks, the movements of hands and their pose or configuration (joint angles) certainly are essential for learning skillful operations. Due to occlusion constraints, camera-based systems are not adequate for tracking finger movements. Therefore data gloves like the one of Immersion [34] can be used for gaining hand configurations with acceptable accuracy. The hand position can be gathered by camera systems and / or magnetic trackers . For identifying the goal of manipulation tasks, the analysis of changes in the observed scene is crucial. For this purpose object tracking is needed and consequently sensors for object recognition and tracking are used. Nowadays active (stereo) cameras are fast and accurate enough and can be used in combination with real-time
Interactive Task Learning from Human Demonstration
337
tracking algorithms to observe complex realistic scenes like household or office environments.
tun- and tiltable camera head for color and depht image processing
movable platform for overview perspective
tun- and tiltable camera head for color and depht image processing microphone and loudspeaker for speech interaction data gloves with attached tactile sensors and magnetic field trackers processing units
magnetic field emitter
Fig. 7.16. Training center for observation and learning of manipulation tasks at the University of Karlsruhe.
Figure 7.16 shows the demonstration area (”Training Center”) for learning and acquisition of task knowledge built at the University of Karlsruhe (cf. section 7.4.3). During the demonstration process, the user handles objects in the training center. This is equipped with the following sensors: two data gloves with magnetic field based tracking sensors and force sensors which are mounted on the finger tips and the palm, as well as two active stereo camera heads. Object recognition is done by computer vision approaches using fast view-based approaches described in [18]. From the data gloves, the system extracts and smoothes finger joint movements and hand positions in 3D space. To reduce noise in trajectory information, the user’s hand is additionally observed by the camera system. Both measurements are fused using confidence factors, see [21]. This information is stored with discrete time stamps in the world model database.
338
Interactive and Cooperative Robot Assistants
7.5 Models and Representation of Manipulation Tasks One of the basic capabilities of robot assistants is the manipulation of the environment. Under the assumption that robots will share the working or living space with humans, learning manipulation tasks from scratch would not be adequate for many reasons: first, in many cases it is not possible for a non-expert user to specify what and how to learn and secondly, teaching skills is a very time-consuming and maybe annoying task. Therefore a model-based approach including a general task representation were a lot of common (or innate) background knowledge is incorporated is more appropriate. Modeling the manipulation tasks requires definitions of general structures used for representing the goal of a task as well as a sequence of actions. Therefore the next section will present some definitions the following task models will rely on. 7.5.1 Definitions The term ”manipulation” is in robotics usually used as a change of position of an object through an action of the robot. In this chapter a more general definition of manipulation will be used, which is not orientated at the goal of the manipulation but rather is related to how a manipulation is performed. The following definition of ”manipulation” will be used: a manipulation (task) is a task, or sequence of actions, which contains a grasp action. A grasp action denotes picking an object but also touching or pushing it (like pushing a button). The classes of manipulation tasks can be divided in more specific subclasses in order to incorporate them into different task models. Considering the complexity of tasks, the following two subclasses are particularly interesting: •
•
Fine manipulation: This class contains all manipulations which are executed by using the fingers to achieve a movement of objects. From the point of view of robotics the observation of these movements as well as their reproduction denotes a big challenge. Dual hand manipulation: All manipulations making use of two arms for achieving the task goal are included in this class. For modelling this kind of manipulations the coordination issues between the hands have to be considered in addition.
7.5.2 Task Classes of Manipulations Considering the household and office domain under the hypothesis that the user causes all of the changes in the environment (closed world assumption) through manipulation actions, a goal-related classification of manipulation tasks can be performed. Hereby the manipulation actions can generally be viewed including dual hand and fine manipulations. Furthermore, the goal of manipulation tasks is closely related to the functional view of actions and, besides Cartesian criteria, consequently
Models and Representation of Manipulation Tasks
339
incorporates a semantic aspect, which is very important for interaction and communication with humans. According to the functional role a manipulation action aims for, the following classes of manipulation tasks can be distinguished (Figure 7.17): Transport operations are one of the widely executed action classes of robots in the role of servants or assistants. Tasks like Pick & Place or Fetch & Carry denoting the transport of objects are part of almost all manipulations. The change of the Cartesian position of the manipulated object serves as a formal distinguishing criterion of this kind of task. Consequently, for modeling transport actions, the trajectory of the manipulated object has to be considered and modeled as well. In terms of teaching transport actions to robots the acquisition and interpretation of the performed trajectory is therefore crucial. Device handling. Another class of manipulation tasks deals with changing the internal state of objects like opening a drawer, pushing a button etc. Actions of this class are typically applied when using devices, but many other objects also have internal states that can be changed (e.g. a bottle can be open or closed, filled or empty etc.). Every task changing an internal state of an object belongs to this class. In terms of modeling device handling tasks, transition actions leading to an internal state change have to be modeled. Additionally, the object models need to integrate an adequate internal state description. Teaching this kind of task requires observation routines able to detect internal state changes by continuously tracking relevant parameters. Tool handling. The most distinctive feature for actions belonging to the class of tool handling is the interaction modality between two objects, typically a tool and some workpiece. Hereby the grasped object interacts with the environment or another grasped object in the case of dual arm manipulations. Like the class of device handling tasks, this action type does not only include manipulations using a tool, but rather all actions containing an interaction between objects. Examples for such tasks are pouring a glass of water, screwing etc. The interaction modality is related to the functional role of objects used or the correlation between the roles of all objects included in the manipulation, respectively. The model of tool handling actions thus should contain a model of the interaction. According to different modalities of interaction, considering contact, movements etc., a diversity of handling methods has to be modeled. In terms of teaching robots the observation and interpretation of the actions can be done using parameters corresponding to the functional role and the movements of the involved objects. These three distinguished classes are obviously not disjunct, since for example a transport action is almost always part of the other two classes. However identifying the actions as part of these classes eases the teaching and interaction process with robot assistants, since the inherent semantics behind these classes correlates with human descriptions of manipulation tasks.
340
Interactive and Cooperative Robot Assistants
Relative Position and Trajectory
1. Transport Actions
Internal State, Opening angle
2. Device Handling
Object Interaction
3. Tool Handling
Fig. 7.17. Examples for the manipulation classes distinguished in a PbD system.
7.5.3 Hierarchical Representation of Tasks Representing manipulation tasks as pure action sequences is not flexible and also not scalable. Therefore a hierarchical representation is introduced in order to generalize an action sequence and to prune elementary skills to more complex subtasks. Looking only on manipulation tasks the assumption is that each manipulation consists of a grasp and a release action, as defined before. To cope with the manipulation classes specified above, pushing or touching an object is interpreted as a grasp. The representation of grasping an object constitutes a ”pick” action and consists of three sub-actions: an approach, a grasp type and a dis-approach (Fig. 7.18). Each of these sub-actions consists of a variable sequence of elementary skills represented by elementary operators (EO’s) of certain types (i.e. for the approach and dis-approach the
Models and Representation of Manipulation Tasks
341
EO’s are ”move” types and the grasp will be of type ”grasp”). A ”place” action is treated analogously.
Fig. 7.18. Hierarchical model of a pick action.
Between a pick and a place operation, depending on the manipulation class, several basic manipulation operations can be placed (Fig. 7.19). A demonstration of the task ”pouring a glass of water” consists, e.g., of the basic operations: ”pick a bottle”, ”transport the bottle”, ”pour in”, ”‘transport the bottle”’ and ”‘place the bottle”’. A sequence of basic manipulation operations starting with a pick and ending with a place is abstracted to a manipulation segment.
Fig. 7.19. Representation of a manipulation segment.
The level of manipulation segments denotes a new abstraction level on closed subtasks of manipulation. In this context ”closed” means that the end of a manipulation segment ensures that both hands are free and that the environmental state is stable. Furthermore the synchronization of EO’s for left and right hand are included in the manipulation segments. Pre and post conditions describing the state of the environment at the beginning and at the end of a manipulation segment are sufficient for their instantiation. The conditions are propagated from the EO level to the manipulation segmentation level and are computed from the environmental changes during the manipulation. In parallel to the propagation of the conditions a generalization in terms of positions and object types and features is performed.
342
Interactive and Cooperative Robot Assistants
A complex demonstration of a manipulation task is represented by a macro operator, which is a sequence of manipulation segments. The pre and post conditions of the manipulation segments are propagated to a context and an effect of the macro operator. At execution time a positive evaluation of the context of a macro enables its execution and an error free execution leads to the desired effect in the environment.
7.6 Subgoal Extraction from Human Demonstration Teaching robots through demonstration, independently of the used sensors and the abstraction level of the demonstration (e.g. interactive learning, task learning and mostly also skill learning), a search of the goals and subgoals of the demonstration has to be performed. Therefore, a recorded or observed demonstration has to be segmented and analyzed, in order to understand the users intention and to extract all information necessary for enabling the robot to reproduce the learned action. A feasible approach for the subgoal extraction is done over several steps, where the gathered sensor data is consequently abstracted. As a practical example for a PbD system with subgoal extraction, this section presents a two step approach based on the manipulation models and representation and the sensor system mounted in the training center (see Fig. 7.16). The example system outlined here is presented in more detail in [71]. In the first step the sensor data has to be preprocessed for extracting reliable measurements and key points, which are in a second step used to segment the demonstration. The following sections describe these steps.
Fig. 7.20. Sensor preprocessing and fusion for segmentation.
Subgoal Extraction from Human Demonstration
343
7.6.1 Signal Processing Figure 7.20 only shows the performed preprocessing steps for the segmentation. A more elaborate preprocessing and fusion system would, e.g., be necessary for detecting gesture actions, as shown in [21]. The input signals jt , ht , ft are gathered from data glove, magnetic tracker and tactile force sensors. All signals are filtered i.e. smoothed and high passed in order eliminate outliers. In the next step, the joint angles are normalized and passed over by a rule-based switch to a static (SGC) and a dynamic (DGC) grasp classifier. The tracker values, representing the absolute position of the hand, are differentiated in order to obtain the velocity of the hand movement. As mentioned in [71] the tactile sensors have a 10-20% hysteresis, which is eliminated by the Function H(x). The normalized force values, together with the velocity of the hand, are passed to the Module R, which consists of a rule set in order to determine if an object is potentially grasped and triggers the SGC and DGC. The output of these classifiers are grasp types according to the Cutkosky hierarchy (static grasps) or dynamic grasps according to hierarchy presented in section 7.6.3.2. 7.6.2 Segmentation Step The segmentation of a recorded demonstration is performed in two steps: 1. Trajectory segmentation This step segments the trajectory of the hand during the manipulation task. Hereby the segmentation is done by detecting grasp actions. Therefore the time of contact between hand and object has to determined. This is done by analyzing the force values with a threshold-based algorithm. To improve the reliability of the system the results are fused within a second algorithm, based on the analysis of trajectories of finger poses, velocity and acceleration with regard to minima. Figure 7.21 shows the trajectories of force values and finger joint velocity values of three Pick & Place actions. 2. Grasp segmentation For detecting fine manipulation the actions while an object is grasped have to be segmented and analyzed. The upper part of figure 7.21 shows that the shape of the force graph features a relative constant plateau. Due to the fact that no external forces are applied to the object this effect is plausible. But if the grasped object collides with the environment the force profile will change. The results are high peaks i.e. both amplitude and frequency are oscillating, like shown in the lower part of figure 7.21). Looking to the force values during a grasp, three different profiles can be distinguished: • Static Grasp Here the gathered force values are nearly constant. The force profile shows characteristic plateaus, where the height points out the weight of the grasped object. • External Forces The force graph of this class shows high peaks. Because of the hysteresis
344
Interactive and Cooperative Robot Assistants
Fig. 7.21. Analyzing segments of a demonstration: force values and finger joint velocity.
•
of the sensors, no quantitative prediction about the applied forces can be made. A proper analysis of external forces applied to a grasped object will be subject of further work. Dynamic Grasps During a dynamic grasp both amplitude and frequency oscillate moderately, as a result of finger movements performed by the user.
The result of the segmentation step is a sequence of elemental actions like moves and static and dynamic grasps. 7.6.3 Grasp Classification For mapping detected manipulative tasks and representations of grasps, the diversity of human object handling has to be considered. In order to conserve as many information as possible from the demonstration, two classes of grasps are distinguished. Static grasps are used for Pick & Place tasks, while fine manipulative tasks require dynamic handling of objects. 7.6.3.1 Static Grasp Classification Static grasps are classified according to the Cutkosky hierarchy ( [14]). This hierarchy subdivides static grasps into 16 different types mainly with respect to their posture (see figure 7.22). Thus, a classification approach based on the finger angles measured by the data glove can be used. All these values are fed into a hierarchy
Subgoal Extraction from Human Demonstration
345
Fig. 7.22. Cutkosky grasp taxonomy.
of neural networks, each consisting of a three layer Radial Basis Function (RBF) network . The hierarchy resembles the Cutkosky hierarchy passing the finger values from node to node like in a decision tree. Thus, each network has to distinguish very few classes and can be trained on a subset of the whole set of learning examples. The classification results of the system for all static grasp types are listed in table 7.1. A training set of 1280 grasps recorded from 10 subjects and a test set of 640 examples was used for testing. The average classification rate is 83,28% where most confusions happen between grasps of similar posture (e.g. types 10 and 12 in
346
Interactive and Cooperative Robot Assistants
the Cutkosky hierarchy). In the application, classification reliability is furthermore enhanced by heuristically regarding hand movement speed (which is very low when grasping an object) and distance to a possibly grasped object in the world model (which has to be very small for grasp recognition). Table 7.1. Classification rates gained on test data
1
2
3
Grasp type 4 5 6
7
8
0.95 0.90 0.90 0.86 0.98 0.75 0.61 0.90
9
10
11
Grasp type 12 13 14
15
16
0.88 0.95 0.85 0.53 0.35 0.88 1.00 0.45
7.6.3.2 Dynamic Grasp Classification For describing various household activities like opening a twisted cap or screwing a bold in a nut, simple operations like Pick & Place are inadequate. Therefore other elemental operations like Dynamic Grasps need to be detected and represented in order to program such tasks within the PbD paradigm. With Dynamic Grasps we denote operations like screw, insert etc. which all have in common that finger joints are changed during the grasping phase (i.e. the force sensors provide non zero values). For classifying dynamic grasps we choose the movement of the grasped object, as a distinction criterion. This allows for an intuitive description of the user’s intention when performing fine manipulative tasks. For describing grasped object movements these are transformed in hand coordinates. Figure 7.23 shows the axes of the hand, according to the taxonomy of Elliot & Co. ( [22]). Rotation and translation are defined with regard to the principle axes. Some restrictions of the human hand, like rotation along the y-axes, are considered. Figure 7.24 shows the distinguished dynamic grasps. The classification is done separately for rotation and translation of the grasped object. Furthermore the number of fingers which where involved in the grasp (i.e. from 2 to 5) are considered. There are several precision tasks which are performed only with thumb and index finger, like e.g. opening a bottle that is classified as a rotation around the x-axes. Other grasps require higher forces and all fingers are involved in the manipulation, as e.g. Full Roll for screwing actions. The presented classification contains most of the common fine manipulations, if we assume that three and four finger manipulations are included in the five finger classes. For example a Rock Full dynamic grasp can be performed with three, four or five fingers. Whereas the classification of static grasps in this system is done by Neural Networks (cf. section 7.6.3.1), Support Vector Machines (SVM) are used to classify dynamic grasps. Support vector machines are a general class of statistical learning architectures, which are getting more and more important in a variety of applications
Subgoal Extraction from Human Demonstration
347
Fig. 7.23. Principal axes of the human hand.
because of their profound theoretical foundation as well as their excellent empirical performance. Originally developed for pattern recognition, the SVM’s justify their application by a large number of positive qualities, like: fast learning, accurate classification and in the same time a high generalization performance. The basic training principle behind the SVM is to find the optimal class-separating hyper-plane, such that the expected classification error for previously unseen examples is minimized. In the remainder of this section, we shortly present the results of the dynamic grasp classification using Support Vector Machines. For an introduction to the theory of SVMs, we recommend e.g. [67], for more details about the dynamic grasp classification system see [71]. For training the SVM, twenty-six classes corresponding to the elementary dynamic grasps presented in figure 7.24 where trained. Because of the fact that a dynamic grasp is defined by a progression of joint values, a time-delay approach was chosen. Consequently the input vector of the SVM Classifier comprised 50 joint configurations of 20 joint values. The training data set contained 2600 input vectors. The fact that SVM’s can learn from significant less data than neuronal networks assures that this approach works very well in this case. Figure 7.24 also shows results of the DGC. Since the figure shows either only right or forward direction, the displayed percentages represent an average value of the two values. The maximum variance between these two directions is about 2%. Remarkable is the fact that the SVM needs only 486 support vectors (SV) for generalizing over 2600 vectors i.e. 18,7% of the data set. A smaller number of SV does not only improve the generalized behavior but also the runtime of the resulting algorithm during the application. The joint data is normalized but surely there exists some user specific variance. Therefore, the DGC was tested with data from a second user. A sample of ten elements of eight elemental dynamic grasps (i.e. 80 graps) was evaluated. The maximum variance was less than or equal to 3%.
348
Interactive and Cooperative Robot Assistants
Fig. 7.24. Hierarchy of dynamic grasps containing the results of the DGC .
7.7 Task Mapping and Execution The execution of meaningful actions in human centered environments is highly demanding on the robot’s architecture, which has to incorporate interaction and learning modalities. The execution itself needs to be reliable and legible in order to be accepted by humans. Two main questions which have to be tackled in the context of task execution with robot assistants are: How to organize a robot system to cope with the demands on interaction and acceptance, and how to integrate new knowledge into the robot? 7.7.1 Event-Driven Architecture In order to cope with high interaction demands, an event-driven architecture like the one outlined in figure 7.25 can be used. Hereby the focus on control mechanism of the robot systems lies in enabling the system to react on interaction demands on the one hand and to be autonomous on the other. All multimodal input especially from gesture and object recognition with speech input can be melted together into so-called ”events“, which are building the key concept of representing this input.
Task Mapping and Execution
349
An event is, on a rather abstract level of representation, the description of something that happened. The input values from the data glove are, e.g., transformed into ”gesture events“ or ”grasp events“. Thus, an event is a conceptual description of input or inner states of the robot on a level which the user can easily understand and cope with – in a way, the robot assistant is speaking ”the same language“ as a human. As events can be of very different types (e.g. gesture events, speech events), they have to be processed with regard to the specific information they carry. Figure 7.25 shows the overall control structure transforming events in the environment to actions performed by robot devices. Action devices are called ”agents” and represent semantic groups of action performing devices (Head, Arms, Platform, Speech, ...). Each agent is capable of generating events. All events are stored in the ”Eventplug”. A set of event transformers called ”Automatons”, each of which can transform a certain event or group of events into an ”action“, is responsible for selecting actions. Actions are then carried out by the corresponding part of the system, e.g. the platform, the speech output system etc. A very simple example would be a speech event generated from the user input ”Hello Albert“, which the responsible event transformer would generate into a robot action speechoutput (Hello!). In principle, the event transformers are capable of fusing two or more events that belong together and generate the appropriate actions. This would, e.g., be a pointing gesture with accompanying speech input like ”take this“. The appropriate robot action would be to take the object on which the user is pointing. The order of event transformers is variable, and transformers can be added or deleted as suitable, thus representing a form of focus – the transformer highest in order tries to transform incoming events first, then (if the first one does not succeed) the transformer second in order etc.. For encapsulating the mechanism of reordering event transformers (automatons), the priority module is responsible.
Fig. 7.25. Event manager architecture.
350
Interactive and Cooperative Robot Assistants
As mentioned above, event transformers are also capable of fusing two or more events. A typical problem is a verbal instruction by the user, combined with some gesture, e.g. ”Take this cup.“ with a corresponding pointing gesture. In such cases, it is part of the appropriate event transformer to identify the two events which belong together and to put in progress the actions which are appropriate for this situation. In each case, one or more actions can be started, as appropriate for a certain event. In addition, it is easily possible to generate questions concerning the input: an event transformer designed specifically for queries (to the user) can handle events which are ambiguous or underspecified, start an action which asks an appropriate question to the user, and then uses the additional information to transform the event. For controlling hardware resources a scheduling system locking and unlocking execution of the action performer is applied. All possible actions (learned or innate) which can be executed by the robot system are coded or wrapped as an automaton. These control the execution, and generate events which can start new automatons or generate ”actions” for the subsystems. The mechanism for starting new automatons, which are carrying out subtasks, leads to a hierarchical execution of actions. Furthermore, the interaction can be initiated from each automaton, what implies that every subtask can fulfill its interaction deliberative. For ensuring reliability of the system the ”Security and Exception Handler” is introduced into the systems architecture. The main tasks of the module are handling exceptions from the agents, reacting on cancelling commands by the user and clearing up internally all automatons after an exception. According to the distributed control through different automatons, every automaton reacts on special events for ”cleaning up” its activities and ensures a stable state of all used and lock agents. The presented architecture concept not only ensures a working way of controlling robot assistants, but also represents a general way of coping with new knowledge and behaviors. These can easily be integrated as new automatons in the system. The integration process itself can be done via a learning automaton , which creates new automatons. The process of generating execution knowledge from general task knowledge is part of the next section. 7.7.2 Mapping Tasks onto Robot Manipulation Systems Task knowledge acquired through demonstration or interaction represents an abstract description for problem solving, stored in some kind of a macro operator which is not directly usable for the target system. Grasps represented in order to describe human hand configurations and trajectories are not optimal for robot kinematics, since they are tailored to human kinematics. Besides, sensor control for the robot like e.g., force control information for the robot system is not extractable from a demonstration directly. Considering manipulation operations, a method for automatically mapping grasps to robot grippers is needed. Furthermore, stored trajectories in the macro have to be optimized for the execution environment and the target system.
Task Mapping and Execution
351
As an example for grasp mapping the 16 different power and precision grasps of Cutkosky’s grasp hierarchy [14] should be considered. As input the symbolic operator description and the positions of the user’s fingers and palm during the grasp are used. For mapping the grasps an optimal group of coupled fingers have to be calculated. These are fingers having the same grasp direction. The finger groups are associated with the chosen gripper type. This helps to orient finger positions for adequate grasp pose. Since the symbolic representation of performed grasps include information about enclosed fingers, only those fingers are used for finding the correct pose. Let’s assume that fingers used for the grasp are numbered as H1 , H2 , . . . , Hn , n ≤ 51 . For coupling fingers with the same grasping direction it is necessary to calculate forces affecting the object. In case of precision grasps the finger tips are projected on a grasping plane E defined by the finger tip position during the grasp (figure 7.26). Since the plane might be overdetermined if more than three fingers are used, the plane is determined by using the least square method.
Fig. 7.26. Calculation of a plane E and the gravity point G defined by the finger tip’s positions .
Now the force-vectors can be calculated by using the geometric formation of the fingers given by the grasp type and the force value measured by the force sensors on the finger tips. Prismatic and sphere grasps are distinguished. For prismatic grasps all finger forces are assumed to act against the thumb. This leads to a simple calculation of the forces F1 , . . . , Fn , n ≤ 5 according to figure 7.27. For circular grasps, the direction of forces are assumed to act against the finger tip’s center of gravity G, as shown in Fig. 7.28. To evaluate the coupling of fingers the degree of force coupling Dc (i, j) is defined: Fi .Fj (7.2) Dc (i, j) = |Fi ||Fj | 1
It should be clear that the number of enclosed fingers can be less than 5.
352
Interactive and Cooperative Robot Assistants
Fig. 7.27. Calculation of forces for a prismatic 5 finger grasp .
Fig. 7.28. Calculation of forces for a sphere precision grasp .
This means |Dc (i, j)| ≤ 1, and in particular Dc (i, j) = 0 for f i ⊥ f j . So the degree of force coupling is high for forces acting in the same direction and low for forces acting in orthogonal direction. The degree of force coupling is used for finding three or two optimal groups of fingers using Arbib’s method [4]. When the optimal groups of fingers is found, the robot fingers are being assigned to these finger groups. This process is rather easy for a three finger hand, since the three groups of fingers can be assigned directly to the robot fingers. In case there are only two finger groups, two arbitrary robot fingers are selected to be treated as one finger. The grasping position depends on the grasp type as well as the grasp pose. Since there is no knowledge about the object present, the grasp position must be obtained directly from the performed grasp. Two strategies are feasible: •
•
Precision Grasps: The center of gravity G which has been used for calculating the grasp pose serves as reference point for grasping. The robot gripper is positioned relative to this point performing the grasp pose. Power Grasps: The reference point for power grasping is the user’s palm relative to the object. Thus, the robot’s palm is positioned with a fixed transformation with respect to this point.
Now, the correct grasp pose and position are determined and the system can perform force controlled grasping by closing the fingers.
Task Mapping and Execution
353
Figure 7.29 shows 4 different grasp types mapped from human grasp operations to a robot equipped with a three finger hand (Barrett) and 7 DoF arm. It can be seen that the robot’s hand position depends strongly on the grasp type.
Fig. 7.29. Example of different grasp types. 1. 2-finger precision grasp, 2. 3-finger tripoid grasp, 3. circular power grasp, 4. prismatic power grasp
For mapping simple movements of the human demonstration to the robot, a set of logical rules can be defined that select sensor constraints depending on the execution context, i.e. for example to select a force threshold parallel to the movement when approaching an object or selecting zero-force control during grasping an object. So context information serves for selecting intelligent sensor control. The context information should be stored, in order to be available for processing. Thus, a set of heuristic rules can be used for handling the system’s behavior depending on the current context. To give an overview the following rules turned out to be useful in specific contexts: •
• • •
Approach: The main approach direction vector is determined. Force control is set to 0 in orthogonal direction to the approach vector. Along the grasp direction a maximum force threshold is selected. Grasp: Set force control to 0 in all directions. Retract: The main approach direction vector is determined. Force control is set to 0 in orthogonal direction to the approach vector. Transfer: Segments of basic movements are chunked into complex robot moves depending on direction and speed.
354
Interactive and Cooperative Robot Assistants
Summarizing, the mapping module generates a set of default parameters for the target robot system itself as well as for movements and force control. These parameters are directly used for controlling the robot. 7.7.3 Human Comments and Advice Although it is desirable to make hypotheses about the user’s wishes and intention based on only one demonstration, this will not work in the general case. Therefore, the system needs the possibility to accept and reject interpretations of the human demonstration generated by the system. Since the generation of a well working robot program is the ultimate goal of the programming process, the human user can interact with the system in two ways: 1. Evaluation of hypotheses concerning the human demonstration, e.g. recognized grasps, grasped objects, important effects in the environment. 2. Evaluation and correction of the robot program that will be generated from the demonstration. Interaction with the system to evaluate learned tasks can be done through speech, when only very abstract information have to be exchanged like specifying the order of actions or some parameters. Otherwise, especially if Cartesian points or trajectories have to be corrected the interaction must be done in an adequate way using for example a graphical interface. In addition, a simulation system showing the identified objects of the demonstration and actuators (human hand with its particular observed trajectory and robot manipulators) in a 3D environment would be helpful. During the evaluation phase of the user demonstration all hypotheses about grasps including types and grasped objects can be displayed in a replayed scenario. With graphical interfaces the user is prompted for acceptance or rejection of actions (see figure 7.30). After the modified user demonstration has been mapped onto a robot program the correctness of the program must be validated in a simulation phase. The modification of the environment and the gripper trajectories are displayed within the 3D environment, giving the user the opportunity to interactively modify trajectories or grasp points of the robot if desired (see figure 7.31).
7.8 Examples of Interactive Programming The concept of teaching robot assistants interactively by speech and gestures as well as through a demonstration in a training center was evaluated in various systems, e.g. [17]. Due to cognitive skills of the programmed robots and their event management concept, intuitive and fluent interaction with them is possible. The following example shows how a previously generated complex robot program for laying a table is triggered. As in figure 7.32, the user shows the manipulable objects to the robot
Examples of Interactive Programming
355
Fig. 7.30. Example of a dialog mask for accepting and rejecting hypotheses generated by a PbD system.
Fig. 7.31. Visualization of modifiable robot trajectories.
which instantiates an executable program from its database with the given environmental information. It is also possible to advice the robot to grasp selected objects and lay them down at desired positions. In this experiment, off-the-shelf cups and plates in different forms and colors were used. The examples were recorded in the training center described in section 7.4.4 [21]. Here, the mentioned sensors are integrated into an environment allowing free movements of the user without restrictions. In a single demonstration, crockery is placed for one person (see left column of figure 7.33). The observed actions can then be replayed in simulation (cf. second column). Here, sensor errors can be corrected interactively. The system displays its hypotheses on recognized actions
356
Interactive and Cooperative Robot Assistants
Fig. 7.32. Interaction with a robot assistant to teach how to lay a table. The instantiation of manipulable objects is aided through pointing gestures.
and their segments in time. After interpreting and abstracting this information, the learned task can be mapped to a specific robot. The third column of Fig. 7.33 shows the simulation of a robot assistant performing the task. In the fourth column, the real robot is performing the task in a household environment.
7.9 Telepresence and Telerobotics Although autonomous robots are the explicit goal of many research projects, there are nonetheless tasks which are very hard to perform for an autonomous robot at the moment, and which robots may never be able to perform completely on their own. Examples are tasks where the processing of sensor data would be very complex or where it is vital to react flexibly on unexpected incidents. On the other hand, some of these tasks can not be performed by humans as well since they are too dangerous (e.g. in contaminated areas), too expensive (e.g. some underwater missions) or too time-consuming (e.g. space missions). In all of these cases, it would be advantageous to have a robot system which performs the given tasks, combined with the foresight and knowledge of a human expert. 7.9.1 The Concept of Telepresence and Telerobotic Systems In this section, the basic ideas of telepresence and telerobotics are introduced, and some example applications are presented. ”Telepresence” indicates a system by which all relevant environment information is presented to a human user as if he or she would be present on site. In ”telerobotic” applications, a robot is located remotely to the human operator. Due to latencies in signal transfer from and to the robot, instantaneous intervention by the user is not possible. Fig. 7.34 shows the range of potential systems with regard to both system complexity and the abstraction level of programming which is necessary. As can be seen
Telepresence and Telerobotics
357
Fig. 7.33. Experiment ”Laying a table”. 1. column: real demonstration, 2. column: analyze and interpretation of the observed task, 3. column: simulated execution, 4. column: execution with a real robot in an household environment.
358
Interactive and Cooperative Robot Assistants
Fig. 7.34. The characteristic of telerobotic systems in terms of system complexity and level of programming abstraction.
from the figure, both the interaction with the user and the use of sensor information increase heavily with increasing system complexity and programming abstraction. At the one end of the spread are industrial robot systems with low complexity and explicit programming . The scale extends towards implicit programming (as in Programming by Demonstration systems) and via telerobots towards completely autonomous systems. In terms of telepresence and telerobotics, this means a shift from telemanipulation via teleoperation towards autonomy . ”Telemanipulation” refers to systems where the robot is controlled by the operator directly, i.e. without the use of sensor systems or planning components. In ”teleoperation”, the planning of robot movements is initialized by the operator, and occurs by use of environment model data which is actualized permanently from sensor data processing. Telerobotic systems are thus semi-autonomous robot systems which integrate the human operator into their control circle because of his perceptive and reactive capabilities which are indispensable for some tasks. Fig. 7.35 shows the components of such a system in more detail. The robot is monitored by sensor systems like cameras. The sensor data is presented to the operator who decides about the next movements of the robot and communicates them to the system by appropriate input devices. The planning and control system integrates the instructions of the operator and the results of the sensor data processing, plans the movements to be performed by the robot and controls the robot remotely. The main problem of these systems is the data transfer between the robot and the control system on the user’s side. Latencies in signal transfer due to technical reasons make it impossible for the user or for the planning and control system, respectively, to intervene immediately in the case of emergencies.
Telepresence and Telerobotics
359
Fig. 7.35. Structure of a telerobotic system.
Beside the latencies in the signal transfer, the kind of information gathered from the teleoperated robot system is often limited due to limited sensors on board, and the sensing or the visualization of important sensor values like force feedback or temperature is restricted as well. 7.9.2 Components of a Telerobotic System A variety of input devices is possible, depending on the application, to bring the user’s commands to the telerobotic system. Graphical user interfaces are very common, but also joysticks, potentially with force feedback devices, to transfer the user’s movements to the systems. Another example for this are 6-dimensional input devices, so-called Space Masters, or, even more elaborate, data gloves or exoskeletons which provide the exact finger joint angles of the human hand. The output of information, e.g., sensor information from the robot, can be brought to the user in different ways, e.g., in textual form or, more comfortable, via graphical user interfaces. Potential output modes include other forms of visual display, like stereoscopic (shutter glasses) or head-mounted displays which are often used in virtual or augmented reality. Force feedback is realized by use of data gloves (ForceGlove) or by exoskeletons which communicate the forces that the robot system experiences to the user. For enabling the robot systems to overcome the latencies caused by the transmission channel it needs a local reaction and planning module . This component ensures an autonomous behavior for a short time. Local models concerning actions and perception constraints are needed for analyzing and planning and reacting in a save way.
360
Interactive and Cooperative Robot Assistants
They need to be synchronized with teleoperator planning and control system in order to determine the internal state of the robot. The teleoperator is aided by a planning and control system which simulates the robot. The simulation is updated with sensor values and internal robot states. Based on the simulation, a prediction module is needed for estimating the actual robot state and (if possible) the sensor values. Based on this the planning module will generate a new plan, which is then sent to the remote system. Beside the automatic planning module the interaction modality of the teleoperator and the remote system represents a crucial component whenever the human operator controls the robot directly. Here apart from the adequate in- and output modalities (depending on the used sensors and devices) the selection of information and the presentation form must be adapted to the actual situation and control strategies. Error and Exception Handling of a teleoperated system must ensure that a the system does not destroy itself or collide with the environment on the one hand and on the other hand that it maintains the communication channel. Therefore often security strategies like stopping the robot’s actions in dangerous situations and waiting for manual control are implemented. 7.9.3 Telemanipulation for Robot Assistants in Human-Centered Environments Considering robot assistants which work in human-centered environments and are supposed to act autonomously, telemanipulation follows different goals: Exception and error handling. Robot assistants usually have adequate exception and error handling procedures, which enable them to fulfill there normal work. But due to the fact that these robots are supposed to work in dynamic and often in unstructured environments, situations can occur which require an intervention of human operators. Having a teleoperation device, the systems can be moved manually into a save position or state and if necessary be reconfigured by a remote service center. Expert programming device. Adaptation is definitely one of the most important features of a robot assistant, but due to strict constraints in terms of reliability, the adaptation to new environments and tasks where new skills are needed cannot be done automatically by the robot. For extending the capabilities of a robot assistant, a teleoperation interface can serve as direct programming interface which enables a remote expert to program new skills with respect to the special context the system is engaged in. In both cases the teleoperation interface is used only in exceptional situations, when an expert is needed to ensure the further autonomous working of the systems.
Telepresence and Telerobotics
361
7.9.4 Exemplary Teleoperation Applications 7.9.4.1 Using Teleoperated Robots as Remote Sensor Systems Fig. 7.37 shows the 6-legged robot LAURON which can be guided by a remote teleoperator in order to find a way through the rubble of collapsed buildings. The robot can be equipped with several sensors depending on the concrete situation (e.g. video cameras, infrared cameras, laser scanner etc.). The images gathered from the on-board sensors are passed to the teleoperator which guides the robot to places of interest.
Fig. 7.36. (a) Six-legged walking machine LAURON; (b) Head-mounted display with the head pose sensor, (c) flow of image and head pose data (for details see [1]).
The undistorted images are presented to the teleoperator via a head-mounted display (HMD) which is fitted with a gyroscopic and inertial sensor. This sensor yields a direction-of-gaze vector which controls both the center of the virtual planar camera and the robot’s pan-tilt unit (cf. Fig. 7.36). The direction of gaze is also used to determine which part of the panoramic image needs to be transferred over the network in the future (namely, a part covering a 180° angle centered around the current direction of gaze). Using this technique together with low-impact JPEG compression, the image data can be transferred even over low-capacity wireless links at about 20fps. To control the movements of LAURON the graphical point-and-click control interface has been expanded by the possibility to use a commercial two-stick analogue joypad. Joypads, as they are used in the gaming industry, have the great advantage that they are designed to be used without looking at them (relief buttons, hand ergonomics). This eliminates the need of the operator to look at the keyboard or the mouse. The joypad controls the three Cartesian walking directions (forward, sidewards and turn) and the height of the body (coupled to the step height) and is equipped with a dead-man circuit. The immersive interface makes the following information (apart from the surrounding camera view) available to the user: artificial horizons of both the user (using the HMD inertial sensor) and the robot (using the robot’s own inertial sensor), HMD direction of gaze relative to the robot’s forward direction, movement of the robot (translation vector, turn angle and body height), foot-ground contact and display and network speed. Additionally, a dot in the user’s view marks the current intersec-
362
Interactive and Cooperative Robot Assistants
tion of laser distance sensor and scene and presents the distance obtained from the sensor at that point, cf. Fig. 7.37.
Fig. 7.37. Immersive interface (user view) and the remotely working teleoperator.
7.9.4.2 Controlling and Instructing Robot Assistants via Teleoperation In this section, an example of a teleoperation system is presented [28]. The system consists of the augmented reality component which is displayed in Fig. 7.38. For visualization, a head-mounted display is used. The head is tracked using two cameras and black-and-white artificial landmarks. As input devices, a data glove and a socalled magic wand are used, which are both also tacked with such landmarks. This system allows to teleoperate robots with either a mobile platform or anthropomorphic shape, as shown in Fig. 7.39. For the mobile platform, e.g. topological knots of the map can be displayed in the head-mounted display and then manipulated by the human operator with the aid of the input devices. Examples of these operations can be seen in Fig. 7.40. In the case of robots with manipulative abilities, manipulations and trajectories can be visualized, corrected or adapted as needed (cf. Fig. 7.40). With the aid of this system, different applications can be realized: •
•
Visualization of robot intention. Such a system is very well suited to visualize geometrical properties, such as planned trajectories for a robot arm or grip points for a dexterous hand. Therefore, the system can be used to refine existing interaction methods by providing the user with richer data about the robot’s intention. Integration with a rich environmental model. Objects can be selected for manipulation by simply pointing at them, together with immediate feedback to the user about whether the correct object has been selected, etc..
Telepresence and Telerobotics
363
Fig. 7.38. Input setup with data gloves and magic wand as input devices.
Fig. 7.39. (left) Mobile platform ODETE , with emotional expression that humans can read; (right) Anthropomorphic robot ALBERT loading a dish washer.
•
Manipulating the environmental model. Since robots are excellent at refining existing geometrical object models from sensor data, but very bad at creating new object models from unclassified data themselves, the human user is integrated into the robot’s sensory loop. The user receives a view of the robot’s sensor data and is thus able to use interactive devices to perform cutting and selection actions on the sensor data to help the robot to create new object models from the data itself.
364
Interactive and Cooperative Robot Assistants
Fig. 7.40. Views in the head-mounted display: top) topological map, and bottom) trajectory of the manipulation ”‘loading a dishwasher”’.
References 1. Albiez, J., Giesler, B., Lellmann, J., Z¨ollner, J., and Dillmann, R. Virtual Immersion for Tele-Controlling a Hexapod Robot. In Proceedings of the 8th International Conference on Climbing and Walking Robots (CLAWAR). September 2005. 2. Alissandrakis, A., Nehaniv, C., and Dautenhahn, K. Imitation with ALICE: Learning to Imitate Corresponding Actions Across Dissimilar Embodiments. IEEE Transactions on Systems, Man, and Cybernetics, 32(4):482–496, 2002. 3. Anderson, J. Kognitive Psychologie, 2. Auflage. Spektrum der Wissenschaft Verlagsgesellschaft mbH, Heidelberg, 1989. 4. Arbib, M., Iberall, T., Lyons, D., and Linscheid, R. Hand Function and Neocortex, chapter Coordinated control programs for movement of the hand, pages 111–129. A. Goodwin and T. Darian-Smith (Hrsg.), Springer-Verlag, 1985. 5. Archibald, C. and Petriu, E. Computational Paradigm for Creating and Executing Sensorbased Robot Skills. 24th International Symposium on Industrial Robots, pages 401–406, 1993. 6. Asada, H. and Liu, S. Transfer of Human Skills to Neural Net Robot Controllers. In IEEE International Conference on Robotics and Automation (ICRA), pages 2442–2448. 1991. 7. Asfour, T., Berns, K., and Dillmann, R. The Humanoid Robot ARMAR: Design and Control. In IEEE-RAS International Conference on Humanoid Robots (Humanoids 2000), volume 1, pages 897–904. MIT, Boston, USA, Sep. 7-8 2000. 8. Asfour, T. and Dillmann, R. Human-like Motion of a Humanoid Robot Arm Based on a Closed-Form Solution of the Inverse Kinematics Problem. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003), volume 1, pages 897–904. Las Vegas, USA, Oct. 27-31 2003. 9. Bauckhage, C., Kummmert, F., and Sagerer, G. Modeling and Recognition of Assembled Objects. In IEEE 24th Annual Conference of the Industrial Electronics Society, volume 4, pages 2051–2056. Aachen, September 1998. 10. Bergener, T., Bruckhoff, C., Dahm, P., Janßen, H., Joublin, F., Menzner, R., Steinhage, A., and von Seelen, W. Complex Behavior by Means of Dynamical Systems for an Anthropomorphic Robot. Neural Networks, 12(7-8):1087–1099, 1999. 11. Breazeal, C. Sociable Machines: Expressive Social Exchange Between Humans and Robots. Master’s thesis, Department of Electrical Engineering and Computer Science, MIT, 2000.
References
365
12. Brooks, R. The Cog Project. Journal of the Robotics Society of Japan Advanced Robotics, 15(7):968–970, 1997. 13. Bundesministerium f¨ur Bildung und Forschung. Leitprojekt Intelligente anthropomorphe Assistenzsysteme, 2000. Http://www.morpha.de. 14. Cutkosky, M. R. On Grasp Choice, Grasp Models, and the Design of Hands for Manufacturing Tasks. IEEE Transactions on Robotics and Automation, 5(3):269–279, 1989. 15. Cypher, A. Watch what I do - Programming by Demonstration. MIT Press, Cambridge, 1993. 16. Dillmann, R., Rogalla, O., Ehrenmann, M., Z¨ollner, R., and Bordegoni, M. Learning Robot Behaviour and Skills based on Human Demonstration and Advice: the Machine Learning Paradigm. In 9th International Symposium of Robotics Research (ISRR 1999), pages 229–238. Snowbird, Utah, USA, 9.-12. Oktober 1999. 17. Dillmann, R., Z¨ollner, R., Ehrenmann, M., and and, O. R. Interactive Natural Programming of Robots: Introductory Overview. In Proceedings of the DREH 2002. Toulouse, France, Octobre 2002. 18. Ehrenmann, M., Ambela, D., Steinhaus, P., and Dillmann., R. A Comparison of Four Fast Vision-Based Object Recognition Methods for Programing by Demonstration Applications. In Proceedings of the 2000 International Conference on Robotics and Automation (ICRA), volume 1, pages 1862–1867. San Francisco, Kalifornien, USA, 24.–28. April 2000. 19. Ehrenmann, M., Rogalla, O., Z¨ollner, R., and Dillmann, R. Teaching Service Robots complex Tasks: Programming by Demonstration for Workshop and Household Environments. In Proceedings of the 2001 International Conference on Field and Service Robots (FSR), volume 1, pages 397–402. Helsinki, Finnland, 11.–13. Juni 2001. 20. Ehrenmann, M., Rogalla, O., Z¨ollner, R., and Dillmann, R. Analyse der Instrumentarien zur Belehrung und Kommandierung von Robotern. In Human Centered Robotic Systems (HCRS), pages 25–34. Karlsruhe, November 2002. 21. Ehrenmann, M., Z¨ollner, R., Knoop, S., and Dillmann, R. Sensor Fusion Approaches for Observation of User Actions in Programming by Demonstration. In Proceedings of the 2001 International Conference on Multi Sensor Fusion and Integration for Intelligent Systems (MFI), volume 1, pages 227–232. Baden-Baden, 19.–22. August 2001. 22. Elliot, J. and Connolly, K. A Classification of Hand Movements. In Developmental Medicine and Child Neurology, volume 26, pages 283–296. 1984. 23. Festo. Tron X web page, 2000. Http://www.festo.com. 24. Friedrich, H. Interaktive Programmierung von Manipulationssequenzen. Ph.D. thesis, Universit¨at Karlsruhe, 1998. 25. Friedrich, H., Holle, J., and Dillmann, R. Interactive Generation of Flexible Robot Programs. In IEEE International Conference on Robotics and Automation. Leuven, Belgium, 1998. 26. Friedrich, H., M¨unch, S., Dillmann, R., Bocionek, S., and Sassin, M. Robot Programming by Demonstration: Supporting the Induction by Human Interaction. Machine Learning, pages 163–189, May/June 1996. 27. Fritsch, J., Lomker, F., Wienecke, M., and Sagerer, G. Detecting Assembly Actions by Scene Observation. In International Conference on Image Processing 2000, volume 1, pages 212–215. Vancouver, BC, Kanada, September 2000. 28. Giesler, B., Salb, T., Steinhaus, P., and Dillmann, R. Using Augmented Reality to Interact with an Autonomous Mobile Platform. In IEEE International Conference on Robotics and Automation (ICRA 2004), volume 1, pages 1009–1014. New Orleans, USA, Oct. 2731 2004.
366
Interactive and Cooperative Robot Assistants
29. Hashimoto. Humanoid Robots in Waseda University - Hadaly-2 and Wabian. In IEEERAS International Conference on Humanoid Robots (Humanoids 2000), volume 1, pages 897–904. Cambridge, MA, USA, Sep. 7-8 2000. 30. Heise, R. Programming Robots by Example. Technical report, Department of Computer Science, The university of Calgary, 1992. 31. Honda Motor Co., L. Honda Debuts New Humanoid Robot ”ASIMO”. New Technology allows Robot to walk like a Human. Press Release of November 20th, 2000, 2000. Http://world.honda.com/news/2000/c001120.html. 32. Ikeuchi, K. and Suehiro, T. Towards an Assembly Plan from Observation. In International Conference on Robotics and Automation, pages 2171–2177. 1992. 33. Ikeuchi, K. and Suehiro, T. Towards an Assembly Plan from Observation, Part I: Task Recognition with Polyhedral Objects. IEEE Transactions on Robotics and Automation, 10(3):368–385, 1994. 34. Immersion. Cyberglove Specifications, 2004. Http://www.immersion.com. 35. Inaba, M. and Inoue, H. Robotics Research 5, chapter Vision Based Robot Programming, pages 129–136. MIT Press, Cambridge, Massachusetts, USA, 1990. 36. Jiar, Y., Wheeler, M., and Ikeuchi, K. Hand Action Perception and Robot Instruction. Technical report, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, 1996. 37. Jiar, Y., Wheeler, M., and Ikeuchi, K. Hand Action Perception and Robot Instruction. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, volume 3, pages 1586–93. 1996. 38. Kaiser, M. Interaktive Akquisition elementarer Roboterf¨ahigkeiten. Ph.D. thesis, Universit¨at Karlsruhe (TH), Fakult¨at f¨ur Informatik, 1996. Erschienen im Infix-Verlag, St. Augustin (DISKI 153). 39. Kanehiro, F., Kaneko, K., Fujiwara, K., Harada, K., Kajita, S., Yokoi, K., Hirukawa, H., Akachi, K., and Isozumi, T. KANTRA – Human-Machine Interaction for Intelligent Robots using Natural Language. In IEEE International Workshop on Robot and Human Communication, volume 4, pages 106–110. 1994. 40. Kang, S. and Ikeuchi, K. Temporal Segmentation of Tasks from Human Hand Motion. Technical Report CMU-CS-93-150, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, April 1993. 41. Kang, S. and Ikeuchi, K. Toward Automatic Robot Instruction from Perception: Mapping Human Grasps to Manipulator Grasps. Robotics and Automation, 13(1):81–95, Februar 1997. 42. Kang, S. and Ikeuchi, K. Toward Automatic Robot Instruction from Perception: Mapping Human Grasps to Manipulator Grasps. Robotics and Automation, 13(1):81–95, Februar 1997. 43. Kang, S. B. Robot Instruction by Human Demonstration. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA, 1994. 44. Kang, S. B. and Ikeuchi, K. Determination of Motion Breakpoints in a Task Sequence from Human Hand Motion. In IEEE International Conference on Robotics and Automation (ICORA‘94), volume 1, pages 551–556. San Diego, CA, USA, 1994. 45. Kibertron. Kibertron Web Page, 2000. Http://www.kibertron.com. 46. Koeppe, R., Breidenbach, A., and Hirzinger, G. Skill Representation and Acquisition of Compliant Motions Using a Teach Device. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS‘96), volume 1, pages 897–904. Pittsburgh, PA, USA, Aug. 5-9 1996.
References
367
47. Kuniyoshi, Y., Inaba, M., and Inoue, H. Learning by Watching: Extracting Reusable Task Knowledge from Visual Observation of Human Performance. IEEE Transactions on Robotics and Automation, 10(6):799–822, 1994. 48. Lee, C. and Xu, Y. Online, Interactive Learning of Gestures for Human/Robot Interfaces, Minneapolis, Minnesota. In Proceedings of the IEEE International Conference on Robotics and Automation, volume 4, pages 2982–2987. April 1996. 49. Menzner, R. and Steinhage, A. Control of an Autonnomous Robot by Keyword Speech. Technical report, Ruhr-Universit¨at Bochum, April 2001. Seiten 49-51. 50. MetaMotion. Gypsy Specification. Meta Motion, 268 Bush St. 1, San Francisco, California 94104, USA, 2001. Http://www.MetaMotion.com/motion-capture/magnetic-motioncapture-2.htm. 51. Mitchell, T. Explanation-based Generalization - a unifying View. Machine Learning, 1:47–80, 1986. 52. Newtson, D. The Objective Basis of Behaviour Units. Journal of Personality and Social Psychology, 35(12):847–862, 1977. 53. Onda, H., Hirukawa, H., Tomita, F., Suehiro, T., and Takase, K. Assembly Motion Teaching System using Position/Force Simulator—Generating Control Program. In 10th IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 389–396. Grenoble, Frankreich, 7.-11. September 1997. 54. Paul, G. and Ikeuchi, K. Modelling planar Assembly Tasks: Representation and Recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Pittsburgh, Pennsylvania, volume 1, pages 17–22. 5.-9. August 1995. 55. Pomerleau, D. A. Efficient Training of Artificial Neural Networks for Autonomous Navigation. Neural Computation, 3(1):88 – 97, 1991. 56. Pomerleau, D. A. Neural Network Based Vision for Precise Control of a Walking Robot. Machine Learning, 15:125 – 135, 1994. 57. Rogalla, O., Ehrenmann, M., and Dillmann, R. A Sensor Fusion Approach for PbD. In Proc. of the IEEE/RSJ Conference Intelligent Robots and Systems, IROS’98, volume 2, pages 1040–1045. 1998. 58. Schaal, S. Is Imitation Learning the Route to Humanoid Robots? In Trends in Cognitive Scienes, volume 3, pages 323–242. 1999. 59. Segre, A. Machine Learning of Assembly Plans. Kluwer Academic Publishers, 1989. 60. Steinhage, A. and Bergener, T. Learning by Doing: A Dynamic Architecture for Generating Adaptive Behavioral Sequences. In Proceedings of the Second International ICSC Symposium on Neural Computation (NC), pages 813–820. 2000. 61. Steinhage, A. and v. Seelen, W. Dynamische Systeme zur Verhaltensgenerierung eines anthropomorphen Roboters. Technical report, Institut f¨ur Neuroinformatik, Ruhr Universit¨at Bochum, Bochum, Deutschland, 2000. 62. Takahashi, T. Time Normalization and Analysis Method in Robot Programming from Human Demonstration Data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Minneapolis, USA, volume 1, pages 37–42. April 1996. 63. Toyota. Toyota Humanoid Robot Partners Web Page, 2000. Http://www.toyota.co.jp/en/special/robot/index.html. 64. Tung, C. and Kak, A. Automatic Learning of Assembly Tasks using a Dataglove System. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), pages 1–8. 1995. 65. Tung, C. P. and Kak, A. C. Automatic Learning of Assembly Tasks Using a DataGlove System. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS‘95), volume 1, pages 1–8. Pittsburgh, PA, USA, Aug. 5-9 1995.
368
Interactive and Cooperative Robot Assistants
66. Ude, A. Rekonstruktion von Trajektorien aus Stereobildfolgen f¨ur die Programmierung von Roboterbahnen. Ph.D. thesis, IPR, 1995. 67. Vapnik, V. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. 68. Vicon. Vicon Web Page, 2000. Http://www.vicon.com/products/viconmx.html. 69. Voyles, R. and Khosla, P. Gesture-Based Programming: A Preliminary Demonstration. In Proceedings of the IEEE International Conference on Robotics and Automation, Detroit, Michigan, pages 708–713. Mai 1999. 70. Zhang, J., von Collani, Y., and Knoll, A. Interactive Assembly by a Two-Arm-Robot Agent. Robotics and Autonomous Systems, 1(29):91–100, 1999. 71. Z¨ollner, R., Rogalla, O., Z¨ollner, J., and Dillmann, R. Dynamic Grasp Recognition within the Framework of Programming by Demonstration. In The 10th IEEE International Workshop on Robot and Human Interactive Communication (Roman), pages 418–423. 18.-21. September 2001.
Chapter 8
Assisted Man-Machine Interaction Karl-Friedrich Kraiss When comparing man-machine dialogs with human to human conversation it is stunning, how comparatively hassle-free the latter works. Even breakdowns in conversation due to incomplete grammar or missing information are often repaired intuitively. There are several reasons for this robustness of discourse like the mutually available common sense and special knowledge about the subject under discussion [33]. Also, the conduct of a conversation follows agreed conventions, allows mixed initiative, and provides feedback of mutual understanding. Focused questions can be asked and answers be given in both directions. On this basis an assessment of plausibility and correctness can be made, which enables understanding in spite of uncomplete information (Fig. 8.1).
Man Domain knowledge , common sense, conventions, expectations
Perception
Man
Man
Action
Domain knowledge , common sense, conventions, expectations
Domain knowledge, common sense, conventions, expectations
Information display Machine
User inputs
Fig. 8.1. Interpersonal conversation (left) vs. conventional man-machine interaction (right).
Unfortunately, man-machine interaction lacks these features yet almost entirely. Only some of these properties have been incorporated rudimentarily into advanced interfaces [8, 17, 41]. In conventional man-machine interaction knowledge resides with the user alone. Neither user characteristics nor the circumstances of operation are made explicit. In this chapter the concept of assisted interaction is put forward. It is based on the consideration that, if systems were made more knowledgeable, human deficiencies could be avoided or compensated in several respects, hereby increasing individual abilities and overall performance. If e.g. the traffic situation and destination were known, a driver could be assisted in accelerating, decelerating and steering his car appropriately. Also, if the whereabouts and intentions of a personal digital assistant user were available, information likely to be needed in a particular context could be provided just in time. Some examples of assistance that can be provided are: •
support of individual skills, e.g. under adverse environmental conditions
370 • • • • • •
Assisted Man-Machine Interaction compensation of individual deficiencies, e.g. of the elderly correction and prevention of human errors automatic override of manual action in case of emergency prediction of action consequences reduction of user workload provision of user tutoring
Assistance augments the information provided to the user and the inputs made by the user in one way or other. In consequence an assisted system is expected to appear simpler to the user than it actually is and to be easier to learn. Hereby handling is made more efficient or even more pleasant. Also system safety may profit from assistance since users can duly be made aware of errors and their consequences. Especially action slips may be detected and suppressed even before they are committed.
8.1 The Concept of User Assistance The generation of assistance relies on the system architecture presented in (Fig. 8.2). As may be seen, the conventional man-machine system is augmented by three function blocks labeled context of use identification, user assistance generation, and data bases. Sensor data
context of use identification
User assistance generation
Data bases
Information display Machine
Man
User inputs
Fig. 8.2. Assisted man-machine interaction.
Context of Use Identification Correct identification of context of use is a prerequisite for assistance to be useful. It describes the circumstances under which a system operates and includes the state of the user, the system state, and the situation. Depending on the problem at hand, some or all of these attributes may be relevant. User state is defined by various indicators like user identity , whereabouts , workload, skills, preferences , and intentions . Identification requires e.g. user prompting
The Concept of User Assistance
371
for passwords or fingerprints. Whereabouts may be derived from telecommunication activities or from ambient intelligence methods. Workload assessment is based either on user prompting or on physiological data like heartbeat or skin resistance. The latter requires tedious installation of biosensors to the body which in general is not acceptable. Skills, preferences, and intentions usually are ascertained by interrogation, a method which in everyday operations is not accepted. Hence only non-intrusive methods can be used, which rely on user observation alone. To this end, user actions or utterances are recorded and processed. System state characterization is application dependent. In dialog systems the state of the user interface and of single applications and functions are relevant for an assessment. In contrast, dynamics systems are described by state variables, state variables limitations, and resources. All these data are in most systems readily accessible for automatic logging. Parameters relevant for situation assessment are system dependent and different for mobile or stationary appliances. In general acquisition of environmental conditions poses no serious problem, as sensors are available abundantly. As long as automatic situation assessment is identical to what a user perceives with his natural senses, assistive functions are based on identical inputs as human reasoning and therefore coincident with user expectations . Problems may arise however if discrepancies occur between both, as may be the case if what technical sensors see is different from what is perceived by human senses. Such deviations may lead to a deficient identification of assistance requirements. Consider, e.g. a car distance control system based on radar that logs, with a linear beam, just the distance to the preceding car. The driver, in contrast, sees a whole queue of cars ahead on a curved road. In this case the speed suggested to the driver may be incompatible with own experiences. Data Bases In general there are several application specific data bases needed for user support generation. For dialog systems these data bases may contain menu hierarchy, nomenclature, and functionality descriptions. For dynamic systems a system data base must be provided that contains information related to performance limits, available resources, functionality schematics for subsystems, maintenance data and diagnostic aids. For vehicle systems an environmental data base will contain navigational, geographical and topological information, infrastructure data as well as weather conditions. For air traffic the navigational data base may e.g. contain, airport approach maps, approach procedures or radio beacon frequencies. In surface traffic e.g. digital road maps and traffic signaling would be needed. User Assistance Generation Based on context of use and on data base content there are five options how to support user in tasks execution:
372
Assisted Man-Machine Interaction
Providing access to information In this mode the user is granted access to the data bases on request, however no active consultation is provided. An example would be the call-up of wiring diagrams during trouble shooting. In addition information may be enhanced to match the characteristics of human senses. Consider, e.g. the use of infrared lighting to improve night vision. Providing consultation In this mode the user is automatically accommodated with information relevant in the prevailing context of use. During diagnostic activities available options are e.g. presented together with possible consequences of choosing a particular one. Consultation is derived by the inference engine of an expert system, which makes use of facts and rules accumulated in data bases. Providing commands Commands are the proper means of assistance if time is scarce, a situation that often happens during manual control. The user is then triggered by a visual, acoustic, or haptic command to act immediately. The command can be followed or disregarded but, due to time restriction, not disputed. Data entry management This type of assistance refers to the correction, prediction and/or completion of keyboard and mouse entries into dialog systems, in order to reduce the number of required input actions. Providing intervention Manual inputs into dynamic systems sometimes happen to be inappropriate in amplitude, duration or timing. Here automatic intervention is applied to appropriately shape, filter, limit or modify manual input signals. A well known example for this approach is the antilock braking system common in cars. Here the drivers brake pedal pressure is substituted by synthesized intermittent braking pulses, scaled to prevent wheels from blocking. While antilock system operation is audible, the user in most systems is not even aware of intervening systems being active. Takeover by Automation If a task requires manual action beyond human capabilities, the human operator may be taken out-of-the-loop altogether and authority may be transferred to permanent automation as exemplified by stabilizers in airplanes. In case task requirements are heavy but still within human capabilities, temporary automation may be considered to reduce workload or enhance comfort, as is the case with autopilots. Suitable strategies for task delegation to automatons must then be considered, which are explained in a later section. In the following the concept of assistance outlined above will be elaborated in more detail for manual control and dialog tasks.
8.2 Assistance in Manual Control In this section various aspects of assistance in manual control tasks are discussed. First the needs for assistance are identified based on an analysis of human motor
Assistance in Manual Control
373
limitations and deficiencies. Then parameters describing context of use as relevant in manual control are described. Finally measures to support manual control tasks are discussed. 8.2.1 Needs for Assistance in Manual Control Manual control relies on perception, mental models of system dynamics, and manual dexterity. While human perception outperforms computer vision in many respects like, e.g. in Gestalt recognition and visual scene analysis, there are also some serious shortcomings. It is, e.g. difficulty for humans to keep attention high over extended periods of time. In case of fatigue there is a tendency for selective perception, i.e. the environment is partially neglected (Tunnel vision ). Night vision capabilities are rather poor; also environmental conditions like fog or rain restrict long range vision. Visual acuity deteriorates with age. Accommodation and adaptation processes result in degraded resolution, even with young eyes. Support needs arise also from cognitive deficiencies in establishing and maintaining mental models of system dynamics. Human motor skills are on one hand very elaborate, considering the 27 degrees of a hand. Unfortunately hand skills are limited by muscle tremor as far as movement precision is concerned and by fatigue as far as force exertion is concerned [7, 42]. Holding a position steadily during an extended period of time is beyond human abilities. In high precision tasks, as e.g. during medical teleoperation, this can not be tolerated. Vibrations also pose a problem, since vibrations present in a system, originating e.g. from wind gust or moving platforms, tend to cross talk into manual inputs and disturb control accuracy. For the discussion of manual control limitations a simple compensatory control loop is depicted in (Fig. 8.3a), where r(t) is the input reference variable, e(t) the error variable, u(t) the control variable, and y(t) the output variable. The human operator is characterized by the transfer function GH (s), and the system dynamics by GS (s).
Fig. 8.3. Manual compensatory control loop (a) and equivalent crossover model (b).
The control loop can be kept stable manually, as long as the open loop transfer function G0 (s) meets the crossover model of (8.1), where ωc is the crossover frequency and τ the delay time accumulated in the system. Model parameters vary with input bandwidth, subjects and subject practice. The corresponding crossover model loop is depicted in (Fig. 8.3b). G0 (s) = GH (s) · GS (s) =
ωc −sτ ·e s
(8.1)
374
Assisted Man-Machine Interaction
A quasilinear formulation of the human transfer function is GH (s) =
1 (1 + T1 s) · · e−sτ (1 + T2 s) (1 + Tn s)
(8.2)
Stabilization requires adaptation of lead/ lag and gain terms in GH (s). In a second order system the operator must e.g. produce a lead time T1 in equation (8.2). With reference to the crossover model it follows, that stabilization of third order system dynamics is beyond human abilities (except after long training and following buildup of mental models of system dynamics). 8.2.2 Context of Use in Manual Control Context of use for manual control tasks must take into account operator state, system state, and environmental state. The selection of suitable data describing these states is largely application dependent and can not be discussed in general. Therefore the methods of state identification addressed here will focus on car driving. A very general and incomplete scheme characterizing context of car driving has to take into account the driver, vehicle state, and traffic situation. For each of these typical input data are given in (Fig. 8.4), which will subsequently be discussed in more detail. Head pose, line of sight eye blinks, biosignals, control inputs
Driver state identification
Focus of attention Workload Driving behavior
car and subsystem state variables resources errors
Car state identification
Vehicle errors & resource state
Other cars state variables time, date, weather inputs from infrastructure
Traffic situation assessment
Traffic situation
Context of car driving
Fig. 8.4. Factors relevant in the context of car driving.
Identification of Driver State Biosignals appear to be the first choice for driver state identification. In particular aviation medicine has stimulated the development of manageable biosensors to collect data via electrocardiograms (ECG), electro myograms (EMG), electro oculograms (EOG), and from electro dermal activity sensors (EDA). All of these have proved to be useful, but have the disadvantage of intrusiveness . Identification of the driver’s state in everyday operations must in no case interfere with driver comfort. Therefore non-intrusive methods of biosignal acquisition have been tested as, e.g. deriving skin resistance and heartbeat from the hands holding the steering wheel.
Assistance in Manual Control
375
With breakthroughs in camera and computer technology the robustness of computer vision has become mature enough for in-car applications. In future premium cars a camera will be mounted behind the rearview mirror. From the video pictures head pose , line-of-sight , and eye blinks are identified which indicate where the driver is looking at and where he focuses his attention. Eye blinks are an indicator of operator fatigue (Fig. 8.5). In the state of fatigue frequent slow blinks are observed, in contrast to a few fast blinks in a relaxed state [21].
Fig. 8.5. Eye blink as an indicator of fatigue (left: dizzy; right: awake).
Driver control behavior is easily accessible from control inputs to steering wheel, brakes, and accelerator pedal. The recorded signals may be used for driver modeling. Individual and interindividual behavior may however vary significantly over time. For a model to be useful, it must therefore be able to adapt to changing behavior in real time. Consider e.g. the development of an overhauling warning system for drivers on a winding road. Since the lateral acceleration preferred during curve driving is different for relaxed and dynamic drivers, the warning threshold must be individualized (Fig. 8.6). Warning must only be given if the expected lateral acceleration in the curve ahead exceeds the individual preference range. A fixed threshold set too high would be useless for the relaxed driver, while a threshold set too low would be annoying to the dynamic driver [32]. dynamic driver
Number of curves
Number of curves
relaxed driver
Lateral forces [% of bodyweight]
Lateral forces [% of bodyweight]
Fig. 8.6. Interpersonal differences in lateral acceleration preferences during curve driving [32].
With respect to user models, parametric and black-box approaches must be distinguished. If a priori knowledge about behavior is available, a model structure can be developed, where only some parameters remain to be identified, otherwise no a priori assumptions can be made [26]. An example of parametric modeling is the method
376
Assisted Man-Machine Interaction
for direct measurement of crossover model parameters proposed by [19]. Here a gradient search approach is used to identify ωc and τ in the frequency domain, based on partial derivatives of G0 as shown in (Fig. 8.7). In contrast to parametric models , black box models use regression to match model output to observed behavior (Fig. 8.8). The applied regression method must be able to cope with inconsistent and uncertain data and follow behavioral changes in real time.
Fig. 8.7. Identification of crossover model parameters ωc and τ [19].
One option for parametric driver modeling are functional link networks , which approximate a function y(x) by nonlinear basis functions Φj (x) [34]. y(x) =
b
wj Φj (x)
(8.3)
j=1
Kraiss [26] used this approach to model driver following and passing behavior in a two way traffic situation. The weights are learned during by supervised training and summed up to yield the desired model output. Since global basis functions are used, the resulting model represents mean behavior averaged over subjects and time.
e(t)
y(t)
u(t)
GH
GS
r(t) Black box model Model adaptation
Fig. 8.8. Black-box model identification.
For individual driver support mean behavior is not sufficient, since also rare events are of interest and must not be concealed by frequently occurring events. For
Assistance in Manual Control
377
the identification of such details semi-parametric, local regression methods are better suited than global regression [15]. In a radial basis function network a function y(x) is approximated accordingly by weighted Gaussian functions Φj , y(x) =
b
wj Φj ( x − xj )
(8.4)
j=1
where x − xj denotes a norm, mostly the Euclidian distance, of a data point x from the center xj of Φj : Garrel [13] used a modified version of basis function regression as proposed by [39] for driver modeling. The task was, to learn individual driving behavior when following a leading car. Inputs were own speed as well as distance and speed differences. Data were collected in a driving simulator during 20 minutes training intervals. As documented by (Fig.8.9), the model output emulates driver behavior almost exactly.
Fig. 8.9. Identification of follow-up behavior by driver observation.
Identification of Vehicle State Numerous car dynamics parameters can potentially contribute to driver assistance generation. Among these are state variables for translation, speed, and acceleration. In addition there are parameters indicating limits of operation like, e.g. car acceleration and braking. The friction coefficient between wheel and street is important to prevent blocking and skidding of wheels. Finally car yaw angle acceleration has to be mentioned as an input for automatic stabilization. All these data can be accessed on board without difficulty. Identification of a Traffic Situation Automatic acquisition and assessment of a traffic situation is based on the current position and the destination of the own car in relation to other traffic participants.
378
Assisted Man-Machine Interaction
Also traffic rules to be followed and optional paths in case of traffic slams must be known. Further factors of influence include time of day, week, month, or year as well as weather conditions and available infrastructure.
Fig. 8.10. Detection ranges of distance sensor common in vehicles [22].
Distance sensors used for reconnaissance include far, mid, and short range radars, ultrasonic, and infrared (Fig. 8.10). Recently also video cameras are powerful enough for outdoor application. Especially CMOS high dynamic range cameras can cope with varying lighting conditions. Advanced real-time vision systems try to emulate human vision by active gaze control and simultaneous processing of foveal and peripheral field of view. A summary of related machine vision developments for road vehicles in the last decade may be found in [11]. Visual scene analysis also draws to a great deal upon digital map information for a priori hypotheses disambiguation [37]. In digital maps infrastructural data like the position of gas stations, restaurants, and workshops are associated with geographical information. In combination with traffic telematics information can be provided beyond what the driver perceives with his eyes like, e.g. looking around a corner. Telematics also enables communication between cars and drivers (a feature to be used e.g. for advance warnings), as well as between driver and infrastructure as, e.g. electronic beacons (Fig. 8.11). Sensor data fusion is needed to take best advantage of all sensors. In (Fig. 8.12) a car is located on the street by a combination of digital map and differential global positioning system (DGPS) outputs. In parallel a camera tracks road side-strip and mid-line to exactly identify car position as related to lane margins. Following augmentation by radar, traffic signs and obstacles are extracted from pictures, as well as other leading, trailing, passing, or cutting in cars. Reference to the digital maps also yields the curve radius and reference speed of the road ahead. Finally braking distance and impending skidding are derived from yaw angle sensing, wheel slip detection,wheel air pressure, and road condition.
Assistance in Manual Control
379
GPS localization GPS localisation
Information forwarding
Electronic beacon
Laser, radar , video sensing
Fig. 8.11. Traffic telematic enables contact between drivers and between driver and infrastructure.
Camera
Laser
Picture processing
Sensor data fusion
Side-strip mid-line
Object
Recognition of lane margin
Position relative to lane margin Traffic sign Traffic light Obstacles
Object recognition
Radar
Driving situation
leading cutting in passing Vehicle
Digital map
DGPS
Precision positioning
INS
V ehicle position
Route planning
Yaw angle sensing Transmiss ion slip Wheel air pres sure R oad condition
Reference speed Curve radius
Driving dynamics coefficients
Braking distance Skidding trend
Fig. 8.12. Traffic situation assessment by sensor data fusion.
8.2.3 Support of Manual Control Tasks A system architecture for assistance generation in dynamic systems, which takes context of manual control and digital map data bases as inputs, is proposed in (Fig. 8.13). The manual control assistance generator provides complementary support with respect to information management, control input management, and substitution of manual inputs by automation takeover (Fig.8.13). Information Management Methods to improve information display in a dynamic system may be based on environmental cue perceptibility augmentation, on making available predictive infor-
380
Assisted Man-Machine Interaction
Fig. 8.13. Assistance in manual control.
mation, or on the provision of commands and alerts. In this section the different measures are presented and discussed. Environmental cue perceptibility augmentation Compensation of sensory deficiencies by technical means applies in difficult environmental conditions. An example from ground traffic is active night vision, where the headlights are made to emit near IR radiation reaching wider than the own low beam. Thus a normally invisible pedestrian in (Fig. 8.14, left) is illuminated and can be recorded by an IR-sensitive video camera. The infrared picture is then superimposed on the outside view with a head-up display (Fig. 8.14, right). Hereby no change of accommodation is needed when looking from the panel outside and vice versa.
Fig. 8.14. Left: Sensory augmentation by near infrared lighting and head-up display of pictures recorded by an IR sensitive video camera [22].
Provision of predictive information High inertia systems or systems incorporating long reaction time caused by inertia or transmission delays are notoriously hard to control manually. An appropriate method to compensate such latencies is prediction. A predictor presents to the operator not only the actual state but also a calculated future state, which is characterized by a transfer function Gp (Fig. 8.15). A simple realization of the predictor transfer function GP by a second order Taylor expansion with prediction time τ yields:
Assistance in Manual Control
e(t)
381
u(t)
GH
GS
y(t)
GP
y p(t)
Predicted output variable
r(t)
Fig. 8.15. Control loop with prediction.
GP (s) =
τ2 2 yP =1+τ ·s+ · s + remnant y 2
(8.5)
Now assume the dynamics GS of the controlled process to be of third order (with ζ= degree of damping and ωn = natural frequency ) GS (s) =
y = u (1 +
2ς ωn
1 ·s+
1 2 ωn
· s2 ) · s
(8.6)
√ √ With τ = 2/ωn and ς = 1/ 2 the product GS (s) · GP (s) simplifies – with respect to the predicted state variable – to an easily controllable integrator (8.7). GS (s) · GP (s) =
τ2 2 2ς 1 1 yP = (1 + τ · s + · s ) · (1 + (8.7) · s + 2 · s2 )−1 = u 2 ωn ωn s
As an example a submarine depth control predictor display is shown in Fig. 8.16. The boat is at 22 m actual depth and tilts downward. The predictor is depicted as a quadratic spline line. It shows that the boat will level off after about 28 s at a depth of about 60 m, given the control input remains as is. Dashed lines in (Fig. 8.16) indicate limits of maneuverability , if the depth rudders are set to extreme upper respectively lower deflections. The helmsman can observe the effect of inputs well ahead of time and take corrective actions in time. The predictor display thus visualizes system dynamics, which otherwise would have to be trained into a mental model [25]. Predictors can be found in many traffic systems, in robotics, and in medicine. Successful compensation of several seconds of time delay by prediction has e.g. been demonstrated for telerobotics in space [16]. Reintsema et al [35] summarize recent developments in high fidelity telepresence in space and surgery robotics, which also make use of predictors. Provision of alerts and commands Since reaction time is critical during manual control, assistance often takes the form of visual, acoustic, or haptic commands and alerts. Making best use of human senses requires addressing the sensory modality that matches task requirements best. Reaction to haptic stimulation is the fastest among human modalities (Fig. 8.17). Also haptic stimulation is perceived in parallel to seeing and hearing and thus establishes an independent sensory input channel. These characteristics have led to an extensive use of haptic alerts . In airplanes control stick shakers become active if airspeed is getting critically slow. More recently a tactile situation awareness system for helicopter pilots has been presented, where the pilot’s waistcoat is equipped with
382
Assisted Man-Machine Interaction
Fig. 8.16. Submarine depth predictor display.
Reac tion time [ms]
120 pressure sensors which provide haptic 3D warning information. In cars the driver receives haptic torque or vibration commands via the steering wheel or accelerator pedal as part of an adaptive distance and heading control system. Also redundant coding of a signal by more than one modality is a proven means to guarantee perception. An alarm coded simultaneously via display and sound is less likely to be missed than a visual signal alone [5].
800
With secondary task 40 0
no secondary task
sound
seat vibrations
steering wheel vibrations
steering moment
Fig. 8.17. Reaction times following haptic and acoustic stimulation [30].
The buckle-up sign when starting a car is an example of a nonverbal acoustic alert . Speech commands are common in car navigation systems. In commercial airplanes synthesized altitude callouts are provided during landing approaches. In case of impending flight into terrain pull-up calls alert the pilot to take action. In military aircraft visual commands are displayed on a head-up display to point to an impending threat. However visual commands address a sensory modality which often is
Assistance in Manual Control
383
already overloaded by other observation tasks. Therefore other modalities often are preferable. Control Input Management
C ar dec eleration
In this section two measures of managing control inputs in dynamic systems are illustrated. This refers to control input modification and adaptive filtering of platform vibrations. Control input modifications Modifications of the manual control input signal may consist in its shaping, limiting, or amplifying. Nonlinear signal shaping is e.g. applied in active steering systems in premium cars, where it contributes to an increased comfort of use. During parking with low speed little turning angles at the steering wheel result in large wheel rotation angles, while during fast highway driving, wheel rotations corresponding to the same steering input are comparatively small. The amplitude of required steering actions is thus made adaptive to the traffic situation. Similar to steering gain , nonlinear force execution gain can be a useful method of control input management. Applying the right level of force to a controlled element can be a difficult task. It is e.g. a common observation that people refrain from exerting the full power at their disposal to the brake pedal. In fact, even in emergency situations, only 25% of the physically possible pressure is applied. In consequence a car comes much later to a stop, than brakes and tires would permit. In order to compensate this effect a braking assistant is offered by car manufacturers, which intervenes by multiplying the applied braking force by a factor of about four. Hence already moderate braking pressure results in emergency braking (Fig. 8.18). This braking augmentation is however only triggered when the brake pedal is activated very fast, as is the case in emergency situations.
100 %
Emergency braking
50 %
Normal braking
30 %
executed brake pedal force amplified by a factor of 4. 15 % 25 % 100 % Executed brake pedal force
Fig. 8.18. The concept of braking assistance.
384
Assisted Man-Machine Interaction
A special case of signal shaping is control input limiting . The structure of airplanes is e.g. dimensioned for a range of permitted flight maneuvers. In order to prevent a pilot from executing illegal trajectories like flying too narrow turns at exceeding speeds, steering inputs are confined with respect to the exerted accelerations (g-force limiters). Another example refers to traction control systems on cars where inputs to the accelerator pedal are limited to prevent wheels from slipping. Input shaping by smoothing plays an important role in human-robot interaction and teleoperation. Consider e.g. a surgeon using a robot actuator over an extended period of time. In the sequence of strain his muscles unavoidably will show tremor that cross talks into the machine. This tremor can be smoothed out, to guarantee steady movements of the robot tool. Adaptive filtering of moving platform vibration cross talk into control input In vehicle operations control inputs may be affected by moving platform vibrations which must be disposed of by adaptive filtering. This method makes use of the fact, that noise no contained in the control signal s is correlated with a reference noise n1 acquired at a different location in the vehicle [14, 29, 40]. The functioning of adaptive filtering is illustrated by (Fig. 8.19), where s, n0 , n1 , y are assumed to be stationary and have zero means:
Manual control input plus noise s + n
Reference noise n
Adaptive filter
+
0
-
Output signal =s+n0 -y
y
1
Fig. 8.19. Adaptive filtering of moving platform vibrations.
δ = s + n0 − y
(8.8)
Squaring 8.8 and taking into account, that s is correlated with neither n0 nor y yields the expectation E: E[δ 2 ] = E[s2 ] + E[(n0 − y)2 ] + 2E[s(n0 − y)] = E[s2 ] + E[(n0 − y)2 ] (8.9) Adapting the filter to minimize E[δ 2 ] will minimize E[(n0 − y)2 ] but not affect the signal power E[s2 ]. The filter output y therefore is the best least square estimate of the noise n0 contained in the manual control input signal. Subtracting y according to (8.8) results in the desired noise canceling . Automation If the required speed or accuracy of manual inputs needed in a dynamic system is beyond human capabilities, manual control is not feasible and must be substituted
Assistance in Manual Control
385
by permanent automation. Examples for permanent automation are stabilizers in aircrafts, which level out disturbances resulting from high frequency air turbulences. Frequently it is assumed that automation simplifies the handling of complex systems in any case. This is however not true, since removal of subtasks from a working routine truncates the familiar working procedures. Users find it difficult to cope with the task segments remaining for manual operation. Therefore it is widely accepted, that a user centered procedure must be followed in man-machine task allocation . In order to be accepted, automated functions must e.g. be transparent in the sense that performance must correspond to user expectations. Any automode operation puts an operator into an out-of-the-loop situation. Care has to be taken, that the training level of an operator doesn’t deteriorate, due to lack of active action. Beside of permanent automation there is a need for optional on/off automation of tasks during phases of extensive operator workload or for comfort. Autopilots are e.g. switched on regularly by pilots while cruising at high altitudes. Also moving or grasping objects with robotic actuators is an application where on/off automation makes sense. The manual control of the various degrees of freedom involved in manipulators demands time and special skills on the part of the operator [6]. Instead a manipulator can be taught to perform selected skills just once. Subsequently trained skills may be called up as needed and executed autonomously [23, 31]. If automation is only temporary decisions about function allocation and handover procedures between man and machine have to be made, which both influence system safety and user acceptance [27]. The handover procedures for on/off automatic functions may follow different strategies: • • •
Management by delegation (autonomous operation if switched on) Management by consent (autonomous operation following acknowledgement by the operator) Management by exception (autonomous operation, optional check by the operator)
For the choice of one of these handover strategies the workload of the operator and the possible consequences of a malfunction are decisive. Dangerous actions should never be delegated to autonomous operation, since unexpected software and logic errors may occur anytime. Transition from manual to automated operation also implies serious liability problems. If, e.g. an emergency braking is triggered automatically and turns out to be erroneous in hindsight, insurance companies tend to deny indemnity. Intervention is a special form of temporary takeover by automation. The electronic stabilization program (ESP) in cars becomes e.g. only active, if a tendency towards instability is observed. In such cases it makes use of combined single wheel braking and active steering to stabilize a cars yaw angle. In general the driver becomes not even aware of the fact, that he has been assisted in stabilizing his car.
386
Assisted Man-Machine Interaction
8.3 Assistance in Man-Machine Dialogs In this section the needs for assistance in dialogs are substantiated. Then methods for context of dialog identification are treated. Finally options for dialog support are presented and discussed. 8.3.1 Needs for Assistance in Dialogs Executing a dialog describes the process of executing a sequence of discrete events at the user interface of an appliance in order to achieve a set goal. To succeed in this task, inputs have to be made to formulate legal commands or menu options have to be selected and triggered in the right order. If an appliance provides many functions and if display space is scarce, the menu tends to be hierarchically structured in many levels (Fig. 8.20).
Selected top level menu option i
Visible menu top layer
Hidden sublayer menu options
1. sublayer 2. sublayer 3. sublayer
Desired 3. sublevel menu option j
Fig. 8.20. Example menu hierarchy .
Executing a dialog requires two kinds of knowledge. On one hand the user needs application knowledge which refers to knowing whether a function is available, which service it provides, and how it is accessed. Secondly knowledge about the menu hierarchy, the interface layout, and the labeling of functions must be familiar. All this know-how is subsumed as interaction knowledge . Unfortunately only the top layer menu items are visible to the user, while sublayers are hidden until activated. Since a desired function may be buried in the depths of a multilayer dialog hierarchy, menu selection requires a mental model of the dialog tree structure that tells, where a menu item is located in the menu tree and by which menu path it can be reached. An additional difficulty arises from the fact that the labeling of menu items happens to be misleading, since many synonyms exist. Occasionally even identical labels occur on different menu branches and for different functions. Finally the interface geometric layout must be known, to be able to localize the correct control buttons. All of this must be kept ready in memory. Unfortunately human working memory is limited to only 7 ± 2 Chunks. Also mental models are restricted in accuracy and completeness and subject to forgetting which enforces the use of manuals or query of other people.
Assistance in Man-Machine Dialogs
387
application k now ledge high
Depending on the available level of competence in the application and interaction domain users are classified as novices, trained specialists, application experts, and interaction experts (Fig. 8.21). Assistance in dialogs is geared towards the reduction of the described knowledge requirements. In particular for products designed for everybody’s use like mobiles or electronic appliances, interaction knowledge requirements must be kept as low as possible to ensure positive out of the box experiences on the part of the buyers. As will be shown, identification of the context of dialog is the key to achieve this goal.
Application expert
Trained specialist
Novice
Interaction expert
low
Least informed user
low
Interaction knowledge
high
Fig. 8.21. Application and interaction knowledge required in man-machine dialogs.
8.3.2 Context of Dialogs The context of a dialog is determined by the state of the user, the state of the dialog system itself, and the situation in which the dialog takes place, as depicted by (Fig. 8.22). In this figure parameters available for context of dialog identification are listed, which will be discussed in more detail in the following sections.
User identity & inputs
User state identification
Skill level Preferences Plans/Intentions
System functions Running application Dialog hierarchy User inputs
Dialog system state identification
Dialog state & history
User whereabouts and movements
Dialog situation assessment
Activities
Fig. 8.22. Context of dialog.
Context of dialog
388
Assisted Man-Machine Interaction
User State Identification Data sources for user state identification are the user’s identity and inputs. From these his skill level, preferences, and intentions are inferred. Skill level and preferences are associated with the user’s identity. Entering a password is a simple method to make oneself known to a system. Alternatively biometric access like fingerprints is an option. Both approaches require action by the user. Nonintrusive methods rely on observed user actions. Preferences may e.g. be derived from the frequency with which functions are used. The timing of keystrokes is an indicator of skill level. This method exploits the fact, that people deliberately choose interaction intervals between 2 and 5 seconds when not forced to keep pace. Longer intervals are indicative of deliberation, while shorter intervals point to a trial and error approach [24]. Knowing what a user plans to do is a crucial asset in providing support. At the same time it is difficult to find out the pursued plans, since prompting the user is out of the question. The only applicable method then is keyhole intent recognition, which is based on user input observation [9]. Techniques for plan identification from logged data include Bayesian Belief Networks (BBNs), Dempster Shafer Theory , and Fuzzy Logic , [1, 2, 10, 18, 20]. From these BBNs turn out to be most suited (Fig. 8.23):
action 1 action 2 … . action m
observed actions action 1 action 2 … . action n-1 action n
future intended actions action n+1
action m
Plan recognition
Fig. 8.23. The concept of plan recognition from observed user actions [17].
As long as plans are complete and flawless they can be inferred easily from action sequences. Unfortunately users also act imperfectly by typing flawed inputs or by applying false operations or those irrelevant to a set goal. Classification also has to consider individual preferences among various alternative strategies to achieve identical goals. A plan recognizer architecture that can cope with these problems was proposed by [17] and is depicted in (Fig. 8.24): The sequence of observations O in (Fig. 8.24) describes the logged user actions. Feature extraction performs a filtering of O in order to get rid of flawed inputs, resulting in a feature vector M. The plan library P contains all intentions I a user may possibly have while interacting with an application. Therefore the library content of-
Assistance in Man-Machine Dialogs
Sequence of observations O
389
Feature extraction
Context of use
Feature vector M Training of models
Library of plans P
Plan execution models PM
Feature vector M
Plan based classification
Most likely plan
Fig. 8.24. Plan based interpretation of user actions (adopted from [17]).
ten is identical with the list of functions a system offers. The a priori probability of plans can be made dependent on context of use. The plan execution models PM are implemented as three layer Bayes Belief Networks , where nodes represent statistical variables, while links represent dependencies (Fig. 8.25). User inputs are fed as features M into the net and combined to syntactically and semantically meaningful operators R, which link to plan nodes and serve to support or reject plan hypotheses. The structure of and the conditional probabilities in a plan execution model PM must cover all possible methods applicable to pursuit a particular plan. The conditional probabilities of a BBN plan execution models are derived from training data, which must be collected during preceding empirical tests.
Fig. 8.25. General plan model structure.
For plan based classification , feature vector components are fed into PM feature nodes. In case of a feature match the corresponding plan model state variable is set to true, while the state of all other feature nodes remain false. Matching refers to the feature as well as to the feature sequence (Fig. 8.26). For each plan hypothesis the probability of the observed actions being part of that plan is then calculated and the most likely plan is adopted. This classification procedure is iterated after every new user input.
390
Assisted Man-Machine Interaction
Fig. 8.26. Mapping feature vector components plan model to feature nodes (adopted from [17]).
Dialog System State Identification Identification of dialog system state and history poses no problem, if data related to GUI state, user interaction, and running application can be derived from the operating system or the application software respectively. Otherwise the expense needed to collect these data may be prohibitive and prevent an assistance system from being installed. Dialog Situation Assessment Essential attributes of a dialog situation are user whereabouts and concurrent user activities . In the age of global positioning, ambient intelligence, and ubiquitous computing ample technical measures exist to locate and track people’s whereabouts. Also people may be equipped with various sensors (e.g. micro gyros) and wearable computing facilities, which allow identification of their motor activities. Sensors may alternatively be integrated into handheld appliances, as implemented in the mobile phone described by [36]. This appliance is instrumented with sensors for temperature, light, touch, acceleration, and tilt angle. Evaluation of such sensor data enables recognition of user micro activities like phone “held in hand”, or “located in pocket” from which situations like user is “walking”, “boarding a plane”, “flying”, or “in a meeting” can be inferred. 8.3.3 Support of Dialog Tasks According to ISO/DIS 9241-110 (2004) Part 10 the major design principles for dialogs refer to functionality and usability . As far as system functionality is concerned, a dialog must be tailored to the requirements of the task at hand. Not less, but also no more functions than needed must be provided. Hereby user perceived complexity is kept as low as possible. The second design goal, usability, is substantiated by the following attributes:
Assistance in Man-Machine Dialogs • • • • • •
391
Self-descriptiveness Conformity with expectations Error tolerance Transparency Controllability Suitability for individualization
Self-descriptiveness requires, that the purpose of single system functions or menu options are readily understood by the user or can be explained by the system on request. Conformity with expectations requires that system performance complies with user anticipations originating from previous experiences or user training. Error tolerance suggests that a set plan can be achieved in spite of input errors or with only minor corrections. Transparency refers to GUI design and message formatting, which must be matched to human perceptual, behavioral, and cognitive processes. Controllability of a dialog allows the user to adjust the speed and sequence of inputs as individually preferred. Suitability for individualization finally describes the adaptability of the user interface to individual preferences as well as to language and culture. Dialog assistance is expected to support the stated dialog properties. A system architecture accordant with this goal is proposed in (Fig. 8.27). It takes dialog context, dictionaries, and a plan library as inputs. The generated assistance is concerned with various aspects of information management and data entry management as detailed in the following sections.
Context of dialog
Dialog assistance generator
Dictionaries
Information management
Plan library
Codes & formats Tutoring Functionality shaping Prompting, timing & prioritizing
Data entry management
Predictive text Do what I mean Auto completion Predictive scanning
Information display Man
Control inputs
Dialog system
Fig. 8.27. Assistance in dialogs.
Information Management Information management is concerned with the question, which data should be presented to the user when and how. Different aspects of information management are codes, formats, tutoring, functionality shaping, prompting, timing, and prioritizing.
392
Assisted Man-Machine Interaction
Display codes and formats When information is to be presented to a user, the right decisions have to be made with respect to display coding and format in order to ensure self-descriptiveness and transparency. Various options for information coding exist like text, graphics, animation, video, sound, speech, force, and vibrations. The selection of a suitable code and the best sensory modality depends on the context of use. There exists extensive literature in human factors engineering and many guidelines on these topics which may be consulted, see e.g. [38]. Tutoring Tutoring means context-dependent provision of information for teaching purposes. Following successful intent recognition , step-by-step advice can be given for successful task completion. Also hints to shortcuts and to seldomly used functions may be provided. Descriptions and explanations of functions may be offered, with their granularity adapted to user skill level. Individualization of the GUI is achieved by adapting menu structure and labeling to user preferences. Using augmented reality enables to overlay tutoring instructions on a head mounted display over the real world [4, 12]. Functionality shaping In order to achieve suitability for task, many applications offer the user various functional levels to choose from, which are graded from basic to expert functions. This is reflected in the menu tree by suppressed or locked menu items, which generally are rendered grey, but keep their place in the menu hierarchy so as to ensure constancy in menu appearance. Prompting , timing and prioritizing Information may be prompted by the user or show up unsolicited depending on context of use. Also the timing of information, as well as its duration and frequency can be made context-dependent. Prioritizing of messages and alerts according to their urgency is a means to fight information overload e.g. in a process control situations. As an example for advanced information management consider the electronically centralized aircraft monitoring system (ECAM ) installed in Airbus commercial airplanes, which is designed to assist pilots in resource and error management. The appertaining information is presented on a warning and a system display screen in the center of the cockpit panel. The top left warning display in (Fig. 8.28) illustrates an unsolicited warning as a consequence of a hydraulic system breakdown. The top right system display shows, also unsolicited, the flow diagram of the three redundant onboard hydraulic systems, which are relevant for error diagnosis. Here the pilot can see at a glance, that the blue system has lost pressure. Next the pilot may call-up the following two pages as depicted in the lower part of (Fig. 8.28) for consultation. Here the warning display lists the control surfaces affected by the hydraulic failure, e.g. SLATS SYS 1 FAULT and shows on the system display where those planes are located on the wings. The lower part of the warning display informs the pilot about failure consequences. LDG DISTANCE: MULTIPLY BY 1.2 e.g. indicates that landing distance will be stretched by 20% due to defective control surfaces.
Assistance in Man-Machine Dialogs
393
Fig. 8.28. Electronically Centralized Aircraft Monitoring System (ECAM).
The ECAM system employs a sophisticated concept for mitigating warnings to the pilot. As soon as an error appears, various phases of error management are identified, i.e. warning , error identification , and error isolation . Three types of alarms are distinguished and treated differently. Single alarms have no consequences for other components. Primary alarms indicate major malfunctions, which will be the cause for a larger number of secondary alarms to follow. In normal operation the cockpit is “dark and silent”. Alarms are graded as to their urgency. An emergency alarms e.g. requires immediate action. In an exceptional situation some latency in reaction can be tolerated. Warnings call upon the pilot for enhanced supervision, while simple messages require no action at all. Data Entry Management Data entry often poses a problem especially during mobile use of small appliances. Also people with handicaps need augmentative and alternative communication (AAC). The goal of data entry management is to provide suitable input media, to accelerate text input by minimizing the number of inputs needed and to cope with input errors. Recently multimodal interaction receives increasing attention, since input by speech, gesture, mimics, and haptics can substitute keyboards and mice altogether (see related chapters in this book). For spelling-based systems in English, the number of symbols is at least twentyseven (space included), and more when the system offers options to select from a number of lexical predictions . A great variety of special input hardware has been devised to enable selection of these symbols on small hand-held devices.
394
Assisted Man-Machine Interaction
The simplest solution is to assign multiple letters to each key. The user then is required to press the appropriate key a number of times for a particular letter to be shown. This makes entering text a slow and cumbersome process. For wearable computing also keyboards have been developed, which can be operated by one hand. The thumb operates a track point and function keys on the back side, while the other four fingers activate keys on the front side. Key combinations (accords) permit input of all signs. Complementary to such hardware prediction, completion, and disambiguation algorithms are used to curtail the needed number of inputs. Prediction is based on statistics of “n-grams”, i.e. groups of n letters as they occur in sequence in words [28]. Disambiguation of signs is based on letter-by-letter or word-level comparison [3]. In any case reference is made to built-in dictionaries. Some applications are: Predictive text For keyboards where multiple letters are assigned to each key predictive text reduces the number of key pressings necessary to enter text. By restricting the available words to those in a dictionary each key needs to be pressed only once. It allows e.g. the word “hello” to be typed in 5 key presses instead of 12. Auto completion Auto-completion or auto-fill is a feature of text-entry fields that automatically completes typed entries with the best guess of what the user may intend to enter, such as pathnames, URLs, or long words, thus reducing the amount of typing necessary to enter long strings of text. It uses similar techniques as predictive text. Do what I mean In Do-what-I-mean functions an input string is compared with all dictionary entries. Automatic correction is made if both strings differ by one character, if one character is inserted or deleted, or if two characters are transposed. Sometimes even adjacent sub words are transposed (existsFile == fileExists). Predictive scanning For menu item selection single-switch scanning can be used with a simple linear scan, thus requiring only one switch activation per key selection. Predictive scanning alters the scan such, that keys are skipped over when they do not correspond to any letters which occur in the database at the current point in the input key sequence. This is a familiar feature for destination input in navigation systems.
8.4 Summary Technically there are virtually no limits to the complexity of systems and appliances. The only limits appear to be user performance and acceptance. Both threaten to encumber innovation and prohibit success on the market place. In this chapter the concept of assistance has been put forward as a remedy. This approach is based on the hypothesis, that man-machine system usability can be improved by providing user assistance adapted to the actual context of use. First the general concept of user assistance was described, which involves context of use identification as key issue. Subsequently this concept was elaborated for the
References
395
two main tasks facing the user in man-machine interaction, i.e. manual control and dialogs. Needs for assistance in manual control were identified. Then attributes of context of use for manual control tasks were listed and specified for a car driving situation. Also methods for context identification were discussed, which include parametric and black-box approaches to driver modeling. Finally three methods of providing support to a human controller were identified. These are information management, control input management, and automation. The second part of the chapter was devoted to assistance in man-machine dialogs. First support needs in dialogs were identified. This was followed by a treatment of the variables describing a dialog context, i.e. user state, dialog system state, and dialog situation. Several methods for context identification were discussed with emphasis on user intent recognition. Finally two methods of providing assistance in dialog tasks were proposed. These are information management and data entry management. In summary it was demonstrated how assistance based on context of use can be a huge asset to the design of man-machine systems. However, the potential of this concept remains yet to be fully exploited. Progress relies mainly on improved methods for context of use identification.
References 1. Akyol, S., Libuda, L., and Kraiss, K.-F. Multimodale Benutzung adaptiver KfzBordsysteme. In J¨urgensohn, T. and Timpe, K.-P., editors, Kraftfahrzeugf¨uhrung, pages 137–154. Springer, 2001. 2. Albrecht, D. W., Zukerman, I., and Nicholson, A. Bayesian Models for Keyhole Plan Recognition in an Adventure Game. User Modeling and User-Adapted Interaction, 8(1– 2):5–47, 1998. 3. Arnott, A. L. and Javed, M. Y. Probabilistic Character Disambiguation for Reduced Keyboards Using Small Text Samples. Augmentative and Alternative Communication, 8(3):215–223, 1992. 4. Azuma, R. T. A Survey of Augmented Reality. Presence: Teleoperators and Virtual Environments, 6(4):355–385, 1997. 5. Belz, S. M. A Simulator-Based Investigation of Visual, Auditory, and Mixed-Modality Display of Vehicle Dynamic State Information to Commercial Motor Vehicle Operators. Master’s thesis, Virginia Polytechnic Institute, Blacksburg, Virginia, USA, 1997. 6. Bien, Z. Z. and Stefanov, D. Advances in Rehabilitation Robotics. Springer, 2004. 7. Boff, K. R., Kaufmann, L., and Thomas, J. P. Handbook of Perception and Human Performance. Wiley, 1986. 8. Breazeal, C. Recognition of Affective Communicative Intent in Robot-Directed Speech. Autonomous Robots, 12(1):83–104, 2002. 9. Canberry, S. Techniques for Plan Recognition. User Modeling and User-Adapted Interaction, 11(1–2):31–48, 2001. 10. Charniak, E. and Goldmann, R. B. A Bayesian Model of Plan Recognition. Artificial Intelligence, 64(1):53–79, 1992. 11. Dickmanns, E. D. The Development of Machine Vision for Road Vehicles in the last Decade. In Proceedings of the IEEE Intelligent Vehicle Symposium, volume 1, pages 268–281. Versailles, June 17–21 2002.
396
Assisted Man-Machine Interaction
12. Friedrich, W. ARVIKA Augmented Reality f¨ur Entwicklung, Produktion und Service. Publicis Corporate Publishing, Erlangen, 2004. 13. Garrel, U., Otto, H.-J., and Onken, R. Adaptive Modeling of the Skill- and Rule-Based Driver Behavior. In The Driver in the 21st Century, VDI-Berichte 1613, pages 239–261. Berlin, 3.–4. Mai 2001. 14. Haykin, S. Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, NJ, 2nd edition, 1991. 15. Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River, NJ, 2nd edition, 1999. 16. Hirzinger, G., Brunner, B., Dietrich, J., and Heindl, J. Sensor-Based Space Robotics ROTEX and its Telerobotic Features. IEEE Transactions on Robotics and Automation (Special Issue on Space Robotics), 9(5):649–663, October 1993. 17. Hofmann, M. Intentionsbasierte maschinelle Interpretation von Benutzeraktionen. Dissertation, Technische Universit¨at M¨unchen., 2003. 18. Hofmann, M. and Lang, M. User Appropriate Plan Recognition for Adaptive Interfaces. In Smith, M. J., editor, Usability Evaluation and Interface Design: Cognitive Engineering,Intelligent Agents and Virtual Reality. Proceedings of the 9th International Conferenceon Human-Computer Interaction, volume 1, pages 1130–1134. New Orleans, August 5–10 2001. 19. Jackson, C. A Method for the Direct Measurement of Crossover Model Parameters. IEEE MMS, 10(1):27–33, March 1969. 20. Jameson, A. Numerical Uncertainty Management in User and Student Modeling: An Overview of Systems and Issues. User Modeling and User-Adapted Interaction. The Journal of Personalization Research, 5(4):193–251, 1995. 21. Ji, Q. and Yang, X. Real-Time Eye, Gaze, and Face Pose Tracking for Monitoring Driver Vigilance. Real-Time Imaging, 8(5):357–377, 2002. 22. Knoll, P. The Night Sensitive Vehicle. In VDI-Report 1768, pages 247–256. VDI, 2003. 23. Kragic, D. and Christensen, H. F. Robust Visual Servoing. The International Journal of Robotics Research, 22(10–11):923 ff, 2003. 24. Kraiss, K.-F. Ergonomie, chapter Mensch-Maschine Dialog, pages 446–458. Carl Hanser, M¨unchen, 3rd edition, 1985. 25. Kraiss, K.-F. Fahrzeug- und Prozessf¨uhrung: Kognitives Verhalten des Menschen und Entscheidungshilfen. Springer Verlag, Berlin, 1985. 26. Kraiss, K.-F. Implementation of User-Adaptive Assistants with Neural Operator Models. Control Engineering Practice, 3(2):249–256, 1995. 27. Kraiss, K.-F. and Hamacher, N. Concepts of User Centered Automation. Aerospace Science Technology, 5(8):505–510, 2001. 28. Kushler, C. AAC Using a Reduced Keyboard. In Proceedings of the CSUN California State University Conference, Technology and Persons with Disabilities Conference. Los Angeles, March 1998. 29. Larimore, M. G., Johnson, C. R., and Treichler, J. R. Theory and Design of Adaptive Filters. Prentice Hall, 1st edition, 2001. 30. Mann, M. and Popken, M. Auslegung einer fahreroptimierten Mensch-MaschineSchnittstelle am Beispiel eines Querf¨uhrungsassistenten. In GZVB, editor, 5. Braunschweiger Symposium ”Automatisierungs- und Assistenzsystemefuer Transportmittel”, pages l82–108. 17.–18. Februar 2004. 31. Matsikis, A., Zoumpoulidis, T., Broicher, F., and Kraiss, K.-F. Learning Object-Specific Vision-Based Manipulation in Virtual Environments. In IEEE Proceedings of the 11th International Workshop on Robot and Human Interactive Communication ROMAN 2002, pages 204–210. Berlin, September 25–27 2002.
References
397
32. Neumerkel, D., Rammelt, P., Reichardt, D., Stolzmann, W., and Vogler, A. Fahrermodelle - Ein Schl¨ussel f¨ur unfallfreies Fahren? K¨unstliche Intelligenz, 3:34–36, 2002. 33. Nickerson, R. S. On Conversational Interaction with Computers. In Treu, S., editor, Proceedings of the ACM/SIGGRAPH Workshop on User-Oriented Design of Interactive Graphics Systems, pages 101–113. Pittsburgh, PA, October 14–15 1976. 34. Pao, Y. Adaptive Pattern Recognition and Neural Networks. Addison-Wesley, 1989. 35. Reintsema, D., Preusche, C., Ortmaier, T., and Hirzinger, G. Towards High Fidelity Telepresence in Space and Surgery Robotics. Presence- Teleoperators and Virtual Environments, 13(1):77–98, 2004. 36. Schmidt, A. and Gellersen, H. W. Nutzung von Kontext in ubiquit¨aren Informationssystemen. it+ti - Informationstechnik und technische Informatik, Sonderheft: UbiquitousComputing - der allgegenw¨artige Computer, 43(2):83–90, 2001. 37. Schraut, M. Umgebungserfassung auf Basis lernender digitaler Karten zur vorausschauenden Konditionierung von Fahrerassistanzsystemen. Ph.D. thesis, Technische Universit¨at M¨unchen, 2000. 38. Shneiderman, B. and Plaisant, C. Designing the User Interface. Strategies for Effective Human-Computer Interaction. Addison Wesley, 4th edition, 2004. 39. Specht, D. F. A General Regression Neural Network. IEEE Transactions on Neural Networks, 2(6):568–576, 1991. 40. Velger, M., Grunwald, A., and Merhav, S. J. Adaptive Filtering of Biodynamic Stick Feedthrough in ManipulationTaskson Board Moving Platforms. AIAA Journal of Guidance, Control, and Dynamics, 11(2):153–158, 1988. 41. Wahlster, W., Reitiger, N., and Blocher, A. SMARTKOM: Multimodal Communication with a Life-Like Character. In Proceedings of the 7th European Conference on Speech CommunicationandTechnology, volume 3, pages 1547–1550. Aalborg, Denmark, September 3–7 2001. 42. Wickens, C. D. Engineering Psychology and Human Performance. Columbus. Merrill, 1984.
Appendix A
LTI-L IB — a C++ Open Source Computer Vision Library Peter D¨orfler and Jos´e Pablo Alvarado Moya The LTI-L IB is an open source software library that contains a large collection of algorithms from the field of computer vision. Its roots lie in the need for more cooperation between different research groups at the Chair of Technical Computer Science at RWTH Aachen University, Germany. The name of the library stems from the German name of the Chair: Lehrstuhl f¨ur Technische Informatik. This chapter gives a short overview of the LTI-L IB that provides the reader with sufficient background knowledge for understanding the examples throughout this book that use this library. It can also serve as a starting point for further exploration and use of the library. The reader is expected to have some experience in programming with C++. The main idea of the LTI-L IB is to facilitate research in different areas of computer vision. This leads to the following design goals: •
•
•
•
Cross platform: It should at least be possible to use the library on two main platforms: Linux and Microsoft Windows© . Modularity: To enhance code reusability and to facilitate exchanging small parts in a chain of operations the library should be as modular as possible. Standardized interface: All classes should have the same or very similar interfaces to increase interoperability and flatten the user’s learning curve. Fast code: Although not the main concern, the implementations should be efficient enough to allow the use of the library in prototypes working in real time.
Based on these goals the following design decisions were made: • •
•
Use an object oriented programming language to enforce the standardized interface and facilitate modularity. For the same reasons and also for better code reuse the data structures should be separated from the functionality unless the data structure itself allows only one way of manipulation. To gain fast implementations C++ was chosen as the programming language. With the GNU Compiler Collection (gcc) and Microsoft Visual C++ compilers are available for both targeted platforms.
400
LTI-L IB — A C++ Open Source Computer Vision Library
The LTI-L IB is licensed under the GNU Lesser General Public License (LGPL)1 to encourage collaboration with other researchers and allow its use in commercial products. This has already lead to many enhancements of the existing code base and also to some contributions from outside the Chair of Technical Computer Science. From the number of downloads the userbase can be estimated to a few thousand researchers. This library differs from other open source software projects in its design: it follows an object oriented paradigm that allows to hide the configuration of the algorithms, making it easier for unexperienced users to start working with them. At the same time it allows more experienced researchers to take complete control of the software modules as well. The LTI-L IB also exploits the advantages of generic programming concepts for the implementation of container and functional classes, seeking for a compromise between “pure” generic approaches and a limited size for the final code. This chapter only gives a short introduction to the fundamental concepts of the LTI-L IB. We focus on the functionality needed for working with images and image processing methods. For more detailed information please refer to one of the following resources: • • • •
project webpage: http://ltilib.sourceforge.net online manual: http://ltilib.sourceforge.net/doc/html/index.shtml wiki: http://ltilib.pdoerfler.com/wiki mailing list:
[email protected] The LTI-L IB is the result of the collaboration of many people. As the current administrators of the project we would like to thank Suat Akyol, Ulrich Canzler, Claudia G¨onner, Lars Libuda, Jochen Wickel, and J¨org Zieren for the parts they play(ed) in the design and maintenance of the library additionally to their coding. We would also like to thank the numerous people who contributed their time and the resulting code to make the LTI-L IB such a large collection of algorithms: Daniel Beier, Axel Berner, Florian Bley, Thorsten Dick, Thomas Erger, Helmuth Euler, Holger Fillbrandt, Dorothee Finck, Birgit Gehrke, Peter Gerber, Ingo Grothues, Xin Gu, Michael Haehnel, Arnd Hannemann, Christian Harte, Bastian Ibach, Torsten Kaemper, Thomas Krueger, Frederik Lange, Henning Luepschen, Peter Mathes, Alexandros Matsikis, Ralf Miunske, Bernd Mussmann, Jens Paustenbach, Norman Pfeil, Vlad Popovici, Gustavo Quiros, Markus Radermacher, Jens Rietzschel, Daniel Ruijters, Thomas Rusert, Volker Schmirgel, Stefan Syberichs, Guy Wafo Moudhe, Ruediger Weiler, Benjamin Winkler, Xinghan Yu, Marius Wolf
A.1 Installation and Requirements The LTI-L IB does not in general require additional software to be usable. However, it becomes more useful if some complementary libraries are utilized. These are in1
http://www.gnu.org/licenses/lgpl.html
Installation and Requirements
401
cluded in most distributions of Linux and are easily installed. For Windows systems the most important libraries are included in the installer found on the CD. The script language perl should be available on your computer for maintenance and to build some commonly used scripts. It is included on the CD for Windows, and is usually installed on Linux systems. The Gimp Tool Kit (GTK )2 should be installed as it is needed for almost all visualization tools. To read and write images using the Portable Network Graphics (PNG) and the Joint Picture Experts Group (JPEG) image formats you either need the relevant files from the ltilib-extras package, which are adaptations of the Colosseum Builders C++ Image Library, or you have to provide access to the freely available libraries libjpeg and libpng. Additionally, the data compression library zlib is useful for some I/O functions3. The LTI-L IB provides interfaces to some functions of the well-known LAPACK (Linear Algebra PACKage)4 for optimized algorithms. On Linux systems you need to install lapack, blas , and the fortran to C translation tool f2c (including the corresponding development packages, which in distributions like SuSE or Debian are always denoted with and additional “-dev” or “-devel” postfix). The installation of LAPACK on Windows systems is a bit more complicated. Refer to http://ltilib.pdoerfler.com/wiki/Lapack for instructions. The installation of the LTI-L IB itself is quite straightforward on both systems. On Windows execute the installer program (ltilib/setup-ltilib-1.9.15-activeperl.exe). This starts a guided graphical setup program where you can choose among some of the packages mentioned above. If unsure we recommend installing all available files. Installation on Linux follows the standard procedure for configuration and installation of source code packages established by GNU as a de facto standard: cd $HOME cp /media/cdrom/ltilib/*.gz . tar -xzvf 051124_ltilib_1.9.15.tar.gz cd ltilib tar -xzvf ../051124_ltilib-extras_1.9.15.tar.gz cd linux make -f Makefile.cvs ./configure make Here it has been assumed that you want the sources in a subdirectory called ltilib within your home directory, and that the CD has been mounted into /media/cdrom. The next line unpacks the tarballs found on the CD in the directory ltilib. In the file name, the first number of six digits specifies the release date of the library, where the first two digits relate to the year, the next two indicate the month and the last two the day. You can check for newer versions in the library homepage. The sources in the ltilib-extras package should be unpacked within the 2
http://www.gtk.org http://www.ijg.org, http://www.libpng.org, and http://www.zlib.net, respectively 4 http://netlib.org/lapack/ 3
402
LTI-L IB — A C++ Open Source Computer Vision Library
directory ltilib created by the first tarball. The next line is optional, and creates all configuration scripts. For it to work the auto-configure packages have to be installed. You can however simply use the provided configure script as indicated above. Additional information can be obtained from the file ltilib/linux/README. You can of course customize your library via additional options of the configure script (call ./configure --help for a list of all available options). To install the libraries you will need write privileges on the --prefix directory (which defaults to /usr/local). Switch to a user with the required privileges (for instance su root) and execute make install. If you have doxygen and graphviz installed on your machine, and you want to have local access to the API HTML documentation, you can generate it with make doxydoc. The documentation to the release 1.9.15 which comes with this book is already present on the CD.
A.2 Overview Following the design goals, data structures and algorithms are mostly separated in the LTI-L IB. The former can be divided into small types such as points and pixels and more complex types ranging from (algebraic) vectors and matrices to trees and graphs. Data structures are discussed in section A.3. Algorithms and other functionality are encapsuled in separate classes that can have parameters and a state and manipulate the data given to them in a standardized way. These are the subject of section A.4. Before dealing with those themes, we cover some general design concepts used in the library. A.2.1 Duplication and Replication Most classes used for data structures and functional objects support several mechanisms for duplication and replication. The copy constructor is often used while initializing a new object with the contents of another one. The copy() method and the operator= are equivalent and permit to acquire all data contained in another instance, whereby, as a matter of style, the former is the way preferred within the library, as it visually permits a faster realization that the object belongs to an LTIL IB class instead of being an elemental data type (like integers, floating point values, etc.). As LTI-L IB instances might represent or contain images or other relatively large data structures, it is important to always keep in mind that a replication task may take its time, and it is therefore useful to explicitly indicate when this expensive copies are done. In class hierarchies it is often required to replicate instances of classes to which you just have a pointer to a parent common class. This is the case, for instance, when implementing the factory design pattern. For these situations the LTI-L IB provides a virtual clone() method.
Overview
403
A.2.2 Serialization The data of almost all classes of the LTI-L IB can be written to and read from a stream. For this purpose an ioHandler is used that formats and parses the stream in a special way. At this time, two such ioHandlers exist. The first one, lispStreamHandler, writes the data of the class in a Lisp like fashion, i. e. each attribute is stored as a list enclosed in parenthesis. This format is human readable and can be edited manually. However, it is not useful for large data sets due to its extensive size and expensive parsing. For these applications the second ioHandler, binaryStreamHandler, is better suited. All elementary types of C++ and some more complex types like the STL5 std::string, std::vector, std::map and std::list can be read or written from the LTI-L IB streams using a set of global functions called read() and write(). Global functions also allow to read and write almost all LTI-Lib types from and to the stream handlers, although as a matter of style, if a class defines its own read() and write() methods they should be preferred over the global ones to indicate that the given object belongs to the LTI-L IB. The example in Fig. A.1 shows the principles of copying and serialization on a LTI-L IB container called vector<double>, and a STL list container of strings. A.2.3 Encapsulation All classes of the LTI-L IB reside in the namespace lti to avoid ambiguity when other libraries might use the same class names. This allows for instance to use the class name vector, even if the Standard Template Library already uses this type. For the sake of readability the examples in this chapter omit the explicit scoping of the namespace (i. e., the prefix lti:: before all class names omitted). Within the library, types regarding just one specific class are always defined within that class. If you want to use those types you have to indicate the class scope too (for example lti::filter::parameters::borderType). Even if this is cumbersome to type, it is definitely easier to maintain, as the place of declaration and the context are directly visible. A.2.4 Examples If you want to try out the examples in this chapter, the easiest way is to copy them directly into the definition of the operator() of the class tester found in the file ltilib/tester/ltiTester.cpp. The necessary header files to be included are listed at the begining of each example. Standard header files (e.g. cmath, cstdlib, etc.) are omitted for brevity.
5
The Standard Template Library (STL) is included with most compilers. It contains common container classes and algorithms.
404
LTI-L IB — A C++ Open Source Computer Vision Library
#include "ltiVector.h" #include "ltiLispStreamHandler.h" #include <list> vector<double> v1(10,2.); vector<double> v2(v1); vector<double> v3;
// create a double vector // v2 is equal to v1
v3=v2; v3.copy(v2);
// v3 is equal to v2 // same as above
std::list<std::string> myList; // a standard list myList.push_back(‘‘Hello’’); // with two elements myList.push_back(‘‘World!’’); // create a lispStreamHandler std::ofstream os("test.dat"); lispStreamHandler lsh; lsh.use(os);
and write v1 to a file // the STL output file stream // the LTI-Lib stream handler // tell the LTI-Lib handler // which stream to use
v1.write(lsh); write(lsh,myList);
// // // //
os.close();
write the lti::vector write the std::list through a global function close the output file
// read the file into v4 vector<double> v4; std::list<std::string> l2; std::ifstream is("test.dat"); // lsh.use(is); // // v4.read(lsh); // read(lsh,l2); // is.close(); //
STL input file stream tell the handler to read from the ‘‘is’’ stream load the vector load the list close the input file
Fig. A.1. Serialization and copying of LTI-L IB classes.
A.3 Data Structures The LTI-L IB uses a reduced set of fundamental types. Due to the lack of fixed size types in C/C++, the LTI-L IB defines a set of aliases at configuration time which are mapped to standard types. If you want to ensure 32 bit types you should use therefore int32 for signed integers or uint32 for the unsigned version. For 8 bit numbers you can use byte and ubyte for the signed and unsigned versions, respectively.
Data Structures
405
When the size of the numeric representations is not of importance, the default C/C++ types are used. In the LTI-L IB a policy of avoidance of unsigned integer types is followed, unless the whole data representation range is required (for example, ubyte when values from 0 to 255 are to be used). This has been extended to the values used to represent the sizes of container classes, which are usually of type int, as opposite to the unsigned int of other class libraries like the STL. Most math and classification modules employ the double floating point precision (double) types to allow a better conditioning of the mathematical computations involved. Single precision numbers (float) are used where the memory requirements are critical, for instance in grey valued images that require a more flexible numerical representation than an 8-bit integer. More involved data structures in the LTI-L IB (for instance, matrix and vector containers) use generic programming concepts and are therefore implemented as C++ template types. However, these are usually explicitly instantiated to reduce the size of the final library code. Thus, you can only choose between the explicitly instantiated types. Note that many modules are restricted to those types that are useful for the algorithm they implement, i. e. linear algebra functors only work on float and double data structures. This is a design decision in the LTI-L IB architecture, which might seem in some cases rigid, but simplifies the maintenance and optimization of code for the most frequently used cases. At this point we want to stress the difference of the LTI-L IB with other “pure” generic programming oriented libraries. For the “generic” types, like genericVector or genericMatrix you can also create explicit instantiations for your own types, for which you will require to include the header files with the template implementations, usually denoted with a postfix template in the file name (for example, ltiGenericVector template.h). The next sections introduce the most important object types for image processing. Many other data structures exist in the library. Please refer to the online documentation for help. A.3.1 Points and Pixels The template classes tpoint and tpoint3D define two and three dimensional points with attributes x, y, and if applicable z all of the template type T. For T=int the type aliases point and point3D are available. Especially point is often used as coordinates of a pixel in an image. Pixels are defined similarly: trgbPixel has three attributes red, green, and blue. They should be accessed via the member functions like getRed(). However, the pixel class used for color images is rgbPixel which is a 32 bit data structure. It has the three color values and an extra alpha value all of which are ubyte. There are several reasons for a four byte long pixel representation of a typically 24 bit long value. First of all, in 32 bit or 64 bit processor architectures, this is necessary to ensure the memory alignment of the pixels with the processor’s word length, which helps to improve the performance of the memory access. Second, it helps to improve the efficiency when reading images from many 32-bit based frame
406
LTI-L IB — A C++ Open Source Computer Vision Library
grabbers. Third, some image formats store the 24-bit pixel information together with an alpha channel, resulting in 32-bit long pixels. All of these basic types support the fundamental arithmetic functions. The example in Fig. A.2 shows the idea. #include "ltiPoint.h" #include "ltiRGBPixel.h" // create an rgbPixel, set its rgbPixel pix; pix.set(127,0,255); pix.divide(2); std::cout