DYNAM I C FAC E S INSIGHTS FROM EXPERIMENTS AND COMPUTATION
EDITED BY
CRISTÓBAL CURIO, HEINRICH H. BÜLTHOFF, AND MARTIN A. GIESE FOREWORD BY
TOMASO POGGIO
Dynamic Faces
DYNAMIC FACES Insights from Experiments and Computation
edited by Cristo´bal Curio, Heinrich H. Bu¨lthoff, and Martin A. Giese
Foreword by Tomaso Poggio
The MIT Press Cambridge, Massachusetts London, England
6 2011 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. For information about special quantity discounts, please email
[email protected] This book was set in Times New Roman and Syntax on 3B2 by Asco Typesetters, Hong Kong. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Curio, Cristo´bal, 1972– Dynamic faces : insights from experiments and computation / edited by Cristo´bal Curio, Heinrich H. Bu¨ltho¤, and Martin A. Giese ; foreword by Tomaso Poggio. p. cm. Includes bibliographical references and index. ISBN 978-0-262-01453-3 (hardcover : alk. paper) 1. Human face recognition (Computer science) I. Bu¨ltho¤, Heinrich H. II. Giese, Martin A. III. Title. TA1650.C87 2011 006.3 0 7—dc22 2010003512 10 9 8 7
6 5 4 3
2 1
Contents
Foreword by Tomaso Poggio Introduction ix
vii
I
PSYCHOPHYSICS
1
1
Is Dynamic Face Perception Primary? Alan Johnston
2
Memory for Moving Faces: The Interplay of Two Recognition Systems Alice O’Toole and Dana Roark
3
Investigating the Dynamic Characteristics Important for Face Recognition Natalie Butcher and Karen Lander
4
Recognition of Dynamic Facial Action Probed by Visual Adaptation 47 Cristo´bal Curio, Martin A. Giese, Martin Breidt, Mario Kleiner, and Heinrich H. Bu¨lthoff
5
Facial Motion and Facial Form Barbara Knappmeyer
6
Dynamic Facial Speech: What, How, and Who? Harold Hill
II
PHYSIOLOGY
7
Dynamic Facial Signaling: A Dialog between Brains David A. Leopold
8
Engaging Neocortical Networks with Dynamic Faces Stephen V. Shepherd and Asif A. Ghazanfar
9
Multimodal Studies Using Dynamic Faces Aina Puce and Charles E. Schroeder
3
67 77
95
123
97 105
15 31
vi
Contents
10
Perception of Dynamic Facial Expressions and Gaze Patrik Vuilleumier and Ruthger Righart
11
Moving and Being Moved: The Importance of Dynamic Information in Clinical Populations 161 B. de Gelder and J. Van den Stock
III
COMPUTATION
12
Analyzing Dynamic Faces: Key Computational Challenges Pawan Sinha
13
Elements for a Neural Theory of the Processing of Dynamic Faces Thomas Serre and Martin A. Giese
14
Insights on Spontaneous Facial Expressions from Automatic Expression Measurement 211 Marian Bartlett, Gwen Littlewort, Esra Vural, Jake Whitehill, Tingfan Wu, Kang Lee, and Javier Movellan
15
Real-Time Dissociation of Facial Appearance and Dynamics during Natural Conversation 239 Steven M. Boker and Jeffrey F. Cohn
16
Markerless Tracking of Dynamic 3D Scans of Faces 255 Christian Walder, Martin Breidt, Heinrich H. Bu¨lthoff, Bernhard Scho¨lkopf, and Cristo´bal Curio Contributors Index 281
277
141
175 177 187
Foreword
An area of research that has been greatly promoted for a long time is the interface between artificial and biological vision. However, it is only in the past few years that this area is finally looking promising. Even more interesting, it is neuroscience that seems to be providing new ideas and approaches to computer vision—perhaps in a first sign that the explosion of research and discoveries about the brain may actually lead the development of several areas of computer science and in particular artificial intelligence. Thus the science of intelligence will eventually play a key role in the engineering of intelligence. Vision—which is the general topic of this book—may well be at the forefront of this new development. The specific topic of this collection of papers is faces, particularly when the time dimension is considered. Even just two decades ago it would have been surprising to state that faces are possibly the key to understanding visual recognition and that recognition of faces should be a key problem in vision research. In the meantime, the number of papers on face perception and recognition that have appeared in the neuroscience of vision as well as in computer vision is enormous and growing. At the same time there is a clear trend, driven by advances in communication and computer technology, to consider recognition of videos and not just of images. The trend is most obvious in computer vision but is also growing in visual physiology and psychophysics. For all these reasons, this state-of-the-art collection will contribute to leading a very active field of scientific and engineering research in an interesting and natural direction. The book is edited by an interdisciplinary team who assembled a set of experts in the psychophysical, physiological, and computational aspects of the perception and recognition of dynamic faces. It provides an overview of the field and a snapshot of some of the most interesting recent advances. Tomaso Poggio
Introduction
The recognition of faces is a highly important visual function that has central importance for social interaction and communication. Impairments of face perception, such as prosopagnosia (face blindness), can create serious social problems since the recognition of facial identity and emotional and communicative facial expressions is crucial for social interaction in human and nonhuman primates. Correspondingly, the recognition of faces and facial expressions has been a fundamental topic in neuroscience for almost two centuries (Darwin, 1872) The scientific interest in the processing of faces has vastly increased over the past decade. However, a major part of the existing studies has focused on the processing of static pictures of faces. This is reflected by the fact that over the past ten years more than 8,000 studies on the perception of faces have been listed in the PubMed library of the U.S. National Library of Medicine and the National Institutes of Health but only 300 listed studies treat the recognition of dynamic faces or the perception of faces from movies. The neural mechanisms involved in processing pictures of faces have been the topic of intense debates in psychology, neurophysiology, and functional imaging. Relevant topics have been whether faces form a ‘‘special’’ class of stimuli that are processed by specifically dedicated neural structures, and whether the process exploits computational principles that di¤er substantially from those underlying the recognition of other shapes. Most of these debates are still ongoing and no final conclusions have been drawn on many of these topics. Even less is known about the mechanisms underlying the processing of dynamic faces, and systematic research on this topic has just begun. Are there special mechanisms dedicated to processing dynamic as opposed to static aspects of faces? Are these mechanisms anatomically strictly separate? To what extent does the processing of dynamic faces exploit the same principles as the processing of static pictures of faces? Finally, what are the general computational principles of the processing of complex spatiotemporal patterns, such as facial movements, in the brain? The enormous biological relevance of faces makes their analysis and modeling also an important problem for technology. The recognition of static and dynamic faces
x
Introduction
has become a quite mature topic in computer vision (Li & Jain, 2005), and since the proposal of the first face recognition systems in the 1960s, a large number of technical solutions for this problem have been proposed. Modern computer vision distinguishes di¤erent problems related to the processing of faces, such as identity and expression recognition, face detection in images, the tracking of faces, or the threedimensional geometry of a face in video sequences. The recognition of faces and facial expressions has important applications, such as surveillance, biometrics, and human–computer interfaces. Not only the analysis but also the synthesis of pictures and movies of faces have become important technical problems in computer graphics. Starting from very simple systems in the 1970s (Brennan, 1985) that generated line drawings of faces, the development has progressed substantially, and today’s systems are capable of the simulation of photorealistic pictures of faces. Modern systems of this type exploit data on the texture and three-dimensional structure of faces, which are obtained by special hardware systems, such as laser scanners. Highly realistic simulations have been obtained for static pictures of faces (e.g., Blanz & Vetter, 1999) and recently also for movies of faces (e.g., Blanz, Basso, Poggio, & Vetter, 2003). The simulation of realistic dynamic facial expressions by computer graphics has been fundamental for the realization of many recent movies, such as The Final Fantasy, The Polar Express, Beowulf, or the most recent production of Benjamin Button. Other application domains for the simulation of faces and facial expressions encompass facial surgery and forensic applications. Finally, a challenging problem, with high potential future relevance, which scientists have started to address only recently is the simulation of facial movements for humanoid robots (Kobayashi & Hara, 1995). Thus, recognition and modeling of dynamic faces are interesting and largely unexplored topics in neuroscience that have substantial importance for technical applications. Our goal here is to provide an overview of recent developments in the field of dynamic faces within an interdisciplinary framework. This book is an outgrowth of a workshop we organized in March 2008 at the COSYNE conference in Snowbird, Utah. The chapters are written by experts in di¤erent fields, including neuroscience, psychology, neurology, computational theory, and computer science. We have tried to cover a broad range of relevant topics, including the psychophysics of dynamic face perception, results from electrophysiology and imaging, clinical deficits in patients with impairments of dynamic face processing, and computational models that provide interesting insights about the mechanisms of processing dynamic faces in the brain. This book tries to address equally researchers in biological science and neuroscience and in computer science. In neuroscience, we hope that an overview of the state of the art of knowledge about how dynamic faces are processed might be suitable as a basis for designing new experiments in psychology, psychophysics, neurophysiology, social and communication sciences, imaging, and clinical neuroscience.
Introduction
xi
At the same time, increasing our knowledge about the mechanisms that underlie the processing of dynamic faces by the nervous system seems highly relevant for computer science. Such knowledge seems suitable for improving technical systems for the recognition and animation of dynamic faces by taking into account the constraints and critical properties that are central for processing such faces in biological systems. Second, the principles the brain uses to recognize dynamic faces might inspire novel solutions for technical systems that can recognize and process dynamic faces—similar to the inspiration that the principles of biological vision provided for a variety of other technical solutions in computer vision. Finally, the experimental methods for quantitative characterization of the perception of dynamic faces that are discussed here provide a basis for the validation of technical systems that analyze or synthesize dynamic facial expressions. This makes biological methods interesting for the development and optimization of technical systems in computer science. The book is divided into three major parts that cover di¤erent interdisciplinary aspects of the recognition and modeling of dynamic faces: psychophysics, physiology, and computational approaches. Each part starts with an overview by a recognized expert in the field. The subsequent chapters discuss a spectrum of relevant approaches in more detail. In part I, Alan Johnston introduces the general topic of the psychophysics of the human perception of dynamic faces (chapter 1). He discusses methodological and technical issues that are important for experimentalists and modelers. In chapter 2, Alice O’Toole and Dana Roark present novel insights on the role of dynamic information in face recognition. They discuss the findings in the context of what so far has defined the supplemental information and the representation enhancement hypotheses. This chapter also serves as a good introduction to these two theories. In chapter 3, Natalie Butcher and Karen Lander present in detail a series of studies, including their own new experimental data revealing the dynamic characteristics that are important for face recognition. In chapter 4, Cristo´bal Curio and his colleagues present work that exploited 3D computer graphics methods to generate close-to-reality facial expressions in a study of high-level aftere¤ects in the perception of dynamic faces. The generation of the dynamic facial expressions was based on an algorithm that provides low-dimensional parameterizations of facial movements by approximating them through the superposition of facial action units. The study shows in particular that dynamic faces produce high-level aftere¤ects similar to those shown earlier for static pictures of faces. In chapter 5, Barbara Knappmeyer has particularly investigated the interaction between facial motion and facial form. A cluster-based animation approach allowed the exchange of characteristic motion signatures between di¤erent identities. This study shows that the perception of facial identity is modulated by the perception of individual specific motions.
xii
Introduction
In chapter 6, Harold Hill presents a detailed overview of studies that focus on facial speech perception. His thorough discussion provides insights on spatiotemporal aspects of the interplay of auditory and facial speech signals that are supported by novel data. Part II is devoted to physiology and is introduced by David Leopold (chapter 7). Besides providing a brief overview of this section, he lays out his view on outstanding neural challenges. Given that facial expressions play a central role in social communication, he suggests that the neurophysiological basis of the perception and production of facial expressions, including vocalization and gaze, should be studied by taking into account interactive contexts. In chapter 8, Stephen Shepherd and Asif Ghazanfar cover neurophysiological aspects of gaze, attention, and vocal signals during the perception of expressions. They review behavioral and electrophysiological evidence that perception of facial dynamics and vocalization is linked. In chapter 9, Aina Puce and Charles Schroeder review human electrophysiological event-related potential (ERP) experiments related to facial movement. Using a novel methodological approach, they provide evidence that socially relevant signals, such as a gaze toward or away from an observer, modulates the amplitude of ERP responses. In chapter 10, Patrik Vuilleumier and Ruthger Righart provide a more detailed review of factors that influence the ERP signal (N170) during the perception of dynamic faces. They discuss in detail evidence from their own and others’ work for the coupling of the perception and production of dynamic faces. In chapter 11, Beatrice de Gelder and J. Van den Stock review clinical observations relating to the dynamic information in faces. They complement the discussion on the processing of static and dynamic faces in normal subjects with insights from studies on how movement a¤ects the perception of faces in patients with various cognitive deficits that range from developmental prosopagnosia to brain lesions and autism spectrum disorder. The computational aspects of dynamic faces are introduced in part III in the overview chapter by Pawan Sinha (chapter 12). He formulates a number of computational challenges that are associated with the processing of dynamic faces. The computational chapters of this section cover rather diverse topics that range from neural modeling (chapter 13), automatic behavior classification with applications (chapter 14), and a real-time interactive avatar system for closed-loop behavior research (chapter 15) to state-of-the-art 3D computer graphics that provide novel 3D stimuli for challenging experiments on dynamic faces (chapter 16). In chapter 13, Thomas Serre and Martin Giese present elements of neural theories for object, face, and action recognition that might be central for the development of physiologically inspired models for recognition of dynamic faces.
Introduction
xiii
In chapter 14, Marian Bartlett and her colleagues give an overview of their work on automatic expression measurements. They present a computer vision-based system for recognizing facial expression that uses detectors of facial action units. They review the usefulness of their tool in various interactive applications that require automatic analysis and validat their system by comparison with human performance. Chapter 15, by Steven Boker and Je¤rey Cohn, presents a computational approach that permits the real-time dissociation of facial appearance and dynamics during natural conversation. Their analysis and animation system is one of the first approaches in behavior research that allows the study of facial expressions with realistic-looking avatars that can reproduce and manipulate participants’ facial actions and vocal sounds during interactive conversation. Part III concludes with a novel computer graphics approach suitable for constructing 3D facial animations (chapter 16). Christian Walder and his colleagues present a kernel-based approach for dense three-dimensional tracking of facial movements, providing essential data that are required for realistic and controllable face animation and dynamic analyses of face space. We wish to thank the many colleagues without whom the successful completion of this book would not have been possible. We thank Andreas Bartels, Martin Breidt, Isabelle Bu¨ltho¤, Christoph D. Dahl, and Johannes Schultz for reviewing and providing comments on individual chapters of the draft. We thank Stefanie Reinhard for her support during editing of the book. We thank the COSYNE conference 2008 workshop chairs Fritz Sommer and Jascha SohlDickstein for their support in our workshop on dynamic faces. This work was supported by the European Union-Project BACS FP6-IST-027140 and the Deutsche Forschungsgemeinschaft (DFG) Perceptual Graphics project PAK 38. Finally we want to express our gratitude to Bob Prior, Katherine Almeida, and Susan Buckley from MIT Press for their support and guidance during the completion of this book. References Blanz, V., Basso, C., Poggio, T., & Vetter, T. (2003). Reanimating faces in images and video. Computer Graphics Forum, 22(3), 641–650. Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on computer graphics and interactive techniques, SIGGRAPH (pp. 187–194): New York: ACM Press/Addison-Wesley. Brennan, S. E. (1985). Caricature generator: The dynamic exaggeration of faces by computer. Leonardo, 18(3), 170–178. Darwin, C. (1872). The expression of the emotions in man and animals. London: J. Murray. Kobayashi, H., & Hara, F. (1995). A basic study on dynamic control of facial expressions for face robot. In Ro-Man ’95 Tokyo: 4th IEEE international workshop on robot and human communication, proceedings, pp. 275–280. Li, S. Z., & Jain, A. K. (2005). Handbook of face recognition. New York: Springer.
I
PSYCHOPHYSICS
1
Is Dynamic Face Perception Primary?
Alan Johnston
Movement: A Foundation for Facial Representation
The vast majority of work on perception of the human face has studied the frozen facial image, a frame sampled from the dynamic procession of one transient expression to another. To some degree this has been a necessary simplification, both technically and theoretically, to allow progress in understanding how we see faces (Bruce, 1988; Bruce & Young, 2000) and in discovering the neural structures on which this process is based (Leopold, Bondar, & Giese, 2006; Tsao & Livingstone, 2008). In recent years it has become easier to generate, manipulate, control, and display moving faces and so our progress in understanding dynamic facial perception (A. J. O’Toole, Roark, & Abdi, 2002) is gathering speed. However, the past emphasis on static faces may have obscured a key role for movement in forming the mechanisms of encoding not only moving faces but facial structure in general. In redress, here we explore the central role of change in the representation of faces. Research on facial movement has almost always used natural movement as the stimulus. In some cases the motion is generated by trained actors; in other cases the recorded behavior of naive participants is used. One of the main problems, not often considered, is that the strength of the signal that subjects are trying to encode—for example, to discriminate gender or emotional expression—may vary radically among actors. If the signal is not naturally in the stimulus, subjects will not be able to access it. Inducing natural movement signals within a controlled environment is a significant technical problem. For experiments involving social signals, it makes sense to record facial motion when the actor is seriously engaged in communicating with a confederate. In work on emotional expression, it would be better to use real rather than simulated emotions; however, there are clear ethical problems in provoking real emotional distress. As work on facial motion progresses, it is going to be necessary to have avatars whose facial behavior can be specified and precisely controlled. Fortunately, progress is being made in this direction (Blanz & Vetter, 1999; Cowe, 2003; Curio, Giese, Breidt, Kleiner, & Bu¨ltho¤, this volume; Knappmeyer,
4
A. Johnston
this volume; Knappmeyer, Thornton, & Bu¨ltho¤, 2003) and work in computer vision on facial animation will inevitably generate new methods and techniques that can be exploited by experimental psychologists. Types of Motion
The chapters in this section of the book address the perception of facial motion and refer to studies that manipulate the motion signal and the form of the face in various ways. It is useful to distinguish at the outset the types of motion present in an image sequence containing dynamic facial movement. These types form a list but also a hierarchy of representation from instantaneous local information to temporally extended object-based representations. These distinctions will be helpful in considering how motion influences perception of a face. Local Motion
Local motion refers to measurements of image motion. This can be represented as a vector field, which describes the local image velocities point by point and frame by frame. These measurements are tied to locations in the image and this level of representation is agnostic to, and requires no knowledge of, the content of images. For sensing three-dimensional motion, the local motion vectors will also have a depth component. Object Motion
Object motion refers to motion as a property of an object, such as shape or size. This is a representation of motion tied to objects and is most associated with tracking motion, which delivers an ordered list of points describing a time series of the object’s position in the image. We can also extend this category to three-dimensional space, allowing a description of the pose of the object in addition to its location. Object-Based Motion
The face, like the body, can change its configuration over time. These changes are constrained by the structure of the object (Bruce, 1988), and a good model of the object would be one that has a form and a set of parameters defining the possible variations of the structure that adequately describe the changes seen in the object (Johnston, 1996). Object-based motion could be represented by the change in the parameters of an object’s structural description over time. However, we may also wish to refer to changes in parameters, the scope of which encompasses global changes in the face. Principal Components Analysis (PCA), which delivers orthogonal components ordered in terms of the maximum variance accounted for by the
Is Dynamic Face Perception Primary?
5
components, provides one means of representing these global changes in faces (Turk & Pentland, 1991). Facial expressions and characteristic facial gestures (Knight & Johnston, 1997) or dynamic identity signatures (O’Toole & Roark, this volume; O’Toole et al., 2002) refer to systematic dynamic patterns of object-based motion, which have a manner that conveys information, such as ‘‘I want to hear what you have to say’’; ‘‘I am telling you a joke’’; ‘‘I am admonishing you’’; or ‘‘I am shy.’’ These can also give clues to identity and can have a specific meaning, as in raising an eyebrow, winking, or blowing a kiss. These gestures can be represented as trajectories in an object-based motion space. The natural representations are quite di¤erent for each of these types. The first is a dense vector field, the second a time series of locations, and the third a time series of model parameter values. With these di¤erent kinds of motion in mind, we can go on to consider the problems in demonstrating any advantages of movement. Often, as in cases like the interpretation of visual speech (Hill, this volume), or judging the quality of facial mimicry, the signals to be extracted are almost exclusively dynamic. However, in other cases, such as the recognition of familiar faces, the task can often be accomplished easily from static frames. This raises the question in how far motion is also used for recovering identity from dynamic face stimuli.
Facial Gestures
Revealing the Motion Advantage
We can recognize familiar faces from still images. Indeed, the still images can be degraded by two-toning or distorted by eccentric caricaturing, and the individual will still be recognized. Hence it is di‰cult to demonstrate that adding motion provides a useful additional clue to the recognition of identity (Butcher & Lander, this volume). The general strategy in separating motion and form information is to degrade the form information, leaving the motion information intact. The use of point-light displays (Bassili, 1979; Bruce & Valentine, 1988; Hill, Jinno, & Johnston, 2003) allows the recovery of object-based motion, but the sparse sampling involved inevitably degrades the local motion field as well as the spatial form of the face. We can ask to what degree this technique removes critical motion information as well as spatial form information. There is some current debate over whether the perception of object motion from point-light displays of whole-body motion are based on the perception of changes in form over time (Beintema & Lappe, 2002), which we have classed as object-based motion, or the pattern of midlevel opponent motion signals (Casile & Giese, 2005). Recent evidence favors the idea that the critical features in walking figures are
6
A. Johnston
‘‘signature-like,’’ such as local motion or object motion signals, indicating the reversal in the movements of the feet at the end of the stride, rather than the updating of form templates (Thurman & Grossman, 2008). It is also possible that similar diagnostic motion signals support the recovery of information from faces. Hill, Jinno, and Johnston (2003) found that a few well-placed points were as good as solid-body facial animations in suggesting the gender of an actor, and that performance was better in this case than when more dots were randomly placed on the face, as in the experiments by Bassilli (1979). Thus, facial motion carried by point-light displays can be recognized reasonably well so long as the critical motion information remains salient in the subsequent display. However, it is not clear whether object motion cues from the dots are used directly or whether wellplaced points are su‰cient to activate analysis of object-based motion. Another way to degrade the spatial information is to use photographic negatives (Butcher & Lander, this volume; Knight & Johnston, 1997). Familiar faces in negatives can be recognized better when they are moving than when they are still (Butcher & Lander, this volume; Knight & Johnston, 1997; Lander, Christie, & Bruce, 1999). One of the advantages of this technique is that the motion field is unchanged. Twodimensional form features, such as the position of contours, edges, and facial features, are also una¤ected, indicating that we are unlikely to utilize a representation based on edges in the face-processing pipeline. The main process that is disrupted by this manipulation is the recovery of the surface shape from shading, shadows, and other illumination cues. The gray-level pigmentation of the face is also altered. The pupils of the eye, which are normally dark, are made light, and hair color is reversed, but changes in pigmentation are not as disrupting as changes in the consequences of illumination. Reversing the chromatic signal, e.g., changing the reddish parts of the face to green, has a relatively small e¤ect on recognition (Kemp, Pike, White, & Musselman, 1996). Nevertheless, the advantage that moving photographic negatives have over negative stills in the identification of familiar faces shows that the motion field or any subsequent processing of it is useful for recognition. However, photographic negatives still have spatial information. The shape of the face and the positions of contours and features are unchanged by negation. To remove these cues, Hill and Johnston (2001) tracked features on faces, mapped those features onto an average 3D model avatar, and then animated the avatar using those tracked points—a process known as performance-driven animation. Remarkably, subjects were able to judge the gender of the actor and classify individual facial movements as coming from one individual rather than another more often than would be expected by chance. However, this technique used a mesh-based animation system (Famous Technologies) to generate the motion of the avatar outside the tracked points and therefore the motion field of the resulting avatar was inevitably smoother and coarser grained than real movement. Also, the spatial characteristics
Is Dynamic Face Perception Primary?
7
of the local motion field were altered to match the spatial configuration of the avatar, which may degrade the motion information to some degree. The need to degrade the form of a face to see the motion advantage raises the issue of whether we use facial motion when the spatial information is clearly visible. The fact that motion can bias identity judgments based primarily on spatial information (Knappmeyer, this volume; Knappmeyer et al., 2003) would argue that both types of information are integrated in coming to a decision about identity. Segmenting the Visual Stream and Temporal Constancy
There are some significant problems to be addressed in understanding how human observers encode dynamic events. The first is temporal segmentation. How do we break up the continuous stream of action into identifiable chunks? What principles can be used to say that a set of frames is part of one facial gesture and not another? The second problem is what might be called temporal constancy: How do we recognize two facial gestures as di¤erently speeded versions of each other? Hill and coworkers (Hill, this volume; H. C. Hill, Troje, & Johnston, 2005) encountered these problems in a study of spatial and temporal facial caricaturing. They found that range exaggeration (spatial) was e¤ective in increasing ratings of emotional content, whereas domain-specific (timing di¤erence) exaggeration had minor e¤ects, which were not necessarily consistent with the depicted emotion. To some degree this is to be expected since a wide grin is usually considered to portray greater happiness than a thin smile. What was more surprising was that temporal caricaturing was not very e¤ective in enhancing emotional expression. Hill et al. (2005) used extrema of chin movements during speech as time interval markers. While the extrema of chin movements are reasonable temporal features, the choice of this feature is essentially pragmatic. Temporal segmentation and part decomposition of objects present similar problems. In object recognition, the classical view is that the object is expected to segment into separate components at surface concavities (Biederman, 1987; Marr & Nishihara, 1978). Hill et al. (2005) use a similar signal—chin position extrema—to segment the visual speech signal. However, there are many instances of objects that do not segment at concavities and others that come apart at smooth joins between the components. We cannot tell whether the handset can be removed from the body of a classic telephone just by looking at it. We therefore have to consider how to describe an object on the basis of experience. If regions of an object consistently move together rigidly, then it can be reasonably be considered a single part of the object. To the extent that an object’s regions move in an uncorrelated way, or independently, they can be considered to be separate entities. If we accept the fact that objects are segmented into parts that vary
8
A. Johnston
independently, then no a priori rules can apply and segmentation must depend upon experience (Johnston, 1992). The facility to seamlessly progress from a specific representation, supporting discrimination, to a more global description, supporting abstraction and generalization, was a key insight of Marr and Nishihara’s (1978) approach to object recognition. This idea can readily be incorporated in an experiencebased segmentation scheme (Johnston, 1996). Body parts like the arm can appear as a single part of an object in some contexts and a multipart object in others. We can apply the same argument to the segmentation of facial behavior. Some expressions involve correlated activity that always occurs together. Preparation for a sneeze is a good example of this. When smiling, a narrowing must inevitably follow the widening of the mouth into a smile. This temporal correlation means we do not readily segment the first and second phase of a smile. Frowns do not regularly follow smiles and therefore these actions can reliably be segmented. Principal Components Analysis (Calder, Burton, Miller, Young, & Akamatsu, 2001; Turk & Pentland, 1991) has often been used to identify the major sources of variation in the form of the face. It can also be used to group together into components the characteristic aspects of the face that change together, as in facial expressions (Cowe, 2003). Expressions can be executed quickly or slowly. However, an expression is often recognizable at di¤erent rates. In perception of dynamic faces, this capacity to recognize facial gestures at di¤erent rates requires as much consideration as how we recognize faces at di¤erent spatial scales or di¤erent three-dimensional poses. Added to this is the issue of whether the mechanical properties of the face place limits on temporal scaling; e.g., whether a speeded-up expression is uniformly faster or whether the change in speed varies through the facial gesture. Although we seem remarkably good at recognizing facial gestures regardless of the speed of implementation, often the rate matters—a slow, angry expression might be best described as menacing. Motion can also aid viewpoint constancy. Although perception of static faces tends to show a dependence on viewpoint (Hill, Schyns, & Akamatsu, 1997; O’Toole, Edelman, & Bu¨ltho¤, 1998), and facial adaptation studies show a combination of viewpoint dependence and viewpoint invariance (Benton et al., 2007), recognition of facial motion in a match-to-sample task was relatively invariant of viewpoint (Watson, Johnston, Hill, & Troje, 2005). The benefit of motion may be that nonrigid facial expressions are object-based changes that may be coded in an object-centered manner (Marr & Nishihara, 1978). Prototypes, Object Forms, and Transformations
Facial adaptation can alter the appearance of an average face in a direction opposite to an adaptor (Curio et al., this volume; Leopold, O’Toole, Vetter, & Blanz, 2001; Watson & Cli¤ord, 2003; Webster & MacLin, 1999), implying a representation rela-
Is Dynamic Face Perception Primary?
9
Figure 1.1 Marker points are first identified for di¤erent faces, then each face is warped to a mean face. The warp is applied to each element of the motion sequence for each face. The warp fields and the warped face texture for each image frame form the vectors that are subject to the Principal Components Analysis.
tive to a norm or prototype (Rhodes, Brennan, & Carey, 1987; Valentine, 1991). Single neurons also appear to increase their firing rate as a function of deviation from a norm (Leopold et al., 2006). But, how do norm-based representations arise? Face adaptation e¤ects can be observed by age eight (Nishimura, Maurer, Je¤ery, Pellicano, & Rhodes, 2008) and average faces of a set of four are recognized between one and three months (de Haan, Johnson, Maurer, & Perrett, 2001). Although prototype coding appears early, natural sources of variation in the face are likely to form the basis of the dimensions of face space, and the sensitivity of the representations to adaption suggest they are malleable throughout life. The fundamental problem for a norm-based representation is how to determine what examples to include in generating the average. Do we need to cluster the examples first to, say, distinguish human and animal faces? Do we need to distinguish different people from di¤erent expressions on the same face? When PCA is carried out over multiple expressions of multiple people (figures 1.1 and 1.2), the first few components, which carry most of the variance, code identity, and the next group codes for expression (Calder et al., 2001; Calder & Young, 2005; Cowe, 2003). Thus, the proposal of a separation of identity and expression coding (Bruce & Young, 1986) may just be a reflection of face image statistics (Calder & Young, 2005). Two expressions of the same face (under fixed illumination) are more similar than two faces with the same expression. Instances of the same face will tend to cluster in a PCA space, but this doesn’t help us determine how to set up a PCA-like face space or whether we
10
A. Johnston
Figure 1.2 E¤ect of the variations of the first thirty principal components by Gk s standard deviations from the mean (left and right columns). The values of k are specified in the column headings.
Is Dynamic Face Perception Primary?
11
should represent faces in multiples spaces. PCA relates faces by similarity, not by geometric transformation. However, di¤erent expressions of the same person are clearly related by a geometric transformation and as such they can be thought as forming an equivalence class. The ease of being able to transform one example into another could provide a means of segmenting all faces into separate spaces. For an equivalence class defined by a geometric transformation, the average has no special status—to describe an ellipsoid we say we have a form, an ellipsoid, and then specify the parameters. However, the choice of parameters is somewhat open. We can therefore combine form-based and prototype-based coding by suggesting that the representation of face involves specifying a face form, Barack Obama’s face for example, with a norm based on our experience of the face, and a parameter space based on the natural deviations from the norm seen for that object. It is di‰cult to see how one can form a useful description of a face until it is clear how the face is going to vary. There would be no point in encoding the length of the nose in a face description if every individual’s nose was the same length. Variation is therefore key to the representation of dynamic and static form. Motion allows a subdivision of the space into equivalence classes. In this sense facial motion would appear to be fundamental in building a representation for faces since it provides the context within which individual faces may be described. To summarize, facial motion provides information about facial gestures and speech that are intrinsically dynamic, but it also contributes to face recognition. There are important problems to be addressed, such as determining the principles by which the stream of facial expressions is segmented and how we recognize dynamic information irrespective of rate. Finally, we explored the role of natural variation and particularly the geometric transformations inherent in facial motion in arriving at internal representations for faces. Acknowledgment
I would like to thank the Engineering and Physical Sciences Research Council for their support. References Bassili, J. N. (1979). Emotion recognition: The role of facial movement and the relative importance of upper and lower areas of the face. J Pers Soc Psychol, 37(11), 2049–2058. Beintema, J. A., & Lappe, M. (2002). Perception of biological motion without local image motion. Proc Natl Acad Sci USA, 99(8), 5661–5663. Benton, C. P., Etchells, P. J., Porter, G., Clark, A. P., Penton-Voak, I. S., & Nikolov, S. G. (2007). Turning the other cheek: The viewpoint dependence of facial expression after-e¤ects. Proc Biol Sci, 274(1622), 2131–2137.
12
A. Johnston
Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psych Rev, 94, 115–147. Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3d faces. Paper presented at the proceedings of the 26th annual conference on computer graphics and interactive techniques. Proceedings of the SIGGRAPH ’99, 8–13 August 1999, Los Angeles, pp. 187–194. Bruce, V. (1988). Recognising faces. Hove, UK: Lawrence Erlbaum. Bruce, V., & Valentine, T. (1988). When a nod’s as good as a wink: The role of dynamic information in face recognition. In M. M. Gruneberg, P. E. Morris and R. N. Sykes (eds.), Practical aspects of memory: Current research and issues. (Vol. 1). Chichester, UK: Wiley. Bruce, V., & Young, A. (1986). Understanding face recognition. Br J Psychol, 77(Pt 3), 305–327. Bruce, V., & Young, A. W. (2000). In the eye of the beholder: The science of face perception. Oxford, UK: Oxford University Press. Calder, A. J., Burton, A. M., Miller, P., Young, A. W., & Akamatsu, S. (2001). A principal component analysis of facial expressions. Vision Res, 41(9), 1179–1208. Calder, A. J., & Young, A. W. (2005). Understanding the recognition of facial identity and facial expression. Nat Rev Neurosci, 6(8), 641–651. Casile, A., & Giese, M. A. (2005). Critical features for the recognition of biological motion. J Vis, 5, 348– 360. Cowe, G. A. (2003). Example-based computer-generated facial mimicry. Unpublished PhD Thesis, University College London, London. de Haan, M., Johnson, M. H., Maurer, D., & Perrett, D. I. (2001). Recognition of individual faces and average face prototypes by 1- and 3-month-old infants. Cognit Devel, 16(2), 659–678. Hill, H., Jinno, Y., & Johnston, A. (2003). Comparing solid-body with point-light animations. Perception, 32(5), 561–566. Hill, H., & Johnston, A. (2001). Categorizing sex and identity from the biological motion of faces. Curr Biol, 11(11), 880–885. Hill, H., Schyns, P. G., & Akamatsu, S. (1997). Information and viewpoint dependence in face recognition. Cognition, 62(2), 201–222. Hill, H. C., Troje, N. F., & Johnston, A. (2005). Range- and domain-specific exaggeration of facial speech. J Vis, 5(10), 793–807. Johnston, A. (1992). Object constancy in face processing: Intermediate representations and object forms. Irish J Psych, 13, 425–438. Johnston, A. (1996). Surfaces, objects and faces. In D. W. Green (ed.), Cognitive science: An introduction. Oxford, UK: Blackwell. Kemp, R., Pike, G., White, P., & Musselman, A. (1996). Perception and recognition of normal and negative faces: The role of shape from shading and pigmentation cues. Perception, 25(1), 37–52. Knappmeyer, B., Thornton, I. M., & Bu¨ltho¤, H. H. (2003). The use of facial motion and facial form during the processing of identity. Vision Res, 43(18), 1921–1936. Knight, B., & Johnston, A. (1997). The role of movement in face recognition. Vis Cognit, 4(3), 265–273. Lander, K., Christie, F., & Bruce, V. (1999). The role of movement in the recognition of famous faces. Mem Cognit, 27(6), 974–985. Leopold, D. A., Bondar, I. V., & Giese, M. A. (2006). Norm-based face encoding by single neurons in the monkey inferotemporal cortex. Nature, 442(7102), 572–575. Leopold, D. A., O’Toole, A. J., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding revealed by high-level aftere¤ects. Nat Neurosci, 4(1), 89–94. Marr, D., & Nishihara, H. K. (1978). Representation and recognition of the spatial organisation of threedimensional shape. Proc R Soc Lond B, 200, 269–294. Nishimura, M., Maurer, D., Je¤ery, L., Pellicano, E., & Rhodes, G. (2008). Fitting the child’s mind to the world: Adaptive norm-based coding of facial identity in 8-year-olds. Dev Sci, 11(4), 620–627.
Is Dynamic Face Perception Primary?
13
O’Toole, A. J., Edelman, S., & Bu¨ltho¤, H. H. (1998). Stimulus-specific e¤ects in face recognition over changes in viewpoint. Vision Res, 38(15–16), 2351–2363. O’Toole, A. J., Roark, D. A., & Abdi, H. (2002). Recognizing moving faces: A psychological and neural synthesis. Trends Cognit Sci, 6(6), 261–266. Rhodes, G., Brennan, S., & Carey, S. (1987). Identification and ratings of caricatures: Implications for mental representations of faces. Cogn Psychol, 19(4), 473–497. Thurman, S. M., & Grossman, E. D. (2008). Temporal ‘‘bubbles’’ reveal key features for point-light biological motion perception. J Vis, 8(3), 28 21–11. Tsao, D. Y., & Livingstone, M. S. (2008). Mechanisms of face perception. Annu Rev Neurosci, 31, 411– 437. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. J Cognit Neurosci, 3, 71–86. Valentine, T. (1991). A unified account of the e¤ects of distinctiveness, inversion, and race in face recognition. Q J Exp Psychol A, 43(2), 161–204. Watson, T. L., & Cli¤ord, C. W. (2003). Pulling faces: An investigation of the face-distortion aftere¤ect. Perception, 32(9), 1109–1116. Watson, T. L., Johnston, A., Hill, H. C. H., & Troje, N. F. (2005). Motion as a cue for viewpoint invariance. Vis Cognit, 12(7), 1291–1308. Webster, M. A., & MacLin, O. H. (1999). Figural aftere¤ects in the perception of faces. Psychon Bull Rev, 6(4), 647–653.
2
Memory for Moving Faces: The Interplay of Two Recognition Systems
Alice O’Toole and Dana Roark
The human face is a captivating stimulus, even when it is stationary. In motion, however, the face comes to life and o¤ers us a myriad of information about the intent and personality of its owner. Through facial movements such as expressions we can gauge a person’s current state of mind. By perceiving the movements of the mouth as a friend speaks, a conversation becomes more intelligible in a noisy environment. Through the rigid movement of the head and the direction of eye gaze, we can follow another person’s focus of attention in a crowded room. The amount and diversity of social information that can be conveyed by a face ensures its place as a central focal object in any scene. Beyond the rich communication signals that we perceive in facial expressions, head orientation, eye gaze, and facial speech motions, it is also pertinent to ask whether the movements of a face help us to remember a person. The answer to this question can potentially advance our understanding of how the complex tasks we perform with faces, including those having to do with social interaction and memory, coexist in a neural processing network. It can also shed light on the computational processes we use to extract recognition cues from the steady stream of facial movements that carry social meanings. Several years ago we proposed a psychological and neural framework for understanding the contribution of motion to face recognition (O’Toole, Roark, & Abdi, 2002; Roark, Barrett, Spence, Abdi, & O’Toole, 2003). At the time, there were surprisingly few studies on the role of motion in face recognition and so the data we used to support this framework were sparse and in some cases tentative. In recent years, however, interest in the perception and neural processing of dynamic human faces has expanded, bringing with it advances in our understanding of the role of motion in face recognition. In this chapter we revisit and revise the psychological and neural framework we proposed previously (O’Toole et al., 2002). We begin with an overview of the original model and a sketch of how we arrived at the main hypotheses. Readers who are familiar with this model and its neural underpinnings can skip to the next section of this chapter. In that section we update this past perspective with new findings that
16
A. O’Toole and D. Roark
address the current status of psychological and neural hypotheses concerning the role of motion in face recognition. Finally, we discuss open questions that continue to challenge our understanding of how the dynamic information in faces a¤ects the accuracy and robustness of their recognition. A Psychological and Neural Framework for Recognizing Moving Faces: Past Perspectives
Before 2000, face recognition researchers rarely considered the role of motion in recognition. It is fair to say that with little or no data on the topic, many of us simply assumed that motion would benefit face recognition in a whole variety of ways. After all, a moving face is more engaging; it conveys contextual and emotional information that can be associated with identity; and it reveals the structure of the face more accurately than a static image. It is di‰cult to recall now, in the ‘‘You-tube’’ era, that a few short years ago the primary reason psychologists avoided the use of dynamic faces in perception and memory studies was the limited availability of digital video on the computers used in most psychology labs. This problem, combined with the lack of controlled video databases of faces, hindered research e¤orts on the perception and recognition of dynamic faces. As computing power increased and tools for manipulating digital video became standard issue on most computers, this state of a¤airs changed quickly. Research and interest in the topic burgeoned, with new data to consider and integrate appearing at an impressive rate. The problem of stimulus availability was addressed partially in 2000 with a database designed to test computational face recognition algorithms competing in the Face Recognition Vendor Test 2002 (Phillips, Grother, Micheals, Blackburn, Tabassi, & Bone, 2003). This database consisted of a large collection of static and dynamic images of people taken over multiple sessions (O’Toole, Harms, Snow, Hurst, Pappas, & Abdi, 2005). In our first experiments using stimuli from this database, we tested whether face recognition accuracy could be improved when people learned a face from a high-resolution dynamic video rather than from a series of static images that approximated the amount of face information available from the video. Two prior studies returned a split decision on this question, with one showing an advantage for motion (Pike, Kemp, Towell, & Phillips, 1997) and a second showing no di¤erence (Christie & Bruce, 1998). In our lab, we conducted several experiments with faces turning rigidly, expressing, and speaking, and found no advantage for motion information either at learning or at test. In several replications, motion neither benefited nor hindered recognition. It was as if the captivating motions of the face were completely irrelevant for recognition. These studies led us to set aside the preconceived bias that motion, as an ecologically valid and visually compelling stimulus, must necessarily make a useful contribution to face recognition. Although this made little sense psychologically, it seemed to
Memory for Moving Faces
17
fit coherently into the consensus that was beginning to emerge about the way moving versus static faces and bodies were processed neurally. Evidence from Neuroscience
A framework for understanding the neural organization of processing faces was suggested by Haxby, Ho¤man, and Gobbini (2000) based on functional neuroimaging and neurophysiological studies. They proposed a distributed neural model in which the invariant information in faces, useful for identification, was processed separately from the changeable, motion-based information in faces that is useful for social communication. The invariant information includes the features and configural structure of a face, whereas the changeable information refers to facial expression, facial speech, eye gaze, and head movements. Haxby et al. (2000) proposed the lateral fusiform gyrus for processing the invariant information and the posterior superior temporal sulcus (pSTS) for processing changeable information in faces. The pSTS is also implicated in processing body motion. The distributed network model pointed to the possibility that the limited e¤ect of motion on face recognition might be a consequence of the neural organization of the brain areas responsible for processing faces. The distributed idea supports the main tenets of the role of the dorsal and ventral processing streams in vision with, at least, a partial dissociation of the low-resolution motion information from the higherresolution color and form information. From the Haxby et al. model, it seemed possible, even likely, that the brain areas responsible for processing face identity in the inferior temporal cortex might not have much (direct) access to the areas of the brain that process facial motions for social interaction. A second insight from the distributed network model was an appreciation of the fact that most facial motions function primarily as social communication signals (Allison, Puce, & MacCarthy, 2000). As such, they are likely to be processed preeminently for this purpose. Although facial motions might also contain some unique or identity-specific information about individual faces that can support recognition, this information seems secondary to the more important social communication information conveyed by motion. Fitting the Psychological Evidence into the Neural Framework
A distributed neural network o¤ered a framework for organizing the (albeit) limited data on recognition of people and faces from dynamic video. O’Toole et al. (2002) proposed two nonmutually exclusive ways in which motion might benefit recognition. The supplemental information hypothesis posits a representation of the characteristic facial motions or gestures of individual faces (‘‘dynamic identity signatures’’) in addition to the invariant structure of faces. We assumed that when both static and dynamic identity information is present, people rely on the static information
18
A. O’Toole and D. Roark
because it provides a more reliable marker of facial identity. The representation enhancement hypothesis posits that motion contributes to recognition by facilitating the perception of the three-dimensional structure of the face via standard visual structure-from-motion processes. Implicit in this hypothesis is the assumption that the benefit of motion transcends the benefit of seeing the multiple views and images embedded in a dynamic motion sequence. At the time, there were two lines of evidence for the supplemental information hypothesis. The first came from clever experiments that pitted the shape of a face (from a three-dimensional laser scan of a head model), which could be manipulated with morphing, against characteristic facial motions projected onto heads that varied in shape (Hill & Johnston, 2001; Knappmeyer, Thornton, & Bu¨ltho¤, 2001; Knappmeyer, Thornton, & Bu¨ltho¤, 2003). These studies provided a prerequisite demonstration that the facial motion in dynamic identity signatures can bias an identification decision. The second line of evidence came from studies showing that recognition of famous faces was better with dynamic than with static presentations. This was demonstrated most compellingly when image quality was degraded (Knight & Johnston, 1997; Lander, Bruce, & Hill, 2001; Lander & Bruce, 2000). O’Toole et al. (2002) concluded that the role of motion in face identification depends on both image quality and face familiarity. For the representation enhancement hypothesis, the empirical support was less compelling. The idea that facial motion could be perceptually useful for forming better representations of faces is consistent with a role for structure-from-motion processes in learning new faces. It seemed reasonable to hypothesize that motion could contribute, at least potentially, to the quality of the three-dimensional information perceptually available in faces, even if evidence for this was not entirely unambiguous. To summarize, combining human face recognition data with the distributed network model, we proposed that processing the visual information from faces for recognition involves the interplay of two systems (O’Toole et al., 2002). The first system is equivalent to the one proposed by Haxby et al. (2000) in the ventral temporal cortex. It includes the lateral fusiform gyrus and associated structures (e.g., occipital face area, OFA) and processes the invariant information in faces. The second system processes facial movements and is the part of the distributed network useful for representing the changeable aspects of faces in the pSTS. O’Toole et al. (2002) amended the distributed model to specify the inclusion of both social communication signals and person-specific dynamic identity signatures in facial movements. We suggested that two caveats apply to the e¤ective use of this secondary system. First, the face must be familiar (i.e., characteristic motions of the individual must be known) and second, the viewing conditions must be poor (i.e., otherwise the more reliable pictorial code will dominate recognition and the motion system will not be needed).
Memory for Moving Faces
19
The familiarity caveat is relevant for understanding the well-established di¤erences in processing capabilities for familiar and unfamiliar faces (Hancock, Bruce, & Burton, 2000). When we know a person well, a brief glance from a distance even under poor illumination, is often all that is required for recognition. For unfamiliar faces, changes in viewpoint, illumination, and resolution between learning and test all produce reliable decreases in recognition performance (see O’Toole, Jiang, Roark, & Abdi, 2006, for a review). We suggested that this secondary system might underlie the highly robust recognition performance that humans show in suboptimal viewing conditions for the faces they know best. A second, more tentative amendment we made to the distributed model was the addition of structure-from-motion analyses that could proceed through the dorsal stream to the middle temporal (MT) and then back to the inferior temporal (IT) cortex as ‘‘motionless form.’’ We proposed possible neural mechanisms for this process and will update these presently. In the next section we provide an updated account of the evidence for the supplemental information and the representation enhancement hypotheses. We also look at some studies that suggest a role for motion in recognition but that do not fit easily within the framework we outlined originally (O’Toole et al., 2002; Roark et al., 2003). ‘‘Backup Identity System’’ and Supplemental Motion Information: An Update
Three lines of evidence now support the supplemental information hypothesis. The first adds to previous psychological studies of face recognition and further supports the beneficial e¤ects of dynamic identity signatures for recognition. The second provides new evidence from studies indicating a benefit of motion-based identity codes in the e‰ciency of visually based ‘‘speech-reading’’ tasks. The third line of evidence comes from studies of prosopagnosics’ perceptions of moving faces. Psychological Studies of Dynamic Identity Signatures
Lander and Chuang (2005) found a supportive role for motion when recognizing people in challenging viewing conditions. They replicated the results of earlier studies and expanded their inquiry to examine the types of motions needed to show the benefit, evaluating both rigid and nonrigid motions. Lander and Chuang found a recognition advantage for nonrigid motions (expressions and speech), but not for rigid motions (head nodding and shaking). Moreover, they found a motion advantage only when the facial movements were ‘‘distinctive.’’ They conclude that some familiar faces have characteristic motions that can help in identification by incorporating supplemental motion-based information about the face. In a follow-up study, Lander, Chuang, and Wickam (2006) demonstrated human sensitivity to the ‘‘naturalness’’ of the dynamic identity signatures. Their results
20
A. O’Toole and D. Roark
showed that recognition of personally familiar faces was significantly better when the faces were shown smiling naturally than when they were shown smiling in an artificial way. Artificial ‘‘smile videos’’ were created by morphing from a neutral expression image to a smiling face image. Speeding up the motion of the natural smile impaired identification but did not impair recognition from the morphed artificial smile. Lander et al. conclude that characteristic movements of familiar faces are stored in memory. The study o¤ers further support for a reasonably precise spatiotemporal code of characteristic face motions. Evidence from Facial Speech-Reading
Several recent studies demonstrate that the supplemental information provided by speaker-specific facial speech movements can improve the accuracy of visually based ‘‘speech-reading.’’ Kaufmann and Schweinberger (2005), for example, implemented a speeded classification task in which participants were asked to distinguish between two vowel articulations across variations in speaker identity. Changes in facial identity slowed the participants’ ability to classify speech sounds from dynamic stimuli but did not a¤ect classification performance from single static or multiple-static stimuli. Thus, individual di¤erences in dynamic speech patterns can modulate facial speech processing. Kaufmann and Schweinberger (2005) conclude that the systems for processing facial identity and speech-reading are likely to overlap. In a related study, Lander, Hill, Kamachi, and Vatikiotis-Bateson (2007) found that speaker-specific mannerisms enhanced speech-reading ability. Participants matched silent video clips of faces to audio voice recordings using unfamiliar faces and voices as stimuli. In one experiment, the prosody of speech segments was varied in clips that were otherwise identical in content (e.g., participants heard the statement ‘‘I’m going to the library in the city’’ or the question ‘‘I’m going to the library in the city?’’). Participants were less accurate at matching the face and the voice when the prosody of the audio clip did not match the video clip, or vice versa. Of note, Lander and her colleagues also showed that participants were most successful matching faces to voices when speech cues came in the form of naturalistic, conversational speech. Even relatively minor variations in speaker mannerisms (e.g., unnatural enunciation, hurried speech) inhibited the participants’ ability to correctly match faces with voices. Familiarity with a speaker also seems to play a role in speech-reading ability. Lander and Davies (2008) found that as participants’ experience with a speaker increases through exposure to video clips of the speakers reciting letters and telling stories, so does speech-reading accuracy. The mediating role of familiarity in the use of motion information from a face is consistent with the proposals we made previously (O’Toole et al., 2002) for a secondary face identity system in the pSTS. It is also consistent with the suggestion that the face identity code in this dorsal backup
Memory for Moving Faces
21
system is more robust than the representation in the ventral stream. Lander and Davies (2008) conclude that familiarity with a person’s idiosyncratic speaking gestures can be coupled with speech-specific articulation movements to facilitate speech-reading. Evidence for the Supplemental Motion Backup System from Prosopagnosia
The possible existence of a recognition backup system that processes dynamic identity signatures in the pSTS makes an intriguing prediction about face recognition skills in prosopagnosics. Specifically, it suggests that face recognition could be partially spared in prosopagnosics when a face is presented in motion. The rationale behind this prediction is based on the anatomical separation of the ventral temporal face areas and the pSTS. Damage to the part of the system that processes invariant features of faces would not necessarily a¤ect the areas in the pSTS that process dynamic identity signatures. Two studies have addressed this question with prosopagnosics of di¤erent kinds. In the first study, Lander, Humphreys, and Bruce (2004) tested a stroke patient who su¤ered a relatively broad pattern of bilateral lesion damage throughout ventral-occipital regions, including the lingual and fusiform gyri. ‘‘HJA,’’ who is ‘‘profoundly prosopagnosic’’ (Lander et al., 2004), su¤ers also from a range of other nonface-specific neuropsychological deficits (see Humphreys and Riddoch, 1987 for a review) that include object agnosias, reading di‰culties, and achromatopsia. Despite these widespread visual perception di‰culties, HJA is able to perform a number of visual tasks involving face and body motion. For example, he is able to categorize lip movements accurately (Campbell, 1992). He also reports relying on voice and gait information for recognizing people (Lander et al., 2004). For present purposes, Lander et al. (2004) tested HJA on several tasks of face recognition with moving faces. HJA was significantly better at matching the identity of moving faces than matching the identity of static faces. This pattern of results was opposite to that found for control subjects. However, HJA was not able to use face motion to explicitly recognize faces and was no better at learning names for moving faces than for control faces. Thus, although the study suggests that HJA is able to make use of motion information in ways not easy for control subjects, it does not o¤er strong evidence for a secondary identity backup system. Given the extensive nature of the lesion damage in HJA, however, the result is not inconsistent with the hypothesis of the backup system. The prediction that motion-based face recognition could be spared in prosopagnosics was examined further by Steede, Tree, and Hole (2007). They tested a developmental prosopagnosic (‘‘CS’’) who has a purer face recognition deficit than HJA. CS has no di‰culties with visual and object processing, but has profound recognition di‰culties for both familiar and unfamiliar faces. Steede et al. tested CS with
22
A. O’Toole and D. Roark
dynamic faces and found that he was able to discriminate between dynamic identities. He was also able to learn the names assigned to individuals based only on their idiosyncratic facial movements. This learning reached performance levels comparable to those of control subjects. These results support the posited dissociation between the mechanisms involved in recognizing faces from static versus dynamic information. A cautionary note on concluding this too firmly, based on these results, is that CS is a developmental (congenital) rather than an acquired prosopagnosic. Thus it is possible that some aspects of his face recognition system have been organized developmentally to compensate for his di‰culties with static face recognition. More work of this sort is needed to test patients with relatively pure versions of acquired prosopagnosia. In summary, these three lines of evidence combined o¤er solid support for the supplemental information hypothesis. Representation Enhancement: An Update
The clearest way to demonstrate a role for the representation enhancement hypothesis is to show that faces learned when they are in motion can be recognized more accurately than faces learned from a static image or set of static images that equate the ‘‘face’’ information (e.g., from extra views). If motion promotes a more accurate representation of the three-dimensional structure of a face, then learning a face from a moving stimulus should benefit later recognition. This advantage assumes that face representations incorporate information about the three-dimensional face structure that is ultimately useful for recognition. The benefit of motion in this case should be clear when testing with either a static or a moving image of the face—i.e., the benefit is a consequence of a better, more flexible face representation. At first glance, it seems reasonable to assume that the face representation we are talking about is in the ventral stream. In other words, this representation encodes the invariant featurebased aspects of faces rather than the idiosyncratic dynamic identity signatures. Thus it seems likely that it would be part of the system that represents static facial structure. We will qualify and question this assumption shortly. For present purposes, to date there is still quite limited evidence to support the beneficial use of structure-from-motion analyses for face recognition. This lack of support is undoubtedly related to findings from the behavioral and neural literatures that suggest view-based rather than object-centered representations of faces, especially for unfamiliar faces. In particular, several functional neuroimaging studies have examined this question using the functional magnetic resonance adaptation (fMR-A) paradigm (cf. Grill-Spector, Kushnir, Hendler, Edelman, Itzchak, & Malach, 1999). fMR-A makes use of the ubiquitous finding that brain response decreases with repeated presentations of the ‘‘same’’ stimulus. The fusiform face area (FFA;
Memory for Moving Faces
23
Kanwisher, McDermott, & Chun, 1997) and other face-selective regions in the ventral temporal cortex show adaptation for face identity, but a release from adaptation when the viewpoint of a face is altered (e.g., Andrews & Ewbank, 2004; Pourtois, Schwartz, Seghier, Lazeyras, & Vuilleumier, 2005). This suggests a view-based neural representation of unfamiliar faces in the ventral temporal cortex. [Although see Jiang, Blanz, and O’Toole (2009) for evidence of three-dimensional information contributing to codes for familiar faces.] In the psychological literature, Lander and Bruce (2003) further investigated the role of motion in learning new faces. They show first that learning a face from either rigid (head nodding or shaking) or nonrigid (talking, expressions) motion produced better recognition than learning a face from only a single static image. However, the learning advantage they found for rigid motion could be accounted for by the different angles of view that the subjects experienced during the video. For nonrigid motions, the advantage could not be explained by the multiple sequences experienced in the video. Lander and Bruce suggest that this advantage may be due to the increased social attention elicited by nonrigid facial movements. This is because nonrigid facial motions (talking and expressing) may be more socially engaging than rigid facial ones (nodding and shaking). Nevertheless, the study opens up the possibility that structure-from-motion may benefit face learning, at least for some nonrigid facial motions. Before firmly concluding this, however, additional controls over the potential di¤erences in the attention appeal of rigid and nonrigid motions are needed to eliminate this factor as an explanation of the results. Bonner, Burton, and Bruce (2003) examined the role of motion and familiarization in learning new faces. Previous work by Ellis, Shepherd, and Davies (1979) showed that internal features tend to dominate when matching familiar faces, whereas external features are relied upon more for unfamiliar faces. Bonner et al. examined the time course of the shift from external to internal features over the course of several days. Also, based on the hypothesis that motion is relatively more important for recognizing familiar faces, they looked at the di¤erences in face learning as a function of whether the faces were learned from static images or a video. The videos they used featured slow rigid rotations of the head, whereas the static presentations showed extracted still images that covered a range of the poses seen in the video. They found improvement over the course of 3 days in matching the internal features of the faces, up to the level achieved with the external features for the initial match period. Thus the internal feature matches continued to improve with familiarity but the external matches remained constant. Notably, they found no role for motion in promoting learning of the faces. This is consistent with a minimal contribution of motion for perceptual enhancement. Before leaving this review of the representation enhancement hypothesis for learning new faces, it is worth noting that the results of studies with adults may not
24
A. O’Toole and D. Roark
generalize to learning faces in infancy. Otsuka, Konishi, Kanazawa, Yamaguchi, Abdi, and O’Toole (2009) compared 3–4-month-olds’ recognition of previously unfamiliar faces learned from moving or static displays. The videos used in the study portrayed natural dynamic facial expressions. Infants viewing the moving condition recognized faces more e‰ciently than infants viewing the static condition, requiring shorter familiarization times even when di¤erent images of a face were used in the familiarization and test phases. Furthermore, the presentation of multiple static images of a face could not account for the motion benefit. Facial motion, therefore, promotes young infants’ ability to learn previously unfamiliar faces. In combination with the literature on an adult’s processing of moving faces, the results of Otsuka et al. suggest a distinction between developmental and postdevelopmental learning in structure-from-motion contributions to building representations for new faces. Does Motion Contribute to Ventral Face Representations?
From a broad-brush point of view, the bulk of the literature on visual neuroscience points to anatomically and functionally distinct pathways for processing highresolution color or form information and for processing motion-based form. In our previous review (O’Toole et al., 2002), we discussed some speculative neural support for the possibility that motion information could contribute to face representations in the inferotemporal cortex. These neural links are obviously necessary if structurefrom-motion processes are to enhance the quality of face representations in the traditional face-selective areas of the ventral temporal (VT) cortex. In O’Toole et al. (2002), we suggested the following data in support of motion-based contributions to the ventral cortex face representation. First, we noted that neurons in the primate IT, which are sensitive to particular forms, respond invariantly to form even when it is specified by pure motion-induced contrasts (Sary, Vogels, & Orban, 1993). Second, lesion studies have indicated that form discrimination mechanisms in the IT can make use of input from the motion-processing system (Britten, Newsome, & Saunders, 1992). Third, both the neurophysiological (Sary et al., 1993) and lesion studies (Britten et al., 1992) suggest known connections from the MT to the IT via V4 (Maunsell and Van Essen, 1983; Ungerleider & Desimone, 1986) as a plausible basis for their findings. We also noted in O’Toole et al. (2002) that psychological demonstrations of the usefulness of structure-from-motion for face recognition have been tentative and so strictly speaking there is no psychologically compelling reason to establish a mechanism for the process. At present, the neural possibilities for contact between dorsal and ventral representations remain well established, but there are not enough results, at present, to require an immediate exploration of these links. However, there have been interesting developments in understanding the more general problem of recognizing people in
Memory for Moving Faces
25
motion, particularly from point-light walker displays (Grossman & Blake, 2003; Giese & Poggio, 2003). These studies suggest a role for both ventral and dorsal pathways in the task. Giese and Poggio (2003) caution, however, that there are still open questions and unresolved paradoxes in the data currently available. For present purposes, we have wondered recently if one problem in making sense of the data concerns the assumption we made originally that structure-from-motion must somehow feed back information to the ventral face representations (O’Toole et al., 2002). As noted, this assumption was based on the rationale that structure is an invariant property of faces. Based on the more recent findings in the perception of moving bodies, we tentatively suggest that some aspects of facial structure might also become part of the pSTS representation of identity. This representation of face structure would be at least partially independent of the specific facial motions used to establish it, but would nonetheless need a moving face to activate it. In other words, we hypothesize that the dorsal stream pSTS identity representation might include, not only idiosyncratic facial gestures, but also a rough representation of the facial shape independent of these idiosyncratic motions. Evidence for this hypothesis can be found in two studies that suggest that the beneficial contribution from the motion system for learning new faces may be limited to tasks that include dynamic information both at learning and at test (Lander & Davies, 2007; Roark, O’Toole, & Abdi, 2006). First, from work in our lab, Roark et al. (2006) familiarized participants with previously unknown people using surveillancelike, whole-body images (gait videos) and then tested recognition using either close-up videos of the faces talking and expressing or a single, static image. The results showed that recognition of the people learned from the gait videos was more accurate when the test images were dynamic than when they were static. Similarly, when we reversed the stimuli so that participants learned faces either from the closeup still images or the close-up video faces and were tested using the gait videos, we found that recognition from the gait videos was more accurate when participants had learned the faces from the dynamic videos. Taken together, this pattern of results indicates that it is easier to obtain a motion advantage in a recognition task with unfamiliar faces when a moving image is present both at learning and at test. Furthermore, it is indicative of a system in which ‘‘motion-motion’’ matches across the learning and test trials are more useful for memory than either ‘‘static-motion’’ or ‘‘motion-static’’ mismatches across the learning and test trials. Lander and Davies (2007) found a strikingly similar result. In their study, participants learned faces from either a moving sequence of the person speaking and smiling or a static freeze frame selected from the video sequence. At test, participants viewed either moving or static images. This was a di¤erent moving sequence or static image than that presented during the learning phase. Lander and Davies found that there was an advantage for recognizing a face in motion only when participants had
26
A. O’Toole and D. Roark
learned the faces from dynamic images. This result adds further support to the idea that motion is most helpful when participants have access to it during both learning and test times. It should be noted that the results of both Lander and Davies (2007) and Roark et al. (2006) indicate that it is not a prerequisite that identical motions be present at learning and at test in order to obtain the motion advantage; both studies included di¤erent motions across the learning and test trials. Rather, it seems su‰cient merely to activate the motion system across the learn-test transfer to see gains in recognition accuracy. Returning to the representation enhancement hypothesis, the motionmotion benefit may reflect the e‰ciency of having to access only a single channel (i.e., the dorsal motion system) when bridging between two moving images. Conversely, when dynamic information is present only at learning but not at test (or only at test but not at learning), cross-access between the motion and static information streams is required for successful recognition. This motion-motion advantage must be put into perspective, however, with work on the e¤ect of moving ‘‘primes’’ on face perception. Thornton and Kourtzi (2002) implemented a face-matching task with unfamiliar faces in which participants briefly viewed either moving or static images of faces and then had to identify whether the prime image matched the identity of a static face presented immediately afterward. The participants’ responses were faster following the dynamic primes than following the static primes. Pilz, Thornton, and Bu¨ltho¤ (2006) observed a similar advantage for moving primes, with the additional finding that the benefit of moving primes extends across prime-target viewpoint changes. Pilz et al. also reported that moving primes led to faster identity matching in a delayed visual search task. Neither of these studies, however, included dynamic stimuli during the match trials, so it is di‰cult to tie these results directly to those of Roark et al. (2006) and Lander and Davies (2007), where motion was most useful when it was available from both the learning and the test stimuli. Interpretation of the motion-motion match hypothesis in the context of priming studies is a topic that is clearly ripe for additional work. In conclusion, there is a more general need for studies that can clarify the extent to which motion can act as carrier or conduit for dorsal representations of face identity that have both moving and stationary components. Summary
The faces we encounter in daily life are nearly always in motion. These motions convey social information and also can carry information about the identity of a person in the form of dynamic identity signatures. There is solid evidence that these motions comprise part of the human neural code for identifying faces and that they can be used for recognition, especially when the viewing conditions are suboptimal and
Memory for Moving Faces
27
when the people to be recognized are familiar. Intriguing questions remain about the sparing of these systems in classic cases of acquired prosopagnosia. Moreover, an understanding of dorsal face and body representations, established through experience with dynamic stimuli, might be important for computational models of face recognition aimed at robust performance across viewpoint changes (cf. Giese & Poggio, 2003 and this volume). There is little evidence for the representation enhancement hypothesis having a major role in face recognition. Again, this raises basic questions about the extent to which ventral and dorsal face representations are independent. There is still a great deal of work to be done in bridging the gap between the wellstudied face representations in the ventral stream and less well-understood face representations in the dorsal stream. Acknowledgments
Thanks are due to Technical Support Working Group/United States Department of Defense for funding A. O’Toole during the preparation of this chapter. References Allison, T., Puce, A., & McCarthy, G. (2000). Social perception from visual cues: Role of the STS region. Trends in Cognitive Sciences, 4, 267–278. Andrews, T. J., & Ewbank, M. P. (2004). Distinct representations for facial identity and changeable aspects of faces in the human temporal lobe. NeuroImage, 23, 905–913. Bonner, L., Burton, A. M., & Bruce, V. (2003). Getting to know you: How we learn new faces. Visual Cognition, 10(5), 527–536. Britten, K. H., Newsome, W. T., & Saunders, R. C. (1992). E¤ects of inferotemporal lesions on formfrom-motion discrimination. Experimental Brain Research, 88, 292–302. Campbell, R. (1992). The neuropsychology of lip reading. Philosophical Transactions of the Royal Society of London, 335B, 39–44. Christie, F., & Bruce, V. (1998). The role of dynamic information in the recognition of unfamiliar faces. Memory and Cognition, 26, 780–790. Ellis, H., Shepherd, J. W., & Davies, G. M. (1979). Identification of familiar and unfamiliar faces from internal and external features: Some implications for theories of face recognition. Perception, 8, 431–439. Geise, M., & Poggio, T. (2003). Neural mechanisms for the recognition of biological motion. Nature Reviews Neuroscience, 4, 179–191. Grill-Spector, K., Kushnir, T., Hendler, T., Edelman, S., Itzchak, Y., & Malach, R. (1999). Di¤erential processing of objects under various viewing conditions in human lateral occipital complex. Neuron, 24, 187–203. Grossman, E. D., & Blake, R. (2003). Brain areas active during visual perception of biological motion. Neuron, 35, 1167–1175. Haxby, J. V., Ho¤man, E., & Gobbini, M. I. (2000). The distributed human neural system for face perception. Trends in Cognitive Sciences, 4, 223–233. Hancock, P. J. B., Bruce, V., & Burton, A. M. (2000). Recognition of unfamiliar faces. Trends in Cognitive Sciences, 4, 330–337. Hill, H., & Johnston, A. (2001). Categorizing sex and identity from the biological motion of faces. Current Biology, 11, 880–885.
28
A. O’Toole and D. Roark
Humphreys, G., & Riddoch, M. J. (1987). To see but not to see: A case study of visual agnosia. Hillsdale NJ: Lawrence Erlbaum. Jiang, F., Blanz, V., & O’Toole, A. J. (2009). Three-dimensional information in face representation revealed by identity aftere¤ects. Psychological Science, 20(3), 318–325. Kanwisher, N., McDermott, J., & Chun, M. M. (1997). The fusiform face area: A module in human extrastriate cortex specialized for face perception. Journal of Neuroscience, 17, 4302–4311. Kaufman, J., & Schweinberger, S. R. (2005). Speaker variations influence speechreading speed for dynamic faces. Perception, 34, 595–610. Knappmeyer, B., Thornton, I., & Bu¨ltho¤, H. H. (2001). Facial motion can determine identity. Journal of Vision, 3, 337. Knappmeyer, B., Thornton, I., & Bu¨ltho¤, H. H. (2003). The use of facial motion and facial form during the processing of identity. Vision Research, 43(18), 1921–1936. Knight, B., & Johnston, A. (1997). The role of movement in face recognition. Visual Cognition, 4, 265– 273. Lander, K., & Bruce, V. (2000). Recognizing famous faces: Exploring the benefits of facial motion. Ecological Psychology, 12, 259–272. Lander, K., & Bruce, V. (2003). The role of motion in learning new faces. Visual Cognition, 10(8), 897– 912. Lander, K., Humphreys, G. W., & Bruce, V. (2004). Exploring the role of motion in prosopagnosia: Recognizing, learning and matching faces. Neurocase, 10, 462–470. Lander, K., Bruce, V., & Hill, H. (2001). Evaluating the e¤ectiveness of pixelation and blurring on masking the identity of familiar faces. Applied Cognitive Psychology, 15, 101–116. Lander, K., Christie, F., & Bruce, V. (1999). The role of movement in the recognition of famous faces. Memory and Cognition, 27, 974–985. Lander, K., & Chuang, L. (2005). Why are moving faces easier to recognize? Visual Cognition, 12(3), 429– 442. Lander, K., Chuang, L., & Wickam, L. (2006). Recognizing face identity from natural and morphed smiles. Quarterly Journal of Experimental Psychology, 59(5), 801–808. Lander, K., & Davies, R. (2007). Exploring the role of characteristic motion when learning new faces. Quarterly Journal of Experimental Psychology, 60(4), 519–526. Lander, K., & Davies, R. (2008). Does face familiarity influence speech readability? Quarterly Journal of Experimental Psychology, 61(7), 961–967. Lander, K., Hill, H., Kamachi, M., & Vatikiotis-Bateson, E. (2007). It’s not what you say but how you say it: Matching faces and voices. Journal of Experimental Psychology: Human Perception and Performance, 33(4), 905–914. Maunsell, J. H. R., & Van Essen, D. C. (1983). The connections of the middle temporal visual area (MT) and their relationship to a cortical hierarchy in the macaque monkey. Journal of Neurophysiology, 3, 2563– 2586. O’Toole, A. J., Harms, J., Snow, S., Hurst, D. R., Pappas, M. R., & Abdi, H. (2005). A video database of moving faces and people. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 812–816. O’Toole, A. J., Jiang, F., Roark, D., & Abdi, H. (2006). Predicting human performance for face recognition. In R. Chellappa and W. Zhao (eds.), Face processing: Advanced models and methods. San Diego: Academic Press, pp. 293–320. O’Toole, A. J., Roark, D., & Abdi, H. (2002). Recognition of moving faces: A psychological and neural framework. Trends in Cognitive Sciences, 6, 261–266. Otsuka, Y., Konishi, Y., Kanazawa, S., Yamaguchi, M., Abdi, H., & O’Toole, A. J. (2009). The recognition of moving and static faces by young infants. Child Development, 80(4), 1259–1271. Phillips, P. J., Grother, P., Micheals, R., Blackburn, D., Tabassi, E., & Bone, J. M. (2003). Face Recognition Vendor Test 2002 evaluation report, Tech. Rep. National Institute of Standards and Technology Interagency Report 6965 http://www.frvt.org, 2003.
Memory for Moving Faces
29
Pike, G. E., Kemp, R. I., Towell, N. A., & Phillips, K. C. (1997). Recognizing moving faces: The relative contribution of motion and perspective view information. Visual Cognition, 4, 409–437. Pilz, K. S., Thornton, I. M., & Bu¨ltho¤, H. H. (2006). A search advantage for faces learned in motion. Experimental Brain Research, 171, 436–447. Pourtois, G., Schwartz, S., Seghier, M. L., Lazeyras, F., & Vuilleumier, P. (2005). View-independent coding of face identity in frontal and temporal cortices is modulated by familiarity: An event-related fMRI study. NeuroImage, 24, 1214–1224. Roark, D., Barrett, S. E., Spence, M. J., Abdi, H., & O’Toole, A. J. (2003). Psychological and neural perspectives on the role of facial motion in face recognition. Behavioral and Cognitive Neuroscience Reviews, 2(1), 15–46. Roark, D., Barrett, S. E., O’Toole, A. J., & Abdi, H. (2006). Learning the moves: The e¤ect of familiarity and facial motion on person recognition across large changes in viewing format. Perception, 35, 761–773. Sary, G., Vogels, R., Orban, G. A. (1993). Cue-invariant shape selectivity of macaque inferior temporal neurons. Science, 260, 995–997. Steede, L. L., Tree, J. T., & Hole, G. J. (2007). I can’t recognize your face but I can recognize its movement. Cognitive Neuropsychology, 24(4), 451–466. Thornton, I. M., & Kourtzi, Z. (2002). A matching advantage for dynamic faces. Perception, 31, 113–132. Ungerleider, L. G., & Desimone, R. (1986). Cortical connections of visual area MT in the macaque. Journal of Comparative Neurology, 248, 190–222.
3
Investigating the Dynamic Characteristics Important for Face Recognition
Natalie Butcher and Karen Lander
It has long been established that as humans we are highly skilled at recognizing the faces of people we are familiar with (Braje, Kersten, Tarr, & Troje, 1998; Hill, Schyns, & Akamatsu, 1997; O’Toole, Roark, & Abdi, 1998). We are also adept in identifying particular characteristics from unfamiliar faces, such as their age, sex, race, and emotional state, with incredible accuracy (Lewis & Brookes-Gunn, 1979; McGraw, Durm, & Durnam, 1989; Mondloch et al., 1999; Montepare & Zebrowitz, 1998; Nelson, 1987; Zebrowitz, 1997). Accordingly, psychologists have long been interested in knowing when perception of a face is optimal, and in particular have been keen to understand the cognitive processes that occur when people learn previously unfamiliar faces and when they recognize known familiar faces. One area of research that has received recent focus has been the ‘‘motion advantage’’ in facial recognition (Knight & Johnston, 1997). Studies have demonstrated that the rigid and nonrigid movements shown by a face can facilitate its recognition, leading to more accurate (Pike, Kemp, Towell, & Phillips, 1997; Christie & Bruce, 1998; Lander, Christie, & Bruce, 1999) and faster (Pilz, Thornton, & Bu¨ltho¤, 2005) recognition when the results are compared with recognition from a single static image or multiple static images (Pike et al., 1997; Lander et al., 1999). By rigid motion, we refer to movement of the whole head; for example, when nodding or shaking it. In contrast, nonrigid motion refers to movement of the facial features themselves, as shown when talking or expressing thoughts. It is important that a possible dissociation has been revealed between the ability to recognize a face from the motion it produces and the ability to recognize it from a static image. In this work, Steede and colleagues (2007) reported the case study of a developmental prosopagnosic patient, CS. Despite CS being impaired at recognizing static faces, he was able to e¤ectively discriminate between di¤erent dynamic identities and demonstrated the ability to learn the name of individuals on the basis of their facial movement information (at levels equivalent to both matched and undergraduate controls). This case study indicates a possible dissociation between the cognitive mechanisms involved in the processes of recognizing a static face and those
32
N. Butcher and K. Lander
involved in recognizing a dynamic face. However, previous attempts to determine whether an individual who is impaired in static face recognition can use facial motion as a signal to identity have revealed conflicting findings (Lander, Humphreys, & Bruce, 2004; see also chapter 2). Lander et al. (2004) tested a prosopagnosic patient, HJA on his ability to use motion as a cue to facial identity. In experiment 1, HJA attempted to recognize the identity of dynamic and static famous faces. HJA was found to be severely impaired in his ability to recognise identity and was not significantly better at recognizing moving faces than static ones. In order to test HJA’s ability to learn face–name pairings, a second experiment was conducted using an implicit face recognition task. In this experiment HJA was asked to try and learn true and untrue names for famous faces, which were shown in either a moving clip or a static image. HJA found this a di‰cult task and was no better with moving faces or true face-name pairings. Some prosopagnosic patients have previously found it easier to learn true face-name pairings more accurately and e‰ciently than untrue ones (covert recognition; de Haan, Young, & Newcombe, 1987). From this case study Lander et al. (2004) concluded that HJA was not able to overtly or covertly use motion as a cue to facial identity. Despite this, a third experiment demonstrated that HJA was able to decide whether two sequentially presented dynamic unfamiliar faces had the same or di¤ering identities. His good performance on this matching task demonstrates that HJA retains good enough motionprocessing abilities to enable him to match dynamic facial signatures, yet insu‰cient abilities to allow him to store, recognize, or learn facial identity on the basis of facial movements. Thus any possible cognitive dissociation between processing facial identity from a moving face and from a static one requires further investigation and discussion (see Lander et al., 2004 for elaboration and implications for models of face recognition). In this chapter we are interested in considering what it is about facial motion that aids recognition and what form this facilitating information takes. More specifically, we consider what parameters of facial motion are important in mediating the e¤ect of this robust motion advantage. Before discussing some recent work of our own, we outline the background and history of this research area. Origin of Motion Advantage Research and Theory
The role of motion in the recognition of faces was first indicated in research conducted by Bassili (1978) and Bruce and Valentine (1988) using point-light displays (Johannson, 1973) and familiar face stimuli. Bassili (1979) found that movement of the face increased the likelihood of recognition of basic facial expressions. Based on these findings, this author argued that during facial recognition the viewer can access a large body of surplus information, such as motion information, which can be used
Dynamic Characteristics Important for Face Recognition
33
to assist the recognition process when viewing conditions are nonoptimal. Bruce and Valentine (1988) extended the work of Bassili (1979), finding a significant advantage in performance when participants were to make identity judgments from moving-dot displays compared with their performance using still photographs. More recently, research has been conducted to directly investigate the motion advantage in facial recognition. In 1997 Knight and Johnston presented participants with photographic negated and non-negated famous faces as stimuli. The process of negation makes black areas white, light gray areas dark gray, and so forth, without removing any spatial information from the image. Overall, an advantage was found for recognizing moving negated faces in that moving faces were significantly better recognized than static faces. However, Knight and Johnson (1997) found that when faces were not negated, motion was not useful for recognition. They suggested, though, that this result does not mean that motion information was not useful in aiding recognition of a familiar face, but instead proposed two explanations for their results: first that motion reinstates depth cues that are lost through negation and second that viewing a moving face provides information about its three-dimensional structure, allowing recognition of the characteristic facial features belonging to that individual. This second proposition of Knight and Johnson (1977) embodies one of the dominant hypotheses in the literature regarding the role of motion in facial recognition. The two dominant theories are the representation enhancement hypothesis and supplemental information hypothesis (O’Toole, Roark, & Abdi, 2002). The first hypothesis (O’Toole et al., 2002) suggests that facial motion aids recognition by facilitating the perception of the three-dimensional structure of the face. It posits that the quality of the structural information available from a human face is enhanced by its motion, and this benefit surpasses the benefit provided by merely seeing the face from many static viewpoints (Pike et al., 1997; Christie & Bruce, 1998; Lander et al., 1999). Since this hypothesis is not dependent on any previous experience with an individual face, it seems plausible to predict that it is important in understanding how motion aids recognition of previously unfamiliar faces. The second hypothesis (the supplemental information hypothesis) (O’Toole et al., 2002) assumes that we retain the characteristic motions of an individual’s face as part of our stored facial representation for that individual. For a particular individual’s characteristic facial motions to be learned to an extent where they become intrinsic to that person’s facial representation, experience with the face is needed. Experience or contact is required because characteristic motion information or ‘‘characteristic motion signatures’’ are learned over time, allowing a memory of what facial motions a person typically exhibits to be stored as part of their facial representation. Therefore when motion information for an individual has been integrated into the representation of their face, this information can be retrieved and used to aid recognition of that face. We would then expect, based on the supplemental information hypothesis,
34
N. Butcher and K. Lander
that the more familiar with a face and its motion the observer is, the greater the motion advantage. This predicted result has recently been demonstrated by Butcher and Lander (in preparation) and will be discussed in detail later in this chapter along with further support for the supplemental information hypothesis. Learning and experience with an individual’s facial motion is key to this hypothesis, and as such, the supplemental information hypothesis is crucial to understanding the motion advantage for familiar faces and how e¤ective face learning occurs. It is imperative to consider that these two theoretical explanations of the motion advantage in facial perception are not mutually exclusive and their importance in understanding the motion e¤ect may be mediated by the particular viewing conditions, the characteristics of the face to be recognized, and the nature of the task. Indeed, research using unfamiliar faces has often demonstrated less robust and sometimes conflicting findings. Pike et al. (1997) describe four experiments investigating the e¤ects of rigid (rotational) motion on learning unfamiliar faces. The experiments utilized a learning and a test procedure. In the learning phase, the participants were informed that they were about to view a series of faces on a video. The video contained a series of six single static faces, six faces as a series of five static images, and six dynamic faces. After viewing the tape, the participant then moved into the test phase in which they were shown a series of faces and were asked to decide whether each face had been seen during the video they watched. The participants were also asked to indicate how confident they were about their recognition judgments. The results demonstrated higher hit rates for participants who viewed faces in motion than for those who viewed faces in either multiple static images or a single static image. Pike et al. (1997) attributed this e¤ect to the notion that when a face is seen in motion, a description can be derived that is not essentially ‘‘object-centered’’ (Marr, 1982) or dependent on a particular image or view of that face. Therefore motion cues may provide some of the information necessary to construct a useful description of a face, resulting in familiar faces being easily recognized regardless of viewpoint or image format. Later work by Lander and Bruce (2003) provided some evidence that nonrigid motion, as well as rigid motion, may be used to aid recognition of unfamiliar faces during learning. Higher performance rates were produced with both rigid (head movement) and nonrigid motion than with static conditions when moving faces were presented in a learning phase. However, the results also demonstrated that the advantage of nonrigid motion may not be due to the dynamic information provided by the motion. Indeed, viewing nonrigid motion in reverse also produced significantly better recognition than viewing a single static image. The ‘‘motion advantage’’ with reversed motion was comparable to that found when viewing the same nonrigid motion forward. Lander and Bruce (2003) proposed that the exact dynamic information of a moving face may be important only for the recognition of familiar faces,
Dynamic Characteristics Important for Face Recognition
35
when an individual’s characteristic facial movements have been learned and stored as part of the facial representation. Instead, they suggested that the advantage for rigid motion when learning faces may be attributed to increased perspective information, while the advantage provided by nonrigid motion may be due to increased attention to moving images. Research investigating the importance of motion when recognizing already familiar faces (Knight & Johnston, 1997; Lander et al., 1999; Lander, Bruce, & Hill, 2001) has obtained more consistent and robust results, with widespread support for the supplemental information hypothesis (O’Toole et al. 2002). When considering the supplemental information hypothesis, it is first vital to note that the recognition advantage for motion is not simply due to the increased number of static images shown when the face is in motion. In a series of experiments Lander et al. (1999) presented participants with a task in which they had to recognize individuals from a single static image, a moving image, or a series of static images using the same frames as those used in the moving-image condition. In their third experiment, two multiple static conditions were used. The first multiple-static condition maintained the correct sequence information, reflecting the order in which the static images were taken from the moving image. In the second multiple-static condition, the order in which the static images were displayed was jumbled so that sequence information was not maintained. The participants demonstrated lower recognition rates in both multiplestatic conditions than they did in the moving condition and there was no significant di¤erence between the two multiple-static conditions (correct sequence order and jumbled sequence order). It seems that what contributes to the beneficial e¤ect of motion is actually seeing the dynamic information that motion provides because in this experiment (Lander et al., 1999) there was still a beneficial e¤ect for recognition of famous faces from a moving sequence compared with that from a sequence of images. Using two synthetic heads with each animated by the movement of a di¤erent volunteer, Knappmeyer, Thornton, and Bu¨ltho¤ (2003) provided support for the idea that facial motion becomes intrinsic to the representation of an individual’s face by demonstrating the use of dynamic information in facilitating facial recognition. The participants viewed and thus became familiar with either head A with motion from volunteer A, or head B with motion from volunteer B. In the test phase, an animated head constructed from the morph of the two synthetic heads (A and B) was produced. The participants were asked to identify whose head was shown. It was found that the participants’ identity judgments were biased by the motion they had originally learned from head A or B, providing unmistakable support for the supplemental information hypothesis (O’Toole et al., 2002). As predicted by this hypothesis (O’Toole et al., 2002), this experiment demonstrates that representations of an individual’s characteristic facial motions are learned and are inherent to a representation
36
N. Butcher and K. Lander
of that individual’s face. These results, among others obtained from studies on the recognition of familiar faces support the supplemental information hypothesis and pose the following question: What is it about seeing the precise motion information that a face produces that aids facial recognition? Parameters Mediating the Motion Advantage
As previously mentioned, it is clear that the motion advantage in facial recognition is mediated by situational factors and characteristics of the face to be recognized. The viewing conditions under which facial motion is observed have been shown to a¤ect the extent to which motion aids recognition. Research indicates that disruptions to the natural movement of the face can influence the size of the motion advantage in recognition. Lander et al. (1999) and Lander and Bruce (2000) found lower recognition rates of famous faces when the motion was slowed down, speeded up, reversed, or rhythmically disrupted. Thus, seeing the precise dynamic characteristics of a face in motion provides the greatest advantage for facial recognition. A further demonstration of this point comes from research using both natural and artificially created (morphed) smiling stimuli (Lander, Chuang, & Wickham, 2006). In order to create an artificially moving sequence, Lander et al. (2006) used a morphing technique to create intermediate face images between the first and last frames of a natural smile. When shown in sequence, these images were used to create an artificially moving smile that lasted the same amount of time and had the same start and end point as the natural smile for that individual. They found that familiar faces were recognized significantly better when shown naturally smiling than with a static neutral face, a static smiling face, or a morphed smiling sequence. This further demonstrates the necessity for motion to be natural in order to facilitate the motion advantage. This finding is consistent with the predictions of the supplemental information hypothesis since access to characteristic motion patterns is presumably disrupted when facial motion is not consistent with the manner in which it is stored, i.e., at its natural tempo and rhythm. Repetition priming studies (Lander & Bruce, 2004) have found evidence that is consistent with the viewpoint that dynamic (motion) information is inherent to the representation of a particular individual’s face. Repetition priming is the facilitation demonstrated at test when the item to be recognized (here a face) has been encountered at some point prior to the test. It has been used to probe the nature of the representations underlying recognition, as for instance with regard to faces (e.g., Ellis, Flude, Young, & Burton, 1996; Ellis, Burton, Young, & Flude, 1997). It is proposed that when priming is sensitive to some change in the form of the faces between study and test phases, this parameter may be intrinsic to the representations that mediate face recognition. Such priming e¤ects have been demonstrated for words (Jackson
Dynamic Characteristics Important for Face Recognition
37
& Morton, 1984) and objects (Bruce, Carson, Burton, & Ellis, 2000; Warren & Morton, 1982) as well as for familiar faces (Bruce & Valentine, 1985). In the prime phase of Lander and Bruce’s (2004) experiments, participants were presented with a series of famous faces and asked to name or provide some semantic information about the person shown. Half of the faces were presented in static pose and half in motion. In the test phase, participants were presented with a series of faces and asked to make familiarity judgments about them, indicating whether each face was familiar or unfamiliar. Lander and Bruce (2004) found that even when the same static image was shown in the prime and the test phases, a moving image primed more e¤ectively than a static image (experiment 1). This finding was extended (experiment 2) to reveal that a moving image remains the most e¤ective prime, compared with a static image prime, when moving images are used in the test phase. Significantly, providing support for the notion of ‘‘characteristic motion signatures’’ inherent to a person’s face representation, they also found that the largest priming advantage was found with naturally moving faces rather than with those shown in slow motion (experiment 3). However, they also observed that viewing the same moving facial sequence at prime as at test produced more priming than using di¤ering moving images (experiment 4). As well as investigating the e¤ects of natural versus morphed motion, Lander et al. (2006) found a main e¤ect of familiarity, revealing that the nature of the face to be recognized can also mediate the e¤ect of motion on recognition. They suggested that the more familiar a person’s face is, the more we may be able to utilize its movement as a cue to identity. Indeed, characteristic motion patterns may become a more accessible cue to identity as a face becomes more familiar. However, recent research by Lander and Davies (2007) investigating this issue suggests that learning a face is rapid, and as such the beneficial e¤ect of motion is not highly dependent on the amount of time the face is seen. In this experiment they presented participants with a series of faces in which each was proceeded by a name, and participants were asked to try to learn the names for the faces. When the participants felt they had learned the names correctly, they continued onto the recognition phase in which they were presented with the same faces (using the same presentation method as in the learning phase), and asked to name the individual. The learning phase was repeated and participants were asked to try and learn the names of the faces again if any of the names were incorrectly recalled, after which they took the recognition test again. This procedure was replicated until a participant correctly named all twelve faces shown. In the test phase, participants were presented with forty-eight degraded faces; twentyfour as single static images and twenty-four moving. In the moving condition, the faces were each presented for 5 seconds. The participants were informed that some would be learned faces and some would be ‘‘new’’ faces, for which they had not learned names. They were asked to name the face or respond ‘‘new’’ and to provide
38
N. Butcher and K. Lander
a response for every trial. Supporting the idea of rapidly learned characteristic motion patterns, the results showed that there was an advantage to recognizing a face in motion (at test) only when the face had been learned as a moving image. Conversely, when the face was learned as a static image, there was no advantage for recognizing moving faces. These findings support the notion that as a face is learned, information about its characteristic motion is encoded with its identity. Indeed, it seems that the participants were able to extract and encode dynamic information even when viewing very short moving clips of 5 seconds. Furthermore, the beneficial e¤ect of motion was shown to remain stable despite prolonged viewing and learning of the face’s identity in experiment 2. In this experiment, participants were assigned to one of four experimental groups. Group 1 viewed episode 1 of a TV drama before the test phase; group 2 viewed episodes 1 and 2; group 3 episodes 1 to 3; and group 4 episodes 1 to 4. Each episode was 30 minutes in length. In the test phase, participants viewed moving and static degraded images of the characters and were asked to try and identify them by character name or other unambiguous semantic information. The results revealed that although recognition of characters from the TV drama improved as the number of episodes viewed increased, the relative importance of motion information did not increase with a viewer’s experience with the face (O’Toole et al., 2002). The size of the beneficial e¤ect remained relatively stable across time, demonstrating how rapidly motion information, through familiarization with the face to be recognized, can be integrated into a representation of the face and used for recognition. Another characteristic of the face to be recognized that is important in understanding how the motion advantage is mediated is distinctiveness. Facial recognition research has demonstrated a clear benefit for faces that are thought to be spatially distinctive, as findings indicate that distinctive faces are more easily recognized than those that are rated as being ‘‘typical’’ (Bartlett, Hurry, & Thorley, 1984; Light, Kayra-Stuart, & Hollander, 1979; Valentine & Bruce, 1986; Valentine & Ferrara, 1991; Vokey & Read, 1992). It has also been established that a larger motion recognition advantage is obtained from distinctive motion than from typical motion (Lander & Chuang, 2005). Lander and Chuang (2005) found that the more distinctive or characteristic a person’s motion was rated, the more useful it was as a cue to recognition. This finding can be considered within Valentine’s (1991) multidimensional face space model of facial recognition, which is often used to provide an explanation for the spatial distinctiveness e¤ect. Valentine’s (1991) model posits that faces similar to a prototype or ‘‘typical’’ face are clustered closer together in face space, making them harder to di¤erentiate from each other and resulting in distinctive faces that are positioned away from this cluster being easily recognized. Owing to the homogeneous nature of faces as a whole, many faces are perceived as similar to the prototype, so their representations in face space
Dynamic Characteristics Important for Face Recognition
39
are clustered close to the prototype, making distinction among these faces more di‰cult. A similar theoretical explanation could be applied to moving faces in which faces in the center of the space move in a typical manner. Consequently, faces that exhibit distinctive facial motions could be located away from the center of the space, making them easier to recognize than faces displaying typical motion. Distinctiveness in facial motion may refer to (1) a motion characteristic or typical of a particular individual, (2) an odd motion for a particular individual to produce, or (3) a generally odd or unusual motion. Further work is needed to di¤erentiate among these explanations of motion distinctiveness and to investigate the consequences of each in terms of the stored cognitive representation. It is at this point that we turn to describing our recent ongoing work (Butcher & Lander, in preparation), which has been conducted to expand on current knowledge of what parameters of the face to be recognized influence the motion advantage and how these parameters may be interlinked. Here we describe three experiments designed to investigate whether information about how distinctively a face moves becomes intrinsic to the representation of the face as part of the stored characteristic motion patterns. We also explore how much motion is exhibited by a face and whether this could provide information to aid discrimination among individual faces. In this work, the familiarity and distinctiveness of a face were also related to the size of the motion recognition advantage. Specifically, in experiment 1 participants were asked to recognize (from black and white, negated motion) famous faces and rate the same faces (without negation) for familiarity, how much motion was perceived, and how distinctive the facial motion was. The faces were always rated after the recognition test. The findings revealed that faces rated as distinctive movers were recognized significantly better than those rated as more typical movers ( p < 0:01), supporting the earlier results of Lander and Chuang (2005). In addition, faces rated as moving a lot were also recognized significantly better than those rated as showing little motion ( p < 0:05). Significant correlations were revealed between the correct recognition of identity, familiarity ratings, ratings of how much motion was shown, and distinctiveness of motion ratings. See table 3.1 for correlation results. A significant correlation was also revealed between participants’ ratings of familiarity and ratings of how much motion was seen. Faces rated as being highly familiar to viewers were also rated as exhibiting high levels of motion. A significant positive correlation between face familiarity and distinctiveness of motion indicated that faces rated high in familiarity were also more likely to be rated as exhibiting distinctive motion. The final correlation across items demonstrated that faces rated as exhibiting high levels of facial motion were also rated as being significantly more distinctive movers. The results of this experiment allow us to look in detail at the relationships among the familiarity of a face, the amount it moves, and the type of motion displayed
40
N. Butcher and K. Lander
Table 3.1 Correlations between moving recognition rates compared with the three rated factors: familiarity, how much motion is perceived, and distinctiveness of motion (N ¼ 59) Recognition
Familiarity
Recognition
—
Familiarity
0.64*
—
How much motion Distinctiveness of motion
0.40* 0.54*
0.54* 0.69*
How much motion
Distinctiveness of motion
— 0.82*
—
* Correlation is significant at the 0.01 level (1-tailed).
(distinctiveness) for individual face clips. Using the same set of stimuli as experiment 1, our second experiment was a recognition test. In this task, participants were asked to provide the name of the person shown or some other specific semantic information about the person. Similarly to experiment 1, the names of characters played or programs people had acted in (e.g., ‘‘Chandler Bing’’ or ‘‘Friends’’ for Matthew Perry) were deemed correct recognitions, as were unambiguous descriptions of the person. General information such as ‘‘comedian,’’ ‘‘politician,’’ or ‘‘actor,’’ without further information about the person, was not su‰cient to be regarded as correct recognition. Specifically, in this experiment participants were asked to recognize famous faces from both static and moving clips (only moving clips were used in experiment 1). This allowed us to directly investigate the motion recognition advantage (compared with static recognition) for individual face clips. Overall, a significant motion advantage was found; there were a higher number of correct identity judgments when participants were as ked to recognize moving faces compared with the static ones. More detailed analysis of our results revealed that the size of the recognition advantage was mediated by each of the rated factors, so that faces that had been rated as highly distinctive movers displayed a greater motion advantage than those rated as being less distinctive movers ( p < 0:05). Similarly, familiar faces gained more from being seen in motion than did less familiar faces ( p < 0:05), and faces perceived to move a lot produced a larger motion advantage than those perceived as low in motion ( p < 0:05). Significant positive correlations were found between the motion recognition advantage, the amount of motion displayed, distinctiveness of motion displayed, and face familiarity. See table 3.2 for correlation results. Building on the findings of experiments 1 and 2, experiment 3 investigated an important issue arising from these results. Our previous experiments demonstrated that familiar faces, those displaying high levels of motion, and those in which the motion is perceived to be distinctive, are better recognized and gain more from being seen in motion (compared with recognition rates from static presentation) than the faces of people who are less familiar, move less, and move in a more ‘‘typical’’ manner. Ex-
Dynamic Characteristics Important for Face Recognition
41
Table 3.2 Correlations between the amount of advantage gained from viewing a face in motion compared with static images (di¤erence) with the three rated factors: familiarity, how much motion is perceived, and distinctiveness of motion (N ¼ 58) Recognition Advantage
Familiarity
How much motion
Recognition advantage
—
Familiarity
0.22*
—
How much motion
0.23*
0.49**
—
Distinctiveness of motion
0.24*
0.71**
0.80**
Distinctiveness of motion
—
* Correlation is significant at the 0.05 level (1-tailed). ** Correlation is significant at the 0.01 level (1-tailed).
periment 3 explored whether ratings of the amount and distinctiveness of facial motion were specifically linked to the clip shown or were related to famous face identity (more consistent across clips). In investigating whether participants’ ratings about each face remained relatively consistent across clips of the same face, we extend our understanding of what parameters are stored as part of an individual’s face representation. In experiment 3, twenty-two of the original famous face stimuli were used (set A), along with a 2-second clip (in a di¤erent context) of each of those twenty-two famous faces (set B). The experimental procedure consisted of the same tasks conducted in experiment 1 (recognition and rating). Both tasks were repeated, once with set B stimuli and again for set A. The recognition task was conducted first. The rating task was also carried out in the same manner as in experiment 1, with the additional item of ‘‘distinctiveness of the face’’ rated as well. The results found for sets A and B can be compared with one another as well as with previous findings (experiment 1). Recognition rates were significantly positively correlated within participants and between participants, both when the stimulus clips were the same (experiment 1 and experiment 3 set A data) and when they di¤ered (experiment 1 and experiment 3 set B data). In addition, famous faces that were rated as being highly familiar in experiment 1 were also rated as being highly familiar by the participants in experiment 3 when they viewed both the original stimuli (set A) and new stimuli (set B). Similar positive correlations across testing sessions were found between ratings of motion distinctiveness and how much motion the faces produced in the clip. See table 3.3 for separate correlational results for sets A and B used in experiment 3. Displaying the data in this manner allows us to compare e¤ects across these di¤erent sets of stimuli. First, in terms of consistent findings for both sets A and B: Correct recognition rates were significantly positively correlated with how familiar the face was and how distinctive it was perceived to be. There were also significant positive correlations between familiarity and face distinctiveness, and between how
42
N. Butcher and K. Lander
Table 3.3 Correlations between moving recognition rates for stimuli sets A and B compared with the four rated factors: familiarity, how much motion is perceived, distinctiveness of motion, and distinctiveness of face (N ¼ 22) Recognition
Familiarity
How much motion
Distinct of motion
Distinct of face
Set A Recognition
—
Familiarity
0.88**
—
How much motion Distinctiveness of motion
0.33 0.47*
0.32 0.48*
— 0.90**
—
Distinctiveness of face
0.71**
0.76**
0.44*
0.63**
—
Set B Recognition
—
Familiarity
0.69**
—
How much motion
0.10
0.03
—
Distinctiveness of motion
0.40
0.39
0.89**
—
Distinctiveness of face
0.64**
0.83**
0.26
0.61**
—
* Correlation is significant at the 0.05 level (1-tailed). ** Correlation is significant at the 0.01 level (1-tailed).
distinctive the observed motion was and how much motion was perceived. It is interesting that there was a significant positive correlation between face distinctiveness and motion distinctiveness. Previous work (Lander & Chuang, 2005) found that rated motion distinctiveness had no impact on recognition of static faces. This reassures us to a certain degree that when participants are rating faces for motion distinctiveness, they are not simply confusing this with face distinctiveness. However, further work is needed to explore the relationship between face distinctiveness and motion distinctiveness. Second, in terms of inconsistent findings for sets A and B: For set A, as found in experiment 1, the distinctiveness of the observed motion was significantly correlated with face familiarity and recognition. Thus, distinctive movers were rated as being more familiar and were recognized better than more typical movers. This e¤ect was not found with set B stimuli, where there was no significant correlation between the distinctiveness of the observed motion and face familiarity or recognition. It may be that, for example, the clips used in set B did not move as much as those used in set A. Further work is estimating the amount of motion observed in each set of clips (sets A and B) using optic flow or face tracking techniques (see DeCarlo & Metaxas, 2000). The final inconsistency worth speculating on is that with set A stimuli, there was a significant positive correlation between the amount of rated motion and the rated distinctiveness of the observed face. This correlation was not found with set B
Dynamic Characteristics Important for Face Recognition
43
stimuli. It is unclear, at this relatively early stage of investigation, how such a relationship can be explained because we would expect the amount a face moves to have no relation to motion distinctiveness. Again, additional work is needed to investigate exactly how participants are rating moving clips for motion distinctiveness. To summarize, our results show that ratings of motion distinctiveness and amount of motion exhibited may be related to identification rates for famous faces (set A). Ratings of these factors, as well as those of familiarity and distinctiveness of the face, remained relatively consistent across rating sessions both within (experiment 3) and between subjects (experiment 1 versus experiment 3 ratings). Despite the early stage of this work, we can speculate from these results that ratings of facial characteristics (facial distinctiveness, distinctiveness of motion, and amount of motion exhibited) are not completely clip dependent, but in fact seem more related to that person’s identity. Indeed, if the e¤ects of motion are solely related to the particular clip of the individual selected, then our results should show little consistency across di¤erent clips of the same individual. This is not the case. However, it is important to note that if a person moves in a particularly distinctive or exaggerated manner, then these characteristics should be demonstrated in most moving clips of that individual. Thus, the relationship between clip-dependent and person-dependent characteristics of motion is not clear-cut, but rather is somewhat complex and interdependent. Future advances in computer animation should allow us to directly investigate the relative importance of motion and static-based information in face recognition and learning by translating and mapping the dynamic parameters of one face onto the identity of another. Similarly, we should be able to manipulate the amount of motion, the distinctiveness (by exaggeration) of motion, and the distinctiveness of the face for each face clip in order to determine more precisely the dynamic characteristics important for face recognition and the nature of the stored representations (see Freyd, 1987). Perhaps in the future we may think of recognition of static faces as essentially a snapshot within an inherently dynamic process. Summary
The motion advantage for facial recognition has long been understood to be a robust phenomenon when recognition performance from a moving stimulus is compared with performance from static stimuli. However, understanding why this e¤ect occurs is still a matter of debate among psychologists and computer scientists alike. From the results of the discussed research, we suggest, in accordance with the supplemental information hypothesis, that the recognition benefit of motion can be explained by information on characteristics, specifically, characteristic motion patterns. The availability and learning of the typical motion a face produces builds a more distinct representation that can be used to di¤erentiate a particular face from other faces
44
N. Butcher and K. Lander
more accurately than from static cues alone. This discussion of research to date also demonstrates the complexity of the motion advantage, indicating that although robust, the motion advantage in facial recognition is mediated by various characteristics of the face to be recognized and is reliant on a view or seeing the precise dynamic information that the face exhibits. It is important that new research has been e¤ective in starting to probe the nature of characteristic motion patterns and providing insights into the type of information that is integrated into stored facial representations that facilitate recognition. References Bartlett, J. C., Hurry, S., & Thorley, W. (1984). Typicality and familiarity of faces. Memory and Cognition, 12, 219–228. Bassili, J. (1978). Facial motion in the perception of faces and emotional expression. Journal of Experimental Psychology, 4, 373–379. Bassili, J. N. (1979). Emotion recognition: The role of facial movement and the relative importance of upper and lower areas of the face. Journal of Personality and Social Psychology, 37, 2049–2058. Braje, W. L., Kersten, D., Tarr, M. J., & Troje, N. F. (1998). Illumination e¤ects in face recognition. Psychobiology, 26, 271–380. Bruce, V., Carson, D., Burton, A. M., & Ellis, A. W. (2000). Perceptual priming is not a necessary consequence of semantic classification of pictures. Quarterly Journal of Experimental Psychology, 53A, 289–323. Bruce, V., & Valentine, T. (1985). Identity priming in face recognition. British Journal of Psychology, 76, 373–383. Bruce, V., & Valentine, T. (1988). When a nod’s as good as a wink: The role of dynamic information in face recognition. In M. M. Gruneberg, P. E. Morris, and R. N. Sykes (eds.), Practical aspects of memory: Current research and issues. (Vol. 1). Chichester, UK: Wiley. Christie, F., & Bruce, V. (1998). The role of dynamic information in the recognition of unfamiliar faces. Memory and Cognition, 26, 780–790. DeCarlo, D., & Metaxas, D. (2000). Optical flow constraints on deformable models with applications to face tracking. International Journal of Computer Vision, 38, 99–127. De Haan, E. H. F., Young, A., & Newcombe, F. (1987). Faces interfere with name classification in a prosopagnosic patient. Cortex, 23, 309–316. Ellis, A. W., Burton, A. M., Young, A. W., & Flude, B. M. (1997). Repetition priming between parts and wholes: Tests of a computational model of familiar face recognition. British Journal of Psychology, 88, 579–608. Ellis, A. W., Flude, B. M., Young, A. W., & Burton, A. M. (1996). Two loci of repetition priming in the recognition of familiar faces. Journal of Experimental Psychology: Learning Memory and Cognition, 22, 295–208. Freyd, J. J. (1987). Dynamic mental representations. Psychological Review, 94, 427–438. Hill, H., Schyns, P. G., & Akamatsu, S. (1997). Information and viewpoint dependence in face recognition. Cognition, 62, 201–222. Jackson, A., & Morton, J. (1984). Facilitation of auditory word recognition. Memory and Cognition, 12, 568–574. Johansson, G. (1973). Visual perception of biological motion and a model for its analysis. Perception and Psychophysics, 14, 201–211. Knappmeyer, B., Thornton, I. M., & Bu¨ltho¤, H. H. (2003). Facial motion can bias the perception of facial identity. Vision Research, 43, 1921–1936.
Dynamic Characteristics Important for Face Recognition
45
Knight, B., & Johnston, A. (1997). The role of movement in face recognition. Visual Cognition, 4(3), 265– 273. Lander, K., & Bruce, V. (2000). Recognizing famous faces: Exploring the benefits of facial motion. Ecological Psychology, 12, 259–272. Lander, K., & Bruce, V. (2003). The role of motion in learning new faces. Visual Cognition, 10, 897–912. Lander, K., & Bruce, V. (2004). Repetition priming from moving faces. Memory and Cognition, 32, 640– 647. Lander, K., Bruce, V., & Hill, H. (2001). Evaluating the e¤ectiveness of pixelation and blurring on masking the identity of familiar faces. Journal of Applied Cognitive Psychology, 15, 116. Lander, K., Christie, F., & Bruce, V. (1999). The role of movement in the recognition of famous faces. Memory and Cognition, 27, 974–985. Lander, K., & Chuang, L. (2005). Why are moving faces easier to recognize? Visual Cognition, 12, 429– 442. Lander, K., Chuang, L., & Wickham, L. (2006). Recognizing face identity from natural and morphed smiles. Quarterly Journal of Experimental Psychology, 59, 801–808. Lander, K., & Davies, R. (2007). Exploring the role of characteristic motion when learning new faces. Quarterly Journal of Experimental Psychology, 6, 519–526. Lander, K., Humphreys, G. W., & Bruce, V. (2004). Exploring the role of motion in prosopagnosia: Recognizing, learning and matching faces. Neurocase, 10, 462–470. Lewis, M., & Brookes-Gunn, J. (1979). Social cognition and the acquisition of self. New York: Plenum. Light, L. L., Kayra-Stuart, F., & Hollander, S. (1979). Recognition memory for typical and unusual faces. Journal of Experimental Psychology: Human Learning and Memory, 5, 212–228. Marr, D. (1982). Vision. San Francisco: W. H. Freeman. McGraw, K. O., Durm, M. W., & Durnam, M. R. (1989). The relative salience of sex, race, age, and glasses in children’s social perception. Journal of Genetic Psychology, 150, 251–267. Mondloch, C. J., Lewis, T. L., Budreau, D. R., Maurer, D., Dannemiller, J. L., Stephens, B. R., & Kleiner-Gathercoal, K. A. (1999). Face perception during early infancy. Psychological Science, 10, 419– 422. Montepare, J. M., & Zebrowitz, L. A. (1998). Person perception comes of age: The salience and significance of age in social judgments. In M. P. Zanna (ed.), Advances in experimental social psychology. (Vol. 30). San Diego, CA: Academic Press, pp. 93–161. Nelson, C. A. (1987). The recognition of facial expressions in the first two years of life: Mechanisms of development. Child Development, 58, 889–909. O’Toole, A. J., Roark, D. A., & Abdi, H. (2002). Recognizing moving faces: A psychological and neural synthesis. Trends in Cognitive Sciences, 6(6), 261–266. O’Toole, A. J., De¤enbacher, K. A., Valentin, D., McKee, K., Hu¤, D., & Abdi, H. (1998). The perception of face gender: The role of stimulus structure in recognition and classification. Memory and Cognition, 26, 146–160. Pike, G., Kemp, R., Towell, N., & Phillips, L. (1997). Recognizing moving faces: The relative contribution of motion and perspective view information. Visual Cognition, 4, 409–438. Pilz, K., Thornton, I. M., & Bu¨ltho¤, H. H. (2005). A search advantage for faces learned in motion. Experimental Brain Research, 171, 436–447. Steede, L. L., Tree, J. J., & Hole, G. J. (2007). I can’t recognize your face but I can recognize its movement. Cognitive Neuropsychology, 24, 451–466. Valentine, T. (1991). A unified account of the e¤ects of distinctiveness, inversion, and race in face recognition. Quarterly Journal of Experimental Psychology Section A-Human Experimental Psychology, 43, 161– 204. Valentine, T., & Bruce, V. (1986). The e¤ects of distinctiveness in recognizing and classifying faces. Perception, 15, 525–535.
46
N. Butcher and K. Lander
Valentine, T., & Ferrara, A. (1991). Typicality in categorization, recognition and identification: Evidence from face recognition. British Journal of Psychology, 82, 87–102. Vokey, J. R., & Read, J. D. (1992). Familiarity, memorability, and the e¤ect of typicality on the recognition of faces. Memory and Cognition, 20, 291–302. Warren, C., & Morton, J. (1982). The e¤ects of priming on picture recognition. British Journal of Psychology, 73, 117–129. Zebrowitz, L. A. (1997). Reading faces: Window to the soul? Boulder, CO: Westview Press.
4
Recognition of Dynamic Facial Action Probed by Visual Adaptation
Cristo´bal Curio, Martin A. Giese, Martin Breidt, Mario Kleiner, and Heinrich H. Bu¨lthoff
What are the neural mechanisms responsible for the recognition of dynamic facial expressions? In this chapter we describe a psychophysical experiment that investigates aspects of recognizing dynamic facial expressions in humans. Our latest development of a controllable 3D facial computer avatar (Curio et al., 2006) allows us to probe the human visual system with highly realistic stimuli and a high degree of parametric control. The motivation for the experiments reported here is that research on the neural encoding of faces, so far, has predominantly focused on the recognition of static faces. In recent years a growing interest in understanding facial expressions in emotional and social interactions has emerged in neuroscience. However, in spite of the fact that facial expressions are dynamic and time-dependent, only few studies have investigated the neural mechanisms of the processing of temporally changing dynamic faces. Specifically, almost no physiologically plausible neural theory of the recognition of dynamically changing faces has yet been developed. At the same time, present research in di¤erent areas has started to investigate the statistical basis for the recognition of dynamic faces. To give an example, new computer graphics approaches that are based on enhanced statistical methods are able to encode a large variety of facial expressions (Vlasic, Brand, Pfister, & Popovic, 2005). These methods try to provide intuitive interfaces for users to semantically control the content of facial expressions (e.g., style, gender, identity). In perception research, it has been proven useful to encode faces in a multidimensional face space (Rhodes, Brennan, & Carey, 1987; Valentine, 1991). Several recent studies have provided evidence on the neural representation of static faces in terms of continuous perceptual spaces (Jiang et al., 2006; Leopold, Bondar, & Giese, 2006; Leopold, O’Toole, Vetter, & Blanz, 2001; Lo¿er, Yourganov, Wilkinson, & Wilson, 2005; Webster & MacLin, 1999). A deeper understanding of the mechanisms for processing dynamic faces in biological systems might help to develop and improve technical systems for the analysis and synthesis of facial movements. Furthermore, experiments on the human perception of moving faces that exploit highly controllable stimuli may provide new insights into the neural encoding of dynamic facial
48
C. Curio and colleagues
expressions in the brain. Modern animation techniques have opened a new window to designing controlled studies to investigate the role of dynamic information in moving faces (Hill, Troje, & Johnston, 2005; Knappmeyer, Thornton, & Bu¨ltho¤, 2003). See also related chapters in this part of the book. The study we present here focuses on a visual adaptation study we were able to carry out that is based on a novel 3D animation technique that provided us with a continuous facial expression space. High-Level Aftereffects
High-level aftere¤ects for static face stimuli have recently been a central focus of research, potentially providing important insights into the neural structure of the cortical representation of faces. Exploiting advanced techniques for the modeling and parameterization of face movements, we report on a study that investigates highlevel aftere¤ects in the recognition of dynamic faces. We address specifically the question of how such aftere¤ects relate to the aftere¤ect for static face stimuli (Leopold et al., 2001; Webster & MacLin, 1999). Aftere¤ects have been described for a variety of low-level visual functions, such as motion perception (Mather, Verstraten, & Anstis, 1998; Watanabe, 1998) and orientation perception (Cli¤ord et al., 2007). While these classical aftere¤ects mainly address lower-level visual functions, more recently, high-level aftere¤ects have been observed for the recognition of faces (Leopold et al., 2001; Webster & MacLin, 1999). For example, Leopold et al. (2001) used a face space generated by 3D morphing (Blanz & Vetter, 1999) that permitted the creation of ‘‘anti-faces.’’ Anti-faces are positioned in a face space on the opposite side of the average face with respect to the ‘‘original face’’ identity. Extended presentation of such static ‘‘anti-faces’’ to human observers temporarily biases the perception of neutral test faces toward the perception of specific identities. This high-level aftere¤ect has been interpreted as evidence that the mean face may play a particular role in the perceptual space that encodes static faces. Recent experiments have reported high-level aftere¤ects also for the perception of complex motion presented as point-light walkers (Jordan, Fallah, & Stoner, 2006; Roether, Omlor, Christensen, & Giese, 2009; Troje, Sadr, Geyer, & Nakayama, 2006). These aftere¤ects were reflected by shifts in category boundaries between male and female walks or between emotions. Viewing a gender-specific gait pattern, e.g., a female walk, for an extended period biased the judgments for subsequent intermediate gaits toward the opposite gender. It seems unlikely that this adaptation e¤ect for dynamic patterns can be explained by adaptation to local features of the stimuli. Instead, it most likely results from adaptation of representations of the global motion of the figures. This suggests that highlevel aftere¤ects can also be induced in representations of highly complex dynamic patterns.
Recognition of Dynamic Facial Action
49
High-level Aftereffects in the Recognition of Facial Expressions?
Motivated by the observation of these high-level aftere¤ects both for static pictures of faces and for dynamic body stimuli, we tried to investigate if such aftere¤ects also exist for dynamic facial expressions. In addition, we tried to study how such aftere¤ects are related to those for static faces and temporal order. Using a novel 3D animation framework that approximates facial expressions by linear combinations of facial action units (AUs), recorded from actors, we created dynamic ‘‘antiexpressions’’ in analogy to static ‘‘anti-faces’’ (Leopold et al., 2001) and tested their influence on expression-specific recognition performance after adaptation. We found a new aftere¤ect for dynamic facial expressions that cannot be explained on the basis of low-level motion adaptation. The remainder of this chapter is structured as follows. First we provide some details on the system for the generation of parameterized dynamic facial expressions. Then we report on the experimental adaption study that was based on this novel dynamic face space. To this end, in the explanation of our results, we show how our computer graphics-generated expressions can be used to computationally rule out low-level motion aftere¤ects. We conclude this chapter with a discussion of implications for neural models of the processing of dynamic faces. Construction of a Continuous Facial Expression Space
For our study we created dynamic face stimuli using a three-dimensional morphable face model that is based on 3D scans of human heads (Curio et al., 2006; Curio, Giese, Breidt, Kleiner, & Bu¨ltho¤, 2008). Our method for modeling dynamic expressions is based on the Facial Action Coding System (FACS) proposed by Ekman & Friesen (1978). Action units play an important role in psychology and have been discussed as fundamental units for the encoding of facial actions, potentially even reflecting control units on a muscular level. As such, action units provide a useful basis with semantic meaning and are a standard method for encoding facial expressions. The decoding of expressions from videos or photographs based on FACS is usually an elaborative process. To use FACS in an automated fashion for animation, we adopted the approach used by Choe and Ko (2001). Basically, they propose transferring motion from data capturing facial motion to a computer graphics model based on an optimization technique. The approach consists of fitting a linear combination of keyframes of static facial expression components into 3D motion capture data and applying the results to hand-crafted shapes of a computer graphics model for rendering. The process is done on a frame-by-frame basis. For our purpose, we applied this approach to our real-world dataset of 3D action units, resulting in realistic 3D facial
50
C. Curio and colleagues
expressions. Curio and his colleagues have shown how complex facial movements can be decomposed automatically and then be synthesized again convincingly by this low-dimensional system based on FACS (Curio et al., 2006). Parameterizing a Generative Facial Expression Space
Within our work, FACS is denoted by mathematically parameterized recordings of executed action units from real actors. We recorded FACS both as static 3D facial surface scans (see figure 4.1) along with high-resolution photographs, and in di¤erent sessions we recorded FACS based on sparse 3D motion capture data with the help of a commercial VICON system, providing dynamic recordings of facial actions. We now describe how we made use of the two recording modalities by combining them to produce high-fidelity facial animation. To obtain a low-dimensional parameterization of dynamic facial movements, the goal was to approximate natural facial expressions by a linear combination of facial action units. The input to the facial animation system is based on facial motion capture data recorded with a VICON 612 motion capture system, using K ¼ 69 reflective markers with a temporal resolution of 120 Hz (see figure 4.2, left). Recorded facial action unit sequences MAU; i were segmented manually into a neutral rest state at t ¼ TS , a maximum peak expression at t ¼ TP , and the final neutral state at time t ¼ TE . The data of interest can be characterized by an action unit sequence as a collection of 3D point matrices Morphable Model for the Analysis of Facial Expressions
MAU; i ¼ ½ pt¼TS ; i ; . . . ; pt; i ; . . . ; pt¼TP ; i ; . . . ; pt¼TE ; i : This data stream is characterized by the K reconstructed 3D facial marker coordinates concatenated in the vectors pt A R K3 for each time t. We randomly chose one action unit time course as a neutral reference frame MAU; 0 ¼ pREF ¼ pTS . Overall we recorded N ¼ 17 action units from actors. Having determined the keyframes MAU; i ðt ¼ TP Þ, i A f1; . . . ; Ng of each action unit sequence corresponding to the static peak expressions, we aligned them to the neutral reference keyframe MAU; 0 , and this way eliminated the rigid head movement. For this rigid alignment we chose a few 3D markers that did not move. The vector residuals MAU; i ¼ MAU aligned; i MAU; 0 define the components of our action unit model for analysis that is signified by M AU ¼ fMAU; i A R K3 j i A f1; . . . ; Ngg where each MAU; i can be denoted as a sparse 3D displacement vector field in which the vectors originate at the K 3D motion capture marker positions of a subject’s neutral face (figure 4.2 top).
Recognition of Dynamic Facial Action
51
Figure 4.1 Samples from our 3D action unit head database recorded with actors performing actions according to the Facial Action Coding System suggested by Ekman and Friesen (1978). Our database consists of 3D head scans and high-resolution texture images (not rendered here). In di¤erent sessions we dynamically acquired the same set of action unit movements with a sparse marker-based motion capture system (VICON).
52
C. Curio and colleagues
Figure 4.2 Performance-driven animation pipeline with data captured from the real world. An action unit basis was obtained from motion-captured sequences and textured 3D scans. Frame-by-frame, expressions can be retargeted from motion capture to the action unit morph shapes. In particular, in the top row a selection of action units from the motion-capture morph basis M AU is shown. It is mainly used to automatically decompose the facial movements into their action unit activations. The basis consists of the peak activations of the recorded time course for the dynamic action unit. The bottom row shows the corresponding action unit morph basis S AU obtained from 3D scans. A facial expression is decoded by projecting any new facial motion capture data (left) for each time point t onto the basis of M AU by constrained optimization. The resulting weight vector w ðtÞ can then be directly applied to the graphics model for rendering based on morphing. The rigid head movement is handled separately from the nonrigid intrinsic facial movements.
Our method for the synthesis of realistic 3D faces is based on high-resolution 3D scans by morphing between shapes (S i ) and textures (T i ), which are in dense correspondence (Blanz & Vetter, P 1999). Blanz and Vetter used convex linear combinations of shapes im aiS i and texPm tures i biT i for the photorealistic simulation of static faces with identities S ID . They modeled shape by dense triangle surface meshes represented as L 3D vertex points. Such an approach has even been used to analyze and synthesize facial expressions in videos (Blanz, Basso, Poggio, & Vetter, 2003). For our rendering purposes, we adopted this morphing scheme to linear combinations of scans of facial action units. This action unit basis corresponds to the action unit basis for recorded motion capture with the di¤erence that it is obtained directly from static snapshots of 3D scanned action units that were posed by actors (see processed data in figure 4.1 and bottom row of figure 4.2). These surface scans were manually set into correspondence with 3D software headus CySlice, establishing correspondence between the Morphable Model for the Synthesis of Facial Expressions
Recognition of Dynamic Facial Action
53
recorded scan of actors’ neutral expression and a reference 3D head mesh, resulting in 3D morph shapes. Let SAU ¼ fSAU; i A R L3 j i A f1; . . . ; Ngg be our shape basis (figure 4.2 bottom) where SAU; i ¼ SAU correspondence; i SAU; neutral are meshes that are in correspondence and aligned with a neutral face SAU; neutral . Note that with the dimensionality L g K, the number of mesh vertices for the rendering model exceeds greatly the number K of motion capture markers used. On top of the expression shapes we have acquired high-resolution texture images. The vertices of the mesh are mapped into this 2D photo by texture coordinates through which the image can be projected onto the 3D head models of action units. Animation of Facial Expression by Motion Transfer
Our animation method has two components. The first one is shown in the upper half of figure 4.2. We approximate face movements recorded with motion capture data by a linear superposition of static keyframes of 3D facial action units based on the motion capture basis M AU introduced earlier. For each instance in time, the face can be characterized by a weight vector w ðtÞ that specifies how much the individual action units contribute to the approximation of the present keyframe. The second step of the algorithm transfers w ðtÞ to a dynamic morphable face model that creates photorealistic animations. A more detailed description of the di¤erent components of the animation algorithm follows. Action Unit Model Fitting for Motion Capture Data A low-dimensional approximation of any dynamic facial expression is obtained by first aligning a motion-captured frame with a reference frame MAU; 0 , resulting in Me ðtÞ. We can then express the resulting 3D point configuration by a linear superposition of the action unit’s displacement vectors MAU; i (figure 4.2 left) according to
M e ðtÞ ¼ MAU; 0 þ
N X
we; i ðtÞMAU; i :
i¼1
This can be stated as a standard least-square optimization problem. In addition we want to constrain the weights to be positive, wi b 0, and thus use quadratic programming similarly to Choe and Ko (2001). This is possible because by definition only facial action units should be activated since they correspond to muscular activations. Thus, projecting configurations of facial expressions into this action unit space
54
C. Curio and colleagues
Figure 4.3 Estimated time courses of morph weights we ðtÞ for expressions ‘‘disgusted’’ (left) and ‘‘happy’’ (right) for each action unit. The sequences were sampled at 120 Hz. In our experiments we eliminated eye blink by setting action unit 12 to zero during animation.
should result only in positive numbers of wi . The novelty of our work is that we approximate expressions by real-world recorded motion capture action units and with 3D scans of action units that we have enriched with photographed textures for realistic animation. Examples of time courses for two di¤erent expressions are shown in figure 4.3. We can project facial movement by an actor frame by frame onto the space of action units, resulting in the morph weights we ðtÞ. Now we take advantage of having action unit models available for both motion capture and 3D scans. With these estimated action unit morph Performance-Driven Animation with a 3D Action Unit Head Model
Recognition of Dynamic Facial Action
55
Figure 4.4 Snapshots taken from performed and animated facial action sequences. A nonrigid facial expression is projected onto a motion-capture morph basis of facial action units. The morph weights are applied to a basis of corresponding 3D action unit head scans by morphing. We acquired for both modalities the same N ¼ 17 basis components (see also figure 4.2) independently in di¤erent recording sessions and from di¤erent actors.
weights we ðtÞ we are able to animate heads with di¤erent facial identities (S ID; 0 ) by morphing between the elements of the shape actuation basis (figure 4.2, right) frame by frame, resulting in an animated facial expression sequence: S e ðtÞ ¼ S ID; 0 þ
N X
we; i ðtÞSAU; i :
i¼1
Thus, this animation pipeline defines a low-dimensional space for dynamic facial expressions. The rigid head motion that has been eliminated in the motion capture estimation process can now be applied again to the computer graphics model by simple 3D transformation of S e ðtÞ. Snapshots of the analyzed motion-capture setup and for the animated 3D head based on their respective action unit systems are shown in figure 4.4. In particular, in the study we report here, our expression space allows the generation of reduced expressions and anti-expressions by rescaling the estimated morph weight vector:
Manipulation of Facial Expressions
we0 ðtÞ ¼ gwe ðtÞ
with g A R
For example, expressions with reduced strength correspond in this context to morphing gains g with 0 a g < 1, while a global negative morphing gain, g < 0, is defined by us as a dynamic anti-expression in analogy to the antifaces introduced by Leopold et al. (2001). In addition, while keeping the dynamic movement signature from one performance actor, we can also exchange identities by simply morphing between
56
C. Curio and colleagues
di¤erent identities S ID; 0 . For this purpose we make use of the Max Planck Institute’s MPI-3D-head face database consisting of about 200 three-dimensional head face scans without expression. These heads are in dense mesh correspondence to each other and to our processed 3D action unit head scans SAU; i . We could morph in addition the individual head texture T i but for our studies we used only two di¤erent shape identities (Id A and Id B) with one common texture T . Rigid head movements can contribute essential information to the recognition of dynamic facial expressions. Since we wanted to study specifically the influence of nonrigid intrinsic facial movements, we eliminated the expression-specific rigid head movements from our rendered stimuli. However, since we wanted to minimize the influence of possible low-level adaptation e¤ects, we added a neutral rigid head movement to our animated intrinsic movements. It was given by one simple sinusoidal motion trajectory in 3D space. Adaptation Study of Dynamic Expression
Based on the results on the adaptation study with static faces (Leopold et al., 2001), we might predict that adaptation with anti-expressions should increase the sensitivity of perception for compatible expressions with reduced strength. Conversely, one might expect that such adaptation with incompatible anti-expressions might decrease the sensitivity for faces with reduced expression strength. As adapting stimuli, we used our constructed anti-expressions. In two main experiments we investigated the influence of dynamic versus static adaptation stimuli and the influence of temporal order on the adaptation e¤ect. To compare the results with static adaptation e¤ects, we derived static ‘‘antiexpressions’’. These stimuli were given by one keyframe that corresponded to the peak anti-expression. These static adaptors were presented for the same duration as the dynamic adaptation stimuli. Expression lengths were time-aligned by subsampling, guaranteeing the same time intervals TP between neutral and peak expressions. In addition, we tested the transfer of adaptation to faces with di¤erent identities. The animated facial expressions were rendered in real time at a 50-Hz monitor refresh rate with OpenGL under MATLAB Psychophysics Toolbox-3 (Brainard, 1997; Pelli, 1997), which has been modified and extended for this purpose by M. Kleiner. Subjects viewed the stimuli at 90-cm distance and stimuli were perceived at a vertical visual angle of 12 degrees. Perceptual Calibration
First, we conducted a prestudy to perceptually calibrate each subject’s recognition rates within our expression’s morph space. The morphing parameter g introduced earlier allows us to monotonically vary the intensity of nonrigid facial movements
Recognition of Dynamic Facial Action
57
and thus an observer’s ability to recognize the dynamic expressions. Since recognition rates might vary across subject and expression, we needed to calibrate this space separately for each subject and expression by measuring a psychometric function that describes the relationship between the morphing gain g and the recognition probability of that expression. For this purpose stimuli with morphing gains g A f0:1; 0:15; 0:2; 0:25; 0:5g were presented in random order and the subjects had to respond in a three-alternative-forced-choice task whether they perceived one of the two expressions or a neutral expression (g ¼ 0). The demonstration frequency of a neutral expression was balanced against the ‘‘disgusted’’ and ‘‘happy’’ expression. The measured recognition probabilities PðgÞ were fitted by the normalized cumulative Gaussian probability function z ðg mÞ P ¼ fðzÞ ¼ 0:5 1 þ erf pffiffiffi with z ¼ : s 2 Expressions with reduced strength were defined by the morphing weights that, according to this fitted function, corresponded to a recognition probability of PC ¼ 0:2. Dynamic versus Static Adaptors and the Influence of Identity
In a first experiment we tested the influence of dynamic versus static adaptors (dynamic anti-expressions versus extreme keyframes as anti-expression) on the perception of expressions with reduced strength that we obtained in the perceptual calibration given earlier. In addition, adaptors with the same or a di¤erent identity from the test face were used. For all experiments we fixed the identity (Id A) of the testing expressions. For all adapting expressions, we eliminated rigid head movements (head rotation and translation) and superimposed a sinusoidal 3D head motion trajectory for a normalized time length of the expressions employed. We denote from here on a neutral expression as consisting of the neutral face translating with this sinusoidal head trajectory in space. All other adaptation expressions with intrinsic movements were generated in a similar way by simply superimposing this 3D sinusoidal trajectory onto the respective intrinsic head movements, again with rotation head movement components eliminated. We used the dynamic neutral expression as an adaptor for one identity in order to obtain a baseline recognition rate for the test expressions after adaptation. Adaptation stimuli were presented for 10 seconds, corresponding to five repetitions of the individual anti-expressions followed by a short interstimulus interval (ISI) (400 ms). Subsequently, the test stimulus cycle was presented and subjects had to decide in a two-alternative-forced-choice task whether they had perceived a ‘‘happy’’ or a ‘‘disgusted’’ expression (figure 4.5a and b). The results of this experiment are shown in figure 4.6 on the left. All tested adaptation conditions show an increase in recognition of the matching expression compared with the baseline (adaptation with a neutral face). This increase was significant
58
C. Curio and colleagues
Figure 4.5 Experimental protocol: (a) Timing for exposure to adaptation stimuli, blank interstimulus interval (ISI), and reduced test expression. (b) Subjects adapt to an anti-expression (global morph gain with g ¼ 1), which is rendered either as a dynamic versus a static anti-expression (experiment E1) or as a dynamically played forward versus reverse (experiment E2) expression sequence. In a two-alternative-forced-choice task, subjects had to judge after adaptation which expression they saw. To obtain a recognition baseline for test expressions, the subjects also adapted to neutral expressions. One would expect that recognition performance for the baseline test expression will increase for matching adapt-test expression pairs (conditions along horizontal arrows) and decrease for nonmatching adapt-test stimulus pairs (conditions along diagonal arrows).
for the conditions with matching identity of adaptation and test face (t > 3:4; p < 0:002). For dynamic and static adapting faces with di¤erent identities, this increase failed to reach significance. For nonmatching expressions, the recognition rate was significantly reduced for all adaptation conditions (t < 2:33; p < 0:05). A three-way dependent-measures analysis of variance (ANOVA) shows a significant influence of the test expression (‘‘happy’’ versus ‘‘disgusted’’) (F ð1; 91Þ ¼ 48:3; p < 0:001) and of the factor matching versus nonmatching expression (adapt-test) (F ð1; 91Þ ¼ 94; p < 0:001). The third factor, adaptation stimulus (dynamic versus static face with the same or a di¤erent identity), did not have a significant influence and all interactions were nonsignificant. This indicates that the observed aftere¤ect shows a clear selectivity for the tested expression. A more detailed analysis of the responses for the matching expressions reveals a significant influence of only the test expression [F ð1; 45Þ ¼ 26; p < 0:001] and a marginally significant influence of the adaptation stimulus [F ð1; 45Þ ¼ 2:8; p < 0:062], where a post hoc comparison shows that adaptation with a dynamic anti-expression and the same identity induced significantly higher increases of the recognition rates than adaptation with static or dynamic anti-expressions with di¤erent identities ( p < 0:04). This indicates that
Recognition of Dynamic Facial Action
59
Figure 4.6 Results of recognition rates for reduced test expressions with standard deviation for testing expressions (average) relative to baseline recognition performance for experiment 1 (N ¼ 8, p < 0:05) and experiment 2 (N ¼ 9, p < 0:05) for the di¤erent adaptation conditions. The baseline for recognition of the test expression was obtained throughout each session with neutral faces as the adaptor. The postadaptation recognition with regard to the baseline was plotted for matching and nonmatching adapt-test stimulus conditions.
60
C. Curio and colleagues
adaptation e¤ects for matching conditions were particularly strong for matching facial identity. Role of Forward versus Reverse Temporal Order
In a second experiment we wanted to test how the temporal order of the adapting stimulus a¤ects the adaptation e¤ects. Specifically, if the observed adaptation process were based purely on the adaptation of individual keyframes, no influence of the temporal order on the adaptation e¤ects would be expected. The design of this experiment was identical to experiment 1. However, in half of the trials, instead of static adaptation stimuli, dynamic anti-faces played in reverse temporal order were presented (figure 4.6, right). Overall we tested nine subjects. The results of this second experiment are shown on the right side of figure 4.6. Again, for all tested adaptors we observed significant adaptation e¤ects, with a significant increase in the recognition of the matching expressions compared with the baseline and a significant decrease for nonmatching expression conditions (jt17 j > 2:5; p < 0:02). A more detailed analysis using a three-way dependent measures ANOVA shows a significant influence of matching versus nonmatching adapting expression [F ð1; 98Þ ¼ 21:3; p < 0:001], and of the test expression (‘‘happy’’ or ‘‘disgusted’’) (F ð1; 98Þ ¼ 118; p < 0:001). There was no significant di¤erence between di¤erent types of adapting stimuli (forward versus reverse temporal order, and matching versus nonmatching identity); i.e., there was no influence of the factor type of adaptor. In addition, all interactions among the three factors were nonsignificant. These results again show expression-specific adaptation. A comparison of the di¤erent conditions, and specifically between adaptors with forward and reverse sequential order, did not reveal any significant di¤erences. This implies that the observed adaptation seems equally e‰cient for sequences with forward and reverse temporal orders. Can Low-Level Motion Adaptation Account for the Aftereffect?
Dynamic facial expressions create dense optic flow patterns that produce low-level motion aftere¤ects. For experiment 1 we can rule out low-level aftere¤ects since the adaptors were static. Therefore, if the sum of local curvature cues has not caused the e¤ect, we expect higher-level mechanisms to be responsible for our results. However, one might still raise the concern that the e¤ects observed in experiment 2 might mainly reflect adaptation of low-level motion-processing mechanisms. Our novel technique of generating stimuli using computer graphics allowed us to control for this possible confound by computing the optic flow patterns generated by normal test expressions and the anti-expressions played in forward and reverse order. The size of possible low-level aftere¤ects is determined by the similarity of the local motion patterns generated by the adaptation and the test stimuli. In the presence of such
Recognition of Dynamic Facial Action
61
aftere¤ects, highly correlated local motion patterns (with motion in the same direction) should lead to reduced recognition of the test pattern as a result of adaptation of local motion detectors activated by the adaptor. However, antiparallel motion patterns would predict an increased recognition of the test pattern, since in this case the adapting motion pattern would induce a kind of ‘‘waterfall illusion.’’ To obtain a measure for the size of possible low-level motion aftere¤ects, we computed the correlations of the optic flow patterns generated by adaptation and test stimuli, aligning for the positions of adaptation and test faces. From the 3D face model, all corresponding vertex positions in the image plane X ¼ fxi A R 2 j x1 ; . . . ; xL g can be computed. From subsequent frames for each point in time, the corresponding 2D optical flow vectors vt ¼ fv1 ; . . . ; vL g can be determined. As a coarse estimate of the overlap of the local motion information, we sampled these optic flow fields with a regular reference grid, G g ð23 11Þ, where the valid motion vector for each grid point was determined by nearest-neighbor interpolation (see figure 4.7). Each grid point xg A G g was assigned the flow vector vðxg Þ ¼ vi with i ¼ argminjx xi j inside the face regions of two consecutive snapshots of the animated expression. Outside the face regions, the flow vectors were set to zero. The correlation measure between the flow vector fields was defined as the sum over all scalar vector products of flow vectors. The summation was carried out separately over products with positive and negative signs, defining a measure RMotion; "" for the amount of ‘‘parallel’’ and a measure RMotion; #" for the amount of ‘‘anti-parallel’’ optic flow of the antiand test stimuli. Signifying the corresponding flow vector fields by vA and vB , at each time t and grid point xg , the scalar product is given by Cðxg ; tÞ ¼ ½vA ðxg ; tÞ; vB ðxg ; tÞ: One type of correlation measure is obtained by the integrals over space and the normalized time for the time-aligned rendering duration T denoted by RMotion; "" ¼
1 Z
ð ðT Gg
t¼0
bCðxg ; tÞcþ dt dxg
and RMotion; #" ¼
1 Z
ð ðT Gg
t¼0
bCðxg ; tÞcþ dt dxg ;
where Z denotes some normalization constant and bCcþ the linear threshold function applied to the correlation measure C. With the linear threshold function, we can distribute the positive and negative contribution of Cðxg ; tÞ to the two integrals denoting
62
C. Curio and colleagues
Figure 4.7 Comparison of the optical flow patterns between anti-expression and the two test expressions in experiment 2. (a) Optic flow fields between neutral and the current frames for adapt or test as anti-happy or happy and, (b) respectively, as anti-happy or disgusted. The representative flow field of the anti-expression is black and that of the test expression is light gray. Correlation measures of parallel ("") and anti-parallel (#") optical flow between all dynamic adaptation and testing conditions can be derived as explained in the text. The vertices of the anti-expressions in the graphics model are rendered for Id A in the background.
the correlation of the parallel flow field and the anti-parallel flow fields between the adaptor and test stimulus sequence. Matching anti-expressions played in forward temporal order induce a strong antiparallel optic flow field. Adaptation with the same expressions played in reverse order induces a strong parallel optic flow field between adaptor and test stimulus. If the observed aftere¤ects were mainly based on low-level motion adaptation, this would predict that there should be opposite adaptation e¤ects induced by the antiface played in forward and backward temporal order. This clearly contradicts the experimental results, which indicate no strong influence of the temporal order on the observed aftere¤ects for matching expressions. This result rules out low-level motion aftere¤ects as a major source of the aftere¤ects observed in dynamic facial expressions in experiment 2. For nonmatching adaptation and test expressions, the optic
Recognition of Dynamic Facial Action
63
flow analysis shows weaker correlations. However, it again shows dominant antiparallel optic flow components for antiexpressions played in forward temporal order and dominant parallel flow components for expressions played in reverse order. Parallel optic flow components for adaptation and test predict a reduced recognition of the test stimulus. This e¤ect would match the nonsignificant tendency in experiment 2 that the recognition of nonmatching expressions is somewhat reduced for adaptation stimuli played in reverse temporal compared with forward temporal order. Discussion and Open Questions
We reported the first study on high-level aftere¤ects in the recognition of dynamic facial expressions. The study was based on a novel computer graphics method for the synthesis of parametrically controlled close-to-reality facial expressions. The core of this method was the parameterization of dynamic expressions by facial action units defining an abstract low-dimensional morphing space of dynamic facial expressions. We found consistent aftere¤ects for adaptation with dynamic antiexpressions, resulting in an increased tendency to perceive matching test expressions, and a decreased tendency to perceive nonmatching test expressions. This result is compatible with similar observations for adaptation e¤ects with static faces and points to the existence of continuous perceptual spaces of dynamic facial expressions. In addition, these results show that the aftere¤ects observed are highly expression-specific. A comparison of adapting stimuli played in forward and reverse temporal order did not reveal significant di¤erences between the observed aftere¤ects. This indicated that the aftere¤ect involved might not rely on representations that are strongly selective for temporal sequential order. A more detailed analysis also shows some hints for a higher e‰ciency of adaptation for some expressions, if adaptation and test stimulus represent the same facial identity. Future experiments will have to clarify the exact nature of dynamic integration in such representations of facial expressions and their relationship to such high-level aftere¤ects. A detailed analysis of the optic flow patterns generated by the dynamic face stimuli ruled out that the aftere¤ect obtained is simply induced by standard low-level motion aftere¤ects. For neural models of the perception of dynamic faces, our study suggests the existence of continuous perceptual spaces for dynamic faces that might be implemented in ways similar to perceptual spaces for static face stimuli (Burton, Bruce, & Hancock, 1999; Giese & Leopold, 2005; Leopold et al. 2006). However, so far we did not find strong evidence for a central relevance of sequence selectivity for such dynamic stimuli as opposed to the perception of body motion (Giese & Poggio, 2003). Experiments that minimize form cues in individual frames while providing consistent local motion information might help to clarify the relative influence of form and motion in the perception of dynamic facial expressions. Novel 3D computer graphics models of
64
C. Curio and colleagues
moving faces such as, for example, those described by Walder et al. (this volume) might provide the foundation for such experiments. Acknowledgments
This work was supported by the European Union project BACS FP6-IST-027140 and the Deutsche Forschungs Gemeinschaft (DFG) Perceptual Graphics project, PAK 38. References Blanz, V., Basso, C., Poggio, T., & Vetter, T. (2003). Reanimating faces in images and video. Computer Graphics Forum, 22(3), 641–650. Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on computer graphics and interactive techniques, SIGGRAPH (pp. 187–194). New York: ACM Press/Addison-Wesley. Brainard, D. H. (1997). The psychophysics toolbox. Spatial Vision, 10(4), 433–436. Burton, A. M., Bruce, V., & Hancock, P. J. B. (1999). From Pixels to people: A model of familiar face recognition. Cognitive Science, 23, 1–31. Choe, B., & Ko, H.-S. (2001). Analysis and synthesis of facial expressions with hand-generated muscle actuation basis. In IEEE computer animation conference (pp. 12–19). New York: ACM Press. Cli¤ord, W. G., Webster, M. A., Stanley, G. B., Stocker, A. A., Kohn, A., Sharpee, T. O., & Schwartz, O. (2007). Visual adaptation: Neural, psychological and computational aspects. Vision Research, 47(25), 3125–3131. Curio, C., Breidt, M., Kleiner, M., Vuong, Q. C., Giese, M. A., & Bu¨ltho¤, H. H. (2006). Semantic 3D motion retargeting for facial animation. In Proceedings of the 3rd symposium on applied perception in graphics and visualization (pp. 77–84). Boston: ACM Press. Curio, C., Giese, M. A., Breidt, M., Kleiner, M., & Bu¨ltho¤, H. H. (2008). Probing dynamic human facial action recognition from the other side of the mean. In Proceedings of the 5th symposium on applied perception in graphics and visualization (pp. 59–66). New York: ACM Press. Ekman, P., & Friesen, W. V. (1978). Facial action coding system: A technique for the measurement of facial movement. Palo Alto, CA.: Consulting Psychologists Press. Giese, M. A., & Leopold, D. A. (2005). Physiologically inspired neural model for the encoding of face spaces. Neurocomputing, 65–66, 93–101. Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience, 4(3), 179–192. Hill, H. C. H., Troje, N. F., & Johnston, A. (2005). Range- and domain-specific exaggeration of facial speech. Journal of Vision, 5(10), 793–807. Jiang, X., Rosen, E., Ze‰ro, T., VanMeter, J., Blanz, V., & Riesenhuber, M. (2006). Evaluation of a shape-based model of human face discrimination using fMRI and behavioral techniques. Neuron, 50(1), 159–172. Jordan, H., Fallah, M., & Stoner, G. R. (2006). Adaptation of gender derived from biological motion. Nature Neuroscience, 9(6), 738–739. Knappmeyer, B., Thornton, I. M., & Bu¨ltho¤, H. H. (2003). The use of facial motion and facial form during the processing of identity. Vision Research, 43(18), 1921–1936. Leopold, D. A., Bondar, I. V., & Giese, M. A. (2006). Norm-based face encoding by single neurons in the monkey inferotemporal cortex. Nature, 442(7102), 572–575.
Recognition of Dynamic Facial Action
65
Leopold, D. A., O’Toole, A. J., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding revealed by high-level aftere¤ects. Nature Neuroscience, 4(1), 89–94. Lo¿er, G., Yourganov, G., Wilkinson, F., & Wilson, H. R. (2005). fMRI evidence for the neural representation of faces. Nature Neuroscience, 8(10), 1386–1390. Mather, G., Verstraten, F., & Anstis, S. M. (1998). The motion aftere¤ect: A modern perspective. Cambridge, MA: MIT Press. Pelli, D. G. (1997). The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spatial Vision, 10(4), 437–442. Rhodes, G., Brennan, S., & Carey, S. (1987). Identification and ratings of caricatures: Implications for mental representations of faces. Cognitive Psychology, 19(4), 473–497. Roether, C. L., Omlor, L., Christensen, A., & Giese, M. A. (2009). Critical features for the perception of emotion from gait. Journal of Vision, 9(6), 1–32. Troje, N. F., Sadr, J., Geyer, H., & Nakayama, K. (2006). Adaptation aftere¤ects in the perception of gender from biological motion. Journal of Vision, 6(8), 850–857. Valentine, T. (1991). A unified account of the e¤ects of distinctiveness, inversion, and race in face recognition. Quarterly Journal of Experimental Psychology A, 43(2), 161–204. Vlasic, D., Brand, M., Pfister, H., & Popovic, J. (2005). Face transfer with multilinear models. ACM SIGGRAPH (pp. 426–433). New York: ACM Transactions on Graphics. Watanabe, T. (1998). High-level motion processing: Computational, neurobiological, and psychophysical perspectives. Cambridge, MA.: MIT Press. Webster, M. A., & MacLin, O. H. (1999). Figural aftere¤ects in the perception of faces. Psychonomic Bulletin Reviews, 6(4), 647–653.
5
Facial Motion and Facial Form
Barbara Knappmeyer
Traditionally, research on the dynamic aspects of face recognition has focused on removing facial form from the display to study the e¤ects of facial motion in isolation. For example, some studies have used point-light displays (Bruce & Valentine, 1988; Bassili, 1978; Berry, 1990; Rosenblum et al., 2002), thresholded, (Lander & Bruce, 2000; Lander, Christie, & Bruce, 1999), pixilated, or blurred displays (Lander, Bruce, & Hill, 2001), and photographic negatives (Knight & Johnston, 1997; Lander et al., 1999) to isolate facial motion from facial form. Another study has used an animated average head to explore the e¤ect of facial motion in the absence of individual facial form (Hill & Johnston, 2001). These studies prove that facial motion does carry information about identity as well as gender and age and that this information can be used by the human face recognition system in the absence of other cues. Although this approach is well motivated and informative about the possible impact of facial motion on various aspects of face processing, it is rather unrealistic in everyday life situations in which the human face-processing system usually has simultaneous access to both types of information: facial motion and facial form. Is facial motion used as a cue in such situations? How does the human face-processing system integrate those two types of information? The main purpose of this chapter is to review studies that inform us about the relative contribution of facial motion and facial form. In particular I will review a study that was carried out at the Max Planck Institute for Biological Cybernetics in collaboration with Ian Thornton and Heinrich Bu¨ltho¤ and published in Vision Research (Knappmeyer, Thornton, & Bu¨ltho¤, 2003) and in my dissertation (Knappmeyer, 2004). In this study we directly explored how the human face recognition system combines facial motion and facial form by systematically and independently varying those two cues using motion capture and computer animation techniques. Degrading or Removing Facial Form
Studies in which facial form cues have been consistently degraded or removed have shown advantageous e¤ects of facial motion. For example, Bruce and Valentine
68
B. Knappmeyer
(1988) used point-light displays to completely remove any information about facial form and asked observers to judge the identity, gender, and type of facial expression. It is important to add that the observers were familiar with the faces from which the point-light displays were created. This study revealed that observers performed above chance in all three tasks when the displays were presented dynamically. In addition, they were more accurate when the display was presented dynamically than when it was presented statically (although the di¤erence did not reach significance in the case of gender discriminations). This early study suggests that facial motion—in the absence of facial form—carries information about gender, facial expressions of emotion, and identity. It confirmed an earlier result by Bassili (1978), who also showed that facial expressions are recognized more accurately from a point-light display when they are presented dynamically rather than statically. Another more recent study using point-light stimuli yielded similar results. Rosenblum et al. (2002) found an advantageous e¤ect of dynamic representation in a matching paradigm for facial identity. Another example of isolating facial motion information from facial form is a study by Hill and Johnston (2001) in which an animated average head was used to show that facial motion carries information about gender and to some degree identity. This display did contain generic facial form information (it was an average face), but this form was uninformative with regard to the task (it didn’t contain any gender or identity information). Compared with the point-light displays used in the earlier studies, this is a much more natural display in the sense that it actually looks like a moving face and not just a bunch of moving dots. However, it still is a stimulus in which relevant facial form information is absent. These studies highlight the beneficial role of facial motion in the complete absence of relevant form cues. However, studies using high-quality image displays with full facial form information have often failed to show beneficial e¤ects from facial motion (e.g., Bruce et al., 1999; Christie & Bruce, 1998; Lander et al., 1999; Knight & Johnston, 1997). Use of Facial Motion in the Absence of Other Cues
A number of studies have degraded facial form rather than completely removing it. For example, Knight and Johnston (1997) and Lander et al. (1999) degraded facial form information by using photographic negatives. Moving faces were significantly better recognized than static faces. This was true even when the amount of static information was equal. The same result was found using yet another way to degrade image quality: thresholding. In this type of manipulation, faces appear as black-and-white patterns of shadow and light. Lander et al. (1999) and Lander and Bruce (2000) again showed that familiar faces were recognized more accurately in a moving than in a static dis-
Facial Motion and Facial Form
69
play, even when the amount of static information was equal. Lander et al. (2001) used yet another technique to degrade facial form information. They pixilated and blurred the displays and found similar results. These studies suggest that facial motion is used as a cue not only in the complete absence of facial form but also when some relevant form information is available. However, from these studies it is di‰cult to assess how much form information is still available in the displays and how ambiguous this form information needs to be to allow motion information to become relevant. Also, it is di‰cult to conclude from these studies whether seeing faces in motion has a beneficial e¤ect in general (in the sense that it helps to build a more accurate three-dimensional representation or provides more information about facial form) or whether facial motion in itself provides a specific clue to facial identity (in the sense of an idiosyncratic signature) (O’Toole, Roark, and Abdi, 2002). In our experiments reviewed in the next section, we set out to create stimuli that allowed us to directly control the relevant amount of facial form information present in moving face displays. In addition, we created an experimental paradigm in which the only possible explanation for a beneficial e¤ect of facial motion would be the supplemental information hypothesis, in which facial motion is viewed as providing additional information; for example, in the form of an idiosyncratic motion signature (O’Toole et al., 2002). Before I go into detail in reviewing those experiments, I would like to point out that there are other factors not reviewed here that seem to be important when it comes to predicting whether facial motion plays a beneficial role in a certain faceprocessing task. Most prominent among those factors are familiarity (unfamiliar or familiar or famous faces), the kind of face-processing task (e.g., identity, gender, or emotion recognition), and the kind of facial motion (rigid versus nonrigid motion). Direct Tradeoff between Facial Motion and Facial Form
In the experiments reviewed here in more detail we used facial animation techniques to vary the relevance of facial form cues rather than the image quality of the display in order to create situations in which form and motion were either working in concert or in conflict during the processing of identity (figure 5.1). To do this, we created animated faces from 3D Cyberware laser scans of real human heads taken from the Max-Planck-Institute face database. Facial form cues were systematically manipulated by applying a 3D morphing technique developed by Blanz and Vetter (1999). We captured facial motion patterns from real human actors and applied them to the 3D laser scans by using a commercially available animation system for faces (3Dfamous Pty. Ltd.). The facial motion patterns consisted of an 8–10-second long sequence of facial actions (e.g., smiling, chewing) performed
70
B. Knappmeyer
Figure 5.1 Animation technique. The faces were animated using motion patterns captured from real human actors. They di¤ered either in their form (di¤erent laser scans) or in the motion pattern (di¤erent actors) that drove the animation, but never in the way in which the motion was applied to the faces (clustering). This animation technique allowed us to dissociate and independently vary facial motion and form (Knappmeyer et al., 2003; Knappmeyer, 2004).
by di¤erent nonprofessional human actors. The animation procedure is described in more detail in Knappmeyer et al. (2003). It is important to note here that the resulting animated faces di¤ered either in their form (di¤erent laser scans) or in the motion pattern that drove the animation (di¤erent actors), but never in the way in which the motion was applied to the faces. Therefore the animation procedure itself could not have introduced any biases when the same motion pattern was transferred to two different facial forms. The task consisted of a training phase and a testing phase. During training, observers were first familiarized with two animated faces, each performing the same sequence of facial actions but with the slight idiosyncratic di¤erences in facial movements natural to the di¤erent human actors (figure 5.2). To do this, observers were asked to fill out a questionnaire assessing personality traits (e.g., ‘‘Who looks overall
Facial Motion and Facial Form
71
Figure 5.2 During a training phase, observers were familiarized with two moving faces (e.g., labeled ‘‘Stefan’’ and ‘‘Lester’’), one always animated with motion A and the other one always animated with motion B. The motion patterns consisted of the same sequence of facial expressions performed by di¤erent human actors. At test, each face of a morph sequence between ‘‘Stefan’s’’ and ‘‘Lester’s’’ facial form was combined with each of the two motion patterns. For example, ‘‘Stefan’s’’ face was animated with ‘‘Lester’s’’ motion and ‘‘Lester’s’’ face was animated with ‘‘Stefan’s’’ motion. Observers had to decide whether these moving target faces looked more like ‘‘Stefan’’ or more like ‘‘Lester’’ (Knappmeyer et al., 2003; Knappmeyer, 2004).
happier to you?’’) while watching two animated faces (one after the other) in a looped display along with a corresponding name label. This training procedure took about 30 minutes and afterward observers were able to accurately (100%) label the faces. With this procedure, it was intended to familiarize observers with the animated faces without biasing them to pay attention to any particular aspect of the display. During the testing phase, observers were shown spatial morphs that represented a gradual transition between the form of the learned faces and were asked to identify these morphs. The target faces were presented multiple times, half of which were animated with the motion pattern from one learned face and half of which were animated with the motion pattern from the other learned face. If characteristic facial motion influences the processing of identity in this task, we would expect more ‘‘face A’’ responses if the face is animated with face A’s motion than if the same face is animated with face B’s motion.
72
B. Knappmeyer
Figure 5.3 Mean distribution (collapsed across observers and face pairs) of ‘‘Stefan’’ responses as a function of morph level. Error bars indicate standard error of the mean. The psychometric functions reveal a biasing e¤ect of facial motion for most morph levels. That is, when faces move with ‘‘Stefan’s’’ motion, observers are more likely to respond ‘‘Stefan’’ than when exactly the same faces move with ‘‘Lester’s’’ motion, suggesting that observers based their identity judgments not solely on cues to individual facial form, but also on cues to individual facial motion (Knappmeyer et al., 2003; Knappmeyer, 2004).
Figure 5.3 shows the mean proportion of ‘‘A’’ responses for each morph and each motion pattern collapsed across twenty-six observers and two di¤erent animated face pairs. The psychometric functions clearly show that the observers were sensitive to the facial form information contained in the animated faces. That is, the ‘‘face A’’ responses were close to 100% when the animated face looked exactly like face A (100% morph) and close to 0% when the animated face looked exactly like face B (0% morph). This is important to note since it might not be obvious to the untrained reader that the faces depicted in the bottom row of figure 5.2 are actually dissimilar enough to be easily distinguished by trained observers. More interesting, the shift between the two psychometric functions suggests that observers were biased in their identity judgments by the idiosyncratic motion patterns of the animated faces. Across almost the whole range of the morph sequence observers were more likely to respond ‘‘A’’ when the morphed face was moving with face A’s motion than when the same morphed face was moving with face B’s motion. We assessed the magnitude of the shift between the two psychometric functions by computing the points of subjective equality (PSE) for each of the two facial motion
Facial Motion and Facial Form
73
patterns separately and then quantifying the di¤erence. The PSE denotes the morph level that is perceived as most ambiguous, i.e., the observers’ responses were at chance (50%). An ANOVA revealed a main e¤ect of facial motion [F ð1; 22Þ ¼ 10:3, p ¼ 0:004], showing that the animated face to which observers equally often responded ‘‘A’’ contained less form information from face ‘‘A’’ (38.8%, S.E.2.3%) when it was moving with face ‘‘A’s’’ motion than when it was moving with face ‘‘B’s’’ motion (53.9%, S.E.3.3%). In other words, the biasing e¤ect of idiosyncratic facial motion was equivalent to a di¤erence of 15.1% (t ¼ 3:1, df ¼ 25, p ¼ 0:002) form information at the point of subjective equality. It is interesting that this bias was not only present when the facial form was ambiguous, but also when relevant form cues were available (across almost the whole morph sequence). At the P25 and P75 levels, i.e., the animated morph that elicits 25% or 75% ‘‘face A’’ responses, the biasing e¤ect of facial motion was equivalent to a form di¤erence of 14.2% (at the P25 level, t ¼ 2:4, df ¼ 25, p ¼ 0:012) and 16% (at the P75 level, t ¼ 3:7, df ¼ 25, p ¼ 0:001). Thus, rather than relying exclusively on facial form or on facial motion information, observers seem to integrate both sources of information. We were able to confirm this result in a number of variations of this task and these stimuli. The most interesting variation was a ‘‘family resemblance task,’’ in which we morphed (50% morphs) the previously learned moving faces with a number of new faces. Observers were instructed that they would see faces of people who were related to one of the previously learned faces and they were asked to categorize these faces with regard to their ‘‘family relationship.’’ Again, the faces were animated with facial motion either from previously learned face ‘‘A’’ or from previously learned face ‘‘B.’’ The question was whether facial motion would bias observers’ decisions with regard to the ‘‘family membership.’’ Since in the previous experiments observers were asked to make fine-grained distinctions among highly similar faces, we were concerned that they might have adopted strategies quite di¤erent from the way they would usually process facial identity. In this task, however, the faces looked quite di¤erent (even though they did share a common facial geometry within a ‘‘family’’). So we were hoping that the observers would adopt the same strategies they would use in everyday life when processing facial identity. The results from this experiment again showed that observers used a combination of facial form and facial motion when making their decision about facial resemblance. When form and motion cues were inconsistent, the observers performed much closer to chance (60.95% correct; chance level 50%). When form and motion were consistent, they performed well above chance (76.25% correct) (Knappmeyer et al., 2003). The series of experiments reviewed here showed that nonrigid facial motion biased observers decisions’ about facial identity in the presence of cues to facial form. We were able to quantify the direct tradeo¤ between facial motion and facial form during the processing of facial identity. The fact that response di¤erences were found for
74
B. Knappmeyer
identical faces that di¤ered only in the way they moved provides clear support for the supplemental information hypothesis (O’Toole et al., 2002), which suggests that facial motion might provide additional information in the form of an idiosyncratic motion signature.1 These findings have important implications for models of face perception. Traditionally such models (Bruce & Young, 1986; Haxby, Ho¤man, & Gobbini, 2000) have stressed the separation of invariant aspects of faces (form) and changeable aspects (motion) into independent processing systems. Typically, decisions about facial identity have been assigned to the system that processes invariant aspects of faces (facial form). However, the results reviewed here support a growing body of literature (behavioral: Bernstein & Cooper, 1997; Lorenceau & Alais, 2001; Stone & Harper, 1999; Wallis & Bu¨ltho¤, 2001; computational: Giese & Poggio, 2003; neural: Bradley, Chang, & Andersen, 1998; Decety & Grezes, 1999; Grossman & Blake, 2002; Haxby et al., 2000; Kourtzi, Bu¨ltho¤, Erb, and Grodd, 2002; Oram & Perrett, 1994) that stresses the integration of both types of information, form and motion, in processing many classes of objects. Note 1. The experiments reviewed here are not informative with regard to the alternative hypothesis: the representation enhancement hypothesis (O’Toole et al., 2002) since there was no direct comparison between static and dynamic displays of moving faces.
References Bassili, J. (1978). Facial motion in the perception of faces and emotional expression. Journal of Experimental Psychology: Human Perception and Performance, 4, 373–379. Bernstein, L. J., & Cooper, L. A. (1997). Direction of motion influences perceptual identification of ambiguous figures. Journal of Experimental Psychology: Human Perception and Performance, 23(3), 721–737. Berry, D. S. (1990). What can a moving face tell us? Journal of Personality and Social Psychology, 58(6), 1004–1014. Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3D faces. In Proceedings of the 26th international conference on computer graphics and interactive techniques, SIGGRAPH, (pp. 187–194). August 8–13, 1999, Los Angeles: ACM Press. Bradley, D. C., Chang, G. C., & Andersen, R. A. (1998). Encoding of three-dimensional structure-frommotion by primate area MT neurons. Nature, 392, 714–717. Bruce, V., Henderson, Z., Greenwood, K., Hancock, P. J. B., Burton, A. M., & Miller, P. (1999). Verification of face identities from images captured on video. Journal of Experimental Psychology—Applied, 5(4), 339–360. Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology, 7, 305–327. Bruce, V., & Valentine, T. (1988). When a nod’s as good as a wink: The role of dynamic information in facial recognition. In M. M. Gruneberg, P. E. Morris, and R. N. Sykes (eds.), Practical aspects of memory: Current research and issues. (Vol. 1). Chichester, UK: Wiley, pp. 169–174. Christie, F., & Bruce, V. (1998). The role of dynamic information in the recognition of unfamiliar faces. Memory and Cognition, 26, 780–790.
Facial Motion and Facial Form
75
Decety, J., & Grezes, J. (1999). Neural mechanisms subserving the perception of human actions. Trends in Cognitive Sciences, 3(5), 172–178. Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience, 4, 179–191. Grossman, E. D., & Blake, R. (2002). Brain areas active during visual perception of biological motion. Neuron, 35(6), 1167–1175. Haxby, J. V., Ho¤man, E. A., & Gobbini, M. I. (2000). The distributed human neural system for face perception. Trends in Cognitive Sciences, 4(6), 223–233. Hill, H., & Johnston, A. (2001). Categorizing sex and identity from the biological motion of faces. Current Biology, 11, 880–885. Knappmeyer, B., Thornton, I. M., & Bu¨ltho¤, H. H. (2003). The use of facial motion and facial form during the processing of identity. Vision Research, 43, 1921–1936. Knappmeyer, B. (2004). Faces in motion. Berlin: Logos Verlag. Knight, B., & Johnston, A. (1997). The role of movement in face recognition. Visual Cognition, 4, 265– 273. Kourtzi, Z., Bu¨ltho¤, H. H., Erb, M., & Grodd, W. (2002). Object-selective responses in the human motion area MT/MST. Nature Neuroscience, 5(1), 17–18. Lander, K., Christie, F., & Bruce, V. (1999). The role of movement in the recognition of famous faces. Memory and Cognition, 27, 974–985. Lander, K., & Bruce, V. (2000). Recognizing famous faces: Exploring the benefits of facial motion. Ecological Psychology, 12, 259–272. Lander, K., Bruce, V., & Hill, H. (2001). Evaluating the e¤ectiveness of pixelation and blurring on masking the identity of familiar faces. Applied Cognitive Psychology, 15(1), 101–116. Lorenceau, J., & Alais, D. (2001). Form constraints in motion binding. Nature Neuroscience, 4(7), 745– 751. O’Toole, A. J., Roark, D. A., & Abdi, H. (2002). Recognizing moving faces: Psychological and neural synthesis. Trends in Cognitive Sciences, 6(6), 261–266. Oram, M. W., & Perrett, D. I. (1994). Responses of anterior superior temporal polysensory (STPa) neurons to ‘‘biological motion’’ stimuli. Journal of Cognitive Neuroscience, 6, 99–116. Rosenblum, L. D., Yakel, D. A., Baseer, N., Panchal, A., Nordase, B. C., & Niehus, R. P. (2002). Visual speech information for face recognition. Perception and Psychophysics, 64(2), 220–229. Stone, J., & Harper, N. (1999). Object recognition: View-specificity and motion-specificity. Vision Research, 39, 4032–4044. Wallis, G., & Bu¨ltho¤, H. H. (2001). E¤ects of temporal association on recognition memory. Proceedings of the National Academy of Sciences of the United States of America, 98(8), 4800–4804.
6
Dynamic Facial Speech: What, How, and Who?
Harold Hill
A great deal of dynamic facial movement is associated with speech. Much of this movement is a direct consequence of the mechanics of speech production, reflecting the need to continuously shape the configuration of the vocal tract to produce audible speech. This chapter looks at the evidence that this movement tells us about not only what is being said, but also about how it is being said and by whom. A key theme is what information is provided by dynamic movement over and above that available from a photograph. Evidence from studies of lip-reading1 is reviewed, followed by work on how these movements are modulated by di¤erences in the manner of speech and on what automatic and natural exaggeration of these di¤erences tells us about encoding. Supramodal cues to identity conveyed by both the voice and the moving face are then considered. The aim is to explore the extent to which dynamic facial speech allows us to answer the questions who, what, and how, and to consider the implications of the evidence for theories of dynamic face processing. Key Questions about the Perception of Dynamic Facial Speech
We can tell a lot about a person from a photograph of their face, even though a photograph is an artificial and inherently limited stimulus, especially with respect to dynamic properties. The first question is what, if anything, does seeing a dynamic face add to seeing a photograph? At this level ‘‘dynamic’’ is simply being used to indicate that the stimulus is moving or time varying, as opposed to static. In this sense, auditory speech is inherently dynamic and so audiovisual speech would seem a particularly promising area in which to look for dynamic advantages. If movement is important, we should also ask how it is encoded. Is it, like film or video, simply encoded as an ordered and regularly spaced series of static frames? Or are the processes involved more similar to the analysis of optic flow, with movement encoded as vector fields indicating directions and magnitudes for any change from one frame to the next? Both such encodings, or a combination (Giese & Poggio,
78
H. Hill
2003), would be viewer-centered, a function of the observer’s viewpoint and other viewing conditions as much as of the facial speech itself. In contrast, an object-centered level of encoding might have advantages in terms of e‰ciency through encoding only properties of facial speech itself. It is the relative motions of the articulators that shape speech, especially since their absolute motions are largely determined by whole head and body motions (Munhall & VatikiotisBateson, 1998). Of particular relevance to this question are the rigid movements of the whole head that often accompany natural speech. These include nods, shakes, and tilts of the whole head, as well as various translations relative to the viewer. All of these rigid movements would greatly a¤ect any viewer-centered representation of the nonrigid facial movements, including those of the lips, that are most closely associated with facial speech. In auditory speech recognition, a failure to find auditory invariants corresponding to the hypothesized phonetic units of speech led to the motor theory of speech perception (Liberman & Mattingly, 1985). This argues that speech is encoded in terms of the articulatory ‘‘gestures’’ underlying speech production rather than auditory features. One appeal of this theory in the current context is that the same articulatory gestures could also be used to encode facial speech, naturally providing a supramodal representation for audiovisual integration. Gestures also provide a natural link between perception and production. At this level, the ‘‘dynamic’’ of dynamic facial speech includes the forces and masses that are the underlying causes of visible movement, rather than being limited to the kinematics of movement. From the point of view of the perceptual processes involved, such a theory would need to specify how we recover dynamic properties from the kinematic properties of the movements available at the retina which, like the recovery of depth from a two-dimensional image, is an apparently underspecified problem. There is evidence from the perception of biological motion that human observers are able to do this (Runeson & Frykholm, 1983). Another critical question with regard to encoding movement concerns the temporal scale involved. Many successful methods of automatic speech recognition treat even auditory speech as piecewise static, and use short time frames (G 20 ms). Each frame is characterized by a set of parameters that are fixed for that frame and that are subsequently fed into a hidden Markov model. However, within the auditory domain, there is strong evidence that piecewise static spectral correlates associated with the traditional distinctive features of phonemes are not the sole basis of human perception (Rosenblum & Saldan˜a, 1998). Transitions between consonants and vowels vary greatly according to which particular combinations are involved. While the variation this introduces might be expected to be a problem for the identification of segments based on temporally localized information, on the contrary, experiments with cross-splicing and vowel nucleus deletion show that the pattern of transitions
Dynamic Facial Speech
79
is a useful, and sometimes su‰cient, source of information. In addition, subsequent vowels can influence the perception of preceding consonants, and this influence also operates cross-modally as well as auditorily (Green & Gerdeman, 1995). The structure and redundancy of language means that recognition is determined at the level of words or phrases as well as segments. Word frequency, number of near neighbors, and number of distinctive segments can all contribute. This is also true of facial speech where, for example, polysyllabic words tend to be more distinctive and easier to speech-read than monosyllabic words (Iverson, Bernstein, & Auer, 1998). While ba, ma, and pa can be confused visibly, in the context of a multisyllabic word like balloon, the ambiguity is readily resolved lexically because malloon and palloon are not words. Thus there may be important advantages to encoding both seen and heard speech at longer time scales that cover syllables, whole words, sentences, and even entire utterances. In summary, dynamic facial speech, like vision in general, will inevitably be initially encoded in an appearance-based, viewer-centered way. However, there is evidence that more abstract object-centered and/or muscle-based levels of encodings may also be involved. What? Meaning from Movement
We all speech-read in the general sense that we are sensitive to the relationships between facial movement and the sound of the voice. For example, we know immediately if a film has been dubbed into another language (even when we know neither of the languages), or if the audio is out of synchrony with the video. The degree to which we find these audiovisual mismatches irritating and impossible to ignore suggests that the cross-modal processing involved is automatic and obligatory [even preverbal, 10–16-month-old infants find a talking face that is out of sync by 400 ms distressing (Dodd, 1979)]. Speech-reading in hearing individuals is not the same as the silent speech-reading forced upon the profoundly deaf (or someone trying to work out what Marco Materazzi said to Zinedine Zidane at the 2006 Soccer World Cup final). The two abilities may be related and start from the same stimulus, the dynamic face, but visual speech-reading in normal-hearing individuals is closely linked to auditory processing. Indeed, seeing a silent video of a talking face activates the auditory cortex (Calvert et al., 1999), but this activity is significantly reduced in congenitally deaf people (MacSweeney et al., 2001). For hearing individuals, being able to see the speaker’s face helps us to compensate for noisy audio, an advantage that can be equivalent to a 15-dB reduction in noise (Sumby & Pollack, 1954). Even when the audio is clear, seeing the speaker helps if the material is conceptually di‰cult (a passage from Immanuel Kant was tested!) or spoken in a foreign accent (Reisburg, 1987). Perhaps
80
H. Hill
the most dramatic illustration of speech-reading in normal-hearing individuals is the so-called McGurk e¤ect (McGurk & MacDonal, 1976). When certain pairs of audible and visible syllables are mismatched, for example audio ‘‘ba’’ with visual ‘‘ga,’’ what we hear depends on whether our eyes are open (‘‘da’’), or closed (‘‘ba’’). This section explores the visual information that can a¤ect what we hear. Even a static photograph of a speaking face can provide information about what is being said. Photographs of the apical positions associated with vowels and some consonants can be matched to corresponding audio sounds (Campbell, 1986). The sound of the vowel is determined by the shape of the vocal tract, as this in turn determines the resonant frequencies or formants. Critical vocal tract parameters include overall length and the size and shape of the final aperture formed by the lips. These can be seen in photographs. The area and position of maximum constriction formed by the tongue may also be visible when the mouth is open, and is often correlated with clearly visible lip shape (Summerfield, 1987). A McGurk-like mismatching of audio sounds and visual vowels results in the perception of a vowel with intermediate vocal tract parameters, suggesting these parameters as another potential supramodal representation derivable from both vision and auditory speech (Summerfield & McGrath, 1984). The e¤ect of visual information on even the perception of vowels is particularly compelling, given that vowels tend to be clearly audible as they are voiced and relatively temporally stable. Consonants, in contrast, involve a transitory stopping or constriction of the airflow and, acoustically, are vulnerable to reverberation and noise. The location of the restriction, the so-called ‘‘place of articulation,’’ is often visible and can also be captured in a photograph, at least when it occurs toward the front of the vocal tract. Clearly visible examples include bilabials (p, b, m), labiodentals (f, v), and linguodentals (th). The ‘‘ba’’ and ‘‘ga’’ often used to illustrate the McGurk e¤ect are clearly distinguishable in photographs. More posterior alveolar or palatal constrictions can be visible if the mouth is relatively open. The place of articulation is often di‰cult to determine from audio, and vision may be particularly important in providing complementary information for consonants. The visual confusability of consonants is approximately inversely related to their auditory confusability (Summerfield, 1987). Campbell proposes a strong separation between complementary and correlated visual information, associating the former with static and the later with dynamic properties of audiovisual stimuli, with processing carried out by di¤erent routes (Campbell, 2008). Static photographs have been reported to generate McGurk e¤ects equivalent to those of dynamic stimuli for consonant vowel (CV) syllables, and were even reported to support better silent speechreading of the stimuli (Irwin, Whalen, & Fowler, 2006; but see Rosenblum, Johnson, & Saldan˜a, 1996). Thus for speech-reading, as for so many face-processing tasks,
Dynamic Facial Speech
81
static photographs appear to capture much of the critical visual information, such as vocal tract parameters, including the place of articulation of consonants. Although demonstrably useful, the high-frequency spatial information provided by photographs does not seem to be necessary for speech-reading. Studies of the perceiver’s eye movements show speech-reading benefits, even when people are looking directly at the mouth less than half of the time (Vatikiotis-Bateson, Eigsti, Yano, & Munhall, 1998). This suggests that the high spatial sensitivity of the fovea is not necessary for speech-reading. This is confirmed by studies using spatially filtered stimuli (Munhall, Kroos, Jozan, & Vatikiotis-Bateson, 2004). Instead, the temporally sensitive but spatially coarse resolution periphery appears su‰cient. Further evidence for the importance of motion over spatial form comes from single case studies. HJA, a prosopagnosic patient who is unable to process static images of faces, shows typical speech-reading advantages from video (Campbell, Zihl, Massaro, Munhall, & Cohen, 1997). Conversely, the akinetopsic patient LM, who has severe problems with motion processing, cannot speech-read natural speech but can process static faces, including being able to identify speech patterns from photographs (Campbell et al., 1997). There is also suggestive evidence that speech-reading ability is correlated with motion sensitivity (Mohammed et al., 2005). Studies involving the manipulation of video frame rates show decreases in ability at or below 12.5 fps (Vitkovich & Barber, 1994). Adding dynamic noise also reduces our ability to speech read (Campbell, Harvey, Troscianko, Massaro, & Cohen, 1996). Finally, the most direct evidence for the usefulness and indeed su‰ciency of motion information comes from studies using point-light stimuli (Rosenblum et al., 1996). These displays are designed to emphasize dynamic information and limit form-based cues. They are su‰cient to support McGurk e¤ects and speech-in-noise advantages when presented as dynamic stimuli (Rosenblum & Saldan˜a, 1998). What is it about speech that is captured and conveyed by visual movement? One simple temporal cue would be the time of onset of various events. The onset of speech is characterized by both mouth and head movements (Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004), and this can highlight the onset of the corresponding auditory signal that might otherwise be lost in noise. Another temporal cue that is available multimodally is duration. In many languages, including Japanese, Finnish, and Maori, though not typically in English, di¤erences in vowel duration serve to distinguish between phonemes and can change meaning. In an experiment recording facial motion during the production of long or short minimal pairs (pairs of words di¤ering in only one phoneme) in Japanese, di¤erences in facial movement that would provide visible cues to duration are clearly visible. For example, figure 6.1 shows trajectories associated with koshou (breakdown) and koushou (negotiations). As can be seen, durations between corresponding features of the
82
H. Hill
Figure 6.1 An example of (a) unfocused (left) and focused (right) audio waveforms and (b) spectrograms and motion trajectories for a number of components of dynamic facial speech. The Japanese phrase spoken is the short member of a minimal pair, koshou, defined in terms of a long or short linguistic distinction and contained in a carrier phrase. For details, please see the text.
Dynamic Facial Speech
83
plotted trajectories of jaw and lip movement contain visual information about duration. The segments indicated were defined on the basis of the audio speech, but the corresponding changes in the trajectories would be visible. The interpretation of duration is highly dependent on overall speech rate, another temporal property that is provided by facial as well as audible speech. When auditory and visible speech rates are deliberately mismatched, the rate seen can influence heard segmental distinctions associated with voice onset time and manipulated on a continuum between /bi/ and /pi/ (Green & Miller, 1985). As well as being unable to speech-read, LM cannot report di¤erences in the rate of observed speech (Campbell et al., 1997). Velocities of facial movements are also available from facial speech; they reflect aerodynamics and are diagnostic of phonetic di¤erences, as in, for example, the difference between p and b (Munhall & Vatikiotis-Bateson, 1998). Rapid movement also is associated with, and helps to signal, changes from constricted consonants to open vowels. The overall amount of lip movement can also distinguish between different dipthongs (Jackson, 1988). Dynamic facial speech potentially provides information about transitions. The most dramatic evidence for the importance of formant transitions, as opposed to more complex, temporally localized spectral features, is sine wave speech (Remez, Rubin, Pisoni, & Carrell, 1981). This is ‘‘speech’’ synthesized by using three sine waves that correspond to the amplitudes and frequencies of the first three fundamentals of the original audio. The resultant temporally distributed changes in formant frequencies can be su‰cient for understanding spoken speech despite the almost complete absence of traditional transitory acoustic features. These auditory parameters have been shown to be correlated with three-dimensional face movement (Yehia et al., 1998). Perceptual experiments also show advantages, in terms of syllables correctly reported, for presenting sine wave speech in combination with corresponding video overpresenting of audio or video alone (Remez, Fellowes, Pisoni, Goh, & Rubin, 1998). For single tones, the combination of video and F2 was found to be particularly e¤ective. F2 is the formant most highly correlated with facial movement (Grant & Seitz, 2000) owing to its association with lip spreading or rounding (Jackson, 1988). This highlights the importance of information common to both audio and video in audio visual speech processing, in contrast to the traditional emphasis on complementary cues. Both F2 and facial speech are informative about place of articulation. If complementary audio and visual information was critical, F0 might have been expected to receive the most benefit from the addition of visual information. This is because F0 provides voicing which is not readily visible in dynamic facial speech, being determined primarily by the vibration of the vocal cords rather than the shape of the vocal tract, and not itself being directly available from dynamic
84
H. Hill
facial speech, might have been expected to receive the most benefit from the addition of visual information. Taken together, these findings suggest audiovisual integration at the level of shared spectrotemporal patterns suited to encoding patterns of transitions. Studies of the perception of speech in noise show that even rigid head movements can increase the number of syllables recovered under noisy conditions (Munhall et al., 2004). These rigid movements do not provide information about the shape of the vocal tract aperture or the place of articulation. Indeed, for a viewer-centered encoding and many automatic systems, head movements would be expected to interfere with the encoding of the critical nonrigid facial movements. However, head movements are correlated with the fundamental frequency, F0, and absolute intensity. Both of these are important in conveying prosody, and this may be the mechanism by which they facilitate recognition of syllables. Prosody could help at the level of initial lexical access to words, and at the level of sentences by providing cues to syntactic structure (Cutler, Dahan, & van Donselaar, 1997). Prosody is the central theme of the next section, the how of facial speech. In summary, speech can to an extent be treated as piecewise static in both auditory and visual domains. Within this framework, photographs can clearly capture critical visual information. However, this information will always be limited to the level of individual segments, and the e¤ects of coarticulation mean that even segmental cues will be temporally spread out. Such temporally distributed information can never be captured by a single photograph, but is associated with patterns of movement. Thus the durations, rates, velocities, amounts, and spatiotemporal patterns of movements that make up dynamic facial speech all help us to know what is being said. How? It’s Not What You Say, but How You Say It
Prosody allows us to convey many di¤erent messages even when speaking the same words and is critical to the message communicated. The contrast between good and bad acting amply illustrates how important these di¤erences can be. Acoustically, prosody is conveyed by the di¤erences in pitch, intensity, and duration that determine overall patterns of melody and rhythm. It is suprasegmental; that is, its e¤ects are spread out over the utterance and have meaning only in relation to each other. At the level of individual words, characteristic prosodic patterns, particularly syllable stress, are often associated with accent but can also change the meaning or part of speech, as with convict the noun as opposed to convict the verb. These di¤erences are often visible. Prosody also conveys syntax, emphasis, and emotion, as well as helping to regulate turn-taking and reflecting social relationships, all of which are fundamental to communication. Many of these functions are associated with acoustic patterning of the pitch and intensity of the voice. Although the vibrations of the
Dynamic Facial Speech
85
vocal cords that determine F0 are not visible on a face, they are correlated with head movements (Munhall et al., 2004). Intensity is strongly associated with face motion (Yehia et al., 1998) and to a lesser extent, head movement. Visible movements and expressions that are not directly related to audio, for example, the raising of an eyebrow or widening of the eyes, also provide an additional channel by which facial speech can modulate the spoken message. Work on visual prosody has looked at contrastive focus, the use of prosodic cues to emphasize one part of a sentence over another. There are known acoustic indicators of focus, but perceptual experiments show that it is also possible to speech-read emphasis from video alone (Dohen, Loevenbruck, Cathiard, & Schwartz, 2004). Production studies have involved recording di¤erent people’s movements while they produce examples of contrastive focus. These show a number of visual correlates, including increases in lip and jaw opening, cheek movements, and duration for the focal syllable, coupled with corresponding decreases for the remaining parts of the sentence (Dohen, Loevenbruck, & Hill, 2005). Individual di¤erences are also found, for example, in head movements and anticipatory strategies, which may have an important role in answering the question who? Emotion produces large e¤ects in people’s speech. We looked at how this was conveyed by face and head movements using silent animations (Hill, Troje, & Johnston, 2005). In particular we were interested in whether di¤erences in either the timing or the spatial extent of movement would be more important. We found that exaggerating di¤erences in the spatial extent of movement relative to the grand average across emotion reliably increased the perceived intensity of the emotion. This is again consistent with the importance of apical positions in the perception of facial motion, although moving greater or lesser distances in the same time will also change velocities and accelerations. In natural speech, peak velocity and peak amplitude tend to be linearly related (Vatikiotis-Bateson & Kelso, 1993). In our studies, directly changing timing while leaving spatial trajectories unchanged was less e¤ective in exaggerating emotion. This may have been because the changes in timing used were restricted to changes in the durations of segments relative to the average. Still convinced that timing is important, we decided to use contrastive focus to look at how people naturally exaggerate di¤erences in duration for emphasis. As noted earlier, vowel length is important for distinguishing between phonemes in Japanese, among other languages. We made use of a set of ‘‘minimal pairs,’’ that is, pairs of words that di¤er in only one phoneme, in this case based on a contrast in duration. The speakers whose movements were being recorded read a simple carrier sentence, ‘‘Kare kara X to kikimashita’’ (‘‘I heard X from him’’) where X was one of the members of a minimal pair. An experimenter then responded ‘‘Kare kara Y to kikimashita?’’ (‘‘You heard Y from him’’?), where Y was the other member of the pair. The speaker being recorded then responded, ‘‘Ie, kare kara X to kikimashita!’’
86
H. Hill
(‘‘No, I heard X from him!’’). We were primarily interested in how the second instance of X would di¤er from the first, given that X di¤ers from Y in terms of the duration of one of its segments. Using a two-AFC task, we found that observers could distinguish which version of X had been emphasized with 95% accuracy from audio alone and with 70% accuracy from video alone (where chance was 50%). The e¤ect on facial movement for one speaker is shown in figure 6.1. In this case the first syllable of the keyword, ‘‘ko,’’ is the short version of the minimal pair, which consisted of koshou (breakdown) and koushou (negotiations). The ‘‘ko’’ corresponds with the segment between the first and second vertical lines, the positions of which were defined from the audio. When focused, as in the right half of the figure, the absolute duration of this part does not change, but its proportion as a part of the whole word (contained in the section between the first and third vertical lines) is clearly reduced; i.e., relative duration is exaggerated. This was borne out by an analysis of all the minimal pairs for the clearest speaker (see figure 6.2a). It is clear that focus increased all durations, including the carrier phrase, which is consistent with our generally speaking more slowly when emphasizing something. Indeed, focus on average increased the durations of the short keywords in each minimal pair, although not as much as it increased the duration of the long keywords. Overall the relative duration of the critical syllable was exaggerated. These e¤ects of emphasis on duration are consistent with similar e¤ects of speaking rate, and with the importance of relative rather than absolute duration as the cue for phonemic vowel length (Hirata, 2004). The relevance to dynamic facial speech is that relative duration can be recovered from visual as well as auditory cues. There were also e¤ects on the range of vertical jaw displacements (see figure 6.2b). Jaw movement corresponds to the first principal component of variation for facial movement in Japanese (Kuratate, Yehia, & Vatikiotis-Bateson, 1998). Again, relative encoding is critical, and focus reduces the amplitude of the first part of the carrier phrase, an example of the hypoarticulation often used for contrastive focus (Dohen et al., 2004, 2005). There was also an increase in the amplitude of movement, hyperarticulation, for the long version of the focused keyword. Short and long syllables did not di¤er in amplitude without focus, but emphasizing duration naturally leads to an increase in the amplitude of the movement. With focus, there were also changes in the amplitude of other movements, including additional discontinuities in trajectories, facial postures held even after audio production had ceased, and accompanying head nods, all of which can contribute to conveying visual prosody. Thus there are many ways in which dynamic facial and head movements can signal changes in how the same words are spoken, including di¤erences in the extents and durations of the movements associated with production of sounds. Visible di¤erences in the spatial extent of movements play a role in the perception of both the how
Dynamic Facial Speech
87
Figure 6.2 Visible correlates of phonetic distinctions. (a) This shows the mean durations for three segments of twentytwo sentences of the kind shown in figure 6.1. Average values are shown for short and long versions of each minimal pair when spoken either with or without contrastive focus intended to emphasize the di¤erence in duration between the pairs. Note how the focus increases overall duration and also the relative di¤erence in duration between the long and short keywords. (b) Maximum range of jaw movements corresponding to the same segments and sentences as in (a). Note in particular the increased jaw movement associated with emphasizing the long keyword in each pair and hypoarticulation of the preceding part of the carrier phrase. Error bars show standard errors of means.
88
H. Hill
and the what of audiovisual speech. However, dynamic di¤erences in timing, particularly di¤erences in the duration and rate of speech, can also be perceived from visual motion as well as from audio sounds. Thus dynamic facial speech can convey the spatiotemporal patterning of speech, the temporal patterning that is primarily associated with the opening and closing of the vocal tract. Even rigid head movements not directly associated with the shaping of the vocal tract reflect these patterns of intensity as well as pitch. It is this patterning that carries prosody, the visible as well as auditory how of speech, which in turn a¤ects both what and who. Who? Dynamic Facial Speech and Identity
Other chapters in this volume present evidence that movement of a face can be useful for recognizing people. Much of this evidence is drawn from examples of facial speech. In this section the focus is on cues to identity that are shared by face and voice. Facial speech is di¤erent for di¤erent individuals. Some people hardly move their lips and rarely show their teeth, while others (especially U.S. television news presenters) speak with extreme ranges of motions. These di¤erences a¤ect how well someone can be speech-read and relate to the number of phonemes that can be distinguished visually (visemes) for that person (Jackson, 1988). This variety is also mirrored by voices, which can range from dull monotones to varying widely in speed and pitch over utterances. We wanted to know if these di¤erences could provide a cross-modal cue to identity, and whether people could predict which face went with which voice and vice versa (Kamachi, Hill, Lander, & Vatikiotis-Bateson, 2003; Lander, Hill, Kamachi, & Vatikiotis-Bateson, 2007). The answer was that people perform at chance in matching voices to static photographs, suggesting that fixed physical characteristics, such as the length of the vocal tract, that relate to properties of a voice are not su‰cient for the task. The performance of participants was above chance when the face was seen moving naturally. Performance dropped o¤ if the movement was played backward (all the voices were played forward and matching was always sequential in time). Playing movement backward is a strong control for demonstrating an e¤ect of movement over and above the associated increase in the amount of static information available with videos compared with photographs. The result shows that neither static apical positions, nor direction-independent temporal properties such as speech rate, or speed, or amount of movement, are su‰cient for this task. Instead, direction-dependent, dynamic patterns of spatiotemporal movement support matching. Performance was generalized across conditions where the faces and the voice spoke di¤erent sentences, showing that identical segmental information was not necessary. The audio could even be presented as sine wave speech, which again is consis-
Dynamic Facial Speech
89
tent with the importance of spatiotemporal patterns over segmental cues. Previous work has also shown that the rigid head movements that convey the spatiotemporal patterning of prosody (Munhall et al., 2004) are more useful than segment-related face movements in conveying identity (Hill & Johnston, 2001). Performance at matching a face to a voice was disrupted less by changes in what was being said than changes in how the person was speaking the words. In this case, we used the same sentences spoken as statements or questions and with casual, normal, or clear styles of speech. Artificial uniform changes in overall speaking rate did not a¤ect identity matching, again ruling out many simple temporal cues (Lander et al., 2007). In summary, dynamic cues associated with di¤erent manners of speaking convey supramodal clues to identity. These clues appear relatively independent of what is being said. It remains to consider how theories of dynamic facial speech encoding can provide a unified account for performance on these di¤erent tasks. Conclusion: Encoding Dynamic Facial Speech
We have seen that dynamic facial speech provides a wealth of information about what is being said, how it is being said, and who is saying it. In this section we will consider the information, encoding, and brain mechanisms that allow dynamic facial speech to provide answers to these questions. Even a static photograph can capture a considerable proportion of facial speech. Photographs of apical position in particular can be matched to sounds and convey cues to emotion that can be made stronger by exaggerating relative spatial positions. Functional magnetic resonance imaging studies show that static images activate brain areas associated with biological motion, the superior temporal sulcus, the premotor cortex, and Broca’s area, suggesting that they activate circuits involved in the perception and production of speech (Calvert & Campbell, 2003). This is consistent with static facial configurations playing a role in the representation of dynamic facial speech, much as animators use static keyframes of particular mouth shapes when generating lip-synchronized sequences (Thomas & Johnston, 1981). One issue is how these keyframes could be extracted from a naturally moving sequence, given they will not occur at regular times. Optic flow fields may play a crucial role in this process, with changes in the direction of patterns signaling extreme positions. A test of whether these key frames do have a special role would be to compare facial speech perception from sequences composed of selected apical positions with sequences containing an equal number of di¤erent frames sampled at random or regular intervals. An issue for any encoding based on static images or optic flow is the role of head movements. Rigid movements of the whole head might be expected to disrupt the recovery of flow patterns associated with the relative facial movements of the oral cavity most closely linked to speech sounds, and have to be factored out. Perhaps
90
H. Hill
surprisingly, for human perception such head movements, rather than being a problem, actually appear to facilitate the perception of facial speech (Munhall et al., 2004). In addition, the perception of facial speech is relatively invariant with viewpoint (Jordan & Thomas, 2001), and cues to identity from nonrigid facial movement generalize well between views (Watson, Johnston, Hill, & Troje, 2005). This evidence suggests view-independent primitives or face-centered representations. Temporally based cues provide a number of possible candidates for view-independent primitives and are considered in more detail later. Spatiotemporal patterns of motion have a view-dependent spatial component that would not be invariant but could conceivably be encoded in a head-centered coordinate system. Rigid movements are not simply noise and may be encoded independently because they are also a valuable source of information in their own right. There is suggestive neuropsychological evidence that this may be the case (Steede, Tree, & Hole, 2007). As noted, temporal cues are inherently view independent. They are also available multimodally, thus providing a potential medium for cross-modal integration. The temporal relationship between auditory and facial speech is not exact. In terms of production, preshaping means that movement can anticipate sound and, at the other extreme, lips continue to move together after sound has ceased (Munhall & Vatikiotis-Bateson, 1998). Perceptually, audiovisual e¤ects persist with auditory lags as much as 250 ms (Campbell, 2008). Possible temporal cues include event onsets, durations, the timing and speed of transitions, and speaking rates. Evidence has been presented that all of these are involved in the perception of dynamic facial speech. Global temporal patterns of timing and rhythm associated with prosody and carried by head and face movements also play an important part in the perception of dynamic facial speech. Movement-based point-light sequences, lacking static high spatial-frequency cues but including spatial as well as temporal information, convey facial speech (Rosenblum & Saldan˜a, 1998), emotion (Bassili, 1978; Pollick, Hill, Calder, & Patterson, 2003; Rosenblum, 2007), and identity (Rosenblum, 2007). Eye movement studies suggest that this low-spatial, high-temporal, frequency information plays an important role in everyday facial speech perception when we tend to be looking at the eyes rather than the mouth (Vatikiotis-Bateson et al., 1998). Patterns of cognitive deficits also suggest that preservation of motion processing is more important to facial speech perception than preservation of static face perception (Campbell et al., 1997). Silent moving speech, unlike still images, activates the auditory cortex (Calvert & Campbell, 2003) and only moving silent speech captures dynamic cues to identity shared with voice (Kamachi et al., 2003). How is the kinematic information encoded? As noted, view-independent patterns of performance suggest it may be represented in a head-centered coordinate system.
Dynamic Facial Speech
91
Articulatory gestures are an appealing candidate for the representation of movement (Liberman & Mattingly, 1985; Summerfield, 1987). An example of a proposed articulatory gesture might be a bilabial lip closure, which would typically involve motion of both lips and the jaw as well as more remote regions such as the cheeks. Such gestures are the underlying cause of the correlation between sound and movements. They are face-centered, spread out in time, and directly related to linguistic distinctions but allow for the e¤ects of coarticulation and show individual and prosodic variability. They would also naturally complete the perception production loop, with mirror neurons providing a plausible physiological mechanism for such a model (Skipper, Nusbaum, & Small, 2005, 2006). Gestures are determined by what is being said, but how do they relate to who and how? One might expect that di¤erences associated with speaker and manner would be noise for recovering what is being said and vice versa. Instead, speaker-specific characteristics appear to be retained and beneficially a¤ect the recovery of linguistic information (Yakel, Rosenblum, & Fortier, 2000). We appear to ‘‘tune in’’ to facial as well as auditory speech. Evidence from face-voice matching suggests that cues to identity are also closely tied to manner, i.e., how something is said (Lander et al., 2007). The words to be said (what) determine the articulator movements required, which in turn determine both facial movement and the sound of the speech. Recordings of facial movement show broadly equivalent patterns of movements across speaker (who) and manner (how) for equivalent utterances. Individual di¤erences and changes in manner a¤ect the timing and patterns of displacement associated with the underlying trajectory. It is these di¤erences that allow us to answer who and how, but how are the di¤erences encoded? When analyzing motion-capture data, the whole trajectory is available and equivalent features can be identified and temporally aligned for comparison. This is clearly not the case for interpreting speech viewed in real time. One possibility is to find underlying dynamic parameters, including masses, sti¤nesses, and motion energy, that are characteristic of di¤erent individuals or manners and that a¤ect trajectories globally. This would truly require the perception of dynamic facial speech. In conclusion, while both audio and visual speech can be treated as piecewise static to an extent, both strictly temporal properties and global spatiotemporal patterns of movement are important in helping us to answer the questions of who, what, and how from dynamic facial speech. Acknowledgments
The original data referred to in this chapter are from experiments carried out at Advanced Telecommunication Research in Japan and done with the support of
92
H. Hill
National Institute of Information and Communications Technology. None of the work would have been possible without the help of many people, including Kato Hiroaki for the sets of minimal pairs and linguistic advice, Marion Dohen for advice on visible contrastive focus, Julien Deseigne for doing much of the actual work, Miyuki Kamachi for her face and voice, and Takahashi Kuratate and Erik Vatikiotis-Bateson for inspiration and pioneering the motion-capture setup. Ruth Campbell and Alan Johnston kindly commented on an earlier draft, but all mistakes are entirely my own. Note 1. ‘‘Speech-reading’’ is a more accurate term for the processes involved in recovering information about speech from the visible appearance of the face because this involves not only the lips but also the tongue, jaw, teeth, cheeks, chin, eyebrows, eyes, forehead, and the head and quite possibly the body as a whole.
References Bassili, J. N. (1978). Facial motion in the perception of faces and of emotional expressions. Journal of Experimental Psychology: Human Perception and Performance, 4(3), 373–379. Calvert, G. A., Bullmore, E. T., Brammer, M. J., Campbell, R., Iversen, S. D., & David, A. S. (1999). Activation of auditory cortex during silent speech reading. Science, 276, 593–596. Calvert, G. A., & Campbell, R. (2003). Reading speech from still and moving faces: The neural substrates of visible speech. Journal of Cognitive Neuroscience, 15(1), 57–70. Campbell, R. (1986). The lateralization of lip-read sounds: A first look. Brain and Cognition, 5, 1–21. Campbell, R. (2008). The processing of audio-visual speech: Empirical and neural bases. Philosophical Transactions of the Royal Society, 363, 1001–1010. Campbell, R., Harvey, M., Troscianko, T., Massaro, D., & Cohen, M. M. (1996). Form and movement in speechreading: E¤ects of static and dynamic noise masks on speechreading faces and point-light displays. Paper presented at WIGLS, Delaware. Campbell, R., Zihl, J., Massaro, D., Munhall, K., & Cohen, M. M. (1997). Speechreading in the akinetopsic patient, L.M. Brain, 120(Pt. 10), 1793–1803. Cutler, A., Dahan, D., & van Donselaar, W. (1997). Prosody in the comprehension of spoken language: A literature review. Language and Speech, 40, 141–201. Dodd, B. (1979). Lip reading in infants: Attention to speech presented in- and out-of-synchrony. Cognitive Psychology, 11, 478–484. Dohen, M., Loevenbruck, H., Cathiard, M.-A., & Schwartz, J.-L. (2004). Visual perception of contrastive focus in reiterant French speech. Speech Communication, 44, 155–172. Dohen, M., Loevenbruck, H., & Hill, H. (2005). A multi-measurment approach to the identification of the audiovisual facial correlates of contrastive focus in French. Paper presented at the audio-visual speech processing conference, Vancouver Island, B.C. Giese, M., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements and action. Nature Reviews Neuroscience, 4, 179–192. Grant, K. W., & Seitz, P. F. (2000). The use of visible speech cues for improving auditory detection of spoken sentences. Journal of the Acoustical Society of America, 108, 1197–1208. Green, K. P., & Gerdeman, A. (1995). Cross-modal discrepancies in coarticulation and the integration of speech information: The McGurk e¤ect with mismatched vowels. Journal of Experimental Psychology, Human Perception and Performance, 21(6), 1409–1426.
Dynamic Facial Speech
93
Green, K. P., & Miller, J. L. (1985). On the role of visual rate information in phonetic perception. Perception and Psychophysics, 55, 249–260. Hill, H., & Johnston, A. (2001). Categorizing sex and identity from the biological motion of faces. Current Biology, 11(11), 880–885. Hill, H., Troje, N. F., & Johnston, A. (2005). Range- and domain-specific exaggeration of facial speech. Journal of Vision, 5(10), 793–807. Hirata, Y. (2004). E¤ects of speaking rate on the vowel length distinction in Japanese. Journal of Phoenetics, 32(4), 565–589. Irwin, J., Whalen, D. H., & Fowler, C. (2006). A sex di¤erence in visual influence on heard speech. Perception and Psychophysics, 68(4), 582–592. Iverson, P., Bernstein, L., & Auer, E. (1998). Modelling the interaction of phonemic intelligibility and lexical structure in audiovisual word recognition. Speech Communication, 26(1–2), 45–63. Jackson, P. L. (1988). The theoretical minimal unit for visual speech perception: Visemes and coarticulation. Volta Review, 90(5), 99–115. Jordan, T. R., & Thomas, S. M. (2001). E¤ects of horizontal viewing angle on visual and audiovisual speech recognition. Journal of Experimental Psychology, Human Perception and Performance, 27(6), 1386–1403. Kamachi, M., Hill, H., Lander, K., & Vatikiotis-Bateson, E. (2003). Putting the face to the voice: Matching identity across modality. Current Biology, 13, 1709–1714. Kuratate, T., Yehia, H. C., & Vatikiotis-Bateson, E. (1998). Kinematics-based synthesis of realistic talking faces. Proceedings of the International Conference on Audio-Visual Speech Processing, Terrigal-Sydney, Australia. Lander, K., Hill, H., Kamachi, M., & Vatikiotis-Bateson, E. (2007). It’s not what you say but the way that you say it: Matching faces and voices. Journal of Experimental Psychology, Human Perception and Performance, 33(4), 905–914. Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21(1), 1–36. MacSweeney, M., Campbell, R., Calvert, A., McGuire, P., David, A. S., Suckling, J., et al. (2001). Dispersed activation in the left temporal cortex for speech-reading in congenitally deaf people. Proceedings of the Royal Society of London, Series B, 268, 451–457. McGurk, H., & MacDonal, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Mohammed, T., Campbell, R., MacSweeney, M., Milne, E., Hansen, P., & Coleman, M. (2005). Speechreading skill and visual movement sensitivity are related in deaf speechreaders. Perception, 34, 205–216. Munhall, K., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 15(2), 133–137. Munhall, K., & Vatikiotis-Bateson, E. (1998). The moving face during speech communication. In R. Campbell, B. Dodd and D. Burnham (eds.), Hearing by eye II: Advances in the psychology of speechreading and audio-visual speech. Hove, UK: Psychology Press, pp. 123–139. Munhall, K., Kroos, C., Jozan, G., & Vatikiotis-Bateson, E. (2004). Spatial frequency requirements for audiovisual speech perception. Perception and Psychophsyics, 66(4), 574–583. Pollick, F. E., Hill, H., Calder, A. J., & Patterson, H. (2003). Recognizing expressions from spatially and temporally modified movements. Perception, 32, 813–826. Reisburg, D. (1987). Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. In B. Dodd and R. Campbell (eds.), Hearing by eye: The psychology of lip-reading. London: LEA, pp. 97–113. Remez, R. E., Fellowes, J. M., Pisoni, D. B., Goh, W. D., & Rubin, P. E. (1998). Multimodal perceptual organization of speech: Evidence from tone analogs of spoken utterances. Speech Communication, 26, 65–73. Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). Speech perception without traditional speech cues. Science, 212, 947–950.
94
H. Hill
Rosenblum, L. D. (2007). Look who’s talking: Recognizing friends from visible articulation. Perception, 36, 157–159. Rosenblum, L. D., Johnson, J. A., & Saldan˜a, H. M. (1996). Visual kinematic information for embellishing speech in noise. Journal of Speech and Hearing Research, 39(6), 1159–1170. Rosenblum, L. D., & Saldan˜a, H. M. (1998). Time-varying information for visual speech perception. In R. Campbell, B. Dodd, and D. Burnham (eds.), Hearing by eye II. Hove, UK: Psychology Press, pp. 61–81. Runeson, S., & Frykholm, G. (1983). Kinematic specification of dynamics as an informational basis for perception and action perception. Journal of Experimental Psychology, General, 112, 585–612. Skipper, J. I., Nusbaum, H. C., & Small, S. L. (2005). Listening to talking faces: Motor cortical activation during speech perception. NeuroImage, 25, 76–89. Skipper, J. I., Nusbaum, H. C., & Small, S. L. (2006). Lending a helping hand to hearing: Another motor theory of speech perception. In M. A. Arbib (ed.), Action to language via the mirror meuron system. New York: Cambridge University Press, pp. 250–285. Steede, L. L., Tree, J. J., & Hole, G. J. (2007). I can’t recognize your face but I can recognize its movement. Cognitive Neuropsychology, 24(4), 451–466. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech information in noise. Journal of the Acoustic Society of America, 26, 212–215. Summerfield, Q. (1987). Some preliminaries to a comprehensive account of audio-visual speech processing. In B. Dodd and R. Campbell (eds.), Hearing by eye: The psychology of lipreading. Hove, UK: Lawrence Erlbaum, pp. 3–51. Summerfield, Q., & McGrath, M. (1984). Detection and resolution of audio-visual incompatibility in the perception of vowels. Quarterly Journal of Experimental Psychology, A(36A), 51–74. Thomas, F., & Johnston, O. (1981). Disney animation: The illusion of life (1st ed.). New York: Abbeville Press. Vatikiotis-Bateson, E., Eigsti, I.-M., Yano, S., & Munhall, K. (1998). Eye movement of perceivers during audio-visual speech perception. Perception and Psychophsyics, 60(6), 926–940. Vatikiotis-Bateson, E., & Kelso, J. A. S. (1993). Rhythm type and articulatory dynamics in English, French, and Japanese. Journal of Phonetics, 21, 231–265. Vitkovich, M., & Barber, P. (1994). E¤ects of video frame rate on subjects’ ability to shadow one of two competing verbal passages. Journal of Speech and Hearing Research, 37, 1204–1210. Watson, T. L., Johnston, A., Hill, H., & Troje, N. F. (2005). Motion as a cue for viewpoint invariance. Visual Cognition, 12(7), 1291–1308. Yakel, D. A., Rosenblum, L. D., & Fortier, M. A. (2000). E¤ects of talker variability on speechreading. Perception and Psychophsyics, 62(7), 1405–1412. Yehia, H. C., Rubin, P. E., & Vatikiotis-Bateson, E. (1998). Quantitative association of vocal-tract and facial behavior. Speech Communication, 26, 23–44.
II
PHYSIOLOGY
7
Dynamic Facial Signaling: A Dialog between Brains
David A. Leopold
Humans live in complex and interactive societies, with much of our social perception focused on one another’s faces. Faces serve as a basis for individual recognition, with evidence suggesting that identity is decoded through specialized neural circuits in the temporal cortex. Faces are also configurable, consisting of independently movable features, which allows them to convey a range of social signals unrelated to identity. In fact, most of the time spent looking at faces is not concerned with learning who an individual is, but rather what is on their mind: What they are communicating? What is their emotional state? What is capturing their interest? Often this information can be extracted from a single glance. However, unlike identity, these facial signals can change from moment to moment and, in the context of a social exchange, warrant constant attention. Recent experiments using behavioral, imaging, and electrophysiological techniques have revealed neurons that are sensitive to a variety of configurable facial details. These studies demonstrate that our brains are hard-wired to e‰ciently extract meaningful content from the faces we observe. Thus dynamism is a fundamental characteristic of any social exchange. This principle is obvious in the case in verbal communication, where language by its very nature comprises a temporal stream of information. Humans and our primate cousins rely extensively on visual communication, which also exhibits an inherent temporal structure that cannot be captured, for example, in a static image. Thus understanding how the brain encodes and interprets dynamic visual signals is of great importance for social neuroscience. A second characteristic of many social exchanges is their dyadic nature, in which there is a mutual and bidirectional exchange of signals. Dynamism, combined with dyadic exchange, frequently gives rise to a temporal rhythm governing the social exchange. To communicate anything, one must have a repertoire of dynamic signals, along with another receiving entity, usually a conspecific, capable of interpreting the signals. Static faces would not be of much use socially. If all people wore plastic masks, individual recognition might be preserved (if the masks were di¤erent), but visual communication would be greatly diminished and restricted to movements of the
98
D. A. Leopold
body. The primate face is evolved to be a configurable structure whose elaborated mimetic musculature provides a broad repertoire of facial postures. The postures themselves are somewhat arbitrary—puckering of the lips or raising of the eyebrows. But among conspecifics, where there exists a tacit, evolved agreement among individuals about the meaning of such signals, the choice of facial expression can be of highest importance. As part of a greater social dialog, even a subtle change in a facial expression can raise passions and sometimes elicit strong reactions. An encounter between two individuals requires that they coordinate their facial behavior. More accurately, it is their brains that coordinate their facial behavior. In a manner of speaking, faces are mere instrument panels that send and receive signals. The brains are the puppet masters in the background, gathering visual information through the eyes and tugging on particular facial muscles to send signals to be interpreted by other brains (see figure 7.1). Brain A directs the eyes of monkey A to the face of monkey B, which is controlled by brain B. Then, based on an assessment of
Figure 7.1 The perception–action cycle in dyadic interaction, and the inherently dynamic nature of face perception. Social exchanges are most accurately conceived as exchanges between two or more brains. In primates, faces serve as an interface for a dynamic visual dialog that evolves over time. Here, the faces are mere control panels for the visual interaction between monkey A and monkey B, serving to send and receive visual signals. Brain A perceives face B and then issues a particular facial behavior. This behavior is the perceived by brain B, and so on.
Dynamic Facial Signaling
99
monkey B’s face, brain A elicits a facial behavior, which is then interpreted by brain B, and so on. In this example of a dyadic exchange, the detection of a growing smile might be met with the immediate lowering of the brow, which might lead to a brief widening of the eyes, prompting a downward look with the eyes and carefully timed exhalation. Through their faces, the two brains interact, sometimes collaborating and sometimes competing. Actions and reactions depend on a complex set of innate and learned rules governing the social dialog. We are constantly participating in this sort of social exchange, although most of its details escape our attention—we simply don’t need to know about them. It is in this context that the study of dynamic face perception fits naturally. Physiology of Dynamic Face Perception
This section of the book explores neural mechanisms underlying dynamic face perception in human and nonhuman primates. Social neuroscience is a burgeoning field that has focused much of its energy on human subjects, with functional magnetic resonance imaging (fMRI) taking center stage. Perhaps because fMRI is based on a sluggish blood-based signal, requiring several seconds to evolve, the study of dynamic aspects of social interaction is still in its infancy. Nonetheless, the following chapters demonstrate that neuropsychological and electrophysiological experiments have made important inroads into this challenging topic. In chapter 8, Shepherd and Ghazanfar set the stage for the rest of the section by eloquently making the case for studying facial behavior. Faces consist of moveable elements whose variation about a neutral position during an emotional expression, vocalization, or eye movement carries more information than any static image. The brain’s apparent sensitivity to such facial dynamism is eminently important in social exchange. Shepherd and Ghazanfar review a broad literature, covering the topics of gaze, attention, emotion, and vocal communication. They argue convincingly that dynamic visual and vocalization behaviors are intimately related for both monkeys and humans. This link is underscored by their own electrophysiological results in monkeys, which show enhanced neural responses, as well as increased correlation between auditory and visual brain areas, when monkeys’ vocalizations are appropriately combined with corresponding movies of their changing facial expressions. In chapter 9, Puce and Schroeder review human electrophysiological experiments related to facial movement, focusing on event-related potentials (ERPs), such as the negative deflection around 170 ms (N170). Dynamic faces elicit stronger N170 responses than do static ones. They point out that the measured responses to moving faces cannot be experimentally treated in the same way as those to flashed static faces, since the temporal structure of the stimulus itself contributes to the dynamics of the response. Given this caveat, they find that the strength of the neural response
100
D. A. Leopold
depends on the social and emotional context of a viewed face, including whether gaze moves toward or away from the subject. Importantly, they conclude that the N170 is sensitive to the social significance of particular facial movements, rather than simply reflecting precise spatiotemporal visual stimulus. In chapter 10, Vuilleumier and Righart review behavioral evidence that the perception of dynamic faces is enhanced compared with that of ‘‘frozen’’ (static) ones. They also provide a somewhat di¤erent perspective on the ERP studies of dynamic faces, including the N170, whose sensitivity to changes in facial expression requires that the identity of the face remain unchanged. They also review evidence that very brief dynamic expressions, sometimes termed ‘‘microexpressions’’ (Ekman, 2003), can modulate the perceived intensity of subsequent expressions. These movements may do more than simply enhance the appearance of an emotional state, and Vuilleumier and Righart review evidence that the muscles on the observer’s face are measurably stimulated when subjects view dynamic facial expressions (Dimberg, 1982). Thus, seeing an emotional expression engages one’s own expression-producing machinery. This finding, which is related to the phenomenon of emotional contagion, is a fascinating aspect of the facial dialog that illustrates the direct relationship between perception and action in facial processing. In chapter 11, the final chapter of this section, de Gelder and van den Stock discuss clinical observations relating to the dynamic information expressed by faces. After reviewing the processing of static and dynamic faces in normal subjects, they analyze how facial movement a¤ects perception in patients with various cognitive deficits, ranging from developmental prosopagnosia to brain lesions to autism spectrum disorder. Human neuropsychology case studies provide evidence that facial movement is analyzed independently of static facial structure. However, the authors caution against oversimplification, predicting that a strict modular framework for interpreting the results of face dynamism is unlikely to be fruitful. A common theme in these chapters is the apparent role of the superior temporal sulcus (STS) in the decipherment of dynamic facial and body gestures. The homology between the STS areas of monkeys and humans is still unclear. Nonetheless, this structure seems to be important for processing dynamic social stimuli in both species. Early observations by Perrett, Rolls, and colleagues in the macaque monkey identified this sulcus, most probably its upper bank, as being important for processing the direction of gaze (Perrett et al., 1985b), biological motion (Perrett et al., 1985a), and emotional expression (Hasselmo, Rolls, & Bayliss, 1989). Subsequent neuroimaging work showed surprisingly similar responses in the human STS, despite the uncertainty regarding its anatomical correspondence to the monkey STS, with the STS in humans displaying a similar sensitivity to moveable features of the faces (Engell and Haxby, 2007). In humans, the N170 potential is thought to originate from facespecialized cortical areas, including STS.
Dynamic Facial Signaling
101
The dorsal STS region of the monkey brain, in contrast to the more ventral faceprocessing areas, is multimodal, responding to both visual and auditory social information. The integration of these di¤erent modalities is most obvious during the processing of vocalizations, where dynamic changes in facial posture accompany the issuance of a specific call. Monkeys are behaviorally sensitive to the correspondence between these two components of conspecific vocalizations (Ghazanfar and Logothetis, 2003). Both Shepherd and Ghazanfar and Puce and Schroeder present electrophysiological evidence that visual and auditory cues during vocal behavior are integrated in the temporal lobe. Shepherd and Ghazanfar show an enhanced coupling between the STS and the auditory cortex when a movie of a vocalizing animal is played together with the correct soundtrack. Puce and Schroeder show that the current sinks in the upper bank of the STS are similarly enhanced compared with either the visual or auditory cue alone. These results illustrate the neural integration of visual and nonvisual signals related to the interpretation of faces, in this case in the context of natural vocalizations. Challenges for Neurophysiology
Understanding the neural processing of dynamic social stimuli poses di‰culties for both experimental design and data analysis. First, there exists a tension between having experiments adhere to a natural social or ethological context and ensuring that they are under tight experimental control. In some ways, our understanding of facial expressions is supported by a rich neurophysiological literature that demonstrates many examples of neural specialization for socially relevant facial information. On the other hand, single-unit studies of face processing have been carried out almost exclusively using conventional methods, where eye gaze is restricted and images of faces are briefly flashed onto a blank screen. While this conventional testing paradigm holds many advantages, it fails to capture the hallmarks of facial interaction: its dynamism and its dyadic nature. Shifting experimental paradigms to a more natural viewing regime may allow richer social behavior. At the same time, it introduces a host of other problems for interpreting neural signals. For example, removing all gaze restrictions inevitably leads to a large trial-to-trial variation in firing owing to the monkey’s unpredictable sequence of eye movements. Similarly, testing the neural responses to faces in a natural, social setting, say in a crowd, adds another layer of complexity in identifying the determinants of neural firing (Sheinberg and Logothetis, 2001). Developing methods to interpret neural responses to natural, animate stimuli in the context of the observer’s own exploratory behavior will be an important step forward for social neuroscience. Another challenge for neurophysiology is understanding how dynamic sensory stimuli are encoded in the domain, given that neurons exhibit their own response
102
D. A. Leopold
dynamics even to static stimuli. Cells in the inferotemporal cortex of the monkeys, for example, show strong temporal patterning in response to briefly flashed static stimuli, and these responses carry stimulus-specific information (Richmond, Optican, Podell, & Spitzer, 1987; Sugase, Yamane, Ueno, & Kawano, 1999). How would such neurons respond to continually changing patterns, such as a dynamic facial expression. In this domain visual neurophysiologists may need to follow the lead of their auditory counterpart, who are accustomed to evaluating neural responses to inherently dynamic stimuli. Meeting of the Minds
During a dyadic interaction, when two brains are involved, analysis of neurophysiological data will become even more of a challenge than it already is as the rhythm of the diadic exchange will figure prominently into the neural responses. It will therefore be important to measure and understand elements of the facial dialog. Facial behaviors in one individual elicit a corresponding behavior in another. For example, it has been shown that humans and monkeys reflexively adjust their attention and gaze based on the observed gaze direction of another individual (Deaner & Platt, 2003). In an analogous manner, emotional expression can be contagious, and it would be of great value to capture such contagion in controlled animal experiments. The tight, interindividual coupling between perception and behavior underscores the notion that pairs of interacting brains can be conceived as a single dyadic entity, and it is possible that studying them from this perspective will provide insights into neural activity patterns that would otherwise be out of reach. Developing a framework for studying the neural mechanisms of perception and action during social exchanges is a great challenge. It is likely that future methods will capture the cerebral dialog directly by measuring neurons simultaneously in pairs of interacting subjects (see also Boker and Cohn in this volume). Of course, neurons in interacting brains will be correlated to some degree simply because the associated behaviors are correlated. But might there be something beyond that? One possibility is that a well-defined and coordinated perception-action cycle emerges between two individuals, where each brain takes turns producing a behavior and then sampling the counterpart’s behavior. Another possibility is that rhythms in the two brains’ neurons, or perhaps their local field potentials, become synchronized, or ‘‘locked,’’ together in time. Recent conceptual advances have demonstrated the merits of parallel acquisition of brain activity (Montague et al., 2002), although without direct sensory communication between subjects. Even in the absence of face-to-face communication, and using the sluggish fMRI signal, this approach reveals the interplay between brains through engagement of circuits thought to be involved in social judgment (Tomlin et al., 2006).
Dynamic Facial Signaling
103
Finally, we reflect on language, which must be considered the primary form of information exchange in humans. The linguistic transformation of ideas and information into acoustic signals that can be transmitted from one person to another, decoded and understood, is arguably the most prominent uniquely human adaptation. One might therefore expect that humans would have come to depend less on facial communication than monkeys and apes do. Yet there is no evidence to suggest that language has displaced visual forms of social communication, which are abundant and preserved in all human cultures (for reviews, see Ekman & Oster, 1979 and Argyle, 1988). This may be in part because spoken language draws yet more attention to our mouth and face. The expression on the face of a speaker often conveys more accurate social information than the words themselves. The term ‘‘person’’ derives from the Latin persona, meaning mask (which is itself derived from personare, meaning to sound or speak through). This etymology suggests that humans have long considered the face to represent an image of their true identity. Yet in reality our face is only a mask—a configurable, dynamic, and highly evolved one. When we gaze at another person, we do not care about the visual details: the asymmetry of their mouth, the elevation of their eyebrow, or the velocity of their saccade. These metrics are unimportant at a cognitive level, and we leave them to the most specialized portions of our brain to analyze. Instead we aim to understand the meaning of their facial gestures and the underlying emotional states so that we can understand and predict their behavior—as phrased by Shepherd and Ghazanfar to ‘‘see past the mask.’’ To achieve this, we act not as detached observers, carefully reading and analyzing each expression, but as preprogrammed participants, allowing our brain to engage our own face in a visual dialog with that individual. Through such a dialog, and following a highly sophisticated, if somewhat arbitrary, set of social rules, we gain insight into another person’s state of mind, and allow them to gain some insight into our own. Conclusions
Social neuroscience is challenging the existing framework for understanding sensory processing in the brain. It is an open question whether the principles underlying the processing of socially relevant stimuli, such as dynamic faces, can be understood with traditional concepts and paradigms, or whether studying the interaction between multiple individuals will lie at the heart of understanding the relevant brain circuitry. Neurophysiologists have long viewed activity in the brain objectively, documenting the responses to external sensory stimuli, motor actions, and cognitive states. Social neuroscience is pushing us to extend our behavioral, physiological, and conceptual tools more than one brain at a time, in order to study the coupling between
104
D. A. Leopold
behaviors, sensations, and patterns of neural activity that mark the social exchanges at the heart of human evolutionary optimization. References Adolphs, R. (2002). Neural systems for recognizing emotion. Curr Opin Neurobiol, 12, 169–177. Argyle, M. (1988). Bodily communication. London: Taylor and Francis. Deaner, R. O., & Platt, M. L. (2003). Reflexive social attention in monkeys and humans. Curr Biol, 13, 1609–1613. Dimberg, U. (1982). Facial reactions to facial expressions. Psychophysiology, 19, 643–647. Ekman, P. (2003). Emotions revealed. New York: Holt and Company. Ekman, P., & Oster, H. (1979). Facial expressions of emotion. Annu Rev Psychol, 30, 527–554. Engell, A. D., & Haxby, J. V. (2007). Facial expression and gaze-direction in human superior temporal sulcus. Neuropsychologia, 45, 3234–3241. Ghazanfar, A. A., & Logothetis, N. K. (2003). Neuroperception: Facial expressions linked to monkey calls. Nature, 423, 937–938. Hasselmo, M. E., Rolls, E. T., & Baylis, G. C. (1989). The role of expression and identity in the faceselective responses of neurons in the temporal visual cortex of the monkey. Behav Brain Res, 32, 203–218. Montague, P. R., Berns, G. S., Cohen, J. D., McClure, S. M., Pagnoni, G., Dhamala, M., Wiest, M. C., Karpov, I., King, R. D., Apple, N., & Fisher, R. E. (2002). Hyperscanning: Simultaneous fMRI during linked social interactions. Neuroimage, 16, 1159–1164. Perrett, D. I., Smith, P. A., Mistlin, A. J., Chitty, A. J., Head, A. S., Potter, D. D., Broennimann, R., Milner, A. D., & Jeeves, M. A. (1985a). Visual analysis of body movements by neurones in the temporal cortex of the macaque monkey: A preliminary report. Behav Brain Res, 16, 153–170. Perrett, D. I., Smith, P. A., Potter, D. D., Mistlin, A. J., Head, A. S., Milner, A. D., & Jeeves, M. A. (1985b). Visual cells in the temporal cortex sensitive to face view and gaze direction. Proc R Soc Lond B Biol Sci, 223, 293–317. Richmond, B. J., Optican, L. M., Podell, M., & Spitzer, H. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. I. Response characteristics. J Neurophysiol, 57, 132–146. Sheinberg, D. L., & Logothetis, N. K. (2001). Noticing familiar objects in real-world scenes: The role of temporal cortical neurons in natural vision. J Neurosci, 21, 1340–1350. Sugase, Y., Yamane, S., Ueno, S., & Kawano, K. (1999). Global and fine information coded by single neurons in the temporal visual cortex. Nature, 400, 869–873. Tomlin, D., Kayali, M. A., King-Casas, B., Anen, C., Camerer, C. F., Quartz, S. R., & Montague, P. R. (2006). Agent-specific responses in the cingulate cortex during economic exchanges. Science, 312, 1047– 1050.
8
Engaging Neocortical Networks with Dynamic Faces
Stephen V. Shepherd and Asif A. Ghazanfar
Dynamism is the rule that governs our pursuit of goals and patterns the environment we perceive. To understand perception of faces, we must do more than imagine a series of discrete, static, stimuli and responses. Brains exist to coordinate behavior, not merely to perceive; they have evolved to guide interaction with their environment and not merely to produce a representation of it. Brains expect our environment to change, both in response to our actions and of its own accord, and they extrapolate both the predictable motion of physical objects and the more capricious, goaldirected motion of other animate beings. Finally, our brains have adapted to a world that tests us continuously in real time, not discretely, in separable questions and answers. After all, primates are animate; our faces translate and rotate in space relative to our bodies and to our larger environment. Furthermore, onto the static, bonestructure-defined configuration of any individual’s face are layered dynamic processes, including (from longer to shorter time scales) aging processes; changes in health, hormonal status, hunger, hydration, and exertion; blinks and expression shifts occurring irregularly every several seconds; irregular but sustained mouth and postural changes associated with vocalizations; microexpression shifts several times per second; shifts in sensory orientation several times per second (most obviously by the eyes, but also ears in some primates); and irregular but rhythmic patterns associated with mastication and human speech. Each of these steadily changing features alters the perception of the observer and contributes to both basic perception (e.g., biological motion) and cognitive attributions (e.g., mental states, social relationships). Earlier study of these dynamic features has focused on cortical responses to static gaze direction, static expression, and pictures or brief videos depicting biological motion. Although some of the consequences of observing facial dynamics are known—including a tendency to physically mimic both observed actions and attentional or emotional states—the details and mechanisms of these processes remain largely obscure.
106
S. V. Shepherd and A. A. Ghazanfar
This dynamism presents incredible technical di‰culties, but recent advances make it possible to generate and analyze increasingly complex stimuli and, in parallel, to record from large-scale brain networks and analyze their relationships with their external environment. Thus we believe that despite the myriad challenges, we are at the verge of crossing another Rubicon beyond which social processes will not be conceived merely as stimulus behaviors and perceptual responses. It seems likely that the most consequential features of natural social behavior—including vocal communication, such as speech—cannot be successfully evoked using static, unimodal, noninteracting, spatially abstracted stimuli. Nonetheless, the transition to an interactive paradigm of social interaction will necessarily be bridged by using data collected with static faces. As such, this chapter will review some of what we know about face processing in dyadic interactions with humans and nonhuman primates. When possible, we will reveal how dynamic faces change what we know about face processing and its underlying neural substrates. Facial Motion and Vocal Communication
Primates spend much of their time looking at the faces of other individuals and in particular at their eyes. If the eye-movement strategies of monkeys viewing vocalizing faces are made primarily to glean social information, then why fixate on the eyes? Many previous experiments have shown that monkeys prefer to look at the eyes when viewing neutral or expressive faces (Keating & Keating, 1982; Nahm, Perret, Amaral, & Albright, 1997; Guo, Robertson, Mahmoodi, Tadmor, & Young, 2003) and the attention directed at the eyes often seems to be used to assess the intention of a conspecific or other competitor (Ghazanfar & Santos, 2004). Humans likewise tend to focus on the eye region (Yarbus, 1967; Birmingham, Bischof, & Kingstone, 2007; Fletcher-Watson, Findlay, Leekam, & Benson, 2008). Thus both humans and monkeys may focus on the eyes when observing a conspecific to glean information about that individual’s intentions. (This is reviewed in detail in the following section.) During vocal communication, patterns of eye movements of the observer can be driven by both the demands of the task as well as the dynamics of the observed face. Recent studies of humans examined observers’ eye movements during passive viewing of movies (Klin, Jones, Schultz, Volkmar, & Cohen, 2002) and under different listening conditions, such as varying levels of background noise (VatikiotisBateson, Eigsti, Yano, & Munhall, 1998), competing voices (Rudmann, McCarley, & Kramer, 2003), or silence (i.e., speech-reading with no audio track) (Lansing & McConkie, 1999, 2003). When typical human subjects are given no task or instruction regarding what acoustic cues to attend to, they will consistently look at the eye region more than the mouth when viewing videos of human speakers (Klin et al.,
Engaging Neocortical Networks with Dynamic Faces
107
2002). However, when subjects are required to perform a specific task, then eyemovement patterns are task dependent (see Land & Hayhoe, 2001). For example, when they are required to attend to speech-specific aspects of a communication signal (e.g., phonetic details in high background noise, word identification, or segmental cues), humans will make significantly more fixations on the mouth region than on the eye region (Vatikiotis-Bateson et al., 1998; Lansing & McConkie, 2003). In contrast, when subjects are asked to focus on prosodic cues or to make social judgments based on what they see or hear, they direct their gaze more often toward the eyes than the mouth (Lansing & McConkie, 1999; Buchan, Pare, & Munhall, 2004, 2007). The sensorimotor mechanisms that analyze and integrate facial and vocal expressions are most likely an early innovation that is not specific to perception of human speech (Ghazanfar & Santos, 2004). The eye-movement patterns of rhesus monkeys viewing dynamic vocalizing faces share many of the same features as human eye-movement patterns (Ghazanfar, Nielsen, & Logothetis, 2006). Monkeys viewing video sequences of other monkeys vocalizing under di¤erent listening conditions (silent, matched, or mismatched) spent most of their time inspecting the eye region relative to the mouth under all conditions (figure 8.1a). When they did fixate on the mouth, it was highly correlated with the onset of mouth movements (figure 8.1b). These data show that the pattern of eye fixations is driven at least in part by the dynamics of the face and are strikingly similar to what we know about human eye-movement patterns during speech-reading. In both species, a greater number of fixations fall in the eye region than in the mouth region when subjects are required simply to view conspecifics (Klin et al., 2002), to attend vocal emotion cues, or to make social judgments (Buchan et al., 2007). Even during visual speech alone (no auditory component), when subjects are asked to attend to prosodic cues, they will look at the eyes more than the mouth (Lansing & McConkie, 1999). Furthermore, like human observers (Lansing & McConkie, 2003), monkeys look at the eyes before they look at the mouth and their fixations on the mouth are tightly correlated with mouth movement (Ghazanfar et al., 2006). For instance, Lansing and McConkie (2003) reported that regardless of whether it was visual or audiovisual speech, subjects asked to identify words increased their fixations on the mouth region with the onset of facial motion. The same was true for rhesus monkeys; they fixate on the mouth upon the onset of movement in that region (Ghazanfar et al., 2006). The dynamics of the face can give us clues as to why both monkeys and humans focus primarily on the eye region even during audiovisual vocal communication. One possibility is that gaze deployments may be optimized to extract socially relevant cues that are relatively higher in spatial frequency near the eyes and temporal frequency near the mouth, matching the relative spatiotemporal precision of the foveal and peripheral retina. In addition, the angular size of faces may be too small at
108
S. V. Shepherd and A. A. Ghazanfar
Figure 8.1 Eye movements of monkey observers viewing vocalizing conspecifics. (a) The average fixation on the eye region versus the mouth region across three subjects while viewing a 30-second video of a vocalizing conspecific. The audio track had no influence on the proportion of fixations falling onto the mouth or the eye region. Error bars represent the S.E.M. (b) We also find that when monkeys do saccade to the mouth region, it is tightly correlated with the onset of mouth movements (r ¼ 0:997, p < 0:00001).
Engaging Neocortical Networks with Dynamic Faces
109
conversational distances for there to be a large cost to speech-reading in fixating on the eyes rather than the mouth. As proposed by Vatikiotis-Bateson et al. (1998), it is possible that perceivers acquire vocalization-related information that is distributed broadly on the vocalizer’s face. Facial motion during speech is in part a direct consequence of the vocal tract movements necessary to shape the acoustics of speech; indeed, a large portion of the variance observed in vocal tract motion can be estimated from facial motion (Yehia, Kuratate, & Vatikiotis-Bateson, 2002). Humans, therefore, can identify vocal sounds when the mouth appears outside the fovea or is masked, presumably by using these larger-scale facial motion cues (Preminger, Lin, & Levitt, 1998). Head movement can also be an informative cue, linked to both the fundamental frequency (F0 ) and the voice amplitude of the speech signal (Yehia et al., 2002; Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004). When head movements are eliminated or distorted in speech displays, speech perception is degraded (Munhall et al., 2004). Finally, it is possible that saccades to the mouth are epiphenomenal and merely a reflexive response to detection of motion in the visual periphery (Vatikiotis-Bateson et al., 1998). As in humans, di¤erent rhesus monkey vocalizations are produced with unique facial expressions and the motion of articulators influences the acoustics of the signal (Hauser, Evans, & Marler, 1993; Hauser and Ybarra, 1994). Such articulatory postures could influence facial motion beyond the mouth region. For example, grimaces produced during scream vocalizations cause the skin folds around the eyes to increase in number (Hauser, 1993). Thus, for many of the same reasons suggested for human perceivers, rhesus monkeys may simply not need to look directly at a mouth to monitor visual aspects of vocalization. Gaze and Attention
Faces are salient because they are readily identifiable indicators of animacy and because they strongly distinguish species and individuals. More than this, faces typically lead the body in movement; they contain the ingestive apparatus by which competitors consume and predators prey; and they combine the major sensory organs that both define an animal’s attentional orientation and connote its current intent. The orientation of our most important sensory organ—our eyes—is particularly revealing, providing a crucial context for facial expressions and revealing both our attentional state and our likely future intentions. The information encoded in the direction of a gaze appears to consist of two related signals: first, and urgently, whether the observed individual gazes toward the observer; second, whether (and which) object or area has captured the attention of the observed individual.
110
S. V. Shepherd and A. A. Ghazanfar
It is quite likely that the first manner of sensitivity to gaze direction—sensitivity to being watched—occurs early in ontogeny and is phylogenetically widespread. Human infants prefer full-face to nonface configurations within 72 hours (Macchi Cassia, Simion, & Umilta, 2001), and prefer direct to averted gaze shortly thereafter (within 2–5 days; Farroni, Csibra, Simion, & Johnson, 2002), by gestation age 10 months (Batki, Baron-Cohen, Wheelwright, Connellan, & Ahluwalia, 2000). This sensitivity is widespread, being shared with diverse vertebrates, reported variously in fish (Coss, 1979), primitive primates (Coss, 1978), marine mammals (Xitco, Gory, & Kuczaj, 2004), lizards (Burger, Gochfeld, & Murray, 1992), snakes (Burghardt & Greene, 1988), and birds (Ristau, 1991). The second manner of sensitivity—the use of gaze as a deictic (pointing) cue—has a more uncertain origin both in the brain and in the course of evolution. It is known, for example, that gaze perception interacts bidirectionally with emotional and social perception, both in the perception of facial expressions (Adams & Kleck, 2005; Ganel, Goshen-Gottstein, & Goodale, 2005) and in environmental interactions (Bayliss & Tipper, 2006; Bayliss, Frischen, Fenske, & Tipper, 2007). Furthermore, perceiving another’s direction of gaze does not merely inform us but actively steers our own attention. Fletcher-Watson and colleagues found that while the first saccades to a social scene were to the eyes (Fletcher-Watson et al., 2008), subsequent saccades tended to follow gaze. Increasingly, evidence suggests that these responses are not fully under conscious control. Friesen and Kingstone (1998) discovered, and others quickly replicated (Driver et al., 1999; Langton & Bruce, 1999), that humans reflexively follow the gaze of others. We respond faster and more accurately when detecting, localizing, or discriminating targets that appear in the direction viewed by another individual. These responses occur to static and dynamic gaze cues, to head and eye orientation, to real or cartoon faces (reviewed by Frischen, Bayliss, & Tipper, 2007), and persist even when gaze cues oppose the demands of a task four times out of five. However important such a response may be to human’s sophisticated social cognition (Baron-Cohen, 1994), it appears it is not unique. Apes (Brauer, Call, & Tomasello, 2005), monkeys (Emery, Lorincz, Perrett, Oram, & Baker, 1997; Tomasello, Call, & Hare, 1998), and perhaps even lemurs (Shepherd & Platt, 2008) follow gaze. Intriguingly, similarities between macaque and human gaze-following share similar dynamics (figure 8.2) (Deaner & Platt, 2003), suggesting that our behaviors may share common, evolutionarily ancient mechanisms. In monkeys, as in humans, these fast and stereotyped responses are nonetheless nuanced and context dependent [e.g., influenced in humans by emotion (Mathews, Fox, Yiend, & Calder, 2003; Putman, Hermans, & van Honk, 2006), gender and familiarity (Deaner, Shepherd, & Platt, 2007); and in monkeys by dominance status (Shepherd, Deaner, & Platt, 2006)]. The mechanisms that mediate and modulate gaze-following may not, however, be unique
Engaging Neocortical Networks with Dynamic Faces
111
Figure 8.2 Gaze-following tendencies are shared by humans and macaques, and their similar time course suggests they may derive from similar mechanisms. (a) Savings in reaction time when responding to congruent versus incongruent stimuli. At 200 ms after cue onset, both macaque and human subjects responded faster to targets that appeared in the direction of their gaze. (b) Bias in fixation position as a function of observed gaze direction. From 200 ms onward, a systematic drift in eye position suggested that evoked shifts of covert attention had biased the subject’s microsaccades in the same direction toward which they’d seen another primate look. * p ¼ 0:05; ** p ¼ 0:001.
to perceived attention: instead, they may represent just one aspect of a more general tendency, alternately termed ‘‘mimicry,’’ ‘‘mirroring,’’ or ‘‘embodied perception.’’ Activating Motor Responses through Dynamic Faces
Faces do not reflect the attention and intention of other animals by accident. In primates, faces have become a primary vehicle for the active communication of visuosocial signals, a trend that has reached its apogee in the naked faces of Homo sapiens. Anthropoid primates, in whom visual signaling has supplanted olfactory cues, devote large swaths of the cortex to the production of facial expressions (Allman, 1999). Researchers have identified at least five categories of human facial expression that
112
S. V. Shepherd and A. A. Ghazanfar
appear to be intelligible across cultures (Ekman, 1993). Although these emotions can be readily identified in static images, the canonical photographs sometimes appear caricatured to observers, seeming unnatural when they are divorced from a dynamic context. Indeed, some facial expressions—such as embarrassment—cannot be fully captured except in dynamic sequence: dropping gaze, smiling (often suppressed), then turning away or touching the face (Keltner, 1995). As with gaze, the responses evoked by dynamic social stimuli are not merely perceptual. William James (James, 1890) articulated the ideomotor theory of action, in which every mental representation evokes the represented behavior—evoked traces that are not mere concomitants of perception of emotion but necessary intermediaries in perceptual experience. Since that time, ample support has arisen that emotional facial expressions (e.g., Hess & Blairy, 2001) induce mimicry and are thus in some sense contagious. This mimicry is automatic and does not require conscious awareness (Dimberg, Thunberg, & Elmehed, 2000). Furthermore, interference with motor mimicry can disrupt recognition of facial expression (Oberman, Winkielman, & Ramachandran, 2007). Again, however, this seemingly reflexive process nonetheless presents subtle context dependence. Mimicry may not be tied to any specific physical e¤ector, suggesting that we mirror emotions, rather than purely physical states (Magnee, Stekelenburg, Kemner, & de Gelder, 2007), and facial expressions may induce complementarity rather than mimicry when social dominance is at stake (Tiedens & Fragale, 2003). Crucially, this mimicry does not just aid in perception, it may actively regulate social interaction. Mimicry and a‰liation mutually reinforce one another, with mimicry promoting a‰liation (Chartrand & Bargh, 1999; Wiltermuth & Heath, 2009, although note van Baaren, Holland, Kawakami, & van Knippenberg, 2004) and a‰liation likewise enhancing mimicry (Lakin & Chartrand, 2003; Likowski, Muehlberger, Seibt, Pauli, & Weyers, 2008). While presentation of static faces can sometimes evoke these responses, dynamic faces are much more e¤ective. For example, muscle contraction is much more evident when viewing videotaped than static expressions in both the corrugator supercilii (contracted in anger) and the zygomatic major (contracted in joy) (Sato, Fujimura, & Suzuki, 2008). The attentional and emotional contagion produced by observing a face may be part of a more general trend toward mirroring as a perceptual process. Mimicry appears to occur at multiple levels of abstraction, reflecting action goals, overall motor strategies, and specific e¤ectors and movements. Furthermore, mimicry is not only triggered by gaze and perception of expression, it also plays a role in the perception of speech; mechanical manipulation of listeners’ faces biases their auditory perception toward congruently mouthed words (Ito, Tiede, & Ostry, 2009). Although the data may not require that perceived actions and emotions be embodied by the observer before they can be comprehended [individuals with facial paralysis can
Engaging Neocortical Networks with Dynamic Faces
113
nonetheless recognize emotion (Calder, Keane, Cole, Campbell, & Young, 2000)], they do suggest that physical or simulated embodiment contributes to normal face perception, empathy, communication, and social a‰liation. Neural Networks Activated by Dynamic versus Static Faces
It is likely that social pressure is a major factor in the evolution of larger brains (Barton and Dunbar, 1997; Reader and Laland, 2002). Consistent with this idea, we note that many—perhaps the majority—of cortical areas can be activated by faces under some conditions and may thus be considered part of an extended face perception network. For example, in addition to core face perception areas (reviewed by Tsao and Livingstone, 2008), cortical areas involved in attention (Haxby, Ho¤man, & Gobbini, 2000; Calder et al., 2007) and somatosensation (Adolphs, Damasio, Tranel, Cooper, & Damasio, 2000) are active during observation of faces. Moreover, a range of other areas involved in emotional, mnemonic, and goal-directed processes are commonly engaged by socially significant faces (e.g., Vuilleumier, Armony, Driver, & Dolan, 2001; Ishai, Schmidt, & Boesiger, 2005). Most of what we know about neural processing of faces comes from research on passive perception or simple categorization of static, cropped faces. Dynamic faces appear to be much more e¤ective than static ones at activating neural tissues throughout the extended face perception network (Fox, Iaria, & Barton, 2008), and we speculate that socially relevant, interactive faces would be more e¤ective still. Although current models suggest that facial transients are processed through a separate stream from permanent features such as identity (Haxby et al., 2000 but see also Calder & Young, 2005), dynamism can be expected to enrich the information flowing through both streams. Dynamic stimuli make explicit which facial features are transitory and which are permanent; furthermore, patterns of movement are often idiosyncratic and provide information about underlying structure. Conversely, because features such as bone structure and musculature constrain facial movements, and because recognition of a person likewise constrains the posterior probabilities of underlying mental states, identity-related computations are likely to modulate the processing of more dynamic facial features. For further discussion of identity processing and facial dynamics, see chapter 4. The dynamic visual features described here—including orofacial movement, gaze shifts, and emotional expressions—are thought to be analyzed primarily by cortical areas located near the superior temporal sulcus (Allison, Puce, & McCarthy, 2000). The detailed pathways by which the brain mediates social perceptions and behavioral responses is incompletely understood. Although faces do not remain long in one position, they do dynamically orbit a resting state. When muscles are relaxed, all other things being equal, the mouth gently closes, the eyes align with the head and the head
114
S. V. Shepherd and A. A. Ghazanfar
with the body, and the facial expression becomes placidly neutral. Static photographs of any other pose can thus be considered to imply a temporary shift from this hypothetical default. Perhaps for this reason, relatively little data explicitly contrast cortical responses and dynamic and static stimuli. Gazed Direction
Regions of the human STS (Puce, Allison, Bentin, Gore, & McCarthy, 1998; Wicker, Michel, Hena¤, & Decety, 1998) and amygdala (Kawashima et al., 1999) have both been implicated in visual processing of observed gaze. In particular, the posterior STS is reciprocally interconnected with posterior parietal attention areas, and both regions are activated when subjects specifically attend gaze direction (Ho¤man & Haxby, 2000). In monkeys, neurons in the middle anterior upper bank of the STS represent gaze direction independently of whether it arises through head or eye posture (Perrett, Hietanen, Oram, & Benson, 1992), and while neurons in the caudal STS respond symmetrically to gaze averted to either the right or the left, anterior neurons respond di¤erentially to di¤erent gaze directions (De Souza, Eifuku, Tamura, Nishijo, & Ono, 2005). Although human imaging studies have most consistently reported activations in the caudal STS, recent studies have suggested that only neuronal activity in the inferior parietal lobule and the anterior STS represent gaze deixis, that is, the specific direction toward which gaze is directed (Calder et al., 2007). Nonetheless, a fast subcortical pathway that routes visual information to the amygdala may also play a role in social attention shifts (Adolphs et al., 2005), potentially including those cued by observed gaze (Akiyama et al., 2007). Emotion and Mimicry
Emotional expressions are likewise believed to be processed initially in the STS but have been reported to involve a broad array of areas, including the amygdala; parietal somatosensory regions; insula; and frontal motor, reward, and executive function centers. These activations in the extended face perception system often vary from task to task and emotion to emotion (reviewed by Adolphs, 2002; Vuilleumier & Pourtois, 2007). The significance of an emotional signal and the neural activity it evokes is no doubt profoundly a¤ected by dynamic social and nonsocial environmental cues. For example, it is crucial whether your ally became angry at your enemy, or at you. To understand brain responses to dynamic faces, we must understand how contextual and motivational e¤ects interact with these perceptions. In short, we must extend our paradigm to reflect the interactive and contextually nuanced nature of social relationships. The role of mimicry in face perception is echoed by findings that somatosensory and motor cortices are involved in processing dynamic faces. In particular, parietal somatosensory and frontoparietal motor areas have been found to contain individual
Engaging Neocortical Networks with Dynamic Faces
115
neurons that respond both when performing an action and when observing, either visually or auditorily, the same action being performed by others (reviewed by Rizzolatti & Craighero, 2004). Furthermore, the same neural tissues that generate motor activity are activated during passive perception. For example, transcranial magnetic stimulation (TMS) pulses evoke activity at a lower threshold when subjects view congruent motor activations (Fadiga, Fogassi, Pavesi, & Rizzolatti, 1995; Strafella & Paus, 2000); and the observation of dynamic facial emotions activates motor tissues governing congruent facial displays (Sato, Kochiyama, Yoshikawa, Naito, & Matsumura, 2004). Such mirror activations may mediate observed mimicry e¤ects. In the lateral intraparietal areas that govern attention, for example, observation of averted gaze in another evokes neuronal activity that tracks the dynamics of gaze-following (Shepherd, Klein, Deaner, & Platt, 2009). These data suggest that independent of whether the observed stimuli are physically embodied, the tissues that produce and detect an individual’s own behavioral state are likewise involved in perceiving the behavioral state of another. Vocal Communication
Beyond the STS and inferior regions of the temporal lobe, dynamic faces influence how voices are processed in the auditory cortex (Ghazanfar, Maier, Ho¤man, & Logothetis, 2005; Ghazanfar, Chandrasekaran, & Logothetis, 2008). The vast majority of neural responses in this region show integrative (enhanced or suppressed) responses when dynamic faces are presented with voices than with unimodal presentations (figure 8.3 a and b). Furthermore, these dynamic face and voice responses were specific: replacing the dynamic faces with dynamic disks that mimicked the aperture and displacement of the mouth did not lead to integration (Ghazanfar et al., 2005, 2008). This parallels findings from human psychophysical experiments in which such artificial dynamic disk stimuli can lead to enhanced speech detection but not to the same degree as a dynamic face (Bernstein, Auer, & Takayanagi, 2004; Schwartz, Berthommier, & Savariaux, 2004). Although there are multiple possible sources of visual input to the auditory cortex (Ghazanfar & Schroeder, 2006), the STS is likely to be the major region through which facial images influence the auditory cortex. This is first, because there are reciprocal connections between the STS and the lateral belt and other parts of the auditory cortex (described earlier; see also Barnes & Pandya, 1992; Seltzer & Pandya, 1994; second, because neurons in the STS are sensitive to both faces and biological motion (Harries & Perrett, 1991; Oram & Perrett, 1994); and finally, because the STS is known to be multisensory (Bruce, Desimone, & Gross, 1981; Schroeder & Foxe, 2002). One mechanism for establishing whether the auditory cortex and the STS interact at the functional level is to measure their temporal correlations as a function of stimulus condition. Concurrent recordings of oscillations and single
Figure 8.3 Dynamic faces engage the auditory cortex and its interactions with the superior temporal sulcus. (a) Example of a coo call and a grunt call from rhesus monkeys. (Top) Frames at five intervals from the start of the video (the onset of mouth movement) until the end of mouth movement. X-axes ¼ time in milliseconds. (Bottom) Time waveform of the vocalization where the blue lines indicate the temporally corresponding video frames. (b) Examples of multisensory integration in auditory cortex neurons. Peristimulus time histograms and rasters in response to a grunt vocalization (left and right) and coo vocalization (middle) according to the faceþvoice (FþV), voice alone (V), and face alone (F) conditions. The x-axes show time aligned to onset of the face (solid line). Dashed lines indicate the onset and onset of the voice signal. Y-axes ¼ firing rate of the neuron in spikes/second. (c) Time-frequency plots (cross-spectrograms) illustrate the modulation of functional interactions (as a function of stimulus condition) between the auditory cortex and the STS for a population of cortical sites. X-axes ¼ time in milliseconds as a function of onset of the auditory signal (solid black line). Y-axes ¼ frequency of the oscillations in hertz. The color bar indicates the amplitude of these signals normalized by the baseline mean.
116 S. V. Shepherd and A. A. Ghazanfar
Engaging Neocortical Networks with Dynamic Faces
117
neurons from the auditory cortex and the dorsal bank of the STS reveal that gamma band correlations significantly increased in strength during presentation of bimodal face and voice videos compared with unimodal conditions (Ghazanfar et al., 2008) (figure 8.3c). Because the phase relationships between oscillations were significantly less variable (tighter) when dynamic faces were paired with voices, these correlations are not merely due to an increase in response strength but also reflect a tighter temporal coordination between the auditory cortex and the STS (Varela et al., 2001). This relationship is elaborated further in the following chapter by Puce and Schroeder. To See Past the Mask
For humans, the most important part of face perception is the insight it gives into the underlying person. The significance of human faces in our environment is that they belong to people—friends, enemies, coworkers, competitors, prospects, spouses, and relatives—about whom we maintain richly detailed histories. These individuals form the cast among whom we give our aspirations play. Diverse neural tissues help guide our interaction with these characters, and thus mnemonic, emotional, and executive areas are included in the extended network activated by observation of dynamic faces. Although we did not review these extended processes in detail here, we note that dynamic faces most likely potentiate these processes, just as they potentiate the lower-level perceptions we have dwelled upon. Medial temporal areas recalling diverse personal associations (Quiroga, Reddy, Kreiman, Koch, & Fried, 2005) are most likely more strongly activated as dynamic videos make explicit the temporal invariants within a face, as well as characteristic or otherwise memorable expression trajectories. Likewise, the circuits involved in perspective taking, mentalistic attribution, empathy, and goal processing can be cued by dynamic faces, and the evoked processing seems to resemble that which we outlined earlier. Again, tissues that are specialized for self-processing (e.g., of perceptual sets and working memory in the right temporoparietal junction and inferior frontal gyrus, and of goal sets and cognitive and emotional load in the medial prefrontal cortex) seem to be recruited to aid in the comprehension of others (Saxe, 2006; Frith & Frith, 2007). Thus, dysfunction in these systems can have dire consequences for our ability to relate to others (see de Gelder and Van den Stock in this volume). Both the cortical and the conceptual breadth of activity evoked by dynamic faces speak to their primacy in our lives; our humanity hinges on our relationships with other people. In nature, faces do not appear and disappear as still images, unannounced, unmoving, and unaccompanied. Nor did our brains evolve merely to perceive passive scenes or to respond to the interrogations of curious researchers. The stage across
118
S. V. Shepherd and A. A. Ghazanfar
which we struggle to survive and multiply is not composed of static scenes, and the players are not mere passive props: we are interactors. References Adams, R. B., & Kleck, R. E. (2005). E¤ects of direct and averted gaze on the perception of facially communicated emotions. Emotion, 5, 3–11. Adolphs, R. (2002). Recognizing emotion from facial expressions: Psychological and neurological mechanisms. Behavioral and Cognitive Neuroscience Reviews, 1, 21–61. Adolphs, R., Damasio, H., Tranel, D., Cooper, G., & Damasio, A. R. (2000). A role for somatosensory cortices in the visual recognition of emotion as revealed by three-dimensional lesion mapping. Journal of Neuroscience, 20, 2683–2690. Adolphs, R., Gosselin, F., Buchanan, T. W., Tranel, D., Schyns, P., & Damasio, A. R. (2005). A mechanism for impaired fear recognition after amygdala damage. Nature, 433, 68–72. Akiyama, T., Kato, M., Muramatsu, T., Umeda, S., Saito, F., & Kashima, H. (2007). Unilateral amygdala lesions hamper attentional orienting triggered by gaze direction. Cerebral Cortex, 17, 2593–2600. Allison, T., Puce, A., & McCarthy, G. (2000). Social perception from visual cues: Role of the STS region. Trends in Cognitive Science, 4, 267–278. Allman, J. M. (1999). Evolving brains. New York: W.H. Freeman. Barnes, C. L., & Pandya, D. N. (1992). E¤erent cortical connections of multimodal cortex of the superior temporal sulcus in the rhesus monkey. Journal of Comparative Neurology, 318, 222–244. Baron-Cohen, S. (1994). How to build a baby that can read minds: Cognitive mechanisms in mindreading. Current Psychology of Cognition, 13, 513–552. Barton, R. A., & Dunbar, R. L. M. (1997). Evolution of the social brain. In A. Whiten and R. Byrne (eds.), Machiavellian intelligence. Cambridge, UK: Cambridge University Press. Batki, A., Baron-Cohen, S., Wheelwright, S., Connellan, J., & Ahluwalia, J. (2000). Is there an innate gaze module? Evidence from human neonates. Infant Behavior and Development, 23, 223–229. Bayliss, A. P., Frischen, A., Fenske, M. J., & Tipper, S. P. (2007). A¤ective evaluations of objects are influenced by observed gaze direction and emotional expression. Cognition, 104, 644–653. Bayliss, A. P., & Tipper, S. P. (2006). Predictive gaze cues and personality judgments: Should eye trust you? Psychological Science, 17, 514–520. Bernstein, L. E., Auer, E. T., & Takayanagi, S. (2004). Auditory speech detection in noise enhanced by lipreading. Speech Communication, 44, 5–18. Birmingham, E., Bischof, W. F., & Kingstone, A. (2008). Social attention and real-world scenes: The roles of action, competition and social content. Quarterly Journal of Experimental Psychology, 61, 986–998. Brauer, J., Call, J., & Tomasello, M. (2005). All great ape species follow gaze to distant locations and around barriers. Journal of Comparative Psychology, 119, 145–154. Bruce, C., Desimone, R., & Gross, C. G. (1981). Visual properties of neurons in a polysensory area in superior temporal sulcus of the macaque. Journal of Neurophysiology, 46, 369–384. Buchan, J. N., Pare, M., & Munhall, K. G. (2004). The influence of task on gaze during audiovisual speech perception. Journal of the Acoustical Society of America, 115, 2607. Buchan, J. N., Pare, M., & Munhall, K. G. (2007). Spatial statistics of gaze fixations during dynamic face processing. Social Neuroscience, 2, 1–13. Burger, J., Gochfeld, M., & Murray, B. (1992). Risk discrimination of eye contact and directness of approach in black iguanas (Ctenosaura similis). Journal of Comparative Psychology, 106, 97–101. Burghardt, G., & Greene, H. (1988). Predator simulation and duration of death feigning in neonate hognose snakes. Animal Behavior, 36, 1842–1844. Calder, A. J., Beaver, J. D., Winston, J. S., Dolan, R. J., Jenkins, R., Eger, E., & Henson, R. N. (2007). Separate coding of di¤erent gaze directions in the superior temporal sulcus and inferior parietal lobule. Current Biology, 17, 20–25.
Engaging Neocortical Networks with Dynamic Faces
119
Calder, A. J., Keane, J., Cole, J., Campbell, R., & Young, A. W. (2000). Facial expression recognition by people with Mobius syndrome. Cognitive Neuropsychology, 17, 73–87. Calder, A. J., & Young, A. W. (2005). Understanding the recognition of facial identity and facial expression. Nature Reviews Neuroscience, 6, 641–651. Chartrand, T. L., & Bargh, J. A. (1999). The chameleon e¤ect: The perception-behavior link and social interaction. Journal of Personality and Social Psychology, 76, 893–910. Coss, R. (1978). Perceptual determinants of gaze aversion by the lesser mouse lemur (Microcebus murinus). The role of two facing eyes. Behavior, 64, 248–267. Coss, R. (1979). Delayed plasticity of an instinct: Recognition and avoidance of 2 facing eyes by the jewel fish. Developmental Psychobiology, 12, 335–345. De Souza, W. C., Eifuku, S., Tamura, R., Nishijo, H., & Ono, T. (2005). Di¤erential characteristics of face neuron responses within the anterior superior temporal sulcus of macaques. Journal of Neurophysiology, 94, 1252–1266. Deaner, R. O., & Platt, M. L. (2003). Reflexive social attention in monkeys and humans. Current Biology, 13, 1609–1613. Deaner, R. O., Shepherd, S. V., & Platt, M. L. (2007). Familiarity accentuates gaze cuing in women but not men. Biology Letters, 3, 64–67. Dimberg, U., Thunberg, M., & Elmehed, K. (2000). Unconscious facial reactions to emotional facial expressions. Psychological Science, 11, 86–89. Driver, J., Davis, G., Ricciardelli, P., Kidd, P., Maxwell, E., & Baron-Cohen, S. (1999). Gaze perception triggers reflexive visuospatial orienting. Visual Cognition, 6, 509–540. Ekman, P. (1993). Facial expression and emotion. American Psychologist, 48, 384–392. Emery, N. J., Lorincz, E. N., Perrett, D. I., Oram, M. W., & Baker, C. I. (1997). Gaze following and joint attention in rhesus monkeys (Macaca mulatta). Journal of Comparative Psychology, 111, 286–293. Fadiga, L., Fogassi, L., Pavesi, G., & Rizzolatti, G. (1995). Motor facilitation during action observation: A magnetic stimulation study. Journal of Neurophysiology, 73, 2608–2611. Farroni, T., Csibra, G., Simion, F., & Johnson, M. H. (2002). Eye contact detection in humans from birth. Proceedings of the National Academy of Sciences of the United States of America, 99, 9602–9605. Fletcher-Watson, S., Findlay, J. M., Leekam, S. R., & Benson, V. (2008). Rapid detection of person information in a naturalistic scene. Perception, 37, 571–583. Fox, C. J., Iaria, G., & Barton, J. J. (2008). Defining the face processing network: Optimization of the functional localizer in fMRI. Human Brain Mapping, 30, 1637–1651. Friesen, C. K., & Kingstone, A. (1998). The eyes have it! Reflexive orienting is triggered by nonpredictive gaze. Psychonomic Bulletin and Review, 5, 490–495. Frischen, A., Bayliss, A. P., & Tipper, S. P. (2007). Gaze cueing of attention: Visual attention, social cognition, and individual di¤erences. Psychological Bulletin, 133, 694–724. Frith, C. D., & Frith, U. (2007). Social cognition in humans. Current Biology, 17, R724–732. Ganel, T., Goshen-Gottstein, Y., & Goodale, M. A. (2005). Interactions between the processing of gaze direction and facial expression. Vision Research, 45, 1191–1200. Ghazanfar, A. A., Chandrasekaran, C., & Logothetis, N. K. (2008). Interactions between the superior temporal sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. Journal of Neuroscience, 28, 4457–4469. Ghazanfar, A. A., Nielsen, K., & Logothetis, N. K. (2006). Eye movements of monkeys viewing vocalizing conspecifics. Cognition, 101, 515–529. Ghazanfar, A. A., Maier, J. X., Ho¤man, K. L., & Logothetis, N. K. (2005). Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience, 25, 5004–5012. Ghazanfar, A. A., & Santos, L. R. (2004). Primate brains in the wild: The sensory bases for social interactions. Nature Reviews Neuroscience, 5, 603–616. Ghazanfar, A. A., & Schroeder, C. E. (2006). Is neocortex essentially multisensory? Trends in Cognitive Sciences, 10, 278–285.
120
S. V. Shepherd and A. A. Ghazanfar
Guo, K., Robertson, R. G., Mahmoodi, S., Tadmor, Y., & Young, M. P. (2003). How do monkeys view faces? A study of eye movements. Experimental Brain Research, 150, 363–374. Harries, M. H., & Perrett, D. I. (1991). Visual processing of faces in temporal cortex. Physiological evidence for a modular organization and possible anatomical correlates. Journal of Cognitive Neuroscience, 3, 9–24. Hauser, M. D. (1993). Right-hemisphere dominance for the production of facial expression in monkeys. Science, 261, 475–477. Hauser, M. D., Evans, C. S., & Marler, P. (1993). The role of articulation in the production of rhesus monkey, Macaca mulatta, vocalizations. Animal Behavior, 45, 423–433. Hauser, M. D., & Ybarra, M. S. (1994). The role of lip configuration in monkey vocalizations— Experiments using xylocaine as a nerve block. Brain and Language, 46, 232–244. Haxby, J. V., Ho¤man, E. A., & Gobbini, M. I. (2000). The distributed human neural system for face perception. Trends in Cognitive Science, 4, 223–233. Hess, U., & Blairy, S. (2001). Facial mimicry and emotional contagion to dynamic emotional facial expressions and their influence on decoding accuracy. International Journal of Psychophysiology, 40, 129–141. Ho¤man, E. A., & Haxby, J. V. (2000). Distinct representations of eye gaze and identity in the distributed human neural system for face perception. Nature Neuroscience, 3, 80–84. Ishai, A., Schmidt, C. F., & Boesiger, P. (2005). Face perception is mediated by a distributed cortical network. Brain Research Bulletin, 67, 87–93. Ito, T., Tiede, M., & Ostry, D. J. (2009). Somatosensory function in speech perception. Proceedings of the National Academy of Sciences U.S.A., 106, 1245–1248. James, W. (1890). The principles of psychology. New York: Henry Holt. Kawashima, R., Sugiura, M., Kato, T., Nakamura, A., Hatano, K., Ito, K., Fukuda, H., Kojima, S., & Nakamura, K. (1999). The human amygdala plays an important role in gaze monitoring: A PET study. Brain, 122, 779–783. Keating, C. F., & Keating, E. G. (1982). Visual scan patterns of rhesus monkeys viewing faces. Perception, 11, 211–219. Keltner, D. (1995). Signs of appeasement—Evidence for the distinct displays of embarrassment, amusement, and shame. Journal of Personality and Social Psychology, 68, 441–454. Klin, A., Jones, W., Schultz, R., Volkmar, F., & Cohen, D. (2002). Visual fixation patterns during viewing of naturalistic social situations as predictors of social competence in individuals with autism. Archives of General Psychiatry, 59, 809–816. Lakin, J. L., & Chartrand, T. L. (2003). Using nonconscious behavioral mimicry to create a‰liation and rapport. Psychological Science, 14, 334–339. Land, M. F., & Hayhoe, M. (2001). In what ways do eye movements contribute to everyday activities? Vision Research, 41, 3559–3565. Langton, S. R. H., & Bruce, V. (1999). Reflexive visual orienting in response to the social attention of others. Visual Cognition, 6, 541–567. Lansing, C. R., & McConkie, G. W. (1999). Attention to facial regions in segmental and prosodic visual speech perception tasks. Journal of Speech Language and Hearing Research, 42, 526–539. Lansing, I. R., & McConkie, G. W. (2003). Word identification and eye fixation locations in visual and visual-plus-auditory presentations of spoken sentences. Perception and Psychophysics, 65, 536–552. Likowski, K. U., Muehlberger, A., Seibt, B., Pauli, P., & Weyers, P. (2008). Modulation of facial mimicry by attitudes. Journal of Experimental Social Psychology, 44, 1065–1072. Macchi Cassia, V., Simion, F., & Umilta, C. (2001). Face preference at birth: The role of an orienting mechanism. Developmental Science, 4, 101–108. Magnee, M. J., Stekelenburg, J. J., Kemner, C., & de Gelder, B. (2007). Similar facial electromyographic responses to faces, voices, and body expressions. Neuroreport, 18, 369–372. Mathews, A., Fox, E., Yiend, J., & Calder, A. (2003). The face of fear: E¤ects of eye gaze and emotion on visual attention. Visual Cognition, 10, 823–835.
Engaging Neocortical Networks with Dynamic Faces
121
Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility—Head movement improves auditory speech perception. Psychological Science, 15, 133–137. Nahm, F. K. D., Perret, A., Amaral, D. G., & Albright, T. D. (1997). How do monkeys look at faces? Journal of Cognitive Neuroscience, 9, 611–623. Oberman, L. M., Winkielman, P., & Ramachandran, V. S. (2007). Face to face: Blocking facial mimicry can selectively impair recognition of emotional expressions. Social Neuroscience, 2, 167–178. Oram, M. W., & Perrett, D. I. (1994). Responses of anterior superior temporal polysensory (Stpa) neurons to biological motion stimuli. Journal of Cognitive Neuroscience, 6, 99–116. Perrett, D. I., Hietanen, J. K., Oram, M. W., & Benson, P. J. (1992). Organization and functions of cells responsive to faces in the temporal cortex. Philosophical Transactions of the Royal Society of London— Series B: Biological Sciences, 335, 23–30. Preminger, J. E., Lin, H.-B., Payen, M., & Levitt, H. (1998). Selective visual masking in speechreading. Journal of Speech, Language, and Hearing Research, 41, 564–575. Puce, A., Allison, T., Bentin, S., Gore, J. C., & McCarthy, G. (1998). Temporal cortex activation in humans viewing eye and mouth movements. Journal of Neuroscience, 18, 2188–2199. Putman, P., Hermans, E., & van Honk, J. (2006). Anxiety meets fear in perception of dynamic expressive gaze. Emotion, 6, 94–102. Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., & Fried, I. (2005). Invariant visual representation by single neurons in the human brain. Nature, 435, 1102–1107. Reader, S. M., & Laland, K. N. (2002). Social intelligence, innovation, and enhanced brain size in primates. Proceedings of the National Academy of Sciences of the United States of America, 99, 4436–4441. Ristau, C. (1991). Attention, purposes, and deception in birds. In A. Whiten (ed.), Natural theories of mind. Oxford, UK: Oxford University Press, pp. 209–222. Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience, 27, 169–192. Rudmann, D. S., McCarley, J. S., & Kramer, A. F. (2003). Bimodal displays improve speech comprehension in environments with multiple speakers. Human Factors, 45, 329–336. Sato, W., Fujimura, T., & Suzuki, N. (2008). Enhanced facial EMG activity in response to dynamic facial expressions. International Journal of Psychophysiology, 70, 70–74. Sato, W., Kochiyama, T., Yoshikawa, S., Naito, E., & Matsumura, M. (2004). Enhanced neural activity in response to dynamic facial expressions of emotion: An fMRI study. Brain Research Cognitive Brain Research, 20, 81–91. Saxe, R. (2006). Uniquely human social cognition. Current Opinion in Neurobiology, 16, 235–239. Schroeder, C. E., & Foxe, J. J. (2002). The timing and laminar profile of converging inputs to multisensory areas of the macaque neocortex. Cognitive Brain Research, 14, 187–198. Schwartz, J.-L., Berthommier, F., & Savariaux, C. (2004). Seeing to hear better: Evidence for early audiovisual interactions in speech identification. Cognition, 93, B69–B78. Seltzer, B., & Pandya, D. N. (1994). Parietal, temporal, and occipital projections to cortex of the superior temporal sulcus in the rhesus monkey: A retrograde tracer study. Journal of Comparative Neurology, 343, 445–463. Shepherd, S. V., Deaner, R. O., & Platt, M. L. (2006). Social status gates social attention in monkeys. Current Biology, 16, R119–120. Shepherd, S. V., Klein, J. T., Deaner, R. O., & Platt, M. L. (2009). Mirroring of attention by neurons in macaque parietal cortex. Proceedings of the National Academy of Sciences U.S.A., 106, 9489–9494. Shepherd, S. V., & Platt, M. L. (2008). Spontaneous social orienting and gaze following in ringtailed lemurs (Lemur catta). Animal Cognition, 11, 13–20. Strafella, A. P., & Paus, T. (2000). Modulation of cortical excitability during action observation: A transcranial magnetic stimulation study. Neuroreport, 11, 2289–2292.
122
S. V. Shepherd and A. A. Ghazanfar
Tiedens, L. Z., & Fragale, A. R. (2003). Power moves: Complementarity in dominant and submissive nonverbal behavior. Journal of Personality and Social Psychology, 84, 558–568. Tomasello, M., Call, J., & Hare, B. (1998). Five primate species follow the visual gaze of conspecifics. Animal Behavior, 55, 1063–1069. Tsao, D. Y., & Livingstone, M. S. (2008). Mechanisms of face perception. Annual Review of Neuroscience, 31, 411–437. van Baaren, R. B., Holland, R. W., Kawakami, K., & van Knippenberg, A. (2004). Mimicry and prosocial behavior. Psychological Science, 15, 71–74. Varela, F., Lachaux, J. P., Rodriguez, E., & Martinerie, J. (2001). The brainweb: Phase synchronization and large-scale integration. Nature Reviews Neuroscience, 2, 229–239. Vatikiotis-Bateson, E., Eigsti, I. M., Yano, S., & Munhall, K. G. (1998). Eye movement of perceivers during audiovisual speech perception. Perception and Psychophysics, 60, 926–940. Vuilleumier, P., Armony, J. L., Driver, J., & Dolan, R. J. (2001). E¤ects of attention and emotion on face processing in the human brain: An event-related fMRI study. Neuron, 30, 829–841. Vuilleumier, P., & Pourtois, G. (2007). Distributed and interactive brain mechanisms during emotion face perception: Evidence from functional neuroimaging. Neuropsychologia, 45, 174–194. Wicker, B., Michel, F., Hena¤, M. A., & Decety, J. (1998). Brain regions involved in the perception of gaze: A PET study. Neuroimage, 8, 221–227. Wiltermuth, S. S., & Heath, C. (2009). Synchrony and cooperation. Psychological Science, 20, 1–5. Xitco, M. J., Jr., Gory, J. D., & Kuczaj, S. A., 2nd (2004). Dolphin pointing is linked to the attentional behavior of a receiver. Animal Cognition, 7, 231–238. Yarbus, A. L. (1967). Eye movements and vision. New York: Plenum. Yehia, H. C., Kuratate, T., & Vatikiotis-Bateson, E. (2002). Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30, 555–568.
9
Multimodal Studies Using Dynamic Faces
Aina Puce and Charles E. Schroeder
It is often said that the dynamic face is an important source of cross-modal social information that helps to drive everyday interactions among individuals. The actions of the dynamic face transcend cultures. For instance, many generations of audiences around the world have appreciated mimes and silent movie actors, and individuals such as Buster Keaton, Charlie Chaplin, and W. C. Fields still enjoy popularity even today. An important consideration here is that gestures and expressions of the face, and body, form a powerful and salient stimulus in the absence of other accompanying sensory input. Hence we begin this chapter by examining neuroimaging and neurophysiological findings (multimodal imaging data) obtained in human studies where the subjects viewed purely visual dynamic displays involving faces. The fMRI studies clearly show that the human superior temporal sulcus is active during these displays, especially in the right hemisphere. In the ERP data, a prominent component, the N170, is elicited by facial motion and di¤erentiates between mouth opening and closing movements, and eyes looking away and at the observer. The N170 response to a dynamic face has a scalp distribution and face-specific properties similar to those for a static face. These dynamic studies are discussed briefly in the context of the extensive literature based on neural responses elicited by the presentation of static face stimuli. When sound is presented with the dynamic facial stimulus (cross-modal stimulation), fMRI activation and early ERPs (auditory N140 and visual N170) can show specialization for the human face and voice. This type of selectivity is similar to that in nonhuman primates seen in multisensory stimulus paradigms. Explicit and implicit social or emotional context in a dynamic facial display significantly alter the nature of the neural responses elicited. Specifically, later ERPs modulate their amplitude and latency as a function of this added context. In contrast, functional MRI data indicate that implicit emotional context in a dynamic facial display triggers selective activation of the amygdala. Clearly, the task the subject performs when viewing a dynamic facial display greatly a¤ects what network of cortical structures will be activated. It is important that the STS responds to most manipulations involving dynamic faces, suggesting that this brain region is a crucial node for decoding dynamic facial information.
124
A. Puce and C. E. Schroeder
To eliminate confusion related to terminology, we use the suggestions proposed by Calvert in 2001. We use ‘‘multimodal’’ to describe data acquired with di¤erent neuroimaging modalities and ‘‘cross-modal’’ to indicate that an experimental task has stimuli presented in more than one sensory modality. Neurons with responses to more than one sensory modality are referred to as ‘‘multisensory’’ (Calvert, 2001). Studies of Neural Activity Elicited by Dynamic Facial Displays
Dynamic stimuli pose a challenge for neurophysiological studies because in theory a dynamic stimulus should elicit a continuous and dynamic neural response that could be more di‰cult to detect using traditional analysis techniques. Given this consideration, apparent motion stimuli were used in the first ERP study investigating neurophysiological responses to dynamic facial stimuli (Puce, Smith, & Allison, 2000). The instantaneous visual transition that occurs in an apparent motion stimulus, such as the facial motion task described earlier, mimics the instantaneous transitions used in other more traditional visual event-related potential studies with alternating checkerboard and grating stimuli, or static images using faces (Regan, 1972; Halliday, McDonald, & Mushin, 1973; Chiappa, 1983; Bentin, Allison, Puce, Perez, & McCarthy, 1996). The advantage in using an apparent facial motion task allows facial motion ERPs to be interpreted in the context of the wider visual neurophysiological literature. The first ever study to elicit ERPs by static facial stimuli described a prominent negativity over the bilateral temporal scalp that peaked at around 170 ms after the onset of the stimulus, and that had a marked right hemisphere bias (Bentin et al., 1996). It is interesting that when an N170 response to a facial display was compared with that elicited by apparent facial motion from a face that was already present on the visual display, the N170 response to the dynamic face was delayed by about 30– 40 ms and was smaller relative to the N170 elicited by presentation of a static face (Puce et al., 2000). This is not surprising because the N170 response to facial motion is elicited by a fairly small change in the visual display, in terms of overall image area and luminance and contrast. Indeed, the earliest visual evoked potential studies documented the proportional relationship between amplitudes of the visually evoked potential component and stimulus size, luminance, and contrast (Wicke, Donchin, & Lindsley, 1964; Sokol and Riggs, 1971; Campbell & Kulikowski, 1972; Regan, 1972). Given that there are latency and amplitude di¤erences in N170 across the static and dynamic dimensions, it could be argued that the elicited neural response is not the same entity. The basic morphology and scalp distribution of the two types of N170 are identical, and modeling studies suggest that the underlying generators are in the same cortical regions, notably the superior temporal sulcus and gyrus and on the ventral temporal surface in the fusiform gyrus (Halgren, Raij, Marinkovic, Jous-
Multimodal Studies Using Dynamic Faces
125
maki, & Hari, 2000; Watanabe, Kakigi, & Puce, 2003; Itier & Taylor, 2004; Conty, N’Diaye, Tijus, & George, 2007; De¤ke et al., 2007). These modeling studies have used electroencephalography (EEG) as well as magnetoencephalography (MEG). The equivalent MEG response for the N170 is known as the M170 (Liu, Belliveau, & Dale, 1998; De¤ke et al., 2007) or 2M (Watanabe et al., 2003). A substantial literature has developed around the N170 that is elicited by the presentation of a static face (for an excellent review of this literature, see Rossion & Jacques, 2008). The N170 response to static faces has been interpreted as reflecting neural activity in a face detector (Bentin et al., 1996), or a facial feature analyzer as part of the process of face detection (Bentin, Golland, Flevaris, Robertson, & Moscovitch, 2006). The N170 elicited by static faces is remarkably invariant and does not appear to be a¤ected by the identity of the individual (Anaki, Zion-Golumbic, & Bentin, 2007) nor is it a function of viewpoint, provided both eyes are present in the image (Caharel, d’Arripe, Ramon, Jacques, & Rossion, 2009). However, N170 can habituate to the presentation of successive faces relative to faces interspersed with other material. Further, the spatial extent and right-sided bias of N170, depends on how categorical stimulus material is presented (Maurer, Rossion, & McCandliss, 2008). Taken together, the literature describing the N170 response to static and dynamic faces suggests that this specialized neural activity is probably generated as part of the face detection and analysis process. The neural activity elicited by facial motion is probably generated in response to the onset of the motion stimulus. In the case of dynamic faces, N170 latency might be used to probe processing in more naturalistic visual displays (e.g., Rousselet, Mace, & Fabre-Thorpe, 2004b, 2004a) where a single-trial analysis approach might be adopted (Rousselet, Husk, Bennett, & Sekuler, 2007). Intracranial ERP studies using complex visual displays have been able to detect N170-like intracranial activity (Privman et al., 2007) in single-trial data, which is similar to previously described responses to static facial displays where data have been analyzed with conventional averaging techniques (Allison, Puce, Spencer, & McCarthy, 1999; McCarthy, Puce, Belger, & Allison, 1999; Puce, Allison, & McCarthy, 1999). More interesting would be a study of neural activity elicited by the faces of real individuals that have been shown to produce robust N170s that have shorter latencies than do two-dimensional computer images of presented faces (Ponkanen et al., 2008). The earliest neuroimaging study investigate brain responses to dynamic faces using simple facial movements with no a¤ect or clear communicative message while subjects fixated on a point between the two eyes on the face (Puce, Allison, Bentin, Gore, & McCarthy, 1998). Both the right and left superior temporal sulci were selectively activated when viewing either lateral eye gaze movements, or mouth opening and closing movements. This is a consistently replicated finding in a number of
126
A. Puce and C. E. Schroeder
subsequent functional resonance magnetic imaging studies (Hooker et al., 2003; Puce et al., 2003; Thompson, Abbott, Wheaton, Syngeniotis, & Puce, 2004; Wheaton, Thompson, Syngeniotis, Abbott, & Puce, 2004; Thompson, Hardee, Panayiotou, Crewther, & Puce, 2007). The original activation task (Puce et al., 1998) deliberately employed an apparent motion stimulus so that it would be appropriate for neurophysiological studies (Puce et al., 2000). It is important to note that the STS is activated consistently by a facial motion stimulus, irrespective of whether it is an apparent motion or a real motion, or whether it is a real human face, a computer-generated avatar, or a line-drawn cartoon face (Puce et al., 1998; Puce & Perrett, 2003; Puce et al., 2003; Thompson et al., 2004; Wheaton et al., 2004; Thompson et al., 2007). The Robustness of the Basic Neural Response to Facial Motion
From a neurophysiological point of view, the N170 response to a dynamic face does not change its nature when parts of the face (which remain static) are removed from the visual display. For example, robust N170s to an averted gaze were elicited irrespective of eyes in isolation or within the context of the whole face (Puce et al., 2000), suggesting that the change in the facial feature being moved drives the response. It is important that the N170 response to facial motion is significantly larger than the response elicited by motion per se (Puce et al., 2003). It is also unlikely that spatial attention is responsible for these e¤ects (but see Crist, Wu, Karp, & Woldor¤, 2007), given that N170 changes di¤erent types of eye gaze in isolated eye stimuli while it does not di¤erentiate checks alternating in a checkerboard pattern in similar spatial locations (Puce et al., 2000). The fMRI activation in the bilateral STS caused by dynamic isolated eye stimuli is as robust as that observed with an entire face (Hardee, Thompson, & Puce, 2008). There are no di¤erences in the N170 or fMRI activation when eye gaze shifts are compared with mouth opening and closing movements. The neural activity appears to occur explicitly with the start of the movement of the face part per se. This idea is not inconsistent with reports of N170s elicited by isolated face parts (e.g., Bentin et al., 1996; Eimer, 1998; Taylor, Itier, Allison, & Edmonds, 2001). It should also be noted that N170-like ERPs can be also elicited by hand and body motions (Wheaton, Pipingas, Silberstein, & Puce, 2001) and static hands and bodies (Stekelenburg & de Gelder, 2004; Kovacs et al., 2006; Thierry et al., 2006). What Alters the Basic Neural Response to Facial Motion?
The N170 activated by facial motion is a robust response that occurs with multiple manipulations involving a dynamic face. However, there are a number of circumstances where N170 amplitude is significantly di¤erent across viewing conditions. For example, larger and earlier N170s are elicited by mouth opening than by mouth closing movements. Similarly, larger and earlier N170s occur when eyes are making
Multimodal Studies Using Dynamic Faces
127
extreme gaze aversions than when eyes are looking directly at the observer (Puce et al., 2000; see figure 9.1a). These e¤ects were elicited by a continuously present dynamic display with randomly occurring eye and mouth movements. The eye movements involved extreme gaze positions (completely averted or gazing directly at the observer). A more recent study that also used a dynamic facial display reported significant di¤erences in N170 amplitude and latency during changes in lateral eye gaze (Conty et al., 2007). The experimental design di¤ered from that of Puce et al. (2000) in that eye changes were shown in a single trial where the initial stimulus depicted a face with eyes averted 15 degrees from the observer (figure 9.1b). After a prolonged baseline period, the gaze changed again to either look directly at the observer or to look to a more extreme lateral position (30 degrees from the observer and similar to that in Puce et al., 2000). The observer indicated by pressing button whether the eyes in the stimulus face changed their gaze to look at or away from the observer. The N170 response to the change in the direct gaze was significantly larger than that to the change to the more extreme lateral gaze (Conty et al., 2007), which is seemingly at odds with Puce et al. (2000). However, the data from the two studies might not be as inconsistent as first thought when one considers the nature of the gaze change (see schematic in figure 9.1). In the study by Conty and colleagues, the baseline gaze was averted, so that the subsequent transitions in gaze were averted-averted and averted-direct. One could regard the former gaze change as minimal because the gaze essentially remained averted during the trial in that the stimulus face was still not looking at the observer and in that sense there was no change in social attention. However, in the averted-direct condition, the stimulus faces now looked at the observer and the observer became the focus of social attention. It is not surprising, given this experimental context, that the neural response is larger to the avert-direct transition than to the averted-averted transition. The study by Conty et al. (2007) did not test a direct-averted transition (as in Puce et al., 2000), and Puce et al. (2000) did not test an averted-to-averted gaze transition (as in Conty et al., 2007; see figure 9.1c), since both studies had other experimental aims and test conditions. Both studies clearly show that the neural e¤ects of someone else’s gaze falling on the observer are important; N170 latency was later with direct gaze than with averted gaze in both studies, despite two quite di¤erent task requirements in the two studies. In addition, and importantly, Conty et al. also documented a longer-lasting N170 response to the direct-gaze condition. A future study that includes the various types of gaze transition not tested in the two studies (i.e., figure 9.1c) will allow more generalized conclusions about the e¤ects of direct and averted gaze on neural responses. Gaze changes signal changed social attention. Similarly, mouth opening movements (in contrast to closing movements) can be an important social signal indicating that there might be incoming speech to listen to. Although the N170 modulates its
128
A. Puce and C. E. Schroeder
Figure 9.1 A comparison of stimulus conditions in two ERP studies of lateral gaze perception. (a) Puce et al. (2000) used transitions of extreme gaze from direct gaze (transition 1) and back to extreme gaze aversion (transition 2) in a paradigm where the face was displayed continuously. The schematic at right depicts ERPs elicited by the eye-gaze transition number. (b) Conty et al. (2007) used single trials in which a face showing an intermediate position of gaze aversion averted its gaze even further (top, transition 3), or looked at the observer (bottom, transition 4). The schematic at right depicts ERPs elicited by the eye-gaze transition number. (c) List of stimulus transitions tested (left) and not tested (right) in both studies.
Multimodal Studies Using Dynamic Faces
129
amplitude and latency in response to these di¤erent structural facial changes, the social significance or meaning of the stimulus is more likely to be reflected in longlatency ERP components, as discussed later in this chapter. Species-Specific Effects and Cross-modal Processing
Human primates possess an extensive set of communicative facial gestures and associated vocalizations that convey verbal or nonverbal messages (see also Ghazanfar, this volume). Commonly observed nonsocial facial movements and associated sounds (e.g., coughs, sneezes, yawns, sighs) elicit early ‘‘unisensory’’ neural responses such as the auditory N140 and the visual N170 (Brefczynski-Lewis, Lowitszch, Parsons, Lemieux, & Puce, 2009). Underadditivity occurs in these ERPs for these crossmodal stimuli, suggesting that congruous information from another sensory modality might facilitate the processing of these commonly encountered everyday stimuli. Additional observed e¤ects consisted of early superadditivity in the 60-ms latency range and complex e¤ects in later ERPs. These stimuli were also presented in an fMRI study. The fMRI activations also showed underadditive and superadditive e¤ects in brain regions that were consistent with potential neural sources for the observed ERPs (Brefczynski-Lewis et al., 2009). Studies performing source modeling on simultaneously acquired EEG and fMRI data will be needed to further investigate these initial observations. It is well known that relevant social information is read more e¤ectively from congruous cross-modal sources (Campbell, 2008). An equally interesting question is whether as human primates we have neural responses specialized to our own species. We investigated this question in a study where congruous and incongruous audiovisual stimuli were presented to subjects (Puce, Epling, Thompson, & Carrick, 2007). In this experiment, congruity and incongruity were defined in terms of whether the heard sound was plausibly generated by the video stimulus presented. The subjects made explicit judgments of congruity or incongruity on each experimental trial. Visual stimuli consisted of a human face, a monkey face, and a house image. Using apparent motion, the mouth of each face and the front door of the house opened and then shut with enough time between stimulus transitions to allow ERPs to be recorded for each change. Visual stimuli were paired with auditory stimuli that consisted of a human burp, a monkey screech, and a creaking door. Hence there were nine possible conditions, with three being congruous (human face–human burp, monkey face–monkey screeching, house–door creaking) and the other six being incongruous (e.g., house–screeching, monkey–door creaking). It is interesting that a vertex-peaking auditory P140 was modulated by the audiovisual stimulus pairing: for congruous pairings, the auditory N140 was largest for the human and monkey faces relative to the house. Therefore the N140 appeared to show some form of specialization for either animate or primatelike stimuli. However,
130
A. Puce and C. E. Schroeder
an apparent species-specific response was seen when incongruous and congruous audiovisual pairings were compared as a function of the visual stimulus being viewed. The N140 amplitude was significantly larger for the human face–human vocalization pairing than for the human face paired with any other sound. This suggests that visual context plays an important role in how early auditory information is processed and was a somewhat surprising finding given that N140 has been regarded as an early component that originates in the auditory cortex (Giard et al., 1994; Godey, Schwartz, de Graaf, Chauvel, & Liegeois-Chauvel, 2001; Eggermont and Ponton, 2002). A (visual) N170 was also elicited in this study, but unlike previous studies, it did not change its behavior as a function of the congruity manipulation. As described earlier, N170 alters its amplitude as a function of visual stimulus condition and tends to be largest for faces relative to other objects. In the cross-modal experiment, the amplitude gradient in N170 was not observed, suggesting that this visual component can change its response properties in the presence of input from another sense modality. Latencies of field potentials to auditory stimuli are earlier than those to visual stimulation (Schroeder and Foxe, 2002). It is conceivable therefore that the demands on the visual system might be reduced in an audiovisual stimulation situation. Information regarding the stimulus is already being processed via an alternative sensory pathway that might then be available in at least a partially processed form to other brain regions, including those traditionally regarded as ‘‘unisensory’’ (Ghazanfar and Schroeder, 2006; Ghazanfar, Chandrasekaran, & Logothetis, 2008). When a subject views a dynamic human or monkey face, there is robust fMRI activation in the human STS (figure 9.2). The human face activated the STS bilaterally but the monkey face produced right STS activation only. This has been described
Figure 9.2 Single-subject STS activation elicited by viewing a human and a monkey face. The panel at left shows bilateral activation in the STS in a single axial slice upon viewing a dynamic human face with a moving mouth (enclosed in ellipses; comparison between mouth opening and mouth closure). The middle panel shows unilateral, right-sided STS activation (enclosed in ellipse) by a dynamic monkey face with a moving mouth. The panel at right shows the same slices with no activation in these regions by a control stimulus that consisted of a moving checkerboard display with the squares moving in the same spatial region as the mouth stimuli in the other stimulus conditions. The activations shown reflect t-test data ( p < 0:0001 uncorrected) and were acquired at 3 tesla in a block design.
Multimodal Studies Using Dynamic Faces
131
previously with human subjects for human vocalizations (Fecteau, Armony, Joanette, & Belin, 2004). The species-specific N140 data described earlier (Puce et al., 2007) were not subjected to a source modeling procedure, but it is likely that the response recorded at the human vertex was a summation of activity from both hemispheres. It is interesting to speculate that this is why the N140 response to the human face context was larger in Puce et al. (2007). Perhaps it was generated bilaterally rather than in one hemisphere, as might have been the N140 response to the monkey face (see also the visual fMRI activation in figure 9.2). Previous work has shown that right-sided STS activation occurs with nonverbal human vocalizations (Belin et al., 2002), whereas the bilateral temporal cortex in the human brain can be activated by monkey and other animal vocalizations (Lewis, Brefczynski, Phinney, Janik, & DeYoe, 2005; Altmann, Doehrmann, & Kaiser, 2007; Belin et al., 2008). Neurophysiological and neuroimaging studies in nonhuman primates (macaques) confirm the existence of a superior temporal region that is sensitive to the voices of conspecifics (Ghazanfar, Maier, Ho¤man, & Logothetis, 2005; Gil-da-Costa et al., 2006; Petkov et al., 2008). Activation in these brain regions can be modulated by the identity of the individual (Petkov et al., 2008). The STS neurons show distinctive multisensory interactions when audiovisual stimulation consists of primate faces and vocalizations (Ghazanfar et al., 2008). It is interesting that the nature of the auditory and visual neural responses in the STS to face and voice stimulation can be quite di¤erent. Activation by faces is typically sustained and the energy of the signal is concentrated in the gamma band. In contrast, responses to voices are more transient and tend to have frequency content in the theta band (Chandrasekaran & Ghazanfar, 2009), which is somewhat surprising given the rapid and fleeting nature of primate vocalizations. Communicative signals from one’s own species are extensively processed, not only within the cortex of the STS, but also within the cortex of the ventrolateral prefrontal cortex (Romanski, Guclu, Hauser, Knoedl, & Diltz, 2004; Romanski & Averbeck, 2009). Neural responses to cross-modal stimulation in the STS by our group are illustrated in figure 9.3. The stimulus display appears in figure 9.3a. Figure 9.3b displays neural activity recorded from an electrode implanted in the upper bank of the superior temporal sulcus (in area STP) with a 100-mm distance between contacts. Current source density (CSD) profiles from all twenty-four electrode contact and multiunit activity (MUA) from an electrode contact in layer 4 of the cortex averaged to fifty stimulus trials are shown over pre- and poststimulus time intervals for auditory (left panel, figure 9.3b), visual (middle panel) and combined audiovisual (right panel) stimulation. The onset of the visual (face-on) and auditory (vocalize) stimuli, and the facial movement (move) are depicted by dashed vertical lines. Usually a patchy distribution of auditory and visually dominant sites is found in the upper bank of
132
A. Puce and C. E. Schroeder
Figure 9.3 Neural activity elicited by an audiovisual stimulus consisting of a monkey face and associated vocalization. (a) Stimulus display. An apparent visual motion stimulus consisting of a macaque face opening (third panel from left) and its mouth was paired with a screeching sound whose onset was synchronized to occur two frames after facial movement started. The monkey subject fixated on a central crosshair and once fixation was maintained (leftmost panel), the visual stimulus was presented. The monkey was required to saccade to a spot appearing in the bottom right corner of the display upon the o¤set of the face and accompanying vocalization (right panel). (b) Electrophysiological data recorded from a macaque subject (see text for details).
Multimodal Studies Using Dynamic Faces
133
the STS (Schroeder, Mehta, & Givre, 1998; Schroeder & Foxe, 2002), but in most cases there is evidence of multisensory interaction. In the case shown in figure 9.3b, visual input dominates and there is little evidence of a response to isolated auditory stimuli (left panel). There is a robust response to onset of the face stimulus, followed by subtle modulation of ongoing activity to apparent motion, and finally an ‘‘o¤ ’’ response at face o¤set. Adding sound to the facial gesture causes increased modulation of the neural signal, which is most evident in the current sink below the MUA tracing, and the o¤set response (figure 9.3b, right) is increased significantly. The vocalization, which alone barely produces a local response, appears to increase the local response to the face gesture. These data illustrate how complex the response profiles in the STS can be to fairly elementary (apparent) motion stimuli. Social and Affective Context
Earlier in this chapter N170 amplitude and latency changes were described in studies that were devoid of social or a¤ective context. Here we discuss the e¤ects of explicit and implicit manipulations of social or a¤ective context on neural activity and fMRI activation. Social context was manipulated in a multiface display and ERPs on viewing eye gaze changes were recorded (Carrick, Thompson, Epling, & Puce, 2007). A central flanker face averted its gaze from the observer to look at one of the two flankers during each trial. The flankers always looked away from the observer during the entire trial (figure 9.4a). The gaze change in the central face set up three possible eye-gaze conditions. In one condition, all three faces would look laterally to either the extreme left or right (see figure 9.4a), appearing to the subject as if they were looking at a common focus of attention somewhere o¤ the screen (group attention condition). In a second condition (mutual gaze), the central face could look at one of the flankers on the screen. In the third condition, the central face could look up and seemingly ‘‘ignore’’ both flankers (ignore). The subjects were required to press one of three buttons on each trial to signal if the central face ‘‘shared’’ a common interaction with two (group attention), one (mutual gaze), or neither face (ignore) while their ERP data were recorded (figure 9.4b). The only stimulus change in the trial was the gaze in the central face, which was always averted from the observer. We had determined in an earlier study that N170 was una¤ected by whether the stimulus face looked left or right (Puce et al., 2000). In this social context experiment, the N170 amplitude or latency did not di¤er across the three conditions (where gaze was always averted). Later ERP activity (300 ms after the eye-gaze change) showed di¤erences as a function of social context. Centrally distributed P350 latency was earliest in the group attention and mutualgaze conditions. A P500 was seen maximally over the midline parietal scalp, and significantly larger P500s occurred with the ignore condition. We believe that these later
Figure 9.4 Social context and eye-gaze experiment. (a) Examples of experimental conditions. Each experimental trial involved the presentation of a stimulus pair that consisted of a baseline in which the central face looked directly at the observer while two flanker faces looked away from the observer. The baseline stimulus was replaced by an image in which the flanker faces remained unchanged but the central face changed its gaze direction in one of three possible ways, as shown by the conditions named Group, Mutual, and Ignore. See text for more details. (b) Experimental timeline. Each trial consisted of stimulus pairs containing a baseline (presented for 1.5 sec) and a subsequent image (presented for 3.5 sec) that could display three possible viewing conditions. The presentation of each of the three conditions was randomized during the experiment. Subjects responded by pressing a button during the presentation of the second stimulus of the pair.
Multimodal Studies Using Dynamic Faces
135
ERPs (latencies greater than 300 ms) index social or a¤ective context, and deal with the interpretation of stimulus meaning. In contrast, earlier ERPs, such as N170, probably deal with other aspects of processing the face that are more related to the physical or structural characteristics of the stimulus. Future studies using distributedsource modeling methods with fMRI guidance may be able to shed some light on the generators for this neural activity. The study described here dealt with an explicit manipulation of social context where the subject evaluated the social significance of the stimulus. What happens when the context manipulation is implicit? We performed an fMRI study in which subjects viewed a visual display and at the end of each trial indicated which one of two stimuli (open circle or square) was shown on the first frame of the trial. During the trial, a set of task-irrelevant frames were displayed. These frames consisted of isolated eyes with direct gaze that transitioned from a neutral state to an intermediate state, and then back to the original neutral condition. The intermediate state could be a lateral gaze, displayed fear, displayed happiness, or an upward movement (the control stimulus). The subjects were instructed to ignore the intervening stimuli and focus on being as fast and as accurate as possible with the circle or square judgment. Activation with the eye stimuli was robust and consistent across subjects (Hardee et al., 2008). Activation in the bilateral STS, right amygdala, right fusiform, right intraparietal sulcus, and bilateral orbitofrontal cortex did not di¤er among the stimulus conditions. Somewhat surprisingly, the left amygdala showed activation only with the fearful eye condition. The left fusiform gyrus and left intraparietal sulcus also showed di¤erentiation between the fearful and the other eye conditions. However, activation under all eye conditions was observed, with significantly larger activation occurring in the fear condition. Subjects were debriefed after the MRI scanning session and did not spontaneously report noticing emotions being displayed by the eyes. We believe that the left hemisphere (amygdala, fusiform gyrus, and intraparietal sulcus) could deal with more detail in the visual stimulus, whereas the right hemisphere may deal with the coarser aspects of the stimulus. This might be useful in the detection of danger—a contingency that requires a faster response (Hardee et al., 2008). Indeed, the amygdala is responsive when human subjects view dilated pupils in others, even when the subject is unaware of these pupillary changes (Demos, Kelley, Ryan, Davis, & Whalen, 2008). When split-field presentations of fearful faces are shown, the right amygdala (and right fusiform cortex) is preferentially activated even in explicit judgments of facial a¤ect (Noesselt, Driver, Heinze, & Dolan, 2005). In both studies, the data suggest enhanced coupling between the amygdala and fusiform cortex (Noesselt et al., 2005; Hardee et al., 2008). Although there is a large literature dealing with facial activations in the fusiform cortex, amygdala, and STS, not many studies have explicitly studied the functional links among these structures. The subregions of the amygdala have a rich network
136
A. Puce and C. E. Schroeder
of functional connections that are only now being delineated in the human brain (Roy et al., 2009). Methods that allow functional and structural connectivity to be visualized might shed some light on this issue. Combined studies of fMRI and EEG or MEG might be able to unravel the spatiotemporal dynamics of processing within these structures; it cannot be done well using only one assessment modality, given the limitations of the various methods. Future Studies Using Dynamic Faces
Recently the trend toward more naturalistic stimulus displays is bearing fruit. Use of the preferred stimulus category in a naturalistic scene has been shown to activate brain regions similar to those elicited when the stimulus category is viewed in isolation (Golland et al., 2007; Privman et al., 2007). However, in real life, the visual scene changes rapidly as we move through the world and interact with other individuals. Studies involving faces have traditionally used static displays, but it is clear that dynamic faces can yield consistent and reliable fMRI activation and ERP or MEG data. ERP and MEG studies show just how fleeting, but complex, neural activity can be. Unraveling the neural sources for this activity is challenging and has previously only been possible by using invasive means in nonhuman primates (e.g., figure 9.3b). Assessment techniques that combine neural measures with blood flow data have the potential to noninvasively visualize the spatiotemporal dynamics of the human brain (Liu, Belliveau, & Dale, 1998) and continue to be refined (Ioannides, 2007; Hagler et al., 2008). In the next decade, studies involving dynamic faces will probably evolve to use some of these more complex combined assessment methods and may focus on the data for smaller numbers of individual subjects (Kriegeskorte, Formisano, Sorger, & Goebel, 2007). Although this approach is somewhat controversial currently, there is a lot to be said for detailed studies of individual subjects. Subjects change their performance during a session, and recent studies indicate that the extent of the interactions between cognition and a¤ect may be extremely important (Pessoa, 2008). Single-trial measures of neural activity and blood flow (Scheeringa et al., 2009) might well help unravel some of the variability inherent in averaged data. However, the signal-to-noise ratio in scalp ERP studies of evoked to spontaneous activity is generally low (Puce et al., 1994b, a). Despite this consideration, single-trial approaches may help integrate the literature on averaged data recorded in response to realistic, natural stimulus displays and the large body of literature that exists for more controlled, albeit somewhat artificial situations (see also Maurer et al., 2008). In addition, improvements in source modeling techniques for putative ERP generators, such as the use of realistic head models and fMRI constraints (Altmann et al., 2007), will potentially provide useful information regarding multisensory aspects of processing dynamic faces in the future.
Multimodal Studies Using Dynamic Faces
137
Acknowledgment
We thank Dr. Peter Lakatos for permitting us to use the data depicted in figure 9.4. Puce and Schroeder are supported by National Institute of Neurological Disorders and Stroke grant NS049436. References Allison, T., Puce, A., Spencer, D. D., & McCarthy, G. (1999). Electrophysiological studies of human face perception. I: Potentials generated in occipitotemporal cortex by face and non-face stimuli. Cereb Cortex, 9, 415–430. Altmann, C. F., Doehrmann, O., & Kaiser, J. (2007). Selectivity for animal vocalizations in the human auditory cortex. Cereb Cortex, 17, 2601–2608. Anaki, D., Zion-Golumbic, E., & Bentin, S. (2007). Electrophysiological neural mechanisms for detection, configural analysis and recognition of faces. Neuroimage, 37, 1407–1416. Belin, P., Zatorre, R. J., and Ahad, P. (2002). Human temporal-lobe response to vocal sounds. Cognitive Brain Research, 13, 17–26. Belin, P., Fecteau, S., Charest, I., Nicastro, N., Hauser, M. D., & Armony, J. L. (2008). Human cerebral response to animal a¤ective vocalizations. Proc Biol Sci, 275, 473–481. Bentin, S., Allison, T., Puce, A., Perez, A., & McCarthy, G. (1996). Electrophysiological studies of face perception in humans. J Cogn Neurosci, 8, 551–565. Bentin, S., Golland, Y., Flevaris, A., Robertson, L. C., & Moscovitch, M. (2006). Processing the trees and the forest during initial stages of face perception: Electrophysiological evidence. J Cogn Neurosci, 18, 1406–1421. Brefczynski-Lewis, J., Lowitszch, S., Parsons, M., Lemieux, S., & Puce, A. (2009). Audiovisual non-verbal dynamic faces elicit converging fMRI and ERP responses. Brain Topography, 21, 193–206. Caharel, S., d’Arripe, O., Ramon, M., Jacques, C., & Rossion, B. (2009). Early adaptation to repeated unfamiliar faces across viewpoint changes in the right hemisphere: Evidence from the N170 ERP component. Neuropsychologia, 47, 639–643. Calvert, G. A. (2001). Crossmodal processing in the human brain: Insights from functional neuroimaging studies. Cereb Cortex, 11, 1110–1123. Campbell, F. W., & Kulikowski, J. J. (1972). The visual evoked potential as a function of contrast of a grating pattern. J Physiol, 222, 345–356. Campbell, R. (2008). The processing of audio-visual speech: Empirical and neural bases. Philos Trans R Soc Lond B Biol Sci, 363, 1001–1010. Carrick, O. K., Thompson, J. C., Epling, J. A., & Puce, A. (2007). It’s all in the eyes: Neural responses to socially significant gaze shifts. Neuroreport, 18, 763–766. Chandrasekaran, C., & Ghazanfar, A. A. (2009). Di¤erent neural frequency bands integrate faces and voices di¤erently in the superior temporal sulcus. J Neurophysiol, 101, 773–788. Chiappa, K. H. (1983). Evoked potentials in clinical medicine. New York: Raven Press. Conty, L., N’Diaye, K., Tijus, C., & George, N. (2007). When eye creates the contact! ERP evidence for early dissociation between direct and averted gaze motion processing. Neuropsychologia, 45, 3024–3037. Crist, R. E., Wu, C. T., Karp, C., & Woldor¤, M. G. (2007). Face processing is gated by visual spatial attention. Front Hum Neurosci, 1, 10. De¤ke, I., Sander, T., Heidenreich, J., Sommer, W., Curio, G., Trahms, L., & Lueschow, A. (2007). MEG/EEG sources of the 170-ms response to faces are co-localized in the fusiform gyrus. Neuroimage, 35, 1495–1501. Demos, K. E., Kelley, W. M., Ryan, S. L., Davis, F. C., & Whalen, P. J. (2008). Human amygdala sensitivity to the pupil size of others. Cereb Cortex, 18, 2729–2734.
138
A. Puce and C. E. Schroeder
Eggermont, J. J., & Ponton, C. W. (2002). The neurophysiology of auditory perception: From single units to evoked potentials. Audiol Neurootol, 7, 71–99. Eimer, M. (1998). Does the face-specific N170 component reflect the activity of a specialized eye processor? Neuroreport, 9, 2945–2948. Fecteau, S., Armony, J. L., Joanette, Y., & Belin, P. (2004). Is voice processing species-specific in human auditory cortex? An fMRI study. Neuroimage, 23, 840–848. Ghazanfar, A. A., Chandrasekaran, C., & Logothetis, N. K. (2008). Interactions between the superior temporal sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. J Neurosci, 28, 4457–4469. Ghazanfar, A. A., Maier, J. X., Ho¤man, K. L., & Logothetis, N. K. (2005). Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. J Neurosci, 25, 5004–5012. Ghazanfar, A. A., & Schroeder, C. E. (2006). Is neocortex essentially multisensory? Trends Cogn Sci, 10, 278–285. Giard, M. H., Perrin, F., Echallier, J. F., Thevenet, M., Froment, J. C., & Pernier, J. (1994). Dissociation of temporal and frontal components in the human auditory N1 wave: A scalp current density and dipole model analysis. Electroencephalogr Clin Neurophysiol, 92, 238–252. Gil-da-Costa, R., Martin, A., Lopes, M. A., Munoz, M., Fritz, J. B., & Braun, A. R. (2006). Speciesspecific calls activate homologs of Broca’s and Wernicke’s areas in the macaque. Nat Neurosci, 9, 1064– 1070. Godey, B., Schwartz, D., de Graaf, J. B., Chauvel, P., & Liegeois-Chauvel, C. (2001). Neuromagnetic source localization of auditory evoked fields and intracerebral evoked potentials: A comparison of data in the same patients. Clin Neurophysiol, 112, 1850–1859. Golland, Y., Bentin, S., Gelbard, H., Benjamini, Y., Heller, R., Nir, Y., Hasson, U., & Malach, R. (2007). Extrinsic and intrinsic systems in the posterior cortex of the human brain revealed during natural sensory stimulation. Cereb Cortex, 17, 766–777. Hagler, D. J., Jr., Halgren, E., Martinez, A., Huang, M., Hillyard, S. A., & Dale, A. M. (2008). Source estimates for MEG/EEG visual evoked responses constrained by multiple, retinotopically-mapped stimulus locations. Human Brain Mapping, 30, 1290–1309. Halgren, E., Raij, T., Marinkovic, K., Jousmaki, V., & Hari, R. (2000). Cognitive response profile of the human fusiform face area as determined by MEG. Cereb Cortex, 10, 69–81. Halliday, A. M., McDonald, W. I., & Mushin, J. (1973). Delayed pattern-evoked responses in optic neuritis in relation to visual acuity. Trans Ophthalmol Soc UK, 93, 315–324. Hardee, J. E., Thompson, J. C., & Puce, A. (2008). The left amygdala knows fear: Laterality in the amygdala response to fearful eyes. Soc Cogn A¤ect Neurosci, 3, 47–54. Hooker, C. I., Paller, K. A., Gitelman, D. R., Parrish, T. B., Mesulam, M. M., & Reber, P. J. (2003). Brain networks for analyzing eye gaze. Brain Res Cogn Brain Res, 17, 406–418. Ioannides, A. A. (2007). Dynamic functional connectivity. Curr Opin Neurobiol, 17, 161–170. Itier, R. J., & Taylor, M. J. (2004). Source analysis of the N170 to faces and objects. Neuroreport, 15, 1261–1265. Kovacs, G., Zimmer, M., Banko, E., Harza, I., Antal, A., & Vidnyanszky, Z. (2006). Electrophysiological correlates of visual adaptation to faces and body parts in humans. Cereb Cortex, 16, 742–753. Kriegeskorte, N., Formisano, E., Sorger, B., & Goebel, R. (2007). Individual faces elicit distinct response patterns in human anterior temporal cortex. Proc Natl Acad Sci USA, 104, 20600–20605. Lewis, J. W., Brefczynski, J. A., Phinney, R. E., Janik, J. J., & DeYoe, E. A. (2005). Distinct cortical pathways for processing tool versus animal sounds. J Neurosci, 25, 5148–5158. Liu, A. K., Belliveau, J. W., & Dale, A. M. (1998). Spatiotemporal imaging of human brain activity using functional MRI constrained magnetoencephalography data: Monte Carlo simulations. Proc Natl Acad Sci USA, 95, 8945–8950. Maurer, U., Rossion, B., & McCandliss, B. D. (2008). Category specificity in early perception: Face and word N170 responses di¤er in both lateralization and habituation properties. Front Hum Neurosci, 2, 18.
Multimodal Studies Using Dynamic Faces
139
McCarthy, G., Puce, A., Belger, A., & Allison, T. (1999). Electrophysiological studies of human face perception. II: Response properties of face-specific potentials generated in occipitotemporal cortex. Cereb Cortex, 9, 431–444. Noesselt, T., Driver, J., Heinze, H. J., & Dolan, R. (2005). Asymmetrical activation in the human brain during processing of fearful faces. Curr Biol, 15, 424–429. Pessoa, L. (2008). On the relationship between emotion and cognition. Nat Rev Neurosci, 9, 148–158. Petkov, C. I., Kayser, C., Steudel, T., Whittingstall, K., Augath, M., & Logothetis, N. K. (2008). A voice region in the monkey brain. Nat Neurosci, 11, 367–374. Ponkanen, L. M., Hietanen, J. K., Peltola, M. J., Kauppinen, P. K., Haapalainen, A., & Leppanen, J. M. (2008). Facing a real person: An event-related potential study. Neuroreport, 19, 497–501. Privman, E., Nir, Y., Kramer, U., Kipervasser, S., Andelman, F., Neufeld, M. Y., Mukamel, R., Yeshurun, Y., Fried, I., & Malach, R. (2007). Enhanced category tuning revealed by intracranial electroencephalograms in high-order human visual areas. J Neurosci, 27, 6234–6242. Puce, A., Allison, T., Bentin, S., Gore, J. C., & McCarthy, G. (1998). Temporal cortex activation in humans viewing eye and mouth movements. J Neurosci, 18, 2188–2199. Puce, A., Allison, T., & McCarthy, G. (1999). Electrophysiological studies of human face perception. III: E¤ects of top-down processing on face-specific potentials. Cereb Cortex, 9, 445–458. Puce, A., Berkovic, S. F., Cadusch, P. J., & Bladin, P. F. (1994a). P3 latency jitter assessed using 2 techniques. I. Simulated data and surface recordings in normal subjects. Electroencephalogr Clin Neurophysiol, 92, 352–364. Puce, A., Berkovic, S. F., Cadusch, P. J., & Bladin, P. F. (1994b). P3 latency jitter assessed using 2 techniques. II. Surface and sphenoidal recordings in subjects with focal epilepsy. Electroencephalogr Clin Neurophysiol, 92, 555–567. Puce, A., Epling, J. A., Thompson, J. C., & Carrick, O. K. (2007). Neural responses elicited to face motion and vocalization pairings. Neuropsychologia, 45, 93–106. Puce, A., & Perrett, D. (2003). Electrophysiology and brain imaging of biological motion. Philos Trans R Soc Lond B Biol Sci, 358, 435–445. Puce, A., Smith, A., & Allison, T. (2000). ERPs evoked by viewing facial movements. Cog Neuropsychol, 17, 221–239. Puce, A., Syngeniotis, A., Thompson, J. C., Abbott, D. F., Wheaton, K. J., & Castiello, U. (2003). The human temporal lobe integrates facial form and motion: Evidence from fMRI and ERP studies. Neuroimage, 19, 861–869. Regan, D. (1972). Evoked potentials in psychology, sensory physiology and clinical medicine. London: Chapman and Hall. Romanski, L. M., & Averbeck, B. B. (2009). The primate cortical auditory system and neural representation of conspecific vocalizations. Annual Review of Neuroscience, 32, 315–346. Romanski, L. M., Guclu, B., Hauser, M. D., Knoedl, D. J., & Diltz, M. (2004). Multisensory processing of faces and vocalizations in the primate ventrolateral prefrontal cortex. In Abstract Viewer/Itinerary Planner. Washington, DC: Society for Neuroscience. Rossion, B., & Jacques, C. (2008). Does physical interstimulus variance account for early electrophysiological face-sensitive responses in the human brain? Ten lessons on the N170. Neuroimage, 39, 1959–1979. Rousselet, G. A., Mace, M. J., & Fabre-Thorpe, M. (2004a). Spatiotemporal analyses of the N170 for human faces, animal faces and objects in natural scenes. Neuroreport, 15, 2607–2611. Rousselet, G. A., Mace, M. J., & Fabre-Thorpe, M. (2004b). Animal and human faces in natural scenes: How specific to human faces is the N170 ERP component? J Vis, 4, 13–21. Rousselet, G. A., Husk, J. S., Bennett, P. J., & Sekuler, A. B. (2007). Single-trial EEG dynamics of object and face visual processing. Neuroimage, 36, 843–862. Roy, A. K., Shehzad, Z., Margulies, D. S., Kelly, A. M., Uddin, L. Q., Gotimer, K., Biswal, B. B., Castellanos, F. X., & Milham, M. P. (2009). Functional connectivity of the human amygdala using resting state fMRI. NeuroImage, 45, 614–626.
140
A. Puce and C. E. Schroeder
Scheeringa, R., Petersson, K. M., Oostenveld, R., Norris, D. G., Hagoort, P., & Bastiaansen, M. C. (2009). Trial-by-trial coupling between EEG and BOLD identifies networks related to alpha and theta EEG power increases during working memory maintenance. Neuroimage, 44, 1224–1238. Schroeder, C. E., & Foxe, J. J. (2002). The timing and laminar profile of converging inputs to multisensory areas of the macaque neocortex. Brain Res Cogn Brain Res, 14, 187–198. Schroeder, C. E., Mehta, A. D., & Givre, S. J. (1998). A spatiotemporal profile of visual system activation revealed by current source density analysis in the awake macaque. Cereb Cortex, 8, 575–592. Sokol, S., & Riggs, L. A. (1971). Electrical and psychophysical responses of the human visual system to periodic variation of luminance. Invest Ophthalmol, 10, 171–180. Stekelenburg, J. J., & de Gelder, B. (2004). The neural correlates of perceiving human bodies: An ERP study on the body-inversion e¤ect. Neuroreport, 15, 777–780. Taylor, M. J., Itier, R. J., Allison, T., & Edmonds, G. E. (2001). Direction of gaze e¤ects on early face processing: Eyes-only versus full faces. Brain Res Cogn Brain Res, 10, 333–340. Thierry, G., Pegna, A. J., Dodds, C., Roberts, M., Basan, S., & Downing, P. (2006). An event-related potential component sensitive to images of the human body. Neuroimage, 32, 871–879. Thompson, J. C., Abbott, D. F., Wheaton, K. J., Syngeniotis, A., & Puce, A. (2004). Digit representation is more than just hand waving. Brain Res Cogn Brain Res, 21, 412–417. Thompson, J. C., Hardee, J. E., Panayiotou, A., Crewther, D., & Puce, A. (2007). Common and distinct brain activation to viewing dynamic sequences of face and hand movements. Neuroimage, 37, 966–973. Watanabe, S., Kakigi, R., & Puce, A. (2003). Di¤erences in temporal and spatial processing between upright and inverted faces: A magneto- and electro-encephalographic study. Neuroscience, 116, 879–895. Wheaton, K. J., Pipingas, A., Silberstein, R. B., & Puce, A. (2001). Human neural responses elicited to observing the actions of others. Vis Neurosci, 18, 401–406. Wheaton, K. J., Thompson, J. C., Syngeniotis, A., Abbott, D. F., & Puce, A. (2004). Viewing the motion of human body parts activates di¤erent regions of premotor, temporal, and parietal cortex. Neuroimage, 22, 277–288. Wicke, J. D., Donchin, E., & Lindsley, D. B. (1964). Visual evoked potentials as a function of flash luminance and duration. Science, 146, 83–85.
10
Perception of Dynamic Facial Expressions and Gaze
Patrik Vuilleumier and Ruthger Righart
Facial expressions have a quick onset and a brief duration, being produced and perceived with relative ease without much intentional planning in most everyday situations (Darwin, 1872/1998; Rinn, 1984; Ekman, 1992). This rapid and relative automatic activity is important because it serves to communicate e‰ciently with others in social situations and to allow rapid adaptation in potentially threatening situations (Ekman, 1992; Darwin, 1872/1998). To allow such swift processing, it is possible that a specific pathway in the brain might be dedicated to facial expressions, separate from processing facial identity. Accordingly, separate pathways for face identity and expressions have been proposed by the most influential psychological and neural models of face recognition (Bruce & Young, 1986; Haxby, Ho¤mann, & Gobbini; 2000; Gobbini & Haxby, 2007). In these models, invariant and variant (i.e., changeable) aspects of faces are represented di¤erently. Variant aspects may include expressions (emotional or otherwise), as well as gaze shifts, speech-related mouth movements, or even headorientation changes. Early neuropsychological observations have revealed double dissociations in impairments for recognizing facial expressions, gaze, and identity, and numerous subsequent neuroimaging studies have likewise shown (at least partly) dissociable functional networks activated by expressions, gaze, and identity, providing convincing support for these models (see review in Vuilleumier, 2007). However, it should be noted that both neuropsychological and neuroimaging studies have mainly used static, ‘‘frozen’’ faces. To better understand the processing of facial expressions or gaze direction in an ecological manner, the use of dynamic images in which the temporal unfolding of the expression is visible seems necessary. What are the di¤erences in information content between static and dynamic expressions? Static expressions are still images that imply a movement or an expression that has occurred or is occurring, whereas dynamic expressions are consecutive frames or clips in which there is an ongoing movement showing the characteristic visual features of the face changing over time. Thus, because of their implied meaning, static images may rely to some extent on imagery or more abstract knowledge in
142
P. Vuilleumier and R. Righart
the observer, whereas dynamic expressions may involve more purely sensory-driven processing of moment-to-moment information (even though sensory predictions may be formed over time based on available motion cues; see Grave de Peralta et al., 2008) . Besides expressions created by movements of the facial musculature, other important dynamic changes may concern the shift of gaze direction toward a significant event in the environment, or toward the observer. Rigid head movements, lip movements (reviewed by O’Toole, Roark, & Abdi, 2002), and change of pupil size (reported by Demos et al., 2008) are also dynamic characteristics that can a¤ect perception of emotion and social processing so as to guide inferences of mental state, moods, and intentions; however, the latter subjects are beyond the scope of this review. Furthermore, all these changes may co-occur in time and thus interact with each other, but except for the integration between gaze and expression (e.g., Adams et al., 2003; Sander et al., 2007), these complex e¤ects are still poorly known and will not be covered here. For certain facial expressions of emotion, it is known that subtle changes in the temporal unfolding of the expression over time are crucial for an appropriate recognition of the displayed emotion or for the interpretation of expressions that follow each other. That subtle changes in movement may a¤ect perception is known also in lower levels of cortical processing (Viviani & Aymoz, 2001). For example, it has been suggested that the emotion of surprise or astonishment may typically be followed very shortly in time by an expression of fear or happiness (Darwin, 1872/ 1998). This expression of surprise is usually brief, almost unnoticeable to the eye, but may have a significant impact by enhancing attention and strengthening the perceived intensity of the subsequent expression. This example emphasizes that the full nature of perceiving emotion cannot be understood by single static images and that perception of faces can be fully understood only by studying the dynamic properties of facial expressions and the gaze shifts that occur. Are dynamic expressions better recognized than static expressions? There is strong evidence that this is the case (Wehrle, Kaiser, Schmidt, & Scherer, 2000; Ambadar, Schooler, & Cohn, 2005), although more behavioral data are certainly needed to clarify which features in the dynamics bring about the improvement. Such e¤ects have been found only when subjects were asked to discriminate among several alternative emotions (Wehrle et al., 2000; Ambadar et al., 2005), but not when they had to match facial expressions to a given target (Thornton & Kourtzi, 2002). Here we review recent studies in cognitive neuroscience that have used dynamic images of facial expressions and compare their results with those previously found with static expressions. In addition, we discuss the methodological challenges that researchers may face when using dynamic stimuli. These studies show that using dynamic images of facial expressions and gaze has increased the ecological validity
Perception of Dynamic Facial Expressions and Gaze
143
of results, and resolved several ambiguous results that had previously been obtained with static images. Production of Facial Expressions
Emotions are communicated by facial expressions that may involve a specific musculature pattern, evoked by some innate motor programs (Ekman, 1992). The production of facial expressions has been studied by using electromyographic (EMG) measurements (Tassinary, Cacioppo, & Vanman, 2007). Studies that investigated the production of facial expressions have measured the response of both the upper and lower musculature of the face. It is known that these muscles are innervated by the facial nucleus in the pons (in the brainstem). The motor nucleus in the pons is controlled by activations from the motor strip in the anterior lip of the central sulcus (the primary motor cortex; see Rinn, 1984; Morecraft et al., 2001, 2004). Emotional inputs might influence these motor pathways indirectly through connections from the amygdala and limbic areas (e.g., the orbitofrontal or cingulate cortex) to the prefrontal cortex or directly to other nuclei in the brainstem (Sah et al., 2003), thus a¤ording a rapid production of facial expressions. EMG measures have also demonstrated distinctive responses of the observer’s facial musculature during the visual perception of facial expressions (Dimberg, Thunberg, & Elmehed, 2000), and in fact even during visual perception of emotional body gestures (Magnee et al., 2006) or auditory perception of emotional voice prosody (Hietanen, Surakka, & Linnankoski, 1998). These facial muscular responses are rapid and usually not observable with the naked eye, yet can be readily detected by EMG (Cacioppo, Petty, Losch, & Kim, 1986; Schwartz, Brown, & Ahern, 1980) even when the subjects are not aware of the eliciting stimulus or do not report any subjective emotion (Dimberg et al., 2000). On the other hand, it was found that activity of the zygomaticus major (ZM), recorded over the cheek, and of the corrugator supercilii (CS), over the brow, can di¤erentiate the pleasantness and intensity of emotional reactions (Cacioppo et al., 1986). More generally, such responses are also evoked by the perception of pleasant and unpleasant scene images (Lang, Greenwald, Bradley, & Hamm, 1993), and by mental imagery for emotional stimuli (Fridlund, Schwartz, & Fowler, 1984; Schwartz et al., 1980). In addition, a direct relation between the perception of specific facial expressions and the response of the specific facial musculature has been shown. Several studies found that the ZM activity was increased by happy faces whereas the CS activity was increased with angry faces (e.g., Dimberg et al., 2000). This pattern of muscle activation was also found when subjects were instructed to inhibit any voluntary facial movements, which suggests that the facial reactions are not under conscious control (Dimberg, Thunberg, & Grunedal, 2002).
144
P. Vuilleumier and R. Righart
Although these responses were evoked by the presentation of static facial expressions, it remains to be further investigated whether the presentation of dynamic facial expressions may result in a di¤erent response pattern, and in particular produce a di¤erent temporal unfolding. Prototypical facial expressions could, however, be recorded by video while subjects viewed dynamic movie clips of happy and angry faces (Sato & Yoshikawa, 2007). However, other results indicated that dynamic, compared with static, expressions led to increased reactions of ZM for happy avatar faces, whereas angry avatar faces did not evoke significant CS activation, even though they were rated as being more intense (Weyers, Muhlberger, Hefele, & Pauli, 2006). Conversely, another study reported CS activity in response to angry faces, but did not report data on the ZM response to happy faces, owing to technical problems (Hess & Blairy, 2001). Both CS and ZM activity could be found in a recent study that used photographic faces from the Ekman series presented in short sequences of consecutive frames (Achaibou, Pourtois, Schwartz, & Vuilleumier, 2008) and a videoclip with happy and angry expressions (Sato, Fujimura, & Suzuki, 2008). It was found that not just the ZM muscles responded to dynamic happy faces, but the CS muscles also responded significantly to dynamic angry faces. This e¤ect was still observed after a large number of trials, which suggests a high degree of automaticity (and no complete habituation) of the observer’s facial reactions to dynamic facial expressions. In addition, this study also found that the onset of the EMG response to angry faces was faster than for happy faces, a finding that was not observed in previous studies with static stimuli (Achaibou et al., 2008). This distinct time course may relate to a faster activation of the neural systems involved in processing threat¨ hman & Mineka, 2001), although it may also relate related stimuli (LeDoux, 1996; O to some di¤erences in the musculature, as there may be some advantage for the innervation of the upper facial muscles (Morecraft et al., 2001; Morecraft, StilwellMorecraft, & Rossing, 2004), which receive bilateral a¤erent inputs, unlike the lower muscles, which receive contralateral inputs only. Although much is now known about how facial expressions are produced in the observer when prototypical facial expressions are seen, it remains to be determined whether di¤erent speeds or sequences of muscular movements lead to quantitative or possibly qualitative changes in facial responses of the observer. Clearly, a di¤erent temporal unfolding might cause the brain to create a di¤erent perceptual anticipation of the ongoing movements and thus alter visual and emotional encoding of facial expressions (Grave de Peralta et al., 2008), but the exact neural pathways involved in such dynamic perceptual processes are still largely unknown. In addition, even less is known about how eye gaze is modified when di¤erent facial expressions are perceived and unfold over time. Di¤erences in initial expressive cues as well as expectation might change visual scanning and produce di¤erent biases in face perception. Furthermore, dynamic shifts in the gaze direction of a perceived
Perception of Dynamic Facial Expressions and Gaze
145
face might then elicit corresponding gaze shifts in the observer (Ricciardelli, Bricolo, Aglioti, & Chelazzi, 2002; Ricciardelli, Betta, Pruner, & Turatto, 2009), and such an e¤ect might be further modulated by the perceived significance of a gaze, depending on the nature of the concomitant expression. These questions not only underscore the need but also the potential benefit of considering more systematically the dynamic aspects of expression production in order to gain a richer knowledge of face perception. Perception of Dynamic Facial Expressions
To accurately perceive a facial expression and its temporal dynamics, the neural system encoding this information should be able to operate with fast responses. In cognitive neuroscience, the most convenient methods for tracking the time course with which the brain responds to visual inputs conveying facial and expressive information on a millisecond basis are electroencephalography and magnetoencephalography. A large body of literature has been obtained by EEG and MEG studies of face perception during the past two decades, although most of these studies used static rather than dynamic face stimuli. This electrophysiological work has consistently revealed early neural responses to faces, indexed by well-characterized waveforms arising around 120 and 170 ms postonset, known as the P1 and N170 components, respectively (reviewed by Vuilleumier & Pourtois, 2007). Although the P1 component has often been related to spatial attention (Hillyard & Anllo-Vento, 1998), more recent studies have also shown that the P1 response is larger for faces than nonface objects, implying some di¤erences in the early visual processes engaged by face encoding (Itier and Taylor, 2002). However, the P1 amplitude is modified not only for faces compared with objects, but was shown to be sensitive to emotional facial expressions, a modulation that can perhaps be related to enhanced attention and fast detection of facial expressions (Batty & Taylor, 2003; Pourtois, Dan, Grandjean, Sander, & Vuilleumier, 2005). In addition, a MEG study by Halgren, Raij, Marinkovic, Jousma¨ki, and Hari (2000) found a midline occipital source at @110 ms postonset that distinguished happy and sad faces from neutral faces [for other results showing early EEG and MEG responses to faces, see Morel, Ponz, Mercier, Vuilleumier, & George, (2009)]. An increase in P1 amplitude was also found to reflect attentional orienting toward a certain position in space when a fearful face was presented, as opposed to neutral faces presented at the same location (Pourtois et al., 2004). These electrophysiological results are consistent with behavioral findings that emotional faces are usually detected faster than neutral faces in visual search paradigms (reviewed by Frischen, Eastwood, & Smilek, 2008), indicating that emotional expression information can boost early visual processing of faces at the P1 latencies (see also Vuilleumier & Pourtois, 2007).
146
P. Vuilleumier and R. Righart
On the other hand, the N170 has also been related to an early stage of face processing (Bentin, Allison, Puce, Perez, & McCarthy, 1996) and may reflect a precategorical stage of structural encoding (Eimer, 2000). This stage has usually been related to the perceptual encoding of physiognomic-invariant properties of faces, allowing identification to take place at a later stage. According to the model of Bruce and Young (1986), structural encoding should depend on view-centered descriptions and be expression-independent. In line with this idea, several studies have indeed found that the N170 is not modulated by facial expressions (Eimer & Holmes, 2002; Holmes, Vuilleumier, & Eimer, 2003; Krolak-Salmon, Fischer, Vighetto, & Mauguiere, 2001). However, other studies have found that the N170 is di¤erentially modulated by the emotional expression of faces (Batty & Taylor, 2003; Stekelenburg & de Gelder, 2004; Caharel, Courtay, Bernard, Lalonde, & Rebai, 2005; Righart & de Gelder, 2006; Williams, Palmer, Liddell, Song, & Gordon, 2006; Blau, Maurer, Tottenham, & McCandliss, 2007; Krombholtz, Schaefer, & Boucsein, 2007; Morel et al., 2009). These e¤ects of facial expressions on the N170 appear robust against habituation e¤ects (Blau et al., 2007; but see Morel et al., 2009). However, unequivocal results are not available yet, and the exact visual computations reflected by the P1 and N170 during the processing of faces and facial expressions and gaze direction are still not known. Moreover, at the anatomical level, both identity and expression are known to modulate face-sensitive regions in the fusiform gyrus (e.g., Vuilleumier, Armony, Driver, & Dolan, 2003; Vuilleumier, 2007). These e¤ects might occur at di¤erent latencies and involve an integration of bottom-up and feedback signals within the same area (Sugase, Yamane, Ueno, & Kawano, 1999), suggesting a dynamic time course more complex than the serial processing scheme proposed by traditional models (Bruce & Young, 1986; Haxby et al., 2000; Gobbini & Haxby, 2007). Because static facial expressions and static gaze directions were used in these previous studies and may convey ambiguous or suboptimal information, a presentation of dynamic stimuli in future studies should help disambiguate these findings. Behaviorally, discrimination performance for emotional facial expressions benefits from dynamic features in the images (Wehrle et al., 2000; Ambadar et al., 2005). However, performance does not simply depend on the greater amount of information that is available in dynamic images (containing multiple frames) than in static images (containing a single frame). Surprisingly, similar benefits in performance are obtained when only the first and last image of a clip are shown than when the complete dynamic expression is shown [i.e., all successive images showing how the full expression unfolds; see Ambadar et al. (2005)]. However, there is also evidence that individuals have an explicit mental representation of how the visual unfolding of a dynamic expression proceeds because subjects are able to sort a series of static frames to make a logical clip (Edwards, 1998). Future studies are needed to determine whether this mental representation is necessary for improved recognition of emotion
Perception of Dynamic Facial Expressions and Gaze
147
in dynamic displays. Another unresolved question is whether (and how) dynamic information in faces might contribute to expression recognition under degraded viewing conditions. It has already been found that dynamic information contributes to recognition of identity when viewing conditions are degraded (O’Toole et al., 2002), and this might therefore be true for facial expressions as well. In keeping with these behavioral e¤ects on perception, dynamic facial expressions may also a¤ect early visual stages indexed by ERPs. A recent study used a simple dynamic paradigm in which two faces were presented successively. Smiling faces could be preceded by either a neutral face of the same person, a smiling face of a different person, or a neutral face of a di¤erent person. The second face elicited a larger N170 amplitude when the expression was changed relative to changes of identity or changes of both identity and expression (Miyoshi, Katayama, & Morotomi, 2004; see also Jacques & Rossion, 2006 for similar e¤ects of identity change). This result demonstrates that the N170 for otherwise identical faces can be modified when the preceding face has a di¤erent expression, but only if it is the same person. This in turn suggests that visual information is integrated across successive visual inputs at the level of the N170 processing stage. A large body of positron emission tomography (PET) and fMRI studies has now delineated an extended network of brain regions selectively recruited during face perception (Gobbini & Haxby, 2007; Tsao, Moeller, & Freiwald, 2008); each of these regions is associated with distinct stages of processing although their exact role also remains to be fully elucidated. Nevertheless, in accord with the classic psychological model of Bruce and Young (1986), a ‘‘core network’’ has been identified (which consists of extrastriate visual regions in the inferior occipital gyrus related to structural encoding, together with regions in the lateral fusiform cortex that are presumably critical for view-invariant recognition of face identity (Gauthier et al., 2000, Pourtois, Schwartz, Seghier, Lazeyras, & Vuilleumier, 2005), as well as one or several regions in the superior temporal sulcus that seem specifically responsive to changing features in faces such as gaze, expression, and lip movements (Puce et al., 2003; Winston, Henson, Fine-Goulden, & Dolan, 2004; Calder et al., 2007). As for ERPs, the number of fMRI studies that used dynamic faces (LaBar, Crupain, Voydovic, & McCarthy, 2003; Kilts, Egan, Gideon, Ely, & Ho¤mann, 2003; Wicker et al., 2003; Sato, Kochiyamo, Yoshikawa, Naito, & Matsumura, 2004; Van der Gaag, Minderaa, & Keysers, 2007) is far more limited in comparison with the numerous studies that have used static faces (reviewed by Zald, 2003). In general, fMRI responses were found to be increased in all face-sensitive regions for dynamic compared with static faces, including both the STS and fusiform cortex. Likewise, amygdala activation by emotional expressions has been reported to be enhanced for dynamic faces in movie clips (Van der Gaag et al., 2007; Sato et al., 2004). In one of these fMRI studies (Van der Gaag et al., 2007), the researchers
148
P. Vuilleumier and R. Righart
used three task conditions, including an ‘‘observation task’’ in which subjects performed a memory task as to whether they had seen the video clip before; a ‘‘discrimination task’’ in which they indicated whether the emotions in subsequently viewed videos were the same or di¤erent, and an ‘‘imitation task’’ in which subjects were instructed to imitate the movements and generate the corresponding emotion. The amygdala responded more to facial expressions than to the control condition consisting of abstract pattern motions, but there was no di¤erence among neutral, happy, disgusted, and fearful expressions, and the e¤ects were present across all task conditions. It is known that overt or covert imitation of facial expressions seen in faces might contribute to the profile of brain activation seen in fMRI (Wild, Erb, Eyb, Bartels, & Grodd, 2003) and ERP (Achaibou et al., 2008). Since structural connectivity has been found between the amygdala and the STS in animals (Amaral & Price, 1984), future studies also need to clarify how the processing of dynamic face information in the amygdala and STS might interact during the recognition of both emotion expressions and gaze direction, as suggested by neuroanatomical models of face recognition (Haxby et al., 2000; Gobbini & Haxby, 2007; Calder & Nummenmaa, 2007) as well as by cognitive models of a¤ective appraisals (Sander et al., 2007). Perception of Dynamic Shifts in Eye Gaze Direction
Accurate perception of someone’s gaze direction, particularly when seeing someone shifting their gaze away or looking toward the observer, is obviously an important cue to social attention (Kleinke, 1986; Langton, Watt, & Bruce, 2000). Behavioral studies have shown that the perception of gaze direction may trigger reflexive visuospatial orienting (Driver et al., 1999; Vuilleumier, 2002). Human observers are also slower to make gender judgments for faces with direct gaze than averted gaze, and are better on incidental recognition memory for faces in which the gaze was averted (Mason, Hood, & Macrae, 2004). These e¤ects may be particularly pronounced for faces of the opposite gender (Vuilleumier, George, Lister, Armony, & Driver, 2005). These results suggest that gaze direction can interact with the processing of other facial attributes, such as gender and identity. Furthermore, gaze direction may also exert a strong modulation on the processing of facial expressions of emotion (Adams et al., 2003; Adams & Kleck, 2003, 2005; Sander et al., 2007; N’Diaye, Sander, & Vuilleumier, 2009), as we will discuss later. ERP studies using faces with static gaze have found that N170 amplitudes were larger when gaze was averted than when it was directed to the observer (Watanabe, Miki, & Kakigi, 2002; Itier, Alain, Kovacevic, & McIntosh, 2007; but see Taylor, Itier, Allison, & Edmonds, 2001), and that latencies were shorter for averted eyes than for direct gaze (Taylor, George, & Ducorps, 2001). However, other studies
Perception of Dynamic Facial Expressions and Gaze
149
using dynamic gaze stimuli found that when the direction of the gaze is shifted toward the viewer, the N170 has a larger amplitude and a longer latency (Conty, N’Diaye, Tijus, & George, 2007), which is inconsistent with an earlier report that showed larger N170 amplitudes for gaze directed away from the viewer than for gaze directed toward the viewer (Puce, Smith, & Allison, 2000), and inconsistent with studies that used static gaze. Larger activity in response to averted gaze than to directed gaze arising around 170 ms postonset may relate to activity in the superior temporal sulcus rather than the fusiform cortex (Sato, Kochiyama, Uono, & Yoshikawa, 2008). This is consistent with a few studies that also found sources in the STS for the N170 response to static faces (Itier & Taylor, 2004; Pourtois, Dan, Grandjean, Sander and Vuilleumier, 2005). Moreover, these STS responses are also consistent with proposals that this region is particularly involved in processing changeable features in faces (Haxby, Ho¤mann, & Gobbini, 2000; Gobbini & Haxby, 2007). Indeed, activation of the STS has also been reported for static facial expressions, which imply movement (Allison, Puce, & McCarthy, 2000), and for other types of facial movements (Puce & Perrett, 2003). However, di¤erent regions in the STS have been reported in di¤erent studies to be sensitive to gaze, expression, or interaction between these two factors (e.g., Winston, Henson, Fine-Goulden, & Dolan, 2004; Calder et al., 2007), but the exact reason for these di¤erences is not clear. One possible explanation for at least some inconsistencies might relate to di¤erences in sensory cues or task factors that imply di¤erent degrees of dynamic information in the stimuli, even when the stimuli are static images. For example, the social or emotional interpretation of di¤erent gaze positions or gaze shifts might be ambiguous without knowing the context in which this occurs and thus vary among individual observers or studies, which is consistent with a contribution of STS activation by inferences of the intention of others and theory of mind (Blakemore, Winston, & Frith, 2004; Pelphrey, Singerman, Allison, & McCarthy, 2003). Future studies using well-controlled dynamic displays might also help to clarify the relationship between pure motion-related or spatial directionrelated analysis of gaze and more abstract intention-related processing. Interactions between Eye Gaze and Facial Expressions
Although facial expressions and gaze direction are often instantly recognizable, they may also be quite ambiguous to observers in certain contexts and less so in other contexts (e.g., see Righart & de Gelder, 2008). Moreover, together, both of these cues may also sometimes influence each other, as illustrated by the work of Adams and colleagues (2003; Adams & Kleck, 2003, 2005), Sander and colleagues (2007), and N’Diaye and colleagues (2009). These authors showed that an interaction with
150
P. Vuilleumier and R. Righart
gaze direction may disambiguate the meaning of facial expressions, so that the same pattern of muscular movement in faces may yield di¤erent perceptions of emotion as a function of concomitant eye gaze. For instance, Adams et al. hypothesized that a direct gaze should facilitate the processing of approach-oriented emotions (e.g., anger and happiness), whereas an averted gaze should facilitate the processing of avoidance-oriented emotions (e.g., fear and sadness). Consistent with their hypothesis, they found that response times were faster for recognizing angry and happy expressions in faces with a directed gaze than an averted gaze, but were faster for fearful and sad faces with an averted than a directed gaze (Adams & Kleck, 2003). Their fMRI data were consistent with an interaction pattern as well, showing smaller amygdala responses for expressions in which the expression was better recognized (anger direct, fear averted), which was attributed to a resolution of ambiguity in these conditions because the source of threat was defined by the stimulus; responses were larger when the source of the threat was unclear and recognition more di‰cult (anger averted, fear direct) (Adams et al., 2003). This behavioral pattern of results was replicated in a psychophysical study using dynamic stimuli and more response alternatives (Sander et al., 2007), as well as in a recent imaging study using a set of computer-animated faces (N’Diaye et al., 2009). However, in the latter case, amygdala activation was found to be enhanced rather than decreased for stimuli in which an expression was perceived as more visible and more intense (anger direct, fear averted), which is consistent with another fMRI study that manipulated expression and head orientation instead of gaze direction (Sato, Kochiyama, Yoshikawa, Naito, & Matsumura, 2004). It is important to note that both behavioral and imaging findings with computer-animated faces (N’Diaye et al., 2009) indicated that the influence of gaze on expression recognition was the strongest when the intensity of the facial expression was mild (with a computerized face displaying only half of the maximal expression range), whereas there was no significant influence when an expression was maximal and unambiguous. Thus, while the exact e¤ects of gaze cues on the processing of emotional face expression with static pictures are still somewhat controversial (see Bindemann, Burton, & Langton, 2008), studies using more ecological dynamic stimuli (Sander et al., 2007; N’Diaye et al., 2009) provide more consistent data and emphasize that disambiguation e¤ects that are due to gaze might operate close to threshold when the intensity of an expression is insu‰cient. It should also be noted that there is no evidence that the imaging results of Adams et al. (2003), which showed faster response times with a direct gaze for anger and happiness expressions but faster response times with an averted gaze for fear and sadness, directly related to approachand avoidance-oriented behavior, respectively. In fact, behavioral tendencies of approach and avoidance are di‰cult to assess (Ekman, 1992). In one study, however,
Perception of Dynamic Facial Expressions and Gaze
151
subjects reported the desire to get away from a situation more frequently when dynamic anger expressions were used (Hess & Blairy, 2001). Finally, a direct gaze in an angry face may evoke anger in some persons but fear in others. This issue also awaits further research and may probably require taking into account individual differences in personality and anxiety more systematically. In any case, an important conclusion drawn from data showing interactions of gaze and expression is that information from gaze and facial expression is combined at some stage of perceptual processing (Adams & Kleck, 2003). Contrary to ERP studies that found modulations of the P1 and N170 by both expression and gaze when studied separately, no such e¤ects on these components could be found for specific patterns of interaction between facial expression (anger, happiness) and gaze direction. However, a significant interaction between gaze and expression has been observed in the later time course, around 270–450 ms postonset (Klucharev & Sams, 2004). It should be noted, however, that in the latter study, the subjects were required to report repetition of gaze direction, which may have biased their attention toward gaze direction instead of the whole facial expression. The neural substrates underlying the integration of gaze and expression cues during recognition of emotion are still unclear, but recent imaging results indicate that behavioral modulations of emotion recognition (better for anger with direct gaze and for fear with averted gaze) were paralleled by increased responses in the amygdala, as well as in a paracingulate area in the medial prefrontal cortex previously associated with mentalizing (N’Diaye et al., 2009). By contrast, there was no interaction pattern in the STS. A direct role for the amygdala in such integration is further suggested by neuropsychological data showing that patients with amygdala damage exhibit a loss of the behavioral e¤ects of gaze on expression recognition (Cristinzio, N’Diaye, Seeck, Vuilleumier, & Sander, in press). Future studies combining advanced functional connectivity analysis and white-matter tractography are needed to clarify the dynamic flow of information from visual regions to the amygdala, STS, and medial prefrontal areas. Relations between Perception and Production of Facial Expressions
For e‰cient processing and communication of emotion, especially in the case of threat-related signals, it is essential that perception be rapidly translated in the production of motor actions. In support of a rapid access of emotion signals to motor systems, it is well established that seeing emotional faces may elicit a spontaneous, fast, and specific pattern of mimicry in the face of observers (see earlier discussion; Dimberg et al., 2000). Moreover, a direct relation between the production and perception of facial expressions has recently been shown in a study that combined simultaneous measurements of EMG and EEG in healthy volunteers who watched short
152
P. Vuilleumier and R. Righart
sequences of dynamic facial expressions (Achaibou et al., 2008). The P1 amplitude in response to happy faces was larger when mimicry activity in the ZM was stronger, and conversely, was larger for angry faces when mimicry activity in the CS was larger. The higher P1 amplitude may be related to increased attention for the facial information, which may then lead to stronger motor imitation. Greater activity of the ZM in response to happy faces and greater activity of the CS to angry faces was also associated with smaller N170 amplitudes over the right hemisphere for both expressions (i.e., when imitation was more intense, the N170 amplitude was attenuated). These findings suggest that high-order visual categorization processes were less strongly recruited by faces (leading to smaller N170 amplitudes) when mimicry was enhanced and presumably help in recognizing the emotion displayed in faces (Achaibou et al., 2008). Accordingly, it is known that the responses of the facial musculature contribute to the recognition of facial expressions (Oberman, Winkielman, & Ramachandran, 2007). Conversely, interference with early sensorimotor processing that is due to focal brain damage (Adolphs, Damasio, Tranel, Cooper, & Damasio, 2000) or brief pulses of transcranial magnetic stimulation (TMS) (Pourtois et al., 2004) can produce a significant reduction in the recognition of emotional expressions. The exact neural pathways mediating the rapid transfer of visual face information to facial motor control are not known. An important structure with a key role in rapid processing of emotions and related behavioral responses is the amygdala (Le Doux, 1996). It perhaps also mediates in the pathway for the production of facial expressions, but it is not known whether an intact amygdala is necessary to produce quick and automatic mimicry of facial expressions. Whereas the amygdala has projections to several brainstem nuclei that might influence autonomic responses and perhaps facial motor activity (Sah et al., 2003), other projections from limbic regions medial, prefrontal cortex and hypothalamus are known to provide direct innervation in the facial motor nucleus that by-passes voluntary conscious control (Morecraft et al., 2004). Neuropsychological studies in patients with di¤erent lesion sites are needed to elucidate these questions in a more direct way. Methodological Issues
One of the problems when studying dynamic expressions with neuroscience methods is the necessity to control movement features in the stimulus that are unrelated to perception of emotion, and to distinguish them from the exact movement parameters that are inherent to the perceived emotion. This is a challenging task and may certainly explain why researchers have traditionally preferred to use static faces in their experiments. For example, if fearful expressions have a quicker response onset and more rapid unfolding to their apex, would the observed e¤ect be attributed to the
Perception of Dynamic Facial Expressions and Gaze
153
movement dynamics only, or to the interaction between movement cues and perception of emotion? Several methods have been used to investigate the influence of dynamic properties perception of on emotion in a controlled manner. One method is to present two static images very quickly after each other and measure the response to the second stimulus as a function of the previous stimulus (Miyoshi et al., 2004; Ambadar et al., 2005; Jacques & Rossion, 2006; Conty et al., 2007) so that the response to the last image can be easily compared when the responses are similar but di¤er in terms of the first image. However, the ecological validity may be limited because this procedure basically induces a condition of apparent motion without a full unfolding of di¤erent action units over the whole face. Other ways of presenting dynamic facial expressions include using multiple frames presented in rapid succession (Ambadar, Schooler, & Cohn, 2005; Achaibou et al., 2008), computer-generated virtual faces (Wehrle et al., 2000; Dyck et al., 2008; N’Diaye et al., 2009; Cristinzio et al., 2009), or real video animations (Hess & Blairy, 2001). However, although the video method has the highest ecological validity, it is not easy to use for a number of other reasons. In particular, the use of dynamic natural stimuli may introduce thorny problems for the measurement of behavioral and neural responses. First, behavioral responsetime experiments that compare di¤erent dynamic conditions may be complicated because it is not clear which time point in the sequence should be used as the reference to measure response latencies (i.e., at what moment does the progressive unfolding of the expression start conveying significant emotional significance and/or equal intensity across conditions?). A related problem exists for fMRI studies because identifying the time point at which the neural response to the interaction of interest between facial motion cues and emotion recognition starts is not straightforward. Even worse, the neural responses classically identified in ERPs (such as the P1 and N170) are known to be triggered by the onset of the sudden visual stimulus, but are not easily detected in a continuous dynamic flow of images or video clips (e.g., Tsuchiya, Kawasaki, Oya, Howard, & Adolphs, 2008). These timing constraints might require other methods of analyzing the data, for example, by using di¤erence potentials (see Jacques & Rossion, 2006; Conty et al., 2007), frequency analysis (Tsuchiya et al., 2008), or sophisticated statistical methods based on Bayesian probabilistic models (Grave et al., 2008). Nevertheless, as we have seen, there are several advantages to using dynamic faces. A major advantage is clearly that the images are more related to ‘‘real-life’’ situations and thus have increased ecological validity. We never see ‘‘frozen’’ images around us, but a continuous flow of stimuli. Whereas presenting still images may unintentionally allow subjects to achieve good behavioral performance on a ‘‘picture-to-picture’’ matching strategy based on pictorial codes (Bruce & Young, 1986), this is not possible with moving images. Another more technical advantage is that electrophysiological
154
P. Vuilleumier and R. Righart
recordings (i.e., EMG, EEG, and MEG) during continuous movies are less confounded by sudden blinks (with brisk EMG responses over the CS) caused by stimulus onsets that are typically but undesirably evoked by single static images. Conclusions
The past decades have provided a rich amount of data concerning the psychophysical and neuroanatomical bases of face perception, including features that are intrinsically dynamic in nature, such as expression or gaze, but most of this knowledge has been obtained using relatively impoverished experimental contexts with single pictures of static faces. Many crucial questions are still open and might be usefully addressed by more systematically considering the temporal dynamics of face perception in natural conditions, not only from the point of view of the stimulus (e.g., in relation to facial or eye movements plus their interaction), but also from the point of view of the observer (e.g., in relation to visual scanning, expression mimicry, or gaze-following, as well as anticipation or predictive coding). Although this dynamic approach may create technical constraints, recent advances in our knowledge of the brain’s architecture and methodological tools make the time now ripe to embark on more ecological investigations with more realistic stimuli. Future developments in computer animations and virtual reality will certainly foster these approaches and lead to new research perspectives in cognitive neuroscience. More generally, we believe that using dynamic stimuli will also be critical to obtaining crucial new insights into the dynamic nature of perception and the underlying brain functions. References Achaibou, A., Pourtois, G., Schwartz, S., & Vuilleumier, P. (2008). Simultaneous recording of EEG and facial muscle reactions during spontaneous emotional mimicry. Neuropsychologia, 46, 1104–1113. Adams, R. B., Gordon, H. L., Baird, A. A., Ambady, N., & Kleck, R. E. (2003). E¤ects of gaze on amygdala sensiticity to anger and fear faces. Science, 300, 1536. Adams, R. B., & Kleck, R. E. (2003). Perceived gaze direction and the processing of facial displays of emotion. Psychological Science, 14, 644–647. Adams, R. B., & Kleck, R. E. (2005). E¤ects of direct and averted gaze on the perception of facially communicated emotion. Emotion, 5, 3–11. Adolphs, R., Damasio, H., Tranel, D., Cooper, G., & Damasio, A. R. (2000). A role for somatosensory cortices in the visual recognition of emotion as revealed by three-dimensional lesion mapping. Journal of Neuroscience, 20, 2683–2690. Allison, T., Puce, A., & McCarthy, G. (2000). Social perception from visual cues: Role of the STS region. Trends in Cognitive Sciences, 4, 267–278. Amaral, D. G., & Price, J. L. (1984). Amygdalo-cortical projections in the monkey (Macaca fascicularis). Journal of Comparative Neurology, 230, 465–496. Ambadar, Z., Schooler, J. W., & Cohn, J. F. (2005). Deciphering the enigmatic face: The importance of facial dynamics in interpreting subtle facial expressions. Psychological Science, 16, 403–410.
Perception of Dynamic Facial Expressions and Gaze
155
Batty, M., & Taylor, M. J. (2003). Early processing of the six basic facial emotional expressions. Cognitive Brain Research, 17, 613–620. Bentin, S., Allison, T., Puce, A., Perez, A., & McCarthy, G. (1996). Electrophysiological studies of face perception in humans. Journal of Cognitive Neuroscience, 8, 551–565. Bindemann, M., Burton, A. M., & Langton, S. R. H. (2008). How do eye-gaze and facial expression interact? Visual Cognition, 16(6), 708–733. Blakemore, S.-J., Winston, J., & Frith, U. (2004). Social cognitive neuroscience: Where are we heading? Trends in Cognitive Sciences, 8, 216–222. Blau, V. C., Maurer, U., Tottenham, N., & McCandliss, B. (2007). The face-specific N170 component is modulated by emotional facial expression. Behavioral and Brain Function, 3, 7. Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology, 77, 305– 327. Cacioppo, J. T., Petty, R. E., Losch, M. E., & Kim, H. S. (1986). Electromyographic activity over facial muscle regions can di¤erentiate the valence and intensity of a¤ective reactions. Journal of Personality and Social Psychology, 50, 260–268. Caharel, S., Courtay, N., Bernard, C., Lalonde, R., & Rebai, M. (2005). Familiarity and emotional expression influence an early stage of processing: An electrophysiological study. Brain and Cognition, 59, 96–100. Calder, A. J., Beaver, J. D., Winston, J. S., Dolan, R. J., Jenkins, R., Eger, E., & Henson, R. N. A. (2007). Separate coding of di¤erent gaze directions in the superior temporal sulcus and inferior parietal lobule. Current Biology, 17, 20–25. Calder, A. J., & Nummenmaa, L. (2007). Face cells: Separate processing of expression and gaze in the amygdala. Current Biology, 17, R371–372. Conty, L., N’Diaye, K., Tijus, C., & George, N. (2007). When eye creates the contact! ERP evidence for early dissociation between direct and averted gaze motion processing. Neuropsychologia, 45, 3024–3037. Cristinzio, C., N’Diaye, K., Seeck, M., Vuilleumier, P., & Sander, D. (2010). Integration of gaze direction and facial expression in patients with unilateral amygdala damage. Brain, 133, 248–261. Darwin, C. (1872/1998). The expression of the emotions in man and animals. New York: Oxford University Press. Demos, K. E., Kelley, W. M., Ryan, S. L., Davis, F. C., and Whalen, P. J. (2008). Human amygdala sensitivity to pupil size of others. Cerebral Cortex, 18, 2729–2734. Dimberg, U., Thunberg, M., & Elmehed, K. (2000). Unconscious facial reactions to emotional facial expressions. Psychological Science, 11, 86–89. Dimberg, U., Thunberg, M., & Grunedal, S. (2002). Facial reactions to emotional stimuli: Automatically controlled emotional responses. Cognition and Emotion, 16, 449–471. Driver, J., Davis, G., Ricciardelli, P., Kidd, P., Maxwell, E., & Baron-Cohen, S. (1999). Gaze perception triggers reflexive visuospatial orienting. Visual Cognition, 6, 509–540. Dyck, M., Winbeck, M., Leiberg, S., Chen, Y., Gur, R. C., & Mathiak, K. (2008). Recognition profile of emotions in natural and virtual faces. PLOS One, e3628. Edwards, K. (1998). The face of time. Temporal cues in facial expressions of emotion. Psychological Science, 9, 270–276. Eimer, M. (2000). The face-specific N170 component reflects late stages in the structural encoding of faces. Neuroreport, 11, 2319–2324. Eimer, M., & Holmes, A. (2002). An ERP study on the time-course of emotional face processing. Neuroreport, 13, 427–431. Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6, 169–200. Fridlund, A. J., Schwartz, G. E., & Fowler, S. C. (1984). Pattern recognition of self-reported emotional state from multiple-site facial EMG activity during a¤ective imagery. Psychophysiology, 21, 622–637. Frischen, A., Eastwood, J. D., & Smilek, D. (2008). Visual search for faces with emotional expressions. Psychological Bulletin, 134, 662–676.
156
P. Vuilleumier and R. Righart
Gauthier, I., Tarr, M. J., Moylan, J., Skudlarski, P., Gore, J. C., & Anderson, A. W. (2000). The fusiform ‘‘face area’’ is part of a network that processes faces at the individual level. Journal of Cognitive Neuroscience, 12, 495–504. Gobbini, M. I., & Haxby, J. V. (2007). Neural systems for recognition of familiar faces. Neuropsychologia, 45, 32–41. Grave de Peralta, R., Achabou, A., Bessie`re, P., Vuilleumier, P., & Gonzalez, S. (2008). Bayesian models of mentalizing. Brain Topography, 20(4), 278–283. Halgren, E., Raij, T., Marinkovic, K., Jousma¨ki, V., & Hari, R. (2000). Cognitive response profile of the human fusiform face area as determined by MEG. Cerebral Cortex, 10, 69–81. Haxby, J. V., Ho¤mann, E. A., & Gobbini, M. I. (2000). The distributed human neural system for face perception. Trends in Cognitive Sciences, 4, 223–232. Hess, U., & Blairy, S. (2001). Facial mimicry and emotion contagion to dynamic emotional facial expressions and their influence on decoding accuracy. International Journal of Psychophysiology, 40, 129–141. Hietanen, J. K., Surakka, V., & Linnankoski, I. (1998). Facial electromyographic responses to vocal a¤ect expressions. Psychophysiology, 35, 530–536. Hillyard, S. A., & Anllo-Vento, L. (1998). Event-related potentials in the study of visual selective attention. Proceedings of the National Academy of Sciences of the United States of America, 95, 781–787. Holmes, A., Vuilleumier, P., & Eimer, M. (2003). The processing of emotional facial expressions is gated by spatial attention: Evidence from event-related brain potentials. Cognitive Brain Research, 16, 174–184. Itier, R. J., & Taylor, M. J. (2002). Inversion and contrast polarity reversal a¤ect both encoding and recognition processes of unfamiliar faces: A repetition study using ERPs. Neuroimage, 15, 353–372. Itier, R. J., & Taylor, M. J. (2004). Source analysis of the N170 to faces and objects. Neuroreport, 15, 1261–1265. Itier, R. J., Alain, C., Kovacevic, N., & McIntosh, A. R. (2007). Explicit versus implicit gaze processing assessed by ERPs. Brain Research, 1177, 79–89. Jacques, C., & Rossion, B. (2006). The speed of individual face categorization. Psychological Science, 17, 485–492. Kilts, C. D., Egan, G., Gideon, D. A., Ely, T. D., & Ho¤mann, J. M. (2003). Dissociable neural pathways are involved in the recognition of emotion in static and dynamic facial expressions. Neuroimage, 18, 156–168. Kleinke, C. L. (1986). Gaze and eye contact: A research review. Psychological Bulletin, 100, 78–100. Klucharev, V., & Sams, M. (2004). Interaction of gaze direction and facial expression processing: ERP study. Neuroreport, 15, 621–625. Krolak-Salmon, P., Fischer, C., Vighetto, A., & Mauguiere, F. (2001). Processing of facial emotional expression: Spatio-temporal data as assessed by scalp event-related potentials. European Journal of Neuroscience, 13, 987–994. Krombholz, A., Schaefer, F., & Boucsein, W. (2007). Modification of N170 by di¤erent emotional expression of schematic faces. Biological Psychology, 76, 156–162. LaBar, K. S., Crupain, M. J., Voydovic, J. T., & McCarthy, G. (2003). Dynamic perception of facial a¤ect and identity in the human brain. Cerebral Cortex, 13, 1023–1033. Lang, P. J., Greenwald, M. K., Bradley, M. M., & Hamm, A. O. (1993). Looking at pictures: A¤ective, facial, visceral, and behavioral reactions. Psychophysiology, 30, 261–273. Langton, S. R., Watt, R. J., & Bruce, V. (2000). Do the eyes have it? Cues to the direction of social attention. Trends in Cognitive Sciences, 4, 50–59. LeDoux (1996). The emotional brain. New York: Simon & Schuster. Magne´e, M. J. C. M., Stekelenburg, J. J., Kemner, C., & de Gelder, B. (2006). Similar facial electromyographic responses to faces, voices, and body expressions. Neuroreport, 18, 369–372. Mason, M. F., Hood, B. M., Macrae, C. N. (2004). Look into my eyes: Gaze direction and person memory. Memory, 12, 637–643.
Perception of Dynamic Facial Expressions and Gaze
157
Miyoshi, M., Katayama, J., & Morotomi, T. (2004). Face-specific N170 component is modulated by facial expressional change. Neuroreport, 15, 911–914. Morecraft, R. J., Louie, J. L., Herrick, J. L., & Stilwell-Morecraft, K. S. (2001). Cortical innervation of the facial nucleus in the non-human primate. A new interpretation of the e¤ect of stroke and related subtotal brain trauma on the muscles of facial expression. Brain, 124, 176–208. Morecraft, R. J., Stilwell-Morecraft, K. S., & Rossing, W. R. (2004). The motor cortex and facial expression: New insights from neuroscience. Neurologist, 10, 235–249. Morel, S., Ponz, A., Mercier, M., Vuilleumier, P., & George, N. (2009). EEG-MEG evidence for early di¤erential repetition e¤ects for fearful, happy and neutral faces. Brain Research, 1254, 84–98. N’Diaye, K., Sander, D., & Vuilleumier, P. (2009). Self-relevance processing in the human amygdala: Gaze direction, facial expression, and emotion intensity. Emotion, 9, 798–806. Oberman, L. M., Winkielman, P., & Ramachandran, V. S. (2007). Face to face: Blocking facial mimicry can selectively impair recognition of emotional expressions. Social Neuroscience, 2, 167–178. ¨ hman, A., & Mineka, S. (2001). Fears, phobias, and preparedness: Towards an evolved module of fear O and fear learning. Psychological Review, 108, 483–522. O’Toole, A. J., Roark, D. A., & Abdi, H. (2002). Recognizing moving faces: A psychological and neural synthesis. Trends in Cognitive Sciences, 6, 261–266. Pelphrey, K. A., Singerman, J. D., Allison, T., & McCarthy, G. (2003). Brain activation evoked by perception of gaze shifts: The influence of context. Neuropsychologia, 41, 156–170. Pourtois, G., Dan, E. S., Grandjean, D., Sander, D., & Vuilleumier, P. (2005). Enhanced extrastriate visual response to bandpass spatial frequency-filtered fearful faces: Time course and topographic evokedpotentials mapping. Human Brain Mapping, 26, 65–79. Pourtois, G., Grandjean, D., Sander, D., & Vuilleumier, P. (2004). Electrophysiological correlates of rapid spatial orienting towards fearful faces. Cerebral Cortex, 14, 619–633. Pourtois, G., Sander, D., Andres, M., Grandjean, D., Reveret, L., Olivier, E., & Vuilleumier, P. (2004). Dissociable roles of the human somatosensory and superior temporal cortices for processing social face signals. European Journal of Neuroscience, 1–9. Pourtois, G., Schwartz, S., Seghier, M. L., Lazeyras, F., & Vuilleumier, P. (2005). Portraits or people? Distinct representations of face identity in the human visual cortex. Journal of Cognitive Neuroscience, 17, 1043–1057. Puce, A., & Perrett, D. (2003). Electrophysiology and brain imaging of biological motion. Philosophical Transactions of the Royal Society: Biological Sciences, 358, 435–445. Puce, A., Smith, A., & Allison, T. (2000). ERPs evoked by viewing facial movements. Cognitive Neuropsychology, 17, 221–239. Puce, A., Syngeniotis, A., Thompson, J. C., Abbott, D. F., Wheaton, K. J., & Castiello, U. (2003). The human temporal lobe integrates facial form and motion: Evidence from fMRI and ERP studies. Neuroimage, 19, 861–869. Ricciardelli, P., Bricolo, E., Aglioti, S. M., & Chelazzi, L. (2002). My eyes want to look where your eyes are looking: Exploring the tendency to imitate another individual’s gaze. Neuroreport, 13, 2259–2263. Ricciardelli, P., Betta, E., Pruner, S., & Turatto, M. (2009). Is there a direct link between gaze perception and joint attention behaviours? E¤ect of gaze contrast polarity on oculomotor behaviour. Experimental Brain Research, 194, 347–357. Righart, R., & de Gelder, B. (2006). Context influences early perceptual analysis of faces: An electrophysiological study. Cerebral Cortex, 16, 1249–1257. Righart, R., & de Gelder, B. (2008). Recognition of facial expressions is influenced by emotional scene gist. Cognitive, A¤ective, and Behavioral Neuroscience, 8, 264–272. Rinn, W. E. (1984). The neuropsychology of facial expression: A review of the neurological and psychological mechanisms for producing facial expressions. Psychological Bulletin, 95, 52–77. Sander, D., Grandjean, D., Kaiser, S., Wehrle, T., and Scherer, K. R. (2007). Interaction e¤ects of perceived gaze direction and dynamic facial expression: Evidence for appraisal theories of emotion. European Journal of Cognitive Psychology, 19, 470–480.
158
P. Vuilleumier and R. Righart
Sah, P., Faber, E. S. L., Lopez de Armentia, M., and Power, J. (2003). The amygdaloid complex: Anatomy and physiology. Physiological Review, 83, 803–834. Sato, W., Fujimara, T., & Suzuki, N. (2008). Enhanced facial EMG activity in response to dynamic facial expressions. International Journal of Psychophysiology, 70, 70–74. Sato, W., Kochiyamo, T., Yoshikawa, S., Naito, E., & Matsumura, M. (2004). Enhanced neural activity in response to dynamic facial expressions of emotion: An fMRI study. Cognitive Brain Research, 20, 81–91. Sato, W., Kochiyama, T., Uono, S., & Yoshikawa, S. (2008). Time course of superior temporal sulcus activity in response to eye gaze: A combined fMRI and MEG study. Social Cognitive and A¤ective Neuroscience, 3, 224–232. Sato, W., & Yoshikawa, S. (2007). Spontaneous facial mimicry in response to dynamic facial expressions. Cognition, 104, 1–18. Schwartz, G., Brown, S., & Ahern, G. (1980). Facial muscle patterning and subjective experience during a¤ective imagery: Sex di¤erences. Psychophysiology, 17, 75–82. Stekelenburg, J. J., & de Gelder, B. (2004). The neural correlates of perceiving human bodies. Neuroreport, 15, 777–780. Sugase, Y., Yamane, S., Ueno, S., & Kawano, K. (1999). Global and fine information coded by single neurons in the temporal visual cortex. Nature, 400, 869–873. Tassinary, L. G., Cacioppo, J. T., & Vanman, E. J. (2007). The skeletomotor system: Surface electromyography. In J. T. Cacioppo, L. G. Tassinary, and G. G. Berntson (eds.), Handbook of psychophysiology, 3rd ed. New York: Cambridge University Press, pp. 267–302. Taylor, M. J., Itier, R. J., Allison, T., & Edmonds, G. E. (2001). Direction of gaze e¤ects on early face processing: Eyes-only versus full faces. Cognitive Brain Research, 10, 333–340. Taylor, M. J., George, N., & Ducorps, A. (2001). Magnetoencephalographic evidence of early processing of gaze in humans. Neuroscience Letters, 316, 173–177. Thornton, I. M., & Kourtzi, Z. (2002). A matching advantage for dynamic faces. Perception, 31, 113–132. Tsao, D. Y., Moeller, S., & Freiwald, W. (2008). Comparing face patch systems in macaques and humans. Proceedings of the National Academy of Sciences of the United States of America, 105, 19514–19519. Tsuchiya, N., Kawasaki, H., Oya, H., Howard, M. A., & Adolphs, R. (2008). Decoding face information in time, frequency and space from direct intracranial recordings of the human brain. PLOS One, e3892. Van der Gaag, Minderaa, R. B., & Keysers, C. (2007). The BOLD signal in the amygdala does not di¤erentiate between dynamic facial expressions. SCAN, 2, 93–103. Viviani, P., & Aymoz, C. (2001). Colour, form and movement are not perceived simultaneously. Vision Research, 41, 2909–2918. Vuilleumier, P. (2002). Perceived gaze direction in faces and spatial attention: A study in patients with parietal damage and unilateral neglect. Neuropsychologia, 40, 1013–1026. Vuilleumier, P. (2007). Neural representations of faces in human visual cortex: The roles of attention, emotion, and viewpoint. In N. Osaka, I. Rentschler and I. Biederman (eds.), Object recognition, attention and action. Tokyo: Springer Verlag, pp. 119–138. Vuilleumier, P., Armony, J. L., Driver, J., & Dolan, R. J. (2003). Distinct spatial frequency sensitivities for processing faces and emotional expressions. Nature Neuroscience, 6, 624–631. Vuilleumier, P., George, N., Lister, V., Armony, J., & Driver, J. (2005). E¤ects of perceived mutual gaze and gender on face processing and recognition memory. Visual Cognition, 12, 85–101. Vuilleumier, P., and Pourtois, G. (2007). Distributed and interactive brain mechanisms during emotion face perception: Evidence from functional neuroimaging. Neuropsychologia, 45, 174–194. Watanabe, S., Miki, K., & Kakigi, R. (2002). Gaze direction a¤ects face perception in humans. Neuroscience Letters, 325, 163–166. Wehrle, T., Kaiser, S., Schmidt, S., & Scherer, K. (2000). Studying the dynamics of emotional expressions using synthesized facial muscle movements. Journal of Personality and Social Psychology, 78, 105–119.
Perception of Dynamic Facial Expressions and Gaze
159
Weyers, P., Muhlberger, A., Hefele, C., & Pauli, P. (2006). Electromyographic responses to static and dynamic avatar emotional facial expressions. Psychophysiology, 43, 450–453. Wicker, B., Keysers, C., Plailly, J., Royet, J.-P., Gallese, V., & Rizzolatti, G. (2003). Both of us disgusted in My insula: The common neural basis of seeing and feeling disgust. Neuron, 40, 655–664. Wild, B., Erb, M., Eyb, M., Bartels, M., & Grodd, W. (2003). Why are smiles contagious? An fMRI study of the interaction between perception of facial a¤ect and facial movements. Psychiatry Research: Neuroimaging, 123, 17–36. Winston, J. S., Henson, R. N. A., Fine-Goulden, M. R., & Dolan, R. J. (2004). fMRI-adaptation reveals dissociable neural representations of identity and expression in face perception. Journal of Neurophysiology, 92, 1830–1839. Williams, L. M., Palmer, D., Liddell, B. J., Song, L., & Gordon, E. (2006). The ‘‘when’’ and ‘‘where’’ of perceiving signals of threat versus non-threat. Neuroimage, 31, 458–467. Winston, J. S., Henson, R. N. A., Fine-Goulden, M. R., & Dolan, R. J. (2004). fMRI-adaptation reveals dissociable neural representations of identity and expression in face perception. Journal of Neurophysiology, 92, 1830–1839. Zald, D. H. (2003). The human amygdala and the emotional evaluation of sensory stimuli. Brain Research Reviews, 41, 88–123.
11
Moving and Being Moved: The Importance of Dynamic Information in Clinical Populations
B. de Gelder and J. Van den Stock
In encountering a person, what we are most easily conscious of is that their face gives us access to the person’s identity. At the same time, the face provides many other kinds of information, such as gender, age, emotional expression, attractiveness, trust, and the like. It is likely that some kinds of information are relatively better conveyed by moving than by static faces. Some of these typical face attributes, for example, identity or a¤ect, are also conveyed by other stimuli than faces, for example, whole bodies. And as is the case with faces, they may be conveyed by a still image as well as by its dynamic counterpart. Thus the relative importance of dynamic information is not an issue restricted to face recognition but is encountered just as well in investigations of object recognition in general. It has been discussed that faces might be ‘‘special’’ as a set of visual stimuli. Likewise, one might ask whether the processing of dynamic information is special in the context of faces. Very few focused comparisons are available to answer this question because this is a challenging task. A proper comparison of face perception and recognition abilities with other object perception and recognition abilities requires comparable task settings for the two stimulus classes (e.g., Damasio, Tranel, et al., 1990; Farah, 1990). The available comparisons have almost all used still images and this makes it all the more di‰cult to assess the relative importance of dynamic information for perception of faces. Face Perception: Some Antecedents
The high salience of faces in everyday life is taken for granted and is reflected in the number of studies devoted to face recognition. Research targeting face recognition got a significant boost from the discovery of face-specific deficits after brain damage reported by Bodamer (1947). Investigations into the functional properties of face processing began with the first neuropsychological studies of Yin (1969), who reported a strong inversion e¤ect for faces, and it has been growing exponentially
162
B. de Gelder and J. Van den Stock
since the beginning of brain-imaging studies of face recognition. More and more clinical cases were also reported this past decade, with specific impairments in face recognition abilities. An overview of findings from functional magnetic resonance imaging studies in these clinical cases can be found in Van den Stock et al. (2008) and an overview of EEG studies in Righart and de Gelder (2007). The combined findings from behavioral, clinical, and neuroimaging studies are integrated in theoretical models of face perception, of which the model of Bruce and Young (1986) has been one of the most influential. Since then, a few other models of face perception have been proposed (e.g., Haxby, Ho¤man, & Gabbini, 2000). They have increased our understanding mainly by integrating new findings about face recognition deficits, its neurofunctional basis, category specificity, the relative separation of subsystems like those for identity and expression, genetic basis, the importance of movement information, and the contribution of real-world and context elements. The central notion in contemporary models is that di¤erent aspects of face perception, such as identity, expression, and direction of gaze, are processed in a brain network in which the di¤erent areas show relative functional specialization. The neurofunctional basis of processing facial identity in neurologically intact individuals is reasonably well understood. Sergent and Signoret (1992) first described the middle lateral fusiform gyrus (FG) to be responsive to faces. Kanwisher, McDermott, and Chun (1997) later dubbed this region the fusiform face area. The occipital face area is another important face-sensitive area located in the inferior occipital gyrus (Puce et al., 1996). Although these areas have been related to identity processing, the main area that comes into play when the face carries an emotional expression is the amygdala (AMG). The AMG plays a critical role in mediating emotional responses and actions (see Zald, 2003, for a review). Several studies support the notion that activity in the FFA increases as a result of feedback from the AMG (e.g., Breiter et al., 1996) and anatomical connections between the amygdala and the visual cortex have been established in primates (Amaral & Price, 1984). Faces expressing emotions also modulate OFA activity (Rotshtein et al., 2001). On the other hand, AMG-driven threat-related modulations also involve earlier visual areas such as V1 and other distant regions involved in social, cognitive, or somatic responses (e.g., the superior temporal sulcus [STS], cingulate, or parietal areas) (Catani et al., 2003). The rapid activity and/or the involvement of posterior visual areas in normal persons have been related to coarse processing of salient stimuli in subcortical structures. Support for subcortical processing of salient stimuli of which facial expressions are an example is also provided by residual face perception in patients with striate cortex lesions (Morris et al., 2001).
Moving and Being Moved
163
The Brain Basis of Face Perception in Neurologically Intact Individuals: Perceiving Movement
It needs no stating that in daily life the faces we perceive and interact with are almost continuously in motion and our perceptual systems therefore have more experience with dynamic than with static faces. The movements generated by the complex musculature of the face or body make a substantial contribution to nonverbal communication. Moreover, there are several characteristics of a person that are almost exclusively revealed by the dynamic properties displayed in the face or body: looking at a photograph of Marlon Brando playing Don Corleone in the Godfather results in an experience quite di¤erent from that of watching his live performance in the scene in which he addresses the heads of the families. This di¤erence is illustrative of the clear additive value that lies in the temporal unfolding of dynamic facial expressions. Before developing this point though, it is worth mentioning that using still images to probe face processing may have unique advantages in probing the neurofunctional basis of processing in normal as well as in neurological patients. Static patterns get the mind moving as the brain processes the incoming still image by actively mapping it onto a representation that incorporates the movement and the temporal dynamics normally associated with this visual stimulus in the external world. Well-known studies by, e.g., Shepard & Zare (1983) and by Freyd (1983) have shown convincingly that still images can be fruitfully used to probe movement perception in the brain. Using still images of whole-body expressions, we observed activation in brain areas that are normally sensitive to movement, such as the STS in human observers (de Gelder et al., 2004) and in macaques (de Gelder & Partan, 2009). Although the importance of dynamic expressions and their interpretation by conspecifics has long been recognized in the animal literature (Dawkins & Krebs, 1978), it is quite surprising that there are still only a few neuroimaging investigations with neurologically intact participants that used dynamic expressions. The dynamic information in facial expressions represents a specific kind of biological motion (Johansson, 1973). Therefore it is reasonable to expect that perceiving facial and bodily movements will activate areas known to be involved in movement perception, such as the hMT/V5 complex, and in perception of socially relevant movement, such as the STS (Bonda et al., 1996). Furthermore, socially relevant and emotionally laden movement is likely to involve the AMG. A few studies throw light on these issues, but many open questions remain. For example, it is not known whether the neurofunctional basis of biological movement in faces and bodies is a special case of the more general ability for processing biological as contrasted with nonbiological movement. Alternatively, facial movement patterns that are specifically at the service of facial expressions may be a sui generis specialization of the brain that only minimally overlaps with the neurofunctional
164
B. de Gelder and J. Van den Stock
mechanisms sustaining perception of biological movement in general. The former possibility evokes the notion of a specialized speech module exclusively at the service of the analysis of visual speech. Liberman and colleagues (see Liberman, 1996) developed the argument for such an analysis model for phonetic gestures in the seventies and eighties. A review of the pro and con arguments is provided in the volume dedicated to Al Liberman. More recently, this approach to speech has been viewed as an example of action perception by researchers in the field of mirror neuron-based perception of action. However, once relatively complex stimuli are considered, it remains unclear so far what the relation is between movement and action perception and execution (Pichon et al., 2009). Furthermore, the motor theory of speech perception was motivated by the ambition to start from but reach beyond the available linguistic description of phonetic features and define the set of motor primitives that may be the basis of speech perception. Neither for the more general case of biological movement, nor for the specific ones of human facial movements do we have descriptive theories available at present. Studies of mirror neuron activation have so far been restricted to individual single actions that do not yet allow insight into action primitives. Possibly the analysis of facial motor patterns (the FACS, facial action coding system; Ekman & Friesen, 1978) and bodily emotional motor patterns (BACS, body action coding system; de Gelder & van Boxtel, 2009 [internal report]) that implement emotional expressions may provide input for a future theory of emotional movement primitives. With these caveats in mind, let us turn to available research. In a positron emission tomography study by Kilts, Egan, Gideon, Ely, & Ho¤man (2003), participants were presented with angry, happy, and neutral facial expressions and nonface objects that were either static or dynamic. The perception of dynamic neutral faces, compared with dynamic nonface objects, triggered activity in the AMG, STS, and FG, but none of these areas were active when dynamic neutral faces were compared with static neutral faces. However, dynamic angry faces elicited more activity in these areas than static angry faces. This highlights the importance of emotional information conveyed by facial expressions in a comparison of dynamic and static faces. The increased recruitment of the AMG, STS, and FG in dynamic facial expressions might be specific for expressions with a negative valence since there was no di¤erence in these areas between dynamic and static displays of happy faces. Similar findings are reported with fMRI data; dynamic facial expressions (neutral, fear, and anger) yielded more activity than static emotional faces in the AMG and FG (LaBar, Crupain, Voyvodic, & McCarthy, 2003). An overview of currently available functional imaging studies with dynamic facial expressions is given in table 11.1. The general findings show that comparisons between dynamic faces and
Moving and Being Moved
165
Table 11.1 Functional imaging studies with dynamic facial expressions
Kilts (2003)
LaBar (2003)
Method
Stimuli
Task
Contrast
AMG FG
STS
PET
FneuD FangD FhapD FneuS FfeaS FangS nonFD nonFS FneuD FangD FfeaD FneuS FfeaS FfeaS
Emotion intensity rating
FneuD > DnonF FneuD > FneuS FangD > FangS FhapD > FhapS
X
X
X
X
X
X
FneuD > FneuS FemoD > FemoS FangD > FangS FfeaD > FfeaS
X X
fMRI
Category classification
Puce (2003)
fMRI
FneuD FneuD(l) nonFD
Passive viewing
[FneuD þ FneuD(l)] > nonFD
Decety (2003)
PET
FneuD1 FhapD1 FsadD1
Mood rating
FhapD > FneuD FsadD > FneuD
Wicker (2003)
fMRI
FneuD FdisD FhapD
Passive viewing
FdisD > FneuD FhapD > FneuD
Sato (2004)
fMRI
FfeaD FhapD FfeaS FangS nonFD
Passive viewing
FfeaD > FfeaS FfeaD > nonFD FhapD > FhapS FhapD > nonFD
OFA X X
X
X X
X X X X
X
X
X X
X
X
X X X X
X X X X
X X X X
X
X X X
Wheaton (2004)
fMRI
FneuD FneuS
Passive viewing
FneuD > FneuS
Grosbras (2006)
fMRI
Passive viewing
FneuD > nonFD FangD > nonFD
X X
X X
van der Gaag (2007a)
fMRI
FneuD FangD nonFD FneuD FdisD FfeaD FhapD
Passive viewing
FallD > R
X
X
van der Gaag (2007b)
fMRI
FneuD FdisD FfeaD FhapD nonFD
Passive viewing Discrimination
FallD > nonFD FallD > nonFD
X2 X2
Thompson (2007)
fMRI
FneuD(s) nonFD
Speed change detection
FneuD > nonFD
X
X X
166
B. de Gelder and J. Van den Stock
Table 11.1 (continued)
Kret (2009)
Method
Stimuli
Task
Contrast
AMG FG
fMRI
FangD FfeaD FneuD BangD BfeaD BneuD
Oddball detection
FallD > BallD FemoD > FneuD
X X
STS
OFA
X
Notes: D, dynamic; S, static; Fang, angry face; Fdis, disgusted face; Ffea, fearful face; Fneu, neutral face; Fhap, happy face; Femo, emotional face; Fall, all faces; nonF, non-face; R, rest; (s), synthetic; (l), line drawing; Bang, angry body; Bfea, fearful body; Bneu, neutral body; Ball, all bodies. 1 The conditions reported involve the motoric expression of the stimuli, not the semantic content of the stories told by the actor. 2 No modulation of AMG activity by emotional content of faces.
dynamic nonface stimuli typically activate brain areas already known to be involved in the perception of static faces. Taken at face value, this result suggests that the difference in brain basis between seeing still and dynamic faces is quantitative rather than qualitative. However, a more focused comparison between dynamic and static faces shows a less clear picture, and the contrast becomes stronger when emotional expressions are part of the comparison. In a recent study we investigated the neural correlates of perceiving dynamic face images using a design built on a close comparison of face videos with body videos. To arrive at a better view of dynamic neutral versus emotional (fearful and angry) facial expressions, we used both categories and compared each with its counterpart (Kret, Grezes, Pichon, & de Gelder, submitted). The face versus body comparison showed activation in the AMG and hippocampus. Dynamic emotional faces yielded more activity in the FG and STS than dynamic neutral faces. We found no emotional modulation of the AMG by dynamic emotional compared with neutral faces, a result that is consistent with a study that focused on amygdala activation (van der Gaag, Minderaa et al., 2007a). Neurophysiological Studies in Monkeys
Single-cell recordings in monkeys have shown that cells in the inter temporal cortex and the STS are responsive to di¤erent aspects of face perception (e.g., Bruce, Desimone, & Gross, 1986), including emotional expression (Hasselmo, Rolls, & Baylis, 1989). However, the use of dynamic face stimuli in neurophysiological monkey studies is rare. Evidence exists of neurons that are sensitive to specific movements of the head (Hasselmo, Rolls, & Baylis, 1989) and dynamic whole-body expressions (Oram
Moving and Being Moved
167
& Perrett, 1996). One neurophysiological study reported neurons in the monkey STS that are sensitive to facial dynamics like closing the eyes (Hasselmo, Rolls, & Baylis, 1989). Other cells have been described that are sensitive to threatening open mouths (Perrett & Mistlin, 1990). Similarly in humans, the STS was found to be active following social information when dynamic images were used (see table 11.1). Neurons in the amygdala have also been reported to be responsive to social information in monkeys (e.g., Brothers, Ring, & Kling, 1990). Visual Object Agnosia and Face Agnosia or Prosopagnosia
Prosopagnosia was first reported by Bodamer (1947). The deficit involves recognition of personal identity but not of facial expression. This dissociation has been the cornerstone of the models of face processing in the neuropsychological literature of the past two decades and is at the basis of the face recognition model of Bruce and Young (1986). The typical complaint of a prosopagnosic is the inability to recognize a person by their face. This symptom is far more pronounced than the phenomenon everyone sometimes experiences when they have trouble remembering from where or how they know a certain face. Prosopagnosics can even have di‰culties recognizing the persons they are very close to, such as their immediate family members. Neural Correlates of Face Deficits
The focus on finding the neural correlate of the physically defined face category raised the expectation that patients with face recognition deficits would show lesions or anomalous activation in the normal face recognition areas. This has not always turned out to be the case, as shown by some recent patient studies using brain imaging (e.g., Steeves, Culham, Duchaine, Pratesi, Valyear, et al., 2006). When we turn to developmental prosopagnosia (DP), the situation is not clearer. Investigations into the neurofunctional correlates of DP with fMRI have yielded inconsistent results. Several studies reported increased activity for perceiving faces, compared with nonface stimuli, in the well-known face areas FFA and OFA (e.g., Hasson, Avidan, Dunhill, Berton et al., 2007) whereas the first fMRI study including a DP case by Hadjikhani and de Gelder (2002) and a more recent study (Bentin, Degutis, D’Esposito, & Robertson, 2007) found no face-specific activation in these areas. These findings suggest that intact functioning of the FFA and inferior occipital gyrus are necessary but not su‰cient for successful face recognition. An important issue concerns the emotional information contained by the perceived faces. Recently, we observed reduced activation levels in the FFA of three
168
B. de Gelder and J. Van den Stock
developmental prosopagnosics compared to control subjects when looking at neutral faces. However, there was no di¤erence between both groups in the activation level of the FFA when the faces they viewed expressed either a happy or a fearful emotion (Van den Stock, van de Riet, Righart, & de Gelder, 2008). In the same study, we investigated the neural correlates of perceiving neutral and emotional whole-body expressions and the results showed that in prosopagnosics, the perception of bodies is associated with increased activation in face areas and the perception of faces elicits activity in body areas. Whole-body expressions are quite eligible as a control stimulus condition for faces since they are comparable to faces on a number of variables, for instance, ability to express emotional information, gender, age, or familiarity. Still versus Dynamic Face Images in Patient Studies
Impairments in recognizing emotion or identity in facial expressions have been reported in a variety of syndromes like Huntington’s disease (Sprengelmeyer et al., 1996), Wilson’s disease (Wang, Hoosain, Yang, Meng, Wang, 2003), Urbach-Wiethe disease (Adolphs, Tranel, Damasio, & Damasio, 1994), Parkinson’s disease (Sprengelmeyer, Young, et al., 2003), autism spectrum disorder (for a review, see Sasson, 2006), obsessive-compulsive disorder (Sprengelmeyer, Young, et al., 1997), schizophrenia (Mandal, Pandey, & Prasad, 1998, for review), Alzheimer’s disease (Hargrave, Maddock, & Stone, 2002), semantic dementia (Bozeat, Lambon Ralph, Patterson, Garrard, Hodges, 2000), attention deficit hyperactivity disorder (Singh, Ellis, Winton, Singh, Leung, 1998), amyotrophic lateral sclerosis (Zimmerman, Eslinger, et al., 2007), and frontotemporal dementia (Lavenu, Pasquier, Lebert, Petit, & van der Linden, 1999). However, the bulk of these studies are based on the use of static stimuli, and recognition of static facial expressions calls for a larger e¤ort of the brain than dynamic expressions since the brain has to account for the missing information of temporal dynamics. It is therefore not surprising that several studies with patients found superior recognition of dynamic facial expressions compared with static expressions (Tomlinson et al., 2006). One study with a prosopagnosic reported impaired identity recognition of static face pictures, but not of dynamic faces (Steede, Tree, et al., 2007), a pattern that was not compatible with a similar previous study (Lander, Humphreys, & Bruce, 2004). As far as recognition of facial speech expressions is concerned, we tested a patient with prosopagnosia using still images of facial expressions as well as dynamic videos (de Gelder & Vroomen, 1998). Her performance with still facial expressions was poor but improved significantly when short videos were shown instead. The same pattern was observed in another prosopagnosic patient using point-light displays of emotional face expressions (Humphreys, Donnelly, & Riddoch, 1993).
Moving and Being Moved
169
Being Moved by Still Images
It is often assumed that dynamic stimuli are easier to decode than still images, and the most frequent argument is that dynamic images are more natural or more ecological and thereby more representative of the visual input the brain has evolved for. As we already pointed out, comparisons are complicated by the simple fact that dynamic stimuli contain much more information than still images. On the other hand, there are arguments about the specificity of movement perception that speak against a simple comparison that takes the higher information content of dynamic images into account. One of these is the fact that there are known cases in the neuropsychological literature of movement perception disorders. One of the best ones is Zihl’s patient with bilateral lesions to V5. This patient had a severe movement perception deficit but had no di‰culty in recognizing people by their faces and was not prosopagnosic. She was also able to read speech from static face images but could not perceive speech from dynamic images (Campbell, Zihl, Massaro, Munhall, & Cohen, 1997). The reverse pattern was observed in a patient with lesions in V4 (Humphreys, Donnelly, & Riddoch, 1993). A convergent argument to which we have already alluded several times in favor of a nuance in the distinction between still and dynamic images is that studies using still images have reported activation in motor and premotor areas. This clearly means the brain does not need to be shown movement in order to perceive it. As a matter of fact, using still images may provide a tool for assessing the brain’s perceptual abilities beyond the strictly physically present information. Face Perception in Hemianope Patients
Of particular interest for understanding the neurofunctional basis of perceiving facial movement are patients with damage to the primary visual cortex. Previous studies of such rare cases have illustrated the extent of residual movement vision that does not depend on an intact V1. It is interesting that movement perception with versus without awareness is correlated with di¤erent stimulus properties (for review Weiskrantz, 2009). In our first investigation of the residual vision of hemianope patients, we used both still images and short video clips of faces and we found that only the video clips triggered a reliable recognition of facial expressions in the blind field. This suggested that the presence of movement may be a necessary condition for a¤ective blindsight (de Gelder, Vroomen, Pourtois, & Weiskrantz, 1999). However, in subsequent studies we used EEG and later fMRI measurements and found clear evidence that still images were also processed (e.g., Rossion, de Gelder, Pourtois, Gue´rit, &
170
B. de Gelder and J. Van den Stock
Weiskrantz, 2000). Returning to a more sensitive behavioral paradigm than direct guessing by using the redundant target e¤ect based on the advantage derived from summation across the two hemifields, here the sighted and the blind one, we showed later that still face images are still reliably recognized (e.g., de Gelder, Pourtois, Weiskrantz, 2002). In a recent study we report that the presence of still facial and bodily images triggers muscular movements that can be measured by electromyography. These facial movements reflect the specific emotion expressed in the unseen stimulus, independently of whether it is a face or a body, and have shorter latencies when triggered by an unseen than by a seen stimulus. However, at no time are the subjects aware of the unseen stimulus or of their motor reaction to it (Tamietto et al., submitted). Conclusion
The human perceptual system is eminently tuned to information provided by movement in the environment. The corollary of this is that when it is dealing with still images, the brain will automatically represent the dynamic information that is not, strictly speaking, present in the stimulus. Perceptual deficits, either congenital or as a consequence of brain damage in the normally developed brain, challenge our current understanding of the neurofunctional basis of movement perception. On the one hand there is little doubt that moving images provide more and richer information that, other things being equal, may make it easier for brains and perceptual systems weakened by disease to access information. On the other hand, there is also little doubt that in the developed brain a certain division of labor underpins fluent perceptual abilities. To approach this neuronal division of labor as exclusively a matter of specialized face, movement, or emotion modules may hamper our understanding of these perceptual abilities and the active role of the perceptual system in dealing with stimuli. Acknowledgments
Preparation of this chapter was partly funded by European Union grant FP62005-NEST-Path Imp 043403-COBOL and supported by the Netherlands Science Foundation. References Adolphs, R., Tranel, D., Damasio, H., & Damasio (1994). Impaired recognition of emotion in facial expressions following bilateral damage to the human amygdala. Nature, 372(6507), 669–672. Amaral, D. G., & Price, J. L. (1984). Amygdalo-cortical projections in the monkey (Macaca fascicularis). J Comp Neurol, 230(4), 465–496.
Moving and Being Moved
171
Bentin, S., Degutis, J. M., D’Esposito, M., & Robertson, L. C. (2007). Too many trees to see the forest: Performance, event-related potential, and functional magnetic resonance imaging manifestations of integrative congenital prosopagnosia. J Cogn Neurosci, 19(1), 132–146. Bodamer, J. (1947). Die prosop-Agnosie. Archiv fur Psychiatrie und Nervenkrankheiten, 179, 6–53. Bonda, E., Petrides, M., Ostry, D., & Evans, A. (1996). Specific involvement of human parietal systems and the amygdala in the perception of biological motion. J Neurosci, 16(11), 3737–3744. Bozeat, S., Lambon Ralph, M. A., Patterson, K., Garrard, P., & Hodges, J. R. (2000). Non-verbal semantic impairment in semantic dementia. Neuropsychologia, 38(9), 1207–1215. Breiter, H. C., Etco¤, N. L., Whalen, P. J., Kennedy, W. A., Rauch, S. L., Buckner, R. L., et al. (1996). Response and habituation of the human amygdala during visual processing of facial expression. Neuron, 17(5), 875–887. Brothers, L., Ring, B., & Kling, A. (1990). Response of neurons in the macaque amygdala to complex social stimuli. Behav Brain Res, 41(3), 199–213. Bruce, V., & Young, A. (1986). Understanding face recognition. Br J Psychol, 77, (Pt 3), 305–327. Bruce, C. J., Desimone, R., & Gross, C. G. (1986). Both striate cortex and superior colliculus contribute to visual properties of neurons in superior temporal polysensory area of macaque monkey. J Neurophysiol, 55(5), 1057–1075. Campbell, R., Zihl, J., Massaro, D., Munhall, K., & Cohen, M. M. (1997). Speechreading in the akinetopsic patient, L.M. Brain, 120 (Pt 10), 1793–1803. Catani, M., Jones, D. K., Donato, R., & Ffytche, D. H. (2003). Occipito-temporal connections in the human brain. Brain, 126(Pt 9), 2093–2107. Damasio, A. R., Tranel, D., & Damasio, H. (1990). Face agnosia and the neural substrates of memory. Annual Reviews in Neuroscience, 13, 89–109. Dawkins, R., & Krebs, J. R. (1978). Animals’ signals: Information or manipulation. In J. R. Krebs and N. B. Davies (eds.), Behavioural ecology: An evolutionary approach. Oxford, UK: Blackwell, pp. 282–309. de Gelder, B., & Partan, S. (2009). The neural basis of perceiving emotional bodily expressions in monkeys. Neuroreport, 20(7), 642–646. de Gelder, B., & Vroomen, J. (1998). Impairment of speech-reading in prosopagnosia. Speech Comm, 26(1–2), 89–96. de Gelder, B., Pourtois, G., & Weiskrantz, C. (2002). Fear recognition in the voice is modulated by unconsciously recognized facial expressions but not by unconsciously recognized a¤ective pictures. Proc Natl Acad Sci USA, 99(6), 4121–4126. de Gelder, B., Snyder, J., Greve, D., Gerard, G., & Hadjikhani, N. (2004). Fear fosters flight: A mechanism for fear contagion when perceiving emotion expressed by a whole body. Proc Natl Acad Sci USA, 101(47), 16701–16706. de Gelder, B., Van den Stock, J., Meeren, H. K. M., Sinke, C. B. A., Kret, M. E., & Tamietto, M. (2010). Standing up for the body. Recent progress in uncovering the networks involved in the perception of bodies and bodily expressions. Neuroscience & Biobehavioral Reviews, 34, 513–527. de Gelder, B., Vroomen, J., Pourtois, G., & Weiskrantz, L. (1999). Non-conscious recognition of a¤ect in the absence of striate cortex. Neuroreport, 10(18), 3759–3763. Decety, J., & Chaminade, T. (2003). Neural correlates of feeling sympathy. Neuropsychologia, 41(2), 127– 138. Ekman, P., & Friesen, W. V. (1978). Facial action coding system: A technique for the measurement of facial movement. Palo Acto, CA: Consulting Psychologists Press. Farah, M. (1990). Visual agnosia: Disorders of visual recognition and what they tell us about normal vision. Cambridge, MA: MIT Press. Freyd, J. J. (1983). The mental representation of movement when static stimuli are viewed. Percept Psychophys, 33(6), 575–581. Gre`zes, J., & de Gelder, B. (2009). Social perception: Understanding other people’s intentions and emotions through their actions. In T. Striano and V. Reid (eds.), Social cognition: Development, neuroscience and autism. Oxford, UK: Blackwell.
172
B. de Gelder and J. Van den Stock
Grosbras, M. H., & Paus, T. (2006). Brain networks involved in viewing angry hands or faces. Cereb Cortex, 16(8), 1087–1096. Hadjikhani, N., & de Gelder, B. (2002). Neural basis of prosopagnosia: An fMRI study. Hum Brain Mapp, 16(3), 176–182. Hargrave, R., Maddock, R. J., & Stone, V. (2002). Impaired recognition of facial expressions of emotion in Alzheimer’s disease. J Neuropsychiatry Clin Neurosci, 14(1), 64–71. Hasselmo, M. E., Rolls, E. T., & Baylis, G. C. (1989). The role of expression and identity in the faceselective responses of neurons in the temporal visual cortex of the monkey. Behav Brain Res, 32(3), 203– 218. Hasson, U., Avidan, G., Deouell, L. Y., Bentin, S., & Malach, R. (2003). Face-selective activation in a congenital prosopagnosic subject. Journal of Cognitive Neuroscience, 15(3), 419–431. Haxby, J. V., Ho¤man, E. A., & Gabbini, M. V. (2000). The distributed human system for face perception. Trends in Cognitive Sciences, 4(6), 223–233. Humphreys, G. W., Donnelly, N., & Riddoch, M. J. (1993). Expression is computed separately from facial identity, and it is computed separately for moving and static faces: Neuropsychological evidence. Neuropsychologia, 31(2), 173–181. Johansson, G. (1973). Visual perception of biological motion and a model for its analysis. Percep and Psychophys, 14, 201–211. Kanwisher, N., McDermott, J., & Chun, M. M. (1997). The fusiform face area: A module in human extrastriate cortex specialized for face perception. J Neurosci, 17(11), 4302–4311. Kilts, C. D., Egan, G., Gideon, D. A., Ely, T. D., & Ho¤man, J. M. (2003). Dissociable neural pathways are involved in the recognition of emotion in static and dynamic facial expressions. Neuroimage, 18(1), 156–168. Kret, M. E., Grezes, J., Pichon, S., & de Gelder, B. (submitted). Gender specific brain activations in perceiving threat from dynamic faces and bodies. LaBar, K. S., Crupain, M. J., Voyvodic, J. T., & McCarthy, G. (2003). Dynamic perception of facial a¤ect and identity in the human brain. Cereb Cortex, 13(10), 1023–1033. Lander, K., Humphreys, G., & Bruce, V. (2004). Exploring the role of motion in prosopagnosia: Recognizing, learning and matching faces. Neurocase, 10(6), 462–470. Lavenu, I., Pasquier, F., Lebert, F., Petit, H., & Van der Linden, M. (1999). Perception of emotion in frontotemporal dementia and Alzheimer disease. Alzheimer Dis Assoc Disord, 13(2), 96–101. Liberman, A. (1996). Speech: A special code. Cambridge, MA: MIT Press. Mandal, M. K., Pandey, R., & Prasad, A. B. (1998). Facial expressions of emotions and schizophrenia: A review. Schizophr Bull, 24(3), 399–412. Morris, J. S., de Gelder, B., Weiskrantz, L., & Dolan, R. J. (2001). Di¤erential extrageniculostriate and amygdala responses to presentation of emotional faces in a cortically blind field. Brain, 124(Pt 6), 1241– 1252. Oram, M. W., & Perrett, D. I. (1996). Integration of form and motion in the anterior superior temporal polysensory area (STPa) of the macaque monkey. J Neurophysiol, 76(1), 109–129. Perrett, D. I., & Mistlin, A. J. (1990). Perception of facial attributes. In Comparative perception, complex signals. W. C. Stebbins and M. A. Berkeley (eds.), New York: John Wiley. II, 187–215. Pichon, S., de Gelder, B., & Gve`zes, J. (2009). Two di¤erent faces of threat. Comparing the neural systems for recognizing fear and anger in dynamic body expressions. Neuroimage, 47(4), 1873–1883. Rossion, B., de Gelder, B., Pourtois, G., Gue´rit, J. M., & Weiskrantz, L. (2000). Early extrastriate activity without primary visual cortex in humans. Neuroscience Letters, 279(1), 25–28. Puce, A., Allison, T., Asgari, M., Gore, J. C., & McCarthy, G. (1996). Di¤erential sensitivity of human visual cortex to faces, letterstrings, and textures: a functional MRI study. J Neurosci 16, 5205–5215. Puce, A., Syngeniotis, A., Thompson, J. C., Abbott, D. F., Darby, D., & Donnan, G. (2003). The human temporal lobe integrates facial form and motion: Evidence from fMRI and ERP studies. Neuroimage, 19(3), 861–869.
Moving and Being Moved
173
Righart, R., & de Gelder, B. (2007). Impaired face and body perception in developmental prosopagnosia. Proc Natl Acad Sci USA, 104(43), 17234–17238. Rotshtein, P., Malach, R., Hadar, U., Graif, M., & Hendler, T. (2001). Feeling or features: Di¤erent sensitivity to emotion in high-order visual cortex and amygdala. Neuron, 32(4), 747–757. Sasson, N. J. (2006). The development of face processing in autism. J Autism Dev Disord, 36(3), 381–394. Sato, W., Kochiyama, T., Yoshikawa, S., Naito, E., & Matsumura, M. (2004). Enhanced neural activity in response to dynamic facial expressions of emotion: An fMRI study. Brain Res Cogn Brain Res, 20(1), 81–91. Sergent, J., & Signoret, J. L. (1992). Functional and anatomical decomposition of face processing: Evidence from prosopagnosia and PET study of normal subjects. Philos Trans R Soc Lond B Biol Sci, 335(1273), 55–561; discussion 61–62. Shepard, R. N., & Zare, S. L. (1983). Path-guided apparent motion. Science, 220(4597), 632–634. Singh, S. D., Ellis, C. R., Winton, A. S., Singh, N. N., & Leung, J. P. (1998). Recognition of facial expressions of emotion by children with attention-deficit hyperactivity disorder. Behav Modif, 22(2), 128–142. Sprengelmeyer, R., Young, A. W., Calder, A. J., Karnat, A., Lange, A., et al. (1996). Loss of disgust. Perception of faces and emotions in Huntington’s disease. Brain, 119 (Pt 5), 1647–1665. Sprengelmeyer, R., Young, A. W., Pundt, J., Sprengelmeyer, A., & Calder, A. J., et al. (1997). Disgust implicated in obsessive-compulsive disorder. Proc R Soc Lond B Biol Sci, 264(1389), 1767–1773. Sprengelmeyer, R., Young, A. W., Mahn, K., Schroeder, U., Woitalla, D., Bu¨ttner, T., Kuhn, W., & Przuntek, H. (2003). Facial expression recognition in people with medicated and unmedicated Parkinson’s disease. Neuropsychologia, 41(8), 1047–1057. Steede, L. L., Tree, J. J., & Hole, G. J. (2007). I can’t recognize your face but I can recognize its movement. Cogn Neuropsychol, 24(4), 451–466. Steeves, J. K., Culham, J. C., Duchaine, B. C., Pratesi, C. C., et al. (2006). The fusiform face area is not su‰cient for face recognition: Evidence from a patient with dense prosopagnosia and no occipital face area. Neuropsychologia, 44(4), 594–609. Thompson, J. C., Hardee, J. E., Panayiotou, A.,Crewther, D., & Puce, A. (2007). Common and distinct brain activation to viewing dynamic sequences of face and hand movements. Neuroimage, 37(3), 966–973. Tomlinson, E. K., Jones, C. A., Johnston, R. A., Meaden, A., & Wink, B. (2006). Facial emotion recognition from moving and static point-light images in schizophrenia. Schizophr Res, 85(1–3), 96–105. Van den Stock, J., van de Riet, W. A., Righart, W. A., & de Gelder, B. (2008). Neural correlates of perceiving emotional faces and bodies in developmental prosopagnosia: An event-related fMRI-study. PLoS ONE, 3(9), e3195. van der Gaag, C., Minderaa, R. B., & Keysers, C. (2007a). The BOLD signal in the amygdala does not di¤erentiate between dynamic facial expressions. Soc Cogn A¤ect Neurosci, 2(2), 93–103. van der Gaag, C., Minderaa, R. B., & Keysers, S. C. (2007b). Facial expressions: What the mirror neuron system can and cannot tell us. Soc Neurosci, 2(3–4), 179–222. Wang, K., Hoosain, R., Yang, R. M., Meng, Y., & Wang, C. Q. (2003). Impairment of recognition of disgust in Chinese with Huntington’s or Wilson’s disease. Neuropsychologia, 41(5), 527–537. Weiskrantz, L. (2009). Blindsight: A case study spanning 35 years and new developments. Oxford, UK: Oxford University Press. Wheaton, K. J., Thompson, J. C., Syngeniotis, A., Abbott, D. F., & Puce, A. (2004). Viewing the motion of human body parts activates di¤erent regions of premotor, temporal, and parietal cortex. Neuroimage, 22(1), 277–288. Wicker, B., Keysers, C., Plailly, J., Royet, J. P., & Gallese, V., et al. (2003). Both of us disgusted in My insula: The common neural basis of seeing and feeling disgust. Neuron, 40(3), 655–664. Yin, R. K. (1969). Looking at upside-down faces. J Exp Psychol, 81, 141–145. Zald, D. H. (2003). The human amygdala and the emotional evaluation of sensory stimuli. Brain Res Brain Res Rev, 41(1), 88–123. Zimmerman, E. K., Eslinger, P. J., Simmons, Z., & Barrett, A. M. (2007). Emotional perception deficits in amyotrophic lateral sclerosis. Cogn Behav Neurol, 20(2), 79–82.
III
COMPUTATION
12
Analyzing Dynamic Faces: Key Computational Challenges
Pawan Sinha
Research on face perception has focused largely on static imagery. Featural details and their mutual configuration are believed to be the primary attributes serving tasks such as identity, age, and expression judgments. The extraction of these attributes is best accomplished with high-quality static images. In this setting, motion is counterproductive in that it complicates the extraction and analysis of details and spatial configuration. However, a counterpoint to this idea has recently begun emerging. Results from human psychophysics have demonstrated that motion information can contribute to face perception, especially in situations where static information, on its own, is inadequate or ambiguous. This body of work has served as an impetus for a computational investigation of dynamic face analysis. The chapters included in this volume are excellent examples of the kinds of issues and approaches researchers are exploring in this area. Along these lines, my intent here is to highlight some of the key computational challenges that an analysis of dynamic faces entails. This is by no means an exhaustive list, but rather a set of issues that researchers are most likely going to have to tackle in the near term. When (and What) Does Motion Contribute to Face Perception?
Although we said that motion contributes to face perception, it has to be acknowledged that we do not yet have a good characterization of the tasks that benefit significantly from the inclusion of dynamic information. As we all have experienced firsthand, several face perception tasks can be accomplished quite well even with purely static images. What exactly is motion good for? Does it make a qualitative di¤erence in the performance of certain face tasks or is it only a ‘‘bit player’’? Experimental literature on humans has so far not provided a clear answer to this question. One has to work hard to design stimuli where the contribution of motion is clearly evident. Given the equivocal picture from the experimental front, it falls upon computational investigations to help identify task areas that are likely to benefit in significant ways from the availability of dynamic information. For instance, if it can be shown that under some simple choices of features and classifiers, dynamic attributes
178
P. Sinha
of expressions are much more separable than their static manifestations, then one may justifiably predict that human performance on the task of classifying expressions will be significantly facilitated by motion information. More generally, the idea would be to build a taxonomy of face perception tasks based on a computational analysis of how much task-related information is carried by static and dynamic facial signals. This endeavor would lead to an exciting interplay between computational researchers and psychophysicists, with the former actively suggesting promising experimental avenues to the latter. Besides characterizing the tasks to which motion might contribute, it is also important to investigate precisely what kind of information motion is adding to the computation. Are facial dynamics useful primarily for estimating three-dimensional structure, or for performing a kind of superresolution to e¤ectively gain more detailed information beyond that available in any single frame, or simply for figureground segregation? These are, of course, not the only possibilities. The larger point is that we need to understand how motion might come into play during a face perception task. Computational simulations can play a valuable role in this investigation by providing indications of how feasible it is to extract di¤erent attributes (3D shape, high-resolution images, figure–ground relations etc.) from real-world video data. How Can we Capture Facial Dynamics?
The front end for a static face analyzer is fairly straightforward: a camera that can take a short exposure snapshot and a program that can detect fiducial points such as the centers of the pupils and corners of the mouth. With dynamic imagery, the analogous task becomes much more challenging. Human assistance in annotation, which is a feasible option with static images, is no longer realistic with dynamic sequences of any significant length. Researchers and practitioners in the applied field of facial motion capture have turned to the use of crutches like grids of painted dots or reflective stickers. Although this simplifies the problem to an extent, it is not a full solution for at least two reasons. First, it is not always possible to attach such markers to faces (imagine trying to perform motion capture with archival footage). Second, even when feasible, this approach provides only a sparse sampling of motion information across a face. Much of the nuance of facial movements is lost. This is evident in the unnatural dynamics of animated faces in the current crop of movies. What we need are computational techniques for obtaining dense motion information from unmarked faces. Walder and colleagues in this volume describe an important step forward in this direction. Their algorithm takes as its input an unorganized collection of 4D points (x, y, z, and t) and a mesh template. Its output is an implicit surface model that incorporates dense motion information and closely tracks the deformations of the original face. The results they present are striking in their fidelity. This work sets
Analyzing Dynamic Faces
179
the stage for addressing the next set of challenges in tracking dynamic faces. An obvious one is the need to be able to work with 2D rather than 3D spatial information. For human observers, a 2D video sequence typically su‰ces to convey rich information about facial dynamics. A computational technique ought to be able to do the same. This is important not merely for mimicing human ability but also from the practical standpoint of being able to work with conventional video capture systems, the vast majority of which are 2D. Perhaps a combination of past 3D estimation techniques developed in the context of static face analysis (Blanz & Vetter, 1999), and the kind of approach described by Walder et al. can accomplish the goal of motion capture from 2D sequences. Blanz and Vetter’s technique allows the generation of 3D models from single 2D images based on previously seen 2D–3D mappings. Once such a 3D structure is estimated, it can be used to initialize the stimulus-to-template alignment in Walder’s approach. It remains to be seen whether an initial 3D estimation step will su‰ce for tracking motion over extended sequences, or if the 3D estimation will need to be repeated at frequent intervals for intermediate frames. A second important avenue along which to push the motion-tracking e¤ort is working with poor-quality video. Besides yielding obvious pragmatic benefits, an investigation of how to handle videos with low spatial and temporal resolution is likely to be useful in modeling human use of dynamic information. Although static facial information appears to be su‰cient for many tasks when the images are of high resolution, the significance of dynamic information becomes apparent when spatial information is degraded. Johansson’s classic displays, and their more recent derivatives, are a testament to this point. Computationally, however, the derivation of dense motion fields from low-resolution videos presents significant challenges. It is hard to establish spatially precise correspondences in such sequences, and hence the accuracy and density of the recovered motion fields is limited. Human observers, however, are quite adept at this task. What kind of a computational strategy can prove robust to degradation of spatiotemporal information? One possibility is the use of high-resolution internal models that can be globally fit to the degraded inputs to establish more precise local correspondences. This general idea of using internal models that are richer than the inputs is similar in spirit to what we described earlier for working with 2D rather than 3D data. Perhaps this approach will prove to be a broadly applicable strategy for handling many di¤erent kinds of information loss in observed facial sequences. How Can Dynamic Facial Information Be Represented?
Having tracked a dynamic face using the kinds of approaches outlined here, the next computational challenge is to represent this information e‰ciently. The di¤erent appearances of a face that are revealed over the course of tracking taken together
180
P. Sinha
Figure 12.1 The result of facial tracking is a highly redundant ‘‘stack’’ of images that we refer to as a TEAM, for temporally extended appearance model. How TEAMs should be encoded for various face perception tasks is an important open problem.
constitute its temporally extended appearance model, or TEAM for short, as illustrated in figure 12.1. The resulting spatiotemporal volume of facial appearances constitutes a rich dataset for constructing a robust face model, but how exactly should we encode it? This apparently simple question has yet to be satisfactorily answered either experimentally or computationally. Representation of dynamic faces is the fundamental challenge that the chapters by Boker and Cohn and Serre and Giese in this volume expound on. Here we outline the key conceptual issues related to this topic. The simplest thing we could do with a spatiotemporal volume is store it in its entirety for future reference. Individual images or new spatiotemporal inputs could be compared with the stored volume using any set of features we wish. The storage requirements of this strategy are obviously prohibitive, though. If we are to remember every spatiotemporal face volume we encounter, we will be overwhelmed with data very quickly. Even discarding the temporal contingencies between frames and maintaining only newly encountered static images does not do much to mitigate this expensive encoding strategy. To learn anything useful from dynamic data, the visual system must represent spatiotemporal volumes e‰ciently and with enough information for recognition to proceed. There is a great deal of redundancy in the data depicted in figure 12.1, and finding encodings that reduce it can help guide the search for an e‰cient representation. However, there are two issues we must be mindful of as we consider possible methods of reducing redundancy in this setting. First, does a particular representation support robust recognition? Second, is the proposed representation consistent with human psychophysical performance? To consider the first issue, there are existing methods for recovering a ‘‘sparse’’ encoding of natural image data (van Hateran & Ruderman, 1998; Olshausen, 1996, 2003). Implementing these methods on local image patches or spatiotemporal subvolumes tends to produce basis functions resembling edge-finding filters that translate over time. These provide a useful vocabulary for describing an image or an image sequence using a small set of active units, but these features are often not ideal for recognition tasks (Balas & Sinha, 2006).
Analyzing Dynamic Faces
181
In terms of our second issue, building representations that are consistent with human performance, there are many computations we could carry out on our volume which are ‘‘too strong’’ in some sense to be relatable to human performance. For example, we could use our image sequence to reconstruct a 3D volumetric model of the observed face using structure-from-motion algorithms (Ullman, 1979). The smoothness of appearance change across the images in a temporally extended appearance model (TEAM) could reduce the usual di‰culty of solving the correspondence problem, and we can easily obtain far more views of an object than are strictly necessary to solve for 3D form. However, faces do not respect a cornerstone of structure-frommotion computations, namely, rigidity. The nonrigid deformations that a face typically goes through make it di‰cult to estimate its 3D structure from a TEAM stack. Furthermore, it seems unlikely that human observers actually recognize faces based on view-invariant volumetric models of shape (Ullman, 1996). View-based models currently seem more commensurate with the psychophysical data (Tarr & Pinker, 1989; Bu¨ltho¤ & Edelman, 1992). However, to revisit a point raised earlier, there are also good psychophysical reasons against storing all the views of an object within some spatiotemporal volume. Specifically, observers trained to recognize novel dynamic objects do not behave as though they have stored all the views they were trained with. For example, they find novel dynamic objects harder to recognize if the stimulus presented at test is the time-reversed version of the training sequence (Stone, 1998, 1999; Vuong & Tarr, 2004). An ideal observer who maintains a representation of each image should not be impaired by this manipulation, suggesting that human observers do not simply store copies of all views of an object encountered during training. Instead, the order of appearances is maintained and becomes a critical part of the representation. Learning purely local features in space and time is useful within particular domains (Ullman & Bart, 2004) but potentially di‰cult to ‘‘scale up’’ to natural settings. Also, maintaining fully volumetric face models or large libraries of static face views is both ine‰cient and inconsistent with human data. The challenge we are faced with then is to develop a compact and expressive representation of a dynamic face that is consistent with human performance. A face model based on temporal association might o¤er an attractive alternative to existing proposals. Instead of storing an intractably large number of ordered static views, it should be possible to store only a few prototypical images and use dynamic input to learn a valid generalization function around each prototype. Redundancy within a spatiotemporal volume is thus reduced at the level of global appearance (however we choose to represent it) and the ultimate encoding of the face is viewbased, with a learned ‘‘tuning width’’ in the appearance space around each prototype view.
182
P. Sinha
There are multiple aspects of this model that have yet to be thoroughly explored psychophysically. For example, how are prototypical views of a face selected within a volume? There is very little work on how such views (or keyframes; Wallraven & Bu¨ltho¤, 2001) might be determined computationally or the extent to which they are psychologically real. Likewise, we do not yet have a detailed picture of how generalization around an image evolves following dynamic exposure. We have recently suggested that distributed representations of an object’s appearance follow from dynamic experience with a novel object, but as yet we have not investigated the long-term consequences of dynamic training. These two issues constitute key parameters in what is essentially a statistical model of the appearance of dynamic objects. Finding keyframes is analogous to identifying multiple modes in the data, whereas understanding patterns of generalization around those keyframes is analogous to identifying the variance of data around some mode. In this framework, motion is not a new feature for recognition, but rather a principled way to establish a sort of ‘‘mixture model’’ for facial appearance. The advantage of this strategy is that it makes explicit the fact that while observers probably have access to global appearance data, temporal data are only available locally. Thus we do not try to build a face representation that covers the whole viewing sphere of possible appearances (Murase & Nayar, 1995). Instead, we limit ourselves to learning what changes a face is likely to undergo in a short time interval. This basic proposal leads to many interesting questions for psychophysical research and makes easy contact with several physiological studies of object representation in high-level cortical areas. To summarize, the question of representing dynamic faces has not yet been adequately explored experimentally or computationally. In the absence of human data to guide computational strategies, current proposals are necessarily speculative. One idea that seems perceptually plausible and computationally attractive is to encode a TEAM via keyframes and some specification of the transformation linking these keyframes. Keyframes can be computed by a cluster analysis. They would correspond to the frames that minimize the sum of distances between themselves and all other frames of the TEAM, under the constraint of minimizing the number of keyframes. Of course, the error metric will keep decreasing monotonically as more and more keyframes are selected. However, as is the case with Principal Components Analysis, the decrease in error obtained by adding a new keyframe diminishes as the number of keyframes increases. The knee of the corresponding scree-plot would indicate the number of keyframes to be included. As for encoding the transformation linking these keyframes, a manifold in a low-dimensional space, for instance, one corresponding to the principal components of object appearances seen, might be adequate. The computational choices here await experimental validation.
Analyzing Dynamic Faces
183
What Aspects of Motion Information Are Important for Specific Face Perception Tasks?
If we can convince ourselves that motion information does indeed play a significant role in face perception, a more specific question becomes pertinent: Precisely which aspects of the overall motion signal contribute to a given face perception task? As a rough analogy, consider the case of analyzing static faces. We know that photometric information plays a role in several facial judgments, such as those pertaining to identity and aesthetics. However, this is too broad a statement to be interesting or useful. It needs to be made more precise: Which aspects of the photometric signal are really crucial for, say, identification? Computational and experimental results point to the ordinal brightness relationships around the eyes as being of key significance (Viola & Jones, 2001; Gilad, Meng, & Sinha, 2009). A similar kind of investigation is needed in the dynamic setting. The computational challenge here is to consider many possible subsets of the full dynamic signal to determine which ones are the most useful for the performance of a given task. For the case of identification, for instance, is the movement of the mouth more discriminative across individuals than the movement of the eyes? Bartlett et al. in this volume present an excellent instance of this kind of an e¤ort applied to perception of expressions. Their work demonstrates how such an approach can reveal hitherto unknown facial attributes as indicators of subtle di¤erences in mental states, such as engagement and drowsiness. It would indeed be interesting to examine whether the kinds of feature saliency hierarchies that have been determined for static faces (Fraser, Craig, & Parker, 1990) also carry over to the dynamic setting, or whether the two sets are entirely distinct. Besides carving up information spatially, it will also be important to consider subsets of the dynamic signal in the spatiotemporal frequency domain. Just as analysis of a static face appears to depend most strongly on a constrained band of spatial frequencies (Costen, Parker, & Craw, 1996), so might analysis of a dynamic face be driven largely by a small subset of the full spatiotemporal spectrum. It may be the case, for instance, that seemingly small flutters of eyelids might be more informative for some face perception tasks than large-scale eye blinks. A related but conceptually distinct issue is that of the duration of motion information necessary for performing di¤erent face tasks to some specified level of accuracy. Here, computational simulations can prove to be useful in studying how a system’s performance declines as the length of the motion sequence it is shown is made shorter and shorter. Not only would this provide benchmarks for psychophysical tests of human performance (and for ideal-observer analyses), it would also have the important side e¤ect of suggesting hypotheses for the first question we mentioned earlier: Which aspects of motion are important for a given task? Some motion signals might be
184
P. Sinha
ruled out as significant contributors just based on the fact that the viewing time might be too short for that kind of motion to occur. Conclusion
The analysis of dynamic faces is an exciting frontier in face perception research. The terrain is wide open and several of the most basic questions remain to be answered, both from an experimental perspective and a computational one. We have attempted here to list a few of these questions. We hope that the coming months and years will see a shift in the field’s focus from purely static imagery to more realistic dynamic sequences. The chapters in this volume represent exciting initial steps in this direction. References Balas, B., & Sinha, P. (2006). Receptive field structures for recognition. Neural Computation, 18, 497–520. Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3D faces. In SIGGRAPH conference proceedings (pp. 187–194). Conference Proc. SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques. New York: ACM Press/Addison-Wesley. Bu¨ltho¤, H. H., & Edelman, S. (1992). Psychophysical support for a 2-dimensional view interpolation theory of object recognition. Proceedings of the National Academy of Sciences of the United States of America, 89(1), 60–64. Costen, N. P., Parker, D. M., & Craw, I. (1996). E¤ects of high–pass and low-pass spatial filtering on face identification. Perception & Psychophysics, 58, 602–612. Fraser, I. H., Craig, G. L., & Parker, D. M. (1990). Reaction time measures of feature saliency in schematic faces. Perception, 19, 661–673. Gilad, S., Meng, M., & Sinha, P. (2009). Role of ordinal contrast relationships in face encoding. Proceedings of the National Academy of Sciences of the United States of America, 106(13), 5353–5358. Murase, H., & Nayar, S. K. (1995). Visual learning and recognition of 3-D objects from appearance. International Journal of Computer Vision, 14, 5–24. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A. (2003). Principles of image representation in visual cortex. In L. M. Chalupa and J. S. Werner (eds.), The visual neurosciences. Cambridge, MA: MIT Press, pp. 1603–1615. Stone, J. V. (1998). Object recognition using spatiotemporal signatures. Vision Research, 38, 947–951. Stone, J. V. (1999). Object recognition: View-specificity and motion-specificity. Vision Research, 39, 4032– 4044. Tarr, M. J., & Pinker, S. (1989). Mental rotation and orientation-dependence in shape recognition. Cognitive Psychology, 21(2), 233–282. Ullman, S. (1979). The interpretation of structure from motion. Proceedings of the Royal Society of London, Series B, 203, 405–426. Ullman, S. (1996). High-level vision. Cambridge, MA: MIT Press. Ullman, S., & Bart, E. (2004). Recognition invariance obtained by extended and invariant features. Neural Networks, 17, 833–848.
Analyzing Dynamic Faces
185
van Hateran, J. H., & Ruderman, D. L. (1998). Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London, Series B, 265, 2315–2320. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Paper presented at IEEE Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, pp. 511–518. Vuong, Q. C., & Tarr, M. J. (2004). Rotation direction a¤ects object recognition. Vision Research, 44, 1717–1730. Wallraven, C., & Bu¨ltho¤, H. H. (2001). Automatic acquisition of exemplar-based representations for recognition from image sequences. In Proceedings of IEEE CS Press Conference of Computer Vision and Pattern Recognition Workshops-Models vs. Exemplars.
13
Elements for a Neural Theory of the Processing of Dynamic Faces
Thomas Serre and Martin A. Giese
Face recognition has been a central topic in computer vision for at least two decades and progress in recent years has been significant. Automated face recognition systems are now widespread in applications ranging from surveillance to personal computers. In contrast, only a handful of neurobiologically plausible computational models have been proposed to try to explain the processing of faces in the primate cortex (e.g., Giese & Leopold, 2005; Jiang et al., 2006), and no such model has been applied specifically in the context of dynamic faces. There is a need for an integrated computational theory of dynamic face processing that could integrate and summarize evidence obtained with di¤erent experimental methods, from single-cell physiology to fMRI, MEG, and ultimately behavior and psychophysics. At the same time, physiologically plausible models capable of processing real video sequences constitute a plausibility proof for the computational feasibility of di¤erent hypothetical neural mechanisms. In this review chapter we will first discuss computer vision models for the processing of dynamic faces; these do not necessarily try to reproduce biological data but may suggest relevant computational principles. We then provide an overview of computational neuroscience models for the processing of static faces and dynamic body stimuli. We further highlight specific elements from our own work that are likely to be relevant for the processing of dynamic face stimuli. The last section discusses open problems and critical experiments from the viewpoint of neural computational approaches to the processing of dynamic faces. Computational Models for the Processing of Faces and Bodies
The following section reviews work in computer vision as well as neural and psychological models for the recognition of static faces. In addition, we discuss neurobiological models for the processing of dynamic bodies. It seems likely that some of the computational principles proposed in the context of these models might also be relevant for the processing of dynamic faces.
188
T. Serre and M. A. Giese
Computer Vision Models for the Processing of Dynamic Faces
While a full review of the large body of literature on computer vision systems for the recognition and detection of static faces would far exceed the scope of this chapter (see Jain & Li, 2005; Kriegman, Yang, & Ahuja, 2002; Zhao, Chellappa, Rosenfeld, & Phillips, 2003 for relevant books and reviews), in the following pages we discuss a number of approaches for the recognition of facial expressions and dynamic facial stimuli. While these systems do not try to mimic the processing of information in the visual cortex, they do provide real-world evidence that critical information can be extracted from dynamic faces that is unavailable through the analysis of static frames. Several computer vision systems have been developed for the recognition of facial expressions based on the extraction of temporal cues from video sequences. First introduced by Suwa and colleagues (Suwa, Sugie, & Fujimora, 1978) in the 1970s and later popularized by Mase (Mase, 1991) in the 1990s, systems for the recognition of facial expressions have progressed tremendously in the past decades (see Fasel & Luettin, 2003; Pantic & Rothkrant, 2000; Tian, Kanade, & Cohn, 2005 for reviews). Earlier approaches have typically relied on the computation of local optical flow from facial features (Black & Yacoob, 1995; Essa, Darrell, & Pentland, 1994; Mase, 1991; Otsuka & Ohya, 1996; Rosenblum, Yacoob, & Davis, 1994) and/or hidden Markov models (HMMs) to capture the underlying dynamics (Cohen, Sebe, Garg, Chen, & Huang, 2003; Otsuka & Ohya, 1996). More recent work (Pantic & Patras, 2006) has relied on the dynamics of individual facial points estimated using modern tracking algorithms. Another line of work (Yang, Liu, & Metaxas, 2009; Zhao et al., 2003) involves the extraction of image features shown to work well for the analysis of static faces (Ahonen, Hadid, & Pietikaine, 2006; Viola & Jones, 2001) across multiple frames. Several behavioral studies (see for example chapters 2 and 4) have suggested that people might be able to extract idiosyncratic spatiotemporal signatures of a person’s identity based on body and facial motion. An early approach that exploits the characteristic temporal signature of faces based on partially recurrent neural networks trained over sequences of facial images was first introduced by Gong, Psarrou, Katsoulis, and Palavouzis (1994). Initial experiments conducted by Luettin and colleagues suggested that spatiotemporal models (HMMs) trained on sequences of lip motion during speech could be useful for speaker recognition (Luettin, Thacker, & Beet, 1996). However, beyond this early experiment, the use of spatiotemporal cues for the identification of people in computer vision has remained relatively unexplored (Gong, McKenna, & Psarrou, 2000). Overall, the success of computer vision systems for identifying people (i.e., face recognition) in video sequences, as opposed to static faces, has been more moderate (e.g., Edwards, Taylor, & Cootes, 1998; Lee, Ho, Yang, & Kriegman, 2003; Tistarelli, Bicego, & Grosso, 2009; Yamaguchi, Fukui, & Maeda, 1998). In fact, the idea of
Elements for a Neural Theory
189
exploiting video sequences for the recognition of faces was almost completely abandoned after it was concluded, from the face recognition vendor test (Phillips et al., 2002) (which o¤ers an independent assessment of the performance of some of the leading academic and commercial face recognition systems), that the improvement from using video sequences over still images for face identification applications was minimal (Phillips et al., 2003). Clearly, more work needs to be done. In general, one of the ways by which approaches for the recognition of faces could benefit from the use of video sequences is via the tracking of the face. Tracking for the pose of a face can be used to restrict the search for matches between an image template and a face model across multiple views around expected values, thus reducing the chances of false maxima. In most approaches however, tracking and recognition remain separate processes and the recognition phase usually relies on still images. Relatively few systems have been described that can exploit the temporal continuity and constancy of video sequences. For instance, Li and colleagues (Li, Shaogang Gong, & Heather Liddell, 2001) have described a face recognition system in which the parameters of a 3D point distribution model are estimated using a Kalman filter, e¤ectively tracking parameters and enforcing smoothness over time. More recently, two approaches have been described that systematically investigate the role of temporal information in video-based face recognition applications. Zhou et al. investigated the recognition of human faces in video sequences using a gallery of both still and video images within a probabilistic framework (Zhou, Krueger, & Chellappa, 2003). In their approach, a time series state-space model is used to extract a spatiotemporal signature and simultaneously characterize the kinematics and identity of a probe video. Recognition then proceeds via marginalization over the motion vector to yield a robust estimate of the posterior distribution of the identity variable using importance-sampling algorithms. Finally, recent work by Zhang and Martinez (2006) convincingly shows that the use of video sequences over still images may help alleviate some of the main problems associated with face recognition (i.e., occlusions, expression and pose changes, as well as errors of localization). Biological Models for the Perception of Static Faces
Initial biologically inspired computational models for the processing of human faces have focused on the direct implementation of psychological theories. Classically, these theories have assumed abstract cognitive representations such as ‘‘face spaces,’’ interpreting faces as points in abstract representation spaces. It has been typically assumed that such points are randomly distributed, for example, with a normal distribution (Valentine, 1991). A central discussion in this context has been whether faces are represented as points in a fixed representation space, which is independent of the class of represented faces (example-based coding), or if faces are encoded in relationship to a norm or average face, which represents the average features of a
190
T. Serre and M. A. Giese
large representative set of faces (norm-based or norm-referenced encoding) (Leopold, O’Toole, Vetter, & Blanz, 2001; Rhodes, Brennan, & Carey, 1987; Rhodes & Je¤ery, 2006). Several recent experimental studies have tried to di¤erentiate between these two types of encoding (Lo¿er, Yourganov, Wilkinson, & Wilson, 2005; Rhodes & Je¤ery, 2006; Tsao & Freiwald, 2006). Contrary to norm-based representations, which are characterized by a symmetric organization of the face space around a norm face, example-based representations do not assume such a special role for the average face. Refinements of such face space models have been proposed that take into account varying example densities in the underlying pattern spaces. For example, it has been proposed that such density variations could be modeled by assuming a Veronoi tessellation of the high-dimensional space that forms the basis of perceptual judgments (Lewis & Johnston, 1999). Other models have exploited connectionist architectures in order to account for the recognition and naming of faces (Burton & Bruce, 1993). More recent models work on real pixel images, thus deriving the feature statistics directly from real-world data. A popular approach has been the application of Principal Component Analysis, inspired by the eigenface approach in computer vision (Sirovich & Kirby, 1997; Turk & Pentland, 1991). It has been shown that neural network classifiers based on such eigenfeatures are superior to the direct classification of pixel images (Abdi, Valentin, Edelman, & O’Toole, 1995). At the same time, psychological studies have tried to identify which eigencomponents are relevant for the representation of individual face components (such as gender or race) (e.g., O’Toole, De¤enbacher, Valentin, & Abdi, 1994). More advanced models have applied shape normalization prior to the computation of the eigencomponents (Hancock, Burton, & Bruce, 1996). Such approaches closely resemble methods in computer vision that ‘‘vectorize’’ classes of pictures by establishing correspondences between them automatically and separating shape from texture (e.g., Blanz & Vetter, 1999; Lanitis, Taylor, & Cootes, 1997). Eigenfaces have been combined with neural network architectures, including multilayer perceptrons and radial basis functions (RBF) networks (e.g., Valentin, Abdi, & Edelman, 1997a; Valentin, Abdi, Edelman, & O’Toole, 1997b). Another class of models has been based on the computation of features from Gabor filter responses, including a first filtering stage that is similar to the early processing in the visual cortex (Burton, Bruce, & Hancock, 1999; Dailey & Cottrell, 1999; Dailey, Cottrell, Padgett, & Adolphs, 2002). Only recently have models been developed that take into account detailed principles derived from the visual cortex. One example is the work by Jiang and colleagues (Jiang et al., 2006) who have applied a physiologically inspired hierarchical model (Riesenhuber & Poggio, 1999) for the position and scale-invariant recognition of shapes to faces, in order to test whether face processing requires the introduction of additional principles compared with the processing of general shapes. This model
Elements for a Neural Theory
191
also reproduces a variety of electrophysiological results on the tuning of neurons in areas V4 and IT (e.g., Riesenhuber & Poggio, 1999) and results in quantitative predictions that are in good agreement with behavioral and fMRI data (Jiang et al., 2006; Riesenhuber, Jarudi, Gilad, & Sinha, 2004). Our own work discussed in this chapter is based on closely related model architectures. Biological Models for the Perception of Body Movement
Here we briefly review theoretical models for the recognition of body movements, under the assumption that they might contribute important mechanisms that also apply to the processing of dynamic faces. This idea seems consistent with the fact that face and body-selective regions are often located in close neighborhood in the visual cortex. In monkeys, neurons selective for faces have been found in the superior temporal sulcus and the temporal cortex (e.g., Desimone, Albright, Gross, & Bruce, 1984; Pinsk et al., 2009; Pinsk, DeSimone, Moore, Gross, & Kastner, 2005; Tsao, Freiwald, Knutsen, Mandeville, & Tootell, 2003). (See chapters 8, 9, and 11 for further details.) The same regions contain neurons that are selective for body shapes and movements (Barraclough, Xiao, Oram, & Perrett, 2006; Bruce, Desimone, & Gross, 1986; Oram & Perrett, 1996; Puce & Perrett, 2003; Vangeneugden, Pollick, & Vogels, 2008). Similarly, areas selective for the recognition of faces, bodies, and their movements have been localized in the STS and the temporal cortex of humans, partially in close spatial neighborhood (Grossman & Blake, 2002; Kanwisher, McDermott, & Chun, 1997; Peelen & Downing, 2007; Pinsk et al., 2009, 2005). To our knowledge, no physiologically plausible models have been developed that account for the properties of neurons that are selective for dynamic face stimuli. In contrast, several exist that try to account for neural mechanisms involved in the processing of dynamic body stimuli (Escobar, Masson, Vieville, & Kornprobst, 2009; Giese & Poggio, 2003; Jhuang, Serre, Wolf, & Poggio, 2007; Lange & Lappe, 2006; Schindler, Van Gool, & de Gelder, 2008). These models are based on hierarchical neural architectures, including detectors that extract form or motion features from image sequences. Position and scale invariance has been accounted for by pooling neural responses along the hierarchy. It has been shown that such models reproduce several properties of neurons that are selective for body movements and behavioral and brain imaging data (Giese & Poggio, 2003). Recent work proves the high computational performance of biologically inspired architectures for the recognition of body movement, which lies in the range of the best nonbiological algorithms in computer vision (Escobar et al., 2009; Jhuang et al., 2007). Architectures of this type will be proposed in the following discussion as a basic framework for the development of a neural model for the processing of dynamic faces. A central question in the context of such models has been how form and motion processing contribute to the recognition of body motion. Consistent with experimental
192
T. Serre and M. A. Giese
evidence (Casile & Giese, 2005; Thurman & Grossman, 2008; Vangeneugden et al., 2008), some models have proposed an integration of form and motion information, potentially in the STS (Giese & Poggio, 2003; Peuskens, Vanrie, Verfaillie, & Orban, 2005). Conversely, some studies have tried to establish that at least the perception of point-light biological motion is exclusively based on form processing (Lange & Lappe, 2006). Since facial and body motion generate quite di¤erent optic flow patterns (e.g., with respect to their smoothness and the occurrence of occlusions), it is not obvious whether the relative influences of form and motion are similar for the processing of dynamic faces and bodies. The study of the relative influences of form and motion in the processing of face stimuli is thus an interesting problem, which relates also to the question of how di¤erent aspects of faces, such as static versus changeable aspects (identity versus facial expression), are processed by different cortical subsystems (Bruce & Young, 1986; Haxby, Ho¤man, & Gobbini, 2000). Another neural system for the processing of body movement has been found in the parietal and premotor cortex on macaque monkeys (e.g., Fogassi et al., 2005; Gallese, Fadiga, Fogassi, & Rizzolatti, 1996; Rizzolatti, Fogassi, & Gallese, 2001). A particularity of these areas is that they contain mirror neurons that respond, not only during visual stimulation, but also during the execution of motor actions. An equivalent of the mirror neuron system has also been described in humans (Binkofski & Buccino, 2006; Decety & Grezes, 1999). This observation has stimulated an extensive discussion in cognitive and computational neuroscience as well as robotics and even philosophy. A central question is how the recognition of actions, and especially imitatible actions (Wilson & Knoblich, 2005), might benefit from the use of motor representations. An important hypothesis in this context is that the visual recognition of actions might be accomplished by an internal simulation of the underlying motor behavior (Prinz, 1997; Rizzolatti et al., 2001). A number of computational models in robotics and neuroscience have tried to implement this principle (e.g., Erlhagen, Mukovskiy, & Bicho, 2006; Miall, 2003; Oztop, Kawato, & Arbib, 2006; Wolpert, Doya, & Kawato, 2003). It has been proposed that a similar process—the internal simulation of somatovisceral states—and potentially even motor commands might also be involved in the recognition of emotional facial expressions (e.g., van der Gaag, Minderaa, & Keysers, 2007). A close interaction between perceptual and motor representations of facial movements is also suggested by the phenomenon of facial mimicry, i.e., the stimulation of electric muscle responses by the observation of emotional pictures of faces (see chapter 11). From a theoretical point of view, these observations raise a question about the exact nature of this internal simulation: Does it, for example, reflect the spatial and temporal structure of facial actions or is it more abstract, e.g., in terms of emotional states?
Elements for a Neural Theory
193
Basic Neural Architecture
In this section we present a basic neural architecture that is consistent with many experimental results on the recognition of shapes and motion patterns, and even of static pictures of faces. The underlying model formalizes common knowledge about crucial properties of neurons on di¤erent levels of the ventral and dorsal visual pathway (Giese & Poggio, 2003; Jiang et al., 2006; Riesenhuber & Poggio, 1999). In addition, architectures of this type have been tested successfully with real-world form and motion stimuli (Jhuang et al., 2007; Serre, Wolf, Bileschi, Riesenhuber, & Poggio, 2007). This makes them interesting as a basis for the development of biological models for the processing of dynamic faces. Feedforward Hierarchical Models of the Ventral Stream of the Visual Cortex
The processing of shape information in the cortex is thought to be mediated by the ventral visual pathway running from V1 (Hubel & Wiesel, 1968) through extrastriate visual areas V2 and V4 to the IT (Perrett & Oram, 1993; Tanaka, 1996), and then to the prefrontal cortex (PFC), which is involved in linking perception to memory and action. Over the past decade, a number of physiological studies in nonhuman primates have established several basic facts about the cortical mechanisms of object and face recognition (see also chapters 7 and 8). The accumulated evidence points to several key features of the ventral pathway. Along the hierarchy from V1 to the IT, there is an increase in invariance with respect to changes in position and scale, and in parallel, an increase in receptive field size and the complexity of the optimal stimuli for the neurons (Logothetis & Sheinberg, 1996; Perrett & Oram, 1993). One of the first feedforward models for object recognition, Fukushima’s Neocognitron (Fukushima, 1980), constructed invariant object representations using a hierarchy of stages by progressively integrating convergent inputs from lower levels. Modern feedforward hierarchical models fall into di¤erent categories: neurobiological models (e.g., Mel, 1997; Riesenhuber & Poggio, 1999; Serre, Kreiman et al., 2007; Ullman, Vidal-Naquet, & Sali, 2002; Wallis & Rolls, 1997), conceptual proposals (e.g., Hubel & Wiesel, 1968; Perrett & Oram, 1993), and computer vision systems (e.g., Fukushima, 1980; LeCun, Bottou, Bengio, & Ha¤ner, 1998). These models are simple and direct extensions of the Hubel and Wiesel simple-to-complex cell hierarchy. One specific implementation of this class of models (Serre et al., 2005; Serre, Kreiman et al., 2007) is the following: The model takes as an input a gray-value image that is first analyzed by a multidimensional array of simple (S1) units which, like cortical simple cells, respond best to oriented bars and edges. The next C1 level corresponds to striate complex cells (Hubel & Wiesel, 1968). Each of the complex C1 units receives the outputs of a group of simple S1 units with the same preferred
194
T. Serre and M. A. Giese
orientation (and two opposite phases) but at slightly di¤erent positions and sizes (or peak frequencies). The result of the pooling over positions and sizes is that C1 units become insensitive to the location and scale of the stimulus within their receptive fields, which is a hallmark of cortical complex cells. The parameters of the S1 and C1 units were adjusted to match the tuning properties of V1 parafoveal simple and complex cells (receptive field sizes, peak spatial frequency as well as frequency and orientation bandwidth) as closely as possible. Feedforward theories of visual processing, and this model in particular, are based on extending these two classes of simple and complex cells to extrastriate areas. By alternating between S layers of simple units and C layers of complex units, the model achieves a di‰cult tradeo¤ between selectivity and invariance. Along the hierarchy, at each S stage, simple units become tuned to features of increasing complexity (e.g., from single oriented bars to combinations of oriented bars to form corners and features of intermediate complexities) by combining a¤erents of C units with di¤erent selectivities (e.g., units tuned to edges at di¤erent orientations). For instance, at the S2 level (respectively, S3), units pool the activities of retinotopically organized a¤erent C1 units (respectively, C2 units) with di¤erent orientations (di¤erent feature tuning), thus increasing the complexity of the representation from single bars to combinations of oriented bars forming contours or boundary conformations. Conversely, at each C stage, complex units become increasingly tolerant to 2D transformations (position and scale) by combining a¤erents (S units) with the same selectivity (e.g., a vertical bar) but slightly di¤erent positions and scales. This class of models seems to be qualitatively and quantitatively consistent with (and in some cases actually predicts, several properties of subpopulations of cells in V1, V4, the IT, and the PFC as well as fMRI and psychophysical data. For instance, the described model predicts the maximum computation by a subclass of complex cells in the primary visual cortex (Lampl, Ferster, Poggio, & Riesenhuber, 2004) and area V4 (Gawne & Martin, 2002). It also shows good agreement (Serre et al., 2005) with other data in V4 on the tuning for two-bar stimuli and for boundary conformations (Pasupathy & Connor, 2001; Reynolds, Chelazzi, & Desimone, 1999). The IT-like units of the model exhibit selectivity and invariance that are very similar to those of IT neurons (Hung, Kreiman, Poggio, & DiCarlo, 2005) for the same set of stimuli, and the model helped explain the tradeo¤ between invariance and selectivity observed in the IT in the presence of clutter (Zoccolan, Kouh, Poggio, & DiCarlo, 2007). Also, the model accurately matches the psychophysical performance of human observers for rapid animal versus nonanimal recognition (Serre, Oliva, & Poggio, 2007), a task that is not likely to be strongly influenced by cortical backprojections. This implies that such models might provide a good approximation of the first few hundred milliseconds of visual shape processing, before eye movements and shifts of attention become activated.
Elements for a Neural Theory
195
Are Faces a Special Type of Object?
The question of how faces are represented in the cortex has been at the center of an intense debate (see Gauthier & Logothetis, 2000; Tsao & Livingstone, 2008 for recent reviews). Faces are of high ecological significance and it is therefore not surprising that a great deal of neural tissue seems to be selective for faces both in humans (Kanwisher et al., 1997) and in monkeys (Moeller, Freiwald, & Tsao, 2008; Tsao & Livingstone, 2008). On the one hand, electrophysiological studies (Baylis, Rolls, & Leonard, 1985; Perrett, Rolls, & Caan, 1982; Rolls & Tovee, 1995; Young & Yamane, 1992) have suggested that faces, like other objects, are represented by the activity of a sparse population of neurons in the inferotemporal cortex. Conversely, a theme that has pervaded the literature is that faces might be special. For instance, the so-called face inversion e¤ect [i.e., the fact that the inversion of faces a¤ects performance to a much greater extent than inversion of other objects (Carey & Diamond, 1986; Yin, 1969)] suggested that face processing may rely on computational mechanisms such as configurational processing; this would seem incompatible with the shape-based models described earlier, which are based on a loose collection of image features and do not explicitly try to model the geometry of objects. A model (Riesenhuber & Poggio, 1999) that is closely related to that described earlier was shown to account for both behavioral (Riesenhuber et al., 2004) and imaging data (Jiang et al., 2006) on the processing of still faces in the visual cortex. The model predicts that face discrimination is based on a sparse representation of units selective for face shapes, without the need to postulate additional, ‘‘facespecific’’ mechanisms. In particular, the model was used to derive and test predictions that quantitatively link model FFA face neuron tuning, neural adaptation measured in an fMRI rapid adaptation paradigm, and face discrimination performance. One of the successful predictions of this model is that discrimination performance should become asymptotic as faces become dissimilar enough to activate di¤erent neuronal populations. These results are in good agreement with imaging studies that failed to find evidence for configurational mechanisms in the FFA (Yovel & Kanwisher, 2004). Feedforward Hierarchical Models of the Dorsal Stream
The processing of motion information is typically thought of as being mainly accomplished by the dorsal stream of the visual cortex. Whereas the computational mechanisms of motion integration in lower motion-selective areas (see Born & Bradley, 2005; Smith & Snowden, 1994 for reviews) have been extensively studied, relatively little is known about the processing of information in higher areas of the dorsal stream. It has been proposed that organizational and computational principles may be similar to those observed in the ventral stream (i.e., a gradual increase in the
196
T. Serre and M. A. Giese
complexity of the preferred stimulus and invariance properties along the hierarchy) (Essen & Gallant, 1994; Saito, 1993). Building on these principles, Giese and Poggio have proposed a model for motion recognition that consists of a ventral and a dorsal stream (Giese & Poggio, 2003). Their simulations demonstrated that biological motion and actions can be recognized, in principle, by either stream alone, via the detection of temporal sequences of shapes in the ventral stream of the model, or by recognizing specific complex optic flow patterns that are characteristic for action patterns in the dorsal stream. The architecture of the ventral stream follows closely the architecture of the models described earlier (Riesenhuber & Poggio, 1999), with the addition of a special recurrent neural mechanism on the highest level that makes the neural units’ responses selective for sequential temporal order. The dorsal stream applies the same principles to neural detectors for motion patterns with di¤erent levels of complexity, such as local and opponent motion in the original model, or complex spatially structured optic flow patterns that are learned from training examples (Jhuang et al., 2007; Sigala, Serre, Poggio, & Giese, 2005). The model reproduced a variety of experiments (including psychophysical, electrophysiological, and imaging results) on the recognition of biological motion from point-light and full-body stimuli. Subsequent work showed that the dorsal stream is particularly suited for generalization between fullbody and point-light stimuli, and produced reasonable recognition results even for stimuli with degraded local motion information, which was previously interpreted as evidence that perception of biological motion is exclusively based on form features (Casile & Giese, 2005). This line of work has recently been extended by the inclusion of simple learning mechanisms for middle temporal-like units (Jhuang et al., 2007; Sigala et al., 2005), making it possible to adapt the neural detectors in intermediate stages of the model to the statistics of natural video sequences. The validation of this model showed that the resulting architecture was competitive with state-of-the-art computer vision systems for the recognition of human actions. This makes such models interesting for the recognition of other classes of dynamic stimuli, such as dynamic faces. Extensions of the Basic Architecture for the Processing of Dynamic Faces
The core assumption in this chapter is that the recognition of dynamic facial expressions might exploit computational mechanisms similar to those used to process static objects or body movements. This does not necessarily imply that the underlying neural structures are shared, even though such sharing seems likely with respect to lower and midlevel visual processing. In the following discussion we present a number of extensions of the framework discussed in the preceding section that seem necessary in order to develop models for the processing of dynamic faces. In particular, we speculate that the processing of dynamic face stimuli involves a complex interaction
Elements for a Neural Theory
197
between motion cues from the dorsal stream and shape cues from the ventral stream (as discussed in the previous sections of this chapter). Skeleton Model for the Processing of Dynamic Faces
Following the principles that have been successful in explaining the recognition of static objects and faces as well as dynamic body movements, figure 13.1 provides a sketch of how motion and shape cues may be integrated within a model for the processing of dynamic faces. The model extracts characteristic features of dynamic facial expressions through two hierarchical pathways that extract complex form and motion features. The highest levels of these two streams are defined by complex pattern detectors that have been trained with typical examples of dynamic face patterns. Within this framework it is possible to make the form and also the motion pathway selective for temporal sequences by the introduction of asymmetric lateral connections between high-level units tuned to di¤erent keyframes of a face sequence (as described in Giese & Poggio, 2003). Whether such sequence selectivity is critical in the recognition of dynamic faces is still an open question. In the proposed model, the information from the form and motion pathways is integrated at the highest hierarchy level within model units that are selective for dynamic facial expressions. Again, it remains an open question for experimental research to demonstrate the existence of face-selective neurons that are selectively tuned for dynamic aspects (see also chapter 8). We have conducted preliminary experiments with a part of the proposed architecture using videos of facial expressions as stimuli. Testing the model of the dorsal stream described earlier (Jhuang et al., 2007) on a standard computer vision database (Dollar, Rabaud, Cottrell, & Belongie, 2005) that contains six facial expressions (anger, disgust, fear, joy, sadness, and surprise), we found that a small population of about 500 MT/MST-like motion-sensitive model units were su‰cient for a reliable classification of these facial expressions (model performance: 93.0% versus 83.5% for the system by Dollar et al., 2005). These MT/MST-like units combine a¤erent inputs from ‘‘V1’’ model units that are tuned to di¤erent directions of motion. After a brief learning stage using dynamic face sequences, these units become selective for spacetime facial features such as the motion of a mouth during a smile or the raising of an eyebrow during surprise. It seems likely that shape cues from the ventral stream would also play a key role, if not even a dominant role, in the processing of dynamic faces (see chapter 4). However, the exact integration of motion and form cues can only be determined from future more detailed experimental evidence. Extension for Norm-Referenced Encoding
As discussed in the second section of this chapter, several experiments on the processing of static pictures of faces have suggested a relevance of norm-referenced encoding for face processing, and potentially even for the representation of other classes of
198
T. Serre and M. A. Giese
Figure 13.1 Neural model for the processing of dynamic face stimuli. Form and motion features are extracted in two separate pathways. The addition of asymmetric recurrent connections at the top levels makes the units selective for temporal order. The highest level consists of neurons that fuse form and motion information.
Elements for a Neural Theory
199
objects (Kayaert, Biederman, & Vogels, 2005; Leopold, Bondar, & Giese, 2006; Loffler et al., 2005; Rhodes & Je¤ery, 2006; Tsao & Freiwald, 2006). All models discussed so far are example-based. They assume neural units whose tuning depends on the position of a stimulus in feature space, independently of the overall stimulus statistics. An example is units with Gaussian tuning (Riesenhuber & Poggio, 1999; Serre et al., 2005) with centers defined by individual feature vectors that correspond to training patterns. Conversely, for norm-referenced encoding, the tuning of such units would depend, not only on the actual stimulus, but also on a norm stimulus (average face), resulting in a special symmetry of the tuning about this norm stimulus. The model architecture proposed before can be easily extended to implement ‘‘norm-referenced encoding,’’ and it seems that such an extension might be helpful in accounting for the tuning properties of face-selective neurons in the macaque IT area (Giese & Leopold, 2005). Figure 13.2 shows the results of an electrophysiological experiment (Leopold et al., 2006) that tested the tuning of face-selective neurons in area IT using face stimuli that had been generated by morphing among three-dimensional scans of humans (Blanz & Vetter, 1999). Specifically, morphs between four example faces (F1, F2, F3, and F4) and an average face computed from fifty faces were presented to the animals (panel A). Single-cell and population responses showed a clear tendency of the activity of individual neural units to vary monotonically with the distance of the test stimulus from the average face in the face space (panel B). We then tried to reproduce this result with an example-based model that was basically a simplified version of the model for the ventral stream, as discussed in detail in the previous section. The model consisted of a hierarchy of ‘‘simple’’ and ‘‘complex’’ units to extract oriented contours and midlevel feature detectors that were optimized by PCA based on the available training patterns. Units at the highest level were Gaussian radial basis functions whose centers were defined by the feature vectors of training faces from the data basis, which is consistent with the example-based models discussed earlier. (See Giese & Leopold, 2005 for further details.) Although this model achieved robust face recognition with a realistic degree of selectivity, and matching basic statistical parameters of the measured neural responses, it failed to reproduce the monotonic trends of the tuning curves with respect to the distance of the stimuli from the average face that was observed in the experiment (figure 13.2c). This deviation from the data was quite robust against changes in the parameters of the model, or even structural variations like the number of spatial scales. This points toward a fundamental di¤erence with respect to relevant neural encoding principles. In order to verify this hypothesis, we implemented a second version of the model that included a special neural mechanism that approximates norm-referenced encoding, and which replaced the units with Gaussian tuning in the exemplar-based model.
200
T. Serre and M. A. Giese
Figure 13.2 Responses of face-selective neurons in area IT and simulation results from two model variants implementing example-based and norm-referenced encoding. (a) Stimuli generated by morphing between the average face and four example faces (F1, F2, F3, and F4). Pictures outside the sphere indicate facial caricatures that exaggerate features of the individual example faces. The identity level specifies the location of face stimuli along the line between the average face and the individual example faces (0 corresponding to the average face and 1 to the original example face). (b) Responses of eighty-seven face-selective neurons (normalized average spike rates within an interval 200–300 ms after stimulus onset) in area IT of a macaque monkey. Di¤erent lines indicate the population averages computed separately for di¤erent identity levels and for the example face that elicited, respectively, the strongest, second-strongest etc., and the lowest response. Asterisks indicate significant monotonic trends ð p < 0:05Þ. (c) Normalized responses of faceselective neurons for the model implementing example-based encoding plotted in the same way as the responses of the real neurons in panel b. (d) Normalized responses of the face-selective neurons for the model implementing norm-referenced encoding (conventions as in panel b).
Elements for a Neural Theory
201
The key idea for the implementation of norm-referenced encoding is to obtain an estimate for the feature vector uðtÞ that corresponds to the norm stimulus (average face) by averaging over the stimulus history. Once this estimate has been computed, neural detectors that are selective for the di¤erence between the actual stimulus input from the previous layer rðtÞ and this estimate can be easily constructed. The underlying mechanism is schematically illustrated in figure 13.3a. An estimate of the feature vector ^ uðtÞ that corresponds to the norm stimulus is computed by ‘‘integrator neurons’’ (light gray in figure 13.3a), which form a (very slow-moving) average of the input signal rðtÞ from the previous layer over many stimulus presentations. Simulations showed that for random presentation of stimuli from a fixed set of faces, this temporal average provides a su‰ciently accurate estimate of the feature vector that corresponds to the real average face. The (vectorial) di¤erence zðtÞ ¼ rðtÞ ^ uðtÞ between this estimate and the actual stimulus input is computed by a second class of neurons (indicated in white). The last level of the proposed circuit is given by face-selective neurons (or small networks of neurons) whose input signal is given by the di¤erence vector zðtÞ (indicated by dark gray in figure 13.3). The tuning functions of these neurons were given by the function yk ¼ gk ðzÞ @ jzj
u znk þ1 ; jzj
ð13:1Þ
where the first term defines a linear dependence of the output on the length of the distance vector and where the second term can be interpreted as a direction tuning function in the high-dimensional feature space (the unit vector nk determining the preferred direction, and the positive parameter u controlling the width of the direction tuning). While at first glance, this function does not look biologically plausible, it turns out that for u ¼ 1 (a value leading to a good approximation of the physiological data), it can be approximated well by the function yk ¼ gk ðzÞ @ znk þ jzj ¼ ½zþ ðnk þ 1Þ þ ½zþ ðnk 1Þ:
ð13:2Þ
In this formula ½:þ corresponds to a linear threshold function ½xþ ¼ maxðx; 0Þ, and function (13.2) can be implemented with a simple physiologically plausible two-layer neural network with linear threshold units that is sketched in figure 13.3b. A quantitative comparison between simulation and real experimental data, using the stimuli and the same type of analysis for the real and the modeled neural data, shows a very good agreement, as shown in figure 13.2d. Specifically, the model reproduces even the number of significant positive and negative trends that were observed in the real data. This result shows that the proposed architecture can be extended to include normreferenced encoding without much additional e¤ort and without making biologically
202
T. Serre and M. A. Giese
Figure 13.3 Neural circuits implementing norm-referenced encoding. (a) Basic circuit deriving an estimate of the feature vector that corresponds to the norm stimulus by averaging and computing the di¤erence between the actual input and this estimate. The di¤erence vector provides the input to the face-selective neurons. (b) Implementation of the tuning function of the face-selective neurons by a two-layer linear threshold network. The unit vectors nk define the tuning of the units in the high-dimensional input space (for details, see text).
Elements for a Neural Theory
203
implausible assumptions. In addition, the proposed circuit can be reinterpreted in a statistical framework as a form of predictive encoding where the face units represent the deviations from the most likely expected stimulus, which is the average face if no further a priori information is given. Predictive coding has been discussed extensively as an important principle for visual processing and especially for object and action recognition (e.g., Friston, 2008; Rao & Ballard, 1997). Other Missing Components
The model components and principles discussed in this chapter are far from complete, and it seems likely that an architecture that captures all fundamental aspects of the neural processing of dynamic facial expressions will require a variety of additional elements. A few of such principles are listed in the following discussion. The proposed model has primarily a feedforward architecture. It has been shown that in object recognition, such models capture important properties of immediate recognition (in the first 200 ms after stimulus presentation) (Serre, Oliva et al., 2007). For longer stimulus presentations, which is typical for complex dynamic patterns, top-down e¤ects need to be taken into account. This requires the inclusion of top-down connections and attentional e¤ects in the model, which may be particularly important for fine discrimination tasks such as face identification. Overall, the proposed model has been extensively tested for the classification of objects (including faces). Yet vision is much more than categorization because it involves interpreting an image (for faces, this may take the form of inferring the age, gender, and ethnicity of a person, or physical attributes such as attractiveness or social status). It is likely that the feedforward architectures described in this chapter will be insu‰cient to match the level of performance of human observers on some of these tasks and that cortical feedback and inference mechanisms (Lee & Mumford, 2003) may play a key role. Cortical Feedback
Attentional Mechanisms Hierarchical architectures that are similar to the proposed model have been extended with circuits for attentional modulation (e.g., Deco & Rolls, 2004; Itti & Koch, 2001). Inclusion of attention in models for dynamic face recognition seems to be crucial since faces have been shown to capture attention (e.g., Bindemann, Burton, Hooge, Jenkins, & de Haan, 2005) and the recognition of facial expressions interacts in a complex way with attention (e.g., Pourtois & Vuilleumier, 2006). Interaction with Motor Representations and Top-Down Influences of Internal Emotional States These are other potential missing elements. The model described here focuses
exclusively on purely visual aspects of facial expressions. The influence of motor
204
T. Serre and M. A. Giese
representations could be modeled by a time-synchronized modulation by predictions of sensory states from dynamically evolving predictive internal motor models (e.g., Wolpert et al., 2003). Alternatively, the sensitivity for visual features consistent with specific motor patterns or emotional states might be increased in a less specific manner, similar to attentional modulation without a detailed matching of the temporal structure. Detailed future experiments might help to decide among computational alternatives. Discussion
In this chapter we have described computational mechanisms that in our view could be important for the processing of dynamic faces in biological systems. Since at present no physiologically plausible model for the processing of such stimuli exists, we have reviewed work from di¤erent disciplines: computer vision models for the recognition of dynamic faces (see also chapters 12 and 14), and biologically inspired models for the processing of static faces and full-body movements. In addition, we have presented a physiologically plausible core architecture that has been shown previously to account for many experimental results on the recognition of static objects and faces and dynamic bodies. In addition, our work demonstrates that this architecture reaches performance levels for object and motion recognition that are competitive with state-of-the-art computer vision systems. We suggest that this basic architecture may constitute a starting point for the development of quantitative physiologically inspired models for the recognition of dynamic faces. As for other work in theoretical neuroscience, the development of successful models for the recognition of dynamic faces will depend critically on the availability of conclusive and constraining experimental data. Although the body of experimental data in this area is continuously growing (as shown by the chapters in the first two parts of this book), the available data are far from su‰cient to decide about even the most important computational mechanisms of the processing of dynamic faces. Questions that might be clarified by such experiments include: How much overlap is there between cortical areas involved in the processing of static versus dynamic faces?
How do form and motion cues contribute to the recognition of dynamic faces?
Is temporal order-selectivity crucial for the processing of facial expressions, and which neural mechanisms implement such sequence selectivity?
Are neurons tuned to dynamic sequences of face images such as heads rotating in 3D also involved in the problem of (pose) invariant recognition?
Is there a direct coupling between perceptual and motor representations of facial movements, and what are the neural circuits that implement this coupling?
Elements for a Neural Theory
205
How do other modalities, such as auditory or haptic cues, modulate the visual processing of dynamic faces?
The clarification of such questions will likely require the integration of di¤erent experimental methods, including psychophysics, functional imaging, lesion studies, and most importantly, single-cell physiology. An important function for computational models like the ones discussed in this chapter is to quantitatively link the results obtained with various experimental methods and to test the computational feasibility of explanations in the context of real-world stimuli with realistic levels of complexity. Only computational mechanisms that comply with the available data, and which are appropriate for reaching su‰cient performance levels with real-world stimuli seem promising as candidates for an explanation of the biological mechanisms that underlie the processing of dynamic faces. References Abdi, H., Valentin, D., Edelman, B., & O’Toole, A. J. (1995). More about the di¤erence between men and women: Evidence from linear neural networks and the principal-component approach. Perception, 24(5), 539–562. Ahonen, T., Hadid, A., & Pietikaine, M. (2006). Face description with local binary patterns: Application to face recognition. IEEE Trans Pattern Anal Machine Intell, 28(12), 2037–2041. Barraclough, N. E., Xiao, D., Oram, M. W., & Perrett, D. I. (2006). The sensitivity of primate STS neurons to walking sequences and to the degree of articulation in static images. Prog Brain Res, 154, 135–148. Baylis, G. C., Rolls, E. T., & Leonard, C. M. (1985). Selectivity between faces in the responses of a population of neurons in the cortex in the superior temporal sulcus of the monkey. Brain Res, 342(1), 91–102. Bindemann, M., Burton, A. M., Hooge, I. T., Jenkins, R., & de Haan, E. H. (2005). Faces retain attention. Psychonom Bull Rev, 12, 1048–1053. Binkofski, F., & Buccino, G. (2006). The role of ventral premotor cortex in action execution and action understanding. J Physiol Paris, 99(4–6), 396–405. Black, M. J., & Yacoob, Y. (1995). Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. In Proceedings of the fifth international conference on the computer vision (ICCV ’95) (pp. 374–381). Washington DC: IEEE Computer Society. Blanz, V., & Vetter, T. (1999). A morphable model for synthesis of 3D faces. In Computer graphics proc. SIGGRAPH (pp. 187–194). Los Angeles. Born, R. T., & Bradley, D. C. (2005). Structure and function of visual area MT. Annu Rev Neurosci, 28, 157–189. Bruce, C. J., Desimone, R., & Gross, C. G. (1986). Both striate cortex and superior colliculus contribute to visual properties of neurons in superior temporal polysensory area of macaque monkey. J Neurophysiol, 55, 1057–1075. Bruce, V., & Young, A. (1986). Understanding face recognition. Br J Psychol, 77 (Pt. 3), 305–327. Burton, A. M., & Bruce, V. (1993). Naming faces and naming names: Exploring an interactive activation model of person recognition. Memory, 1, 457–480. Burton, A. M., Bruce, V., & Hancock, P. J. B. (1999). From pixels to people: A model of familiar face recognition. Cognit Sci, 23(1), 1–31. Carey, S., & Diamond, R. (1986). Why faces are and are not special: An e¤ect of expertise. J Exp Psychol Gen, 115, 107–117.
206
T. Serre and M. A. Giese
Casile, A., & Giese, M. A. (2005). Critical features for the recognition of biological motion. J Vis, 5, 348–360. Cohen, I., Sebe, N., Garg, A., Chen, L. S., & Huang, T. (2003). Facial expression recognition from video sequences: Temporal and static modeling. Comp Vis Image Understand, 91(1–2), 160–187. Dailey, M. N., & Cottrell, G. W. (1999). Organization of face and object recognition in modular neural networks. Neural Networks, 12(7–8), 1053–1074. Dailey, M. N., Cottrell, G. W., Padgett, C., & Adolphs, R. (2002). EMPATH: A neural network that categorizes facial expressions. J Cognit Neurosci, 14(8), 1158–1173. Decety, J., & Grezes, J. (1999). Neural mechanisms subserving the perception of human actions. Trends Cogn Sci, 3(5), 172–178. Deco, G., & Rolls, E. T. (2004). A neurodynamical cortical model of visual attention and invariant object recognition. Vision Res, 44(6), 621–642. Desimone, R., Albright, T., Gross, C., & Bruce, C. (1984). Stimulus-selective properties of inferior temporal neurons in the macaque. J Neurosci, 4(8), 2051–2062. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatiotemporal features. Paper presented at the workshop on visual surveillance and performance evaluation of tracking and surveillance (pp. 65–72). October 16, Beijing China. Edwards, G., Taylor, C., & Cootes, T. (1998). Interpreting face images using active appearance models. Paper presented at the 3rd IEEE international conference on automatic face and gesture recognition (pp. 300–305). Apr. 14–16 Nara, Japan. Washington DC: IEEE Computer Society. Erlhagen, W., Mukovskiy, A., & Bicho, E. (2006). A dynamic model for action understanding and goaldirected imitation. Brain Res, 1083(1), 174–188. Escobar, M. J., Masson, G. S., Vieville, T., & Kornprobst, P. (2009). Action recognition using a bioinspired feedforward spiking network. Int J Comp Vis, 82(3), 284–301. Essa, I., Darrell, T., & Pentland, A. (1994). A vision system for observing and extracting facial action parameters. In Proceedings of the conference on computer vision and pattern recognition (CVPR ’94) (pp. 76–83). 21 Jun–23 Jun 1994, Seattle WA. Essen, D. C. V., & Gallant, J. L. (1994). Neural mechanisms of form and motion processing in the primate visual system. Neuron, 13(1), 1–10. Fasel, B., & Luettin, J. (2003). Automatic facial expression analysis: A survey. Pattern Recog, 36(1), 259–275. Fogassi, L., Ferrari, P. F., Gesierich, B., Rozzi, S., Chersi, F., & Rizzolatti, G. (2005). Parietal lobe: From action organization to intention understanding. Science, 308(5722), 662–667. Friston, K. (2008). Hierarchical models in the brain. PLoS Comput Biol, 4(11), e1000211. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition una¤ected by shift in position. Biol Cyb, 36, 193–202. Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996). Action recognition in the premotor cortex. Brain, 119 (Pt 2), 593–609. Gauthier, I., & Logothetis, N. (2000). Is face recognition not so unique after all? Cognit Neuropsychol, 17(1–3), 125–142. Gawne, T. J., & Martin, J. M. (2002). Responses of primate visual cortical V4 neurons to simultaneously presented stimuli. J Neurophysiol, 88(3), 1128–1135. Giese, M. A., & Leopold, D. A. (2005). Physiologically inspired neural model for the encoding of face spaces. Neurocomputing, 65–66, 93–101. Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements. Nat Rev Neurosci, 4(3), 179–192. Gong, S., Psarrou, A., Katsoulis, I., & Palavouzis, P. (1994). Tracking and recognition of face sequences. Paper presented at the Proceedings of the European workshop on combined real and synthetic image processing for broadcast and video production. Hamburg, Germany 1994, 23–24. Nov. Gong, S. M., McKenna, S. J., & Psarrou, A. (2000). Dynamic vision: From images to face recognition. London: Imperial College Press.
Elements for a Neural Theory
207
Grossman, E. D., & Blake, R. (2002). Brain areas active during visual perception of biological motion. Neuron, 1167–1175. Hancock, P. J. B., Burton, M. A., & Bruce, V. (1996). Face processing: Human perception and principal components analysis. Memory Cognit, 24, 26–40. Haxby, J., Ho¤man, E., & Gobbini, M. (2000). The distributed human neural system for face perception. Trends Cognit Sci, 4(6), 223–233. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. J Physiol, 195(1), 215–243. Hung, C. P., Kreiman, G., Poggio, T., & DiCarlo, J. J. (2005). Fast read-out of object identity from macaque inferior temporal cortex. Science, 310, 863–866. Itti, L., & Koch, C. (2001). Computational modelling of visual attention. Nat Rev Neurosci, 2(3), 194–203. Jain, A. K., & Li, S. Z. (2005). Handbook of face recognition. New York: Springer-Verlag. Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings of the eleventh IEEE international conference on computer vision (ICCV) (pp. 1–8). Washington DC: IEEE Computer Society. Jiang, X., Rosen, E., Ze‰ro, T., Vanmeter, J., Blanz, V., & Riesenhuber, M. (2006). Evaluation of a shape-based model of human face discrimination using fMRI and behavioral techniques. Neuron, 50(1), 159–172. Kanwisher, N., McDermott, J., & Chun, M. M. (1997). The fusiform face area: A module in human extrastriate cortex specialized for face perception. J Neurosci, 17(11), 4302–4311. Kayaert, G., Biederman, I., & Vogels, R. (2005). Representation of regular and irregular shapes in macaque inferotemporal cortex. Cereb Cortex, 15(9), 1308–1321. Kriegman, D., Yang, M. H., & Ahuja, N. (2002). Detecting faces in images: A survey. IEEE Trans Pattern Anal Machine Intell, 24, 34–58. Lampl, I., Ferster, D., Poggio, T., & Riesenhuber, M. (2004). Intracellular measurements of spatial integration and the MAX operation in complex cells of the cat primary visual cortex. J Neurophysiol, 92(5), 2704–2713. Lange, J., & Lappe, M. (2006). A model of biological motion perception from configural form cues. J Neurosci, 26(11), 2894–2906. Lanitis, A., Taylor, C., & Cootes, T. (1997). Automatic interpretation and coding of face images using flexible models. IEEE Trans Pattern Anal Machine Intell, 19, 743–756. LeCun, Y., Bottou, L., Bengio, Y., & Ha¤ner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE, 86, 2278–2324. Lee, K. C., Ho, J., Yang, M. H., & Kriegman, D. (2003). Video-based face recognition using probabilistic appearance manifolds. Paper presented at the Proceeding International Conference Computer Vision and Pattern Recognition (CVPR ’03) (Vol. 1, pp. 313–320). Madison WI, Jun 18–20. Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. Journal of the Optical Society of America, A 20(7), 1434–1448. Leopold, D. A., Bondar, I. V., & Giese, M. A. (2006). Norm-based face encoding by single neurons in the monkey inferotemporal cortex. Nature, 442(7102), 572–575. Leopold, D. A., O’Toole, A. J., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding revealed by high-level aftere¤ects. Nat Neurosci, 4(1), 89–94. Lewis, M. B., & Johnston, R. A. (1999). A unified account of the e¤ects of caricaturing faces. Vis Cognit, 6, 1–41. Li, Y., Shaogang Gong, S., & Heather Liddell, H. (2001). Modelling faces dynamically across views and over time. Paper presented at the 8th IEEE International Conference on Computer Vision (pp. 554–559). Jul 7–14, 2001 Vancouver, BC, Canada. Lo¿er, G., Yourganov, G., Wilkinson, F., & Wilson, H. R. (2005). fMRI evidence for the neural representation of faces. Nat Neurosci, 8(10), 1386–1391. Logothetis, N. K., & Sheinberg, D. L. (1996). Visual object recognition. Annu Rev Neurosci, 19, 577–621.
208
T. Serre and M. A. Giese
Luettin, J., Thacker, N. A., & Beet, S. W. (1996). Visual speech recognition using active shape models and Hidden Markov Models. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 96) (Vol. 2, pp. 817–820). 07–10 May 1996, Atlanta GA. Mase, K. (1991). Recognition of facial expression from optical flow. IEICE Trans, 74(10), 3474–3483. Mel, B. W. (1997). SEEMORE: Combining color, shape and texture histogramming in a neurally inspired approach to visual object recognition. Neural Comp, 9, 777–804. Miall, R. C. (2003). Connecting mirror neurons and forward models. Neuroreport, 14(17), 2135–2137. Moeller, S., Freiwald, W. A., & Tsao, D. Y. (2008). Patches with links: A unified system for processing faces in the macaque temporal lobe. Science, 320(5881), 1355–1359. O’Toole, A. J., De¤enbacher, K. A., Valentin, D., & Abdi, H. (1994). Structural aspects of face recognition and the other-race e¤ect. Mem Cognit, 22(2), 208–224. Oram, M. W., & Perrett, D. I. (1996). Integration of form and motion in the anterior superior temporal polysensory area (STPa) of the macaque monkey. J Neurophysiol, 76(1), 109–129. Otsuka, T., & Ohya, J. (1996). Recognition of facial expressions using HMM with continuous output probabilities. Paper presented at the 5th IEEE Workshop on Robot and Human Communication (pp. 323–328). Tsukuba, Japan, 11–14 Nov 96. Oztop, E., Kawato, M., & Arbib, M. (2006). Mirror neurons and imitation: A computationally guided review. Neural Netw, 19(3), 254–271. Pantic, M., & Patras, I. (2006). Dynamics of facial expression: Recognition of facial actions and their temporal segments from face profile image sequences. IEEE Trans Systems, Man, and Cybernetics, Part B, 36(2), 433–449. Pantic, M., & Rothkrant, L. J. M. (2000). Automatic analysis of facial expressions: The state of the art. IEEE Trans Pattern Anal Machine Intell, 22, 1424–1445. Pasupathy, A., & Connor, C. E. (2001). Shape representation in area V4: Position-specific tuning for boundary conformation. J Neurophysiol, 86(5), 2505–2519. Peelen, M. V., & Downing, P. E. (2007). The neural basis of visual body perception. Nat Rev Neurosci, 8(8), 636–648. Perrett, D., & Oram, M. (1993). Neurophysiology of shape processing. Image Vision Comput, 11, 317–333. Perrett, D. I., Rolls, E. T., & Caan, W. (1982). Visual neurones responsive to faces in the monkey temporal cortex. Exp Brain Res, 47, 329–342. Peuskens, H., Vanrie, J., Verfaillie, K., & Orban, G. A. (2005). Specificity of regions processing biological motion. Eur J Neurosci, 21(10), 2864–2875. Phillips, P. J., Grother, P., Micheals, R., Blackburn, D. M., Tabassi, E., & Bone, M. (2003). Face recognition vendor test 2002. Paper presented at the IEEE International Workshop on Analysis and Modeling of Faces and Gestures (p. 44). 17 Oct 2003 Nice France. Washington DC: IEEE Comp. Society. Pinsk, M. A., Arcaro, M., Weiner, K. S., Kalkus, J. F., Inati, S. J., Gross, C. G., et al. (2009). Neural representations of faces and body parts in macaque and human cortex: A comparative fMRI study. J Neurophysiol, 101(5), 2581–2600. Pinsk, M. A., DeSimone, K., Moore, T., Gross, C. G., & Kastner, S. (2005). Representations of faces and body parts in macaque temporal cortex: A functional MRI study. Proc Natl Acad Sci USA, 102(19), 6996– 7001. Pourtois, G., & Vuilleumier, P. (2006). Dynamics of emotional e¤ects on spatial attention in the human visual cortex. Prog Brain Res, 156, 67–91. Prinz, W. (1997). Perception and action planning. Eur J Cogn Psychol, 9, 129–154. Puce, A., & Perrett, D. (2003). Electrophysiology and brain imaging of biological motion. Phil Trans R Soc Lond B Biol Sci, 358(1431), 435–445. Rao, R. P., & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex, Neural Comp, 9, 721–763. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. J Neurosci, 19(5), 1736–1753.
Elements for a Neural Theory
209
Rhodes, G., Brennan, S., & Carey, S. (1987). Identification and ratings of caricatures: Implications for mental representations of faces. Cogn Psychol, 19(4), 473–497. Rhodes, G., & Je¤ery, L. (2006). Adaptive norm-based coding of facial identity. Vision Res, 46, 2977– 2987. Riesenhuber, M., Jarudi, I., Gilad, S., & Sinha, P. (2004). Face processing in humans is compatible with a simple shape-based model of vision. Proc Biol Sci, 271 Suppl 6, S448–450. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neurosci, 2(11), 1019–1025. Rizzolatti, G., Fogassi, L., & Gallese, V. (2001). Neurophysiological mechanisms underlying the understanding and imitation of action. Nat Rev Neurosci, 2(9), 661–670. Rolls, E. T., & Tovee, M. J. (1995). Sparseness of the neuronal representation of stimuli in the primate temporal visual cortex. J Neurophysiol, 73(2), 713–726. Rosenblum, M., Yacoob, Y., & Davis, L. (1994). Human emotion recognition from motion using a radial basis function network architecture. In Proceedings of the 1994 IEEE workshop on motion of non-rigid and articulated objects (pp. 43–49). Austin TX, Nov 11–12. Saito, H. (1993). Hierarchical neural analysis of optical flow in the macaque visual pathway. In T. Ono, L. R. Squire, M. E. Raichle, D. I. Perrett, and M. Fukuda (eds.), Brain mechanisms of perception and memory. Oxford, UK: Oxford University Press, pp. 121–140. Schindler, K., Van Gool, L., & de Gelder, B. (2008). Recognizing emotions expressed by body pose: A biologically inspired neural model. Neural Netw, 21(9), 1238–1246. Serre, T., Kouh, M., Cadieu, C., Knoblich, U., Kreiman, G., & Poggio, T. (2005). A theory of object recognition: Computations and circuits in the feedforward path of the ventral stream in primate visual cortex (AI Memo 2005-036 / CBCL Memo 259). MIT, Cambridge, MA. Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., & Poggio, T. (2007). A quantitative theory of immediate visual recognition. Prog Brain Res, 165, 33–56. Serre, T., Oliva, A., & Poggio, T. (2007). A feedforward architecture accounts for rapid categorization. Proc Natl Acad Sci USA, 104(15), 6424–6429. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., & Poggio, T. (2007). Object recognition with cortex-like mechanisms. IEEE Trans Pattern Analy Machine Intell, 29(3), 411–426. Sigala, R., Serre, T., Poggio, T., & Giese, M. A. (2005). Learning features of intermediate complexity for the recognition of biological motion. Artificial Neural Networks: Formal Models and Their Applications— ICANN 2005, 15th International Conference, Warsaw, Poland Sep 11–15 2005, Warsaw, Poland (pp. 241– 246). Sirovich, L., & Kirby, M. (1997). A low-dimensional procedure for identifying human faces, J Opt Soc Am A, 4, 519–524. Smith, A. T., & Snowden, R. J. (1994). Visual detection of motion. London: Academic Press. Suwa, M., Sugie, N., & Fujimora, K. (1978). A preliminary note on pattern recognition of human emotional expression. In Proceedings of the 4th international joint conference on pattern recognition (pp. 408– 410). Nov 7–10, 1978, New York NY. Tanaka, K. (1996). Inferotemporal cortex and object vision, Annu Rev Neurosci, 19, 109–139. Thurman, S. M., & Grossman, E. D. (2008). Temporal ‘‘bubbles’’ reveal key features for point-light biological motion perception. J Vis, 8(3), 1–11. Tian, Y. L., Kanade, T., & Cohn, J. F. (2005). Facial expression analysis. In S. Z. Li & A. K. Jain (eds.), Handbook of face recognition. New York: Springer-Verlag. Tistarelli, M., Bicego, M., & Grosso, E. (2009). Dynamic face recognition: From human to machine vision. Image Vis Comp, 27(3), 222–232. Tsao, D. Y., & Freiwald, W. A. (2006). What’s so special about the average face? Trends Cognit Sci, 10, 391–393. Tsao, D. Y., Freiwald, W. A., Knutsen, T. A., Mandeville, J. B., & Tootell, R. B. (2003). Faces and objects in macaque cerebral cortex. Nat Neurosci, 6(9), 989–995.
210
T. Serre and M. A. Giese
Tsao, D. Y., & Livingstone, M. S. (2008). Mechanisms of face perception. Annu Rev Neurosci, 31, 411– 437. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition, J Cognit Neurosci, 3, 71–86. Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nat Neurosci, 5(7), 682–687. Valentin, D., Abdi, H., & Edelman, B. (1997a). What represents a face? A computational approach for the integration of physiological and psychological data. Perception, 26(10), 1271–1288. Valentin, D., Abdi, H., Edelman, B., & O’Toole, A. J. (1997b). Principal component and neural network analyses of face images: What can be generalized in gender classification? J Math Psychol, 41(4), 398–413. Valentine, T. (1991). A unified account of the e¤ects of distinctiveness, inversion and race in face recognition. Quart J Exp Psychol, 43A, 161–204. van der Gaag, C., Minderaa, R. B., & Keysers, C. (2007). Facial expressions: What the mirror neuron system can and cannot tell us. Soc Neurosci, 2(3–4), 179–222. Vangeneugden, J., Pollick, F., & Vogels, R. (2008). Functional di¤erentiation of macaque visual temporal cortical neurons using a parametric action space. Cereb Cortex, 9(3), 593–611. Viola, P., & Jones, M. (2001). Robust real-time face detection. In Proceedings of the 8th international conference on computer vision (Vol. 20, No. 11, pp. 1254–1259). Jul 7–14 Vancouver BC, Canada. Washington DC: IEEE Camp. Society. Wallis, G., & Rolls, E. T. (1997). A model of invariant recognition in the visual system. Prog Neurobiol, 51, 167–194. Wilson, M., & Knoblich, G. (2005). The case for motor involvement in perceiving conspecifics. Psychol Bull, 131(3), 460–473. Wolpert, D. M., Doya, K., & Kawato, M. (2003). A unifying computational framework for motor control and social interaction. Phil Trans R Soc Lond B Biol Sci, 358(1431), 593–602. Yamaguchi, O., Fukui, K., & Maeda, K. (1998). Face recognition using temporal image sequence. In Proceedings of the 3rd international conference on automatic face and gesture recognition (pp. 318–323). Nara Japan, Apr 14–16. Washington DC: IEEE Comp. Soc. Yang, P., Liu, Q., & Metaxas, D. N. (2009). Boosting encoded dynamic features for facial expression recognition. Patt Recogn Lett, 30(2), 132–139. Yin, R. K. (1969). Looking at upside-down faces. J Exp Psychol, 81, 141–145. Young, M. P., & Yamane, S. (1992). Sparse population coding of faces in the inferior temporal cortex. Science, 256, 1327–1331. Yovel, G., & Kanwisher, N. (2004). Face perception: Domain specific, not process specific. Neuron, 44(5), 889–898. Zhang, Y., & Martinez, A. M. (2006). A weighted probabilistic approach to face recognition from multiple images and video sequences. Image Vis Comput, 24(6), 626–638. Zhao, W., Chellappa, R., Rosenfeld, A., & Phillips, P. (2003). Face recognition: A literature survey. ACM Comp Surveys, 35(4), 399–458. Zhou, S., Krueger, V., & Chellappa, R. (2003). Probabilistic recognition of human faces from video. Comp Vis Image Understand, 91, 214–245. Zoccolan, D., Kouh, M., Poggio, T., & DiCarlo, J. J. (2007). Trade-o¤ between object selectivity and tolerance in monkey inferotemporal cortex. J Neurosci, 27(45), 12292–12307.
14
Insights on Spontaneous Facial Expressions from Automatic Expression Measurement
Marian Bartlett, Gwen Littlewort, Esra Vural, Jake Whitehill, Tingfan Wu, Kang Lee, and Javier Movellan
The computer vision field has advanced to the point that we are now able to begin to apply automatic systems for recognizing facial expressions to important research questions in behavioral science. One of the major outstanding challenges has been to achieve robust performance with spontaneous expressions. Systems that performed well on highly controlled datasets often performed poorly when tested on a di¤erent dataset with di¤erent image conditions, and especially had trouble generalizing to spontaneous expressions, which tend to have much greater noise from numerous causes. A major milestone in recent years is that systems for automatic recognition of facial expressions are now able to measure spontaneous expressions with some success. This chapter describes one such system, the Computer Expression Recognition Toolbox (CERT). CERT will soon be available to the research community (see http://mplab.ucsd.edu). This system was employed in some of the earliest experiments in which spontaneous behavior was analyzed with automated expression recognition to extract new information about facial expression that was previously unknown (Bartlett et al., 2008). These experiments are summarized in this chapter. The experiments measured facial behavior associated with faked versus genuine expressions of pain, driver drowsiness, and perceived di‰culty of a video lecture. The analysis revealed information about facial behavior during these conditions that was previously unknown, including the coupling of movements such as eye openness with brow raise during driver drowsiness. Automated classifiers were able to di¤erentiate real from fake pain significantly better than naive human subjects, and to detect driver drowsiness above 98% accuracy. Another experiment showed that facial expression was able to predict the perceived di‰culty of a video lecture and preferred presentation speed (Whitehill, Bartlett, & Movellan, 2008). CERT is also being employed in a project to give feedback on production of facial expression to children with autism. A prototype game, called SmileMaze, requires the player to produce a smile to enable a character to pass through doors and obtain rewards (Cockburn et al., 2008).
212
M. Bartlett and colleagues
These are among the first generation of research studies to apply fully automated measurement of facial expression to research questions in behavioral science. Tools for automatic measurement of expressions will bring about paradigmatic shifts in a number of fields by making facial expression more accessible as a behavioral measure. Moreover, automated analysis of facial expressions will enable investigations into facial expression dynamics that were previously intractable using human coding because of the time required to code intensity changes. The Facial Action Coding System
The Facial Action Coding System (Ekman and Friesen, 1978) is arguably the most widely used method for coding facial expressions in the behavioral sciences. The system describes facial expressions in terms of forty-six component movements, which roughly correspond to the movements of individual facial muscles. An example is shown in figure 14.1. FACS provides an objective and comprehensive way to separate expressions into elementary components, analogous to decomposition of speech into phonemes. Because it is comprehensive, FACS has proven useful for discovering
Figure 14.1 Sample facial actions from the Facial Action Coding System incorporated in the Computer Expression Recognition Toolbox (CERT). CERT includes 30 action units total.
Insights on Spontaneous Facial Expressions
213
facial movements that are indicative of cognitive and a¤ective states. See Ekman and Rosenberg (2005) for a review of facial expression studies using FACS. The primary limitation to the widespread use of FACS is the time required to code the expressions. FACS was developed for coding by hand, using human experts. It takes over 100 hours of training to become proficient in FACS, and it takes approximately 2 hours for human experts to code each minute of video. The authors have been developing methods for fully automating the facial action coding system (e.g., Donato, Bartlett, Hager, Ekman, & Sejnowski, 1999; Bartlett et al., 2006). In this chapter we apply a computer vision system trained to automatically detect facial action units to datamine facial behavior and to extract facial signals associated with states, including (1) real versus fake pain, (2) driver fatigue, and (3) easy versus di‰cult video lectures. Spontaneous Expressions
The machine learning system presented here was trained on spontaneous facial expressions. The importance of using spontaneous behavior for developing and testing computer vision systems becomes apparent when we examine the neurological substrate for producing facial expressions. There are two distinct neural pathways that mediate facial expressions, each one originating in a di¤erent area of the brain. Volitional facial movements originate in the cortical motor strip, whereas spontaneous facial expressions originate in the subcortical areas of the brain (see Rinn, 1984, for a review). These two pathways have di¤erent patterns of innervation on the face, with the cortical system tending to give stronger innervation to certain muscles primarily in the lower face, while the subcortical system tends to more strongly innervate certain muscles primarily in the upper face (e.g., Morecraft, Louie, Herrick, & Stilwell-Morecraft, 2001). The facial expressions mediated by these two pathways have di¤erences both in which facial muscles are moved and in their dynamics (Ekman, 2001; Ekman & Rosenberg, 2005). Subcortically initiated facial expressions (the spontaneous group) are characterized by synchronized, smooth, symmetrical, consistent, and reflexlike facial muscle movements, whereas cortically initiated facial expressions (posed expressions) are subject to volitional real-time control and tend to be less smooth, with more variable dynamics (Rinn, 1984; Frank, Ekman, & Friesen, 1993; Schmidt, Cohn, & Tian, 2003; Cohn & Schmidt, 2004). Given the two di¤erent neural pathways for producing facial expressions, it is reasonable to expect to find di¤erences between genuine and posed expressions of states such as pain or drowsiness. Moreover, it is crucial that the computer vision model for detecting states such as genuine pain or driver drowsiness be based on machine learning of expression samples when the subject is actually experiencing the state in question.
214
M. Bartlett and colleagues
The Computer Expression Recognition Toolbox
The Computer Expression Recognition Toolbox, developed at the University of California, San Diego, is a fully automated system that analyzes facial expressions in real time. CERT is based on 15 years of experience in automated recognition of facial expression (e.g., Bartlett et al., 1996, 1999, 2006; Donato et al., 1999; Littlewort, Bartlett, Fasel, Susskind, & Movellan, 2006). This line of work originated from a collaboration between Sejnowski and Ekman, in response to a National Science Foundation planning workshop on understanding automated facial expression (Ekman, Levenson, & Friesen, 1983). The present system automatically detects frontal faces in the video stream and codes each frame with respect to forty continuous dimensions, including basic expressions of anger, disgust, fear, joy, sadness, surprise, contempt, a continuous measure of head pose (yaw, pitch, and roll), as well as thirty facial action units from the Facial Action Coding System (Ekman & Friesen, 1978). See figure 14.2. System Overview
The technical approach to CERT is an appearance-based discriminative approach. Such approaches have proven highly robust and fast for detecting and tracking faces (e.g., Viola & Jones, 2004). Appearance-based discriminative approaches don’t su¤er from initialization and drift, which present challenges for state-of-the art tracking algorithms, and they take advantage of the rich appearance-based information in facial expression images. This class of approaches achieves a high level of robustness through the use of very large datasets for machine learning. It is important that the training set be similar to the proposed applications in terms of noise. A detailed analysis of machine learning methods for robust detection of one facial expression, smiles, is provided in Whitehill, Littlewort, Fasel, Bartlett, & Movellan (2009). The design of CERT is as follows. Face detection and detection of internal facial features is first performed on each frame using boosting techniques in a generative framework (Fasel, Fortenberry, & Movellan, 2005), extending work by Viola and Jones (2004). The automatically located faces then undergo a 2D alignment by computing a fast least-squares fit between the detected feature positions and a sixfeature face model. The least-squares alignment allows rotation, scale, and shear. The aligned face image is then passed through a bank of Gabor filters of eight orientations and nine spatial frequencies (two to thirty-two pixels per cycle at halfoctave steps). Output magnitudes were then normalized and passed to facial action classifiers. Facial action detectors were then developed by training separate support vector machines to detect the presence or absence of each facial action. The training set
Insights on Spontaneous Facial Expressions
215
Figure 14.2 (a) Overview of CERT system design. (b) Example of CERT running on live video. Each subplot has time in the horizontal axis and the vertical axis indicates the intensity of a particular facial movement.
216
M. Bartlett and colleagues
consisted of over 10,000 images that were coded for facial actions from the Facial Action Coding System, including over 5,000 examples from spontaneous expressions. In previous work we conducted empirical investigations of machine learning methods applied to the related problem of classifying expressions of basic emotions. We compared image features (e.g., Donato et al., 1999), classifiers such as AdaBoost, support vector machines, and linear discriminant analysis, as well as feature selection techniques (Littlewort et al., 2006). The best results were obtained by selecting a subset of Gabor filters using AdaBoost and then training support vector machines on the outputs of the filters selected by AdaBoost. An overview of the system is shown in figure 14.2a. Benchmark Performance
In this chapter, performance for expression detection is assessed using a measure from signal detection theory: area under the receiver operating characteristic (ROC) curve (A0 ). The ROC curve is obtained by plotting true positives against false positives as the decision threshold for deciding an expression is present shifts from high (0 detections and 0 false positives) to low (100% detections and 100% false positives). A0 ranges from 0.5 (chance) to 1 (perfect discrimination). We employ A0 instead of percent correct since A0 can change for the same system, depending on the proportion of targets and nontargets in a given test set. A0 can be interpreted in terms of percent correct on a two-alternative forced-choice task in which two images are presented on each trial and the system must decide which of the two is the target. Performances on a benchmark dataset (Cohn-Kanade) show state-of-the-art performance for recognition of both basic emotions and facial actions. Performance for expressions of basic emotion was 0.98 area under the ROC for detection (1 versus all) across seven expressions of basic emotion, and 93% correct for a sevenalternative forced choice. This is the highest performance reported to our knowledge on this benchmark dataset. Performance for recognizing facial actions from the Facial Action Coding System was 0.93 mean area under the ROC for posed facial actions. (This was mean detection performance across twenty facial action detectors). Recognition of spontaneous facial actions was tested on the RU–FACS dataset (Bartlett et al., 2006). This dataset is an interview setting containing speech. Performance was tested for thirty-three subjects with 4 minutes of continuous video each. The mean area under the ROC for detection of eleven facial actions was 0.84. System outputs consist of the margin of the Support Vector Machine (distance to the separating hyperplane between the two classes) for each frame of video. System outputs are significantly correlated with the intensity of the facial action, as measured by FACS expert intensity codes (Bartlett et al., 2006), and also as measured by naive observers obtained by having them turn a dial while watching continuous video (Whitehill et al., 2009). Thus the frame-by-frame intensities provide information on
Insights on Spontaneous Facial Expressions
217
Figure 14.3 Datamining human behavior. CERT is applied to face videos containing spontaneous expressions of the states in question. Machine learning is applied to the outputs of CERT to train a classifier to automatically discriminate state A from state B.
the dynamics of facial expression at temporal resolutions that were previously impractical by manual coding. There is also preliminary evidence from Jim Tanaka’s laboratory of concurrent validity with EMG. CERT outputs significantly correlated with EMG measures of zygomatic and corrugator activity despite the visibility of the electrodes in the video processed by CERT. A Second-Layer Classifier to Detect Internal States
The overall CERT system gives a frame-by-frame output with N channels, consisting of N facial action units. This system can be applied to datamine human behavior. By applying CERT to a face video while subjects experience spontaneous expressions of a given state, we can learn new things about the facial behaviors associated with that state. Also, by passing the N channel output to a machine learning system, we can directly train detectors for the specific state in question. See figure 14.3. Real versus Faked Expressions of Pain
The ability to distinguish real pain from faked pain (malingering) is an important issue in medicine (Fishbain, Cutler, Rosomo¤, & Rosomo¤, 2006). Naive human subjects perform near chance in di¤erentiating real from fake pain when observing facial
218
M. Bartlett and colleagues
expressions (e.g., Hadjistavropoulos, Craig, Hadjistavropoulos, & Poole, 1996). In the absence of direct training in facial expressions, clinicians are also poor at assessing pain using the face (e.g., Prkachin, Schultz, Berkowitz, Hughes, & Hunt, 2002; Prkachin, Solomon, & Ross, 2007; Grossman, Shielder, Swedeen, & Mucenski, 1991). However, a number of studies using the Facial Action Coding System (Ekman & Friesen, 1978) have shown that information exists in the face for di¤erentiating real from posed pain (e.g., Hill and Craig, 2002; Craig, Hyde, & Patrick, 1991; Prkachin, 1992). In fact, if subjects receive corrective feedback, their performance improves substantially (Hill & Craig, 2004). Thus it appears that a signal is present but that most people don’t know what to look for. This study explored the application of a system for automatically detecting facial actions to this problem. The goal of this work was to (1) assess whether the automated measurements with CERT were consistent with expression measurements obtained by human experts, and (2) develop a classifier to automatically di¤erentiate real from faked pain in a subject-independent manner from the automated measurements. In this study, participants were videotaped under three experimental conditions: baseline, posed pain, and real pain. We employed a machine learning approach in a two-stage system. In the first stage, the video was passed through a system for detecting facial actions from the Facial Action Coding System (Bartlett et al., 2006). These data were then passed to a second machine learning stage in which a classifier was trained to detect the di¤erence between expressions of real pain and fake pain. Naive human subjects were tested on the same videos to compare their ability to di¤erentiate faked from real pain. The ultimate goal of this work is not the detection of malingering per se, but rather to demonstrate the ability of the automated system to detect facial behavior that the untrained eye might fail to interpret, and to di¤erentiate types of neural control of the face. It holds out the prospect of illuminating basic questions pertaining to the behavioral fingerprint of neural control systems and thus opens many future lines of inquiry. Video Data Collection
Video data were collected from twenty-six human subjects during real pain, faked pain, and baseline conditions. The subjects were university students, six men and twenty women. The pain condition consisted of cold pressor pain induced by immersing the arm in icewater at 5 Celsius. For the baseline and faked pain conditions, the water was 20 Celsius. The subjects were instructed to immerse their forearm into the water up to the elbow and hold it there for 60 seconds in each of the three conditions. For the faked-pain condition, the subjects were asked to manipulate their facial expressions so that an ‘‘expert would be convinced they were in actual
Insights on Spontaneous Facial Expressions
219
Figure 14.4 Sample facial behavior and facial action codes from the (a) faked and (b) real pain conditions.
pain.’’ In the baseline condition, subjects were instructed to display their expressions naturally. Participants’ facial expressions were recorded using a digital video camera during each condition. Examples are shown in figure 14.4. For the twenty-six subjects analyzed here, the order of the conditions was baseline, faked pain, and then real pain. Another twenty-two subjects received the counterbalanced order: baseline, real pain, then faked pain. Because repeating facial movements that were experienced minutes before di¤ers from faking facial expressions without an immediate pain experience, the two conditions were analyzed separately. The following analysis focuses on the condition in which the subjects faked first. After the videos were collected, a set of 170 naive observers were shown the videos and asked to guess whether each video contained faked or real pain. The subjects were undergraduates with no explicit training in measuring facial expression. They were primarily introductory psychology students at the University of California, San Diego. The mean accuracy of naive human subjects in discriminating fake from real pain in these videos was at chance at 49.1% (standard deviation 13.7%). Characterizing the Difference between Real and Faked Expressions of Pain
The computer expression recognition toolbox was applied to the three 1-minute videos of each subject. The following set of twenty facial actions was detected for
220
M. Bartlett and colleagues
Table 14.1 Z-score di¤erences of the three pain conditions, averaged across subjects A. Real Pain vs. baseline Action Unit 25
12
9
26
10
6
Z-score
1.4
1.3
1.2
0.9
0.9
1.4
B. Faked Pain vs. Baseline Action Unit 4 DB
1
25
12
6
26
10
FB
9
20
7
Z-score
1.7
1.5
1.4
1.4
1.3
1.3
1.2
1.1
1.0
0.9
2.7
2.1
C. Real Pain vs. Faked Pain Action Unit
4
DB
1
Z-score di¤erence
1.8
1.7
1.0
Note: FB, fear brow 1þ2þ4. DB, distress brow ð1; 1þ4Þ.
each frame [1 2 4 5 6 7 9 10 12 14 15 17 18 20 23 24 25 26 1þ4 1þ2þ4]. (A version of CERT was used that detected only these 20 AUs.) This produced a twenty-channel output stream consisting of one real value for each learned AU, for each frame of the video. We first assessed which AU outputs contained information about genuine pain expressions, faked pain expressions, and which show di¤erences between genuine versus faked pain. The results were compared with studies that employed expert human coding. Real Pain versus Baseline We first examined which facial action detectors were elevated in real pain compared with the baseline condition. Z-scores for each subject and each AU detector were computed as Z ¼ ðx mÞ=s, where ðm; sÞ are the mean and variance for the output of frames 100–1,100 in the baseline condition (warm water, no faked expressions). The mean di¤erence in Z-score between the baseline and pain conditions was computed across the twenty-six subjects. Table 14.1 shows the action detectors with the largest di¤erence in Z-scores. We observed that the actions with the largest Z-scores for genuine pain were mouth opening and jaw drop (25 and 26), lip corner puller by zygomatic muscle (12), nose wrinkle (9), and to a lesser extent, lip raise (10) and cheek raise (6). These facial actions have been previously associated with cold pressor pain (e.g., Prkachin, 1992; Craig & Patrick, 1985).
The Z-score analysis was next repeated for faked versus baseline expressions. We observed that in faked pain there was relatively more facial activity than in real pain. The facial action outputs with the highest Z-scores for faked pain relative to baseline were brow lower (4), distress brow (1 or 1þ4), inner brow raise (1), mouth open and jaw drop (25 and 26), cheek raise (6), lip raise (10), fear brow (1þ2þ4), nose wrinkle (9), mouth stretch (20), and lower lid raise (7).
Faked Pain versus Baseline
Insights on Spontaneous Facial Expressions
221
Di¤erences between real and faked pain were examined by computing the di¤erence of the two Z-scores. Di¤erences were observed primarily in the outputs of action unit 4 (brow lower), as well as distress brow (1 or 1þ4) and inner brow raise (1 in any combination). There was considerable variation among subjects in the di¤erence between their faked and real pain expressions. However, the most consistent finding is that nine of the twenty-six subjects showed significantly more brow lowering activity (AU 4) during the faked pain condition, whereas none of the subjects showed significantly more AU 4 activity during the real pain condition. Also, seven subjects showed more cheek raise (AU 6), and six subjects showed more inner brow raise (AU 1) and the fear brow combination (1þ2þ4). The next most common di¤erences were to show more 12, 15, 17, and distress brow (1 alone or 1þ4) during faked pain. Paired t-tests were conducted for each AU to assess whether it was a reliable indicator of genuine versus faked pain in a within-subjects design. Of the twenty actions tested, the di¤erence was statistically significant for three actions. It was highly significant for AU 4 ðp < 0:001Þ, and marginally significant for AU 7 and distress brow ð p < 0:05Þ.
Real versus Faked Pain
Comparison with Human Expert Coding
The findings from the automated system were first compared with previous studies that used manual FACS coding by human experts. Overall, the outputs of the automated system showed patterns similar to previous studies of real and faked pain. Real Pain In previous studies using manual FACS coding by human experts, at least twelve facial actions showed significant relationships with pain across multiple studies and pain modalities. Of these, the ones specifically associated with cold pressor pain were 4, 6, 7, 9, 10, 12, 25, 26 (Craig & Patrick, 1985; Prkachin, 1992). Agreement of the automated system with the human coding studies was computed as follows. First a superset of the AUs tested in the two cold pressor pain studies was created. AUs that were significantly elevated in either study were assigned a 1 and otherwise a 0. This vector was then correlated against the findings for the twenty AUs measured by the automated system. AUs with the highest Z-scores, shown in table 14.1, A, were assigned a 1, and the others a 0. Only AUs that were measured both by the automated system and by at least one of the human coding studies were included in the correlation analysis. Agreement computed in this manner was 90% for AUs associated with real pain.
A study of faked pain in adults showed elevation of the following AUs: 4, 6, 7, 10, 12, 25 (Craig et al., 1991). A study of faked pain in children aged 8– 12, (LaRochette, Chambers, & Craig, 2006) observed significant elevation in the
Faked Pain
222
M. Bartlett and colleagues
following AUs for fake pain relative to baseline: 1 4 6 7 10 12 20 23 25 26. These findings again match the AUs with the highest Z-scores in the automated system output of the present study, as shown in table 14.1, B. (The two human coding studies did not measure AU 9 or the brow combinations.) Agreement of the computer vision findings with these two studies was 85%. Real versus Faked Pain Exaggerated activity of the brow lower (AU 4) during faked pain is consistent with previous studies in which the real pain condition was exacerbated lower back pain (Craig et al., 1991; Hill & Craig, 2002). Only one other study looked at real versus faked pain in which the real pain condition was cold pressor pain. This was a study with children aged 8–12 (LaRochette et al., 2006). When faked pain expressions were compared with real cold pressor pain in children, LaRochette et al. found significant di¤erences in AUs 1 4 7 10. Again, the findings of the present study using the automated system are similar, as the AU channels with the highest Z-scores were 1, 4, and 1þ4 (table 14.1, C), and the t-tests were significant for 4, 1þ4, and 7. Agreement of the automated system with the human coding findings was 90%.
In order to further assess the validity of the automated system findings, we obtained FACS codes for a portion of the video data employed in this study. The codes were obtained by an expert coder certified in the Facial Action Coding System. For each subject, the last 500 frames of the fake pain and real pain conditions were FACS coded (about 15 seconds each). It took 60 man-hours to collect the human codes, over the course of more than 3 months, since human coders can only code up to 2 hours per day before having negative repercussions in accuracy and coder burnout. The sum of the frames containing each action unit were collected for each subject condition, as well as a weighted sum, multiplied by the intensity of the action on a 1– 5 scale. To investigate whether any action units successfully di¤erentiated real from faked pain, paired t-tests were computed on each action unit. (Tests on specific brow region combinations 1þ2þ4 and 1, 1þ4 have not yet been conducted.) The one action unit that significantly di¤erentiated the two conditions was AU 4, brow lower, ð p < 0:01Þ for both the sum and weighted-sum measures. This finding is consistent with the analysis of the automated system, which also found action unit 4 most discriminative. Human FACS Coding of Video in This Study
Automatic Discrimination of Real from Fake Pain
We next turned to the problem of automatically discriminating genuine from faked pain expressions using the facial action output stream. This section describes the second machine learning stage in which a classifier was trained to discriminate genuine
Insights on Spontaneous Facial Expressions
223
from faked pain in the output of the twenty facial action detectors. The task was to perform subject-independent classification. If the task were to simply detect the presence of a red-flag set of facial actions, then di¤erentiating fake from real pain expressions would be relatively simple. However, it should be noted that subjects display actions such as AU 4, for example, in both real and fake pain, and the distinction is in the magnitude and duration of AU 4. Also, there is intersubject variation in expressions of both real and fake pain; there may be combinatorial di¤erences in the sets of actions displayed during real and fake pain; and the subjects may cluster. We therefore applied machine learning to the task of discriminating real from faked pain expressions. A classifier was trained to discriminate genuine pain from faked pain based on the CERT output. The input to this classifier consisted of the twenty facial action detector outputs from the full minute of video in each condition. Before training the classifier, we developed a representation that summarized aspects of the dynamics of the facial behavior. The frame-by-frame AU detection data were integrated into temporal ‘‘events.’’ We applied temporal filters at eight di¤erent fundamental frequencies to the AU output stream using the di¤erence of Gaussian convolution kernels. The half-width of the positive Gaussian (s) ranged from 8 to 90.5 frames at half-octave intervals (0.25 to 1.5 seconds), and the size of the negative Gaussian was fixed at 2s. Zero-crossings of the filter outputs were used to segment the output into positive and negative regions. The integral of each region was then computed (see figure 14.5). The distributions of the positive events were then characterized using histograms. These histograms of ‘‘events’’ comprised the input representation to train a nonlinear SVM with Gaussian kernels. The input therefore consisted of twenty histograms, one for each AU. The system was trained using cross-validation on the twenty-six subject videos. In the cross-validation approach, the system was trained and tested twenty-six times, each time using data from twenty-five subjects for parameter estimation and reserving a di¤erent subject for testing. This provided an estimate of subject-independent detection performance. The percent of correct two-alternative forced choices of fake versus real pain on new subjects was 88%. This was significantly higher than the performance of naive human subjects, who obtained a mean accuracy of 49% correct for discriminating faked from real pain on the same set of videos. Performance using the integrated event representation was also considerably higher than an alternative representation that did not take temporal dynamics into account (see Littlewort, Bartlett, & Lee, 2009, for details). This integrated event representation contains useful dynamic information that allows more accurate behavioral analysis. This suggests that this decision task depends not only on which subset of AUs is present at which intensity, but also on the duration and number of AU events.
224
M. Bartlett and colleagues
Figure 14.5 Example of integrated event representation for one subject and one frequency band for AU 4. Dotted line: Raw CERT output for AU 4. Dashed line: DOG filtered signal for one frequency band. Dotted/dashed line: area under each curve.
Discussion of Pain Study
The field of automatic analysis of facial expression has advanced to the point that we can begin to use it to address research questions in behavioral science. Here we describe a pioneering e¤ort to apply fully automated coding of facial action to the problem of di¤erentiating fake from real expressions of pain. Whereas naive human subjects were only at 49% accuracy for distinguishing fake from real pain, the automated system obtained 0.88 area under the ROC, which is equivalent to 88% correct on a two-alternative forced choice. Moreover, the pattern of results in terms of which facial actions may be involved in real pain, fake pain, and di¤erentiating real from fake pain is similar to previous findings in the psychology literature using manual FACS coding. Here we applied machine learning on a twenty-channel output stream of facial action detectors. The machine learning was applied to samples of spontaneous expressions during the subject’s state in question which here was fake versus real pain. The same approach can be applied to learn about other states given a set of
Insights on Spontaneous Facial Expressions
225
spontaneous expression samples. The following section develops another example in which this approach is applied to the detection of driver drowsiness from facial expression. Automatic Detection of Driver Fatigue
The U.S. National Highway Tra‰c Safety Administration (NHTSA) estimates that in the United States alone approximately 100,000 crashes each year are caused primarily by driver drowsiness or fatigue (DOT, 2001). In fact, the NHTSA has concluded that drowsy driving is just as dangerous as drunk driving. Thus, methods to automatically detect drowsiness may help save many lives. There are a number of techniques for analyzing driver drowsiness. One set places sensors on standard vehicle components, e.g., steering wheel or gas pedal, and analyzes the signals sent by these sensors to detect drowsiness (Takei & Furukawa, 2005). It is important for such techniques to be adapted to the driver, since Abut and his colleagues note that there are noticeable di¤erences among drivers in the way they use the gas pedal (Igarashi, Takeda, Itakura, & Abut, 2005). A second set of techniques focuses on measurement of physiological signals such as heart rate, pulse rate, and electroencephalography (e.g., Cobb, 1983). It has been reported by researchers that as the alertness level decreases, EEG power of the alpha and theta bands increases (Hong & Chung, 2005) and thereby provides indicators of drowsiness. However, this method has drawbacks in terms of practicality since it requires a person to wear an EEG cap while driving. A third set of solutions focuses on computer vision systems that can detect and recognize the facial motion and appearance changes occurring during drowsiness (Gu & Ji, 2004; Gu, Zhang, & Ji, 2005; Zhang & Zhang, 2006). The advantage of computer vision techniques is that they are noninvasive and thus are more amenable to use by the general public. Most of the previous research on computer vision approaches to detection of fatigue primarily make preassumptions about the relevant behavior, focusing on blink rate, eye closure, and yawning. Here we employ machine learning methods to datamine actual human behavior during drowsiness episodes. The objective of this study was to discover what facial configurations are predictors of fatigue. In this study, facial motion was analyzed automatically from videos using the computer expression recognition toolbox to automatically code facial actions from the Facial Action Coding System. Driving Task
The subjects played a driving video game on a Windows machine using a steering wheel (The Thrustmaster1 Ferrari Racing Wheel) and an open-source multiplatform video game (The Open Racing Car Simulator, TORCS). See figure 14.6. They were
226
M. Bartlett and colleagues
Figure 14.6 Driving simulation task. Top: Screen shot of driving game. Center: Steering wheel. Bottom: Image of a subject in a drowsy state (left), and falling asleep (right).
instructed to keep the car as close to the center of the road as possible as they drove a simulation of a winding road. At random times, a wind e¤ect was applied that dragged the car to the right or left, forcing the subject to correct the position of the car. This type of manipulation had been found in the past to increase fatigue (Orden, Jung, & Makeig, 2000). Driving speed was held constant. Four subjects performed the driving task over a 3-hour period beginning at midnight. During this time the subjects fell asleep multiple times, thus crashing their vehicles. Episodes in which the car left the road (crash) were recorded. A video of the subject’s face was recorded using a digital videocamera for the entire 3-hour session. In addition to measuring facial expressions with CERT, head movement was measured using an accelerometer placed on a headband, as well as steering wheel movement data. The accelerometer had 3 degrees of freedom, consisting of three one-dimensional accelerometers mounted at right angles and measuring accelerations in the range of 5 g to þ5 g where g represents earth gravitational force.
Insights on Spontaneous Facial Expressions
227
Facial Actions Associated with Driver Fatigue
The data from the subjects were partitioned into drowsy (nonalert) and alert states as follows. The 1 minute preceding a sleep episode or a crash was identified as a nonalert state. There was a mean of twenty-four nonalert episodes per subject, with each subject contributing between nine and thirty-five nonalert samples. Fourteen alert segments for each subject were collected from the first 20 minutes of the driving task. The output of the facial action detector consisted of a continuous value for each facial action and each video frame, which was the distance to the separating hyperplane, i.e., the margin. Histograms for two of the action units in alert and nonalert states are shown in figure 14.7. The area under the ROC (A0 ) was computed for the outputs of each facial action detector to see to what degree the alert and nonalert output distributions were separated. In order to understand how each action unit is associated with drowsiness across di¤erent subjects, multinomial logistic ridge regression (MLR) was trained on each facial action individually. Examination of the A0 for each action unit reveals the degree to which each facial movement was able to predict drowsiness in this study. The A0 s for the drowsy and alert states are shown in table 14.2. The five facial actions that were the most predictive of drowsiness by increasing in drowsy states were 45 (blink/eye closure), 2 (outer brow raise), 15 (lip corner depressor or frown), 17 (chin raise), and 9 (nose wrinkle). The five actions that were the most predictive of drowsiness by decreasing in drowsy states were 12 (smile), 7 (lid tighten), 39 (nostril compress), 4 (brow lower), and 26 ( jaw drop). The high predictive ability of the blink/eye closure measure was expected. However, the predictability of the outer brow raise (AU 2) was previously unknown. We observed during this study that many subjects raised their eyebrows in an attempt to keep their eyes open, and the strong association of the AU 2 detector is consistent with that observation. Also of note is that action 26, jaw drop, which occurs during yawning, actually occurred less often in the critical 60 seconds prior to a crash. This is consistent with the prediction that yawning does not tend to occur in the final moments before falling asleep. Predicting Driver Fatigue
The ability to predict drowsiness in novel subjects from the facial action code was then tested by running MLR on the full set of facial action outputs. Prediction performance was tested by using a leave-one-out cross-validation procedure in which one subject’s data were withheld from the MLR training and retained for testing, and the test was repeated for each subject. The data for each subject by facial action were first normalized to zero-mean and unit standard deviation. The MLR output
228
M. Bartlett and colleagues
Figure 14.7 Example histograms for (a) blink and (b) action unit 2 in alert and nonalert states for one subject. A0 is the area under the ROC. The x-axis is the CERT output.
Insights on Spontaneous Facial Expressions
229
Table 14.2 MLR model for predicting drowsiness across subjects showing performance for each facial action More when critically drowsy AU 45
Less when critically drowsy 0
Name Blink/eye closure
A 0.94
AU 12
Name Smile
A0 0.87 0.86
2
Outer brow raise
0.81
7
Lid tighten
15
Lip corner depressor
0.80
39
Nostril compress
0.79
17
Chin raiser
0.79
4
Brow lower
0.79
9
Nose wrinkle
0.78
26
Jaw drop
0.77
30
Jaw sideways
0.76
6
Cheek raise
0.73
20
Lip stretch
0.74
38
Nostril dilate
0.72
11 14
Nasolabial furrow Dimpler
0.71 0.71
23 8
Lip tighten Lips toward
0.67 0.67
1
Inner brow raise
0.68
5
Upper lid raise
0.65
10
Upper lip raise
0.67
16
Upper lip depress
0.64
32
Bite
0.63
27
Mouth stretch
0.66
18
Lip pucker
0.66
22
Lip funneler
0.64
24
Lip presser
0.64
19
Tongue show
0.61
Table 14.3 Drowsiness detection performance for novel subjects, using an MLR classifier with di¤erent feature combinations Feature
A0
AU 45, AU 2, AU 19, AU 26, AU 15
0.9792
All AU features
0.8954
Note: The weighted features are summed over 12 seconds before computing A0 .
for each AU feature was summed over a temporal window of 12 seconds (360 frames) before computing A0 . MLR trained on all AU features obtained an A0 of 0.90 for predicting drowsiness in novel subjects. Because prediction accuracy may be enhanced by feature selection, in which only the AUs with the most information for discriminating drowsiness are included in the regression, a second MLR was trained by contingent feature selection, starting with the most discriminative feature (AU 45) and then iteratively adding the next most discriminative feature given the features already selected. These features are shown in table 14.3. The best performance of 0.98 was obtained with five features: 45, 2, 19 (tongue show), 26 ( jaw drop), and 15. This five-feature model outperformed the MLR trained on all features.
230
M. Bartlett and colleagues
Figure 14.8 Performance for drowsiness detection in novel subjects over temporal window sizes.
The performances shown in table 14.3 employed a temporal window of 12 seconds, meaning that the MLR output was summed over 12 seconds (360 frames) for making the classification decision (drowsy or not drowsy). We next examined the e¤ect of the size of the temporal window on performance. Here, the MLR output was summed over windows of N seconds, where N ranged from 0.5 to 60 seconds. The five-feature model was again employed for this analysis. Figure 14.8 shows the area under the ROC for drowsiness detection in novel subjects over a range of temporal window sizes. Performance saturates at about 0.99 as the window size exceeds 30 seconds. In other words, given a 30-second video segment, the system can discriminate sleepy versus nonsleepy segments with 0.99 accuracy across subjects. Effect of Temporal Window Length
Coupling of Behaviors
This study also revealed information about coupling of behaviors during drowsiness. Certain movements became coupled in the drowsy state which were not coupled in alert states. Observation of the subjects during drowsy and nondrowsy states indicated that the subject’s head motion di¤ered substantially when alert than when the driver was about to fall asleep. Surprisingly, head motion increased as the driver became drowsy, with large roll motion coupled with the steering motion as the driver became drowsy. Just before falling asleep, the head would become still.
Steering and Head Motion
Insights on Spontaneous Facial Expressions
231
Figure 14.9 Head motion (black) and steering position (gray) for 60 seconds in an alert state (left) and 60 seconds prior to a crash (right). Head motion is the output of the roll dimension of the accelerometer.
We also investigated the coupling of the head and arm motions. Correlations between head motion as measured by the roll dimension of the accelerometer output and the steering wheel motion are shown in figure 14.9. For this subject (subject 2), the correlation between head motion and steering increased from 0.33 in the alert state to 0.71 in the nonalert state. For subject 1, the correlation between head motion and steering similarly increased from 0.24 in the alert state to 0.43 in the nonalert state. The other two subjects showed a smaller coupling e¤ect. Future work includes combining the head motion measures and steering correlations with the facial movement measures in the predictive model for detecting driver drowsiness. We observed that for some of the subjects, coupling between raised eyebrows and eye openness increased in the drowsy state. In other words, subjects tried to open their eyes using their eyebrows in an attempt to keep awake. See figure 14.10.
Eye Openness and Eyebrow Raise
Conclusions of Driver Fatigue Study
This section presented a system for automatic detection of driver drowsiness from videos. Previous approaches focused on assumptions about behaviors that might be predictive of drowsiness. Here, a system for automatically measuring facial
232
M. Bartlett and colleagues
Figure 14.10 Eye openness (gray) and eyebrow raises (AU 2) (black) for 10 seconds in an alert state (left) and 10 seconds prior to a crash (right).
expressions was employed to datamine spontaneous behavior during real drowsiness episodes. This is the first work to our knowledge to reveal significant associations between facial expression and fatigue beyond eyeblinks. The project also revealed a potential association between head roll and driver drowsiness, and the coupling of head roll with steering motion during drowsiness. Of note is that a behavior that is often assumed to be predictive of drowsiness, yawning, was in fact a negative predictor of the 60-second window prior to a crash. It appears that in the moments before falling asleep, drivers yawn less, not more, often. This highlights the importance of using examples of fatigue and drowsiness conditions in which subjects actually fall sleep. Automated Feedback for Intelligent Tutoring Systems
Whitehill et al. (2008) investigated the utility of integrating automatic facial expression recognition into an automated teaching system. There has been a growing thrust to develop tutoring systems and agents that respond to the students’ emotional and cognitive state and interact with them in a social manner (e.g., Kapoor, Burleson, & Picard, 2007; D’Mello, Picard, & Graesser, 2007). Whitehill’s work used facial ex-
Insights on Spontaneous Facial Expressions
233
pression to estimate the preferred viewing speed of a video and the level of di‰culty, as perceived by the individual student, of a lecture at each moment of time. This study took first steps toward developing methods for closed-loop teaching policies for automated tutoring systems. These are systems that have access to real-time estimates of cognitive and emotional states of the students and select actions accordingly. The goal of this study was to assess whether automated facial expression measurements using CERT, collected in real time, could predict factors such as the perceived di‰culty of a video lecture or the preferred viewing speed. In this study, eight subjects separately watched a video lecture composed of several short clips on mathematics, physics, psychology, and other topics. The playback speed of the video was controlled by the student using a keyboard—the student could speed up, slow down, or rewind the video by pressing a di¤erent key. The subjects were instructed to watch the video as quickly as possible (so as to be e‰cient with their time) while still retaining accurate knowledge of the video’s content since they would be quizzed afterward. While watching the lecture, the student’s facial expressions were measured in real time by the CERT system (Bartlett et al., 2006). CERT is fully automatic and requires no manual initialization or calibration, which enables real-time applications. The version of CERT used in the study contained detectors for twelve di¤erent facial action units as well as a smile detector trained to recognize social smiles. Each detector output a real-valued estimate of the expression’s intensity at each moment in time. After watching the video and taking the quiz, each subject then watched the lecture video again at a fixed speed of 1.0 (real time). During this second viewing, the subjects specified how easy or di‰cult they found the lecture to be at each moment in time using the keyboard. In total, the experiment resulted in three data series for each student: (1) expression, consisting of thirteen facial expression channels (twelve di¤erent facial actions, plus smile); (2) speed, consisting of the video speed at each moment in time as controlled by the user during the first viewing of the lecture; and (3) di‰culty, as reported by the student him or herself during the second viewing of the video. For each subject, the three-data series were time aligned and smoothed. Then a regression analysis was performed using the expression data (both the thirteen intensities themselves, as well as the first temporal derivatives) as independent variables to predict both di‰culty and expression using standard linear regression. An example of such predictions is shown in figure 14.11c for one subject. The results of the correlation study suggested that facial expression, as estimated by a fully automatic facial expression recognition system, is significantly predictive of both preferred viewing speed and perceived di‰culty. Across the eight subjects, and evaluated on a separate validation set taken from the three-data series that were not used for training, the average correlations with di‰culty and speed were
Figure 14.11 (a) Sample video lecture. (b) Automated facial expression recognition is performed on a subject’s face as she watches the lecture. (c) Self-reported di‰culty values (dashed), and the reconstructed di‰culty values (solid) computed using linear regression over the facial expression outputs for one subject (12 action units and smile detector). (Reprinted with permission from Whitehill, Bartlett, & Movellan, 2008. 6 2008 IEEE.)
Insights on Spontaneous Facial Expressions
235
0.42 and 0.29, respectively. The specific facial expressions that were correlated varied highly from subject to subject. No individual facial expression was consistently correlated across all subjects, but the most consistently correlated expression (taking parity into account) was AU 45 (blink). The blink action was negatively correlated with perceived di‰culty, meaning that subjects blinked less during the more di‰cult sections of the video. This is consistent with previous work associating decreases in blink rate with increases in cognitive load (Holland & Tarlow, 1972; Tada, 1986). Overall, this pilot study provided proof of principle that fully automated recognition of facial expression at the present state of the art can be used to provide realtime feedback in automated tutoring systems. The validation correlations were small but statistically significant (Wilcoxon sign rank test p < 0:05), showing that a signal exists in the face and that the automated system can detect it in real time. Since conducting the pilot study, we have added twenty-five additional facial expression dimensions and made a number of improvements to CERT. Moreover, the correlation analysis was quite simple. Future work includes exploration of more powerful dynamic models that may give stronger prediction accuracy. Discussion
The computer vision field has advanced to the point that we are now able to begin to apply automatic facial expression recognition systems to important research questions in behavioral science. This chapter explored three such applications in which the automated measurement system revealed information about facial expression that was previously unknown. Although the accuracy of individual facial action detectors is still below that of human experts, automated systems can be applied to large quantities of video data. Statistical pattern recognition methods applied to this large quantity of data can reveal emergent behavioral patterns that previously would have required hundreds of coding hours by human experts and would be unattainable by the nonexpert. Moreover, automated analysis of facial expressions will enable investigations into facial expression dynamics that were previously infeasible by human coding because of the time required to code intensity changes. Acknowledgments
Support for this work was provided in part by National Science Foundation (NSF) grants CNS-0454233, SBE-0542013, and NSF ADVANCE award 0340851, and by a grant from the Turkish State Planning Organization. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Portions of the research in this paper use the Man-Machine Interaction (MMI) Facial
236
M. Bartlett and colleagues
Expression Database collected by M. Pantic and M. F. Valstar. This chapter is based on the following three papers: 1. G. Littlewort, M. Bartlett, and K. Lee (2009). Automatic coding of facial expressions displayed during posed and genuine pain. Image and Vision Computing, 27(12), 1797–1803. 6 2009 Elsevier. 2. E. Vural, M. Cetin, A. Ercil, G. Littlewort, M. Bartlett, and J. Movellan (2007). Drowsy driver detection through facial movement analysis. Paper presented at International Conference on Computer Vision Workshop on Human–Computer Interaction. 6 2007 IEEE. 3. J. Whitehill, M. Bartlett, and J. Movellan (2008). Automated teacher feedback using facial expression recognition. Paper presented at Workshop on Human Communicative Behavior Analysis, IEEE Conference on Computer Vision and Pattern Recognition. 6 2008 IEEE. References Bartlett, M., Littlewort, G., Vural, E., Lee, K., Cetin, M., Ercil, A., and Movellan, M. (2008). Data mining spontaneous facial behavior with automatic expression coding. Lecture Notes in Computer Science, 5042, 1–21. Bartlett, M. S., Littlewort, G. C., Frank, M. G., Lainscsek, C., Fasel, I., and Movellan, J. R. (2006). Automatic recognition of facial actions in spontaneous expressions. Journal of Multimedia, 1(6), 22–35. Bartlett, M. S., Hager, J. C., Ekman, P., and Sejnowski, T. J. (1999). Measuring facial expressions by computer image analysis. Psychophysiology, 36, 253–263. Bartlett, M. S., Viola, P. A., Sejnowski, T. J., Golomb, B. A., Larsen, J., Hager, J. C., and Ekman, P. (1996). Classifying facial action. In Eds. D. Touretzky, M. Mozer, and M. Hasselmo, Advances in neural information processing systems 8. Cambridge, MA: MIT Press, pp. 823–829. Cockburn, J., Bartlett, M., Tanaka, J., Movellan, J., Pierce, M., and Schultz, R. (2008). SmileMaze: A tutoring system in real-time facial expression perception and production for children with autism spectrum disorder. Paper presented at the international conference on automatic face and gesture recognition, Workshop on facial and bodily expressions for control and adaptation of games, Amsterdam, 978–986. Cobb, W. A. (Ed.) (1983). Recommendations for the practice of clinical neurophysiology. Amsterdam: Elsevier. Cohn, J. F., & Schmidt, K. L. (2004). The timing of facial motion in posed and spontaneous smiles. Journal of Wavelets, Multi-resolution and Information Processing, 2(2), 121–132. Craig, K. D., Hyde, S., & Patrick, C. J. (1991). Genuine, supressed, and faked facial behaviour during exacerbation of chronic low back pain. Pain, 46, 161–172. Craig, K. D., & Patrick, C. J. (1985). Facial expression during induced pain. Journal of Personality and Social Psychology, 48(4), 1080–1091. D’Mello, S., Picard, R., & Graesser, A. (2007). Towards an a¤ect-sensitive autotutor. IEEE Intelligent Systems, Special issue on Intelligent Educational Systems, 22(4), 53–61. DOT (2001). Saving lives through advanced vehicle safety technology. U.S. Department of Transportation. http://www.its.dot.gov/ivi/docs/AR2001.pdf. Donato, G., Bartlett, M. S., Hager, J. C., Ekman, P., & Sejnowski, T. J. (1999). Classifying facial actions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(10), 974–989.
Insights on Spontaneous Facial Expressions
237
Ekman, P. (2001). Telling lies: Clues to deceit in the marketplace, politics, and marriage. New York: W.W. Norton. Ekman, P., & Friesen, W. (1978). Facial action coding system: A technique for the measurement of facial movement. Palo Alto, CA: Consulting Psychologists Press. Ekman, P., Levenson, R. W., and Friesen, W. V. (1983). Autonomic nervous system activity distinguishes between emotions. Science, 221, 1208–1210. Ekman, P., & Rosenberg, E. L. (eds.), (2005). What the face reveals: Basic and applied studies of spontaneous expression using the FACS. Oxford, UK: Oxford University Press. Fasel, I., Fortenberry, B., & Movellan, J. R. (2005). A generative framework for real-time object detection and classification. Computer Vision and Image Understanding, 98, 182–210. Fishbain, D. A., Cutler, R., Rosomo¤, H. L., & Rosomo¤, R. S. (2006). Accuracy of deception judgments. Personality and Social Psychology Review, 10(3), 214–234. Frank, M. G., Ekman, P., & Friesen, W. V. (1993). Behavioral markers and recognizability of the smile of enjoyment. Journal of Personality and Social Psychology, 64(1), 83–93. Grossman, S., Shielder, V., Swedeen, K., & Mucenski, J. (1991). Correlation of patient and caregiver ratings of cancer pain. Journal of Pain and Symptom Management, 6(2), 53–57. Gu, H., & Ji, Q. (2004). An automated face reader for fatigue detection. In Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 111–116. Gu, H., Zhang, Y., & Ji, Q. (2005). Task-oriented facial behavior recognition with selective sensing. Computer Vision Image Understanding, 100(3), 385–415. Hadjistavropoulos, H. D., Craig, K. D., Hadjistavropoulos, T., & Poole, G. D. (1996). Subjective judgments of deception in pain expression: Accuracy and errors. Pain, 65(2–3), 251–258. Hill, M. L., & Craig, K. D. (2002). Detecting deception in pain expressions: The structure of genuine and deceptive facial displays. Pain, 98(1–2), 135–144. Hill, M. L., & Craig, K. D. (2004). Detecting deception in facial expressions of pain: Accuracy and training. Clinical Journal of Pain, 20, 415–422. Holland, M. K., & Tarlow, G. (1972). Blinking and mental load. Psychological Reports, 31(1), 119–127. Hong, J. E., & Chung, K. (2005). Electroencephalographic study of drowsiness in simulated driving with sleep deprivation. International Journal of Industrial Ergonomics, 35(4), 307–320. Igarashi, K., Takeda, K., Itakura, F., & Abut, H. (2005). DSP for in-vehicle and mobile systems. New York: Springer. Kapoor, A., Burleson, W., & Picard, R. (2007). Automatic prediction of frustration. International Journal of Human–Computer Studies, 65(8), 724–736. Larochette, A. C., Chambers, C. T., & Craig, K. D. (2006). Genuine, suppressed and faked facial expressions of pain in children. Pain, 126(1–3), 64–71. Littlewort, G., Bartlett, M., & Lee, K. (2009). Automatic coding of facial expressions displayed during posed and genuine pain. Journal of Image and Vision Computing, 27(12), 1797–1803. Littlewort, G., Bartlett, M. S., Fasel, I., Susskind, J., & Movellan, J. (2006). Dynamics of facial expression extracted automatically from video. Journal of Image and Vision Computing, 24(6), 615–625. Morecraft, R. J., Louie, J. L., Herrick, J. L., & Stilwell-Morecraft, K. S. (2001). Cortical innervation of the facial nucleus in the non-human primate: A new interpretation of the e¤ects of stroke and related subtotal brain trauma on the muscles of facial expression. Brain, 124(Pt 1), 176–208. Orden, K. F. V., Jung, T. P., & Makeig, S. (2000). Combined eye activity measures accurately estimate changes in sustained visual task performance. Biological Psychology, 52(3), 221–240. Prkachin, K. M. (1992). The consistency of facial expressions of pain: A comparison across modalities. Pain, 51(3), 297–306. Prkachin, K. M., Schultz, I., Berkowitz, J., Hughes, E., & Hunt, D. (2002). Assessing pain behaviour of low-back pain patients in real time: Concurrent validity and examiner sensitivity. Behaviour Research and Therapy, 40(5), 595–607.
238
M. Bartlett and colleagues
Prkachin, K. M., Solomon, P. A., & Ross, A. J. (2007). The underestimation of pain among health-care providers. Canadian Journal of Nursing Research, 39, 88–106. Rinn, W. E. (1984). The neuropsychology of facial expression: A review of the neurological and psychological mechanisms for producing facial expression. Psychology Bulletin, 95, 52–77. Schmidt, K. L., Cohn, J. F., & Tian, Y. (2003). Signal characteristics of spontaneous facial expressions: Automatic movement in solitary and social smiles. Biological Psychology, 65(1), 49–66. Tada, H. (1986). Eyeblink rates as a function of the interest value of video stimuli. Tohoku Psychologica Folica, 45, 107–113. Takei, Y., & Furukawa, Y. (2005). Estimate of driver’s fatigue through steering motion. In Man and Cybernetics, Proceedings of the 2005 IEEE international conference on systems, man and cybernetics. (Vol. 2, pp. 1765–1770). Viola, P., & Jones, M. (2004). Robust real-time face detection. Journal of Computer Vision, 57(2), 137–154. Whitehill, J., Littlewort, G., Fasel, I., Bartlett, M., & Movellan, J. (2009). Towards practical smile detection. Transactions on Pattern Analysis and Machine Intelligence, 31(11), 2106–2111. Whitehill, J., Bartlett, M., and Movellan, J. (2008). Automatic facial expression recognition for intelligent tutoring systems. Paper presented at Workshop on CVPR for human communicative behavior analysis, IEEE conference on computer vision and pattern recognition. Zhang, Z., & Zhang, J. (2006). Driver fatigue detection based intelligent vehicle control. In Proceedings of the 18th international conference on pattern recognition, Washington, DC, IEEE Computer Society (pp. 1262–1265).
15
Real-Time Dissociation of Facial Appearance and Dynamics during Natural Conversation
Steven M. Boker and Jeffrey F. Cohn
As we converse, facial expressions, head movements, and vocal prosody are produced that are important sources of information in communication. The semantic content of conversation is accompanied by vocal prosody, nonword vocalizations, head movements, gestures, postural adjustments, eye movements, smiles, eyebrow movements, and other facial muscle changes. Coordination between speakers’ and listeners’ head movements, facial expressions, and vocal prosody has been widely reported (Bernieri & Rosenthal, 1991; Bernieri, Davis, Rosenthal, & Knee, 1994; Cappella, 1981; Chartrand, Maddux, & Lakin, 2005; Condon, 1976; Lafrance, 1985). Conversational coordination can be defined as occurring when an action generated by one individual is predictive of a symmetric action by another (Rotondo & Boker, 2002; Gri‰n & Gonzalez, 2003). This coordination is a form of spatiotemporal symmetry between individuals (Boker & Rotondo, 2002) that has behaviorally useful outcomes (Chartrand et al., 2005). Movements of the head, facial expressions, and vocal prosody have been reported to be important in judgments of identity (Fox, Gross, Cohn, & Reilly, 2007; Hill & Johnston, 2001; Munhall & Buchan, 2004), rapport (Grahe & Bernieri, 2006; Bernieri et al., 1994), attractiveness (Morrison, Gralewski, Campbell, & Penton-Voak, 2007), gender (Morrison et al., 2007; Hill & Johnston, 2001; Berry, 1991), personality (Levesque & Kenny, 1993), and a¤ect (Ekman, 1993; Hill, Troje, & Johnston, 2003). In a dyadic conversation, each conversant’s perception of the other person produces an ever-evolving behavioral context that in turn influences her or his ensuing actions, thus creating a nonstationary feedback system as conversants form patterns of movements, expressions, and vocal inflections, sometimes with high symmetry between the conversants and sometimes with little or no similarity (Ashenfelter, Boker, Waddell, & Vitanov, 2009). This skill is so automatic that little thought is given to it unless it begins to break down. As symmetry is formed between two conversants, the ability to predict the actions of one based on the actions of the other increases and the perception of empathy increases (Baaren, Holland, Steenaert, & Knippenberg, 2003). Symmetry in movements implies redundancy, which can be defined as negative Shannon information
240
S. M. Boker and J. F. Cohn
Figure 15.1 A conceptual model for adaptive feedback between two individuals engaged in conversation. A mirror system tracks the movements and vocalizations of the interlocutor, but the output of the mirror system is frequently suppressed. When symmetric action is called for, the mirror system is preprimed to produce symmetry by enabling its otherwise suppressed output.
(Redlich, 1993; Shannon & Weaver, 1949). By this logic, when symmetry between two conversants is high, they are sharing an embodied state and thus may feel greater empathy toward one another. When interpersonal symmetry is broken, that is, when there is a change from similar expressions and movements to dissimilar ones, individuals are less likely to be able to predict each other’s movements, owing to lowered redundancy and thus increased Shannon information. In this way, changes in symmetry can be interpreted as changes in information flow between the conversants. An information flow view of the dynamics of dyadic conversation is consistent with a model in which contributions from audition, vision, and proprioception are combined in a low-level mirror system (Rizzolatti & Craighero, 2004) that uses the continuous stream of auditory and visual input as sources of information available for grammatic, semantic, and a¤ective perception. A conceptual diagram of this model is shown in figure 15.1. According to this model, motor activity can be generated from the mirror system, but this activity is, in general, suppressed. Release of suppression of motor mimicry is used intermittently to elicit engagement and express empathy with interlocutors. This model suggests that the adaptive dynamics of head movements and facial expressions observed during conversation are composed of low-evel perception–action contributions exhibited as periods of high symmetry with an interlocutor as well as top-down cognitive contributions exhibited in the regulation of these periods of symmetry. Adaptation to Context: Dynamics and Appearance
The context of a conversation can influence its course. The person you think you are speaking with can influence what you say and how you say it. The appearance of an
Real-Time Dissociation of Facial Appearance and Dynamics
241
interlocutor is composed of his or her facial and body structure as well as how he or she moves and speaks. Separate pathways for perception of structural appearance and biological motion have been proposed (Giese & Poggio, 2003). Evidence for this view comes from neurological (Steede, Tree, & Hole, 2007b,a) and judgment (Knappmeyer, Thornton, & Bu¨ltho¤, 2003; Hill & Johnston, 2001) studies. Munhall and Buchan (2004) postulate that motion contributes to identification of faces by providing better structure cues from a moving face and dynamic facial signatures. Berry (1991) reports that children could recognize gender from point-light faces when the recorded motion was from a conversation, but not when it was from a recitative reading. This suggests that contextual dynamics cues are likely to be stronger in normal interactions than when they are generated from a scripted and acted sequence. If we accept that the dynamics of facial expressions, head movements, and gestures during natural conversation are generated as part of a system that uses mechanisms of adaptive feedback and varies the informational content of its output, we conclude that these dynamics are likely to exhibit highly complex time dependence. When building statistical models of such a system, we may expect to encounter data with high numbers of degrees of freedom and complex nonlinear interactions. The successful study of such systems requires measurements with high precision in both time and space. Large numbers of data samples are required in order to have su‰cient power to be able to distinguish models with nonlinear time-dependence from those with nonstationary linear components. Finally, in order to test models, precise experimental perturbations are needed in order to be able to distinguish causal from correlational structures. Measuring and Manipulating Dynamics and Appearance
Given these needs for studying coordinative movements in natural conversation, we sought a nonintrusive method for automatically tracking facial expressions and head movements. In addition, in order to create perturbations, we sought a method for covertly introducing known perturbations into natural conversation. We wished to be able to make adjustments to both appearance and dynamics without a naive conversant knowing that these perturbations were present. The adaptive regulation of cognition and expressive a¤ect has long been studied using labor-intensive methods, such as hand coding of video tape or film (e.g., Cohn, Ambadar, & Ekman, 2007; Cappella, 1996; Condon & Ogston, 1966). Experiments in the regulation of expressive a¤ect have primarily used fixed or recorded stimuli because it is otherwise di‰cult to control the context to which the participant is adapting. However, people react much di¤erently to recorded stimuli than they do when they are engaged in a conversation. Perceptions and implicit biases triggered by the appearance and dynamics of the interlocutor have the potential to change the
242
S. M. Boker and J. F. Cohn
Figure 15.2 Layout of the videoconference booth and motion-tracking system. The oval magnetic field penetrates the magnetically transparent sound isolation wall so that participants sit approximately 3 m apart in the same motion-tracking field.
way that a participant self-regulates during an interaction. We sought a way for controlling the context of a natural conversation by allowing the random assignment of appearance variables such as age, sex, race, and attractiveness. Videoconference Paradigm
In order to control context and to acquire high-quality full-face video and acoustically isolated audio of conversation participants, we selected a videoconference paradigm. In our current laboratory setup, each participant sits on a stool in a small video booth as shown in figure 15.2 facing a backprojection screen approximately 2 m away. A small (2 cm 10 cm) ‘‘lipstick’’ video camera is mounted in front of the backprojection screen at a position that corresponds to the forehead of the lifesized image of the interlocutor’s face projected on the screen. Each participant wears headphones and a small microphone is mounted overhead, out of the participants’ field of view. The walls of the booths are moveable ‘‘gobos’’ built of sound-di¤using compressed fiberglass panels covered with white fabric. The participants are lit from
Real-Time Dissociation of Facial Appearance and Dynamics
243
the front and the gobos and a white fabric booth ceiling serve as reflective surfaces so that there are few shadows on the face, facilitating automatic video tracking. The field of view in the booth is basically a featureless white fabric other than the image of the conversant. Acoustically, the sound-di¤usion panels prevent coherent early reflections and thus the booth does not sound as small as it is. This e¤ect and the open sides are used to help prevent feelings of claustrophobia that might otherwise occur for some participants in a small (2.5 m 2 m) enclosed space. Although the participants are in separate rooms, and thus acoustically isolated from one another, they are actually in close proximity to one another. This allows the use of a single magnetic field covering the two booths to enable motion tracking (Ascension Technologies MotionStar) to synchronously record the participants’ head movements (6 degrees of freedom, DOF sampled at 81.6 Hz). Each participant wears a headband or hat to which a tracker is attached. Naive participants are informed that we are ‘‘measuring magnetic fields during conversation’’ and in our experiments over the past 10 years, all ðN > 200Þ but one case accepted this cover story. One participant in one of our recent experiments immediately indicated that he knew that the headband was a motion capture sensor and so his data were not used. By withholding the fact that we are motion tracking, we wish to prevent participants from feeling self-conscious about their movements during their conversations. Video and audio can be transmitted between the booths with a minimal delay; there is a one-video frame delay (33 ms) at the projector while it builds a frame bu¤er to project. Audio between booths is delayed so as to match the arrival of the video. In some of our experiments, we have used frame delays of between three and five frames (99 ms to 165 ms) but as yet have found no e¤ects on movements within this area of delay. As delays become longer than 200 ms, conversational patterns can change. Delays of over 500 ms can cause breakdowns in conversational flow because individuals begin talking at the same time, having di‰culty in negotiating smooth speaker–listener turn-taking. Real-Time Facial Avatars
The video conference paradigm allows us to track head movements but it also allows us to track facial movements using active appearance models (AAMs) (Cootes, Edwards, & Taylor, 2001; Cootes, Wheeler, Walker, & Taylor, 2002) by digitizing a video stream and applying tracking software developed by our colleagues at Carnegie Mellon University (Matthews, Ishikawa, & Baker, 2004; Matthews & Baker, 2004). From the tracking data, we can redisplay a computer-generated avatar face for each video frame (Theobald, Matthews, Wilkinson, Cohn, & Boker, 2007). The tracking and redisplay take less time than a single video frame, so the whole process takes 33 ms for the frame digitizing and 33 ms for the tracking and redisplay of the frame bu¤er. Thus we can track a face and from that data synthesize a video avatar
244
S. M. Boker and J. F. Cohn
within 66 ms. We have been using an o¤-the-shelf PCIe AJA Video Kona Card for video digitizing and redisplay in a standard 3.0-GHz Mac Pro that performs the tracking and synthesis. The capability to track facial movements and redisplay them brings with it the possibility of creating perturbations to conversation in a covert manner and randomly assigning them within a conversation. In order to do this, we first needed to find out whether the synthesized avatars were accepted as being video by naive participants. Then we needed to develop a method for applying perturbations of appearance and dynamics to the resynthesized avatar faces. Facial Avatars Using Active Appearance Models
AAMs are generative, parametric models and consist of both shape and appearance components. The shape component of the AAM is a triangulated mesh that moves like a face undergoing both rigid (head pose variation) and nonrigid motion (expression) in response to changes in the parameters. The shape components are identified by first, hand labeling between thirty and fifty frames of video for the sixty-eight vertices composing the triangular mesh, a process that takes a trained research assistant about 2 to 3 hours. The video frames are chosen so as to cover the range of facial motion normally exhibited by the target individual’s expressions. Once these video frames are hand labeled, the remainder of the process is automatic. Thus, once a model is constructed, we can continue to use it to track an individual over multiple occasions and experiments. A principal components analysis is performed on these labeled video frames to extract between eight and twelve shape components that can be thought of as axes of facial movement. These components are independent and additive so that the estimated shape of a face mesh in a single frame of video is expressed as the weighted sum of the retained components: s ¼ s0 þ
m X
pi s i ;
ð15:1Þ
i¼1
where s is the estimated shape, s0 is the mean shape, si are the component loadings, and pi are the shape parameters. Figure 15.3a plots the first three of these components for one target individual. The arrows in each wire-frame face in figure 15.3a demonstrate how one unit of change in each of the first three principal components creates simultaneous change across multiple vertices of the triangular mesh. The appearance component of the AAM is an image of the face, which itself can vary under the control of the parameters. As the parameters are varied, the appearance changes to model e¤ects such as the appearance of furrows and wrinkles. Again, we use principal components analysis on the same labeled video frames and
Real-Time Dissociation of Facial Appearance and Dynamics
245
Figure 15.3 Active appearance models (AAMs) have both shape and appearance components. (a) The first three shape modes. (b) The mean appearance (left) and first two appearance modes. (c) Three example faces generated with the AAM in (a) and (b).
retain a few (eight to twelve) appearance components. Thus, the estimated appearance AðxÞ becomes a linear combination of the mean appearance A0 ðxÞ plus a weighted sum of appearance images Ai ðxÞ: AðxÞ ¼ A0 ðxÞ þ
l X
li Ai ðxÞ
Ex A s0 ;
ð15:2Þ
i¼1
where the coe‰cients li are the appearance parameters. Figure 15.3b shows the mean appearance and the first two appearance components for a target individual’s face. Combining the shape and appearance models, we can create a wide variety of natural-looking head poses and facial expressions. Fitting an AAM is a di‰cult nonlinear optimization problem. Matthews and Baker (2004) recently proposed and demonstrated an AAM fitting algorithm that is more robust and faster than previous algorithms, tracking faces at over 200 frames
246
S. M. Boker and J. F. Cohn
Figure 15.4 Video frames and mean shape and appearance models for two research assistants.
per second. Our project uses this algorithm to fit a prebuilt AAM model to each frame of video as it becomes available at the digitizing frame bu¤er. Figures 15.4a and 15.4b display video frames of two research assistants as captured by the video camera during an experiment. Below each individual’s picture, figures 15.4c and 15.4d present the respective computer-generated facial avatars for the mean shape and appearance for the two research assistants. The required number of degrees of freedom is surprisingly low (less than 25 DOF) for each of the generated avatars, although the avatars’ appearance is very similar to the individuals’ actual appearances. Note that the mean shape and appearance is somewhat smoother than any particular video frame and that the mean shape and appearance is not a completely relaxed expression. Figure 15.5 displays a still from the video feeds, facial tracking, and avatar from one conversation during an experiment. Figure 15.5a shows the video that was captured from the research assistant. In figure 15.5b, her face is tracked by the 68-vertex triangular mesh. From these tracking data and a previously constructed model, a video frame with a matching avatar is displayed in figure 15.5c. The naive participant, shown in figure 15.5d, sees only the avatar image from figure 15.5c, while the research assistant sees the full video of the naive participant as seen in figure 15.5d. Note that there appears to be a high degree of symmetry exhibited by the faces in figure 15.5. After the conversation session is over, models are built and tracked for
Real-Time Dissociation of Facial Appearance and Dynamics
247
Figure 15.5 One frame from conversation during a videoconference experiment. (a) The research assistant whose face was tracked sat in one booth. (b) The tracking mesh is automatically fit to the research assistant’s face. (c) The synthesized avatar is displayed to the naive participant within 99 ms of the light captured by the camera in the research assistant’s booth. (d) The naive participant’s image is seen by the research assistant.
each naive participant. In this way, we have been capturing measurements that will be used to construct and test specific models for the dynamic ebb and flow of symmetry formation and breaking during conversation. Swapping Appearance by Using Avatar Models
Once we were able to construct and display an avatar in real time, we began to work on methods for introducing manipulations of the appearance and dynamics of the avatar. Changing the appearance of the avatar is akin to putting a flexible mask on a person. That is to say, the avatar’s expressions are driven by the captured motions of the person whose face is being motion tracked, but the avatar model that is shown making these expressions is one that was generated from a di¤erent individual. To accomplish this, we first constructed a set of short video clips that was representative of each of our six research assistants. We then videotaped each assistant while she or he mimicked the facial expressions of each of the other assistants as
248
S. M. Boker and J. F. Cohn
Figure 15.6 Six avatars generated from the facial expression captured from person (a).
shown on the video clips. Then we used the captured video from each research assistant to build a model that covered approximately the same space of expressions. Finally, we simply substituted one person’s mean shape and appearance for another person’s during the synthesis portion of the process. As an example of how this works, the research assistant in figure 15.6a is being tracked and his expressions are mapped onto the other five research assistants in figures 15.6b through 15.6f. Note that the expressions in figure 15.6 are not exactly the same. One might think of this mapping as how person (a)’s expression di¤ers from his mean being applied to how person (b)’s face would appear if he had produced an expression with the same di¤erence from his mean. This tends to create natural-appearing expressions, since the shapes themselves are not mapped, but simply an expression that represents a similar point in expression space. By sampling all individuals mimicking the same movements, we were able to have the axes of the spaces be relatively similar. This is an important point because principal component axes generated from the distribution of naturally occurring movements from one individual may be substantially different from the axes generated from another person who may have a very di¤erent
Real-Time Dissociation of Facial Appearance and Dynamics
249
Figure 15.7 One frame from a conversation in which the appearance and voice of the research assistant was changed to appear to be male. (a) The research assistant whose face was tracked sat in one booth. (b) The tracking mesh is automatically fit to the research assistant’s face. (c) A synthesized avatar with mean appearance taken from a male research assistant is displayed to the naive participant within 99 ms of the light captured by the camera in the research assistant’s booth. (d) The naive participant’s image is seen by the research assistant.
distribution of characteristic movements. Methods for rotation and scaling of axes between avatar models (Theobald, Matthews, Cohn, & Boker, 2007) may improve expression mapping and reduce the need for mimic-based video sequences on which to build models. We used the mean shape and appearance substitution method that was used to produce displays that map appearance from one sex to another (Boker et al., in press). For instance, in figure 15.7, the research assistant’s face was mapped to a male avatar. Each video frame is mapped, so the dynamics of the movements produced by the female research assistant are produced by the male avatar’s expressions. In addition, the female research assistant’s voice was processed using a TC–Helicon VoicePro vocal formant processor to change the fundamental frequency and formants to approximate a male voice. In this experiment, the naive participant in figure 15.7d was informed that she would have six di¤erent conversations. She actually
250
S. M. Boker and J. F. Cohn
talked to two research assistants, one male and one female, but she thought she had spoken with six di¤erent individuals, three male and three female. The research assistants were blind to whether they appeared as a male or a female in any particular conversation. In our avatar videoconference experiments, we have used more than 100 naive participants and only two of them doubted our cover story that the faces they saw were a live video that had been ‘‘cut out’’ so that they only saw the face of the person they were talking to. One of those was the person who also guessed that we were putting motion-capture sensors on him. Unfortunately, as knowledge of this technology becomes more widely disseminated, we will not be able to rely on participants trusting that the face they see in a videoconference in fact belongs to the person with whom they are speaking. Future Directions
Now that it is practical to precisely and noninvasively measure and control nonrigid facial movements produced in natural conversation, we expect that there will be a surge of experiments that test hypotheses about the coupled dynamics of interpersonal coordination. We expect that a mapping will be developed between a semantic space of adjectives describing emotion and a low degree-of-freedom avatar model of the human face. This mapping will allow the automatic tracking of a¤ective facial displays in a way that may revolutionize human–computer interactions. We are also interested in perturbing the dynamics of expressions. In a¤ective disorders such as depression, individuals display facial behavior during conversation that di¤ers from the coordination between people’s expressions shown in normal conversation. Depressed individuals also report feelings of being distant from others. By better understanding the way that these patterns of a¤ective display develop and persist, we may be able to devise better interventions that allow these individuals to recover from depressive episodes more quickly and e¤ectively. Another area amenable to study using real-time avatars is stereotyping and bias. Since we can convincingly change a person’s apparent sex, we expect that further work will allow us to randomly assign variables such as race and age during natural conversation. Studying stereotyping using this paradigm is particularly interesting since the research assistant whose characteristics are being modified can be kept blind to the modification. It is not as if an assistant is asked to act a part. The only way the assistant can know how he or she appears to the conversational partner is by how the conversational partner treats the assistant. By counterbalancing so that the conversational partner has more than one conversation with the same assistant in each appearance condition, we can isolate e¤ects during conversation to being that of the randomly assigned appearance variable.
Real-Time Dissociation of Facial Appearance and Dynamics
251
Applications for this technology in human–computer interaction are not di‰cult to envision. For instance, a National Aeronautics and Space Administration–funded pilot project has been proposed to track teachers’ faces and mapping them onto celestial objects so that, in classrooms equipped with distance learning, children can ‘‘talk to Jupiter.’’ Transmitting avatar displays requires extremely low bandwidth, so these displays may find application in cell phones and other videoconferencing applications (Brick, Spies, Theobald, Matthews, & Boker, 2009). Computer-based tutoring systems may be able to use webcams to track whether a learner is displaying confusion or frustration. Autonomous avatars may be able to display expressions that are perceived as showing empathy by tracking viewers’ faces and displaying an appropriate amount of interpersonal symmetry, thereby reducing the feeling that the automatous faces are cold and mechanical. Appropriate responses to a¤ect detected in human facial expressions may allow human–robot interactions to be less threatening and more fulfilling for humans. Conclusions
We have described an overview of our team’s work in developing and testing realtime facial avatars driven by motion capture from a video. The avatar technology has enabled videoconference experiments that randomly assign appearance variables and examine how people coordinate their motions and expressions in natural conversation. After 24 to 30 minutes of the videoconference conversations, 98% of naive participants did not doubt the cover story that we were ‘‘cutting out video to just show the face.’’ We find this to be surprising since each video frame was constructed from approximately twenty-five floating point values applied to a model. Contrast that with the fact that a real video frame contains over 300,000 pixels, each of which is represented by a 24-bit number. We can think of three reasons why this illusion is so convincing. The first possible reason is that when we produce facial expressions, we largely coordinate our muscles in correlated patterns, so that the total number of degrees of freedom we express is relatively small—on the order of 3 degrees of freedom for head pose and 7 to 12 degrees of freedom for facial expression. The second possible reason is that there may be some limiting of perceived degrees of freedom as we view a facial expression. Thus, our perceptual system may be mapping the facial expressions onto a lower dimensional space than actually exists in the world, so when the number of degrees of freedom in the display is reduced, we do not notice. Such a perceptual e¤ect might also explain why it is so easy to see a face in an arbitrary pattern with only marginal similarity to a face—the so-called face on Mars or face on the tortilla e¤ect.
252
S. M. Boker and J. F. Cohn
A third possible reason is that in real-time conversation, a participant is expecting to interact with a real person and is engaged in that interaction. Thus the dynamics of the symmetry formation and symmetry breaking are appropriate and convince the participant that since the interaction is real, the video image must be real. Contrast that situation with a judgment paradigm where the participant may adopt a more critical attitude and is not dynamically engaged with the person on the display. Thus the nature of the context and task may lead to greater or lesser credibility of the avatar display. We expect that real-time facial avatars will be in common, everyday use within 10 years or less. We expect facial avatar technology to be influential in teaching, in human–computer interactions, and in the diagnosis and treatment of a¤ective disorders. In the meantime, these constructs provide powerful tools for examining human interpersonal communication. Acknowledgments
The authors gratefully acknowledge the contributions of the many investigators and research assistants who worked on this project: Zara Ambadar, Kathy Ashenfelter, Timothy Brick, Tamara Buretz, Enoch Chow, Eric Covey, Pascal Deboeck, Katie Jackson, Hannah Kim, Jen Koltiska, Nancy Liu, Michael Mangini, Iain Matthews, Sean McGowan, Ryan Mounaime, Sagar Navare, Andrew Quilpa, Je¤rey Spies, Barry-John Theobald, Stacey Tiberio, Michael Villano, Chris Wagner, and Meng Zhao. Funding for this work was provided in part by National Science Foundation grant BCS-0527485. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. References Ashenfelter, K. T., Boker, S. M., Waddell, J. R., & Vitanov, N. (2009). Spatiotemporal symmetry and multifractal structure of head movements during dyadic conversation. Journal of Experimental Psychology: Human Perception and Performance, 34(3), 1072–1091. Baaren, R. B. van, Holland, R. W., Steenaert, B., & Knippenberg, A. van (2003). Mimicry for money: Behavioral consequences of imitation. Journal of Experimental Social Psychology, 39(4), 393–398. Bernieri, F. J., Davis, J. M., Rosenthal, R., & Knee, C. R. (1994). Interactional synchrony and rapport: Measuring synchrony in displays devoid of sound and facial a¤ect. Personality and Social Psychology Bulletin, 20(3), 303–311. Bernieri, F. J., & Rosenthal, R. (1991). Interpersonal coordination: Behavior matching and interactional synchrony. In R. S. Feldman and B. Rime´ (eds.), Fundamentals of nonverbal behavior. Cambridge, UK: Cambridge University Press, pp. 401–431. Berry, D. S. (1991). Child and adult sensitivity to gender information in patterns of facial motion. Ecological Psychology, 3(4), 349–366.
Real-Time Dissociation of Facial Appearance and Dynamics
253
Boker, S. M., Cohn, J. F., Theobald, B.-J., Matthews, I., Mangini, M., Spies, J. R., et al. (in press). Something in the way we move: Motion dynamics, not perceived sex, influence head movements in conversation. Journal of Experimental Psychology: Human Perception and Performance. Boker, S. M., & Rotondo, J. L. (2002). Symmetry building and symmetry breaking in synchronized movement. In M. Stamenov and V. Gallese (eds.), Mirror neurons and the evolution of brain and language. Amsterdam: John Benjamins, pp. 163–171. Brick, T. R., Spies, J. R., Theobald, B., Matthews, I., & Boker, S. M. (2009). High–presence, low– bandwidth, apparent 3–d video–conferencing with a single camera. In Proceedings of the 2009 international workshop on image analysis for multimedia interactive services (WIAMIS). Piscataway, NJ: IEEE. Cappella, J. N. (1981). Mutual influence in expressive behavior: Adult–adult and infant–adult dyadic interaction. Psychological Bulletin, 89(1), 101–132. Cappella, J. N. (1996). Dynamic coordination of vocal and kenesic behavior in dyadic interaction: Methods problems and interpersonal outcomes. In J. H. Watt and C. A. VanLear (eds.), Methodology in social research. Thousand Oaks, CA: Sage, pp. 353–386. Chartrand, T. L., Maddux, W. W., & Lakin, J. L. (2005). Beyond the perception–behavior link: The ubiquitous utility and motivational moderators of nonconscious mimicry. In R. Hassin, J. Uleman, and J. A. Bargh (eds.), The new unconscious. New York: Oxford University Press, pp. 334–361. Cohn, J. F., Ambadar, Z., & Ekman, P. (2007). Observer-based measurement of facial expression with the Facial Action Coding System. In J. A. Coan and J. J. B. Allen (eds.), The handbook of emotion elicitation and assessment. New York: Oxford University Press, pp. 203–221. Condon, W. S. (1976). An analysis of behavioral organization. Sign Language Studies, 13, 285–318. Condon, W. S., & Ogston, W. D. (1966). Sound film analysis of normal and pathological behavior patterns. Journal of Nervous and Mental Disease, 143(4), 338–347. Cootes, T. F., Edwards, G., & Taylor, C. J. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6), 681–685. Cootes, T. F., Wheeler, G. V., Walker, K. N., & Taylor, C. J. (2002). View-based active appearance models. Image and Vision Computing, 20(9–10), 657–664. Ekman, P. (1993). Facial expression and emotion. American Psychologist, 48, 384–392. Fox, N. A., Gross, R., Cohn, J. F., & Reilly, R. B. (2007). Robust biometric person identification using automatic classifier fusion of speech, mount, and face experts. IEEE Transactions on Multimedia, 9(4), 701–714. Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience, 4, 179–192. Grahe, J. E., & Bernieri, F. J. (2006). The importance of nonverbal cues in judging rapport. Journal of Nonverbal Behavior, 23(4), 253–269. Gri‰n, D., & Gonzalez, R. (2003). Models of dyadic social interaction. Philosophical Transactions of the Royal Society of London, B, 358(1431), 573–581. Hill, H. C. H., & Johnston, A. (2001). Categorizing sex and identity from the biological motion of faces. Current Biology, 11(3), 880–885. Hill, H. C. H., Troje, N. F., & Johnston, A. (2003). Range- and domain-specific exaggeration of facial speech. Journal of Vision, 5, 793–807. Knappmeyer, B., Thornton, I. M., & Bu¨ltho¤, H. H. (2003). The use of facial motion and facial form during the processing of identity. Vision Research, 43(18), 1921–1936. Lafrance, M. (1985). Postural mirroring and intergroup relations. Personality and Social Psychology Bulletin, 11(2), 207–217. Levesque, M. J., & Kenny, D. A. (1993). Accuracy of behavioral predictions at zero acquaintance: A social relations model. Journal of Personality and Social Psychology, 65(6), 1178–1187. Matthews, I., & Baker, S. (2004). Active appearance models revisited. International Journal of Computer Vision, 60(2), 135–164.
254
S. M. Boker and J. F. Cohn
Matthews, I., Ishikawa, T., & Baker, S. (2004). The template update problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 810–815. Morrison, E. R., Gralewski, L., Campbell, N., & Penton-Voak, I. S. (2007). Facial movement varies by sex and is related to attractiveness. Evolution and Human Behavior, 28, 186–192. Munhall, K. G., & Buchan, J. N. (2004). Something in the way she moves. Trends in Cognitive Sciences, 8(2), 51–53. Redlich, N. A. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural Computation, 5, 289–304. Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Reviews of Neuroscience, 27, 169–192. Rotondo, J. L., & Boker, S. M. (2002). Behavioral synchronization in human conversational interaction. In M. Stamenov and V. Gallese (eds.), Mirror neurons and the evolution of brain and language. Amsterdam: John Benjamins, pp. 151–162. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana IL: University of Illinois Press. Steede, L. L., Tree, J. J., & Hole, G. J. (2007a). Dissociating mechanisms involved in accessing identity by dynamic and static cues. Object perception, attention, and memory (OPCAM) 2006 conference report, visual cognition, 15(1), 116–123. Steede, L. L., Tree, J. J., & Hole, G. J. (2007b). I can’t recognize your face but I can recognize its movement. Cognitive Neuropsychology, 24(4), 451–466. Theobald, B., Matthews, I., Cohn, J. F., & Boker, S. (2007). Real-time expression cloning using appearance models. In Proceedings of the 9th international conference on multimodal interfaces. New York: ACM Press, pp. 134–139. Theobald, B., Matthews, I., Wilkinson, N., Cohn, J. F., & Boker, S. (2007). Animating faces using appearance models. In Proceedings of the 2007 Workshop on Vision, Video and Graphics, 14 September 2007 at Warwick University, in association with the British Machine Vision Conference 2007.
16
Markerless Tracking of Dynamic 3D Scans of Faces
Christian Walder, Martin Breidt, Heinrich H. Bu¨lthoff, Bernhard Scho¨lkopf, and Cristo´bal Curio
Human perception of facial appearance and motion is a very sensitive and complex, yet mostly unconscious process. In order to be able to systematically investigate this phenomenon, highly controllable stimuli are a useful tool. This chapter describes a new approach to capturing and processing real-world data on human performance in order to build graphics models of faces, an endeavor where psychology intersects with other research areas such as computer graphics and machine learning, as well as the artistic fields of movie or game production. Creating animated 3D models of deformable objects such as faces is an important and di‰cult task in computer graphics because the human perception of face appearance and motion is particularly sensitive. By drawing on a lifetime of experience with real faces, people are able to detect even the slightest peculiarities in an artificially animated face model, potentially eliciting unintended emotional or other communicative responses. This has made the animator’s job rather di‰cult and time-consuming when done manually, even for talented artists, and has led to the use of data-driven animation techniques, which aim to capture and re-use the dynamic performance of a live subject. Data-driven face animation has recently enjoyed considerable success in the movie industry, in which marker-based methods are the most widely used. Although steady progress has been made in marker-based tracking, there are certain limitations involved in placing physical markers on a subject’s face. Summarizing the face by a sparse set of point locations may lose some information and necessitates involved retargeting of geometric motion to map the marker motion onto that of a model suitable for animation. Markers also partially occlude the subject’s face, which itself contains dynamic information such as that caused by changes in blood flow and expression wrinkles. On a practical level, time and e¤ort are required to correctly place the markers, especially when short recordings of a large number of actors are needed—a scenario likely to arise in the computer game industry for example, but also common in psychology research.
256
C. Walder and colleagues
Figure 16.1 Results of an automatically tracked sequence. The deforming mesh is rendered with a wire frame to show the correspondence between time steps. Since no markers are used, we also obtain an animated texture capturing subtle dynamic elements like expression wrinkles.
Tracking the changes of a deforming surface over time without markers is more di‰cult (see figure 16.1). To date, many markerless tracking methods have made extensive use of optical flow calculations between adjacent time steps of the sequence. Since local flow calculations are noisy and inconsistent, it is necessary to introduce spatial coherency constraints. Although significant progress has been made in this direction, for example, by Zhang, Snavely, Curless, and Seitz (2004), the sequential use of between-frame flow vectors can lead to continual accumulation of errors, which may eventually necessitate labor-intensive manual corrections as reported by Borshukov and Lewis (2003). It is also noteworthy that facial cosmetics designed to remove skin blemishes strike directly at the key assumptions of optical flow-based methods. For face-tracking purposes, there is significant redundancy between the geometry and color information. Our goal is to exploit this multitude of information sources to obtain high-quality tracking results in spite of possible ambiguities in any of the individual sources. In contrast to classical motion capture, we aim to capture the surface densely rather than at a sparse set of locations. We present a novel surface-tracking algorithm that addresses these issues. The input to the algorithm consists of an unorganized set of four-dimensional (3D plus time) surface points, along with a corresponding set of surface normals and surface colors. From these data, we construct a 4D implicit surface model, as well as a regressed function that models the color at any given point in space and time. Since we require only an unorganized point cloud, the algorithm is not restricted to scanners that produce a sequence of 3D frames and can handle samples at arbitrary points in time and space as produced by a laser scanner, for example. The algorithm also requires a mesh as a template for tracking, which we register with the first frame of the scanned sequence using an interactive tool. The output of our algorithm is a high-quality tracking result in the form of a sequence of deformations that move the template mesh in correspondence with the scanned subject, along with an animated
Markerless Tracking of Dynamic 3D Scans of Faces
257
texture. A central design feature of the algorithm is that in contrast to sequential frame-by-frame approaches, it solves for the optimal motion with respect to the entire sequence while incorporating geometry and color information on an equal footing. As a result, the tracking is robust against slippage that is due to the accumulation of errors. Background on Facial Animation
Since the human face is such an important communication surface, facial animation has been of interest from early on in the animation and later the computer graphics community. In the following paragraphs we want to give an overview on related work, focusing on automated, algorithmic approaches rather than artistic results such as traditional Cel animation. Generally, facial animation can be produced either by a purely 2D image-based approach or by using some kind of 3D representation of the facial geometry. By using the full appearance information from images or video, 2D approaches naturally produce realistic-looking results but the possibilities for manipulations are limited. Orthogonally, 3D approaches allow a great amount of control over every aspect of the facial appearance but often have di‰culty in achieving the same level of realism as image-based methods. For 3D facial animation, two main ingredients are required: a deformable face model that can be animated, and animation data that describe when and where the face changes. Deformable 3D face models can be roughly split into three basic techniques:
Example-based 3D morphing, also known as blend shape animation.
Physical simulation of tissue interaction (e.g., muscle simulations, finite-element systems); see, for example, the work of Sifakis, Neverov, and Fedkiw (2005).
General nonrigid deformation of distinct surface regions (e.g. radial basis function approaches, cluster or bone animation) (Guenter, Grimm, Wood, Malvar, & Pighin, 1998).
Animation data can be generated by interpolating manually created keyframes, by procedural animation systems (e.g., as produced by text-to-speech systems), or by capturing the motion of a live actor using a specialized recording system. Once the animation data are available, di¤erent techniques exist for transferring the motion onto the facial model. If the model is not identical to the actor’s face, this process is called retargeting and the capturing process is often called facial performance capture. A fourth category of facial animation techniques exists that combines the model and the animation into one dataset by densely recording shape, appearance, and timing aspects of a facial performance. By capturing both 3D and 2D data from moving
258
C. Walder and colleagues
faces, the work described in this chapter uses a combination of both and thus allows for high realism while still being able to control individual details of the synthetic face. An example of general nonrigid deformations based on motion capture data is used in this book by Knappmeyer, whereas Boker and Cohn follow an examplebased 2D approach, and Curio et al. describe a motion-retargeting approach for 3D space. A wide variety of modalities have been used to capture facial motion and transfer it onto animatable face models. Sifakis et al. (2005) used marker-based optical motion capture to drive a muscle simulation system; Blanz, Basso, Poggio, and Vetter (2003) used a statistical 3D face model to analyze facial motion from monocular video; Guenter et al. (1998) used dense marker motion data to directly deform an otherwise static 3D face geometry. Recently, several approaches for dense 3D capture of facial motion were presented: Kalberer and Gool (2002) used structured light projection and painted markers to build a face model, then employed Independent Component Analysis to compute the basis elements of facial motion. For the feature film The Matrix, Borshukov and Lewis (2003) captured realistic faces in motion using a calibrated rig of high-resolution cameras and a 3D fusion of optical flow from each camera, but heavily relied on skilled manual intervention to correct for the accumulation of errors. The Mova multiple-camera system (http://www.mova.com) o¤ers a capture service with a multicamera setup, establishing correspondence across cameras and time using speckle patterns caused by phosphorescent makeup. Zhang et al. (2004) employ color optical flow as constraints for tracking. Wand et al. (2007) present a framework for reconstructing moving surfaces from sequences based solely on unordered scanned 3D point clouds. They propose an iterative model assembly that produces the most appropriate mesh topology with correspondence through time. Many algorithms in computer graphics make use of the concept of implicit surfaces, which are one way of defining a surface mathematically. The surface is represented implicitly by way of an embedding function that is zero on the surface of the object and has a di¤erent sign (positive or negative) on the interior and exterior of the object. The study of implicit surfaces in computer graphics began with Blinn (1982) and Nishimura et al. (1985). Variational methods have since been employed (Turk & O’Brien, 1999), and although the exact computation of the variational solution is computationally prohibitive, e¤ective approximations have since been introduced (Carr et al., 2001) and proven useful in representing dynamic 3D scanner data by constructing 4D implicits (Walder, Scho¨lkopf, & Chapelle, 2006). At the expense of useful regularization properties, the partition-of-unity methods exemplified by Ohtake, Belyaev, Alexa, Turk, and Seidel (2003) allow even greater e‰ciency. The method we present in this chapter is a new type of partition of unity implicit based on nearest-neighbor calculations. As a partition-of-unity method, it represents
Markerless Tracking of Dynamic 3D Scans of Faces
259
the surface using many di¤erent local approximations that are then stitched together. The stitching is done by weighting each local approximation by one of a set of basis functions which together have the property that they sum to one, or in other words, they partition unity. The new approach extends trivially to arbitrary dimensionality, is simple to implement e‰ciently, and convincingly models moving surfaces such as faces. Thus we provide a fully automatic markerless tracking system that avoids the manual placement of individual mesh deformation constraints by a straightforward combination of texture and geometry through time, thereby providing both meshes in correspondence and complete texture images. Our approach is able to robustly track moving surfaces that are undergoing general motion without requiring 2D local flow calculations or any specific shape knowledge. Hardware, Data Acquisition, and Computation
Here we provide only a short description of the scanning system we employed for this work since it is not our main focus. The dynamic 3D scanner we use is a commercial prototype developed by ABW GmbH (http://www.abw-3d.de) that uses a modified coded light approach with phase unwrapping (Wolf, 2003). The hardware (see figure 16.2) consists of a minirot H1 projector synchronized with two photon focus MV-D752-160 gray-scale cameras that run with 640 by 480 pixels at 200 frames per second, and one Basler A601fc color camera, running at 40 frames per second and a resolution of 656 by 490 pixels. Our results would likely benefit from higher resolutions, but it is in fact a testament to the e‰cacy of the algorithm that the system succeeds even with the current hardware setup. The distance from the projector to the recorded subject is approximately 1 m, and the baseline between the two gray-scale cameras is approximately 65 cm, with the color camera and projector in the middle. Before recording, the system is calibrated in order to determine the relative transformations between the cameras and projector. During recording, the subject wears a black ski mask to cover irrelevant parts of the head and neck. This is not necessary for our surface tracking algorithm but is
Figure 16.2 Setup of the dynamic 3D scanner. Two high-speed gray-scale cameras (a, d) compute forty depth images per second from coded light produced by the synchronized projector (c). Two strobes (far left and right) are triggered by the color camera (b), capturing color images with controlled lighting at a rate of forty frames per second.
260
C. Walder and colleagues
Figure 16.3 A single unprocessed mesh produced by the 3D scanning system, with and without the color image (shown in the center) projected onto it.
useful in practice because it reduces the volume of scan data produced and thereby lightens the load on our computational pipeline. To minimize occlusion of the coded light, the subject faces the projector directly and avoids strong rigid head motion during the performances, which range between 5 and 10 seconds. After recording, relative phase information is computed from the stripe patterns cast by the projector. The relative phase is then unwrapped into absolute phase, allowing 3D coordinates to be computed for each valid pixel in each camera image, producing two overlapping triangle meshes per time step, which are then exported for processing by the tracking algorithm. A typical frame results in around 40k points with texture coordinates that index into the color image (see figure 16.3). Problem Setting and Notation
The data produced by our scanner consist of a sequence of 3D meshes with texture images, sampled at a constant rate. As a first step we transform each mesh into a set of points and normals, where the points are the mesh vertices and the corresponding normals are computed by a weighted average of the adjacent face normals using the method described by Desbrun, Meyer, Schro¨der, and Barr (2002). Furthermore, we append to each 3D point the time at which it was sampled, yieldInput
Markerless Tracking of Dynamic 3D Scans of Faces
261
Figure 16.4 Result of the interactive, nonrigid alignment of a common template mesh to scans of three di¤erent people.
ing a 4D spatiotemporal point cloud. To simplify the subsequent notation, we also append to each 3D surface normal a fourth temporal component of zero value. To represent the color information, we assign to each surface point a 3D color vector representing the red green blue (RGB) color, which we obtain by projecting the mesh produced by the scanner into the texture image. Hence we summarize the data from the scanner as the following set of m (point, normal, color) triplets: fðxi ; ni ; ci Þg1 tive time of the i-th frame. That is, ~vi; j ¼ ðv> i; j ; DiÞ where D is the interval between frames. The Algorithm
We take the widespread approach of minimizing an energy functional, Eobj: , which in our case is defined in terms of the entire sequence of vertex locations, V1 ; V2 ; . . . ; Vs . Rather than using the (point, normal, color) triplets of equation 16.1 directly, we instead use summarized versions of the geometry and color, as represented by the implicit surface embedding function fimp: and color function fcol: , respectively. The construction of these functions is explained in detail in the appendix at the end of this chapter. For now, it is su‰cient to know that the functions can be set up and evaluated rather e‰ciently, are di¤erentiable almost everywhere, and 1. fimp: : R 4 ! R takes as input a spatiotemporal location [say, x ¼ ðx; y; z; tÞ> ] and returns an estimate of the signed distance to the scanned surface. The signed distance to a surface S evaluated at x has an absolute value jdistðS ; xÞj, and a sign that di¤ers on di¤erent sides of S . At any fixed t, the 4D implicit surface can be thought of as a 3D implicit surface in ðx; y; zÞ (see figure 16.5, left). 2. fcol: : R 4 ! R 3 takes a similar input, but returns a 3-vector representing an estimate of the RGB color value at any given point. Evaluated away from the surface, the function returns an estimate of the color of the surface nearest to the evaluation point (see figure 16.5, right). 3. Both functions are di¤erentiable almost everywhere, vary smoothly through both space and time, and (under some mild assumptions on the density of the samples) can be set up and evaluated e‰ciently. See the appendix at the end of this chapter. Modeling the geometry and color in this way has the practical advantage that as we construct fimp: and fcol: , we may separately adjust parameters that pertain to the noise level in the raw data, and then visually verify the result. Having done so, we may solve the tracking problem under the assumption that fimp: and fcol: contain little noise, while summarizing the relevant information in the raw data. The energy we minimize depends on the vertex locations through time and the connectivity (edge list) of the template mesh, the implicit surface model, and the color
Markerless Tracking of Dynamic 3D Scans of Faces
263
Figure 16.5 The nearest-neighbor implicit surface (left, intensity plot of fimp: , darker is more positive) and color (right, gray-level plot of fcol: ) models. Although fcol: regresses on RGB values, for this plot we have mapped to gray scale for printing. Here we fix time and one space dimension, plotting over the two remaining space dimensions. The data used are that of a human face; we depict here a vertical slice that cuts the nose, revealing the profile contour with the nose pointing to the right. For reference, a line along the zero level set of the implicit appears in both images.
model, i.e., V1 ; . . . Vs , G, fimp: , and fcol: . With a slight abuse of notation, the functional can be written Eobj: 1
X
al El ;
ð16:2Þ
l A terms
where the al are parameters that we fix as described in the section on parameter selection, and the El are the individual terms of the energy function, which we now introduce. Note that it is possible to interpret the minimizer of the above energy functional as the maximum a posteriori estimate of a posterior likelihood in which the individual terms al El are interpreted as negative log probabilities, but we do not elaborate on this point. The first term is straightforward; in order to keep the mesh close to the surface, we approximate the integral over the template mesh of the squared distance to the scanned surface. As an approximation to this squared
Distance to the Surface
264
C. Walder and colleagues
distance we take the squared value of the implicit surface embedding function fimp: . We approximate the integral by taking an area-weighted sum over the vertices. The quantity we minimize is given by Eimp: 1
XX i
aj fimp: ð~vi; j Þ 2 :
ð16:3Þ
j
Here, as throughout, aj refers to the Voronoi area (Desbrun et al., 2002) of the j-th vertex of M1 , the template mesh at its starting position. We assume that each vertex should remain on a region of stable color, and accordingly we minimize the sum over the vertices of the sample variance of the color components observed at the sampling times of the dynamic 3D scanner. We discuss the validity of this assumption in our presentation of the results. The sample variance of a vector of observations y ¼ ð y1 ; y2 ; . . . ; ys Þ> is
Color
V ðyÞ 1
s X
yi
s X
!2 , yi 0 =s
s:
i 0 ¼1
i¼1
To ensure a scaling that is compatible with that of Eimp: , we neglect the term 1=s in the above expression. Summing these variances over RGB channels, and taking the same approximate integral as before, we obtain the following quantity to be minimized: 2 X X Ecol: 1 aj fcol: ð~vi; j Þ fcol: ð~vi 0 ; j Þ=s : 0 i; j i
ð16:4Þ
Acceleration To guarantee smooth motion and temporal coherence, we also minimize a similar approximation to the surface integral of the squared acceleration of the mesh. For a physical analogy, this is similar to minimizing a discretization in time and space of the integral of the squared accelerating forces acting on the mesh, assuming that it is perfectly flexible and has constant mass per area. The corresponding term is given by
Eacc: 1
X j
aj
s1 X
kvi1; j 2vi; j þ viþ1; j k 2 :
ð16:5Þ
i¼2
In addition to the previous terms, it is also necessary to regularize deformations of the template mesh in order to prevent unwanted distortions durMesh Regularization
Markerless Tracking of Dynamic 3D Scans of Faces
265
ing the tracking phase. Typically such regularization is done by minimizing measures of the amount of bending and stretching of the mesh. In our case however, since we are constraining the mesh to lie on the surface defined by fimp: , which itself bends only as much as the scanned surface, we need only control the stretching of the template mesh. We now provide a brief motivation for our regularizer. It is possible to use variational measures of mesh deformations, but we found these energies inappropriate for the following reason. In our experiments with them, it was di‰cult to choose the correct amount by which to penalize the terms; we invariably encountered one of two undesirable scenarios: (1) the penalization was insu‰cient to prevent undesirable stretching of the mesh in regions of low deformation, or (2) the penalization was too great to allow the correct deformation in regions of high deformation. It is more e¤ective to penalize an adaptive measure of stretch, which measures the amount of local distortion of the mesh, while retaining invariance to the absolute amount of stretch. To this end, we compute the ratio of the area of adjacent triangles and penalize the deviation of this ratio from that of the initial template mesh M1 . The precise expression for Ereg: is area½face1 ðei Þ area½face1 ðe1 Þ 2 aðeÞ : area½face2 ðei Þ area½face2 ðe1 Þ i¼2 e A G
s X X
ð16:6Þ
Here, face1 ðeÞ and face2 ðeÞ are the two triangles containing edge e, areaðÞ is the area of the triangle, and aðeÞ ¼ area½face1 ðe1 Þ þ area½face2 ðe1 Þ. Note that the ordering of face1 and face2 a¤ects the above term. In practice we restore invariance with respect to this ordering by augmenting the above energy with an identical term with reversed order. Implementation
Below we provide details on the parametrization and optimization of our tracking algorithm. Deformation-Based Reparameterization So far we have cast the surface tracking problem as an optimization with respect to the 3ðs 1Þn variables corresponding to the n 3D vertex locations of frames 2; 3; . . . ; s. This has the following shortcomings:
1. It necessitates further regularization terms to prevent folding and clustering of the mesh, for example. 2. The number of variables is rather large. 3. Compounding the previous shortcoming, convergence will be slow, as this direct parameterization is guaranteed to be ill-conditioned. This is because, for example,
266
C. Walder and colleagues
Figure 16.6 Deformation using control vertices. In this example the template mesh (left) is deformed via three deformation control vertices (black dots) with deformation displacement constraints (black arrows), leading to the deformed mesh (right). In this example 117 control vertices were used (white dots).
the regularization term Ereg: of equation 16.6 acts in a sparse manner between individual vertices. Hence, loosely speaking, gradients in the objective function that are due to local information (for example, the color term Ecol: of equation 16.4) will be propagated by the regularization term in a slow domino-like manner from one vertex to the next only after each subsequent step in the optimization. A simple way of overcoming these shortcomings is to optimize with respect to a lower-dimensional parameterization of plausible meshes. To do this, we manually select a set of control vertices that are displaced in order to deform the template mesh (see figure 16.6). Note that the precise placement of these control vertices is not critical provided they a¤ord su‰ciently many degrees of freedom. To this end, we take advantage of some ideas from interactive mesh deformation (Botsch & Kobbelt, 2004). This leads to a linear parameterization of the vertex locations V2 ; V3 ; . . . Vs , namely V i ¼ V1 þ Pi B;
ð16:7Þ
where Pi A R 3p represent the free parameters and B A R pn represent the basis vectors derived from the deformation scheme (Botsch & Sorkine, 2008). We have written V i instead of Vi because we apply another parameterized transformation, namely, the rigid-body transformation. This is necessary since the surfaces we wish to track are not only deformed versions of the template, but also undergo rigid body motion. Hence our vertex parameterization takes the form
Markerless Tracking of Dynamic 3D Scans of Faces
267
Vi ¼ Rðy i ÞV i þ ri ¼ Rðyi ÞðV1 þ Pi BÞ þ ri ; where ri A R 3 allows an arbitrary translation, y i ¼ ðai ; bi ; gi Þ> and RðyÞ is 10 10 0 cos b i 0 sin b i cos gi sin gi 1 0 0 CB CB B 1 0 A@ sin gi cos gi @ 0 cos ai sin ai A@ 0 0 sin ai cos ai sin bi 0 cos b i 0 0
ð16:8Þ is a vector of angles, 1 0 C 0 A: 1
The way in which we have proposed to reparameterize the mesh does not amount to tracking only the control vertices. Rather, the objective function contains terms from all vertices, and the positions of the control vertices are optimized to minimize this global error. Alternatively, one could optimize all vertex positions in an unconstrained manner. The main drawback of doing so, however, is not the greatly increased computation times, but the fact that allowing each vertex to move freely necessitates numerous additional regularization terms in order to prevent undesirable mesh behaviors such as triangle flipping. While such regularization terms may succeed in solving this problem, the reparameterization described here is a more elegant solution because we found the problem of choosing various additional regularization parameters to be more di‰cult in practice than the problem of choosing a set of control vertices that is su‰cient to capture the motion of interest. Hence the computational advantages of our scheme are a fortunate side e¤ect of the regularizer induced by the reparameterization.
Remarks on the Reparameterization
We use the popular LBFGS-B optimization algorithm of Byrd, Lu, Nocedal, and Zhu (1995), a quasi-Newton method that requires as input (1) a function that should return the value and gradient of the objective function at an arbitrary point and (2) a starting point. We set the number of optimization line searches to twenty-five for all of our experiments. In our case, the optimization of the Vi is done with respect to the parameters fðPi ; yi ; ri Þg described earlier. Hence the function passed to the optimizer first uses equation 16.8 to compute the Vi based on the parameters, then computes the objective function equation 16.2 and its gradient with respect to the Vi , and finally uses these gradients to compute the gradients with respect to the parameters by application of the chain rule of calculus. Optimizer
Incremental Optimization It turns out that even in this lower dimensional space of parameters, optimizing the entire sequence at once in this manner is computationally infeasible. First, the number of variables is still rather large: 3ðs 1Þð p þ 2Þ, corresponding to the parameters fðPi ; yi ; ri Þgi¼2...s . Second, the objective function is
268
C. Walder and colleagues
rather expensive to compute, as we discuss in the next paragraph. However, optimizing the entire sequence would be problematic even if it were computationally feasible, owing to the di‰culty of finding a good starting point for the optimization. Since the objective function is nonconvex, it is essential to be able to find a starting point that is near a good local minimum, but it is unclear how to initialize all frames 2; 3; . . . s given only the first frame and the raw scanner data. Fortunately, both the computational issue and that of the starting point are easily dealt with by incrementally optimizing within a moving temporal window. In particular, we first optimize frame 2, then frames 2–3, frames 2–4, frames 3–5, frames 4–6, etc. With the exception of the first two steps, we always optimize a window of three frames, with all previous frames held fixed. It is now reasonable to simply initialize the parameters of each newly included frame with those of the previous frame at the end of the previous optimization step. Note that although we optimize on a temporal window with the other frames fixed, we include in the objective function all frames from the first to the current, eventually encompassing the entire sequence. Hence, loosely speaking, the color variance term Ecol: of equation 16.4 forces each vertex inside the optimization window to stay within regions that have a color similar to that ‘‘seen’’ previously by the given vertex at previous time steps. One could also treat the final output of the incremental optimization as a starting point for optimizing the entire sequence with all parameters unfixed, but we found this leads to little change in practice. This is not surprising as, given the moving window of three frames, the optimizer essentially has three chances to get each frame right, with a forward and backward lookahead of up to two frames. Parameter Selection Choosing parameters values is straightforward since the terms in equation 16.2 are, loosely speaking, close to orthogonal. For example, tracking color and staying near the implicit surface are goals that typically compete very little; either can be satisfied without compromising the other. Hence the results are insensitive to the ratio of these parameters, namely aimp: =acol: . Furthermore, the parameters relating to the nearest-neighbor-based implicit surface and color models, and the deformation-based reparameterization, can both be verified independently of the tracking step and were fixed for all experiments (see the appendix for details). In order to determine suitable parameter setttings for aimp: , acol: , aacc: , and areg: of equation 16.2, we employed the following strategy. First, we removed a degree of freedom by fixing without loss of generality aimp: ¼ 1. Next we assumed that the implicit surface was su‰ciently reliable and treated the distance-to-surface term almost like the hard constraint Eimp: ¼ 0 by setting the next parameter acol: to be 1=100. We then took a sample dataset and ran the system over a 2D grid of values
Markerless Tracking of Dynamic 3D Scans of Faces
269
of Eacc: and Ereg: , inspected the results visually, and fixed these two remaining parameters accordingly for all subsequent experiments. Results
Tracking results are best visualized with animation, hence the majority of our results are presented in the accompanying video. (The results are available at http:// www.kyb.mpg.de/~dynfaces). Here we discuss the practical performance of the system and show still images from the animated sequences produced by the surfacetracking algorithm (see figure 16.7).
Figure 16.7 Challenging examples visualized by projecting the tracked mesh into the color camera image.
270
C. Walder and colleagues
Markerless Tracking of Dynamic 3D Scans of Faces
271
Performance
We now provide timings of the tracking algorithm, which ran on a 64-bit, 2.4-GHz AMD Opteron 850 processor with 4 GB of random-access memory (RAM), using a mixture of Matlab and Cþþ code. We focus on timings for face data and only report averages since the timings vary little with respect to identity and performance. The recording length is currently limited to 400 frames by the amount of RAM on the computer scanner, which is limited owing to the 32-bit architecture. Drivers of 64 bits are not available for our frame grabbers and the data rate is too great to store directly to hard disk. Note that this limitation is not due to our tracking algorithm, which has constant memory and linear time requirements in the length of the sequence. After recording a sequence, the scanning software computes the geometry with texture coordinates in the form depicted in the top panel of figure 16.8. Before starting the tracking algorithm, a fraction of a second per frame is required to compute (point, normal, color) triplets from the output of the scanner. The computation time for B of equation 16.7 was negligible for the face template mesh we used, because it consists of only 2,100 vertices. Almost all the computation time of the tracking algorithm was spent evaluating the objective function and its gradient during the optimization phase, and of this, about 80% was spent doing nearest-neighbor searches into the scanner data using the algorithm of Merkwirth, Parlitz, and Lauterborn (2000) in order to evaluate the implicit surface and color models. Including the 1–2 seconds required to build the data structure of the nearest-neighbor search algorithm for each temporal window, the optimization phase of the tracking algorithm required about 20 seconds per frame. Note that the algorithm is trivially parallelizable, and that only a small fraction of the recorded data needs to be stored in RAM at any given time. Note also that the computation times seem to scale roughly linearly with template mesh density. Markerless Tracking
Figure 16.8 shows various stills from the recording of a single subject, depicting the data input to the tracking system, as well as various visualizations of the tracking results. The tracking result is convincing and exhibits very little accumulation of error, as can be seen by the consistent alignment of the template mesh with the Figure 16.8 Snapshots from a tracking sequence. The top panel shows the input to the tracking system: texture images from the color camera (top) and the geometry data (bottom). The bottom panel visualizes the output of the system: the tracked mesh with checkerboard pattern to show the correspondence (top), the tracked mesh with animated texture taken directly from the color camera (middle), and for reference the tracked mesh as a wire frame projected into the original color camera image (bottom). We show the first and final frames of the sequence at the far left and right, with three intermediate frames between them.
272
C. Walder and colleagues
neutral expression in the first and last frames. Since no markers were used, the original color camera images can be projected onto the deformed template mesh, yielding photorealistic expression wrinkles. Some more challenging examples are shown in figure 16.7. The expressions performed by the male subject in the top row involve complex deformations around the mouth area that the algorithm captures convincingly. To test the reliance on color, we also applied face paint to the female subject shown in the video of our results at http://www.kyb.mpg.de/~dynfaces. The deterioration in performance is graceful in spite of both the high specularity of the paint and the sparseness of the color information. To demonstrate that the system is not specific to faces, we also include a final example showing a colored cloth being tracked in the same manner as all of the other examples, only with a di¤erent template mesh topology. The cloth tracking exhibits only minor inaccuracies around the border of the mesh because there is less information here to resolve the problems caused by plain-colored and strongly shadowed regions. A further example included in the accompanying video shows a uniformly colored, deforming, and rotating piece of foam being tracked reliably using shape cues alone. Discussion and Future Work
By design, our algorithm does not use optical flow calculations as the basis for surface tracking. Rather, we combine shape and color information on a coarser scale, under the assumption that the color does not change excessively on any part of the surface. This assumption did not cause major problems in the case of expression wrinkles because such wrinkles tend to appear and disappear on a part of the face with little relative motion with respect to the skin. Hence, in terms of the color penalty in the objective function, wrinkles do not induce a strong force in any specific direction. Although there are other lighting e¤ects that are more systematic, such as specularities, and self-shadowing, we believe these do not represent a serious practical concern for the following reasons. First, we found that in practice the changes caused by shadows and highlights were largely accounted for by the redundancy in color and shape over time. Second, it would be easy to reduce the severity of these lighting e¤ects using light polarizers, more strobes, and lighting normalization based on a model of the fixed-scene lighting. With the recent interest in markerless surfacecapturing methods, we hope that in the future the performance of new approaches such as those presented in this chapter can be systematically compared with others. The tracking system we have presented is automated; however, it is straightforward to modify the energy functional we minimize in order to allow the user to edit
Markerless Tracking of Dynamic 3D Scans of Faces
273
the result by adding vertex constraints, for example. It would also be interesting to develop a system that can improve the mesh regularization terms in a face-specific manner by learning from previous tracking results. Another interesting direction is intelligent occlusion handling, which could overcome some of the limitations of structured light methods and also allow the tracking of more complex self-occluding objects. Acknowledgments
This work was supported by the European Union project BACS FP6-IST-027140 and the Deutsche Forschungs Gemeinschaft (DFG) Perceptual Graphics project PAK 38. Appendix: KNN Implicit Surface and Color Models
In this appendix we motivate and define our nearest-neighbor-based implicit surface and color models. Our approach falls into the category of partition-of-unity methods, in which locally approximating functions are mixed together to form a global one. Let W be our domain of interest and assume that we have a set of non-negative (and typically compactly supported) functions fji g that satisfy X
ji ðxÞ ¼ 1;
Ex A W:
ð16:9Þ
i
Now let f fi g be a set of locally approximating functions for each supðji Þ. The P partition-of-unity approximating function on W is f ðxÞ ¼ i ji ðxÞ fi ðxÞ. The ji are typically defined implicitly by way of a set of compactly supported auxiliary functions fwi g. Provided the wi are non-negative and satisfy supðwi Þ ¼ supðji Þ, the following choice is guaranteed to satisfy equation 16.9 wi : ji ¼ P j wj At present we take the extreme approach of associating a local approximating function fi with each data point from the set x1 ; x2 ; . . . xm A R 4 produced by our scanner. In particular, for the implicit surface embedding function fimp: : R 4 ! R, we associate with xi the linear locally approximating function fi ðxÞ ¼ ðx xi Þ> ni , where ni is the surface normal at xi . For the color model fcol: : R 4 ! R 3 , the local approximating functions are simply the constant vector-valued functions fi ðxÞ ¼ ci , where ci A R 3 represents the RGB color at xi . Note that the description here constitutes a slight abuse of notation, owing to our having redefined fi twice.
274
C. Walder and colleagues
Figure 16.9 An R 1 example of our nearest-neighbor-based mixing functions fji g, with k ¼ 5. The horizontal axis represents the one-dimensional real line on which the fxi g are represented as crosses. The correspondingly colored curves represent the value of the mixing functions fji g.
To define the ji , we first assume without loss of generality that d1 a d2 a a dk a di , Ei > k, where x is our evaluation point and di ¼ kx xi k. In practice, we obtain such an ordering by way of a k nearest-neighbor search using the TSTOOL software library (Merkwirth et al., 2000). By now letting ri 1 di =dk and choosing wi ¼ ð1 ri Þþ , it is easy to see that the corresponding ji of equation 16.9 are continuous, di¤erentiable almost everywhere, and that we only need to examine the k nearest neighbors of x in order to compute them (see figure 16.9). Note that the nearest-neighbor search costs are easily amortized between the evaluation of fimp: and fcol: . Larger values of k average over more local estimates and hence lead to smoother functions; for our experiments, we fixed k ¼ 50. Note also that the nearest-neighbor search requires Euclidean distances in 4D, so we must decide, say, what spatial distance is equivalent to the temporal distance between frames. If the spatial distance is too small, each frame will be treated separately, whereas if it is too large, the frames will be smeared together temporally. The heuristic we used was to adjust the time scale so that on average approximately half of the k nearest neighbors of each data point come from the same time (that is, the same 3D frame from the scanner) as that data point, and the other half comes from the surrounding frames. In this way we obtain functions that vary smoothly through space and time. It is easy to visually verify the e¤ect of this choice by rendering the implicit surface and color models, as demonstrated in the accompanying video. This method is particularly e‰cient when we optimize on a moving window as discussed in the implementation section. Provided the data are of a roughly constant spatial density near the surface, as is the case with our dynamic 3D scanner, one may easily bound the
Markerless Tracking of Dynamic 3D Scans of Faces
275
temporal interval between any given point in the optimization window and its k-th nearest neighbor. Hence it is possible to perform the nearest-neighbor searches on a temporal slice of the full dataset. In this case, for a constant temporal window size, the implicit surface and color models enjoy setup and evaluation costs of O½q logðqÞ and O½k logðqÞ, respectively, where q is the number of vertices in a single 3D scan from the scanner. These costs are those of building and traversing the data structure used by the nearest-neighbor searcher (Merkwirth et al., 2000). References Blanz, V., Basso, C., Poggio, T., & Vetter, T. (2003). Reanimating faces in images and video. Comput Graph Forum, 22, 641–650. Blinn, J. F. (1982). A generalization of algebraic surface drawing. SIGGRAPH Comput Graph, 16, 273. Borshukov, G., Piponi, D., Larsen, O., Lewis, J. P., & Tempelaar-Lietz, C. (2003). Universal captureimage-based facial animation for ‘‘The Matrix Reloaded.’’ In SIGGRAPH 2003 Sketches. New York: ACM Press. Botsch, M., & Kobbelt, L. (2004). An intuitive framework for real-time freeform modeling. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers (pp. 630–634). New York: ACM Press. Botsch, M., & Sorkine, O. (2008). On linear variational surface deformation methods. IEEE Trans Vis Comput Graph, 14, 213–230. Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput, 16, 1190–1208. Carr, J. C., Beatson, R. K., Cherrie, J. B., Mitchell, T. J., Fright, W. R., McCallum, B. C., & Evans, T. R. (2001). Reconstruction and representation of 3D objects with radial basis functions. In ACM SIGGRAPH 2001 (pp. 67–76). New York: ACM Press. Desbrun, M., Meyer, M., Schro¨der, P., & Barr, A. H. (2002). Discrete di¤erential-geometry operators for triangulated 2-manifolds. Vis Math, 2, 35–57. Guenter, B., Grimm, C., Wood, D., Malvar, H., & Pighin, F. (1998). Making faces. In SIGGRAPH ’98: Proceedings of the 25th annual conference on computer graphics and interactive techniques (pp. 55–66). New York: ACM Press. Kalberer, G. A., & Gool, L. V. (2002). Realistic face animation for speech. J Visual Comput Anima, 13, 97–106. Kazhdan, M., Bolitho, M., & Hoppe, H. (2006). Poisson surface reconstruction. In SGP ’06: Proceedings of the fourth Eurographics symposium on geometry processing (pp. 61–70). Aire-la-Ville, Switzerland: Eurographics Association. Merkwirth, C., Parlitz, U., & Lauterborn, W. (2000). Fast nearest-neighbor searching for nonlinear signal processing. Phys Rev E, 62, 2089–2097. Nishimura, H., Hirai, M., Kawai, T., Kawata, T., Shirkaw, I., & Omura, K. (1985). Object modeling by distribution function and a method of image generation. Trans Inst Electron Commun Eng Japan, 68, 718–725. Ohtake, Y., Belyaev, A., Alexa, M., Turk, G., & Seidel, H.-P. (2003). Multi-level partition of unity implicits. ACM Trans Graph, 22, 463–470. Sifakis, E., Neverov, I., & Fedkiw, R. (2005). Automatic determination of facial muscle activations from sparse motion capture marker data. ACM Trans Graph, 24, 417–425. Turk, G., & O’Brien, J. F. (1999). Shape transformation using variational implicit functions. In Proceedings of ACM SIGGRAPH 1999 (pp. 335–342). Los Angeles, CA. New York: ACM Press. Walder, C., Scho¨lkopf, B., & Chapelle, O. (2006). Implicit surface modelling with a globally regularised basis of compact support. Proc Eurographics, 25, 635–644.
276
C. Walder and colleagues
Wand, M., Jenke, P., Huang, Q., Bokeloh, M., Guibas, L., & Schilling, A. (2007). Reconstruction of deforming geometry from time-varying point clouds. In SGP ’07: Proceedings of the fifth Eurographics symposium on geometry processing (pp. 49–58). Aire-la-Ville, Switzerland: Eurographics Association. Wolf, K. (2003). 3D measurement of dynamic objects with phase-shifting techniques. In T. Ertl (ed.), Proceedings of the vision, modeling, and visualization conference 2003 (pp. 537–544). Aka GmbH. Zhang, L., Snavely, N., Curless, B., & Seitz, S. M. (2004). Spacetime faces: High-resolution capture for modeling and animation. ACM SIGGRAPH (pp. 548–558). Los Angeles, CA. New York: ACM Press.
Contributors
Institute for Neural Computation, University of California, San Diego, San Diego, Califormia
Marian Bartlett
Department of Psychology, University of Virginia, Charlottesville,
Steven M. Boker
Virginia Department of Human Perception, Cognition and Action, Max Planck Institute for Biological Cybernetics, Tu¨bingen, Germany
Martin Breidt
Heinrich H. Bu¨lthoff Department of Human Perception, Cognition and Action, Max Planck Institute for Biological Cybernetics, Tu¨bingen, Germany Natalie Butcher
School of Psychological Sciences, University of Manchester, Man-
chester, UK Department of Psychology, University of Pittsburgh, Pittsburgh,
Jeffrey F. Cohn
Pennsylvania Department of Human Perception, Cognition and Action, Max Planck Institute for Biological Cybernetics, Tu¨bingen, Germany
Cristo´bal Curio
Cognitive and A¤ective Neurosciences Laboratory, Tilburg University, Tilburg, the Netherlands, and Martinos Center for Biomedical Imaging, Massachusetts General Hospital / Harvard Medical School, Charlestown, Massachusetts
Beatrice de Gelder
Asif A. Ghazanfar Departments of Psychology and Ecology and Evolutionary Biology, Princeton University, Princeton, New Jersey
Section for Computational Sensomotorics, Hertie Institute for Clinical Brain Sciences, and Center for Integrative Neuroscience, University Clinic, Tu¨bingen, Germany
Martin A. Giese
Harold Hill School of Psychology, University of Wollongong, Wollongong, New South Wales, Australia Alan Johnston
UK
Department of Psychology, University College London, London,
278
Contributors
Mario Kleiner Department of Human Perception, Cognition and Action, Max Planck Institute for Biological Cybernetics, Tu¨bingen, Germany
Computational Neuroimaging Laboratory, New York University, New York, New York Barbara Knappmeyer Karen Lander
School of Psychological Sciences, University of Manchester, Manches-
ter, UK Kang Lee Human Development and Applied Psychology, University of Toronto, Toronto, Canada David A. Leopold Laboratory of Neuropsychology, National Institute of Mental Health, National Institutes of Health, Bethesda, Maryland Gwen Littlewort Institute for Neural Computation, University of California, San Diego, San Diego, California
Institute for Neural Computation, University of California, San Diego, San Diego, California Javier Movellan
Alice O’Toole School of Behavioral and Brain Science, The University of Texas at Dallas, Richardson, Texas Aina Puce Department of Psychological and Brain Sciences, Indiana University, Bloomington, Indiana Ruthger Righart Neurology and Imaging of Cognition Laboratory, and Swiss Center for A¤ective Sciences, University of Geneva, Geneva, Switzerland
School of Behavioral and Brain Science, The University of Texas at Dallas, Richardson, Texas Dana Roark
Charles E. Schroeder
Nathan Kline Institute for Psychiatric Research, Orangeburg,
New York Bernhard Scho¨lkopf
Max Planck Institute for Brilogical Cybernetics, Tu¨bingen,
Germany McGovern Institute for Brain Research, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts
Thomas Serre
Stephen V. Shepherd
Neuroscience Institute, Princeton University, Princeton, New
Jersey Pawan Sinha Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts J. Van den Stock Cognitive and A¤ective Neurosciences Laboratory, Tilburg University, Tilburg, the Netherlands, and Department of Neuroscience, KU Leuven, Leuven, Belgium
Contributors
279
Patrik Vuilleumier Laboratory for Behavioral Neurology and Imaging of Cognition, Department of Neuroscience, University Medical Center, and Department of Neurology, University Hospital, Geneva, Switzerland
Institute for Neural Computation, University of California, San Diego, San Diego, California, and Engineering and Natural Science, Sabanci University, Istanbul, Turkey
Esra Vural
Department of Informatics and Mathematical Modeling, Technical University of Denmark, Lyngby, Denmark
Christian Walder
Institute for Neural Computation, University of California, San Diego, San Diego, California
Jake Whitehill
Institute for Neural Computation, University of California, San Diego, San Diego, California Tingfan Wu
Index
Active appearance model (AAM), 243¤ Adaptation to context, 141, 241 Adaptation dynamic, 8, 47¤ neural, 195 static, 9, 23 Alzheimer’s disease, 168 Amygdala, 110, 114, 123, 135, 143, 147¤, 150¤, 162, 166¤, 170 Anti-expression, 49¤ Anti-faces, 48¤ Attention, 15, 23, 35, 71, 97, 99, 102¤, 109¤, 126¤, 133, 142, 145, 148, 151¤, 168, 194, 203¤ Auditory cortex, 79, 90, 101, 115¤, 130 Auditory speech recognition, 78 Autism spectrum disorder, 100, 168 Automated feedback, 211, 218, 232¤, 239¤ Automatic analysis of facial expression, 50¤, 164, 212, 224, 235 Average faces, 9 Biological motion, 27, 64¤, 78, 89, 100, 105, 115, 163, 192, 196, 241 Bruce, V., 3¤, 9, 16, 18–19, 21, 23, 31¤, 63, 67¤, 74, 110, 115, 141, 146¤, 153, 162, 167¤, 190¤ C layers, 194 Caricaturing, 5, 7 Causal structure, 241 Changeable aspects, 18, 27, 74, 141, 192 Closed-loop teaching, 233¤ Coarticulation, 84, 91 Coded light, 259¤ Complex cells, 193¤ Computer Expression Recognition Toolbox (CERT), 211, 214 Computer graphics, 47¤, 255¤ Computer vision, 4, 187¤, 211¤ Configural structure, 17 Configurational processing, 195 Connectionist architectures, 190 Continuous perceptual space, 47, 63
Controllable stimuli, 255 Correlational structure, 241 Correspondence(s), 52–53, 56, 101, 179, 181, 190, 256, 258¤, 261¤, 271 Corrugator supercilii, 112, 143 Cross-modal processing, 79, 88, 90, 123, 124¤ Datasets Cohn-Kanade, 216 RU-FACS, 216 Deaf, 79 Distinctiveness, 38¤ Distributed model, 17¤ Dorsal stream, 19, 25, 27, 195¤ Driver fatigue, 213¤ Drowsiness, 183, 211¤ Dynamic features, 105, 146 Eigenface, 190 Ekman, P., 49, 51, 100, 103, 112, 141, 143¤, 150, 164, 212¤, 218, 239, 241 Electroencephalography (EEG), 125, 129, 136¤, 145, 151, 162, 169, 225 Electromyography (EMG), 143¤, 151, 170, 217 Emotional expressions, 113¤, 152, 164, 166 Event-related potentials (ERPs), 5, 99¤, 123¤, 128¤, 133, 136¤, 148, 151 Eye gaze, 15, 17, 101, 125¤, 144¤ Eye movement, 81, 101, 106, 108, 127, 194, 239 Face Recognition Vendor Test, 16, 189 Facial Action Coding System (FACS), 49, 51, 164, 212¤, 216, 218, 222, 225 Facial adaption, 8 Face space model, 5, 38, 190 Face space, 9, 38, 47¤, 190, 199 active appearance model, 243¤ avatar, 3¤, 47¤, 67, 69, 144, 150, 243¤ Facial expression production, 91, 111, 143¤, 255 Facial motion, 44¤, 49¤, 52, 67¤, 81, 85, 106¤, 109, 123–126, 153, 178, 188, 225, 244, 258 Facial performance capture, 257
282
FACS. See Facial Action Coding System Faked expressions, 211¤ fMRI, 99, 102, 126, 129¤, 133, 135¤, 147¤, 150, 153, 164¤, 169, 197, 191, 194¤ Friesen, W. V., 49, 51, 110, 164, 212¤, 218 Fukushima, K., 193 Functional resonance magnetic imaging, 126 Fusiform cortex, 135, 147, 149 Fusiform face area, 22, 162, 167¤, 195 Fusiform gyrus, 17¤, 124, 135, 146, 162 Facial gestures, 5, 7¤, 11, 25, 103, 129 Facial speech, 15, 17, 20, 77–79, 81¤, 168 Famous faces, 18, 32¤, 35¤, 39¤, 43, 69 Familiarity, 19–21, 23, 28¤, 37, 39–44, 69, 110, 168 Game production, 255 Gamma band, 117, 131 Gaze direction, 102, 110, 114, 141¤ Gaze following, 110¤, 115 Gaze shifts, 113, 126, 141¤, 145, 149 Gaze, 15, 17, 98¤, 107, 109¤, 125¤, 133¤, 140¤, 153¤, 162 Gazed direction, 114 Gender, 3, 6, 47¤, 76, 68¤, 110, 148, 161, 168, 190, 203, 239, 241 Graphics model, 49, 55, 63, 255 Hidden Markov models (HMMs), 78, 188 High-level aftere¤ects, 48¤, 63 Human-computer interaction, 250 Huntington’s disease, 168 Ideomotor theory of action, 112 Idiosyncrasy, 21¤, 25, 69¤, 72¤, 188 Image quality, 18, 68¤ Imagination, 141, 143 Implicit surfaces, 258¤ Independent component analysis, 258 Infancy, 24, 45 Inferior temporal cortex (IT), 17, 19, 24, 191, 193¤, 199¤ Influence of identity, 57, 68 Insula, 114 Interstimulus interval (ISI), 57 Inversion e¤ect, 161
Index
Local motion, 4¤, 60¤, 63, 194 Low-dimensional, 50, 53, 55, 63, 182 Low-level motion adaptation and aftere¤ects, 49, 60¤ M170, 125 Magnetoencephalography (MEG), 125, 136¤, 145 Manipulating appearance, 241¤ Manipulating dynamics, 241¤ McGurk e¤ect, 80¤ Medial temporal areas, 117 Mesh deformation, 256¤, 259, 265¤ Mesh regularization, 258, 264¤ Microexpression(s), 100, 105 Mimicry, 5, 111¤, 114¤, 151¤, 192, 240 Mirror system, 240 Model fitting, 53 Morphing, 18, 20, 36, 48, 52, 55¤, 63, 69, 199¤, 257 Motion advantage, 5, 7, 19, 25¤, 31¤, 36, 38¤, 43¤ Motion capturing, 49¤, 67, 178–179, 243, 251, 256, 258 Motion distinctiveness, 39¤ Motion retargeting, 257 Motion transfer, 53 Motor representation, 192, 203¤ Motor theory of speech perception, 78, 164 Movie production, 255 Multidimensional face space, 47 Multimodality, 81, 90, 100, 123¤, 127, 129, 131, 133, 135 Multisensory interaction, 131, 133 N170, 99¤, 123¤, 129¤, 133, 135, 145¤, 151¤ Negation, 6, 33, 39 Neocognitron, 193 Nonrigid instrinsic facial movements, 56 Norm, 9, 11¤, 189¤, 197, 199¤ Norm-based encoding, 190, 197, 199¤ Norm-based representation, 9, 190 Norm-referenced encoding, 190, 197, 199¤ Object motion, 4¤ Object-based motion, 4¤ Object-centered representation, 8, 22, 34, 78, 79 OpenGL, 56 Optical flow, 61¤, 188, 256, 258, 272
James, W., 112 Kalman filter, 189 Keyframes, 49¤, 89, 182, 197, 257 Latency, 123¤, 127, 129, 133 Lateral connections, 197 Learning, 16, 18¤, 21¤, 28¤, 34¤, 37¤, 43¤, 97, 181¤, 196¤, 213¤, 216¤, 222¤, 251, 273 Least-square optimization problem, 53 Liberman A., 78, 91, 164
P140, 129 Parametric control, 47 Parietal areas, 114, 162 Parkinson’s disease, 168 PCA, 4, 9, 11, 199 Perception production loop, 91 Perceptrons, 190 Person, 11, 15, 19, 25¤, 33, 37¤, 40, 43, 77, 88¤, 102¤, 113, 117¤, 147, 161¤, 167, 187¤, 203, 225, 239¤, 247¤, 250, 252
Index
Perturbation, 241 Photographic negatives, 6, 67¤ Photorealistic simulation, 52 Place of articulation, 80 Point-light displays, 5¤, 32, 67¤, 168 Point-light stimuli, 5, 6, 25, 32, 48, 67¤, 81, 90, 168, 192, 196, 241 Point-light walkers, 25, 48 Points of subjective equality (PSE), 72¤ Positron emission tomography (PET), 147, 164, 165 Prefrontal cortex, 117, 131, 143, 151¤, 193 Premotor areas, 169 Priming, 26, 36¤, 44¤ Principal component analysis (PCA), 4, 8¤, 182, 244 Prosody, 20, 84¤, 88¤, 143, 239 Prosopagnosia, 21¤, 100, 167¤, 170¤ Prototype, 8¤, 11¤, 38¤, 181, 211, 259 Psychometric function, 57, 72 Psychophysics Toolbox-3, 56 Pupil size, 142 Quadratic programming, 53 Radial basis function(s) (RBF), 190, 199, 257 Real-time, 56, 91, 105, 214, 233, 235, 243, 247 Relative motion, 78 Representation enhancement hypothesis, 18, 22¤, 26¤, 33, 74 Rigid head movements, 56 S layers, 194 Schizophrenia, 168 Self-processing, 117 Semantic control, 47 Sensory-driven processing, 142 Sequential temporal order, 196 Simple cells, 193¤ Sine-wave speech, 83, 88 Social attention, 23, 114, 127, 148 Social communication, 17¤, 103 Social context, 133¤ Spatiotemporal symmetry, 239 Speech, 77 Spontaneous expressions, 211, 213¤ Structure-from-motion, 18¤, 22¤, 181 Superior temporal sulcus (STS), 17, 89, 100, 113, 116, 123¤, 126, 130¤, 131, 133, 135, 147, 149, 151, 163, 191 Supplement information hypothesis, 17¤, 22, 33¤, 43, 69, 74 Surface tracking, 255¤, 269 Symmetric action, 239 Synthesis, 47¤, 244, 248 Teaching, 233¤ Temporal order, 49, 63, 181, 196, 204
283
Temporal segmentation, 7 3D scanning, 49, 52, 54, 258¤ Tracking, 4, 42, 145, 179¤, 188¤, 214, 241¤, 249– 255, 265, 267¤, 271¤ Transcranical magnetic stimulation (TMS), 115, 152 Tutoring system, 232 Types of motion, 4 V1, 162¤, 193¤, 197, 261¤, 266¤ V4, 24, 163, 169, 191, 193¤ V5, 163, 169 Ventral (visual) pathway, 193 Ventral stream, 21–22, 27, 193, 195¤, 199 Vernoi tessellation, 190 VICON, 50 Video, 16¤, 20, 23, 25, 34, 77, 79, 81, 83, 85¤, 107¤, 116, 129, 144, 148, 153, 169, 178¤, 187¤, 196, 213¤, 222¤, 225¤, 230, 233¤, 241¤, 246¤, 257¤, 269, 272, 274¤ Videoconference, 242, 250¤ View-based representation, 22¤, 181 Viewer-centered representation, 78 View-inpedendency, 90 Viewpoint constancy, 8 Vocal tract, 80¤, 109 Vocalization, 99, 101, 105, 129, 131, 239¤ Voluntary facial movements, 143, 152, 213 Young, A. W., 3, 8¤, 24, 32, 36, 74, 106, 113, 141, 146¤, 153, 162, 166¤, 192, 195 Zihl, J., 81, 169 Zygomatic major, 112