Cognitive Science 35 (2011) 1–33 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 0364-0213 print / 1551-6709 online DOI: 10.1111/j.1551-6709.2010.01142.x
The AHA! Experience: Creativity Through Emergent Binding in Neural Networks Paul Thagard, Terrence C. Stewart Department of Psychology, University of Waterloo Received 18 September 2009; received in revised form 25 March 2010; accepted 27 June 2010
Abstract Many kinds of creativity result from combination of mental representations. This paper provides a computational account of how creative thinking can arise from combining neural patterns into ones that are potentially novel and useful. We defend the hypothesis that such combinations arise from mechanisms that bind together neural activity by a process of convolution, a mathematical operation that interweaves structures. We describe computer simulations that show the feasibility of using convolution to produce emergent patterns of neural activity that can support cognitive and emotional processes underlying human creativity. Keywords: Binding; Conceptual combination; Convolution; Creativity; Emotion; Neural engineering framework; Neural networks; Neurocomputation; Representation
1. Creative cognition Creativity is evident in many human activities that generate new and useful ideas, including scientific discovery, technological invention, social innovation, and artistic imagination. Understanding is still lacking of the cognitive mechanisms that enable people to be creative, especially about the neural mechanisms that support creativity in the brain. How do people’s brains come up with new ideas, theories, technologies, organizations, and esthetic accomplishments? What neural processes underlie the wonderful AHA! experiences that creative people sometimes enjoy? We propose that human creativity requires the combination of previously unconnected mental representations constituted by patterns of neural activity. Then creative thinking is a matter of combining neural patterns into ones that are both novel and useful. We advocate Correspondence should be sent to Paul Thagard, Department of Psychology, University of Waterloo, Ontario N2L 3G1 Canada. E-mail:
[email protected] 2
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
the hypothesis that such combinations arise from mechanisms that bind together neural patterns by a process of convolution rather than synchronization, which is the currently favored way of understanding binding in neural networks. We describe computer simulations that show the feasibility of using convolution to produce emergent patterns of neural activity of the sort that can support human creativity. One of the advantages of thinking of creativity in terms of neural representations is that they are not limited to the sort of verbal and mathematical representations that have been used in most computational, psychological, and philosophical models of scientific discovery. In addition to words and other linguistic structures, the creative mind can employ a full range of sensory modalities derived from sight, hearing, touch, smell, taste, and motor control. Creative thought also has vital emotional components, including the reaction of pleasure that accompanies novel combinations in the treasured AHA! experience. The generation of new representations involves binding together previously unconnected representations in ways that also generate new emotional bindings. Before getting into neurocomputational details, we illustrate the claim that creative thinking consists of novel combination of representations with examples from science, technology, social innovation, and art. We then show how multimodal representations can be combined by binding in neural populations using a process of convolution in which neural activity is ‘‘twisted together’’ rather than synchronized. Emotional reactions to novel combinations can also involve convolution of patterns of neural activity. After comparing our neural theory of creativity with related work in cognitive science, we place it in the context of a broader account of multilevel mechanisms—including molecular, psychological, and social ones—that together contribute to human creativity. We propose a theory of creativity encapsulated in the following theses: 1. Creativity results from novel combinations of representations. 2. In humans, mental representations are patterns of neural activity. 3. Neural representations are multimodal, encompassing information that can be visual, auditory, tactile, olfactory, gustatory, kinesthetic, and emotional, as well as verbal. 4. Neural representations are combined by convolution, a kind of twisting together of existing representations. 5. The causes of creative activity reside not just in psychological and neural mechanisms but also in social and molecular mechanisms. Thesis 1, that creativity results from combination of representations, has been proposed by many writers, including Koestler (1967) and Boden (2004). Creative thinkers such as Einstein, Coleridge, and Poincare´ have described their insights as resulting from combinatory play (Mednick, 1962). For the purposes of this paper, the thesis need only be the modest claim that much creativity results from novel combination of representations. The stronger claim that all creativity requires novel combination of representations is defended by analysis of 200 great scientific discoveries and technological inventions (P. Thagard, unpublished data). The nontriviality of the claim that creativity results from combination of representations is shown by proponents of behaviorism and radical embodiment who contend that there are no mental representations.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
3
Thesis 2, that human mental representations are patterns of neural activity, is defended at length elsewhere (Thagard, 2010). The major thrust of this paper is to develop and defend thesis 4 by providing an account of how patterns of neural activity can be combined to constitute new ones. Our explanation of creativity in terms of neural combinations could be taken as an additional piece of evidence that thesis 2 is true, but we need to begin by providing some examples that provide anecdotal evidence for theses 1 and 3. Thesis 5, concerning the social and molecular causes of creativity, will be discussed only briefly.
2. Creativity from combination of representations Fully defending thesis 1, that creativity results from combination of representations, would take a comprehensive survey of hundreds or thousands of acknowledged instances of creative activity in many domains. We can only provide a couple of supporting examples from each of the four primary areas of human creativity: scientific discovery, technological invention, social innovation, and artistic imagination. These examples show that at least some important instances of creativity depend on the combination of representations. Many scientific discoveries can be understood as instances of conceptual combination, in which new theoretical concepts arise by putting together old ones (Thagard, 1988). Two famous examples are the wave theory of sound, which required development of the novel concept of a sound wave, and Darwin’s theory of evolution, which required development of the novel concept of natural selection. The concepts of sound and wave are part of everyday thinking concerning phenomena such as voice and water waves. The ancient Greek Chrysippus put them together to create the novel representation of a sound wave that could explain many properties of sound such as propagation and echoing. Similarly, Darwin combined familiar ideas about selection done by breeders with the natural process of struggle for survival among animals to generate the mechanism of natural selection that could explain how species evolve. One of the cognitive mechanisms of discovery is analogy, which requires putting together the representation of a target problem with the representation of a source (base) problem that furnishes a solution. Hence, the many examples of scientific discoveries arising from analogy support the claim that creativity arises from combination of representations. See Holyoak and Thagard (1995) for a long list of analogy’s greatest successes in the field of scientific discovery. Cognitive theories of conceptual combination have largely been restricted to verbal representations (e.g., Costello & Keane, 2000; Smith & Osherson, 1984), but conceptual combination can also involve perception (Wu & Barsalou, 2009; see also Barsalou, Simmons, Barbey, & Wilson, 2003). Obviously, the human concept of sound is not entirely verbal, possessing auditory exemplars such as music, thunder, and animal noises. Similarly, the concept of wave is not purely verbal but involves in part visual representations of typical waves such as those in large bodies of water or even in smaller ones such as bathtubs. Hence, a theory of conceptual combination, in general and in specific application to scientific discovery, needs to attend to nonverbal modalities.
4
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
Technological invention also has many examples of creativity arising from combination of representations. In our home town of Waterloo, Ontario, the major economic development of the past decade has been the dramatic rise of the company Research in Motion (RIM), maker of the extremely successful BlackBerry wireless device. The idea for this device originated in the 1990s as the result of the combination of two familiar technological concepts: electronic mail and wireless communication. According to Sweeny (2009), RIM did not originate this combination, which came from a Swedish company, Ericsson, where an executive combined the concepts of wireless and email into the concept of wireless email. Whereas the concept of sound wave was formed to explain observed phenomena, the concept of wireless email was generated to provide a new target for technological development and financial success (see Saunders & Thagard, 2005, for an account of creativity in computer science). Thus, creative conceptual combination can produce representations of goals as well as of theoretical entities. RIM’s development of the BlackBerry depended on many subsequent creative combinations such as two-way paging, thumb-based typing, and an integrated single mailbox. Another case of technological development by conceptual combination is the invention of the stethoscope, which came about by an analogical discovery in 1816 by a French physician, The´ophile Laennec (Thagard, 1999, ch. 9). Unable to place his ear directly on the chest of a modest young woman with heart problems, Laennec happened to see some children listening to a pin scratching through a piece of wood and came up with the idea of a hearing tube that he could place on the patient’s chest. The original concepts here are multimodal, involving sound (hearing heartbeats) and vision (rolled tube). Putting these multimodal representations together enabled Laennec to create the concept we now call the stethoscope. It would be easy to document dozens of other examples of technological invention by representation combination. Social innovations have been less investigated by historians and cognitive scientists than developments in science and technology (Mumford, 2002), but they also result from representation combination. Two of the most important social innovations in human history are public education and universal health care. Both of these innovations required establishing new goals using existing concepts. Education and health care were private enterprises before social innovators projected the advantages for human welfare if the state took them on as a responsibility. Both innovations required novel combinations of existing concepts concerning government activity plus private concerns, generating the combined concepts of public education and universal health care. Many other social innovations, from universities, to public sanitation, to Facebook, can also be seen as resulting from the creative establishment of goals through combinations of previously existing representations. The causes of social innovation are of course social as well as psychological, as we will make clear later when we propose a multilevel system view of creativity. There are many kinds of artistic creativity, in domains as varied as literature, music, painting, sculpture, and dance. Individual creative works such as Beethoven’s Ninth Symphony, Tolstoy’s War and Peace, and Manet’s Le de´jeuner sur l’herbe are clearly the result of the cognitive efforts of composers, authors, and artists to combine many kinds of representations: verbal, auditory, visual, and so on. Beethoven’s Ninth, for example,
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
5
combines auditory originality with verbal novelty in the famous last movement known as the Ode to Joy, which also generates and integrates emotional representations. Hence, illustrious cases of artistic imagination support the claim that creativity emanates in part from cognitive operations of representation combination. Even kinesthetic creativity, the generation of novel forms of movement, can be understood as combination of representations as long as the latter are understood very broadly to include neural encodings of motor sequences (e.g., Wolpert & Ghahramani, 2000). Historically, novel motor sequences include the slam dunk in basketball, the over-the-shoulder catch in baseball, the Statue of Liberty play in football, the bicycle kick in soccer, and the pas de deux in ballet. All of these can be described verbally and may have been generated using verbal concepts, but it is just as likely that they were conceived and executed using motor representations that can naturally be encoded in patterns of neural activity. We are primarily interested in creativity as a mental process, but we cannot neglect the fact that it also often involves interaction with the world. Manet’s innovative painting arose in part from his physical interaction with the brush, paint, and canvas. Similarly, invention of important technologies such as the stethoscope and the wheel can involve physical interactions with the world, as when Laennec rolled up a piece of paper to produce a hearing tube. External representations such as diagrams and equations on paper can also be useful in creative activities, as long as they interface with the internal mental representations that enable people to interact with the world. We have provided examples to show that combination of representations is a crucial part of creativity in the domains of scientific discovery, technological invention, social innovation, and artistic imagination. Often these domains intersect, for example, when the discovery of electromagnetism made possible the invention of the radio, and when the invention of the microscope made possible the discovery of the cell. Social innovations can involve both scientific discoveries and technological invention, as when public health is fostered by the germ theory of disease and the invention of antibiotics. Artistic imagination can be aided by technological advances, for example, the invention of new musical instruments. We hope that our examples make plausible the hypothesis that creativity often requires the combination of mental representations operating with multiple modalities. We now describe neural mechanisms for such combinations.
3. Neural combination and binding Combination of representations has usually been modeled with symbolic techniques common in the field of artificial intelligence. For example, concepts can be modeled by schema-like data structures called frames (Minsky, 1975), and combination of concepts can be performed by amalgamating frames (Thagard, 1988). Other computational models of conceptual combination have aimed at modeling the results of psycholinguistic experiments rather than creativity, but they also take concepts to be symbolic, verbal structures (Costello & Keane, 2000). Rule combination has been modeled with simple rules consisting of strings of bits and genetic algorithms (Holland, Holyoak, Nisbett, & Thagard, 1986), and genetic
6
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
algorithms have also been used to produce new combinations of expressions written in the programming language LISP (Koza, 1992). Lenat and Brown (1984) produced discovery programs that generated new LISP-defined concepts out of chunks of LISP code. Rule-based systems such as ACT and SOAR can also employ learning mechanisms in which rules are chunked or compiled together to form new rules (Anderson, 1993; Laird, Rosenbloom, & Newell, 1986). These symbolic models of combination are powerful, but they lack the generality to handle the full range of representational combinations that include sensory and emotional information. Hence, we propose viewing representation combination at the neural level, as all kinds of mental representations—concepts, rules, sensory encodings, and emotions—are produced in the brain by the activity of neurons. Evidence for this claim comes from the vast range of psychological phenomena such as perception and memory that are increasingly being explained by cognitive neuroscience (see e.g., Chandrasekharan, 2009; Smith & Kosslyn, 2007; Thagard, 2010). The basic idea that neural representations are constituted by patterns of activity in populations of neurons dates back at least to Donald Hebb (1949, 1980), and it is implicit in many more recent and detailed neurocomputational accounts (e.g., Churchland, 1989; Churchland & Sejnowski, 1992; Dayan & Abbott, 2001; Eliasmith & Anderson, 2003; Rumelhart & McClelland, 1986). If this basic idea is right, then combination of representations should be a neural process involving generation of new patterns of activity from old ones. Hebb is largely remembered today for the eponymous idea of learning of synaptic connections between neurons that are simultaneously active, which has its roots in the learning theories of 18th-century empiricist philosophers such as David Hartley. But Hebb’s most seminal contribution was the doctrine that all thinking results from the activity of cell assemblies, which are groups of neurons organized by their synaptic connections and capable of generating complex behaviors. Much later, Hebb (1980) sketched an account of creativity as a normal feature of cognitive activity resulting from the firing of neurons in cell assemblies. Hebb described problem solving as involving many ‘‘cell-assembly groups which fire and subside, fire and subside, fire and subside, till the crucial combination occurs’’ (Hebb, 1980, p. 119). Combination produces a new scientific idea that sets off a new sequence of ideas and constitutes a different way of seeing the problem situation by reorienting the whole pattern of cortical activity. The new combination of ideas that result from connection of cell assemblies forms a functional system that excites the arousal system, producing the Eureka! emotional effect. Thus, Hebb sketched a neural explanation of the phenomenon of insight that has been much discussed by psychologists interested in problem solving (e.g., Bowden & Jung-Beeman, 2003; Sternberg & Davidson, 1995). Hebb’s conception of creative insight arising from cell-assembly activity is suggestive, but rather vague, and raises as many questions as it answers. How are cell assemblies related to each other, and how is the information they carry combined? For example, if there is a group of cell assemblies (a neural population) that encodes the concept of sound, and another that encodes the concept of wave, how does the combined activity of the overall neural population encode the novel conceptual combination of sound wave?
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
7
We view the problem of creative combination of representations as an instance of the ubiquitous binding problem that pervades cognitive neuroscience (e.g., Roskies, 1999; Treisman, 1996). This problem was first recognized in studies of perception, where it is problematic how the brain manages to integrate various features of an object into a unified representation. For example, when people see a stop sign, they see the color red, the octagonal shape, and the white letters as all part of the same image, which requires the brain to bind together what otherwise might be several disparate representations. The binding problem is also integral to explain the nature of consciousness, which has a kind of unity that may seem mysterious from the perspective of the variegated activity of billions of neurons processing many different kinds of information. The most prominent suggestions of how to deal with the binding problem have concerned synchronization of neural activity, which has been proposed as a way to deal with the kind of cognitive coordination that occurs in consciousness (Crick, 1994; Engel, Fries, Ko¨nig, Brecht, & Singer, 1999; Grandjean, Sander, & Scherer, 2008; Werning & Maye, 2007). At a more local level, neural synchrony has been proposed as a way of integrating crucial syntactic information needed for the representation of relations, for example, to mark the difference between Romeo loves Juliet and Juliet loves Romeo (Hummel & Holyoak, 1997; Shastri & Ajjanagadde, 1993). For the first sentence, the neural populations for Romeo and SUBJECT would be active for a short period of time, then the neural populations for loves and VERB, and finally for Juliet and OBJECT. For the second sentence, the timing would be changed so that Romeo and OBJECT were synchronized, as well as Juliet and SUBJECT. Unfortunately, there has been little success at finding neural mechanisms that could cause this synchronization behavior. Simple approaches, such as having a separate group of neurons that could specifically stimulate pairs of concepts, do not scale up, as they must have neurons for every single possible conceptual combination and thus require more neurons than exist in the human brain. Other models restrict themselves to particular modalities, such as binding color and location (Johnson, Spencer, & Scho¨ner, 2008). Still others are too brittle, assuming that neurons have no randomness and never die (for a detailed discussion, see C. Eliasmith & T. C. Stewart, unpublished data; T. C. Stewart & C. Eliasmith, unpublished data). Our aim in this paper is to develop an alternative account based on convolution rather than synchronization. To do this, we present a neurally realistic model of a mechanism whereby arbitrary concepts can be combined, resulting in a new representation. We will not propose a general theory of binding, but rather defend a more narrow account of how the information encoded in patterns of neural activity gets combined into new representations that may turn out to be creative. We draw heavily on Eliasmith’s (2004, 2005a, unpublished data) work on neurobiologically plausible simulations of high-level inference.
4. Binding by convolution Tony Plate (2003) developed a powerful way of thinking about binding in neural networks, using vector-based representations similar to but more computationally efficient
8
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
than the tensor-product representations proposed by Smolensky (1990). Our presentation of Plate’s idea, which we call binding by convolution, will be largely metaphorical in this section, but technical details are provided later. Eliasmith and Thagard (2001) include a relatively gentle introduction to Plate’s method. To get the metaphors rolling, consider the process of braiding hair. Thousands of long strands of hair can be braided together by twisting them systematically into one or more braids. Similar twisting can be used to produce ropes and cables. Another word for twisting and coiling up things in this way is convolve, and things are convolved if they are all twisted up together. Another word for ‘‘convolve’’ is ‘‘convolute,’’ but we rarely use this term because in recent decades the term ‘‘convoluted’’ has come to mean ‘‘excessively complicated.’’ In mathematics, a convolution is an integral function that expresses the amount of overlap of one function f as it is shifted over another function g, expressing the blending of one function with another (http://mathworld.wolfram.com/Convolution.html). This notion gives a mathematically precise counterpart to the physical process of braiding, as we can think of mathematical convolution as blending two signals together (each represented by a function) in a way roughly analogous to how braiding blends two strands of hair together. Plate developed a technique he called holographic reduced representations that applies an analog of convolution to vectors of real numbers. It is natural to think of patterns of neural activity using vectors: If a neural population contains n neurons, then its activity can be represented by a sequence that contains n numbers, each of which stands for the firing rate of a neuron. For example, if the maximum firing rate of a neuron is 200 times per second, the rate of a neuron firing 100 times per second could be represented by the number .5. Then the vector (.5, .4, .3, .2, .1) corresponds to the firing rates of this neuron and four additional ones with slower firing rates. So here is the basic idea: If we abstractly represent the pattern of activity of two neural populations by vectors A and B, then we can represent their combination by the mathematical convolution of A and B, which is another vector corresponding to a third pattern of neural activity. For the moment, we ignore what this amounts to physiologically—see the Simulation section below. The resulting vector has emergent properties, that is, properties not possessed by (or simple aggregates of) either of the two vectors out of which it is combined. The convolved vector combines the information included in each of the originating vectors in a nonlinear fashion that enables only approximate reconstruction of them. Hence, the convolution of vectors produces an emergent binding, one which is not simply the sum of the parts bound together (on emergence, see Bunge, 2003; Wimsatt, 2007). Talking about convolution of vectors still does not enable us to grasp the convolution of patterns of neural activity. To explain that, we will need to describe our simulation model of how combination can occur in a biologically realistic, computational neural network. Before getting into technical details, however, we need to describe how the AHA! experience can be understood as convolution of a novel combined representation with patterns of brain activity for emotion.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
9
5. Emotion and creativity Cognitive science must not only explain how representations get combined into creative new ones, it should also explain how such combinations can be intensely emotional. Many quotes from eminent scientists attest to the emotional component of scientific discovery, including such reactions as delight, amazement, pleasure, glory, passion, and joy (Thagard, 2006a, ch. 10). We expect that breakthroughs in technological invention, social innovation, and artistic imagination are just as exciting. For example, Richard Feynman (1999, p. 12) wrote: ‘‘The prize is the pleasure of finding a thing out, the kick of the discovery, the observation that other people use it [my work]—those are the real things, the others are unreal to me.’’ What are the neuropsychological causes of the ‘‘kick of the discovery’’? To answer that question, we need a neurocomputational theory of emotion that can be integrated with the account of representation generation provided earlier in this paper. Thagard and Aubie (2008) have hypothesized how emotional experience can arise from a complex neural process that integrates cognitive appraisal of a situation with perception of internal physiological states. Fig. 1 shows the structure of the EMOCON model, with the emotional feeling
anterior cingulate external stimulus
external sensors
DLPFC
dopamine system
OFPFC
VMPFC
thalamus
amygdala
insula
internal sensors
bodily states
Fig. 1. The EMOCON model of Thagard and Aubie (2008), which contains details and partial computational modeling. DLPFC, dorsolateral prefrontal cortex; OFPFC, orbitofrontal prefrontal cortex; VMPFC, ventromedial prefrontal cortex. The dotted line is intended to indicate that emotional consciousness emerges from activity in the whole system.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
10
interaction of multiple brain areas generating emotional consciousness, requiring both appraisal of the relevance of a situation to an agent’s goals (largely realized by the prefrontal cortex and midbrain dopamine system) and internal perception of physiological changes (largely realized by the amygdala and insula). For defense of the neural, psychological, and philosophical plausibility of this account of emotional experience, see also Thagard (2010). If the EMOCON model is correct, then emotional experiences such as the ecstasy of discovery are patterns of neural activity, just like other mental representations such as concepts and rules. Now it is easy to see how the AHA! or Eureka! experience can arise. When two representations are combined by convolution into a new one, the brain automatically performs an evaluation of the relevance of the new representation to its goals. Ordinarily, such combinations are of little significance, as in the ephemeral conceptual combinations that take place in all language processing. There need be no emotional reaction to mundane combinations such as ‘‘brown cow’’ and ‘‘tall basketball player.’’ But some combinations are surprising, such as ‘‘cow basketball,’’ and may elicit further processing to try to make sense of them (Kunda, Miller, & Claire, 1990). In extraordinary situations, the novel combination may be not only surprising but actually exciting, if it has strong relevance to accomplishing the longstanding goals of the thinker. For example, Darwin was thrilled when he realized that the novel combination natural selection could explain facts about species that had long puzzled him, and the combination wireless email excited the inventors of the BlackBerry when they realized its great commercial potential. Fig. 2 shows how representation combination can be intensely emotional, when patterns of neural activity corresponding to concepts become convolved with patterns of activity that constitute the emotional evaluation of the new combination. A new combination such as sound wave is exciting because it is highly relevant to accomplishing the discoverer’s goals. But emotions are not just a purely cognitive process of appraisal, which could be performed dispassionately, as in a calculation of expected utility as performed by economists. AHA! is a very different experience from ‘‘Given the probabilities and expected payoffs, the expected value of option X is high.’’
combined concept
emotional reaction convolute AHA
convolute
concept 1
concept 2
convolute
appraisal
physiology
Fig. 2. How the AHA! experience arises by multiple convolutions of representation combination and emotion. Fig. 9 will present a neural version.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
11
Physiology is a key part of emotional experience—racing heartbeats, sweaty palms, etc., as pointed out by many theorists (e.g., Damasio, 1994; James, 1884; Prinz, 2004). But physiology cannot be the only part of the story, as there are many reasons to see cognitive appraisal as crucial, too (Thagard, 2010). The EMOCON model shows how reactions can combine both cognitive appraisal and physiological perception. Hence, we propose that the AHA! experience requires a triple convolution, binding: (a) two representations into an original one; (b) cognitive appraisal and physiological perception into a combined assessment of significance; and (c) the combined representation and the integrated cognitive ⁄ physiological emotional response into a unified representation (pattern of neural activity) of the creative representation and its emotional value. Because the brain’s operations are highly parallel, it would be a mistake to think of what we have just described as a serial process of first combining the representations, then integrating the appraisal and physiological perception, and finally convoluting the new representation and the emotional reaction. Rather, all these processes take place concurrently, with a constant flow of activation among the regions of the brain crucial for different kinds of cognition, perception, and emotion. The AHA! experience seems mysterious because we have no conscious access to any of these processes or their integration. But Fig. 2 shows a possible mechanism for how the wonderful AHA! experience can emerge from neural activity.
6. Simulations Our discussion of convolution so far has been largely metaphorical. We will now describe neurocomputational simulations that use the methods of Eliasmith (2004, 2005a) to show: (a) how patterns of neural activity can represent vectors; (b) how convolution can bind together two representations to form a new one (and subsequently unbind them); and (c) how convolution can bind together a new combined representation with patterns of brain activity that correspond to emotional experience arising from a combination of cognitive appraisal and physiological perception. 6.1. Simulation 1: Neurocomputational model of visual patterns The first requirement for a mechanistic neural explanation of conceptual combination is specifying how a pattern can be represented by a population of neurons. Earlier, we gave a simple example where a vector of five elements corresponded to the firing rates of five neurons; for example, the value 0.5 could be indicated by an average firing rate of 100 Hz (i.e., 100 times per second). Linguistic representations can be translated into vectors using Plate’s method of holographic reduced representation, and there are also natural translations of visual, auditory, and olfactory information into the mathematical form of vectors. Hence, a general multimodal theory of combined representation needs to consider only how vectors can be neurally represented and combined. We can depict the simple approach visually as in Fig. 3. Here, we are using 25 neurons (the circles) to represent a 25-dimensional vector (squares). The shading of each neuron
12
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
Fig. 3. Simple neural network encoding using one neuron per pixel value. The three numerical vectors are shown at the top, shaded by value to aid interpretation. At the bottom, each circle is an individual neuron and the lighter shading indicates a faster firing rate. Encodings for three different 25-dimensional vectors are shown.
corresponds to its firing rate (with white being fast and black being slow). Depending on the firing rate, these same neurons can represent different vectors, and three different possibilities are shown. While this approach for representation is simple, it is not biologically realistic. In brains, if we map a single value onto a single neuron, then the randomness of that neuron will limit how accurately that value can be represented. Neurons are highly stochastic devices, and can even die, so we cannot use a method of representation that does not have some form of redundancy. To deal with this problem, we need to use many more neurons to represent the same vector. In Fig. 4, we represent the same three patterns as Fig. 3, but using 100 times as many neurons. While this approach allows for increased accuracy and robustness to neuron death, this approach does not correspond to what is found in real brains. Instead of having a large number of neurons, each of which behaves similarly to its neighbors (as do each of the groups of 100 neurons in Fig. 4), there is strong neurological evidence that vectors are represented by having each neuron have a different overall pattern in response to which they fire most quickly. For example, Georgopoulos, Schwartz, and Kettner (1986) demonstrated that neurons in the motor cortex of monkeys encode reaching direction by each one having its own preferred direction vector. That is, for each neuron, there is a particular vector for which it fires most quickly, and the firing rate decreases as the vector represented becomes more dissimilar to that direction vector. As another example, neurons in the visual cortex respond to patterns over the visual field, and each one responds most strongly to a slightly different
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
13
pattern, with neurons that are near each other responding to similar patterns (e.g., Blasdel & Salama, 1986). To adopt this approach, instead of having 100 neurons responding to each single value in the vector, we take each neuron and randomly choose a particular pattern for which it will fire the fastest. For other vectors, it will fire less quickly, based on the similarity with its preferred vector. This approach corresponds to that seen in many areas of visual and motor cortex, and it has been shown to allow for improved computational power over the previous
Fig. 4. Simple neural network encoding using 100 neurons per pixel value. The three vectors are shown at the top (see Fig. 3 for more details). In the large bottom squares, each circle is an individual neuron and the lighter shading indicates a faster firing rate. Encodings for three different 25-dimensional vectors are shown.
Fig. 5. Distributed encoding using preferred direction vectors of a 25-dimensional vector with 2,500 neurons. Each circle is an individual neuron and the lighter shading indicates a faster firing rate. Encodings for three different 25-dimensional vectors are shown.
14
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
approach. Fig. 5 shows a group of 2,500 neurons representing three 25-dimensional vectors in this manner. It should be noted that, while we can no longer visually see the relationship between the firing pattern and the original vector (as we could in Fig. 4), the vector is still being represented by this neural firing, as we can take this firing pattern and derive the original pattern, if we know the preferred direction vectors for each of these neurons (see Appendix for more details). To further improve the biological realism of our model, we can construct our model using spiking neurons. That is, instead of just calculating a firing rate, we can model the flow of current into and out of each neuron, and when the voltage reaches its firing potential (around )45 mV), it fires. Once a neuron fires, it returns to its resting potential ()70 mV) and excites all the neurons to which it is connected, thus changing the current flows in those neurons. With these neurons, we set the amount of current flowing into them (based on the similarity between the input vector and the preferred direction vector), instead of directly setting the firing rate. The simplest model of this process is the leaky integrate-and-fire (LIF) neuron. We use it here, although all of the techniques discussed in this paper can be applied to more complex neural models. If we run the model for a very long time and average the resulting firing rate, we get the same picture as in Fig. 5. However, at any given time within the simulation, the voltage levels of the various neurons will be changing. Fig. 6 shows a snapshot of the resulting neural behavior, where black indicates the neuron resting and white indicates that enough voltage has built up for the neuron to fire. Any neuron that is white in this figure will soon fire, return to black, and then slowly build up voltage over simulated time. This rate is governed by the neuron’s membrane time constant (set to 20 ms in accordance with McCormick, Connors, Lighthall, & Prince, 1985) and the refractory period (2 ms).
Fig. 6. A snapshot in time showing the voltage levels of 2,500 neurons using distributed encoding of preferred direction vectors of a 25-dimensional vector. Each circle is an individual neuron and the lighter shading indicates a higher membrane voltage. White circles indicate a neuron that is in the process of firing. Encodings for three different 25-dimensional vectors are shown.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
15
6.2. Simulation 2: Convolution of patterns Now that we have specified how neurons can represent vectors, we can organize neurons to perform the convolution operation (Eliasmith, 2004). We need a neural model where we have two groups of neurons representing the input patterns (using the representation scheme above), and these neurons must be connected to a third group of neurons which will be driven to represent the convolution of the two original patterns. The Neural Engineering Framework (NEF) of Eliasmith and Anderson (2003) provides a methodology for converting a function such as convolution into a neural model by deriving the synaptic connection weights that will implement that function. Once these weights are found, we use the representation method discussed above to encode the two original vectors in neural groups A and B. When these neurons fire, the synaptic connections cause electric current to flow into any neurons to which they are connected. This flow in turn causes firing in neural group C, which will represent the convolution of the patterns in A and B. Details on the derivation of these synaptic connections can be found in the Appendix. Fig. 7 shows how convolution combines the neural representation of two perceptual inputs, on the left, into a neural representation of their convolution, on the right. It is clear from the visual interpretation of the neural representation on the right that the convolution of the two input patterns is not simply the sum of those patterns and, therefore, amounts to an emergent binding of them. Importantly, one set of neural connection weights is sufficient for performing the convolution of any two input vectors. That is, the synaptic connections do not need to be changed if we need to convolve two new patterns; all that has to change is the firing patterns of the neurons in groups A and B. We do not rely on a slow, learning process of changing synaptic weights: Convolution is a fast process response to changes in perceptual inputs. If the synaptic connections in our NEF model correspond to fast glutamate receptors, convolution can
Fig. 7. Convolution occurring in simulated neurons. The two input grids on the left are represented by the neural network firing patterns beside them. On the right is the neural network firing pattern that represents the convolution of the two inputs. The grid on the far right is the visual interpretation of the result of this convolution. Arrows indicate synaptic connections via intervening neural populations.
16
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
occur within 5 ms. However, we make no claims as to how the neurons come to have the particular connection weights that allow them to perform this convolution. These weights could be genetically specified or they could be learned over time. After two representations have been combined, it is still possible to extract the original information from the combined representation. The process of convolution can be reversed, using neural connections almost identical to those needed for performing the convolution in the first place, as shown in Fig. 8. There is a loss of information in that the extracted information is only an approximation of the original. However, by increasing the number of vector values and the number of neurons per value, we can make this approximation as accurate as desired (Eliasmith & Anderson, 2003). Importantly, this is not a selective process: All of the original patterns are preserved. Given the concept ‘‘sound wave,’’ we can always break it back down into ‘‘sound’’ and ‘‘wave.’’ Any specialization of the new concept must involve the development of new associations with the new pattern. The process for this is outside the scope of this paper, but since the new pattern is highly dissimilar from the original patterns, these new associations can be formed without disrupting conceptual associations with the original patterns. The key point here is that the process of convolution generates a new pattern given any two previous patterns, and that this process is reversible. In Fig. 7, the pattern on the right bears no similarity to either of the patterns on the left. In mathematical terms, the degree of similarity can be measured using the dot product, which tends toward zero the larger the vectors (Plate, 2003). This means that the new pattern can be used for cognitive processing without being mistaken for existing patterns. However, this new pattern can be broken back down into an approximation of the original patterns, if needed. This allows us to make the claim that the representational content is preserved via convolution, at least to a certain degree. The pattern on the right of Fig. 8 bears a close resemblance to the pattern on the top left of Fig. 7, and this accuracy increases with the number of dimensions in the vector and the number of neurons used. Stewart,
Fig. 8. Deconvolution occurring in simulated neurons. Inputs are (top left) the output from Fig. 7 and (bottom left) the other input from Fig. 7. The result (far right) is an approximation of the first of the original inputs (top left in Fig. 7).
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
17
Choo, and Eliasmith (2010b) have created a large-scale (373,000 neuron) model of the basal ganglia, thalamus, and cortex using this approach, and they have shown that it is capable of taking the combined representation of a sentence (such as ‘‘Romeo loves Juliet’’) and accurately answering questions about that sentence. This result demonstrates that all of the representational content can be preserved up to some degree of accuracy. Furthermore, they have also shown that these representations can be manipulated using the equivalent of IF-THEN production rules that indicate which concepts should be combined and which ones should be taken apart. When these rules are implemented using realistic spiking neurons, they are found to require approximately 40 ms for simple rules (not involving conceptual combination) and approximately 65 ms for complex rules that combine concepts or extract them (Stewart, Choo, & Eliasmith, 2010a). This finding accords with the standard empirical finding that humans require 50 ms to perform a single cognitive action. However, this research has not addressed how different modalities can be combined. 6.3. Simulation 3: Multimodal convolution Simulation 2 demonstrates the use of biologically realistic artificial neurons to combine two arbitrary vector-based representations, showing the feasibility of convolution as a method of conceptual combination. We used a visual example, but many other modalities can be captured by vectors. Hence, we can create multimodal representations by convolving representations from distinct modalities. For example, we might combine a particular visual stimulus with a particular auditory stimulus, along with a symbolic label, and even an emotional valence. Earlier we proposed that the AHA! experience required a convolution of at least four components: two representations that get combined into a new one, and two aspects of emotional processing—cognitive appraisal and physiological perception. Fig. 9 shows our NEF simulation of how this might work. Instead of just two representations being convolved, there are a total of four that could stand for various aspects of the cognitive and emotional content of a situation. Each convolution is implemented as before, and they are added by having all convolutions project to the same set of output neurons. We do not need to assume that all of the vectors being combined are of the same length. For example, representing physiological perception may only require a few dimensions, whereas encoding symbolic content requires hundreds of dimensions. These vectors can still be combined by projecting the low-dimensional vector into the higher dimensional space. This means that we can combine any representations of any size and any modality into a single novel representation. In our simulations, we have shown that vectors can be encoded using neural firing patterns. These vectors can represent information in any modality, from a visual stimulus to an internal state to a symbolic term. Furthermore, these representations can be combined via a neural implementation of the convolution operation that produces a new vector that encodes an approximation of all the original vectors. We have thus shown the computational feasibility of using convolution to combine representations of concepts and emotional reactions to them, generating the AHA! experience.
18
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
Fig. 9. Combining four representations into a single result. The arrows suggest the main flow of information, but they do not preclude the presence of many reentry (feedback) loops.
Our examples of convolution using visual patterns are obviously much simpler than important historical cases such as the double helix and the BlackBerry. But complex representations are made up of simpler ones, and convolution provides a plausible mechanism for building the kinds of rich, multimodal structures needed for creative human cognition.
7. What convolutions are creative? Obviously, however, not all convolutions are creative. If our neural account is correct, even mundane conceptual combinations such as red shirt are produced by convolution, so what makes some convolutions creative? According to Boden (2004), what characterizes acts as creative is that they are novel, surprising, and valuable. Most combinations of representations meet none of these conditions. However, convolutions that generate the AHA! emotional response are far more likely to satisfy them, as can be shown by consideration of the relevant affective mechanisms. In our current simulations, the appraisal and physiological components of emotion are provided as inputs, but they could naturally be generated by the neural processes described in the EMOCON model of Thagard and Aubie (2008). Following Sander, Grandjean, and Scherer (2005), novelty could be assessed by inputs concerning suddenness, familiarity, predictability, intrinsic pleasantness, and goal-need relevance. The latter two factors, especially goal-need relevance, are also directly relevant to assessing whether a new conceptual combination (new convolution) is potentially valuable. A mundane combination such as red shirt generates little emotion because it usually makes little contribution to potential goal
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
19
satisfaction. In contrast, combinations like wireless email and sound wave can be quickly appraised as highly relevant to satisfying commercial or scientific goals. Hence, such convolutions get stronger emotional associations because the appraisal process has marked them as novel and valuable, in line with Boden’s criteria. Surprise is more complicated. Thagard (2000) proposed a localist neural network model of surprise in which an assessment is made of the extent to which nodes change their activation. Surprise concerning an element results from rapid shifts in activation of the node representing the element from positive to negative or vice versa. A similar process could be built into a more neurologically plausible distributed model by having neural populations that respond to rapid changes in the patterns of activation in other neural populations. Mathematically, this process is equivalent to taking the derivative of the activation value, and realistic neural models of this form have been developed (Tripp & Eliasmith, 2010). Like other emotional reactions, surprise comes in various degrees, from mild appreciation of something new to wild astonishment. Thagard and Aubie (2008) propose that the intensity of emotional experience derives from the firing rates of neurons in the relevant populations. For example, high degrees of pleasure arise from rapid rates of firing in neurons in the dopamine system. Analogously, a very strong AHA! reaction would result from rapid firing in the neural population that provides the convolution of the conceptual combination with the emotional reaction that itself combines both cognitive appraisal and physiological perception. There is no need to postulate some kind of AHA! detection module in the brain, as the emotion-generating convolutions may occur on the fly in various brain areas with connectivity to neural populations for both representing and evaluating. The representational format for all these processes is the same—patterns of neural firing—so exchange and integration of many different kinds of information is achieved. In these ways, the neural system itself can emotionally mark some convolutions as representing results that are potentially more novel, surprising, and valuable and hence more likely to qualify as creative according to Boden’s (2004) criteria. Of course, there are no guarantees, as people sometimes get excited about ideas that turn out to be derivative or useless. But it is a crucial part of the emotional system that it serves to identify some conceptual combinations as exciting and hence worth storing in long-term memory and serving in future problem solving. The AHA! experience is not just a side effect of creative thinking, but rather a central aspect of identifying those convolutions that are potentially creative. A philosophical worry about combination and convolution is how newly generated patterns of neural activation can retain or incorporate the content (meaning) of the original concepts. How does the convolved representation for wireless email contain the content for both wireless and email? Answering this question presupposes solutions to highly controversial philosophical problems about how psychological and neural representations gain their contents. We cannot defend a full answer here, but merely provide a sketch based on accounts of neurosemantics defended elsewhere (Eliasmith, 2005b; Parisien & Thagard, 2008; Thagard, 2010, unpublished data). We avoid the term ‘‘content’’ because it misleadingly suggests that the meaning of a representation is some kind of thing rather than a multifaceted relational process.
20
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
Representations acquire meaning in two ways, through processes that relate them to the world and through processes that relate them to other representations. A representational neural population is most clearly meaningful when its firing activity is causally correlated with events in the world, that is, when there is a statistical correlation that results from causal interactions. These interactions can operate in both directions, from perception of objects in the world causing neural firing, and from neural firing causing changes in perceptions of objects. However, the firing activity of a representational neural population is also affected by the firing activity of other neural populations, as is most clear in neural populations for abstract concepts such as justice and infinity. Hence, the crucial question about the meaningfulness of conceptual combinations by convolution is: How do the relational processes that establish the meaning of two original concepts contribute to the relational processes that establish the meaning of the combined concept produced by convolution in a neural network? Take a simple example such as red shirt. The neural representation (firing pattern) for red gets its meaning from causal correlations with red things in the world and with causal correlations with other neural representations such as ones for color and blood. Similarly, the neural representation for shirt gets its meaning from causal correlations with shirts and with other neural representations such as ones for clothing and sleeves. If the combination red shirt was triggered by perception of a red shirt, then the new convolving neural population will have a causal correlation with the stimulus, just as the neural populations for red and shirt will. Hence, there will be some overlap in the firing patterns for red, shirt, and red shirt. Similarly, if the conceptual combination is triggered by an utterance of the words ‘‘red shirt’’ rather than anything perceptual, there will still be some overlap in the firing patterns of red shirt with red and shirt thanks to interconnections with the other related concepts such as color and clothing. Hence, it is reasonable to conclude that the neural population representing the convolution red shirt retains much of the meaning of both red and shirt.
8. Limitations We have presented a detailed, neurocomputational account of psychological mechanisms that may contribute to creativity and the AHA! experience. Our hypothesis that creativity arises from neural processes of convolution explains how multimodal concepts in the brain can be combined, how original ideas can be generated, and how production of new ideas can be an emotional experience. But we acknowledge that our account has many limitations with respect to describing the neural and other kinds of processes that are involved in creativity and innovation. We see the models we have presented here as only a small part of a full theory, so we will sketch here what we see as some of the missing neural, psychological, and social ingredients. First, we have no direct evidence that convolution is the specific neurocomputational mechanism for conceptual combination. However, it is currently the only proposed mechanism that is general-purpose, scalable, and works with realistic neurons. Convolution has the appropriate mathematical properties for combining any sort of neural information and is
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
21
consistent with biological limitations. It scales up to human-sized vocabularies (Plate, 2003) without requiring an unfeasibly large number of additional neurons to coordinate firing activity in different neural populations (Eliasmith & Stewart, forthcoming). The mechanism works with noisy spiking neurons, and it can make use of detailed models of individual neurons. That said, we need additional neuroscientific evidence and to develop alternative computational models before it would be reasonable to claim that convolution is the mechanism of idea generation. Theorizing about neural mechanisms for high-level cognition is still in its infancy. We have assumed, in accord with many current views in theoretical neuroscience, that patterns of neural firing are the brain vehicles for mental representations, but future research may necessitate a more complex account that incorporates, for example, chemical activity in glial cells that interact with neurons. Second, our current model of convolution combines whole concepts without selectively picking out aspects of them. In contrast, the symbolic computational model of conceptual combination developed by Thagard (1988) allows new concepts, construed as a collection of slots and values, to select various slots and values from the original concepts. We conjecture that convolution can be made more selective using mechanisms like those now performing neurobiologically realistic rule-based reasoning (Stewart, Choo, & Eliasmith, 2010a, 2010b), but selective convolution is a subject for future research. More generally, our account is intended as only part of a general, multilevel account of creativity. By no means are we proposing a purely neural, ruthlessly reductionist account of creativity. We recognize the importance of understanding thinking in terms of multilevel mechanisms, ranging from the molecular to the psychological to the social, as shown in Fig. 10 (Thagard, 2009, 2010; see also Bechtel, 2008; Craver, 2007). Creativity has many social aspects, requiring interaction among people with overlapping ideas and interest. For example, scientific research today is largely collaborative, and many discoveries occur because of fruitful interactions among researchers (Thagard, 1999, ch. 11, 2006b). Neurocomputational models ignore the social causes of creative breakthroughs, which are often crucial in explaining how different representations come to be combined in a single brain. For example, Darwin’s ideas about evolution and breeding (artificial selection) were the social
psychological
neural
molecular
Fig. 10. Causal interactions between four levels of analysis relevant to understanding creativity.
22
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
result of many interactions he had with other scientists and farmers, and various engineers were involved in the development of the wireless technologies that evolved into the BlackBerry. A full theory of creativity and innovation will have to flesh out the upward and downward arrows in Fig. 10 in a way that produces a more complete account of creativity. Fig. 10 is incompatible with other, more common views of causality in multilevel systems. Reductionist views assume that causality runs only upwards, from the molecular to the neural to the psychological to the social. At the opposite extreme, antireductionist views assume that the complexity of systems and the nature of emergent properties (ones that hold of higher level objects and not of their constituents) means that higher levels must be described as largely independent of lower ones, so that social explanations (e.g., in sociology and economics) have a large degree of autonomy from psychological ones, and psychological explanations have a large degree of autonomy from neural and molecular ones. From our perspective, reductionist, individualistic views are inadequate because they ignore ways in which objects and events at higher levels have causal effects on objects and events and lower levels. Consider two scientists (perhaps Crick and Watson, who developed many of their ideas about DNA jointly), managing in conversation to make a creative breakthrough. This conversation is clearly a social interaction, with two people communicating in ways that may be both verbal and nonverbal. Such interactions can have profound psychological, neural, and molecular effects. When the scientists bring together two concepts or other representations that they have not previously connected, their beliefs may change dramatically, for example, when they realize they have found a hypothesis that can solve the problem on which they have been jointly working. This psychological change is also a neural change, altering their patterns of neural activity. If the scientists are excited or even a little pleased by their new discovery, they will also undergo molecular changes such as increases in dopamine levels in the nucleus accumbens and other reward-related brain areas. Because social interactions can cause important psychological, neural, and molecular changes, the reductionist view that only considers how lower level systems causally affect higher level ones is implausible. Craver and Bechtel (2007) argue against the value of considering causal relations between levels, but P. Thagard and J. V. Wood (unpublished data) dispute their arguments. On the other hand, the holistic, antireductionist view that insists on autonomy of higher levels from lower ones is also implausible. Molecular changes even as simple as ingestion of caffeine or alcohol can have large effects on psychological and social processes; compare the remark that a mathematician is a device for turning coffee into theorems. If our account of creative representation combination as neural binding is on the right track, then part of the psychological explanation of the creativity of individuals, and hence part of the social explanation of the productivity of groups, will require attention to neural processes. Genetic processes may also be relevant, as seen in the recent finding that combinations of genes involved in dopamine transmission have some correlation with artistic capabilities (Kevin Dunbar, personal communication). One useful way to characterize multilevel relations is provided by Bunge (2003), who critiques both holism and individualism. He defines a system as a quadruple:
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
23
where: Composition = collection of parts; Environment = items that act on the parts; Structure = relations among parts, especially bonds between them; Mechanism = processes that make the system behave as it does. Our discussion in this paper has primarily been at the neural level: The composition is a collection of neurons; the environment consists of the physiological inputs such as external and internal perception that cause changes in firing of neurons, the structure consists of the excitatory and inhibitory synaptic connections among neurons, and the mechanism is the whole set of neurochemical processes involved in neural firing. A complete theory of creativity would require specifying social, psychological, and molecular systems as well, not just on their own, but in relation to neural processes of creativity. We would need to specify the composition of social, psychological, neural, and molecular systems in a way that exhibits their part-whole relations. Much more problematically, we need to describe the relations among the processes that operate at different levels. This project of multilevel interaction goes far beyond the scope of the current paper, but we mention it here to forestall the objection that neural convolution is only one aspect of creativity; we acknowledge that our neural account is only part of a full scientific explanation of creativity. P. Thagard and J. V. Wood (unpublished data) advocate the method of multilevel interactive mechanisms as a general approach for cognitive science. Various writers on the social processes of innovation have remarked on the importance of interpersonal contact for the transmission of tacit knowledge, which can be very difficult to put into words (e.g., Asheim & Gertler, 2005). Our neural account provides a nonmysterious way of understanding tacit knowledge, as the neural representations our models employ are compatible with procedural, emotional, and perceptual representations that may be nonverbal. Transferring such information may require thinkers to work together in the same physical environment so that bodily interactions via manipulation of objects, diagrams, gestures, and facial expressions can provide sources of communication that may be as rich as verbal conversation. A full account of creativity and innovation that includes the social dimension should include explanation of how nonverbal communication can contribute to the joint production of tacit knowledge, which we agree is an important part of scientific thinking (Sahdra & Thagard, 2003). See S. Helie and R. Sun (unpublished data) for a connectionist account of insight through conversion of knowledge from implicit to explicit. Even at the psychological level, our account of creativity is incomplete. We have described how two representations can be combined into new ones, but we have not attempted to say what triggers such combinations. Specifying triggering conditions for representation combination would require full theories of language processing and problem solving. We have not investigated how conceptual combination can occur as part of language processing (e.g., Medin & Shoben, 1988; Wisniewski, 1997). Conceptual combination is clearly not sufficient for creativity, as people make many boring combinations as part of everyday communication. Moreover, in addition to conceptual combination, there are
24
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
other kinds of mental operations important for creativity, as we review in the next section on comparisons with other researchers’ accounts of creativity. For problem solving, Thagard (1988) presented a computational model of how conceptual combination could be triggered when two concepts become attributed to the same object during the attempt to solve explanation problems, and a similar account could apply to the kinds of creativity we have been concerned with here. When scientists, inventors, social activists, or artists are engaged in challenging activity, they naturally have mental representations such as concepts active simultaneously in working memory. Combining such representations occasionally produces creative results. But the neurocomputational models described in this paper deal only with the process of combination, not with triggering conditions for combination or with post-combination employment of newly created and combined representations. A full account of creativity as representation combination would need to include both the upstream processes that trigger combination and the downstream processes that make use of newly created representations, as shown in Fig. 10. The downstream processes include assessing the ongoing usefulness of the newly created representations for whatever purpose inspired them. For example, the creation of a new theoretical concept such as sound wave becomes important if it enters the scientific vocabulary and becomes subsequently employed in ongoing explanations. Thus, a full theory of creativity would have to include a description of problem solving and language processing as both inputs to and outputs from the neural process of representation generation, which is all we have tried to model in this paper. Implementing Fig. 11 would require integrated neurocomputational models of problem solving and language processing that remain to be developed.
9. Comparisons with related work Our neural models of representation combination and emotional reaction are broadly compatible with the ideas of many researchers on creativity whose work has addressed different issues. Boden (2004) usefully distinguishes three forms of creativity: combinatorial, exploratory, and transformational. We have approached only the combinatorial form in which new concepts are generated, but we have not dealt with the broader exploration and transformation of conceptual spaces. We have been concerned with what Boden calls
Fig. 11. Problem solving and language processing as contributing to and affected by representation combination.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
25
psychological creativity, which involves the generation of representations new to particular individuals, and not historical creativity, which involves the generation of completely new representations not previously produced by any individual. Our approach in this paper has also been narrower than Nersessian’s (2008) discussion of mental models and their use in creating scientific concepts. She defines a mental model as a ‘‘structural, behavioral, or functional analog representation of a real-world or imaginary situation, event or process’’ (p. 93). We agree with her contention that many kinds of internal and external representations are used during scientific reasoning, including ones tightly coupled to embodied perceptions and actions. As yet, no full neurocomputational theory of mental models has been developed, so we are unable to situate our account of representation combination within the rich mental-models approach to reasoning. But our account is certainly compatible with Nersessian’s (2008, p. 138) claim that conceptual changes arise from interactive processes of constructing, manipulating, evaluating, and revising models. P. Thagard (unpublished data) sketches how mental models can be understood in terms of neural processes. We agree with Hofstadter (1995) that many computational models of analogy have relied too heavily on fixed verbal representations that prevent the kind of fluidity found in many kinds of creative thinking. In the models described in this paper, we use neural representations of a more biologically realistic sort, representing concepts by the activity of thousands of spiking neurons, not by single nodes as in localist connectionist simulations, nor by a few dozen nonspiking neurons as in parallel distributed processing (PDP) simulations. We acknowledge, however, that we have not produced a new model of analogical processing. Eliasmith and Thagard (2001) was a first step toward a vector-based system of analogical mapping, but no full neural implementation of that system has yet been produced. The scope of our project is also narrower than much artificial intelligence research on scientific discovery that develops more general accounts of problem solving (e.g., Bridewell, Langley, Todorovski, & Dzeroski, 2008; Langley, Simon, Bradshaw, & Zytkow, 1987). Our neural simulations are incapable of generating new scientific laws or representations of processes that are crucial for a general account of scientific discovery. Another example of representation generation is the ‘‘crossover’’ operation that operates as part of genetic algorithms in the generation of new rules in John Holland’s classifier systems (Holland et al., 1986). Our models could be seen as doing a kind of crossover mating between neural patterns, but they are not embedded in a general system capable of making complex inferences. Our account of creativity as based on representation combination is similar to the idea of blending (conceptual integration) developed by Fauconnier and Turner (2002), which is modeled computationally by Pereira (2007). Our account differs in providing a neural mechanism for combining multimodal representations, including emotional reactions. Brain imaging is beginning to yield interesting results about the neural correlates of creativity (e.g., Bowden & Jung-Beeman, 2003; Kounis & Beeman, 2009; Subranamiam, Kounios, Parrish, & Jung-Beeman, 2009). Unfortunately, our neural models of creativity are not yet organized into specific brain areas, so we cannot explain particular findings concerning these neural correlates.
26
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
10. Conclusion Despite the limitations described in the last two sections, we think our account of representation generation is novel and interesting, perhaps even creative, in several respects. First, we have shown how conceptual combination can occur in biologically realistic populations of thousands of spiking neurons. Second, we employed a mechanism of binding—convolution—that differs in important ways from the synchrony mechanism that is more commonly advocated. Third, and perhaps most important, we have used convolution to show how the creative generation of new representations can generate emotional reactions such as the much-desired AHA! experience. Convolution also provides an alternative to synchronization as a potential naturalistic solution to the classic philosophical problem of explaining the apparent unity of consciousness. In Fig. 8, we portrayed a simulation that integrates the activity of seven neural populations, showing how there can be a unified experience of the combination of two concepts and an emotional reaction to them. We do not mean to suggest that there is a specific locus of consciousness in the brain, as there are many different convergent zones (also known as association areas) where information from multiple neural populations can come together. The dorsolateral prefrontal cortex seems to be an important convergence zone for working memory and hence for consciousness, as in the EMOCON model of Thagard and Aubie (2008). Neural processes of the sort we have described are capable of encoding the full range of representations that contribute to human thinking, including ones that are verbal, visual, auditory, olfactory, tactile, kinesthetic, procedural, and emotional. This range should enable application of the mechanism of representation combination to all realms of human creativity, including scientific discovery, technological invention, social innovation, and esthetic imagination. But much research remains to be done to build a full, detailed account of the creative brain.
Acknowledgments Our research has been supported by the Natural Sciences and Engineering Research Council of Canada and SHARC ⁄ NET. Thanks to Chris Eliasmith for valuable advice on simulations and for helpful comments on a previous draft. We have also benefitted from suggestions by Nancy Nersessian, William Bechtel, Robert Hadley, and an anonymous referee.
References Anderson, J. R. (1993). Rules of the mind. Hillsdale, NJ: Erlbaum. Asheim, B. T., & Gertler, M. S. (2005). The geography of innovation: Regional innovation systems. In J. Fagerberg, D. C. Mowery, & R. R. Nelson (Eds.), The Oxford handbook of innovation (pp. 291–317). Oxford, England: Oxford University Press.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
27
Barsalou, L. W., Simmons, W. K., Barbey, A. K., & Wilson, C. D. (2003). Grounding conceptual knowledge in modality-specific systems. Trends in Cognitive Sciences, 7, 84–91. Bechtel, W. (2008). Mental mechanisms: Philosophical perspectives on cognitive neuroscience. New York: Routledge. Blasdel, G. G., & Salama, G. (1986). Voltage sensitive dyes reveal a modular organization in monkey striate cortex. Nature, 321, 579–585. Boden, M. (2004). The creative mind: Myths and mechanisms (2nd ed.). London: Routledge. Bowden, E. M., & Jung-Beeman, M. (2003). Aha! Insight experience correlates with solution activation in the right hemisphere. Psychonomic Bulletin and Review, 2003 (10), 730–737. Bridewell, W., Langley, P., Todorovski, L., & Dzeroski, S. (2008). Inductive process modeling. Machine learning, 71, 1–32. Bunge, M. (2003). Emergence and convergence: Qualitative novelty and the unity of knowledge. Toronto: University of Toronto Press. Chandrasekharan, S. (2009). Building to discover: A common coding model. Cognitive Science, 33, 1059–1086. Churchland, P. M. (1989). A neurocomputational perspective. Cambridge, MA: MIT Press. Churchland, P. S., & Sejnowski, T. (1992). The computational brain. Cambridge, MA: MIT Press. Costello, F. J., & Keane, M. T. (2000). Efficient creativity: Constraint-guided conceptual combination. Cognitive Science, 24, 299–349. Craver, C. F. (2007). Explaining the brain. Oxford, England: Oxford University Press. Craver, C. F., & Bechtel, W. (2007). Top-down causation without top-down causes. Biology and Philosophy, 22, 547–663. Crick, F. (1994). The astonishing hypothesis: The scientific search for the soul. London: Simon and Schuster. Damasio, A. R. (1994). Descartes’ error. New York: G. P. Putnam’s Sons. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. Eliasmith, C. (2004). Learning context sensitive logical inference in a neurobiological simulation. In S. Levy & R. Gayler (Eds.), Compositional connectionism in cognitive science (pp. 17–20). Menlo Park, CA: AAAI Press. Eliasmith, C. (2005a). Cognition with neurons: A large-scale, biologically realistic model of the Wason task. In B. Bara, L. Barasalou, & M. Bucciarelli (Eds.), Proceedings of the XXVII Annual Conference of the Cognitive Science Society (pp. 624–629). Mahwah, NJ: Lawrence Erlbaum Associates. Eliasmith, C. (2005b). Neurosemantics and categories. In H. Cohen & C. Lefebvre (Eds.), Handbook of categorization in cognitive science (pp. 1035–1054). Amsterdam: Elsevier. Eliasmith, C. (forthcoming). How to build a brain. Oxford, England: Oxford University Press. Eliasmith, C., & Anderson, C. H. (2003). Neural engineering: Computation, representation and dynamics in neurobiological systems. Cambridge, MA: MIT Press. Eliasmith, C., & Thagard, P. (2001). Integrating structure and meaning: A distributed model of analogical mapping. Cognitive Science, 25, 245–286. Engel, A. K., Fries, P., Ko¨nig, P., Brecht, M., & Singer, W. (1999). Temporal binding, binocular rivalry, and consciousness. Consciousness and Cognition, 8, 128–151. Fauconnier, G., & Turner, M. (2002). The way we think. New York: Basic Books. Feynman, R. (1999). The pleasure of finding things out. Cambridge, MA: Perseus Books. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233 (4771), 1416–1419. Grandjean, D., Sander, D., & Scherer, K. R. (2008). Conscious emotional experience emerges as a function of multilevel, appraisal-driven response synchronization. Consciousness and Cognition, 17, 484–495. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hebb, D. O. (1980). Essay on mind. Hillsdale, NJ: Lawrence Erlbaum. Hofstadter, D. (1995). Fluid concepts and creative analogies: Computer models of the fundamental mechanisms of thought. New York: Basic Books.
28
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. R. (1986). Induction: Processes of inference, learning, and discovery. Cambridge, MA: MIT Press ⁄ Bradford Books. Holyoak, K. J., & Thagard, P. (1995). Mental leaps: Analogy in creative thought. Cambridge, MA: MIT Press ⁄ Bradford Books. Hopfield, J. (1999). Odor space and olfactory processing: Collective algorithms and neural implementation. Proceedings of the National Academy of Sciences of the United States of America, 96, 12506–12511. Hummel, J. E., & Holyoak, K. J. (1997). Distributed representations of structure: A theory of analogical access and mapping. Psychological Review, 104, 427–466. Hummel, J. E., & Holyoak, K. J. (2003). A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110, 220–264. James, W. (1884). What is an emotion? Mind, 9, 188–205. Johnson, J. S., Spencer, J. P., & Scho¨ner, G. (2008). Moving to higher ground: The dynamic field theory and the dynamics of visual cognition. New Ideas in Psychology, 26, 227–251. Koestler, A. (1967). The act of creation. New York: Dell. Kounios, J., & Beeman, M. (2009). The Aha! moment: The cognitive neuroscience of insight. Current Directions in Psychological Science, 18, 210–216. Koza, J. R. (1992). Genetic programming. Cambridge, MA: MIT Press. Kunda, Z., Miller, D., & Claire, T. (1990). Combining social concepts: The role of causal reasoning. Cognitive Science, 14, 551–577. Laird, J., Rosenbloom, P., & Newell, A. (1986). Chunking in Soar: The anatomy of a general learning mechanism. Machine Learning, 1, 11–46. Langley, P., Simon, H., Bradshaw, G., & Zytkow, J. (1987). Scientific discovery. Cambridge, MA: MIT Press ⁄ Bradford Books. Lenat, D., & Brown, J. S. (1984). Why AM and Eurisko appear to work. Artificial Intelligence, 23, 269– 294. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. Journal of Neurophysiology, 54, 782–806. Medin, D., & Shoben, E. (1988). Context and structure in conceptual combination. Cognitive Psychology, 20, 158–190. Mednick, S. A. (1962). The associate basis of the creative process. Psychological Review, 69, 220–232. Minsky, M. (1975). A framework for representing knowledge. In P. H. Winston (Ed.), The psychology of computer vision (pp. 211–277). New York: McGraw-Hill. Mumford, M. D. (2002). Social innovation: Ten cases from Benjamin Franklin. Creativity research journal, 14, 253–266. Nersessian, N. (2008). Creating scientific concepts. Cambridge, MA: MIT Press. Parisien, C., & Thagard, P. (2008). Robosemantics: How Stanley the Volkswagen represents the world. Minds and Machines, 18, 169–178. Pereira, F. C. (2007). Creativity and artificial intelligence. Berlin: Mouton de Gruyter. Plate, T. (2003). Holographic reduced representations. Stanford: CSLI. Prinz, J. (2004). Gut reactions: A perceptual theory of emotion. Oxford, England: Oxford University Press. Roskies, A. L. (1999). The binding problem. Neuron, 24, 7–9. Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press ⁄ Bradford Books. Sahdra, B., & Thagard, P. (2003). Procedural knowledge in molecular biology. Philosophical Psychology, 16, 477–498. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. Journal of Computational Neuroscience, 1, 89–107. Sander, D., Grandjean, D., & Scherer, K. R. (2005). A systems approach to appraisal mechanisms in emotion. Neural Networks, 18, 317–352.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
29
Saunders, D., & Thagard, P. (2005). Creativity in computer science. In J. C. Kaufman & J. Baer (Eds.), Creativity across domains: Faces of the muse (pp. 153–167). Mahwah, NJ: Lawrence Erlbaum Associates. Shastri, L., & Ajjanagadde, V. (1993). From simple associations to systematic reasoning: A connectionist representation of rules, variables, and dynamic bindings. Behavioral and Brain Sciences, 16, 417–494. Smith, E. E., & Kosslyn, S. M. (2007). Cognitive psychology: Mind and brain. Upper Saddle River, NJ: Pearson Prentice Hall. Smith, E., & Osherson, D. (1984). Conceptual combination with prototype concepts. Cognitive Science, 8, 337– 361. Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46, 159–217. Sternberg, R. J., & Davidson, J. E. (Eds.). (1995). The nature of insight. Cambridge, MA: MIT Press. Stewart, T. C., Choo, X., & Eliasmith, C. (2010a). Dynamic behaviour of a spiking model of action selection in the basal ganglia. In D. Salvucci & G. Gunzelmann (Eds.), Proceedings of the 10th International Conference on Cognitive Modeling (pp. 235–240). Philadelphia, PA: Drexel University. Stewart, T. C., Choo, X., & Eliasmith, C. (2010b). Symbolic reasoning in spiking neurons: A model of the cortex ⁄ basal ganglia ⁄ thalamus loop. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd Annual Conference of the Cognitive Science Society (pp. 1100–1105). Portland, OR: Cognitive Science Society. Stewart, T. C., & Eliasmith, C. (2009). Compositionality and biologically plausible models. In W. Hinzen, E. Machery, & M. Werning (Eds.), Oxford handbook of compositionality. Oxford, England: Oxford University Press. Subranamiam, K., Kounios, J., Parrish, T. B., & Jung-Beeman, M. (2009). A brain mechanism for facilitation of insight by positive affect. Journal of Cognitive Neuroscience, 21, 415–432. Sweeny, A. (2009). BlackBerry planet. Missisaugua, ON: Wiley. Thagard, P. (1988). Computational philosophy of science. Cambridge, MA: MIT Press. Thagard, P. (1999). How scientists explain disease. Princeton: Princeton University Press. Thagard, P. (2000). Coherence in thought and action. Cambridge, MA: MIT Press. Thagard, P. (2006a). Hot thought: Mechanisms and applications of emotional cognition. Cambridge, MA: MIT Press. Thagard, P. (2006b). How to collaborate: Procedural knowledge in the cooperative development of science. Southern Journal of Philosophy, 44, 177–196. Thagard, P. (2009). Why cognitive science needs philosophy and vice versa. Topics in Cognitive Science, 1, 237–254. Thagard, P. (2010). The brain and the meaning of life. Princeton, NJ: Princeton University Press. Thagard, P., & Aubie, B. (2008). Emotional consciousness: A neural model of how cognitive appraisal and somatic perception interact to produce qualitative experience. Consciousness and Cognition, 17, 811–834. Treisman, A. (1996). The bindng problem. Current Opinion in Neurobiology, 6, 171–178. Tripp, B. P., & Eliasmith, C. (2010). Population models of temporal differentiation. Neural Computation, 22, 621–659. Werning, M., & Maye, A. (2007). The cortical implementation of complex attribute and substance concepts: Synchrony, frames, and hierarchical binding. Chaos and Complexity Letters, 2, 435–452. Wimsatt, W. C. (2007). Re-engineering philosophy for limited beings. Cambridge, MA: Harvard University Press. Wisniewski, E. J. (1997). Conceptual combination: Possibilities and esthetics. In T. B. Ward, S. M. Smith, & J. Viad (Eds.), Conceptual structures and processes: Emergence, discovery, and change (pp. 51–81). Washington, DC: American Psychological Association. Wolpert, D. M., & Ghahramani, Z. (2000). Computational principles of movement neuroscience. Nature Neuroscience, 3, 1212–1217. Wu, L. L., & Barsalou, L. W. (2009). Perceptual simulation in conceptual combination: Evidence from property generation. Acta Psychological, 132, 173–189.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
30
Appendix Representation using vectors across modalities Using convolutions to combine neural representations makes the fundamental assumption that anything we wish to represent in the brain can be treated as a vector (i.e., a set of numbers of a fixed size). It is clear how this approach can be mapped to visual representations; at the simplest level, the brightness of each pixel in an image can be mapped onto one value in the vector. This approach is used to create the figures in this paper. For color images, three more values (representing the three primary colors) may be used per pixel. A more realistic approach can draw on extensive research on the primate visual cortex, showing many layers of neurons, each of which transforms the vector in the previous layer into a new vector, where, for example, individual values may indicate the presence of edges at particular locations. Other modalities can also be treated in this manner. For audition, the cochlea contains sensory cells responsive to different sound frequencies (from 20 to 20,000 Hz). The activity of these cells can be seen as a vector of values, each value corresponding to a different frequency. For olfaction, the mammalian nose contains approximately 2,000 different types of receptor cells, each one sensitive to a different range of chemicals. This range can be treated as a large vector with 2,000 values, one value for each type of receptor cell (e.g., Hopfield, 1999). Verbal representations such as sentences, frames (sets of attribute-value pairs), and analogies can be converted into vectors using the holographic reduced representation method of Plate (2003; see also Eliasmith & Thagard, 2001). Representing vectors using neurons The simulations in this paper use the general approach to encoding a vector into a population of neurons of Eliasmith and Anderson (2003). Every neuron has a preferred direction ~ such that the current entering the neuron is proportional to the similarity (dot vector / ~ and the vector x being represented. If a is the sensitivity of the neuron product) between / bias and J is a fixed background current, then the total current flowing into cell i at any given point is ~ " x þ Jbias Ji ¼ ai / i i
ð1Þ
In our simulations, we model each neuron with the standard LIF model. This model adjusts the voltage of a neuron based on its input current and its membrane time constant, as per Eq. 2. dV &VðtÞ þ RIðtÞ ¼ dt sm
ð2Þ
When the voltage reaches its firing potential, the neuron fires and the voltage returns to its resting potential for a set period of time (the neural refractory period). This produces a series of spikes at times tin for each neuron i. If the LIF model’s effects are written as G[Æ]
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
31
and the neural noise of variance r2 is g(r), then the encoding of any given x as the temporal spike pattern across the neural group is given as X n
~ " xðtÞ þ Jbias þ g ðrÞ( dðt & tin Þ ¼ Gi ½ai / i i
ð3Þ
This formula allows us to determine the spiking pattern across a group of neurons which corresponds to a particular input x. We can also perform the opposite operation: using the pattern of spikes to recover the original value of x. We write this as x^ to indicate that it is an estimate, and this is used above to determine the output vectors from our simulations (the right-most image in Fig. 7 and the central image in Fig. 8). The first step in calculating x^ is to determine the linearly optimal decoding vectors u for each neuron as per Eq. 4, where ai is the firing rate for neuron i. This method has been shown to uniquely combine accuracy and neurobiological plausibility (e.g., Salinas & Abbott, 1994). / ¼ C&1 ! Cij ¼ ai aj dx
!j ¼ ai xdx
ð4Þ
Now we can determine x^ by weighting the activity of each neuron by the corresponding u value. To improve the neural realism of the model, we can do this weighting in a manner that respects the causal influence one neuron can have on another. That is, instead of just considering the spike timing, we note that when one neuron spikes, it affects the next neuron by causing a postsynaptic current to flow into it. These currents are well studied, and different neurotransmitters have different characteristic shapes. If we denote the current caused by a single spike at time t = 0 as h(t) for a given neurotransmitter, then we can use these currents to derive our estimate of x as follows: X X dðt & tin Þ ) hi ðtÞ/i ¼ hðt & tin Þ/i ð5Þ x^ðtÞ ¼ in
in
Deriving synaptic connection weights
Given the decoding vectors u derived above, we can derive the optimal synaptic connection weights for connecting two groups of neurons such that the value represented by the second group will be some given function of the value represented by the first group, providing the basis for defining the convolution operation given in this paper (Eliasmith, 2004). We start with a linear transformation. That is, if the first group of neurons represents x and the second group represents y, we want y = Mx, where M is an arbitrary matrix. Both x and y are vectors of arbitrary size. To achieve this, Eq. 1 dictates that the current entering the second group of neurons should be as follows, where we use the index j for the elements of the second group. ~ " x þ Jbias Jj ¼ aj / j j
ð6Þ
If we substitute x^ for x using Eq. 5, we can express the current coming into the second group of neurons as a function of the current leaving the first group.
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
32
~ " Jj ¼ aj / j
X in
hðt & tn Þ/i þ Jbias j
ð7Þ
Rearranging this equation leads to an expression where the current leaving each neuron in the first group is weighted by a fixed value and summed to produce the current in the second group of neurons. These weights are the desired synaptic connection weights. X ~ / hðt & tn Þ þ Jbias aj / ð8Þ Jj ¼ j i j in
This approach allows us to derive optimal connection weights xij for any linear operation. It should be noted that this process will also work without modification for adding together inputs from multiple neural groups. ~/ xij ¼ aj / j i
ð9Þ
To determine the connection weights for nonlinear operations, we return to the derivation of the decoding vectors in Eq. 4. We modify this to derive a new set of decoding vectors which will approximate an arbitrary function of x, rather than x itself. Eq. 4 can be seen as a special case of Eq. 10 where f(x) = x. /fðxÞ ¼ C&1 !fðxÞ
Cij ¼ ai aj dx
fðxÞ
!i
¼ ai fðxÞdx
ð10Þ
Given these tools, we can derive the synaptic connections needed for performing the convolution operation. The definition of the circular convolution is given in Eq. 11. Importantly, we also see that it can be rewritten as a multiplication in the Fourier domain, as per the convolution theorem. We use this form to reduce the number of nonlinear operations required, thus increasing the accuracy of the neural calculation. z¼x)y
zi ¼
N X j¼1
xj yði&jÞmodN
z ¼ F&1 ðFðxÞ " FðyÞÞ
ð11Þ
Our first step is to define a new group of neurons (indexed by k) to represent v which will contain the Fourier transforms of the vector x represented by one group of neurons (indexed by i) and the vector y represented by another group of neurons (indexed by j). That is, we want v = [F(x),F(y)]. As the Fourier transform is a linear operation, we can use Eq. 9 to derive these weights, where FD is defined to be the Discrete Fourier Transform matrix of the same dimensionality D as x and y. xij ¼ ak /~k M/i ~ M/ xjk ¼ ak / k
j
M ¼ ½FD ; 0(
M ¼ ½0; FD (
ð12Þ
We next need to multiply the individual Fourier transform coefficients together. Since this is a nonlinear operation, this is done using Eq. 10, where the function f(v) is defined by taking the product of the corresponding elements in vector v. That is, element 1 in the result
P. Thagard, T. C. Stewart ⁄ Cognitive Science 35 (2011)
33
will be the production of elements 1 and 1 + D, element 2 will be the product of elements 2 and 2 + D, and so on. We combine this new set of decoding vectors with the matrix corresponding to the inverse Fourier transform to produce the synaptic connection weights to the final set of neurons (indexed by m) representing z = x*y. fðvÞ
~ F&1 / xkm ¼ am / m D k
These synaptic connection weights are used in all the simulations in this paper.
ð13Þ
Cognitive Science 35 (2011) 34–78 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 0364-0213 print / 1551-6709 online DOI: 10.1111/j.1551-6709.2010.01134.x
Simplifying Reading: Applying the Simplicity Principle to Reading Janet I. Vousden,a Michelle R. Ellefson,b Jonathan Solity,c Nick Chaterd b
a Department of Psychology, Coventry University Psychology and Neuroscience in Education, Faculty of Education, University of Cambridge c Educational Psychology Group, Department of Psychology, University College London d Behavioural Science Group, Warwick Business School, University of Warwick
Received 3 March 2009; received in revised form 26 May 2010; accepted 28 May 2010
Abstract Debates concerning the types of representations that aid reading acquisition have often been influenced by the relationship between measures of early phonological awareness (the ability to process speech sounds) and later reading ability. Here, a complementary approach is explored, analyzing how the functional utility of different representational units, such as whole words, bodies (letters representing the vowel and final consonants of a syllable), and graphemes (letters representing a phoneme) may change as the number of words that can be read gradually increases. Utility is measured by applying a Simplicity Principle to the problem of mapping from print to sound; that is, assuming that the ‘‘best’’ representational units for reading are those which allow the mapping from print to sounds to be encoded as efficiently as possible. Results indicate that when only a small number of words are read whole-word representations are most useful, whereas when many words can be read graphemic representations have the highest utility. Keywords: Psychology; Representational units of reading; Mathematical modeling
1. Introduction The ability to translate a printed word into its spoken form is a fundamental skill beginning readers must master. There are a number of different ways that this may be achieved. For example, it might be possible to learn to associate the shape of a written word with its phonological (spoken) form. Another strategy might be to learn to associate smaller ‘‘chunks’’ of text, for example, on a letter-by-letter basis, with their phonological form and Correspondence should be sent to Janet I. Vousden, Department of Psychology, Coventry University, Priory Street, Coventry CV1 5FB, UK. E-mail:
[email protected] J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
35
to build words by decoding each chunk in turn. Thus, different representational units may be used to read. Identifying which representational units are used during reading acquisition in English has generated considerable debate and has been frequently influenced by examining the types of phonological units that pre- and beginning readers are able to manipulate (e.g., Hulme et al., 2002). However, it is also clear that the choice of representational unit will be influenced by the transparency between the print and sound of a language. Furthermore, the transparency between print and sound within a language may change over time—as the number of words that can be read increases—and the preferred type of representational unit may also vary accordingly. The aim of the current article is to examine which representational units should be most useful for reading acquisition by focusing attention on the structure of the print to sound mapping in English. A key objective is to consider how the type of representational unit best suited to facilitate reading acquisition may change as the reading vocabulary increases. In pursuing these questions, we aim to develop a theoretically motivated account of why particular types and specific instances of representational units might be preferred over others, and to consider the implications of our findings for reading instruction. We examine these objectives using the Simplicity Principle, which operates on the basis of choosing simpler explanations of data over complex ones—and here favors representational units that allow the mapping from print to sound to be specified as simply as possible. This approach emphasizes maximizing the utility of various representational units by trading off the complexity of the solution against the ability to account for the data. A number of mechanisms may exist that can implement the solution identified, although the approach is not directly dependent upon specific underlying representations or processes at an implementational level. 1.1. Simplifying reading: Applying the Simplicity Principle to reading The spelling-to-sound mapping of English can be distinguished from other languages by the fact that the same spelling may be pronounced in different ways; for example, the letters ‘‘ea’’ are pronounced differently depending on the context in which they occur (e.g., beach, real, head, great, etc.). This contrasts with many other European languages in which pronunciation is determined unambiguously by spelling (e.g., Serbo-Croat, Greek), and character-based languages such as Chinese, where the written language provides an even less transparent guide to pronunciation. Thus, some English spelling patterns are consistent1 (e.g., ck— ⁄ k ⁄ ) and a simple, small unit (grapheme–phoneme) rule suffices, but others (e.g., ea) require a more sophisticated unit rule for accurate decoding. This inconsistency clearly increases the difficulty of learning to read compared with learning a language using a more consistent orthography (Landerl, 2000; Wimmer & Goswami, 1994). Most often, the vowel together with the following consonant(s) (i.e., a word body, such as -eap) provides a better cue to pronunciation (Treiman, Mullennix, Bijeljacbabic, & Richmondwelty, 1995). However, even word bodies can be inconsistent (Vousden, 2008). Alternatively, frequently encountered words can be learned ‘‘by sight,’’ without any decoding, and inconsistency is largely eliminated. A word can therefore be decoded by reference to units of different sizes: whole words (beach), bodies (b-each), or graphemes (b-ea-ch). In general terms, the smaller
36
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
the unit, the less there is to remember, but this comes at the expense of greater inconsistency. The question faced by the cognitive system is then: What is the optimal way to represent such an orthographic to phonological mapping system? The answer to this question is important for teaching beginning readers: If the optimal representation can be found, then it should be of value when considering instructional materials. The optimal representational units for reading may differ across languages and orthographic systems (Ziegler & Goswami, 2005). More regular orthographies, for example, can more reliably be read using small linguistic units; highly irregular orthographies appear to require encoding via larger ‘‘chunks.’’ But note, too, that even within a language and an orthography, the optimal representation may also shift over time. We should expect that, just as the cognitive system is able to adapt to different orthographies, so too is it able to adapt to changes in the different number or types of words encountered within the same orthography. Early readers are exposed to subsets of written words of the language as they learn; we define these subsets here as a reading vocabulary. Note that this is different from a receptive or productive vocabulary, which may include many words that are known but that cannot yet be read. In general, we should not expect that the best representations for capturing a small reading vocabulary will be the same as the best representations for capturing a much larger reading vocabulary. This expectation raises the interesting possibility that, as readers progress, their optimal representations should change—the increase in reading vocabulary providing a potential driving force for developmental change, independent of any exogenous changes in the cognitive machinery of learning, during development (e.g., changes in aspects of short-term verbal memory). The possibility that, as the beginning reader’s linguistic exposure gradually increases, the reader’s optimal representation of the print-sound mapping changes, provides an interesting possible explanation for representation change driving reading development, and one that we shall explore further below. For a small reading vocabulary, for example, it may be that simply memorizing the pronunciation of each word requires fewer cognitive resources than learning many more seemingly arbitrary sublexical decoding strategies to read a small number of words. The structure of this article is as follows. We begin with the section Representational Units and Learning to Read by reviewing the evidence that different representational units are involved during reading acquisition. Next, in Spelling-to-Sound Consistency and Learning to Read we review cross-linguistic evidence that shows how inconsistency in the print to sound mappings complicates the choice of representational units for English. In The Simplicity Principle, we present a nontechnical description of the Simplicity Principle and describe how it can be applied to reading to trade off maximizing reading outcome against minimizing complexity. The findings from the simplicity-based analyses are presented next. Analysis 1 explores which general type of representational unit should best facilitate reading acquisition, while Analysis 2 examines which specific representational units should be most useful. Analysis 3 considers the effect of an increasing vocabulary on the choice of representational unit, and Analysis 4 explores the extent to which choosing among inconsistent units can be aided by considering context-sensitive units. Thus, this is not a direct study of human reading, but an evaluation of some hypotheses concerning the representations of orthographic to phonological translation that might be learned during reading acquisition.
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
37
Finally, in the Discussion, we discuss the results of the analyses and the implications for instruction. 1.2. Representational units and learning to read The status of different unit sizes in terms of reading outcome has received much attention in recent decades. There is a general consensus that phonological awareness (the ability to deal explicitly with sound units at a subsyllabic level), together with knowledge of how spelling patterns represent speech sounds, is central to reading outcomes (Goswami & Bryant, 1990; Gough, Ehri, & Treiman, 1992; Wagner & Torgesen, 1987; Wyse & Goswami, 2008). Tasks that measure either phoneme awareness, for example, how well children can say words after deleting the last sound (CAT -> CA), or rime (VC) awareness, for example, whether children can judge whether two words rhyme (do HAT and CAT rhyme?), are strongly correlated with reading ability. However, there has been less agreement on whether early measures of phoneme awareness (Hulme et al., 2002; Hulme, Muter, & Snowling, 1998; Muter, Hulme, Snowling, & Taylor, 1997; Nation & Hulme, 1997) or rime awareness (Bradley & Bryant, 1978; Bryant, 1998; Bryant, Maclean, Bradley, & Crossland, 1990; Maclean, Bryant, & Bradley, 1987) best predict later reading ability. There is more agreement, however, on the development of phonological awareness; children become aware of large phonological units (words, syllables, rimes) before small units such as phonemes (Carroll, Snowling, Hulme, & Stevenson, 2003; Goswami & Bryant, 1990). Furthermore, theories of reading development have assumed development occurs in stages (Frith, 1985; Marsh, Friedman, Welch, & Desberg, 1981), from a large unit, logographic stage, through a small unit alphabetic (or grapheme–phoneme) stage, and finally a stage where more advanced context-dependent associations are available. However, recent evidence suggests that prereaders are not restricted to a purely logographic strategy and are able to apply sublexical knowledge very early on in development. For example, prereaders with some knowledge of letter names learned phonetically motivated spelling-sound pairs such as AP—ape more easily than arbitrary (large-unit or logographic) pairs such as ID—oat (Bowman & Treiman, 2008), and letters with phonologically similar letter names and sounds, for example, ⁄ bi+ ⁄ and ⁄ bb ⁄ for the letter B, are learned before letters without that phonological similarity, for example, ⁄ waw ⁄ and ⁄ jb ⁄ for the letter Y (Ellefson, Treiman, & Kessler, 2009). Furthermore, Treiman and colleagues have shown that context-dependent associations are also available early on in development. For example, in a task where young children are taught pronunciations of novel graphemes in pseudowords, they are able to make use of appropriate context-sensitive associations (onset–vowel, or head, and vowel–coda, or rime) in a subsequent transfer task (Bernstein & Treiman, 2004; Treiman, Kessler, Zevin, Bick, & Davis, 2006). These results suggest that children are able to take advantage of the statistical structure of text, as and when it arises, regardless of the type of unit per se. However, although there is a wealth of evidence suggesting that language acquisition is facilitated by exploiting multiple cues in the input (e.g., Christiansen, Allen, & Seidenberg, 1998; Monaghan, Chater, & Christiansen, 2005), the exploitation of such findings in terms of instruction has not been
38
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
apparent. This observation is particularly salient given that environmental factors (e.g., exposure to books, growing up in a stimulating environment) are likely to contribute to literacy problems (Bishop, 2001). Our assumption, consistent with rational analysis theory (Anderson, 1990; Anderson & Schooler, 1991; Oaksford & Chater, 1998), is that these statistical properties (at multiple levels) should play an important role in learning to read because they guide our adaptation to the vocabulary to be acquired. We note that both the specific lexical knowledge, in the form of exactly which words children experience when learning to read, and the number of words they have learned (i.e., vocabulary size) will act together to form the lexical vocabulary to which the cognitive system adapts. This perspective may lead to shifts of optimal strategy as both factors change during learning. In other words the optimal strategy for a small specific vocabulary may not be the same as for a much larger general vocabulary. 1.3. Spelling-to-sound consistency and learning to read The problem of inconsistency for English spelling is most obvious at the grapheme level. For example, a recent listing of grapheme–phoneme probabilities lists some vowel graphemes as having nine pronunciations (Gontijo, Gontijo, & Shillcock, 2003). Estimates of the orthographic depth of English (e.g., the average number of pronunciations each grapheme has) range from 2.1 (Berndt, Reggia, & Mitchum, 1987) to 2.4 (Gontijo et al., 2003) for polysyllabic text, to 1.7 (Vousden, 2008) for monosyllabic text. This result can be contrasted with languages such as Serbo-Croat that has an orthographic depth of 1. Thus, even monosyllabic text, items to which beginning readers of English are more likely to be exposed, is still more inconsistent than other languages. This inconsistency is problematic for beginning readers for obvious reasons: How does a beginning reader choose between alternative pronunciations of a given grapheme? It is the inconsistency of English that sets it apart from other orthographies—when compared with other major European languages, it is judged to be the least consistent (Borgwaldt, Hellwig, & De Groot, 2005; Seymour, Aro, & Erskine, 2003). This inconsistency is thought to be the cause of difficulty in learning to read, when compared with other more consistent orthographies (e.g., Henderson, 1982; Landerl, 2000; Wimmer & Goswami, 1994; Ziegler & Goswami, 2005). Comparisons of reading performance between learners of consistent and inconsistent language have been well documented (for a review, see Ziegler & Goswami, 2005). Typically, the acquisition of spelling-to-sound knowledge is measured using nonword reading performance because decoding nonwords must be done by the application of spelling-to-sound rules. Early comparisons of monolingual studies reveal that while children who learn consistent orthographies perform at between 80% and 89% on nonword reading tasks, where percent correct is determined according to whether the regular or most dominant pronunciation of the graphemes in the nonword is produced (Cossu, Gugliotta, & Marshall, 1995; Porpodas, Pantelis, & Hantziou, 1990; Sprenger-Charolles, Siegel, & Bonnet, 1998), English children lag behind at 45% (Frith, Wimmer, & Landerl, 1998). Cross-linguistic studies, where tighter control over items and subjects can be exerted, show similar results, with English children performing at between 12% and 51% correct, compared with children that learn more
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
39
consistent non-English orthographies performing typically above 90% correct (Ellis & Hooper, 2001; Frith et al., 1998; Goswami, Gombert, & de Barrera, 1998; Seymour et al., 2003). The Psycholinguistic Grain Size (PGS) account (Ziegler & Goswami, 2005) offers an explanation of these cross-linguistic differences. The core assumption behind PGS is that the word reading process adapts to the consistency of the orthography with which it is faced: Consistent orthographies can rely on grapheme–phoneme correspondences (GPCs), whereas more inconsistent orthographies require the formation of larger sublexical units to resolve those inconsistencies, in addition to GPCs (see Ziegler & Goswami, 2005 for a review of the evidence supporting these claims). In addition, learning an inconsistent language is constrained by the granularity problem—there are many more mappings to learn for larger psycholinguistic units (e.g., rimes). Therefore, according to PGS, it seems to make sense to highlight the consistencies within the language across grain sizes and optimize the orthographic-phonological learning task accordingly to maximize consistency, minimize the effects of granularity, and hence maximize overall performance. While there is plenty of evidence that cross-linguistic differences in consistency account for differences in early reading performance (e.g., Seymour et al., 2003), there is also some data that suggest even a basic attempt to maximize consistency within an orthography improves performance for inconsistent orthographies (Shapiro & Solity, 2008). Evidence that highlighting the most consistent aspects of the orthography improves performance comes from recent modeling of the cross-linguistic differential in reading performance (Hutzler, Ziegler, Perry, Wimmer, & Zorzi, 2004), and behavioral data that compares teaching methods for children learning to read English (Landerl, 2000; Shapiro & Solity, 2008; Solity & Shapiro, 2008). In a recent attempt to model the cross-linguistic differences in reading performance between German and English children, Hutzler et al. (2004) found that a two-layer associative network (Zorzi, Houghton, & Butterworth, 1998) was only able to simulate the cross-linguistic reading data once the model had been pretrained with a set of highly consistent grapheme–phoneme correspondences. Comparison of the model data before and after pretraining, for both English and German words, showed a marked improvement in performance, suggesting that learning even an inconsistent orthography such as English can be improved by highlighting the most consistent correspondences at the grapheme level. Behavioral data appear to support this theoretical suggestion. In two studies designed to tease apart the effects of instruction and orthographic consistency on reading performance, Landerl (2000) compared the reading performance of German and English children who had received different instruction. Phonics instruction involves conveying an understanding of how spelling patterns (graphemes) relate to speech sounds (phonemes), and how this knowledge can be applied to ‘‘sound out’’ words. In Landerl’s study, English children either received a structured phonics approach (grapheme–phoneme correspondences are introduced in a predetermined and systematic manner, e.g., Lloyd, 1992), or a mix of phonics and whole-word instruction. German children received phonics instruction, the dominant approach in Germany. English children receiving only phonics instruction outperformed English children who received the mixed approach, and they were almost as accurate as the German children when reading nonwords. More recent work backs up these findings. Chil-
40
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
dren from the United Kingdom who were taught phonic skills using a restricted set of frequent, highly consistent grapheme–phoneme correspondences not only improved their reading performance significantly faster than children who were taught a wider range of grapheme–phoneme correspondences (including less consistent ones), but significantly fewer of them were classified as having reading difficulties (Shapiro & Solity, 2008). In English, most of the orthographic inconsistency stems from the multiple pronunciations for vowel graphemes. Extensive analyses have shown that the problems of grapheme inconsistency can be alleviated by considering a larger unit: the body (Kessler & Treiman, 2001; Peereman & Content, 1998; Treiman et al., 1995)—that is, the consonants following the orthographic vowel generally predict its pronunciation better than the preceding consonants (Treiman & Zukowski, 1988). Generally, body units are more consistent than graphemes. In an analysis of English monosyllabic text, Vousden (2008) found that while 39% of graphemes are inconsistent (i.e., 39% of all graphemes can be pronounced in more than one way—for example, ea can be pronounced as in head, beach, great, etc., but the rest are only ever pronounced one way—t is always pronounced as in tap), only 16% of onset units (e.g., wh- can be pronounced as in who, white) and 18% of body units (e.g., -arm can be pronounced as in warm, farm) are inconsistent. In terms of predicting pronunciation, onset and body units together predict pronunciation better than graphemes (Vousden, 2008) due to their greater consistency. However, the apparent benefit afforded by bodies is compromised by the granularity problem (Ziegler & Goswami, 2005): The number of orthographic units to learn increases as the size of the orthographic unit increases. Thus, there are many more body units than graphemes. Vousden (2008) has shown that if the most frequent mappings at each grain size are chosen, then graphemes are better at predicting pronunciation when a smaller number of mappings are known; however, if a large number of mappings are known, then onsets and bodies are better at predicting pronunciation. In sum, it would appear that reducing inconsistency might improve reading performance for beginning readers. As highlighted above, one way to accomplish this for English may be to concentrate on larger ‘‘chunks’’ (Ziegler & Goswami, 2005). However, learning larger ‘‘chunks’’ comes at the expense of an increased load for the cognitive system. The question then raised concerns how an optimal reading system trades off the load of representing more mappings against increasing predictability of pronunciation, and how the most consistent units across multiple levels can be combined within the same system. Our aim is to explore to what extent different sized representational units are useful in English, an inconsistent language, using the simplicity principle. 1.4. The Simplicity Principle At a fairly general level, learning to read could be interpreted as a search for patterns in the orthographic to phonological translation of the language, which is broadly consistent with both rational analysis and PGS. According to Ehri (1992, 1998), these patterns act as the basis of the links between the orthographic and phonological representations of words, which then form the basis of later automatic sight reading—where skilled readers can automatically access pronunciations without resorting to the slower decoding processes characteristic of
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
41
beginning readers. As such, the orthography to be searched (the data) may be consistent with any number of possible patterns, and the problem is how to choose between alternative patterns. The Simplicity Principle is based on the assumption (similarly to Ockham’s razor) that simpler explanations of data should be preferred to complex explanations (Chater, 1997, 1999; Chater & Vita´nyi, 2003). The Simplicity Principle can be mathematically expressed using the theory of Kolmogorov complexity (Kolmogorov, 1965), which defines the complexity of an individual object as the length of the shortest program in a universal programming language (any conventional programming language) that re-creates the object (Li & Vita´nyi, 1997) and therefore provides an objective measure of complexity. Importantly, the length of the program is independent of the particular universal programming language chosen, up to a constant. Of course, the preferred explanation must be able to re-create the data. According to simplicity, the preferred explanation is one that re-creates the data but can be described most succinctly. Thus, the cognitive system should strive to maximally compress the data such that the data can be reconstructed with maximum accuracy from its compressed form. In this sense, the Simplicity Principle provides an objective basis for choosing among many compatible patterns. The Simplicity Principle appears to be consistent with a range of empirical data across a range of cognitive domains, including perception (Chater, 1996, 2005), categorization (Feldman, 2000; Pothos & Chater, 2002), similarity judgments (Hahn, Chater, & Richardson, 2003), and various data within the linguistic domain (Brent & Cartwright, 1996; Brighton & Kirby, 2001; Chater, 2004; Dowman, 2000; Ellison, 1992; Goldsmith, 2002, 2007; Goldsmith & Riggle, 2010; Onnis, Roberts, & Chater, 2002; Perfors, Tenenbaum, & Regier, 2006; Roberts, Onnis, & Chater, 2005). A practical methodology in statistics and information theory that embodies the principles of simplicity and Kolmogorov complexity is the Minimum Description Length (MDL) principle (Rissanen, 1989; see also Wallace & Freeman, 1987). MDL is based on the notion that the more regularity there is in the data, the more it can be compressed, and hence described succinctly. It requires simply that the length of code required to specify an object (the description length) be computed, and not the actual code itself. The description length can be calculated using information theory (Shannon, 1948) if a probability can be associated with an event, or pattern. Using standard information theory in this way, more probable events are associated with shorter code lengths. Data that contain no regularities or patterns cannot be compressed and are best described by (reproducing) the data itself. However, where patterns exist and some events are highly probable, the description length of an object will be reduced. Therefore, MDL provides a measure of simplicity by specifying the length of the binary code necessary to describe that object. The binary code must represent two components. First, it must provide a measure of the hypothesis (spelling-to-sound mappings) that describes the route from print to sound. Second, it must provide a measure of how well the data can be re-created, given the hypothesis under consideration. Put more formally: L ¼ LðHÞ þ LðDjHÞ where L is the total description length, L(H) is the description length of the current hypothesis, and L(D|H) is the description length of the data under the current hypothesis.
42
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
In the current context, it is useful to think of (a) a set of spelling-to-sound mappings as a hypothesis (H), and (b) how well those spelling-to-sound mappings describe the target pronunciations as the data given the hypothesis (D, given H). From a simplicity perspective, the optimal reading system should trade-off the number of spelling-to-sound mappings represented (complexity of the hypothesis, L(H)) against the ability to pronounce existing and novel words as concisely as possible (goodness-of-fit to data, L(D|H)). The description length of each hypothesis will vary, as will the description length of the data given each different hypothesis. The total description length of a hypothesis (L) is therefore a measure of its simplicity. Full algorithmic details concerning the calculation of the description length using information theory (Shannon, 1948) for both hypotheses and the data, given each hypothesis, are given in Appendix; a brief summary of the mechanistic calculation of word pronunciation is given below. The underlying mechanism that produces pronunciations from mappings in the simplicity analyses that follow assumes a probabilistic rule-based mechanism that could be instantiated in a number of ways; for example, implicitly as a two-layer associative network, consistent with models that are able to represent the statistical regularities that exist between orthography and phonology (Perry, Ziegler, & Zorzi, 2007; Zorzi et al., 1998). However, because we are interested in comparing different sizes of units (the effects of different sized units occur within the same architecture in connectionist models) and different mappings within unit sizes, it has been implemented here as a simple, explicit, probability-based production system. Multiple rules for orthographic units are represented and applied probabilistically; therefore, the production system is more similar to two-layer associative networks (Perry et al., 2007; Zorzi et al., 1998) than the symbolic sublexical rule system of the DRC (Coltheart, Curtis, Atkins, & Haller, 1993). A simple, probabilistic rule-based mechanism for generating pronunciations is appealing in the context of this study because it is straightforward to see how using such a mechanism could apply to instruction (i.e., applying spelling-sound knowledge through blending to form words). The rules are applied probabilistically, based on the directionless type frequency with which individual orthographic and phonological representations are associated. Such frequencies (termed sonograph frequencies; see Appendix for further discussion of frequency; see Data S1 and Data S2 for the actual grapheme–phoneme frequencies used throughout the following analyses) play a significant role in predicting young children’s reading accuracy (Spencer, 2010). The probabilistic rule system implemented here is able to produce the target (correct) pronunciation for all words, although sometimes the target pronunciation may have a lower probability than an alternative pronunciation. The relative probability with which the target pronunciation is output by the rule system provides the basis for the calculation of the description length of the data under the current hypothesis, as described above. Thus, the description length of the data when the target pronunciation is the most probable output will be shorter than when the target pronunciation has a lower probability than an alternative (incorrect) pronunciation. When the target pronunciation is not the most probable output, additional description is required to specify which output is the target pronunciation. Thus, all words can be pronounced correctly, but the description length for all target pronunciations will vary according to how many words can be assigned
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
43
the shortest code length (their pronunciations are most probable) and how many require additional description, over and above that specified by the hypothesis. The less likely the current hypothesis finds the target pronunciation to be, the longer the code will be to describe it under the current hypothesis. We do not assume that skilled reading is based on the same explicit mechanism, only that a simple mechanism that assumes a direct association between orthography and phonology forms the basis of the underlying associations between the orthographic and phonological representations of words on which later automatization occurs. We return to the issue of mechanisms and models in the Discussion section. In the next section, we describe a series of analyses and discuss how well they match relevant human data. The first two analyses show how the utility of different types of representational units varies considerably within large reading vocabularies, but is qualitatively similar across children’s and adults’ reading vocabularies. To anticipate the results, the data are most concisely described by grapheme-sized units. The analyses also show how, within a type of representational unit, some specific units (graphemes) are more useful than others. The third analysis shows how the utility of different types of representational units varies with the size of the reading vocabulary. This reveals a shift in preference from large units to small units as the vocabulary increases. The last analysis reveals a pattern in which large representational units have high utility for large vocabularies, when carefully selected within the context of preferred small units.
2. Analysis 1: Which type of representational unit should best facilitate reading acquisition? It has previously been shown that rime (VC) units are more consistent and provide a more accurate guide to pronunciation than both head (CV) and individual grapheme units (Treiman et al., 1995), although head units can in some circumstances influence pronunciation as well as rimes (Bernstein & Treiman, 2004; Treiman, Kessler, & Bick, 2003; Treiman et al., 2006). Overall, pronunciation is better predicted by the application of onset and rime sized correspondences than by the application of grapheme–phoneme correspondences (Vousden, 2008), but many more onset–rime units are required to achieve accurate pronunciation (Solity & Vousden, 2009; Vousden, 2008), thereby increasing the complexity of the solution to be adopted by the cognitive system, an issue that has received little consideration. For example, Solity and Vousden (2009) showed that the amount of monosyllabic adult text that could be accurately decoded by applying 64 grapheme–phoneme correspondences would require the application of 63 onset and 471 rime correspondences. Following the literature that found a developmental progression from large to small unit phonological awareness, and a number of studies that suggested training children about onset and rime could be beneficial for reading acquisition (Goswami, 1986; Wise, Olson, & Treiman, 1990), there emerged a large literature comparing different methods of instruction based on different unit sizes, from whole words (e.g., I. S. Brown & Felton, 1990) and rimes (e.g., Greaney, Tunmer, & Chapman, 1997), to graphemes (e.g., Stuart, 1999).
44
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
Our aim for this analysis was to ask which type of representational unit should best facilitate reading acquisition. Four different types of spelling-to-sound mapping (whole-word, head–coda, onset–rime, and grapheme) were considered separately as hypotheses about the data (the spelling-to-sound translation of words from each database), and the simplicity of each hypothesis was calculated. This facilitated an exploration of the impact of different cognitive variables (embodied by each hypothesis) on potential reading. The separate analyses for each unit size allow comparison with the existing reading instruction literature, and they also serve as a baseline for later analyses where hypotheses based on different unit sizes are optimized and combined. 2.1. Method 2.1.1. Databases To examine the text that an adult reader is exposed to, we used the CELEX database (Baayen, Piepenbrock, & Gulikers, 1995), which uses frequency counts from the COBUILD ⁄ Birmingham corpus (Sinclair, 1987). This 17.9 million–word token corpus mainly consists of written text from 284 mostly British English sources. For children’s text, a word frequency list based on children’s early reading vocabulary was used (Stuart, Dixon, Masterson, & Gray, 2003). This database was based on the transcription of 685 children’s books used by 5- to 7-year-olds (i.e., the first 3 years of formal instruction) in English schools. Both databases were restricted to monosyllabic words only and were reduced by removing proper names, abbreviations, interjections, and nonwords. This procedure resulted in a total of 3,066 word types for the children’s database and 7,286 word types for the adults’ database. 2.1.2. Procedure The simplicity of the whole-word, head–coda, onset–body, and grapheme hypothesis was calculated by following the calculations described in detail in the Appendix, resulting in a total description length measure (L) for each hypothesis. This measure was the sum of two lengths—the description length of the hypothesis, L(H), that measured the complexity of the hypothesis, and the description length of the data, L(D|H), that measured the goodness-of-fit to the data. Each hypothesis had a separate measure of simplicity (L = L(H) + L(D|H)) for each database. In order to explore how well mappings derived from children’s text could generalize to describe the spelling-to-sound translation for adult text, another measure of simplicity was calculated. This measure of simplicity consisted of the description length of hypotheses generated from the children’s database—L(HChild), and the description length of the data from the adults’ database, given those hypotheses—L(DAdult | HChild). This resulted in a total description length for each hypothesis, which measured generalization (L = L(HChild) + L(DAdult | HChild)). 2.2. Results and discussion The number of mappings required to describe the spelling-to-sound translation for each database is listed in Table 1. First, hypotheses that incorporate large segment sizes—the
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
45
Table 1 Total number of mappings for each hypothesis, for both databases Number of Mappings for Database
Mapping size Whole words Heads and codas Onsets and rimes Graphemes
Stuart et al. (2003)
CELEX
3,066 1,285 1,141 237
7,286 1,944 2,070 311
head–coda and onset–body segmentations—require many more mappings to describe the same amount of data. Both head–coda and onset–body hypotheses require between six and seven times more mappings than the grapheme-sized hypothesis. Second, it appears that the sets of large-segment mappings required to describe children’s reading vocabulary are less adequate to also describe adults’ reading vocabulary than the grapheme-sized hypothesis. This result can be seen by the greater increase in number of mappings required to describe the CELEX versus the Stuart et al. (2003) database for head–coda and onset–body mappings compared to a more modest increase for grapheme-based mappings. Thus, overall, head–coda and onset–body mappings appear more complex and less generalizable than grapheme-sized mappings. A comparison of the different sized mappings in terms of total description length (L) can be seen in Fig. 1A and B for the children’s and adults’ databases, respectively. The pattern is the same for both databases: Grapheme-sized mappings have the shortest total description length, and head–coda and onset–body mappings have total description lengths that are shorter than the data itself (as shown by the total description length for whole words) but longer than that of grapheme-based mappings. The description length of the hypotheses (L(H)) is a function of both the complexity of the mappings (i.e., larger segment mappings will have longer description lengths) and the total number of mappings that make up the hypothesis. Thus, because head–coda and onset–body mappings are more complex and numerous (Table 1) than grapheme-sized mappings, their description lengths are longer. However, inspection of Fig. 1A and B reveals a different picture for the description lengths of the data, given the hypothesis (L(D|H)). Again, the pattern is similar for both databases—pronunciation is more concisely described by onset–body mappings than either head–coda or grapheme-based mappings. This result reflects the greater certainty with which the target pronunciation occurs when the mapping size is based on onset–body units as shown in Fig. 1C and D for children’s and adults’ text, respectively. Fig. 1C and D shows the proportion of target pronunciations that are output as most probable (i.e., accurate pronunciations) under each hypothesis; this is a measure of how accurately a hypothesis can account for data and is the basis of the description length of that data. The measure illustrates the extent to which a hypothesis is able to minimize its description of the data—the shortest description lengths of data (L(D | H)) will be obtained for hypotheses where a large proportion of the data is accurately reproduced—rather than a comparison of human performance data.
46
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
Fig. 1. Description lengths (A and B) and performance on target pronunciations (C and D) for each hypothesis for the children’s and adults’ database.
Fig. 2A shows the description lengths for the adult database given the head–coda, onset– body, and grapheme hypotheses derived from the children’s database. The description length of each hypothesis (L(HChild)) is the same as that plotted in Fig. 1A; however, the description length of the adult database (L(DAdult|HChild)) has increased—most notably for the head–coda and onset–body hypotheses. Fig. 2B shows the size of this increase—2.3 times and 5.2 times more code, respectively, was needed to describe the adult database using the head–coda and onset–body hypotheses based on the children’s database than when using hypotheses based on the adult database (L(DAdult|HAdult)). In contrast, the increase in description length for the adult database was a modest 1.18 times for the grapheme hypothesis. The differential increase in description length occurs because the graphemes derived from the children’s database generalize better to the adult database than the head–coda and onset–body mappings. Fig. 2C compares the proportion of words in the adult database where the target pronunciation was the most probable for hypotheses derived from the children’s and adults’ database. The results reflect those in Fig. 2B. For the onset–body hypothesis, fewer target pronunciations were output as the most probable when the hypothesis was derived from the children’s database than when it was derived from the adults’ database. However, this pattern is not the case for the grapheme hypothesis; similar numbers of target pronunciations
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
47
Fig. 2. Generalization measures. (A) Description lengths based on data from the adults’ database, and hypotheses derived from the children’s database. (B) Description length of data from the adults’ database, comparing hypotheses from the children’s and adults’ database. (C) Performance on target pronunciations from the adults’ database, comparing hypotheses from the children’s and adults’ database.
were output as the most probable regardless of whether the hypothesis was derived from the children’s or adults’ database. This suggests that grapheme-sized mappings derived from children’s text generalize well to adults’ text. These results are consistent with earlier findings that show rime units to be better predictors of pronunciation than both head units and graphemes (Treiman et al., 1995; Vousden, 2008): Pronunciation is described most concisely by the onset–body hypothesis for both databases (Fig. 1A and B). However, the brevity with which onset–body mappings describe the data does not outweigh the length of code needed to describe the mappings themselves: The total description length is greater than that required by grapheme-sized mappings. The finding that sublexical sized mappings offer a simpler account of the data than whole-word sized mappings can be compared to findings from empirical studies in which young children are trained to read English words using either whole-word, onset–body, head–coda, or grapheme-based reading strategies.
48
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
In a short, 6-week training study, Haskell, Foorman, and Swank (1992) compared first graders who were trained to read using whole words, onsets and bodies, or graphemes. Those in the whole-word group performed significantly worse than those in the other two groups, although there were no differences between the sublexical groups. In another short study, Levy and Lysynchuk (1997) trained beginning readers and poor grade two readers to read words using one of the three strategies above, or head–coda sized mappings. Word reading, retention, and generalization were poorest in the whole-word group, but again, there were no differences between the sublexical groups. Similarly, Levy, Bourassa, and Horn (1999) trained poor second grade readers to read words using either whole-word, onset–body, or grapheme strategies. Overall, they found that the sublexical strategies were superior to whole-word methods, and that retention was worst for the onset–body strategy. Tests of generalization revealed performance was best for graphemes and worst for whole-words. These results give a representative picture of whether sublexical reading strategies provide a more favorable approach to reading than whole-word strategies, as borne out by more recent meta-analyses and reviews (e.g., Rayner, Foorman, Perfetti, Pesetsky, & Seidenberg, 2001). Ehri, Nunes, Stahl, and Willows (2001) compared the effect of systematic phonics instruction with unsystematic or no phonics instruction as part of the U.S. National Reading Panel meta-analysis on reading (National Institute of Child Health and Human Development, 2000). Collating the results from 60 treatment-control comparisons, Ehri et al. (2001) compared small unit (graphemes, 39 comparisons), large unit (included onset–body units, 11 comparisons), and miscellaneous phonics programs (10 comparisons) with control conditions that did not include any systematic phonics (e.g., whole-language and whole-word approaches). All types of phonics instruction produced mean effect sizes that were significantly greater than zero, meaning that they were more effective than reading programs that did not contain systematic phonics instruction. The largest effect size was observed for the small unit programs (d = .45), followed by large unit programs (d = .34), and the smallest effect size observed was for miscellaneous programs (d = .27), although these effect sizes were not significantly different. These findings are consistent with the simplicity analysis presented above, in which sublexical approaches to decoding are preferred over whole-word representations, and grapheme sized units are preferred over onset–body sized units. A similar finding was observed in a more recent meta-analysis by Torgerson, Brooks, and Hall (2006), in which a small (d = .27) but significant effect size was found for systematic phonics over no systematic phonics instruction (e.g., whole-language and whole-word approaches). Evidence of small unit effects can also be observed in adult skill reading. Pelli and Tillman (2007) showed that letter-by-letter decoding accounted for 62% of adult reading rate, with whole-word recognition (16%) and sentence context (22%) accounting for the rest. Other studies have found generalization from body units to be problematic (Muter, Snowling, & Taylor, 1994; Savage, 1997; Savage & Stuart, 1998), or inferior to generalization from grapheme units (Bruck & Treiman, 1992; Levy et al., 1999). This is the pattern of results predicted by the Simplicity Principle (Fig. 2). Analysis 1 focused on how well various hypotheses accounted for spelling-to-sound translations of large reading vocabularies. The results suggest that one of the main determinants of simplicity appears to be the number of mappings required by each
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
49
hypothesis—those hypotheses that require a large number of mappings to describe the data also have longer descriptions (Table 1), although they are able to describe the data more concisely. This raises the question of whether each hypothesis can be simplified by reducing the number of mappings each employs. Closer inspection of the mappings employed by each hypothesis shows that many mappings occurred only once (19.2% of grapheme– phoneme mappings and 45.5% of onset–rime mappings) and therefore possibly contribute little to the overall statistical structure of the hypotheses. Therefore, in Analysis 2, our aim was to explore whether some mappings were more useful than others, and whether the total description length of the onset–body and grapheme hypotheses could be reduced by omitting some of the least useful mappings each employs.
3. Analysis 2: Are all mappings equally useful? The goal of Analysis 2 was to compare the simplest forms of the onset–body and grapheme hypotheses, and to compare the utility of individual mappings. Here, each hypothesis must be expressed in its simplest form; that is, in a form that requires the shortest total description length. Given that both hypotheses contain a sizeable proportion of mappings occurring at very low frequency, the description length of the hypotheses themselves could be shortened by omitting mappings that have less impact on the description of the data, given the hypothesis. In order to find their simplest forms, a procedure must be followed to determine which mappings would result in an overall shortening of total description length when removed from the hypothesis, and of the remaining mappings, to identify their relative utility. 3.1. Method The onset–body and grapheme hypotheses generated from the children’s database in Analysis 1 were used as initial hypotheses. The number of mappings for each hypothesis was therefore the same as given in Table 1. For each hypothesis, an individual mapping was removed from the complete set, and the total description length (based on all mappings bar one) was recalculated. The mapping was replaced, and the procedure was repeated for each mapping in the complete set. This resulted in a set of total description lengths reflecting the removal of each mapping. Thus, the impact of removing any particular mapping could be ascertained by comparing the total description length for the hypothesis without that mapping to the total description length of the hypothesis based on the complete set of mappings. If the total description length was shorter after the removal of a mapping, then it indicated that the hypothesis might be simplified by omitting it; if it was higher, then that mapping formed part of a simpler hypothesis and should be retained. Starting with the mapping whose removal produced the shortest total description length, each mapping was removed (without replacement) from the complete set of mappings one at a time, with recalculation of the total description length occurring after each removal. If the resulting total description length was shorter, then the procedure was
50
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
repeated with the next mapping in the list. If the resulting total description length was larger, then the mapping was replaced and the procedure was repeated with alternative mappings to find the largest reduction in total description length. The procedure stopped when the total description length could no longer be reduced. 3.2. Results and discussion The shortest total description lengths for each simplified hypothesis are presented in Fig. 3A. First, the pattern of results is qualitatively the same as for Analysis 1: The shortest description length was obtained for the grapheme hypothesis. Both hypotheses were simplified considerably by removing a substantial number of mappings (compared to Fig. 1A). The shortest total description lengths were obtained by reducing the number of onset–body mappings from 1,141 to 349, and the number of grapheme mappings from 237 to 118. Even after the hypotheses were simplified as much as possible, the total description length of the grapheme hypothesis was still considerably shorter than that of the onset–body hypothesis; it also contained approximately a third of the mappings. Even though the simplified grapheme hypothesis contained fewer mappings, the proportion of words for which the target pronunciation was the most probable was greater than that of the simplified onset– body hypothesis (73.0% vs. 60%, respectively). Furthermore, the proportion of words for which the target pronunciation was the most probable was less than from each complete hypothesis, but the reduction was much greater for the onset–body hypothesis, as depicted in Fig. 3B. The difference between the two hypotheses is in part a reflection of the relatively large proportion of onset–body mappings that occur only once (45.5% vs. 19.2% for graphemes). These mappings contribute little to the regularity of the hypothesis (while the individual mappings may be regular, because they occur only once they do not describe a pattern within the data as a whole) and so the brevity with which they describe the data does not outweigh the length of code needed to describe them. However, because there are so
Fig. 3. Description lengths (A) of optimized hypotheses and performance on target pronunciations (B) of complete and optimized hypotheses.
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
51
many of them, a correspondingly large proportion of words depend on them for correct pronunciation. The number of inconsistent orthographic units (i.e., those that map to more than one phonemic representation) represented by the simplified hypotheses was reduced from 15.2% to 0.6% for the onset–body hypothesis, and from 37.2% to 25.8% for the grapheme hypothesis. The results yielded an ordered list of mappings for each hypothesis (see Appendix S1 for the ordered list of grapheme-sized mappings), with those ranked high on the list representing the most useful mappings. These mappings clearly account for more of the regularity in the data than those nearer the end of the list. The highest ranked mappings in the grapheme hypothesis are mainly single letter graphemes, with increasingly longer graphemes occupying lower ranks (median rank for single-letter graphemes = 20, two-letter graphemes = 63.5, three-letter graphemes = 71, four-letter graphemes = 95). These findings are potentially relevant for reading instruction because they make explicit the potential benefits of learning different representational units, both at a general type-ofunit level and more specifically at an individual unit level. However, relatively few studies have directly compared the impact of teaching grapheme versus onset–body sized correspondences. According to the results above, the preferred representation is clearly for grapheme-sized correspondences. Knowledge of grapheme-sized correspondences should therefore be advantageous to beginning readers. A small but insignificant advantage was found for small units in a recent meta-analysis (Torgerson et al., 2006); however, only three studies were included in the analysis and the authors concluded that there was insufficient evidence on which to draw any firm conclusions. Several intervention studies not included in the Torgerson et al. (2006) meta-analysis have directly compared the impact of teaching grapheme versus onset–body sized correspondences. Christensen and Bowey (2005) compared children taught over a 14-week period where instruction was based on graphemes, bodies, or a control condition. Children were assessed on reading accuracy and speed for words taught during the program and transfer words. The grapheme group were consistently better on all assessments, significantly so for the transfer words. In addition, the grapheme group’s reading age on posttest was 9 months ahead of those in the body group. Similar results were found by Savage, Abrami, Hipps, and Deault (2009), who compared children taught over a 12-week period using either grapheme or body-based interventions, or a control condition. They found that both interventions produced significant improvements over the control condition for a range of literacy measures, but the grapheme intervention produced larger effect sizes for the key skills such as word blending and reading accuracy, both at immediate and delayed posttest, whereas the body intervention produced a more general effect. Several studies have explored the effects of orthographic complexity on grapheme acquisition. These data can be compared to the list of grapheme correspondences listed in Appendix S1; the more useful correspondences (those higher up the list) should be easier to learn. In a study comparing poor and normal readers, Manis (1981) (as cited in Morrison, 1984) found that both groups read simple word-initial consonants in nonwords with greater accuracy than medial short vowels, which in turn were read more accurately than medial long vowels. The correlation between the accuracy that normal readers read each of the 15
52
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
graphemes and the rank order of those graphemes in Appendix S1 was high, at .7. Laxon, Gallagher, and Masterson (2002) found that 6- to 7-year-olds read simple, short vowels with greater accuracy than complex (digraph) vowels in both familiar words and nonwords. Similarly, the mean rank order of simple, short vowels from Appendix S1 was less than the mean rank order of the complex vowels (18 vs. 49) from that study. Thus, the order in which different types of graphemes are acquired appears to be well aligned with their utility in decoding text, with more useful graphemes acquired before less useful ones. Analyses 1 and 2 focused on how well various hypotheses accounted for spelling-tosound translations of large reading vocabularies. In Analysis 3, we explored whether similar results would be obtained if vocabularies of increasing size, drawn from the children’s database, were considered.
4. Analysis 3: Does optimal unit size change with reading development? The aim of this analysis was to compare the optimal hypotheses obtained from Analysis 2 (onset–body and grapheme) with the whole-word hypothesis, for different sized reading vocabularies. The above results suggest that grapheme-sized mappings result in the simplest hypothesis about the data when the data encompasses a sizeable vocabulary. However, a reading vocabulary is acquired incrementally, and the same (grapheme) hypothesis may not result in the simplest hypothesis for smaller data sets. 4.1. Method The children’s database was split into incrementally larger vocabulary sets. The different sized vocabulary sets from the children’s database were formed by saving the first 10, 50, 100, 500, 1,000, and 3,000 most frequent words to a different vocabulary set, and they represented an approximate progression of the number of words a child can read. Thus, each subsequent set included all words in the next smallest set plus a number of less frequent words from the database.2 The simplicity of the whole-word, onset–body, and grapheme hypothesis for each vocabulary size was calculated by following the calculations described in the Appendix, but restricting the mapping set for the onset–body and grapheme hypotheses to those from the optimal hypotheses resulting from Analysis 2, rather than using the complete set (as in Analysis 1). 4.2. Results and discussion The total description length for each hypothesis is plotted in Fig. 4 as a function of vocabulary size. For small vocabularies (up to 50 words), the simplest hypothesis about the data is represented by whole-word mappings. However, once the vocabulary size exceeds around 50 words, grapheme-sized mappings provide the simplest hypothesis about the data.
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
53
Fig. 4. Total description lengths of hypotheses as a function of increasing vocabulary size.
Fig. 4 indicates that the regularity evident in all but the smallest vocabularies is best captured by grapheme-sized mappings. It is clearly suggested from the literature (e.g., Carroll et al., 2003) that children are aware of larger units before they are aware of smaller units, and furthermore, that awareness of phonemes develops (at least in part) alongside instruction. Thus, children come to the task of learning to read with an established large unit phonological lexicon, which may explain why learning a small vocabulary by sight is quick and easy—the appropriate phonological representations to which orthographic representations must be associated are already present. However, our analyses explain why this initially useful strategy, evident in some models of reading development (e.g., Ehri, 1992), is less efficient as the reading vocabulary develops—as has been demonstrated by a number of meta-analyses comparing ‘‘wholeword’’ approaches to instruction with systematic small unit approaches—because grapheme-sized representations offer a much more efficient representation of the data for larger vocabularies. We return to this issue in the General Discussion. In contrast, not only do children have to develop their knowledge of graphemes and phonemes, but for small vocabularies, grapheme-sized mappings are less likely to be repeated (most grapheme-sized mappings in the smaller vocabularies occur only once); therefore, the number of grapheme-sized mappings needed to describe the vocabulary will be greater than the number of whole-word mappings. Hence, the simplest hypothesis for small vocabularies is the whole-word hypothesis. In a recent study, Powell, Plaut, and Funnell (2006) investigated children’s ability to read words and nonwords (a measure of the ability to apply grapheme–phoneme knowledge) at the beginning of formal instruction in the United Kingdom, and again 6 months later. They found that the children read significantly more words (where errors were largely lexical errors) than nonwords at Time 1, but read a similar number of each at Time 2 (although lexical errors still dominated). Thus, their initial ability, when few words were known, was marked more by a whole-word approach, whereas later ability appeared to be characterized by a more equal input from both whole-word and grapheme–phoneme knowledge. Likewise, in an early training study, Vellutino and Scanlon (1986) found that children trained by
54
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
whole-word methods read more pretrained words than children trained with small-unit phonic methods on early trials, yet the pattern reversed for later trials. The pattern predicted by the simplicity analysis above matches these patterns of acquisition. It explains why when children are taught a small set of sight words in a particular session, the whole-word approach will work best but will fail to integrate with the rest of their knowledge, therefore leading to poorer overall performance in the longer term (Vellutino & Scanlon, 1986). However, English contains many words that are irregular (the pronunciations for some of the letters they contain are not the most frequent pronunciation), and it is possible that hypothesis preference is determined by whether words are regular, rather than vocabulary size. In order to explore this further, the children’s database was split according to whether a word was regular. Different sized vocabularies were then formed as before, containing either all regular, or all irregular, words. The total description lengths for each vocabulary size were recalculated by following the calculations described in the Appendix. For a meaningful comparison between the regular and irregular vocabularies, description lengths were calculated using the complete sets of (nonoptimized) mappings, as in Analysis 1 and not the optimized set of mappings derived in Analysis 2 (although a qualitatively similar result was obtained using the optimized grapheme hypothesis from Analysis 2, even for small vocabularies). If regularity, rather than vocabulary size, is the critical factor, then the grapheme hypothesis for the smallest regular vocabularies should have shorter description lengths than the whole-word hypothesis. Fig. 5B and C shows the total description lengths for each hypothesis as a function of vocabulary size, for regular and irregular words, respectively; Fig. 5A shows the corresponding description lengths for all words (regular and irregular), for comparison. The advantage for whole-word mappings still exists for small vocabularies, but it is moderated by regularity. The advantage for grapheme-sized mappings is apparent for small regular vocabularies (Fig. 5B) but only for relatively larger irregular vocabularies (Fig. 5A and C). As well as regularity moderating the point at which the grapheme hypothesis becomes most preferable, the number and complexity of graphemes in an orthography is likely to have an effect. Thus, it is possible for other more consistent languages where the orthography contains fewer and less complex graphemes that the advantage for whole-word learning might disappear even for small vocabularies. For example, for German, small-unit phonics teaching typically dominates from the beginning of reading instruction (e.g., Landerl, 2000), with considerable success (Wimmer, 1993). Fig. 5B and C indicates that the total description length for irregular words is much larger than for regular words, suggesting regular words should be learned more easily than irregular words. This appears to be the case both across and within languages. Thus, the reading (decoding) ability of British children lags consistently behind that of children who learn more regular languages (Seymour et al., 2003; Ziegler & Goswami, 2005), whereas children who learn less regular languages, for example, Chinese, spend considerably longer learning an equivalently sized vocabulary (Rayner et al., 2001). Hanley, Masterson, Spencer, and Evans (2004) followed up the Welsh and English readers of a previous study to find that at age 10, the English group still read irregular words worse than regular and nonwords, which were read to a similar proficiency as the Welsh readers. These data are consistent
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
55
Fig. 5. Total description lengths for hypotheses as a function of vocabulary size, for all words (A), regular words (B), and irregular words (C).
with the idea that the simplicity of an orthography affects the ease with which it can be learned. Analyses 1–3 compared hypotheses composed of one type of mapping with those composed of another. The results consistently show that the grapheme hypothesis yields the shortest total description length. However, the optimal grapheme hypothesis, as
56
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
identified in Analysis 3, also contained some graphemes with multiple pronunciations. Our aim for Analysis 4 was to consider whether reducing the inconsistency in the grapheme hypothesis by providing some larger unit mappings would further reduce the description length.
5. Analysis 4: Can the addition of large units reduce small unit inconsistency? The results from Analysis 2 indicated that the benefit in describing the data when representing some alternative pronunciations in the optimal grapheme hypothesis outweighed the cost of describing them. However, it is possible that the benefit gained from those alternative pronunciations could be increased further by providing a representation of when they could be most usefully applied. For example, the grapheme ‘‘a’’ may be pronounced in many different ways (e.g., ball, fast, want), but the correct pronunciation can often be determined by considering the surrounding letter(s); ‘‘a’’ is nearly always pronounced ⁄ a+ ⁄ when it is followed by the letter ‘‘s’’ (e.g., fast, task), and often pronounced ⁄ Z ⁄ when it is preceded by the letter ‘‘w’’ (e.g., was, want). Likewise, the grapheme ‘‘c’’ is nearly always pronounced ⁄ s ⁄ when it is followed by grapheme ‘‘e’’ (e.g., cell, cent), or appears in the coda position (e.g., face, rice, fierce). In Analysis 4, our goal was to optimize the grapheme hypothesis further by constructing an alternative hypothesis that contained additional mappings of other unit sizes, relevant to the inconsistent grapheme mappings. The main motivation for this analysis was to explore whether providing some contextual information for inconsistent graphemes would reduce the overall description length. In addition, we wished to consider whether taking account of the syllabic position of graphemes when describing mappings would have an impact on total description length, prompted by the finding that the distribution of consonants across syllable position is not uniform (Kessler & Treiman, 1997). 5.1. Method Analysis 4 was restricted to the children’s database. The simplest grapheme hypothesis, as identified by Analysis 2, containing 118 grapheme–phoneme mappings, was used as the starting hypothesis to which larger unit mappings were added. This straightforward approach was taken to make the contribution of large units transparent, rather than to explore a less transparent analysis of the order in which larger and smaller units added alongside each other might have most impact. 5.1.1. Position-specific mappings First, inconsistent consonant graphemes (e.g., c, th) that could be pronounced in more than one way in both onset and coda positions were identified from the simplest grapheme hypothesis. The mappings containing them were listed with two frequencies, reflecting how often each mapping occurred separately in the onset and coda of a syllable. For example, the mappings containing the grapheme th were listed as ‘‘th ⁄ h ⁄ 34, 32’’ and ‘‘th ⁄ ð ⁄
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
57
15, 12’’ instead of ‘‘th ⁄ h ⁄ 66’’ and ‘‘th ⁄ ð ⁄ 27.’’ The total description length of the simplest grapheme hypothesis was then calculated separately by including each inconsistent consonant grapheme with position-specific mapping frequencies. The impact of positionspecific mappings was assessed by comparing the total description length of the simplest grapheme hypothesis with and without position-specific mapping frequencies for each inconsistent consonant grapheme. Those that resulted in a shorter description length were incorporated into the simplest grapheme hypothesis before large-unit mappings were tested, below. 5.1.2. Large unit mappings A potential pool of large-unit mappings was created by identifying any large-unit mapping (onset, rime, head, and coda mappings from hypotheses described in Analysis 1) that contained an alternative pronunciation for an inconsistent grapheme. For example, the rime mapping ead— ⁄ ed ⁄ was identified as containing an alternative pronunciation for the inconsistent grapheme ea (usually pronounced to rhyme with beach), and war— ⁄ wf+ ⁄ was identified as containing an alternative pronunciation for the inconsistent grapheme ar (usually pronounced to rhyme with hard). The total description length was calculated for the simplest grapheme hypothesis plus one of the large unit mappings, for each large unit mapping in turn. The impact of adding any particular mapping was determined by comparing the total description length for the simplest grapheme hypothesis with and without the mapping in question. If the total description length was shorter after the addition of a mapping, then it indicated that the hypothesis might be simplified by including that mapping; otherwise it should be omitted. Starting with the mapping whose addition produced the shortest total description length, each mapping was added to the simplest grapheme hypothesis one at a time, with recalculation of the total description length occurring after each addition. If the resulting total description length was shorter, then the procedure was repeated with the next mapping in the list. If the resulting total description length was larger, or only a little shorter (Zn) will then be:
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
77
Log2 ð1=pðoÞÞ þ log2 ð1=pðnÞÞ þ log2 ð1=pðspaceÞÞ þ log2 ð1=pðrank ¼ 1ÞÞ þ log2 ð1=pðeolÞÞ
Hence, the more often the hypothesis provides the target pronunciation as rank 1 for each word, the shorter the code length will be to describe the data. If the hypothesis is unable to offer the target pronunciation for a word at all (this result occurs for later simulations in which mappings are removed from hypotheses to reduce the total description length), then the target output is described by taking the most probable pronunciation, given the available mappings, and calculating a correction cost. The correction cost, derived by calculating the number of bits necessary to change the most probable pronunciation into the target one, is based on Levenshtein distance (Levenshtein, 1966). In brief, the less closely the most probable pronunciation matches the target one, the higher the correction cost, as each change requires log2(1 ⁄ p(any one phoneme)) bits. Producing pronunciations To produce all possible pronunciations of a given word, each word is first parsed into constituent parts, according to the unit size under consideration. Head–coda and onset–rime unit sizes yield only one parse. For example, given the word bread, a head–coda parse gives brea.d and an onset–rime parse yields br.ead, because there is only one way of splitting the word into either head–coda or onset–rime. However, for grapheme units, there are many ways to parse a word, not all of which will correspond to available grapheme–phoneme mappings. For example, bread could theoretically be parsed many ways: b.read, brea.d, br.ead, bre.ad, br.ea.d, br.e.ad, b.re.ad, br.e.a.d, b.re.a.d, b.r.ea.d, b.r.e.ad, b.r.e.a.d, but only the parses b.r.e.a.d and b.r.ea.d correspond to available grapheme–phoneme mappings. For each parse, the relevant mappings are applied combinatorially and each pronunciation assigned an overall probability according to the product of the individual mapping probabilities. Individual mapping probabilities are based on the (type) frequency with which orthographic representations are associated with phonological representations. This frequency measures a directionless association between orthography and phonology (the sonograph, see Spencer, 2009), and contrasts with directional measures such as conditional grapheme– phoneme probabilities, or token-based measures that weight grapheme–phoneme counts by word frequency. Sonograph frequency has been shown to be a better predictor of reading accuracy in young British readers than either directional or token-based measures (Spencer, 2010). For the MDL analyses presented here, probabilities based on type rather than token frequency are appropriate because they provide a more accurate guide to pronunciation of the context-independent orthographic representations that are used to redescribe the data. Unless the context in which any given representation occurs is encoded, then the use of token frequencies could distort the likelihood with which orthographic representations are commonly pronounced across multiple contexts, which would be contrary to the goal of a child learning to read. In order to reduce the description length of encoding the frequencies (see above), mapping frequencies are represented by a number corresponding to the fre-
78
J. I. Vousden et al. ⁄ Cognitive Science 35 (2011)
quency range to which they belong (by taking the log of the actual frequency). The only information available concerning mapping frequency is obtained by computing the inverse of the log of the number that the mapping was assigned when originally taking logs to encode the frequency. To continue using base 10 as an example, if a mapping had been assigned number 3, then the frequency used in calculating pronunciations would be taken to be 103 = 1000. A small amount of Gaussian noise is added to each pronunciation probability to avoid different pronunciations occurring with equal probability (this result happens occasionally because the encoded mapping frequencies have restricted resolution due to taking logs), and average code lengths for each hypothesis were based on at least 100 simulations. Pronunciations are filtered so those that violate the phonotactics of English (i.e., those that either contain disallowed sequences of phonemes, such as ⁄ gn ⁄ , or contain phonemes in disallowed positions, such as starting a word with ⁄ ¢ ⁄ ) are not considered. The list of pronunciations is arranged in descending order of probability, and the target pronunciation assigned a rank. Continuing the example using bread, the head–coda hypothesis uses the parse brea.d and mappings brea— ⁄ bre ⁄ , brea— ⁄ bri+ ⁄ , brea— ⁄ brew ⁄ , and d— ⁄ d ⁄ to produce three outputs (in descending order of probability): ⁄ bred ⁄ , ⁄ bri+d ⁄ , and ⁄ brewd ⁄ . The onset–rime model applies mappings br -> ⁄ br ⁄ , ead -> ⁄ ed ⁄ , and ead -> ⁄ i+d ⁄ to produce two outputs from the parse br.ead: ⁄ bred ⁄ and ⁄ bri+d ⁄ . The GPC model produces many outputs from the parse b.r.e.a.d: For example, ⁄ breæd ⁄ , ⁄ brea+d ⁄ , ⁄ breZd ⁄ , …. ⁄ bri+æd ⁄ , ⁄ bri+a+d ⁄ , ⁄ bri+Zd ⁄ ,…, all of which form phonotactically improbable monosyllables; and produces eight outputs from the parse b.r.ea.d: ⁄ bri+d ⁄ , ⁄ bred ⁄ , ⁄ brewd ⁄ , ⁄ bri+t ⁄ , ⁄ bret ⁄ , ⁄ brewt ⁄ , ⁄ brwbd ⁄ , and ⁄ brwbt ⁄ , one of which ( ⁄ brwbt ⁄ ) is phonotactically improbable. The target pronunciation is described by ranks of 1, 1, and 2, respectively, for each hypothesis.
Supporting Information Additional Supporting Information may be found in the online version of this article on Wiley Online Library: Appendix S1. Ranked list of grapheme–phoneme mappings, in descending order of contribution to the brevity of the total description length. Data S1. Grapheme–phoneme frequencies in monosyllabic British English. Data S2. Conversion of keyboard phonemes to IPA. Please note: Wiley-Blackwell is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.
Cognitive Science 35 (2011) 79–118 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 0364-0213 print / 1551-6709 online DOI: 10.1111/j.1551-6709.2010.01149.x
Holographic String Encoding Thomas Hannagan, Emmanuel Dupoux, Anne Christophe Laboratoire de Sciences Cognitives et Psycholinguistique, EHESS/CNRS/DEC-ENS Received 28 June 2010; received in revised form 8 July 2010; accepted 1 October 2010
Abstract In this article, we apply a special case of holographic representations to letter position coding. We translate different well-known schemes into this format, which uses distributed representations and supports constituent structure. We show that in addition to these brain-like characteristics, performances on a standard benchmark of behavioral effects are improved in the holographic format relative to the standard localist one. This notably occurs because of emerging properties in holographic codes, like transposition and edge effects, for which we give formal demonstrations. Finally, we outline the limits of the approach as well as its possible future extensions. Keywords: Letter position coding; Distributed representations; Holographic memories
1. Introduction Research on letter position encoding has been through a remarkable expansion in recent years. At the origin of this are numerous findings concerning the flexibility of letter string representations, as well as the growing realization of their importance in the computational study of lexical access. Nevertheless, how the brain encodes visual words is still a subject of vivid debate. Obviously, a stable code must be used if we are to retrieve words successfully, and order must be encoded because we can distinguish between anagrams like ‘‘relating’’ and ‘‘triangle.’’ But researchers disagree on exactly how this string encoding step1 is achieved. There is currently a great variety of proposals in the literature, but it is possible to classify them according to the scheme and code they use. Indeed, current proposals fall into two classes of schemes: relative and absolute, and each of them can be implemented using two kinds of codes: localist or distributed. Correspondence should be sent to Thomas Hannagan, Laboratoire de Sciences Cognitives et Psycholinguistique, EHESS/CNRS/DEC-ENS Ecole Normale Supe´rieure, 29 rue d’Ulm, 75005 Paris, France. E-mail:
[email protected] 80
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
We will see that all schemes but one have been implemented using localist codes. In this article, we explore a special kind of distributed coding, based on holographic representations. We show that well-known schemes behave differently depending on the code they use, holographic ones performing generally better on a benchmark of behavioral effects. Finally, we exploit the compositional power of holographic representations to study a new scheme in the spirit of the local combination detector model (LCD; Dehaene, Cohen, Sigman, & Vinckier, 2005), a recent and biologically constrained proposal for string encoding. 1.1. Absolute versus relative schemes A string encoding scheme is an abstract specification of the basic entities and relationships by which order will be achieved in the string representation. 1.1.1. Absolute position schemes Absolute schemes hold that for any given entity in a string, its position is marked by some coordinates in a common reference frame. The basic entity in all current absolute schemes is the letter, and the most used origin for the reference frame is the first letter. Examples of this approach include slot coding (Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001; McClelland & Rumelhart, 1981), double-anchoring (Jacobs, Rey, Ziegler, & Grainger, 1998), the overlap model (Gomez, Ratcliff, and Perea, 2008), or Spatial Coding (Davis, in press). A scheme should be flexible if it is to bear any resemblance to human performances, and absolute schemes can be more or less flexible depending on the reference frame. The less flexible, but still the most widespread, is slot coding, in which position coordinates are perfectly determined and originate at the beginning of the word. This amounts to feeding letters in a positional array, implying that as few as a single letter shift will displace all elements and minimize similarity (the so-called alignment problem; Davis, 1999). Building on slot coding, some variants like double-anchoring introduce more flexibility by using two origins (first and last letters) instead of one (Fischer-Baum, McCloskey, & Rapp, 2010; Jacobs et al., 1998). Many absolute schemes also choose to feature uncertainty in position assignment. This approach is best illustrated in the overlap model (Gomez et al., 2008), where the position of a letter in the frame is not marked by a constant coordinate, but rather by a probability density function centered on this coordinate. Finally, the most flexible absolute scheme is Spatial Coding (Davis, in press), where letters are assigned monotonic position values as they progress in the string. The flexibility comes partly from exploiting some uncertainty in position assignment, but mostly from the similarity measure between two strings, which assesses how consistently matching letters are shifted in both strings. 1.1.2. Relative position (RP) schemes The second approach conceives of letter order in relative terms. Here, the information concerning the position of single letters is utterly irrelevant, and there is no common coordinate frame. Rather, what is important is the relative order in which two or more (possibly distant) letters appear in the string. That is, a string code is best thought of as a set of activated n-gram entities. Among the tenants of this approach figure Wickelcoding (Seidenberg
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
81
& McClelland, 1989; Wickelgren, 1969), Blirnet (Mozer, 1991), and open-bigram schemes (Grainger & Van Heuven, 2003; Whitney & Berndt, 1999). Relative schemes vary as to the size and contiguity properties an entity should possess. Wickelcoding and Blirnet both use letter trigrams, but whereas the first imposes complete contiguity upon their constituent letters, the second allows for a one-letter gap to occur. Open-bigram schemes restrict the entity to the bigram, but enlarge the authorized distance between letters: up to two intervening letters are allowed in constrained open-bigrams (COB; Grainger & Van Heuven, 2003) and Seriol (Whitney & Berndt, 1999), while no restriction is imposed in unconstrained open-bigrams (UOB). More recently, an overlapping open-bigram scheme was proposed in which contiguity is reinstated upon entities, but position uncertainty takes place at lower levels; consequently, activation of noncontiguous bigrams is still achieved, albeit in a different way (Grainger, Granier, Farioli, Van Assche, & Van Heuven, 2006). Note that we have used the terms absolute and relative slightly differently from that in the literature. The term ‘‘absolute’’ in the context of letter position coding has been applied to any scheme using a fixed coordinate frame, with a single origin and no uncertainty. The term ‘‘relative’’ was then usually defined by opposition, and hence includes schemes with more than one origin. Accordingly, double-anchoring schemes were first dubbed relative on the grounds that position is encoded with respect to two anchors, and Spatial Coding has been referred to as relative because match calculations are based on the relative alignment between letter positions in both strings, rather than on the number of letters that match at any absolute position. But we feel this terminology could profit from a more traditional use of the absolute/relative dichotomy. In the classical meaning, any scheme that encodes order through a coordinate system, using the distance of entities to one or more common origin(s), should be referred to as absolute, whereas any scheme that purports to encode order directly by the relationship between entities and bypassing any origin, should be referred to as relative. This terminology is orthogonal to flexibility; indeed, relative schemes can be quite rigid (e.g., Wickelcoding) and absolute schemes can be very flexible (e.g., Spatial Coding). 1.1.3. A recent hierarchical proposal: Local combination detector Although it has not yet been implemented, a recent string encoding proposal privileging biological plausibility and constrained by brain imaging data has recently appeared (Dehaene et al., 2004, 2005; Vinckier et al., 2007). LCD holds that hierarchical feature extraction from the stimulus results in invariance for size and font in the left fusiform gyrus, and more precisely in the posterior part of the socalled visual word form area (Cohen & Dehaene, 2004; Cohen et al., 2000). From there, string encoding is best described as involving three levels of units with increasing receptive fields, culminating in the anterior part of the region. The model distinguishes between letter, bigram, and quadrigram detectors, which would be segregated into distinct cortical stripes—although all types of units could also be present at each level in different proportions (Vinckier et al., 2007). Based on evidence from behavioral and brain studies (Dehaene et al., 2004; Peressotti & Grainger, 1999), the model assumes an absolute scheme at the
82
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
letter level, but the highest level in the hierarchy uses open-quadrigrams and thus a scheme built from the LCD model would ultimately be a relative one. 1.2. Localist and distributed codes Once a scheme has been decided upon, its actual implementation requires to define a code. Codes are primarily concerned with units, and how they relate to the entities that have been specified in the scheme. One should distinguish between two different kinds of relationships between units and entities: one-to-one relationships, which produce localist codes,2 and many-to-many relationships, which produce distributed codes (Plaut & McClelland, 2010; Rumelhart & McClelland, 1986). 1.2.1. Localist codes A localist code describes a one-to-one mapping between entities and units. For example, consider encoding the word ‘‘CABS’’: localist Wickelcoding could use a binary vector where each component stands for a single letter triplet. The code would then assign ones only to these four units corresponding to CA, CAB, ABS, and BS and zeros to all others. This code is localist in the strongest sense: A single bit would map to exactly one element and hence flipping a bit from 1 to 0 would entirely and exclusively destroy the information about the corresponding triplet. The localist approach has a long history in cognitive science. As each unit represents one entity, localist codes have appeared more compatible with so-called classical cognitive frameworks in which rules operate on symbols (Chomsky, 2002; Fodor, 1983; Pinker, 1999; Turing, 1950). Localist coding has also been used in many successful computational models coming from various areas of cognition, and especially in visual word recognition (Coltheart et al., 2001; Grainger & Jacobs, 1996; McClelland & Rumelhart, 1981; Norris, 2006). At the neural level, the approach concords with the so-called grandmother cell hypothesis, which states that single cells can be dedicated to a single arbitrarily complex object irrespective of its size, orientation, or color (Gross, 2002). As noted by Gross, the original grandmother cell hypothesis advocated a weaker form of localist representations, in which many neurons (the number 18,000 has been advanced) would be tuned to the same entity, and indeed some localist proponents have embraced this many-to-one redundant coding (Bowers, 2009). 1.2.2. Distributed codes Alternatively, one can use a code in which many units participate in representing any given entity, and many entities use each unit. That is, no unit can be singled out as uniquely representing one stimulus—or any class, or any given feature of it—and instead information about any given entity is spread over the activity of many units. At one extreme, this manyto-many mapping produces a dense distributed code, each entity activating in average half of the units. At the other extreme, we find sparsely distributed codes, in which an entity uses only a few units and each unit codes for a few entities. For many years, the only distributed string code in the literature implemented a Wickelcoding scheme, and had been proposed in
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
83
the context of the Parallel Distributed Processing (PDP) model (Seidenberg & McClelland, 1989). It consisted of 400 binary units, each responding to 1,000 randomly chosen letter triplets. This so-called coarse coding approach is distributed because considering only a unit cannot tell us which one of the 1,000 triplets it is activated for, yet taken together the 400 units do define a virtually unique correspondence.3 Recently, however, Dandurand, Grainger, and Dufau (2010) trained a backpropagation network on a location invariant word recognition task, and it has been shown that this network implements a densely distributed version of the overlap scheme (Hannagan, Dandurand, & Grainger, in press). Distributed codes are slightly more recent than localist ones but also have a strong pedigree. Their study was precipitated by the discovery of the backpropagation learning algorithm (Rumelhart, Hinton, & Williams, 1986), which usually produces distributed codes whose sparseness depends on the network structure and the task at hand (Plaut & McClelland, 2010). These codes are also intimately tied with high-stake theoretical issues in cognitive science, contrasting a vision of human cognition as emerging from associations between (often distributed) representations, rather than as systems of rules and symbols (McClelland, Patterson, Pinker, & Ullman, 2002). However, this distinction was blurred in the 1990s by the introduction of such techniques as Tensor Product representations (Smolensky, 1990) and later by Holographic representations, which instantiate what has been described as an Integrated Connectionist Symbolic framework (Smolensky, Legendre, & Miyata, 1992). 1.3. The localist versus distributed debate As this article illustrates, the choice of a representational format has profound impacts at the computational level: Schemes do not make the same predictions depending on the code they use. In consequence, one should assess which of the distributed or localist approach, if any, is currently favored by the evidence. Nowadays, the commitment to distributed representations is widespread among both neuroscientists and computational modelers, in part because of the success of the PDP framework (Rumelhart & McClelland, 1986), and also because they have been reported in a vast number of brain regions and in a variety of forms—see Abbott and Sejnowski (1999) and Felleman and Van Essen (1991) for good surveys. As regards string encoding, it is also worth noting here that the LCD proposal, which may currently have the strongest claims to biological plausibility, postulates distributed representations. Recently, however, Bowers (2009) argued for a localist vision in which units are redundant (i.e., many dedicated localist units code for any given entity), activation spreads between them (i.e., each dedicated unit can possibly also show some low level of activity for a related stimuli), and units are limited to coding for individual words, faces, and objects (Bowers, 2009). Bowers points to human neurophysiological recordings showing rare cells with highly selective responses to a given face, and negligible or flat responses to all others. It has been pointed out that reports of highly tuned cells can hardly count as evidence for localist coding since one can neither test exhaustively all other cells for any given stimulus, nor the presumed localist cell exhaustively for all possible stimuli (Plaut & McClelland, 2010). Nevertheless, Bayesian estimates of sparseness can and have been derived for
84
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
medial temporal lobe cells (Waydo, Kraskov, Quiroga, Fried, & Koch, 2006), assuming that on the order of 105 items are stored in this area—where coding is generally believed to be sparser than elsewhere in the brain—and these estimates turned out to be unequivocally in favor of sparse distributed coding. Crucially, the following claim from Bowers is incorrect: Waydo et al.’s (2006) calculations are just as consistent with the conclusion that there are roughly 50–150 redundant Jennifer Aniston cells in the hippocampus. (p. 245) Waydo et al.’s (2006) study was agnostic with respect to redundant coding and explicitly concludes that rather than 50–150, on the order of several million medial temporal (MT) neurons should respond to any given complex stimulus (Jennifer Aniston included), and each cell is likely to respond to a hundred of different stimuli. In fact such ‘‘multiplex’’ cells, to use Plaut and McClelland’s (2010) proposed terminology, have been directly observed by Waydo et al. (2006) (personal communication). As for computational efficiency, distributed codes have well-known computational qualities: They are robust to noise, have a large representational power, and support graded similarities. In distributed codes, randomly removing n units does not lead to the total deletion of n representations, but rather to the gradual loss of information in all representations. This would seem especially useful considering the different sources of noise—environmental or neural—in spite of which the brain has to operate (Knoblauch & Palm, 2005). Redundant localist coding would indeed provide a similar advantage, were it not already disproved by the above mentioned estimates. Distributed coding also provides a significant gain in representational power. For instance, n binary units can encode at most n local representations (without any redundancy), but this number is a lower bound when using distributed representations—and can reach 2n with dense coding. Indeed, because firing a neuron takes energy, there is likely to be a trade-off between representational efficiency and metabolic cost. Brain regions supporting highly regular associations would be more likely to settle on dense representations, whereas those supporting highly irregular associations would settle on sparser ones, although not to the localist extreme. Moreover, while distributed representations are susceptible to result in ‘‘catastrophic interference’’ of new memories on old ones, it is known that the independently motivated introduction of constraints, such as interleaved learning or assumptions, such as pseudo-rehearsal can circumvent the issue (Ellis & Lambon Ralph, 2000; McClelland, McNaughton, & O’Reilly, 1995). Finally, in localist coding, the question arises as to which entities or class of entities a unit should stand for, this especially considering that class distinctions can be entirely context dependent (Plaut & McClelland, 2010). This is intimately tied to the problem of defining similarities between representations, a problem which is solved in distributed representations where graded similarities can be defined as the overlap between patterns—for instance, by computing the euclidean distance between activation patterns. In light of the available evidence, we think that the localist approach is not currently supported by physiology, that it runs into the fundamental issue of how to determine what a unit codes for, and that it severely limits representational power especially if one acknowledges that redundant coding is required to overcome noise issues—which is not compatible with
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
85
current estimates. On the other hand, distributed representations have been abundantly found in the brain, and they allow for a much larger, graded, and robust spectrum of representations to be coded by any given population of unit. 1.4. Constituent structure and the dispersion problem As we have seen, there are behavioral and biological reasons to explore the possibility that letter combinations are involved in the string encoding process, as proposed for instance in LCD, open-bigrams, or Wickelcoding. In these schemes by definition a given letter combination is active only if its constituents are present in the string in the correct order. There is, however, a consequence of this way to represent letter order: the possible loss of information pertaining to constituent structure. Under a localist open-bigram implementation, for instance, bigram 12 is equally similar to bigrams 1d or 21, with which it shares common constituents,4 than to bigram dd, with which it does not—see also Davis and Bowers (2006), for the same remark on localist Wickelcoding. This is what mathematicians refer to as a topologically discrete space: the (normalized) distance between two localist representations is always one (or zero iff identical). As n-gram units increase in size, this lack of sensitivity to constituent structure can become problematic for the string encoding approach, implying for instance that a localist open-quadrigram code would assign no similarity whatsoever between any four letter strings, in flat contradiction with experimental results (Humphreys, Evett, & Quinlan, 1990). While this issue does not arise in absolute codes because similarities are computed at the level of letters, it is potentially serious for relative schemes. The issue seems to stem in part from the use of localist representations, but it is unclear how using distributed representations would actually improve the situation. Indeed, as described earlier, one example of a distributed string code can be found in the original PDP model and was applied to the Wickelcoding scheme. The code has good robustness to noise and provides graded similarity measures for strings of more than three letters, but it is of little use in the problem described above. This is because as triplets are randomly assigned to units, there is still no relationship between units coding for triplets with similar letter constituents. This phenomenon is related to the dispersion problem observed in PDP (i.e., the fact that orthographicto-phonological regularities learned in some context do not carry over to different contexts) and it has been identified as a major obstacle to generalization in the PDP model (Plaut, McClelland, Seidenberg, & Patterson, 1996). What seems to be required is a way to represent letter-combination entities so as to respect similarities between combinations that share constituents, and we now describe how this can be achieved using holographic representations. 1.5. Binary spatter code Binary spatter codes (BSC; Kanerva, 1997, 1998) are a particular case of holographic reduced representations (Plate, 1995) and provide a simple way to implement a distributed and compositional code.
86
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
Formally, BSCs are randomly distributed vectors of large dimensions (typically in the order of 103 to 104), where each element is bipolar, identically and independently distributed in a Bernoulli distribution. Vectors can be combined together using two particular operators: X-or and Majority rule, often, respectively referred to as the Binding and Chunking operators.5 The composition of BSC vectors using these operators results in a new vector of the same dimension and distribution. Operators are defined element-wise (see Appendix A for formal definitions): X-oring two bipolar variables gives 1 if they are different and )1 otherwise, while the majority rule applied to k bipolar variables gives 1 if they sum up above zero and )1 otherwise. Both operators preserve the format (size and distribution) of vectors, so that the resulting vectors can themselves be further combined, thus allowing for arbitrarily complex combinatorial structures to be built in a constant sized code. As an example, consider how to implement the sequence of events [A, B, C] with an absolute scheme and vectors of dimension 10. We start by generating three event vectors at random: A :¼ ½1; 0; 1; 1; 1; 0; 0; 1; 1; 0# B :¼ ½1; 0; 0; 0; 0; 1; 0; 1; 0; 1# C :¼ ½0; 1; 0; 1; 0; 1; 1; 1; 0; 1# Likewise, we also generate at random the three temporal vectors required by an absolute position scheme: T1 :¼ ½1; 0; 1; 1; 0; 1; 1; 0; 1; 1# T2 :¼ ½0; 1; 0; 1; 0; 0; 0; 0; 0; 1# T3 :¼ ½1; 1; 0; 1; 1; 0; 1; 1; 1; 1# We express the idea that an event occurred at a given time by X-oring these vectors together, giving the three bindings: A$T1 ¼ ½0; 0; 0; 0; 1; 1; 1; 1; 0; 1# B$T2 ¼ ½1; 1; 0; 1; 0; 1; 0; 1; 0; 0# C$T3 ¼ ½1; 0; 0; 0; 1; 1; 0; 0; 1; 0# These vectors are again densely and randomly distributed (p(0) ¼ p(1) ¼ 1/2). Calculating the expected similarities6 between one binding and its constituents, we see that it must be zero: X-or destroys similarity. Finally, we signify that these three events constitute one sequence by applying the majority rule: S ¼ MajðA$T1; B$T2; C$T3Þ
' A$T1 ( B$T2 ( C$T3 ¼ ½1; 0; 0; 0; 1; 1; 0; 1; 0; 0#
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
87
The sequence vector S that results from the Maj also has the same distribution as its argument vectors. As we calculate its expected similarity with its arguments, we see that contrary to the X-or it is non-zero, but here, 1/2.7 The fact that Maj allows for part/whole similarities also implies that two structures which share some constituents will be more similar than chance. Note that BSCs can be made arbitrarily sparse (Kanerva, 1995), although in this article, we use the dense version for simplicity. It has also been shown that randomly connected sigma–pi neurons can achieve holographic reduced representations (Plate, 2000). In addition, holographic techniques, such as BSCs have been recently applied to a number of language related problems, including grammatical parsing (Whitney, 2004), learning and systematization (Neumann, 2002), lexical acquisition (Jones & Mewhort, 2007), language acquisition (Levy & Kirby, 2006), or phonology (Harris, 2002). 1.6. Comparing codes: Method, objections, and replies The masked priming paradigm (Forster & Davis, 1984) is most commonly used to assess the relative merits of candidate schemes. Simply stated, in this paradigm, a mask is presented for 500 ms, followed by a prime for a very brief duration (e.g., 50 ms), and then directly by the target string for 500 ms. Using this paradigm, subjects are generally unaware of the prime—it is said to be subliminal. However, when target strings are preceded by subliminal primes that are close in visual form, subjects are significantly faster and more accurate in their performances. In most studies the primes of interest are non-words, and there is no manipulation of frequency [F] or neighborhood [N]). Varying the orthographic similarity with the target (e.g hoss-TOSS vs. nard-TOSS), one can observe different amounts of facilitation that can then be compared to similarity scores derived from candidate schemes. The way similarity scores are calculated depends upon the scheme. Most of the time, the similarity between prime and target strings is obtained simply by dividing the number of shared units by the number of units in the target. This is the similarity defined for instance in Grainger and Van Heuven’s localist open-bigrams, and when absolute positions are taken into account, in slot coding. More sophisticated similarities can introduce weights for different units (Seriol model), signal-to-weight differences (Spatial Coding), or sums of integrals over Gaussian products (overlap model). Some researchers have recently raised objections to this way of assessing codes. Lupker and Davis (2009) challenge the notion that masked priming effects can constrain codes, since the latter abstract away from N and F that despite all precautions will play a role in the former, and a new paradigm is proposed that is deemed less sensitive to these factors. Although this ‘‘sandwich priming’’ appears promising, a dismissal of the standard approach would certainly be premature. All researchers agree that the similarity/masked priming comparison should be used with caution—see, for example, Gomez et al. (2008)—which would exclude, for instance, conditions that explicitly manipulate primes for N or F. But the standard view is that in general masked priming facilitation should highly correlate with the magnitude of similarities, which in all models are thought to capture bottom-up activation
88
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
from sublexical units (see, for instance, Davis & Bowers, 2006; Gomez et al., 2008; Grainger et al., 2006; Van Assche & Grainger, 2006; Whitney, 2008). This view is also supported by unpublished investigations showing that similarities in the Spatial Coding scheme correlate 0.63 with simulated priming in the Spatial Coding model, and 0.82 when conditions that manipulate F and N are excluded (Hannagan, 2010). Although such correlations are bound to differ across models because these make different assumptions about N and F mechanisms, the differences put models’ behaviors into perspective, allowing one to distinguish bottom-up contributions from other contributions in priming simulations. Finally, in some circumstances, similarities alone can be sufficient to rule codes out. Such a situation occurred, for instance, in the case of standard localist slot coding, where a string code suffers equal disruptions from double substitutions or from transpositions. No lexical level from any model could save this localist code from being inconsistent with human data, because these show strong priming in the latter case but weak or no priming in the former (Perea & Lupker, 2003a). In summary, assessing codes on masked priming constraints has been standard practice in the field, can rule out codes on face value, arguably provides accurate expectations for priming simulations carried out on models, and in any case provides a useful way to understand their inner workings. Consequently and following others, in this article, we will use similarity scores to predict these priming effects that would be expected from the codes alone in any lexical access model, and we will assess codes exclusively on masked priming data. 1.7. Four family of constraints from masked priming We now describe several properties we believe word codes should display. These properties are by no means exhaustive but were chosen because they have been consistently reported in the literature. 1.7.1. Stability First and foremost the code should be stable, if words are ever to be accessed reliably. A sufficient condition of stability for string S is that the similarity between S and any other string should be strictly less than one, but it should be one between S and itself. However, this cannot be a necessary condition, as in a noisy environment, the system must be prepared to accept some variation to occur even in the identity condition. Thus, we propose that a necessary and sufficient stability condition for string S is that in the absence of noise, no transformation of S can produce a code closer to S than a given similarity criterion (we use 0.95 throughout this article). We will test for stability of five letter strings by computing the codes for minimal transformations, fair instances of which being substitutions, transposition, insertion, repetition, or deletion (see Table 1 for examples). 1.7.2. Edge effects Another desirable although controversial property concerns the importance of outer letters relatively to inner ones. In a letter identification paradigm, subjects are significantly
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
89
Table 1 Description of the constraints used in this article, and their respective conditions Constraint Family
Condition
Stability
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20)
Edge effect
Transposed letter effect
Relative position effect
Prime
Target
Criterion
12345 1245 123345 123d45 12dd5 1d345 12d456 12d4d6 d2345 12d45 1234d 12435 21436587 125436 13d45 12345 34567 13457 123256 123456
12345 12345 12345 12345 12345 12345 123456 123456 12345 12345 12345 12345 12345678 123456 12345 1234567 1234567 1234567 1232456 1232456
> 0.95 < (1) < (1) < (1) < (1) < (1) < (1) < (7) < (10) < (1) < (10) > (5) ¼ Min < (7) and > (8) < (6) > Min > Min > Min > Min ¼ (19)
Note. Min:minimum similarity value among all conditions.
better at reporting the first and last letters than any other (Stevens & Grainger, 2003). In masked priming lexical decision studies, primes being one or two letters different from the target produce weaker (or no) facilitation when outer letters are manipulated than when inner letters are—see Jordan, Thomas, Patching, and Scott-Brown (2003) for a good summary of studies on the subject. The facilitation produced is often equal for initial or final letter substitutions (Grainger & Jacobs, 1993), but edge effects have sometimes proved elusive (Grainger et al., 2006; Guerrera & Forster, 2008), or confined to the initial letter (Whitney, 2001). In this article, we will say that a code shows edge effects if single substitutions occurring at outer positions are more destructive in terms of similarity than those occurring at inner ones (i.e., sim(d2345, 12345) < sim(12d45) and sim(1234d, 12345) < sim(12d45)7). 1.7.3. Transposition priming Transposed letter (TL) priming refers to the robust finding that transposing two contiguous letters is less destructive than replacing them altogether (e.g ‘‘jugde’’ primes ‘‘JUDGE’’ significantly more than ‘‘jupte’’ does; Perea & Lupker, 2003a). Recently, TL has been further investigated in a number of directions and the influence of contiguity, status, length or number of transposed letters is being assessed (Lupker, Perea, & Davis, 2008; Perea & Acha, 2008; Perea, Dun˜abeitia, & Carreiras, 2008). In this article, we choose to focus on four results: the basic single contiguous TL constraint for five letter strings
90
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
(sim(12435, 12345) > sim(12dd5, 12345), hereafter local TL), a global TL constraint from a recent study (Guerrera & Forster, 2008) where transposing all contiguous letter couples from an eight-letter string produced no facilitation (sim(21436587, 12345678) ¼ Min,8 a distant TL constraint where a single noncontiguous transposition in six-letter words was found to give more facilitation than the corresponding double substitution, but less than a single substitution (sim(12d456) < sim(125436, 123456) < sim(12d4d6, 123456)) (Perea & Lupker, 2004), and finally what we will call the compound TL constraint, stating that a transposition coupled with a substitution (13d45, neighbors-once-removed, N1R hereafter) produces less priming than a single median substitution (Davis & Bowers, 2006). The global TL constraint currently defies predictions from all schemes but Seriol9 by producing no priming at all,10 whereas the compound and distant TL constraints seem to challenge all schemes, but Spatial Coding. 1.7.4. RP priming Relative position priming covers the fact that strings obtained by disturbing the absolute position of the target letters, while sparing their relative positions, still make efficient primes (e.g., 12345, 34567, and 13457 all prime the word 1234567; Grainger et al., 2006). This goes against expectations based on absolute letter coding schemes like slot coding, as well as schemes that incorporate some local context like Wickelcoding. Although RP priming has recently been extended to superset primes (Van Assche & Grainger, 2006), in this article we will focus on subset conditions but will consider conditions with and without repeated letters. A code for a seven-letter word will be said to satisfy the distinct RP priming constraint if it exhibits strictly positive similarities with the strings described above (12345, 34567, and 13457). The repeated RP constraint covers the finding that given a target with one repeated letter, the same amount of priming is produced whether a deletion involves the repeated letter or not (sim(123456, 1232456) ¼ sim(123256, 1232456)). This result is at odds with predictions coming from the open-bigram scheme (Schoonbaert & Grainger, 2004). Table 1 recapitulates these constraints. 1.8. General procedure In the next four sections, we will assess a number of schemes on this set of criteria. The same procedure is used throughout all simulations: Each scheme is implemented using holographic coding on the one hand, and localist coding on the other, for comparison purposes. Similarities in the localist case are computed in the standard way described earlier. The holographic procedure is carried out as follows. For each trial, we randomly generate orthogonal holographic vectors of dimension 1,000. From these, we build the various codes for all conditions, compute similarity scores based on hamming distance between vectors, and average them across 100 trials. The use of high dimensional vectors implies very little variance for these similarity scores, typically 0.01 for dimension 1,000. Consequently, we will only report mean similarities, and we will consider that two conditions yield equal facilitation if their difference in similarities lies within 0.01.
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
91
Fig. 1. Holographic slot coding: arborescence of the code. Each node represents a vector of constant size, while ? and = connections respectively stand for the X-or and Maj operators. The ith element of the global code is one if and only if a majority of ith elements from letter/position bindings are 1.
2. Slot coding We begin by translating the slot coding scheme in the language of holographic representations. This is done by generating position and letter vectors. Each letter vector is bound to a position vector representing the position within the string, using the X-or operator. These bindings are then chunked together using the majority rule, resulting in the holographic structure: TABLE ¼ T"p1 # A"p2 # B"p3 # L"p4 # E"p5 ;
where " is the binding operator X-or, ¯ is the majority rule operator, and pi are independent holographic vectors. Fig. 1 gives the arborescence of the holographic slot coding scheme. 2.1. Results Table 2 presents similarity results for both versions of the slot coding scheme. The first line shows the localist code, and the second the holographic code. As is well known, when the string size is unchanged the least destructive transformation in localist slot coding is single substitution (conditions 6, 7, 9, 10, 11), and it produces a code distant enough from the original to satisfy the stability constraint. Moreover, a single substitution has the same impact wherever it occurs in the string, thus precluding edge effects (conditions 9–11). Generally speaking, the code is a quite stringent one, sanctioning even the slightest transformation with very low similarity scores (conditions 2–5). A single transposition in slot coding is necessarily as destructive as a double substitution, thus preventing local or distant transposition effects (resp. conditions 12 vs. 5, and 14 vs. 8). However, localist slot coding correctly captures the absence of similarity resulting from global transposition (condition 10), and rightly makes N1R more destructive than single substitution neighbors (conditions 15 vs. 6). Subset conditions produce very different similarities depending on where deletions are performed: The earlier it occurs in the string, the more destructive it is. Similarities decrease steadily for deletion positions 6, 5, 4, and 2, hitting a floor at position 1 where no similarity to the original string remains (resp. conditions 16, 19, 20, 18, and 17), and as a consequence both the distinct and the repeated RP constraints are violated.
12435 12345 0.60 0.62
Prime Target Localist slot Holographic slot
123345 12345 0.60 0.36
(3)
21436587 12345678 0.00 0.00
(13)
(4) 123d45 12345 0.60 0.36
125436 123456 0.67 0.84
(14)
Transposition (12–15)
1245 12345 0.40 0.28
(2)
13d45 12345 0.60 0.67
(15)
12dd5 12345 0.60 0.43
(5)
12345 1234567 0.71 0.68
(16)
1d345 12345 0.80 0.76
(6) 12d4d6 123456 0.67 0.67
(8)
d2345 12345 0.80 0.62
(9)
13457 1234567 0.14 0.12
(18)
(19)
12d45 12345 0.80 0.62
(10)
123267 1232567 0.57 0.70
Relative Position (16–20)
34567 1234567 0.00 0.00
(17)
12d456 123456 0.83 0.83
(7)
123567 1232567 0.43 0.65
(20)
1234d 12345 0.80 0.62
(11)
Edge Effects (9–11)
Notes. Average similarity scores (bold) calculated for different prime–target conditions. For each condition, the main constraint it intends to test is outlined. Note that technically the stability constraint extends to all conditions.
(12)
12345 12345 1.00 1.00
(1)
Condition
Prime Target Localist slot Holographic slot
Condition
Stability (1–8)
Table 2 Slot coding scheme: similarities derived in the localist and holographic case
92 T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
93
As for holographic slot coding, Table 2 first shows that it is stable—the identity condition returns one (condition 1) while all other transformations for five-letter strings remain well below the instability threshold. A remarkable feature of holographic slot coding is that it produces transposition effects: Transposing two letters (condition 12, sim 0.62) is less destructive than replacing them altogether (condition 5, sim 0.43). This is surprising because as explained earlier, binding two vectors using X-or yields a new vector that bears no similarity to either of them (i.e., sim(A,A!B) ¼ sim(B,A!B) ¼ 0). Thus, a given letter vector T, X-ored with either the first or the second position vector, will give two unrelated vectors (i.e., sim(T!p1,T!p2) ¼ 0). Consequently, at first sight, one might not expect any TL effects to occur. However, in Appendix B, we demonstrate that this TL effect is an inevitable consequence of the combinatorics involved with BSC’s operators when order is encoded using positional cues. Indeed, in terms of hamming distance to the original string, we show that double substitutions must be more destructive than transpositions by a factor of 3/2. The reason why one observes a TL effect in holographic slot coding, despite letter and position vectors being orthogonal, can be intuited in dimension one, where each representation is a simple bipolar variable. Consider two letters A and B, respectively, at position p1 and p2 in a string of arbitrary length. All other things being equal, transposing letters A and B will only change the code for the string if AÆp1 + BÆp2 is different from AÆp2 + BÆp1. It is easy to compute the probability of this happening, and to see that it is not equal to 1/2 (chance), but to 1/4. Thus, transposing two letters is less destructive than replacing them altogether. Furthermore, this TL effect can be shown empirically to decrease with the number of letters involved in a transposition, and to vanish when all letter couples are transposed (condition 13), in good agreement with Guerrera and Forster (2008). Hence, it seems that this code displays the adequate flexibility to solve a difficult conundrum: accounting for local transposition effects without predicting any facilitation for global transpositions. As to this day and to our knowledge, holographic slot coding stands alone in this precise matter. The code also satisfies the compound TL constraint (condition 15 lower than 6) as well as part of the distant TL constraint: If it mistakenly equates the impacts of distant transposition and single substitution (condition 14 vs. 7), holographic slot coding nevertheless rightly makes both of them less disruptive than distant double substitutions (conditions 8). Be it localist or holographic, however, the absolute slot coding scheme has two limitations: It cannot reproduce RP effects and edge effects, the final letter being no more important than inner ones. Although it is possible to introduce a weight gradient in the Maj operator (see Appendix A), this weighted holographic slot coding version cannot fix all the weaknesses of the slot coding scheme. In particular, it would suffer from the same alignment problem as the localist version because all absolute positions are disturbed by an initial letter deletion. The only way in which relative position priming could be restored to some extent would be to use double anchors, a move that might be supported by a recent study (Fischer-Baum et al., 2010), or alternatively to introduce a small correlation between position vectors.11 Arguably, it might even be possible to combine both moves in
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
94
a holographic overlapping and double-anchored scheme, which could presumably inherit the qualities of each.12 Either way, we see that to account for edge effects and relative position priming, a positional code needs to introduce several additional hypotheses. Thus, for the sake of parsimony, we should first make sure that all simple alternatives have been ruled out. 2.1.1. Summary In summary, localist slot coding can satisfy three constraints: stability, compound, and global TL, but it fails on edge effects, local and distant TL, and RP constraints. Holographic slot coding satisfies the same constraints plus local TL, which obtains by virtue of the binding and chunking operators. This shows that contrary to the common belief, local TL is not sufficient to rule out the slot coding scheme, and instead, the real obstacle appears to be in RP constraints.
3. Open-bigrams One simple alternative to slot coding is the open-bigram scheme, which we now proceed to study, first in its constrained version and then in the unconstrained one. Again, in both cases, we will compare the localist and holographic implementations. 3.1. Constrained open-bigrams The COB code can be translated in holographic representations as follows: TABLE ¼ TA " TB " TL " AB " AL " AE " BL " BE " LE
where the code for bigram LiLj is: Li Lj ¼ Li #l " Lj #r and l, r are two orthogonal vectors, respectively, coding for left and right positions within a bigram. The arborescence of the holographic COB code is depicted in Fig. 2. 3.1.1. Results Table 3 shows that whether localist or holographic, the COB scheme is stable. It is also more flexible than slot coding, as apparent from the fact that similarity patterns have generally increased. Despite this general increase, similarity rankings remain more or less unchanged with respect to slot coding. The main difference is that single substitutions in five-letter strings (conditions 6, 9, 10, 11) are now more disruptive than repetitions (condition 3), additions (condition 4), or transpositions (condition 12). However, the most destruc-
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
95
Fig. 2. Holographic COB (constrained open-bigrams): arborescence of the code.
tive transformation for five-letter strings is still double substitution (condition 5), while the least destructive is still local transposition (condition 9). Switching to a bigram scheme does not improve the number of constraints satisfied but changes their variety, allowing one to account for some aspects of TL and RP priming at the same time. Indeed, deleting the initial letters from a string (resp. transposing two letters) now results in high similarities because the relative position of letters is maintained (resp. not completely lost) and thus several bigrams remain from the original. However, this comes at the cost of global, distant, and compound TL (conditions 13–15), for which both codes now make wrong predictions. Global TL produces high similarities because many letters are left in the same relative order and thus generate the same bigrams, distant TL is less disruptive than single substitution because it affects four bigrams rather than five, and compound TL is violated as the N1R condition is exactly as disruptive as single substitution. In this last case, this is because an N1R string (13d45) maintains the same relative order to the target as a single substitution neighbor (1d345), producing exactly the same set of bigrams except for one, bigram ‘‘3d’’ in the first case and ‘‘d3’’ in the second. On both accounts, this leads to identical similarities with the original string (12345). There are also differences between the two implementations of COB. A first difference is that the holographic, but not the localist code, satisfies the repeated RP constraint (i.e., conditions 19 and 20 should produce equal similarities). In the localist case, each bigram counts only once in the representation and thus redundant bigrams play no role in similarity calculations. The undue difference between repeated condition (19) and nonrepeated condition (20) then occurs because the latter shares ten bigrams with the target, whereas the former only shares nine.13 In the holographic case, on the other hand, redundant bigrams all contribute through the majority rule to impact on the final representation. Hence, in conditions
(12)
12435 12345 0.89 0.76
Condition
Prime Target Localist COB Holographic COB
123345 12345 0.78 0.70
(3)
21436587 12345678 0.67 0.72
(13)
(4) 123d45 12345 0.78 0.67
125436 123456 0.66 0.84
(14)
Transposition (12–15)
1245 12345 0.56 0.59
(2)
Note. Boldface type indicates similarities.
12345 12345 1.00 1.00
(1)
Prime Target Localist COB Holographic COB
Condition
13d45 12345 0.56 0.78
(15)
12dd5 12345 0.22 0.42
(5)
Stability (1–8)
12345 1234567 0.60 0.59
(16)
1d345 12345 0.55 0.78
(6) 12d4d6 123456 0.33 0.79
(8)
d2345 12345 0.67 0.65
(9)
13457 1234567 0.47 0.62
(18)
123267 1232567 0.82 0.83
(19)
12d45 12345 0.55 0.64
(10)
123567 1232567 0.91 0.82
(20)
1234d 12345 0.67 0.65
(11)
Edge Effects (9–11)
Relative Position (16–20)
34567 1234567 0.60 0.59
(17)
12d456 123456 0.58 0.78
(7)
Table 3 Constrained open-bigrams (COB): similarities derived for holographic and localist codes across selected conditions
96 T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
97
(19) and (20), the strings are really made of 12 bigrams each, and as it turns out, they share exactly 10 of them with the target, resulting in equal similarities. Another difference is that localist COB displays reverse edge effects, whereas there are little or no such effects in the holographic code (conditions 9–11). It is only natural that the former displays reverse edge effects. Indeed, Fig. 2 shows that each edge letter is only involved in three bigrams, against four bigrams for inner letters. Thus, outer substitutions disturb fewer bigram units than inner substitutions, and they result in codes that are more similar to the original. This implies that reverse edge effects will be produced by just every COB scheme, whatever the length of the word or the authorized letter gap. To avoid this reverse effect, one must resort to unconstrained bigrams or edge units. The same logic should apply to the holographic code, and we should thus ask why it yields very small reverse edge effects, if any. The key difference is that as already noticed, the localist version assigns a similarity of zero between any two distinct bigram units. On the contrary, the holographic version provides graded similarities between distinct bigrams, for the same reasons previously exposed for slot coding. Thus, replacing bigram ‘‘TA’’ by, for example, bigram ‘‘XA’’ does not completely destroy similarities in a holographic scheme, and this might provide one cue as to why reverse edge effects are counterbalanced. We will shortly give a more thorough explanation of edge effects in the context of unconstrained open-bigrams. 3.1.2. Summary Contrary to slot coding, the COB scheme can accomodate TL constraints as well as RP constraints. Localist COB satisfies three constraints: stability, local TL, and distinct RP. Again, the holographic version satisfies precisely the same constraints plus another: repeated RP, which obtains essentially because repeated bigrams contribute more to the holographic word code. This solves the difficult conundrum for bigram schemes introduced in Schoonbaert and Grainger (2004), showing that repeated RP does not challenge the COB scheme itself, but only its localist implementation. 3.2. Unconstrained open-bigrams The unconstrained open-bigrams code can be translated in holographic representations as follows: TABLE ¼ TA " TB " TL " TE " AB " AL " AE " BL " BE " LE
Its arborescence, exactly identical to COB except for additional bigrams (bigram TE in our case), is given in Fig. 3.
98
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
Fig. 3. Holographic UOB (unconstrained open-bigrams): arborescence of the code.
3.2.1. Results The UOB scheme behaves like the constrained one in a number of ways. For the same reasons as above, both the localist and the holographic UOB versions satisfy the local TL and distinct RP constraints, but they fail to reproduce distant, global, and compound transposition effects. Nevertheless, Table 4 indicates that the differences due to the code are most salient for the UOB scheme: in fact, according to our benchmark presented in Table 6, localist and holographic UOB, respectively, rank as last and first in terms of explanatory power. Localist UOB violates the stability constraint: Insertions or repetitions produce equivalent codes to the original, as defined by the similarity measure. This problem does not appear in the holographic implementation, where the less disruptive transformation is still single transposition (condition 12), and the similarity remains well below our stability criterion (0.82 < 0.95). This is because in holographic codes, the similarity measure is not target oriented but proceeds from the symmetrical hamming distance. A second difference is that holographic UOB satisfies the repeated RP constraint. As in the case of COB, localist UOB is hindered by the exclusion of redundant bigrams from similarity calculations. Another difference concerns edge effects. As expected from the previous section, the reverse edge effects observed with localist COB are absent with UOB. Indeed, single substitutions, be they inner or outer, now impact on exactly the same number of bigrams: L)1, where L is the string length. However, the holographic implementation now displays strong edge effects. It is possible to gain some insights into why edge effects appear in the UOB code by returning to the arborescence in Fig. 3. Let us place ourselves in the one-dimensional case, and look at the four bigrams in which the inner letter B occurs. Across these bigrams, letter B appears twice on the left, and twice on the right. Thus, if left and right position variables have different values, the ‘‘vote’’ of letter B in the majority rule simply cancels itself out. However, this cannot happen if we now consider the four bigrams in which an outer letter
12435 12345 0.90 0.82
Prime Target Localist UOB Holographic UOB
123345 12345 1.00 0.79
(3)
21436587 12345678 0.86 0.79
(13) 125436 123456 0.80 0.92
(14)
Transposition (12–15)
1245 12345 0.60 0.62
(2)
Note. Boldface type indicates similarities.
(12)
12345 12345 1.00 1.00
(1)
Condition
Prime Target Localist UOB Holographic UOB
Condition
13d45 12345 0.60 0.79
(15)
123d45 12345 1.00 0.73
(4)
(16)
1d345 12345 0.60 0.79
(6)
12345 1234567 0.48 0.52
12dd5 12345 0.30 0.45
(5)
Stability (1–8)
34567 1234567 0.48 0.51
(17)
12d4d6 123456 0.40 0.75
(8)
d2345 12345 0.60 0.49
(9)
13457 1234567 0.48 0.61
(18)
123267 1232567 0.71 0.89
(19)
123567 1232567 0.88 0.89
(20)
12d45 12345 0.60 0.62
(10)
1234d 12345 0.60 0.53
(11)
Edge Effects (9–11)
Relative Position (16–20)
12d456 123456 0.66 0.85
(7)
Table 4 Unconstrained open-bigrams (UOB): similarities derived for holographic and localist codes across selected conditions
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011) 99
100
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
occurs, because of course outer letters always appear at the same position in any bigram. Thus, the ‘‘vote’’ of an outer letter naturally carries much more weight than the vote of an inner letter. As a consequence, inner substitutions are less disruptive than outer ones. Note that this ‘‘voting’’ combinatorics induced by the operators is also present in holographic COB, although it does not result in edge effects because it must fight the previously discussed reverse tendency resulting from the constraint on bigrams. 3.2.2. Edge effects in holographic UOB From this, we can draw three predictions. First, the same mechanism should hold regardless of word length, so that edge effects should not be restricted to this example. Second, edge effects should not arise only from the contrast with this particular inner position. In fact, the similarity pattern created by a single substitution should not be constant across inner positions, but increase as one approaches the edges, because the number of cases in which the substituted letter’s vote cancels out diminishes as one gets closer to the edges. Third and for the same reason, the similarity obtained with inner left substitutions (resp. for inner right substitutions) should be proportional to the binomial number Clr (resp. Crl ), where l and r, respectively, stand for the number of appearance of the substituted letter in left and right positions. To test these claims, we computed similarities for single substitutions across all positions in strings of length 5 and 7, and compared them with the corresponding human data for fiveletter words reported in Grainger and Jacobs (1993). Fig. 4 first shows that simulations can match the human data obtained for five-letter strings. Although Grainger and Jacobs (1993) found no significant difference between outer conditions or between inner ones, and the medial condition was not carried out, there is a visible tendency toward an inverse U-shaped form which is also pickep up by holographic UOB similarities. What we can say thus is that in this code, despite the lack of any
Fig. 4. Single-substitution priming across position for five-letter words. Bars: experimental results from Grainger and Jacobs (1993) (central condition unavailable). Curve: similarities from holographic UOB.
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
101
information on absolute position or any weight gradient, the correct priming pattern appears in virtue of the prominence of letters in left or right positions across bigrams. This saliency pattern also impacts on distinct RP priming. Having equal letter weights, the localist UOB code naturally shows constant similarities for initial, medial, or final deletions. On the contrary, in the holographic implementation, we find that initial or final deletions should be more disruptive than medial ones. This is not in agreement with behavioral data from Grainger et al. (2006), which show a tendency for less facilitation with medial primes. 3.2.3. Summary Differences between implementations are most salient within the UOB scheme. While localist UOB fails on all constraints but local TL and distinct RP, holographic UOB satisfies in addition stability, edge effects, and repeated RP. Stability and repeated RP are recovered because all bigrams contribute to determine the final holographic code. Edge effects emerge naturally from the chunking operator, which tends to cancel out contributions from central letters in a holographic code. This simple explanation could easily have been falsified by behavioral data, which on the contrary support the holographic UOB account (Grainger & Jacobs, 1993). 3.3. LCD We finally proceed to implement a holographic code in the spirit of the LCD proposal (Dehaene et al., 2005). The arborescence of the code is illustrated in Fig. 5.
Fig. 5. Arborescence of holographic LCD (local combination detector). A three-level hierarchy progressively achieves recognition of larger letter combinations. Simple X-or operations (only shown for the first and last bigrams) perform the transition between absolute position (letter level) and relative position (bigram level) coding.
102
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
The letter and bigram levels of the hierarchy are, respectively, identical to the holographic slot and COB codes. However, in the previous sections, we have studied them in isolation, whereas the two schemes are now connected levels in the LCD architecture. Apart from the difference in the authorized gap between constituent letters (one for the original LCD against two here), our implementation departs from LCD in several important ways. LCD assumes that some position specificity remains at the bigram level. This has the consequence that duplicated banks of bigram units must exist at this level, each receptive to a particular part of the string. On the contrary, our code follows the initial COB approach, which assumes only a single bank of units, relevant bigrams being activated regardless of the absolute positions of letters in the string. Also, in LCD the letter to bigram transition is thought to be achieved by banks of bigram units with overlapping receptive fields, whereas in the proposed code, this transition between position specificity and invariance is achieved using simple X-or operations taking place between the letter and bigram levels. This transition can be illustrated by building bigram unit LiLj :¼ Li"l ¯ Lj"r from letters Li at position pi, and Lj at position pj, i > j. This requires two arbitrary left and right vectors l and r and the following operations: Li Lj ¼ ðLi "pi Þ " ðl"pi Þ % ðLj "pj Þ " ðr"pj Þ
¼ ðLi "lÞ " ðpi "pi Þ % ðLj "rÞ " ðpj "pj Þ because X-or is associative ¼ Li "l % Lj "r because X-or is its own inverse:
At the quadrigram level, we use units allowing for a one-letter gap. Again, this differs from LCD, which makes no such assumption but holds that only frequent quadrigrams are activated. As this code does not incorporate any notion of frequency, another selection mechanism must be assumed and open-quadrigrams of gap 1 seem to achieve the correct compromise between number and diversity of units. Alternatively, this level can also be seen as an extension of Mozer’s (1987) Blirnet to four-letter units. Eventually the LCD code is built in the following way: TABLE ¼ TABL % TABE % TALE % TBLE % ABLE where each quadrigram unit is activated by consistent units from the bigram level, for instance: TABE ¼ TA % TB % AB % AE % BE and where, for example, the code for bigram TA is: TA ¼ T"l % A"r At any level, a unit’s code is obtained in the usual way, by chunking the code of consistent subunits from the previous level. Note the absence of positional vectors in the final
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
103
code, resulting from the hypothesized unbinding that takes place between letter and bigram levels. In effect, all absolute position information is lost at this point and replaced by relative information, although Fig. 5 features an absolute letter level and unbinding operations to highlight the relationship to LCD. Also, bigram TE, although consistent, does not participate in quadrigram TABE. Indeed, this bigram is not activated because the constraint on the authorized letter gap would be violated. 3.3.1. Results Table 5 reports similarity scores obtained using holographic LCD, as well as its localist open-quadrigram counterpart. Whatever the code, this version of LCD is stable and can account for distinct RP, although in both cases outer conditions show higher similarities than inner ones. This contrasts with what we observed for holographic UOB, where edge effects induced the opposite tendency for relative priming conditions. This is because in holographic LCD, large units and reduced gap conspire to produce a code that is more sensitive to contiguity. There is a lack of flexibility in the localist quadrigram code that is particularly visible for double substitutions, where the code produces no similarity at all but where the holographic version assigns generous similarities. In fact, as a consequence, both codes fail to reproduce distant TL: While the holographic code mistakenly predicts more facilitation for a distant transposition than for a single substitution, the localist code predicts no priming whatsoever with double substitution or distant transposition primes. Neither the localist nor the holographic version of LCD satisfies the repeated RP constraint, assigning different similarities in conditions (19) and (20). Moreover, no version can account for compound TL. This can be easily seen in the localist case, because N1R and single substitution primes have exactly one quadrigram in common with the target. In the holographic case, things are more complicated because quadrigrams can have graded similarities, but the fact that the quadrigrams that differ in the two conditions are only one transposition away—and thus have the same distance to the corresponding quadrigram in the target—appears sufficient to make the codes themselves equally distant to the target. As for the differences between implementations, it is worth noting that the rigidity of the localist code does not weigh entirely against it, as it appears to be in just the right proportion to capture both local and global TL effects, whereas the holographic code can only reproduce local TL. It might come as a surprise that two codes that differ in every respect, holographic slot coding and localist open-quadrigrams, are the only one able to accommodate local and global TL effects at the same time. But intuitively we see that they actually achieve similar compromises in terms of flexibility. Slot coding is a priori a strictly compartmented scheme, but implementing it holographically has the effect of suppressing positional frontiers to some extent. Open-quadrigrams inherit the rigidity of localist representations and use large units, but this rigidity is relaxed by using a relative position scheme that tolerates noncontiguity. However, a problem for localist LCD would be that as noted earlier, it predicts no similarity between four-letter strings, which is falsified experimentally (Humphreys et al., 1990).
12435 12345 0.40 0.79
Prime Target Localist LCD Holographic LCD
123345 12345 0.40 0.45
(3)
21436587 12345678 0.00 0.62
(13) 125436 123456 0.00 0.87
(14)
Transposition (12–15)
1245 12345 0.20 0.53
(2)
Note. Boldface type indicates similarities.
(12)
12345 12345 1.00 1.00
(1)
Condition
Prime Target Localist LCD Holographic LCD
Condition
13d45 12345 0.20 0.78
(15)
123d45 12345 0.40 0.51
(4)
(16)
1d345 12345 0.20 0.78
(6)
12345 1234567 0.38 0.58
12dd5 12345 0.00 0.37
(5)
Stability (1–8)
34567 1234567 0.38 0.58
(17)
12d4d6 123456 0.00 0.70
(8)
d2345 12345 0.20 0.56
(9)
13457 1234567 0.15 0.49
(18)
123267 1232567 0.23 0.82
(19)
Relative Position (16–20)
12d456 123456 0.22 0.80
(7)
123567 1232567 0.31 0.78
(20)
12d45 12345 0.20 0.63
(10)
1234d 12345 0.20 0.56
(11)
Edge Effects (9–11)
Table 5 Unconstrained local combination detector (LCD): similarities derived for holographic and localist codes across selected conditions
104 T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
105
Considering Fig. 5, we see that a localist open-quadrigram code cannot show edge effects in the case of five-letter strings, where each letter participates in exactly four quadrigrams, but will produce reverse edge effects in general, since inner letters would participate in more quadrigrams than outer ones. On the contrary, the holographic implementation does result in edge effects. The reason why edge effects are produced in holographic LCD can again be understood by considering the links to the quadrigram level. The arborescence in Fig. 5 shows that for each quadrigram unit, its set of five activating bigrams is balanced. This is to say that across these bigrams, letters can only appear at most twice at the same position, thus preventing systematical majorities to occur. However, edge quadrigrams are the exception to the rule. Embodying a sequence of contiguous letters, these units are activated by not five, but six bigrams. This disturbs the balance, and the initial letter (resp. final) is more represented in quadrigram 1234 (resp. 2345) than any other letter at any position in remaining bigrams. As a result, transformations involving edge letters are more destructive than others. It should be noted that edge effects in holographic LCD are also dependent on word length and seem to reverse for longer strings. Nevertheless, this code provides a clear demonstration that edge effects in holographic codes are not limited to unconstrained open-bigrams, but can also appear with constrained units of higher grain, even though in this case they appear to be inconstant. 3.3.2. Summary In summary, our study of an LCD-like scheme suggests that just as open-bigrams, it can reproduce both TL and RP effects. Localist LCD is stable, satisfies distinct RP, and even accommodates local and global TL at the same time. Despite these interesting properties, the code appears to be falsified by priming results on four-letter words, a problem which is avoided in the holographic implementation. Holographic LCD has generally the same profile, but trades global TL against edge effects, illustrating that holographic edge effects can occur for other relative schemes than open-bigrams.
4. Discussion Table 6 recapitulates how the various codes behave with respect to our benchmark of constraints. It is clear from Table 6 that not all constraints are equally easy to satisfy. Distant TL in particular resists all the schemes considered above, whatever the implementation. This is because most of the codes wrongly imply that distant TL is less disruptive than single substitution. A bit less elusive is the compound TL constraint, which remains beyond the reach of all codes but localist or holographic slot coding. A scheme that has also been known to satisfy compound TL is Spatial Coding (Davis & Bowers, 2006), and we will shortly consider it in more detail. Much more reachable are global TL and edge constraints, as they are each satisfied by at least two different codes. The easiest constraints are the stability and local TL, met by seven codes of eight. There is also a general trend in Table 6 for TL constraints to be more often satisfied by holo-
106
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
Table 6 Summary of the constraints satisfied by the codes
Localist Slot coding COB UOB LCD Holographic Slot coding COB UOB LCD Other schemes Spatial Coding Seriol
Stability
Edge Effects
Local TL
Distant TL
Compound TL
Global TL
Distinct RP
Repeated RP
4 4 – 4
– – – –
– 4 4 4
– – – –
4 – – –
4 – – 4
– 4 4 4
– – – –
4 4 4 4
– – 4 4
4 4 4 4
– – – –
4 – – –
4 – – –
– 4 4 4
– 4 4 –
– 4
4 –
4 4
4 4
4 4
4 –
4 4
4 –
Note. COB, constrained open-bigrams; UOB, unconstrained open-bigram; LCD, local combination detector; RP, relative position; TL, transposed letter.
graphic slot coding, and RP constraints by holographic relative position codes (COB, UOB, and LCD). A second conclusion we can draw from Table 6 is that representational format has a profound impact on the number and nature of the constraints satisfied by a scheme: Holographic representations are globally better at explaining the available data. Indeed, with the notable exception of localist LCD on the global transposition effect, holographic codes always account at least for the same effects as localist ones. In addition, this implementation often satisfies more constraints: It recovers TL effects from the slot coding scheme, produces edge effects when used with open-ngrams, stabilizes UOB and makes it able to account for repeated RP. Edge effects arise in codes using ngram units as a consequence of the holographic chunking operator, the majority rule. Importantly, we are not claiming that edge effects must exclusively be explained in this way. Rather, we report how a contribution to edge effects can arise from holographic codes. Several other factors may be at work to produce edge effects, including crowding effects and top-down facilitation (Tydgat & Grainger, 2009). As to the mechanisms by which TL effects appear in the slot coding scheme, we have informally described them earlier and they are laid down formally in Appendix B. It is important to note that this mechanism is also active for other schemes, for instance, by making unit AB and BA more similar than chance in open-bigrams. Generally speaking, both properties take root in the same basic quality of holographic coding, which is that compound units are similar to one another. This makes for more flexible codes, introduces rich majority combinatorics and breaks positional frontiers. Repeated RP effects are produced in holographic COB and UOB again as a result of the holographic Maj operator. While localist versions are hindered by their exclusion of
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
107
redundant bigrams in similarity scores, the Maj operator in effect assigns different weights to bigrams in proportion to their redundancy. Taking all bigrams into account thus appears sufficient to satisfy the repeated RP constraint. We also introduced a code in the spirit of the LCD model, although admittedly with considerable liberty. The localist version appears to be too rigid to account for single substitution priming results in words of various lengths (Humphreys et al., 1990; Perea & Lupker, 2004). Being distributed and hierarchical, the holographic version is much more in line wit the LCD proposal, and satisfies the same constraints except for global TL, that it ‘‘exchanges’’ for edge effects. Despite these achievements, Table 6 shows that the most efficient holographic code with respect to our benchmark is holographic UOB, which satisfies one more constraint without invoking a quadrigram level. As acknowledged by Vinckier et al., the quadrigram level in LCD remains hypothetical and in fact even the evidence from Vinckier et al. (2007) is debatable because of some discrepancies between stimuli legality across conditions.14 Secondly, as we have seen, edge effects in UOB are much more robust than in LCD. As for now, holographic UOB seems to be the richest code in phenomena and the most parsimonious in hypotheses. 4.1. Comparison with other schemes In this article, we have been essentially concerned with the differences arising when a given scheme is implemented with distributed or localist representations. However, in this comparison process, we have left aside some well-known schemes, either because they were not easily translated into the holographic format (e.g., Spatial Coding, whose assumptions on how to compute similarities are at odds with hamming distance), or because they were closely related to schemes that had already been considered (Seriol is related to COB, and Overlap to slot coding). Although direct comparisons with previous codes would be hindered by differences in degrees of freedom—Spatial Coding and Seriol have more degrees of freedom because they both use activation gradients and edge marker parameters—it is still of interest to assess how the latest versions of these schemes behave on our benchmark. The bottom panel in Table 6 shows the performances of Spatial Coding and Seriol (see Appendix C for detailed values). It is clear that Spatial Coding has the upper hand, satisfying all our constraints, but stability (see Appendix C, repeated letter condition 3). However, edge effects occur in Spatial Coding by virtue of an ‘‘end letter marker’’ parameter that increases the importance of edge letters in similarity calculations. More impressive is the coverage of transposition constraints: Spatial Coding manages to satisfy single, distant, global, and compound TL constraints at the same time. This is particularly impressive considering the failure of previous codes to account for distant TL—most of the time because they wrongly deem single substitution to be more disruptive—and provides a good occasion to gain some insights into Spatial Coding’s inner workings. Recall that in Spatial Coding a letter only contributes in similarity scores if it is shared by both strings, and the contribution is then based on the misalignment in absolute letter position across strings—the difference between gradient values. Hence, single substitution only results in one missing contribution but maximal
108
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
contributions from all other letters, whereas in distant TL there is no missing contribution but two letters undercontribute since they are misaligned. In general, all gradients such that twice a TL contribution remains inferior to a full contribution will satisfy this part of the TL constraint. This holds for the gradient used in Davis (in press), but scaling it by half would break the TL constraint—giving similarities for distant TL and single substitution of 0.9 and 0.88, respectively . Similar sensitivities arise for other conditions, prompting one to ask how the activation gradient is learned in Spatial Coding. The question of learning is touched upon in Davis (in press), where it is assumed that in the full visual word recognition model, the gradient would be in phase with connection weight values learned by self-organizing mechanisms. What Spatial Coding seems to be telling us is that whereas an absolute scheme can indeed account for all TL constraints, these are by no means inherent to the scheme itself but specific to the gradient, which in turn will depend on the learning environment. Thus, at the very least, the Spatial Coding account raises doubts as to the universality of TL constraints. Although Seriol performs less well than Spatial Coding, it is on a par with the best code previously studied (holographic UOB). Seriol is a bigram-based scheme that also uses an activation gradient, but where values do not serve as position markers and are only determined by the contiguity of letter constituents (zero-gap ¼ 1, one-gap ¼ 0.7, two-gap ¼ 0.5). Seriol also introduces edge bigrams, and thus the absence of edge effects might come as a surprise. This is because a substitution at the edge leaves two one-gap bigrams and one two-gap bigram unchanged (contributing 2 · 0.7 · 0.7 + 1 · 0.5 · 0.5 ¼ 1.48 to the final similarity), whereas a substitution in medial position leaves one one-gap bigram and two two-gap bigrams unchanged (contributing 0.98 ¼ 2 · 0.5 · 0.5 + 1 · 0.7 · 0.7 to the final similarity). This holds with or without edge bigrams, as their contribution is factored out across conditions. Hence, to obtain edge effects, the gradient should assign higher values to two-gap than to one-gap bigrams, a move that is not easily interpreted. However, like Spatial Coding and unlike all other codes, Seriol manages to account for distant TL. In the case of Seriol though, this arises because of similar trade-offs taking place between one-gap and two-gap contributions, which further involve contributions from cross-terms because TL turns one-gap bigrams into two-gap bigrams. Apart from gradient values, this kind of account is sensitive to the authorized letter gap between bigrams, as for instance allowing for three-gap bigrams with 0.5 activation values would break the distant TL constraint in Seriol (giving similarities for distant TL and single substitution of 0.7 and 0.68, respectively). In conclusion, Spatial Coding and Seriol perform very well on our benchmark of constraints, although the former has the upper hand. Interestingly, in both cases, the introduction of a gradient proves sufficient to account for the TL constraint that resisted all previous codes. Nevertheless, using gradients also implies dealing with more degrees of freedom and we have argued that to avoid the arbitrary fitting that can ensue, the question of how this gradient is learned must be addressed. 4.2. Below and above holographic string encoding In all the codes we have studied, letter vectors were assumed to be orthogonal. However, it is clear that some letters (like ‘‘C’’ and ‘‘O’’) share more visual features than others.
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
109
Introducing these visual features proves straightforward in a holographic implementation, because one only needs to generate vectors for as many visual features and 2D-positions as required. Letters can then be created by X-oring visual features to adequate 2D-position vectors and chunking these together. Using these codes in a connectionist network would be the next step in specifying a lexical access model. As distributed string encoding scheme, holographic codes can directly serve as input for PDP models (Plaut et al., 1996; Seidenberg & McClelland, 1989), or unsupervised network such as self-organizing maps or hopfield networks. Such an extension has been studied in Hannagan (2009) and shows that the joint use of holographic representations and attractor networks is not only possible but also fruitful. Indeed, we found that holographic string codes were sufficiently distant from one another to allow for efficient retrieval and storage under noisy conditions, and they supported an interpretation of frequency effects as emerging from noise during learning. It might also be of interest to apply holographic coding to sequences of symbols or digits, considering the evidence that their encoding recruits at least partially overlapping mechanisms (Carreiras, Dunabeitia, & Perea, 2007; Tydgat & Grainger, 2009). Indeed, there is nothing specific to letter strings in the holographic codes we have presented, and our similarity measures should also hold for strings of digits and symbols. Thus, according to the holographic panel in Table 6, the fact that edge effects are still observed for digits but not for symbols (Tydgat & Grainger, 2009) would suggest that these do not use the same scheme. We speculate that the same holographic coding system might resort to an absolute scheme such as slot coding for symbols, because symbol combinations do not appear regularly in the visual environment, but might use a relative scheme such as UOB in the case of letters and digits, because they appear more frequently in short combinations. 4.3. Limits of holographic string encoding One limitation of this holographic coding approach is the ill-defined Maj operator: because the representations we use are binary, ties can arise for even number of arguments that need to be broken arbitrarily. This parity issue is inherent to bipolar holographic codes— it is the price we pay for simplicity—and it should be noted that it disappears in the continuous case with real valued vectors. Other systems using bipolar codes, such as Hopfield networks, are also afflicted with the same problem but being dynamic they can resort to breaking the tie in favor of the value at t ) 1. The same solution would work for dynamic systems based on binary holographic coding. Finally, in the static case the parity issue can be circumvented in situations where different arguments of the Maj ought to be given different strength, by generalizing the Maj operator to a weighted Maj (see Appendix A). Although we have argued that holographic codes endorse more biological plausibility because they are distributed and support constituent structure, we have not discussed how the brain would actually compute them. Hence, this study cannot substitute to a computational model of string encoding, which needs to be specified in future work—possibly using randomly connected sigma–pi neurons as suggested elsewhere (Plate, 2000). Nevertheless,
110
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
Hannagan et al. (in press) recently showed that holographic slot coding with correlated location vectors captures very sharply the computations performed by a location invariant feed-forward network trained by backpropagation. This correspondence holds between holographic letter/location bindings and input-to-hidden weight vectors, and stems from broken patterns of symmetries in network connection weights. That holographic codes can emerge from a well-known learning algorithm and with a very simple network structure certainly brings us closer to understanding how they could be implemented neurally. One might also ask how these results would carry over to sparser holographic codes. Indeed, in this article, we have used a dense code for simplicity (p(1) ¼ p()1) ¼ 1/2), but as noted earlier, the code can be made as sparse as required (Kanerva, 1995). As in the process the X-or operator is not affected and the Maj operator only lowers its threshold, we expect that the simulations presented here will not vary drastically with the density parameter. In particular, emerging TL and edge effects would be anticipated to diminish, but not to disappear, as they are driven by the nature of both holographic operators and not their thresholds. Finally, a limitation of the string encoding approach in general, and thus of the holographic codes we studied here, is the silent treatment of phonology. Indeed, the extent to which phonological representations interact with visual ones during string encoding is not at all considered (Ziegler, 1995). We would like to emphasize, however, that phonological information can easily be introduced in static holographic code, either indirectly as a unit selection mechanism, or directly by creating phonological vectors and introducing them adequately in the codes.
5. Conclusion In this article, we have proposed a string encoding approach that uses distributed representations and takes into account constituent structure. We used the BSC, a special case of holographic reduced representations designed to capture and manipulate combinatorial structures. We have assessed these codes against their localist counterparts using several different schemes and a number of criteria coming from masked priming experiments. Our first conclusion is that schemes do not behave in the same way whether they are implemented using localist or holographic codes. This implies that a given scheme (e.g., slot coding or openbigrams) cannot be ruled out simply because one implementation of it fails to satisfy a constraint (resp. single TL and repeated RP), as this only falsifies one possible implementation (localist). Our results show that to falsify a string encoding scheme properly, its distributed implementations should also be considered. Our second conclusion is that with one exception, using holographic representations never jeopardizes the qualities observed in the localist case, and on the contrary holographic coding can show a number of desirable emerging properties. Particularly surprising is the apparition of edge effects in UOB and LCD, and of transposition effects in slot coding. We have explained these phenomena in detail, formally in the case of transposition, and empiri-
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
111
cally in the case of edge effects. Among holographic codes, we found that UOB could account for a large number of effects (stability, edge effects, relative position priming with or without repetitions, local transposition) while at the same time being more parsimonious than LCD in hypotheses. More work is required to understand the properties and limits of these codes, but they are laid in a mathematical framework, which encourages formal analysis. Future research could investigate the performance of holographic codes that use gradients, correlated position vectors, phonological information, or frequency as a unit selection mechanism. Word codes can also be made more sophisticated by building on arbitrarily detailed visual feature codes, and the holographic approach as a whole can be extended to digit or symbol processing.
Notes 1. In this article, we use string encoding and letter position coding as interchangeable terms. Although the latter is the authorized one, our preference goes to the former because it is less misleading: As we will see, schemes do not always encode order at the letter level. 2. Although we will later discuss redundant localist coding, which defines a special many-to-one relationship. 3. The probability that two different strings actually activate exactly the same set is very low with 400 units, and decreases exponentially with this number. 4. We use the standard letter position notation, which replaces letters by their absolute position, and where the letter ‘‘d’’ in the prime means a different letter is used relatively to the target. 5. The chunking operator is not always constant across all binary spatter codes. Here, we use a deterministic version of the original majority rule. 6. In this article, we use a similarity measure based on the normalized Hamming distance h between two vectors A and B, and defined as sim(A,B) ¼ 1)2|h(A,B)|. 7. This similarity decreases as the number of argument vectors grows. 8. This condition should produce the least similarity among all conditions tested. 9. Current specification, yet unpublished. 10. This result has been disputed by Lupker and Davis (2009) on the grounds that global TL priming can be obtained using an alternative ‘‘sandwich’’ priming paradigm. 11. When each vector correlates with the previous one, the resulting code becomes the holographic pendant of the overlap model. 12. However, it has been argued that the saliency gradient obtained in Gomez et al. (2008) is incompatible with such a modification. 13. There was a mistake in Schoonbaert and Grainger (2004), where 10 bigrams were believed to be shared in this condition. 14. We thank one anonymous reviewer for pointing this out to us.
112
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
Acknowledgments The first author thanks Kenneth I. Forster and Paul Smolensky for useful comments on earlier versions of this work; Carol Whitney, Colin Davis, and one anonymous reviewer for constructive reviews; Jonathan Grainger for helpful suggestions and remarks; and Simon Fischer-Baum for stimulating discussions.
References Abbott, L., & Sejnowski, T. (1999). Neural codes and distributed representationsCambridge, MA: MIT Press. Bowers, J. S. (2009). On the biological plausibility of grandmother cells: Implications for neural network theories in psychology and neuroscience. Psychological Review, 116(1), 220–251. Carreiras, M., Dunabeitia, J. A., & Perea, M. (2007). Reading words, number and symbol. Trends in Cognitive Science, 11(11), 454–455. Chomsky, N. (2002). Syntactic structures (2nd ed.). New York: Walter de Gruyter. Cohen, L., & Dehaene, S. (2004). Specialization within the ventral stream: The case for the visual word form area. NeuroImage, 22, 466–476. Cohen, L., Dehaene, S., Naccache, L., Lehe´ricy, S., Dehaene-Lambertz, G., He´naff, M., & Michel, F. (2000). The visual word-form area: Spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients. Brain, 123, 291–307. Coltheart, M., Rastle, K., Perry, C., Langdon, R., & Ziegler, J. (2001). Drc: A dual route cascaded model of visual word recognition and reading aloud. Psychological Review, 108, 204–256. Dandurand, F., Grainger, J., & Dufau, S. (2010). Learning location invariant orthographic representations for printed words. Connection Science, 22(1), 25–42. Davis, C. J. (1999). The self-organising lexical acquisition and recognition (solar) model of visual word recognition. Doctoral dissertation, University of New South Wales, Sydney, New South Wales, Australia. Dissertation Abstracts International, vol. 62, p. 594. Davis, C. J. (in press). The spatial coding model of visual word identification. Psychological Review, 117, 713– 758. Davis, C. J., & Bowers, J. S. (2006). Contrasting five theories of letter position coding. Journal of Experimental Psychology: Human Perception and Performance, 32(2), 535–557. Dehaene, S., Cohen, L., Sigman, M., & Vinckier, F. (2005). The neural code for written words: a proposal. Trends in Cognitive Science, 9, 335–341. Dehaene, S., Jobert, A., Naccache, L., Ciuciu, P., Poline, J. B., LeBihan, D., et al. (2004). Letter binding and invariant recognition of masked words: Behavioral and neuroimaging evidence. Psychological Science, 15(5), 307–313. Ellis, A. W., & Lambon Ralph, M. A. (2000). Age of acquisition effects in adult lexical processing reflect loss of plasticity in maturing systems: Insights from connectionist networks. Journal of Experimental Psychology: Learning, Memory & Cognition, 26, 1103–1123. Felleman, D., & Van Essen, D. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1(1), 1–47. Fischer-Baum, S., McCloskey, M., & Rapp, B. (2010). Representation of letter position in spelling: Evidence from acquired dysgraphia. Cognition, 115(3), 466–490. Fodor, J. A. (1983). The modularity of mind. Cambridge, MA: The MIT Press. Forster, K. I., & Davis, C. (1984). Repetition priming and frequency attenuation in lexical access. Journal of Experimental Psychology: Learning, Memory, and Cognition, 10, 680–698. Gomez, P., Ratcliff, R., & Perea, M. (2008). The overlap model: A model of letter position coding. Psychological Review, 115(3), 577–600.
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
113
Grainger, J., Granier, J., Farioli, F., Van Assche, E., & Van Heuven, W. (2006). Letter position information and printed word perception: The relative-position priming constraint. Journal of Experimental Psychology: Human Perception and Performance, 32, 865–884. Grainger, J., & Jacobs, A. M. (1993). Masked partial-word priming in visual word recognition: Effects of positional letter frequency. Journal of Experimental Psychology: Human Perception and Performance, 19(5), 951–964. Grainger, J., & Jacobs, A. M. (1996). Orthographic processing in visual word recognition: A multiple read-out model. Psychological Review, 103, 518–565. Grainger, J., & Van Heuven, W. J. B. (2003). Modeling letter position coding in printed word perception. In P. Bonin (Ed.), Mental lexicon: ‘‘Some words to talk about words‘‘ (pp. 1–23), New York: Nova Science. Gross, C. G. (2002). Genealogy of the ‘‘grandmother cell’’. Neuroscientist, 8, 512–518. Guerrera, C., & Forster, K. I. (2008). Masked form priming with extreme transposition. Cracking the Orthographic Code. A special issue of Language and Cognitive Processes, 23, 117–142. Hannagan, T. (2009). Visual word recognition: Holographic representations and attractor networks. Unpublished doctoral dissertation (p. 6). Paris: Universite´ Paris. Hannagan, T. (2010). Match calculations and priming simulations: The case of spatial coding. Available at: http://www.lscp.net/persons/hannagan/Material/MatchVsSim.pdf. Hannagan, T., Dandurand, F., & Grainger, J. (in press). Broken symmetries in a location invariant word recognition network. Neural Computation. Harris, H. (2002). Holographic reduced representations for oscillator recall: A model of phonological production. In W. D. Gray & C. D. Schunn (Eds.), The Proceedings of the 24th Annual Meeting of the Cognitive Science Society (pp. 423–428). Mahwah, NJ: Lawrence Erlbaum Associates. Humphreys, G. W., Evett, L. J., & Quinlan, P. T. (1990). Orthographic processing in visual word identification. Cognitive Psychology, 22, 517–560. Jacobs, A. M., Rey, A., Ziegler, J. C., & Grainger, J. (1998). Mrom-p: An interactive activation, multiple readout model of orthographic and phonological processes in visual word recognition. In J. Grainger & A. M. Jacobs (Eds.), Localist connectionist approaches to human cognition (pp. 147–188). Mahwah, NJ: Erlbaum. Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and information in a composite holographic lexicon. Psychological Review, 114, 1–37. Jordan, T., Thomas, S., Patching, G., & Scott-Brown, K. (2003). Assessing the importance of letter pairs in initial, exterior and interior position in reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 883–893. Kanerva, P. (1995). A family of binary spatter codes. In F. Fogelman-Soulie´ and P. Gallineri (Eds.), ICANN ’95: Proceedings of the International Conference on Artificial Neural Networks (vol. 1, pp. 517–522). Paris, France: EC2 & Cie. Kanerva, P. (1997). Fully distributed representation. In Proceedings of the 1997 Real World Computing Symposium (Tokyo, Japan), pp. 358–365. Kanerva, P. (1998). Pattern completion with distributed representation. In IEEE International Conference on Neural Networks (IJCNN’98), II, pp. 1416–1421. Knoblauch, A., & Palm, G. (2005). What is signal and what is noise in the brain? Biosystems, 79(1–3), 83–90. Levy, S., & Kirby, S. (2006). Evolving distributed representations for language with self-organizing maps. In P. Vogt (Eds.), Symbol Grounding and Beyond: Proceedings of the Third International Workshop on the Emergence and Evolution of Linguistic Communication (pp. 57–71). Berlin: Springer. Lupker, S. J., & Davis, C. J. (2009). Sandwich priming: A method for overcoming the limitations of masked priming by reducing lexical competitor effects. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35, 618–639. Lupker, S. J., Perea, M., & Davis, C. J. (2008). Transposed letter priming effects: Consonants, vowels and letter frequency. Language and Cognitive Processes, 23, 93–116.
114
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
McClelland, J., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights on the successes and failures of connectionist models of learning and memory. Psychological Review, 102, 419–457. McClelland, J., Patterson, K., Pinker, S., & Ullman, M. (2002). The past tense debate: Papers and replies by S. Pinker & M. Ullman and by J. McClelland & K. Patterson. Trends in Cognitive Sciences, 6, 456–474. McClelland, J., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: I. an account of basic findings. Psychological Review, 88, 375–407. Mozer, M. C. (1987). Early parallel processing in reading: A connectionist approach. 83–104. Mozer, M. C. (1991). The perception of multiple objects: A connectionist approach. Cambridge, MA: MIT Press. Neumann, J. (2002). Learning the systematic transformation of holographic reduced representations. Cognitive Systems Research, 3(2), 227–235. Norris, D. (2006). The Bayesian reader: Explaining word recognition as an optimal Bayesian decision process. Psychological Review, 113(2), 327–357. Perea, M., & Acha, J. (2008). The effects of length and transposed-letter similarity in lexical decision: Evidence with beginning, intermediate, and adult readers. British Journal of Psychology, 99, 245–264. Perea, M., Dun˜abeitia, J., & Carreiras, M. (2008). Transposed-letter priming effects for close vs. distant transpositions. Experimental Psychology, 55, 397–406. Peressotti, F., & Grainger, J. (1999). The role of letter identity and letter position in orthographic priming. Perception and Psychophysics, 61, 691–706. Perea, M., & Lupker, S. J. (2003). Does jugde activate court? Transposed-letter similarity effects in masked associative priming. Memory and Cognition, 31, 829–841. Perea, M., & Lupker, S. J. (2004). Can caniso activate casino? Transposed-letter similarity effects with nonadjacent letter positions. Journal of Memory and Language, 51, 231–246. Pinker, S. (1999). Words and rules: The ingredients of language. New York: Basic Books. Plate, T. A. (1995). Holographic reduced representations. IEEE Transactions on Neural Networks, 6 (3), 623–641. Plate, T. A. (2000). Randomly connected sigma-pi neurons can form associative memories. Network: Computation in neural systems, 11(4), 321–332. Plaut, D. C., & McClelland, J. L. (2010). Locating object knowledge in the brain: A critique of Bowers’ (2009) attempt to revive the grandmother cell hypothesis. Psychological Review, 117, 284–290. Plaut, D. C., McClelland, J. L., Seidenberg, M. S., & Patterson, K. (1996). Understanding normal and impaired word reading: Computational principles in quasi-regular domains. Psychological Review, 103, 56–115. Rumelhart, D. E., Hinton, G., & Williams, R. J. (1986). Learning internal representations by error propagation In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Rumelhart, D. E., McClelland, J. L., & the P. R. Group. (Eds.). (1986). Parallel distributed processing: Explorations in the microstructure of cognition, Vol. 1. Foundations. Cambridge, MA: MIT Press. Schoonbaert, S., & Grainger, J. (2004). Letter position coding in printed word perception: Effects of repeated and transposed letters. Language and Cognitive Processes, 19(3), 333–367. Seidenberg, M. S., & McClelland, J. L. (1989). A distributed, developmental model of word recognition and naming. Psychological Review, 96, 523–568. Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46, 159–216. Smolensky, P., Legendre, G., & Miyata, Y. (1992). Principles for an integrated connectionist/symbolic theory of higher cognition (Tech. Rep. No. CU-CS-600-92). Department of Computer Science and Institute of Cognitive Science, University of Colorado, Boulder. Stevens, M., & Grainger, J. (2003). Letter visibility and the viewing position effect in visual word recognition. Perception and Psychophysics, 65, 133–151. Turing, A. (1950). Computing machinery and intelligence. Mind, 49, 433–460.
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
115
Tydgat, I., & Grainger, J. (2009). Serial position effects in the identification of letters, digits, and symbols. Journal of Experimental Psychology: Human Perception and Performance, 35(2), 480–498. Van Assche, E., & Grainger, J. (2006). A study of relative-position priming with superset primes. Journal of Experimental Psychology: Learning, Memory and Cognition, 32(2), 399–415. Vinckier, F., Dehaene, S., Jobert, A., Dubus, J., Sigman, M., & Cohen, L. (2007). Hierarchical coding of letter strings in the ventral stream: Dissecting the inner organization of the visual word-form system. Neuron, 55, 143–156. Waydo, S., Kraskov, A., Quiroga, R. Q., Fried, I., & Koch, C. (2006). Sparse representation in the human medial temporal lobe. Journal of Neuroscience, 26, 10232–10234. Whitney, C. (2001). How the brain encodes the order of letters in a printed word: The seriol model and selective literature review. Psychonomic Bulletin and Review, 8, 221–243. Whitney, C. (2004). Investigations into the neural basis of structured representations. Unpublished doctoral dissertation, University of Maryland. Whitney, C. (2008). Comparison of the seriol and solar theories of letter position encoding. Brain and Language, 107(2), 170–178. Whitney, C., & Berndt, R. S. (1999). A new model of letter string encoding: Simulating right neglect dyslexia. In J. A. Reggia, E. Ruppin, and D. Glanzman (Eds.), Progress in brain research (vol. 121, pp. 143–163). Amsterdam: Elsevier. Wickelgren, W. A. (1969). Auditory or articulatory coding in verbal short-term memory. Psychological Review, 76, 232–235. Ziegler, J. C. (1995). Phonological information provides early sources of constraint in the processing of letter strings. Journal of Memory and Language, 34(5), 567–593.
Appendix A: Operators: Formal definition X-or Let A1, A2 be bipolar variables. The binding operator X-or is a dyadic operator from {)1,1}2 to {)1,1} and is defined as: ! $1; if A1 % A2 ¼ 1 X-orðA1 ; A2 Þ :¼ A1 bA2 ¼ 1; if A1 % A2 ¼ $1 Majority rule Here we use a deterministic version of the original Maj operator introduced in Kanerva (1997): Let (Ap)p¼1,…,k be k bipolar variables. The chunking operator Maj is defined as: 8 Pk i > < $1; ifP p¼1 Ap < 0 k k i 8i; MajðAi1 ; . . . ; Aik Þ :¼ ap¼1 Aip ¼ 1; if p¼1 Ap > 0 > P : k i T; if p¼1 Ap ¼ 0 ðiþ1ÞmodðnÞ
ðiþ1ÞmodðnÞ
% Ak where T ¼ A1 The weighted Maj version used in weighted holographic slot coding is given by:
116
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
Let (Ap)p¼1,…,k be k bipolar variables and (ap)p¼1,…,k 2 < $1; ifP p¼1 ap Ap < 0 k k WMajððA1 ; . . . ; Ak Þ; ða1 ; . . . ; ak ÞÞ :¼ ap¼1 ap Ap ¼ 1; if ap Ap > 0 > Pp¼1 : k T; if p¼1 ap Ap ¼ 0
where T is an independent random bipolar variable following a Bernoulli distribution of parameter 1/2.
Appendix B: Transposition effects in BSCs In this appendix, we show why transposition effects arise in absolute coding schemes using BSCs. We restrict our proof to structures of k odd arguments, similar mechanisms being at work in the even case. 8k 2 N; k odd; 8j ¼ 1; . . . ; k; let S ¼ A1 % X1 & A2 % X2 & ' ' ' & Ak % Xk and T ¼ A2 % X1 & A1 % X2 & ' ' ' & Ak % Xk be two BSC vectors related by transposition of A1 with A2. We first compute the expected Hamming distance H between S and T: " # k X Ai Xi ¼ 1 E½HðS; TÞ) ¼ P A1 X1 þ A2 X2 < A2 X1 þ A1 X2 ; "
i¼3
þ P A1 X1 þ A2 X2 > A2 X1 þ A1 X2 ;
k X i¼3
¼ P½A1 X1 þ A2 X2 < A2 X1 þ A1 X2 ) ' P
i¼3
þ P½A1 X1 þ A2 X2 > A2 X1 þ A1 X2 ) ' P ¼ 2 ' P½A1 ¼ $A2 ) ' P½X1 ¼ $X2 ) ' P
Ai Xi ¼ $1
" k X "
" k X i¼3
#
Ai Xi ¼ 1
k X i¼3
#
Ai Xi ¼ $1
Ai Xi ¼ 1
" # k$2 X 1 Ti ¼ 1 ; where Ti ¼ Ai Xi i.i.d. ¼ 'P 2 i¼1
#
#
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
117
" # k#2 X 1 Ti þ 1 k # 1 Ti þ 1 1 ¼ ; where follows a Bernoulli p ¼ ¼ "P 2 2 2 2 2 i¼1 1
k#2
¼ Ck#1 " 2
2k#1
Let now U¼D1%X1¯D2%X2¯" " "¯Ak%Xk. From Kanerva’s formula on the distances between two BSCs sharing k arguments (Kanerva, 1995), we get: 3 1 E½HðS; UÞ) ¼ " Ck#2 k#1 " k#1 2 2 2 3 ¼ " E½HðS; TÞ) 2 Finally, to obtain the transposition effect we need only to subtract both distances, and remember the relationship with similarities: E½SimðS; TÞ) # E½SimðS; UÞ) ¼ ð1 # 2 " E½HðS; TÞ)Þ # ð1 # 2 " E½HðS; UÞ)Þ ¼ E½HðS; TÞ) 1 ¼ Ck#2 k#1 " k#1 2 2 This shows that using the original BSCs, the transposition effect is maximum for strings of size 3 (at 0.25), and then decreases convexly with size.
Appendix C: Similarity measures for Seriol and Spatial Coding The following table presents the similarity values obtained for Spatial Coding and Seriol, used in Table 6. Table C1 Similarities for Spatial Coding (with end letter markers) and Seriol coding (with edge bigrams) Stability (1–8)
Edge Effects (9–11)
Condition
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
Prime Target Spatial Coding Seriol
12345 12345 1.00
1245 12345 0.79
123345 12345 1.00
123d45 12345 0.92
12dd5 12345 0.71
1d345 12345 0.86
12d456 123456 0.88
12d4d6 123456 0.75
d2345 12345 0.71
12d45 12345 0.86
1234d 12345 0.71
1.00
0.68
0.92
0.86
0.41
0.66
0.67
0.44
0.66
0.63
0.66
:
T. Hannagan, E. Dupoux, A. Christophe/Cognitive Science 35 (2011)
118 Table C1 (Continued)
Transposition (12–15) Condition Prime Target Spatial coding Seriol
Relative Position (16–20)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
12435 12345 0.89
21436587 12345678 0.48
125436 123456 0.78
13d45 12345 0.81
12345 1234567 0.67
34567 1234567 0.67
13457 1234567 0.69
123267 1232567 0.82
123567 1232567 0.81
0.85
0.45
0.64
0.63
0.61
0.61
0.58
0.77
0.83
Note. Boldface type indicates similarities.
Supporting Information Additional Supporting Information may be found in the online version of this article on Wiley Online Library: Appendix S1. Test items: Exposure items. Appendix S2. Test items: Novel items. Please note: Wiley-Blackwell is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.
Cognitive Science 35 (2011) 119–155 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 0364-0213 print / 1551-6709 online DOI: 10.1111/j.1551-6709.2010.01160.x
Learning Diphone-Based Segmentation Robert Daland,a Janet B. Pierrehumbertb a
Department of Linguistics, UCLA Department of Linguistics, Northwestern University
b
Received 16 August 2009; received in revised form 24 February 2010; accepted 28 April 2010
Abstract This paper reconsiders the diphone-based word segmentation model of Cairns, Shillcock, Chater, and Levy (1997) and Hockema (2006), previously thought to be unlearnable. A statistically principled learning model is developed using Bayes’ theorem and reasonable assumptions about infants’ implicit knowledge. The ability to recover phrase-medial word boundaries is tested using phonetic corpora derived from spontaneous interactions with children and adults. The (unsupervised and semisupervised) learning models are shown to exhibit several crucial properties. First, only a small amount of language exposure is required to achieve the model’s ceiling performance, equivalent to between 1 day and 1 month of caregiver input. Second, the models are robust to variation, both in the free parameter and the input representation. Finally, both the learning and baseline models exhibit undersegmentation, argued to have significant ramifications for speech processing as a whole. Keywords: Language acquisition; Word segmentation; Bayesian; Unsupervised learning; Computational model
1. Introduction Word learning is fundamental in language development. Aside from communicating lexical meaning in individual utterances, words play a role in acquiring generalizations at multiple levels of linguistic structure, for example, phonology1 and syntax.2 Therefore, it is crucially important to understand the factors and processes that shape word learning. In order to learn a word, the listener must first parse the wordform out as a coherent whole from the context in which it was uttered—word segmentation. Word segmentation is a challenging phenomenon to explain, as word boundaries are not reliably marked in everyday speech with invariant acoustic cues, such as audible pauses (Lehiste, 1960). Correspondence should be sent to Robert Daland, Department of Linguistics, 3125 Campbell Hall, UCLA, Los Angeles, CA 90095-1543. E-mail:
[email protected] 120
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
Therefore, listeners must exploit some kind of language-specific knowledge to determine word boundaries. In adults, one obvious source for word segmentation is recognition of neighboring words: The end of one word signals the onset of the next, and vice versa. Indeed, a number of computational models such as TRACE (McClelland & Elman, 1986) and Shortlist B (Norris & McQueen, 2008) have explained word segmentation as an epiphenomenon of word recognition in closed-vocabulary tasks, such as an adult might face in a familiar listening environment. Word segmentation in adults is facilitated by recognition of specific words and other ‘‘top-down’’ (syntactic ⁄ semantic and pragmatic ⁄ world) knowledge (Mattys, White, & Melhorn, 2005), which may even override lexical ⁄ phonological information (e.g., Levy, 2008). Thus, ‘‘top-down’’ knowledge plays a vital role in adult word segmentation. However, the acquisition facts suggest that word recognition cannot be the only—or even the most important—mechanism for infant word segmentation. This is evident from the fact that infants do not command very much top-down knowledge that might support word segmentation. For example, infants between the ages of 6 and 12 months are reported to know an average of 40–80 word types (Dale & Fenson, 1996), a tiny fraction of the words they encounter. During this same developmental period infants exhibit robust word segmentation, apparently on the basis of low-level cues such as phonotactics and stress (Aslin, Saffran, & Newport, 1998; Jusczyk, Hohne, & Bauman, 1999; Jusczyk, Houston, & Newsome, 1999; Mattys & Jusczyk, 2001; Saffran, Aslin, & Newport, 1996). While word recognition clearly plays some role in infant word segmentation (Bortfeld, Morgan, Golinkoff, & Rathbun, 2005), word segmentation organizes and supports infant word recognition and learning (Davis, 2004), rather than being only an epiphenomenon of word recognition. These facts call for a phonotactic account of word segmentation acquisition. Phonotactics refers to tacit knowledge of possible ⁄ likely sound sequences, including words, syllables, and stress (Albright, 2009; Chomsky & Halle, 1965; Dupoux, Kakehi, Hirose, Pallier, & Mehler, 1999; Hayes & Wilson, 2008; Jusczyk, Luce, & Charles-Luce, 1994). Although phonotactics can refer to a broad array of sound structures, the present paper will focus on segmental sequences and their distribution within and across words. More specifically, this paper explores Diphone-Based Segmentation (DiBS) as previously studied in Cairns, Shillcock, Chater, and Levy (1997) and Hockema (2006). The underlying idea of DiBS is that many diphones are good cues to the presence or absence of a word boundary. For example, the sequence [pd] occurs in no or almost no English words, so it is a strong cue to the presence of a word boundary between [p] and [d]. Similarly, the sequence [ba] occurs very frequently within English words, but only rarely across word boundaries, so it is a strong cue to the absence of a word boundary. This idea can be formalized as calculating, for every diphone [xy] that occurs in the language, the probability p(# | xy) that a word boundary # falls between [x] and [y]. The DiBS models in previous studies were supervised models, meaning that model parameters were estimated from phonetic transcriptions of speech in which the presence ⁄ absence of word boundaries was marked. Since this is precisely what infants are trying
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
121
to discover, supervised models are not appropriate as models of human acquisition, which is unsupervised. Thus, despite the promising segmentation performance of these models, they have attracted little follow-up research, apparently because the model parameters were regarded as unlearnable. The computational literature shows that when model parameters cannot be directly inferred, they can often be indirectly inferred using Bayes’ theorem with reasonable prior assumptions (Manning & Schu¨tze, 1999). The Bayesian approach is especially appropriate for the study of language acquisition because it forces a principled distinction between learner assumptions and the data that the child learns from. Accordingly, a learning DiBS model that uses Bayes’ theorem to estimate parameters is developed here. The approach builds off acquisition literature documenting children’s use of phonotactics for word segmentation; specifically DiBS formalizes the finding that children exploit diphones to segment words from unfamiliar sequences (Mattys & Jusczyk, 2001). In estimating parameters, the learning model exploits the fact that phrase boundaries contain distributional information useful for identifying word boundaries (Aslin, Woodward, LaMendola, & Bever, 1996). The paper is structured as follows. In the background section, we begin with terminology. Next we describe previous computational approaches to word segmentation; then we consider evidence of phonotactic segmentation in infants. Finally, we argue for phonotactic segmentation as a prelexical process in a two-stage (prelexical ⁄ lexical) theory of speech processing. In the DiBS section, we begin with our cognitive assumptions and next describe the core learning model; two specific instantiations are introduced: Phrasal-DiBS bootstraps model parameters from the distribution of speech sounds at phrase edges; Lexical-DiBS estimates them from the infant’s lexicon. Phrasal-DiBS is an unsupervised algorithm, and lexical-DiBS can be characterized as semi-supervised. The remainder of the paper is devoted to testing the models and discussion of their performance. Simulation 1 uses a phonetic corpus derived from child-directed speech to assess the learning models’ ability to recover phrase-medial word boundaries using the supervised model of Cairns et al. (1997) as a baseline. Simulation 2 assesses the models’ robustness to variation in the parameter p(#), the learner’s estimate of the global probability of a phrase-internal word boundary. Finally, Simulation 3 assesses robustness to pronunciation variation using a corpus of spontaneous adult speech that represents the phonetic outcome of conversational reduction processes.
2. Background Word segmentation has been the focus of intensive cross-disciplinary research in recent years, with important contributions from infant experiments (e.g., Saffran et al., 1996; Mattys & Jusczyk, 2001), corpus studies of caregiver speech (e.g., van de Weijer, 1998), computational models (Christiansen, Allen, & Seidenberg, 1998; Fleck, 2008; Goldwater, 2006; Swingley, 2005), or combinations of these methods (Aslin et al., 1996; Brent & Siskind, 2001). Rapid further progress depends on integrating the insights from these
122
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
multiple strands of research. In this section, we begin by defining terminology. Next we describe several classes of computational models implementing a variety of theoretical approaches to the acquisition of word segmentation. Then we review evidence of phonotactic segmentation in infants and argue that existing approaches fail to accommodate it. Finally we argue that phonotactic segmentation is a prelexical process. 2.1. Terminology 2.1.1. Units of speech perception The phonetic categories that infants perceive will be referred to as phones. ‘‘Phone’’ is used in preference to ‘‘phoneme’’ or ‘‘allophone’’ because these terms imply to some readers that infants learn the full system of contextual variation and lexical contrast relating cognitive units (phonemes) with their phonetic realization (allophones). For example, alveolar taps, aspirated stops, voiceless unaspirated stops, and unreleased stops are all allophones of the same phoneme ⁄ t ⁄ . It is nontrivial to learn these relations (Peperkamp, Le Calvez, Nadal, & Dupoux, 2006), and there is no unambiguous evidence that prelexical infants have done so (Pierrehumbert, 2002); hence, the more neutral term ‘‘phone.’’ 2.1.2. Undersegmentation and oversegmentation errors An undersegmentation error occurs when there is an underlying word boundary in the input, but the model fails to identify it. An oversegmentation error occurs when there is not an underlying word boundary, but the model identifies one. These terms are used because they refer directly to the perceptual outcome for the infant3: In the former case the infant will perceive an unanalyzed whole that underlyingly consists of multiple words; in the latter case the infant will improperly split up a single word into subparts. 2.2. Computational models of segmentation acquisition Computational models can be regarded as specific instantiations of broader theories, making more specific and sometimes more easily testable predictions than the theories they embody. A variety of modeling frameworks have been proposed for the acquisition of word segmentation, including phonotactic models, connectionist models, and models which treat word segmentation and word learning as a joint-optimization problem. These models differ not only in their internal structure, and in what information they bring to bear, but also in the task they are solving; some are designed to acquire a lexicon as well as segment speech. 2.2.1. Diphone and higher n-phone models Cairns et al. (1997) used the London-Lund corpus of spoken conversation to test a diphone model. For each diphone [xy] in the language, they collected the frequency ƒ#(xy) with which [xy] spans a word boundary, and the frequency ƒ!#(xy) with which [xy] occurs word-internally. Then the probability of a word boundary between [x] and [y] is p(# | xy) = ƒ#(xy) ⁄ (ƒ#(xy)+ƒ!#(xy)).4 By positing a boundary whenever this probability exceeded
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
123
an optimal threshold, the model found 75% of the true word boundaries in the corpus, with only 5% of nonboundaries misidentified as word boundaries. The high level of performance was later explained by Hockema’s (2006) finding that English diphones contain a great deal of positional information because most occur within a word, or across word boundaries, but not both. Cairns et al. (1997) did not regard the diphone model as a suitable model for acquisition, because calculating the model parameters depends on knowing the relative frequency with which word boundaries span different diphones. Observing this information would require knowing when a word boundary has occurred and when it has not, which is precisely the segmentation problem infants are trying to solve. Swingley (2005) followed up with a wordlearning model using a related statistic that is observable to infants. Although this model achieved promising results on word learning, it also made a number of ad hoc assumptions that may not be cognitively plausible (for discussion see Goldwater, 2006). Other studies have revisited the assumption that diphone models are unlearnable. The key insight is that phrase boundaries are always word boundaries (Aslin et al., 1996). Thus, while infants may not observe which diphones span a word boundary phrasemedially, they can observe which phones are likely to occur phrase-initially and -finally. Xanthos (2004) exploited this idea by defining ‘‘utterance-boundary typicality’’ as the ratio of the expected probability of a diphone across phrase boundaries to the observed probability within a phrase. This method crucially assumes independence of phonological units across word boundaries. Going a step further, Fleck (2008) used Bayes’ theorem to derive word-boundary probabilities with the further, counterintuitive, assumption of phonological independence within words. Statistical dependencies in this model are represented using all n-phones, n £ 5, that occur more than five times in the corpus, so the model is more powerful than a diphone model and requires correspondingly stronger assumptions about infants’ cognitive abilities. Fleck’s model also includes a lexical process that repairs morphologically driven segmentation errors, for example, boundaries between stems and suffixes. To anticipate briefly, the present study includes elements from several of these studies. It shares the core diphone model from Cairns et al. (1997) and Hockema (2006). From Aslin et al. (1996) and Xanthos (2004) it draws the idea of using utterance boundary distributions to estimate word boundary distributions, although it goes beyond these works in offering a principled probabilistic formulation. And in common with Fleck (2008), this work uses Bayes’ theorem to bootstrap model parameters, although it is a leaner model, because it uses only diphones and does not also attempt to learn words. 2.2.2. Connectionist models A number of researchers have proposed connectionist models of word segmentation, generally using the Simple Recurrent Network first defined in Elman (1990). Work in this line has illustrated a number of important theoretical points, such as learnability of segmentation from distributional information (Aslin et al., 1996) and the additional leverage gained by combining multiple cues (Christiansen et al., 1998). These results do not directly bear on the nature of the representations or computations that humans bring to bear in word
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
124
segmentation, in part because of the well-known difficulty of interpreting connection weights and hidden unit activations (Elman, 1990)—typically it is unclear how the network solved the problem. 2.2.3. Joint-optimization approaches Some have formulated word segmentation and word learning as joint-optimization problems, in which related problems can be solved jointly by defining a single optimal solution (e.g., Blanchard & Heinz, 2008; Brent & Cartwright, 1996; Goldwater, Griffiths, & Johnson, 2009). As shown in Goldwater (2006), extant approaches have a natural Bayesian formulation in which ‘‘solutions’’ are segmentations of the input, and the optimum is defined as the solution with maximum a posteriori probability, calculated from a prior on the segmentation-induced lexicon. To illustrate the core ideas, consider two minimally different orthographic segmentations5 of the sentence The dog chased the cat. Each segmentation induces a lexicon, operationalized as a list of word types and associated frequencies (Table 1). Crucially, the induced lexicons differ in the number of words and their frequencies. It is these differences which cause joint-optimization models to prefer one solution over another. For example, ‘‘minimum description length’’ prefers solution (1a) to (1b) because it uses fewer words to explain the observed corpus (Brent & Cartwright, 1996). The ‘‘Chinese Restaurant Process’’ prior of Goldwater (2006) would also prefer (1a) to (1b), because it exhibits a Zipfian frequency distribution in which a few words occur repeatedly (in this case, the) and many elements occur only rarely (Baayen, 2001). While some joint-optimization models adopt an ideal-observer approach, in which the goal is to draw inferences about the cognitive properties of the learner from the optimal solution (e.g., Goldwater, 2006), other models claim to model human cognitive processes (Brent, 1999; Blanchard & Heinz, 2008). The current generation of such models assumes a one-to-one relationship between input segmentation and the learner’s lexicon, so positing a word boundary automatically entails incrementing the frequency in the lexicon of the words Table 1 Segmentation in joint-optimization models Segmentation (a) (b)
t t
h h
e e
# d
d #
o o
g g
# #
c c
h h
a a
s s
e e
d d
# #
t t
h h
e e
# #
c c
a a
t t
Induced lexicon (a)
(b)
Word
Frequency
Word
Frequency
the dog chased cat
2 1 1 1
thed og chased the cat
1 1 1 1 1
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
125
on either side. These models bear on the crucial assumption of this paper that word segmentation is in part a prelexical process, because they instantiate the alternative hypothesis that word segmentation, word recognition, and word learning are part of the same act and are driven by word frequency distributions. 2.3. Motivation for a phonotactic approach While the joint-optimization approach is highly illuminating, we argue that currentgeneration models do not solve the segmentation task in the same way that infants do. Two issues motivate a phonotactic approach: infants’ use of phonotactic generalizations in segmentation and the complexity of word learning. 2.3.1. Phonotactic generalizations A number of studies provide clear evidence that infants make use of phonotactic generalizations (not just lexical knowledge) for word segmentation. As early as 7.5 months of age, English-learning infants treat stressed syllables as word onsets, incorrectly segmenting TARis as a word from the sequence ...guiTAR is... (Jusczyk, Houston, et al., 1999)— a strategy that is highly appropriate for English owing to the fact that most English content words are stress-initial (Cutler & Carter, 1987). By 8 months of age, infants exhibit some familiarity with the segmental phonotactics of their language and use it for word segmentation (Friederici & Wessels, 1993; Jusczyk, Friederici, Wessels, Svenkerud, & Jusczyk, 1993; Jusczyk et al., 1994; Saffran et al., 1996). Mattys and colleagues demonstrated that infants exploit diphone phonotactics specifically for word segmentation. Recall that many diphones are contextually restricted, occurring either within a word (e.g., [ba]), or across word boundaries (e.g., [pd]), but not both (Hockema, 2006). Mattys, Jusczyk, Luce, and Morgan (1999) exposed infants to CVC. CVC nonwords, finding that both stress and the medial C.C cluster affected infants’ preferences. Then, Mattys and Jusczyk (2001) showed that infants use this diphone cue to segment novel words from an unfamiliar, phrase-medial context. With the exception of Blanchard and Heinz (2008), current-generation joint-optimization models do not predict segmentation on the basis of phonotactic generalizations (stress, diphone occurrence). Blanchard and Heinz (2008) show that including a phonotactic model yields significantly better performance; however, even this model exhibits word learning from a single exposure, argued below to be cognitively implausible. 2.3.2. Word learning In current-generation joint-optimization models, positing a word boundary entails incrementing the frequency of the wordforms on either side. If the wordforms are not already present in the lexicon, they are added. This amounts to the assumption that words are always learned from a single presentation. While learning a word from one exposure is clearly possible, even for adults it is not the norm; even after seven presentations adults fail to learn about 20% of novel CVC words (Storkel, Armbruster, & Hogan, 2006), and a number of additional studies suggest that segmenting a word is not sufficient to cause word learning in
126
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
infants (Brent & Siskind, 2001; Davis, 2004; Graf Estes, Evans, Alibali, & Saffran, 2007; Swingley, 2005). Moreover, word learning in infants is apparently subject to many other factors besides word segmentation. Lexical neighbors facilitate word learning in adults and 3- to 4-year-olds (Storkel et al., 2006; Storkel & Maekawa, 2005). Caregiver ⁄ infant joint attention also facilitates word learning (Tomasello, Mannle, Kruger, 1986; Tomasello & Farrar, 1986). A comprehensive theory of word learning should include these factors, but they are apparently independent of word segmentation. Thus, while it is fair to ask how a word segmentation model can facilitate word learning, segmentation models should not bear the full explanatory burden for word learning. In short, segmentation makes word forms available to be learned, but word learning is a separate process. 2.4. Segmentation in the cognitive architecture More precisely, we argue that phonotactic segmentation is a prelexical process (whereas word learning is necessarily a lexical process). For this claim to make sense, it is necessary to accept that there is a distinction between prelexical and lexical processing. This section reviews evidence for a two-stage (prelexical ⁄ lexical) account of speech processing (Luce & Pisoni, 1998; McClelland & Elman, 1986). The general principle underlying this distinction is that prelexical processing assigns structure to speech in some way that facilitates lexical access. The most convincing evidence for a two-stage processing account comes from dissociable effects of phonotactic probability and lexical neighborhood density across a wide range of tasks. The phonotactic probability of a wordform is estimated compositionally from the probabilities of its subparts (e.g., p([bat]) = p([b])p([a]|[b])p([t]|[a])). Lexical neighborhood density refers to the number of phonological neighbors of a word (i.e., differing by one phoneme). Bailey and Hahn (2001) and Albright (2009) find unique effects of phonotactics and lexical neighborhood in explaining word acceptability judgements. Luce and Large (2001) found a facilitory effect of phonotactic probability, but an inhibitory effect of lexical neighborhood density on reaction time in a same-different task. While lexical neighbors affect categorization of phonetically ambiguous tokens (Ganong, 1980), experiments on perceptual adaptation (Cutler, McQueen, Butterfield, & Norris, 2008) and phonetic categorization of ambiguous stimuli (Massaro & Cohen, 1983; Moreton, 1997) show there are also nonlexical (phonotactic) effects. Thorn and Frankish (2005) found a facilitory effect of phonotactic probability on nonword recall when neighborhood density was controlled, and a facilitory effect of neighborhood density when phonotactic probability is controlled. Storkel et al. (2006) found a facilitory effect of neighborhood density and an inhibitory effect of phonotactic probability on word learning in adults. These findings can be straightforwardly explained by a theory with distinct sublexical and lexical levels of representation, but they are harder to accommodate under a single-stage approach, such as joint-optimization models appear to take. That phonotactic word segmentation is attested for novel words in novel contexts (Mattys & Jusczyk, 2001) provides prima facie evidence it must be a prelexical mechanism. By
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
127
prelexical, we mean the segmentation mechanism has no access to specific lexical forms; instead, it feeds a downstream lexical processor (Fig. 1). One implication is that segmentation is a distinct cognitive process from word learning. As a corollary, we attribute to the downstream lexical processor factors in word learning such as lexical neighborhood density and caregiver ⁄ infant joint attention, which exceed the scope of this paper.
3. DiBS We now present a phonotactic model that learns diphone-based segmentation (DiBS), as described in Cairns et al. (1997) and Hockema (2006). We begin by reviewing our assumptions about the task and what infants bring to it. We follow other computational studies in assuming that speech input is represented as a sequence of phones and the listener’s goal is to recover phrase-medial word boundaries. In DiBS, the listener recovers word boundaries based on the identity of the surrounding diphone. The core of the model: Given a sequence [xy], estimates the probability that a word boundary falls in the middle p(# | xy). In the following section, we outline our assumptions as to what is observable to infants, and the assumptions they must make to estimate model probabilities from these observables. 3.1. Assumptions We assume the infant knows or can observe the following: • • • • • •
phonetic categories; phonological independence across word boundaries; phrase-edge distributions; the context-free diphone distribution; the context-free probability of a phrase-medial word boundary; the lexical frequency distribution.
These assumptions are justified as follows.
hIzmamwalks InT park
prosodic/ phonetic encoding
lexical access
his mom walks in the park
the kitty went to park
145000 20000 122500 89000 8000
Fig. 1. Cognitive architecture of speech perception.
128
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
3.1.1. Phonetic categories Infant speech perception begins to exhibit hallmark effects of phonetic categorization by 9 months of age. Phonetic categorization in adults is evident from high sensitivity to meaningful acoustic ⁄ phonetic variation, and low sensitivity to meaningless acoustic ⁄ phonetic variation. For example, the speech sounds [l] and [r] represent distinct sound categories in English, as evident by minimal pairs such as leak ⁄ reek and lay ⁄ ray; the same sounds do not signal a lexical contrast in Japanese, because they represent alternate pronunciations of a single sound category. Thus, Japanese listeners exhibit poor discrimination of the [l] ⁄ [r] contrast, whereas English listeners exhibit excellent discrimination (Miyawaki et al., 1975). Adult speech perception is exquisitely tuned to the phonological system of the native language. The effect of language exposure on phonetic categorization generally becomes apparent between 7 and 11 months of age.6 Prior to 6 or 7 months of age, infants exhibit similar discrimination regardless of language background (Kuhl et al., 2006; Trehub, 1976; Tsao, Liu, & Kuhl, 2006; Werker & Tees, 1984). Between 7 and 11 months, discrimination of meaningless contrasts decreases (Werker & Tees, 1984), and discrimination of meaningful contrasts improves (Kuhl et al., 2006; Tsao et al., 2006). Thus, infants appear to acquire native language phonetic categories around 9 months of age. 3.1.2. Phonological independence Phonological independence across word boundaries means that phonological material at the end of one word exhibits no statistical dependencies with phonological material at the beginning of the next word. While this assumption is not strictly true, it is reasonable to make in the initial stages of acquisition, in the absence of contradictory evidence. 3.1.3. Phrase-edge distributions We assume that infants know the frequency distribution of phones in the phrase-initial and phrase-final position. This assumption is motivated by the fact that infants are sensitive to phrase boundaries in phonological and syntactic parsing (Christophe, Gout, Peperkamp, & Morgan, 2003; Soderstrom, Kemler-Nelson, & Jusczyk, 2005), and the generalization that they are sensitive to the relative frequency of phonotactic sequences (Jusczyk et al., 1994). Because this study is limited by the coding conventions of the corpora employed, only utterance edges are treated as exemplifying phrase boundaries. The availability of such boundaries to the infant is indisputable. Indeed the works just cited suggest that weaker boundaries, such as utterance-medial intonation phrase boundaries, may also be available to the infant due to pausing and other suprasegmental cues. Including these boundaries would increase the success of the model by increasing the effective sample size for training, and by explicitly providing some boundaries in the test set that our model must estimate (e.g., those word boundaries that coincide with intonational phrase boundaries). Thus, using utterance edge statistics to estimate phrase edge statistics is a very conservative choice.
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
129
3.1.4. Context-free diphone distribution We assume that infants track the context-free distribution of diphones in their input. This assumption, which is shared in some form by all existing models of phonotactic word segmentation, is motivated by evidence that infants attend to local statistical relationships in their input (Mattys & Jusczyk, 2001; Saffran et al., 1996). 3.1.5. Context-free probability of a phrase-medial word boundary The context-free probability of a word boundary is a free parameter of the model. Because this value is determined by average word length and words ⁄ utterance, we assume infants can obtain a reasonable estimate of it. For example, average word length is lowerbounded by the cross-linguistic generalization that content words are minimally bimoraic, a prosodic requirement typically instantiated as CVC or CVCV. Even allowing for the fact that some of the function words are shorter, this implies that the overall probability of a word boundary must be less than about 1 ⁄ 3. Because the assumption that infants can estimate this parameter with adequate reliability is somewhat speculative, Simulation 2 investigates the model’s robustness to variation in this parameter. 3.1.6. Lexical frequency distribution Finally, we assume infants know the relative frequency of the word forms they have learned. This assumption is motivated by the massive body of evidence documenting frequency effects in adults (for a review see Jurafsky, 2003) and findings of frequency sensitivity in closely related levels of representation in infants (Anderson et al., 2003; Jusczyk et al., 1994; Mintz, 2003; Peterson-Hicks, 2006). 3.2. Baseline-DiBS Diphone-Based Segmentation models necessarily exhibit imperfect segmentation. For every token of a diphone type they make the same decision (boundary ⁄ no boundary), whereas at least some diphones occur both word-internally and across word boundaries (e.g., [rn]: Ernie, garner but bar none, more numbers). Since a DiBS model must make the same decision in both cases, it will either make errors on the word-internal items, or on the word-spanning items. We define the baseline model as the statistically optimal one, that is, making the smallest possible number of errors—exactly the model described in Cairns et al. (1997) and Hockema (2006). We use this statistically optimal model as the baseline because it establishes the highest level of segmentation performance that can be achieved by any DiBS model. Thus, we refer to the baseline model’s segmentation as ‘‘ceiling,’’ meaning not perfect segmentation, but the best segmentation achievable by DiBS. 3.3. Learning The core goal of the learner is to derive an estimate of the DiBS statistics p(# | xy) from observable information. Recall that p(# | xy) represents the probability of the presence of a
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
130
word boundary in the middle of a sequence, given that the constituent phones of the sequence were [x] and [y]. 3.3.1. Bayes’ rule The first step is to apply Bayes’ rule, rewriting this conditional probability in terms of the reverse conditional probability: pð#jxyÞ ¼ pðxyj#Þ $ pð#Þ=pðxyÞ
ð1Þ
where p(xy) is the context-free probability of the diphone [xy] and p(#) is the context-free probability of a word boundary. Note that these two terms are known by assumption, so the infant now need only worry about estimating p(xy | #). 3.3.2. Phonological independence The next step is to apply the assumption of phonological independence so as to factor p(xy | #). This represents the joint probability of a word-final [x] followed by a word-initial [y]. Under the assumption of phonological independence, the probability of a word-initial [y] does not depend on the word-final phone of the preceding word. Thus, the joint probability is simply the product of the each event’s probability: pðxyj#Þ % pðx
#Þ $ pð# ! yÞ
ð2Þ
where p(x ‹ #) represents the probability of observing a word-final [x], p(# fi y) represents the probability of observing a word-initial [y], and % indicates approximation. The problem of estimating p(xy | #) has been reduced to the problem of estimating the distribution of phones at word edges. 3.3.3. Phrasal-DiBS The word-edge distribution itself is not observable to infants until after they have begun to solve the segmentation problem. However, infants can get a first-pass approximation by capitalizing on the fact that phrase boundaries are always word boundaries (Aslin et al., 1996), using the phrase-edge distribution as a proxy: pðx
#Þ % pðx
%Þ
ð3Þ
pð# ! yÞ % pð% ! yÞ
where p(x ‹ %) and p(% fi y) represent the probability of observing [x] phrase-finally and [y] phrase-initially, respectively. The entire model can be written: pphrasal ð#jxyÞ ¼ pðx
%Þ $ pð#Þ $ pð% ! yÞ=pðxyÞ
ð4Þ
This first-pass approach is suitable for the very earliest stages of segmentation, when the infant must bootstrap from almost nothing. Recall that utterance boundaries are used here as a conservative proxy for phrase boundaries, due to limitations imposed by the transcripts in the corpora.
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
131
3.3.4. Lexical-DiBS After infants have begun to acquire a lexicon, they have a much better source of data to estimate the distribution of phones at word-edges—namely from the words they know. As discussed in the introduction, by the time infants evince phonotactic segmentation in laboratory studies (about 9 months), they are reported to know an average of 40 words, including familiar names (Dale & Fenson, 1996) and other words which presumably co-occur with them (Bortfeld et al., 2005). However these words are learned, once they are learned, they can be leveraged for phonotactic word segmentation. By estimating edge statistics from known words, infants may avoid errors caused in Phrasal-DiBS by distributional atypicalities at phrase edges.7 To use this data, the infant must estimate the probability with which each phone ends ⁄ begins a word in running speech. The most accurate method is to use the token probability, that is, weighting word-initial and -final phones according to lexical frequency. (For example, the sound [o0 ] has a low type frequency but is highly frequent in running speech, because it occurs in a small number of highly frequent words such as the, this and that. Infants need to estimate the token probability in order to make use of this important segmentation cue.) These probabilities can be estimated as follows: pK ðx
pK ð#
#Þ # ðRx2K ðx ¼¼ ½. . . x&Þ ' fðxÞÞ=ðRx2K fðxÞÞ yÞ # ðRx2K ðx ¼¼ ½y . . .&Þ ' fðxÞÞ=ðRx2K fðxÞÞ
ð5Þ
where K is the listener’s lexicon. In these equations, the numerator represents the expected token frequency of words that end ⁄ begin with [x] ⁄ [y] and the denominator represents the total observed token frequency. The notation (x==[...x]) is an indicator variable whose value is 1 if word x ends in [x], and 0 otherwise; (x==[y...]) similarly indicates whether x begins with [y]. The full model is given below: plexical ð#jxyÞ ¼ pK ðx
#Þ ' pð#Þ ' pK ðy ! #Þ=pðxyÞ
ð6Þ
The following section discusses Lexical-DiBS in the context of learning theory. 3.3.5. Gradient of supervision Language acquisition is generally acknowledged to be ‘‘unsupervised.’’ In the context of word segmentation, this means that the language input does not include the hidden structure (phrase-medial word boundaries) that the model is supposed to identify at test. While the distinction between unsupervised and supervised models may seem clear, it is not always clear how to apply this distinction to ‘‘bootstrapping’’ models, in which some preexisting knowledge is leveraged to solve a different problem. Baseline-DiBS is fully supervised. Phrasal-DiBS is unsupervised because it leverages information that is obviously available in the input. Lexical-DiBS lies somewhere in between. Lexical-DiBS estimates its parameters from the infant’s developing lexicon. It is not unsupervised, because it depends on information (a set of words in the lexicon) whose acquisition has not been modeled here. It is not fully supervised, because the phonological
132
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
sequences used in training do not have the word boundaries indicated. In Lexical-DiBS, a number of diphones are identified as boundary-spanning through a nontrivial inductive leap, assuming phonological independence across word boundaries, and estimating word edge distributions from the aggregate statistical properties of the lexicon. Semi-supervised learning is of interest as a model of human acquisition because infants clearly know some words and learn more during the developmental period modeled here (Bortfeld et al., 2005; Dale & Fenson, 1996). A small early lexicon might be acquired by learning words that have occurred in isolation, through successful application of the segmentation algorithm presented here, or through some other mechanism we have not modeled, such as noticing repeatedly recurring phoneme sequences. Although a full treatment of these factors exceeds the scope of this paper, Lexical-DiBS will reveal how segmentation can improve if generalizations about the form of words in the lexicon are fed back to the word segmentation task as soon as such generalizations become possible. 3.3.6. Summary The core DiBS statistics can be estimated from phrase-edge distributions, and ⁄ or wordedge distributions in the listener’s lexicon. This section has articulated a learning model for DiBS using Bayes’ theorem and the assumption of phonological independence across word boundaries. Two instantiations were proposed: Phrasal-DiBS estimates model parameters from phrase-edge distributions, and Lexical-DiBS estimates them from word-edge distributions. In Simulation 1, the developmental trajectory of these learning models is assessed.
4. Simulation 1 The goal of Simulation 1 is to measure performance of the learning models against the supervised baseline model. Thus, the baseline model should replicate the main findings of Cairns et al. (1997). However, because the focus of the present study is learnability, the methodology differs. The training data are divided into units that represent a ‘‘day’’ worth of caregiver input. In accord with contemporary corpus linguistic standards, the model is only tested on unseen data, never on data it has already been trained on. The training and testing data for Simulation 1 are drawn from the CHILDES database (MacWhinney, 2000) of spontaneous interactions between children and caregivers. 4.1. Input 4.1.1. Corpus The CHILDES database consists of transcriptions of spontaneous interactions between children and caregivers. It contains many subcorpora, collected by a variety of researchers over the past several decades. Ecological validity was the primary motivation for selecting this corpus—it was important to obtain input close to what infants actually hear.
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
133
We drew samples from the entire English portion of the database, as very few of the CHILDES corpora target children under 1;5. The motivation for this was to get more data: Acquisition of word segmentation apparently takes several months, so a large data set is required to accurately model the amount and extent of language input that infants hear. Our CHILDES sample contains 1.5 million words; the Bernstein-Ratner corpus used in several other studies of word segmentation acquisition (e.g., Brent & Cartwright, 1996; Goldwater, 2006) contains about 33,000 words, representing about a day of input to a typical child. By sampling from the entire CHILDES corpus, we sacrifice some ecological validity (by including child-directed rather than only infant-directed speech), but we obtain a larger sample more representative of an infant’s total language exposure. 4.1.2. Sample For each target child in the database, a derived corpus was assembled of speech input to the child. Each derived file contained all utterances in the original file except those spoken by the child herself. A sample of ‘‘days’’ was drawn from this derived corpus, as follows. Based on van de Weijer’s (1998) diary study as well as an ecological study of adult production (Mehl, Vazire, Ramirez-Esparza, Slatcher, & Pennebaker, 2007), a ‘‘day’’ of input was defined as 25,000 words. (This value is used here as a standardized unit for modeling purposes; it is intended to approximate what a typical English-learning infant hears in a day, but it is not intended as a claim that all infants hear exactly this amount of input.) Files were selected at random from the derived corpus and concatenated to obtain 60 ‘‘days’’ of input, each containing approximately 25,000 words. Properties of the corpus and training and test sets for Simulation 1 are given in Table 2, along with comparable figures for Simulation 3. 4.1.3. Phonetic mapping A phonetic representation was created by mapping spaces to word boundaries and mapping each orthographic word to a phonetic pronunciation using the CELEX pronouncing dictionary (Baayen, Piepenbrock, & Gulikers, 1995) with the graphemic DISC transcription system. Words not listed in the dictionary were simply omitted from the phonetic transcription, for example, ‘‘You want Baba?’’ would be transcribed as [ju wQnt], omitting the unrecognized word ‘‘Baba.’’ (About 8.75% of tokens were omitted, including untranscribed tokens ‘‘xxx,’’ nonspeech vocalizations like ‘‘um’’ and ‘‘hm,’’ nonstandardly transcribed Table 2 Corpus properties Simulation
Type
Corpus
Words
Phrases
Phones
p(#)
1 1 3 3 3 3
Train Test Train Train Test Test
– – Canonical Reduced Canonical Reduced
750,111 750,125 150,030 149,998 16,058 16,051
170,709 171,232 25,914 25,907 2,490 2,488
2,226,561 2,224,873 479,741 442,135 51,555 47,516
.2818 .2819 .2735 .2981 .2765 .3012
134
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
speech routines like ‘‘thank+you’’ and ‘‘all+right,’’ unlisted proper names like ‘‘Ross,’’ and phonetically spelled variants like ‘‘goin’’ and ‘‘doin.’’) Note that the CHILDES standard is to put each sequence of connected speech on its own line, without punctuation or phrase boundaries; thus, an individual ‘‘phrase’’ corresponds to something more like an utterance in this corpus. 4.1.4. Training and test sets The training set consisted of the first 30 ‘‘days’’ of input. This length of time is used because the acquisition literature suggests the onset of phonotactic word segmentation occurs shortly after the acquisition of language-specific phonetic categories (cf. Werker & Tees, 1984; Tsao et al., 2006; Kuhl et al., 2006 with Jusczyk, Hohne, et al., 1999; Jusczyk, Houston, et al., 1999; Mattys & Jusczyk, 2001). Thus, a learning model based on categorical phonotactics must be trainable input on the scale of weeks. The test set consisted of the remaining 30 ‘‘days.’’ 4.2. Models 4.2.1. Phrasal-DiBS The Phrasal-DiBS model parameters were estimated according to the phrase-edge distributions in the learner’s input. 4.2.2. Lexical-DiBS Lexical-DiBS is based on the learner’s lexical knowledge rather than the raw input. In order to properly compare Lexical-DiBS with Phrasal-DiBS, it is necessary to know which words an infant will learn, given what the infant has heard. Unfortunately, no sufficiently predictive theory of word learning exists. As a crude proxy, we use a frequency threshold model: Wordforms are added to the lexicon incrementally as soon as they have occurred n times in the input. Three frequency thresholds were used: 10, 100, and 1,000. The threshold 10 is used as a lower bound, because it almost certainly overestimates an infant’s lexicon size (even 14-month-olds do not learn every word they hear 10 times, e.g., Booth & Waxman, 2003). Similarly, the threshold of 1,000 is used as an upper bound, because only a few words like dada and mama are actually uttered more than 1,000 times in a typical month of infant input, and all 9-month-olds learn these high-frequency words (Dale & Fenson, 1996). The threshold of 100 is a reasonable compromise between these upper and lower bounds. (NB: Frequency in the learner’s lexicon was calculated by subtracting the threshold from the true input frequency.) 4.2.3. Baseline-DiBS The Baseline-DiBS model parameters were estimated according to the within-word and across-word diphone counts in the training corpus. In all cases, if the model encountered a previously unseen diphone in the test set, for example, one not expected given the training data, the diphone was treated as signalling a
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
135
word boundary. In the context of our analysis, this will cause an oversegmentation error whenever the diphone was actually word-internal. 4.3. Method Each model was exposed cumulatively to the ‘‘days’’ of the training set. After each ‘‘day’’ of training, the model was tested on the entire test set. 4.3.1. Hard decisions: Maximum likelihood decision threshold Formally speaking, a DiBS model estimates the probability of a word boundary given the surrounding diphone, symbolized p(# | xy). Being probabilistic, DiBS does not assign hard decisions as to the presence or absence of a word boundary, but rather a probability. However, the ‘‘correct answer’’ is not probabilistic: The speaker intended a particular sequence of words, and the word boundaries are underlyingly there, or not. Thus, for evaluation purposes, the probabilistic output of DiBS models is mapped to hard decisions using the maximum likelihood decision threshold h = 0.5: If p(# | xy) > .5, a word boundary is identified, otherwise not. (The value 0.5 is called the maximum likelihood threshold because it results in the minimum number of total errors; that is, it is the threshold with maximum likelihood of yielding a correct decision.) This process is repeated for every diphone in a phrase, as exemplified in Fig. 2. By scoring in this way, we do not intend to claim that phonotactic segmentation in humans consists of hard decisions, as probabilities are a rich source of information which are likely to be useful in lexical access; hard decisions are used here for simple comparison with other studies. 4.3.2. Segmentation measures For plotting, the dependent measure used is errors ⁄ word, distinguishing both undersegmentation and oversegmentation errors. For example, an oversegmentation error rate of 1 ⁄ 10 means that the listener will incorrectly split up 1 word out of every 10 tokens he or she hears. These measures strike us as highly informative because they indicate the error rate relative to the perceptual object the listener is attempting to identify: the word. In contrast, hearing that boundary precision is 89% does not make it clear how many word tokens the listener will oversegment. To facilitate comparison with other published studies, we also report boundary precision and recall, token precision and recall, and lexical precision and recall. Boundary precision orthographic phonetic p(# | xy) hard decision
top | dog /------\ /------\ t a p d a g \_/ \_/ \_/ \_/ \_/ .01 .02 .99 .01 .01 t a p|d a g
ta ap pd da ag
p(# | [ta]) = .01 < .5: p(# | [ap]) = .02 < .5: p(# | [pd]) = .99 > .5: p(# | [da]) = .01 < .5: p(# | [ag]) = .01 < .5:
Fig. 2. Segmentation in DiBS.
no boundary no boundary boundary! no boundary no boundary
136
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
is the probability of a word boundary given that the model posited one; boundary recall is the probability that the model posits a word boundary given that one occurred. Token precision is the probability that a form is a word token given that the model segmented it (posited boundaries on both sides); token recall is the probability that the model segmented a form, given that it occurred. Lexical precision and recall are analogous, except that wordform types are counted rather than tokens. Note that because DiBS is intended as a prelexical model, its task is not to identify words per se, but to presegment the speech stream in whatever manner offers maximal support to the downstream lexical process. Since DiBS is intended to get the learner ‘‘off the ground’’ in feeding word learning, it is eminently appropriate to assess it in terms of token precision ⁄ recall. However, as repeatedly noted above, DiBS is not intended to account for word learning on its own, so it is not appropriate to compare its lexical precision ⁄ recall against more complex models that include a lexical module. Type recall is not comparable for another reason: The test set here is about 200 times larger than in comparison studies, so there are a significantly greater number of types. 4.4. Results Fig. 3 illustrates the undersegmentation and oversegmentation error rates as a function of language exposure. To facilitate discussion and comparison, the other measures of performance from this Simulation and Simulation 3 are reported in Table 3, as well as reported values from other published studies. In addition, a small sample of the output (first line of the last test file) is given below in Table 4. 4.4.1. Confidence intervals Because the undersegmentation and oversegmentation error rates represent probabilities, confidence intervals can be determined by assuming they are Bernoulli-distributed. The half-width of the 95% confidence interval for a Bernoulli distribution is no larger than .98 ⁄ !n, where n is the sample size (Lohr, 1999). As the test set contained 750,111 words, the error rates are accurate to ±0.1%. 4.5. Discussion 4.5.1. Rapid learning The phrasal and baseline models reach near-ceiling performance within the first ‘‘day’’ of training, as evident from the nearly flat trajectory of these models in Fig. 3. That is, while these models do exhibit modest changes in error rates, these changes are small relative to the overall error rate. Only the lexical model continues to exhibit substantial gains with increasing language exposure, and the trajectory of the lexical-10 model suggests that these gains will asymptote eventually as well. For the phrasal and baseline models, most of what can be learned from the training data is learned within
0.8
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
137
0.4 0.0
0.2
Baseline
0.6
undersegmentation oversegmentation
5
10
15
0.8
0
20
25
30
20
25
30
25
30
0.4 0.0
0.2
Phrasal
0.6
undersegmentation oversegmentation
5
0.8
0
10
15
under−10 over−10
under−1000 over−1000
0.4 0.0
0.2
Lexical
0.6
under−100 over−100
0
5
10
15
20
Days of language exposure
Fig. 3. Undersegmentation and oversegmentation error rates of baseline and learning models. The x-axes represent the number of ‘‘days’’ of language exposure (approximately 25,000 words ⁄ ‘‘day’’). The y-axes represents the probability per word of making an undersegmentation error (dashed ⁄ empty) or oversegmentation error (heavy ⁄ filled). Panels indicate the baseline model (top), phrasal model (middle), and lexical model (bottom). In the lexical panel, the ‘‘upper bound’’ (frequency threshold of 10) is shown with squares; the ‘‘lower bound’’ (frequency threshold of 1,000) is shown with triangles.
the first day for these models; for the lexical model, much of what can be learned is learned within a month. The rapidity with which the models learn is especially important because it demonstrates that diphone-based segmentation is learnable not only in principle, but also in practice. The amount of language exposure required is well within a reasonable timescale of what infants actually hear.
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
138
Table 3 Segmentation performance of DiBS and other models Paper
Corpus
Tokens
BP
BR
BF
TP
TR
TF
LP
LR
LF
Ba Br F G GGJ GGJ JG V V S D D D F F G G D D D D F F G G F G F G S
BR87 BR87 BR87 BR87 BR87 BR87 BR87 BR87 BR87 Korman CHILD CHILD CHILD Buck Buck Buck Buck Buck Buck Buck Buck Switch Switch Switch Switch Arab Arab Spanish Spanish Weijer
33k 33k 33k 33k 33k 33k 33k 33k 33k 42k 750k 750k 750k 32k 32k 32k 32k 150k 150k 150k 150k 34k 34k 34k 34k 30k 30k 37k 37k 25k
– 80.3 94.6 89.2 90.3 92.4 – 81.7 80.6 – 88.3 87.4 82.8 89.7 71 74.6 49.6 87.4 82.5 80.5 76.0 90 91.3 73.9 73.1 88.1 47.5 89.3 69.2 –
– 84.3 73.7 82.7 80.8 62.2 – 82.5 84.8 – 82.1 48.9 39.0 82.2 64.1 94.8 95 76.7 68.6 47.6 44.1 75.5 80.5 93.5 92.4 68.5 97.4 48.5 92.8 –
– 82.3 82.9 85.8 85.2 74.3 – 82.1 82.6 – 85.1 62.7 53.1 85.8 67.4 83.5 65.1 81.7 74.9 59.8 55.8 82.1 85.5 82.6 81.6 77.1 63.8 62.9 79.3 –
67.2 67 – – 75.2 61.9 – 68.1 67.7 – 73.7 53.4 44.8 – – – – 66.4 56.5 44.1 39.1 – – – – – – – – –
68.2 69.4 – – 69.6 47.6 – 68.6 70.2 – 69.6 35.2 26.5 – – – – 59.6 48.4 28.8 25.2 – – – – – – – – –
67.7 68.2 70.7 72.5 72.3 53.8 88 68.3 68.9 – 71.6 42.5 33.3 72.3 44.1 68.1 35.4 62.8 52.2 34.9 30.7 66.3 72 65.8 63.6 56.6 32.6 38.7 57.9 –
– 53.6 – – 63.5 57 – 54.5 52.9 75 14.6 5.6 4.5 – – – – 30.7 34.7 16.5 21.1 – – – – – – – – 75
– 51.3 – – 55.2 57.5 – 57 51.3 – 53.6 50.8 47.3 – – – – 53.7 45.8 37.2 29.4 – – – – – – – – –
– 52.4 36.6 56.2 59.1 57.2 – 55.7 52 – 23.0 10.1 8.2 37.4 28.6 26.7 12.8 39.1 39.5 22.8 24.6 33.7 37.4 27.8 28.4 40.4 9.5 16.6 17 –
Notes rep ⁄ GGJ rep ⁄ F bigram p(#) = .05, a = 20 see JG bigram, rep ⁄ GGJ unigram, rep ⁄ GGJ base phrasal lexical-100 reduced rep ⁄ F reduced, rep ⁄ F base base, reduced phrasal phrasal, reduced orthographic rep ⁄ F ortho, rep ⁄ F rep ⁄ F rep ⁄ F
Note. Column header: B ⁄ T ⁄ L indicates boundary ⁄ token ⁄ lexical; P ⁄ R ⁄ F indicates precision ⁄ recall ⁄ F-score (e.g., BR = boundary recall). Paper key: Ba = Batchelder (2002), Br = Brent (1999), D = DiBS, F = Fleck (2008), G = Goldwater (2006), GGJ = Goldwater et al. (2009), JG = Johnson and Goldwater (2009), S = Swingley (2005), V = Venkataraman (2001); ‘‘rep ⁄ X’’ indicates the results are reported in paper X.
These results may help to explain a puzzle of the acquisition literature: why the onset of phonotactic segmentation coincides with or shortly follows the acquisition of phonetic categories. The DiBS model crucially assumes that infants possess a categorical representation of speech; however, as long as such a categorical representation is available, only a minuscule amount of language exposure is required to estimate the relevant phonotactic segmentation statistics. Thus, DiBS predicts that phonotactic segmentation should become evident shortly after infants begin to exhibit language-specific phonetic categorization—precisely what occurs. While rapid trainability is presumably not specific to DiBS, to our knowledge
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
139
Table 4 Sample output of learning models ortho correct base phrasal lex-10
‘‘If you want to eat something give you a cookie but take these’’ If ju wQnt tu it sVmTIN gIv ju 1 kUkI bVt t1k D5z If ju wQnt tu itsVmTINgIv ju 1kUkI bVt t1k D5z If ju wQnttuitsVmTINgIv ju 1kUkIbVtt1k D5z IfjuwQnt tuitsVmTIN gIvju1kUkI bVt t1kD5z
Note. Spaces indicate true ⁄ posited word boundaries.
we are the first to draw attention to this explanation of why phonotactic segmentation emerges shortly after phonetic categorization. 4.5.2. Undersegmentation While the baseline and learning models varied considerably in the undersegmentation error rate, they consistently exhibited a low oversegmentation error rate from the beginning of training. In every case, the oversegmentation error rate was below 10%, meaning less than 1 oversegmentation error per 10 words. In fact, the learning models make fewer oversegmentation errors than the baseline model (the baseline model exhibits overall higher accuracy because the undersegmentation error rate is much lower). This overall pattern, in which some undersegmentation errors are made, but very few oversegmentation errors are made, can be characterized as an overall pattern of undersegmentation. These results show that undersegmentation is the predicted perceptual outcome for all DiBS models considered. We will return to this point in the general discussion. In summary, Simulation 1 demonstrated three key findings. First, parameters of the Cairns et al. (1997) diphone model can be estimated to some accuracy from information that is plausibly attributable to infants. Second, only a small amount of language exposure is required to make these estimates. Phrasal-DiBS reaches its asymptote with less input than an infant might receive in a typical day; Lexical-DiBS continues to improve with increasing language exposure. Its asymptotic performance is similar to Phrasal-DiBS, indicating that a small lexicon does not supply greatly more information than was already present at utterance boundaries. However, the lex-1000 model already achieves better performance than Phrasal-DiBS within a month, an indication that exploiting the phone statistics of highfrequency words can improve segmentation performance. Finally, all models exhibit undersegmentation, characterized by an error rate of less than 1 oversegmentation error per 10 words. Of the assumptions required for the learning model to work, the one which is perhaps the most speculative is the assumption that infants know or can learn the context-free probability of a word boundary p(#). Thus, it is natural to wonder how sensitive the model is to this assumption. In Simulation 2, we investigate the consequences of an error in the infant’s estimate of p(#) by varying this parameter over a reasonable range and assessing the model’s performance. The model would not be robust if even small errors in the estimate of p(#) cause dramatic qualitative shifts in the predicted segmentation pattern. Conversely, the
140
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
model is robust if small errors in the estimate of p(#) cause at most small changes in the segmentation pattern. The ‘‘reasonable’’ range for p(#) that infants might consider is constrained by the relation between the phrase-medial word boundary probability and average word length. For example, if phrases contain an average of four words, and words contain an average of four phones, there will be three phrase-medial word boundaries per 16 phones. Average word length has natural upper and lower bounds, which correspondingly bound p(#). As discussed above, consideration of the Minimal Prosodic Word (McCarthy & Prince, 1986 ⁄ 1996) generates an upper bound on p(#) of !.33. A reasonable upper bound is the longest word that infants are observed to learn (perhaps owing to memory ⁄ coding limitations); for English, this is 6–8 phones (8: breakfast, 7: toothbrush, telephone, 6: grandma, peekaboo, stroller, cheerios, outside; Dale & Fenson, 1996), so p(#) > 1 ⁄ 8 ! .125. Thus, infants might reasonably consider the range for the context-free probability of a word boundary to be between 1 ⁄ 8 and 1 ⁄ 3. Simulation 2 assesses DiBS’ robustness to estimation errors for p(#). 5. Simulation 2 5.1. Input The stimuli consisted of the training and test sets of Simulation 1. 5.2. Models The Phrasal-DiBS and Lexical-DiBS models of Simulation 1 were used. 5.3. Method Instead of varying language exposure, the free parameter p(#) was varied in equal steps of .02 from .16 to .40, corresponding to a range of average word lengths from about 2.5 to about 6 phones. Results are reported from the final ‘‘day,’’ that is, after exposure to the entire training set. 5.4. Results The results are shown in Fig. 4, which plots under- and oversegmentation error rates as a function of p(#). In addition, a sensitivity analysis is presented in Table 5. Table 5 reports the undersegmentation and oversegmentation error rates with the correct value for p(#) (columns UE and OE), and beside these columns it reports the range of values for p(#) that will result in a 5% absolute change to the undersegmentation ⁄ oversegmentation error rate. For example, the entries in the far right column indicate that even when the learner estimates p(#) to be as low ⁄ high as .16 ⁄ .4, the absolute undersegmentation ⁄ oversegmentation rate does not decrease ⁄ increase by more than 5%.
0.8
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
141
0.4 0.0
0.2
Baseline
0.6
undersegmentation oversegmentation
0.25
0.8
0.20
0.30
0.35
0.40
0.35
0.40
0.35
0.40
0.4 0.0
0.2
Phrasal
0.6
undersegmentation oversegmentation
0.8
0.20
0.25
under−10 over−10
under−1000 over−1000
0.4 0.0
0.2
Lexical
0.6
under−100 over−100
0.30
0.20
0.25
0.30
Word boundary probability
Fig. 4. Undersegmentation and oversegmentation error rates as a function of the probability of a phrase-medial word boundary. The x-axes represent p(#) and the y-axes represent undersegmentation and oversegmentation error rates, as in Fig. 3.
5.5. Discussion The results of Simulation 2 suggests the learning model is robust to estimation errors. The oversegmentation error rate is particularly robust—varying p(#) yields significant changes in the undersegmentation rate, but over a wide range, the oversegmentation rate stays small (near or under 10%). The phonetic corpus in Simulations 1 and 2 used a canonical, invariant pronunciation for each word. Since pronunciation variation is a general fact of speech (Johnson, 2004;
142
R. Daland, J. B. Pierrehumbert ⁄ Cognitive Science 35 (2011)
Table 5 Sensitivity analysis of Simulation 2 results Model phrasal lex-10 lex-100 lex-1000
UE 39.49 50.25 47.05 37.64
)5% < DUE < +5%
OE
p(#) p(#) p(#) p(#)
5.41 8.47 6.24 4.4
.22 .20 .18 .20
< < <
.1, but crucially, there was a reliable interaction, f(1, 36) = 5, p < .05. The factors interacted because the high-similarity 4-year-olds matched the prime dative (M = 1.32) more than they mismatched (M = 0.52), t(18) = 2.22, p < .05, d = 0.51, whereas the low-similarity condition showed no such effect (Match M = 0.58; Mismatch M = 0.79), t(18) = )0.78, p > .4, d = )0.18. We also conducted a 2 (Similarity) · 2 (Construction Match) mixed measures anova, for the 5-year-olds. Here, there was neither a main effect of similarity nor a similarity by construction matching interaction p’s > .4, but a main effect of construction matching, f(1, 46) = 29.39, p < .0001. This pattern is obtained because both similarity groups showed the priming effect. The high- and low-similarity children matched (High M = 1.48; Low M = 1.28) more frequently than mismatched (High M = 0.39; Low M = 0.44), t(22) = 4.08, p < .01, d = 0.85 and t(24) = 3.56, p < .01, d = 0.71, respectively.
M. B. Goldwater et al. ⁄ Cognitive Science 35 (2011)
165
To further show the need of shared surface similarity for 4-year-olds to map both semantic and syntactic relations even in the high-similarity condition, we analyze the dependence of structural priming on repeating a verb from a prime utterance (within the same block). The analogous analyses could not be done for the low-similarity condition, because verb repetition did not occur, given the dissimilarity between scenes within a block. Because children differed in the number of trials they repeated verbs, proportions are analyzed instead of sums. The proportions are calculated as the number of matches or mismatches divided by the number of total utterances (i.e., matches, mismatches, and nondatives) when a verb was repeated. The younger children only showed full structural priming when they repeated verbs (see Fig. 4). When repeating verbs, the 4-year-olds matched (M = 0.92) more frequently than
Fig. 3. Full structural priming: Contrasting target utterances that matched the prime dative alternate with mismatching datives, out of three trials. Means and standard errors are shown.
Fig. 4. Verb repetition and full structural priming: Contrasting the proportion of target utterances that matched the prime dative alternate with mismatching datives. Means and standard errors are shown.
166
M. B. Goldwater et al. ⁄ Cognitive Science 35 (2011)
mismatched (M = 0.08), t(12) = 5.5, p < .01, d = 1.53. When they did not repeat verbs, they did not match (M = 0.28) significantly more than they mismatched (M = 0.16); t(18) = 0.95, p > .3, d = 0.22. In contrast, 5-year-olds in the high-similarity condition showed priming regardless of verb repetition, matching more frequently (M = 0.81) than mismatching (M = 0.11), t(17) = 4.57, p < .01, d = 1.08, when repeating verbs and matching (M = 0.38) more frequently than mismatching (M = 0.14) when not repeating verbs, t(22) = 2.08, p < .05, d = 0.43. When the 4-year-olds in the high-similarity condition did not repeat verbs, they still showed semantic priming, that is, a higher proportion of total dative responses (M = 0.44), when compared to baseline (M = 0.15), t(30) = 2.38, p < .05, d = 0.86.
7. Discussion In keeping with our predictions, 4-year-olds only showed full structural priming when there was high surface similarity between prime and target scenes and utterances, that is, when the scenes came from the same event category. The simpler mapping of semantic relations-only priming required neither verb repetition nor high similarity. The 4-year-olds in the low-similarity condition, and high-similarity condition when not repeating verbs, showed this semantic role set priming. This means that a prime in one dative alternate increased the likelihood of the use of both dative alternates equally in a subsequent scene description. This is predicted by our account, but it is unaccounted for by other models or theories (e.g., Chang et al., 2006; Pickering & Branigan, 1998; Savage et al., 2003; Shimpi et al., 2007). In addition, the finding that the 5-year-olds showed full structural priming in all conditions and not dependent on repeating verbs suggests the ability to retrieve and map more sophisticated relational linguistic structures develops between ages 4 and 5. In sum, we proposed that children store the form of utterances, and then retrieve and map their relational structure to guide formation of further utterances. Because 4-year-olds are worse at this than 5-year-olds, they require more shared surface similarity across domains to enable complex relational mapping or mappings where relational knowledge is limited (Andrews & Halford, 2002; Gentner & Rattermann, 1991; Ratterman & Gentner, 1998a,b).
8. Implications for grammatical development Previous studies in children’s structural priming have been used to argue whether children have abstract syntactic representations, that is, whether constructions have been generalized across lexical items. The reasoning is that, if structural priming is shown without verb repetition, then the representation of constructions is considered verb-general. There are two issues to discuss here. The first issue is that our findings appear inconsistent with some others in this literature, creating important questions for future research. Thothathiri and Snedeker (2008) show between-verb priming in a 4-year-old’s comprehension of datives, while we only show it
M. B. Goldwater et al. ⁄ Cognitive Science 35 (2011)
167
within-verb for 4-year-olds’ production. However, comprehension often appears ahead of production. In fact, Chang et al.’s (2006) model simulations show an earlier maturity for comprehension tasks than production using a single learning mechanism. Still, the relation between how priming guides language comprehension and production remains an important issue to investigate further. In addition, Shimpi et al. (2007) show dative priming in 4-year-olds’ production even though verbs used in the target utterances differed from the verbs in the primes. However, there are many methodological differences between our study and theirs. For one, their children received 10 primes before producing 10 targets, creating more of a chance for priming to build across trials than in ours. In addition, it is less clear how to analyze for verb repetition and scene similarity contingencies given how their scenes were blocked. A complete theory would have to account for both sets of results given these methodological differences. This warrants further investigation. The second issue to address is that regardless of these apparent inconsistencies, it is unclear whether structural priming reflects the state of a child’s grammatical knowledge; that is, it is not necessarily the result of ‘‘tapping into’’ their preexisting syntactic representations, abstract or lexeme specific. Our results can be explained by children’s use of the surface form of the utterances in the experiment to guide further sentence production. This can be accounted for by a relational reasoning exemplar model (Tomlinson & Love, 2006) that needs no permanent grammatical knowledge, but simply represents the semantic role and word order relations of the exemplar utterances. We are not denying that children have grammatical knowledge beyond this, but that grammatical knowledge may be unnecessary to account for the repetition of structure in turn-taking scene description tasks. However, our results are suggestive of the 4-year-olds having less syntactic knowledge than the 5-year-olds because it is possible that the inability to map syntactic relations is not due to processing capacity limits, but because of a dearth of relational knowledge (see Ratterman & Gentner, 1998a,b). In addition, it is possible that behavior in these tasks is reflective of grammatical knowledge, but that knowledge changes throughout the course of the experiment. Analogical processes are also learning mechanisms (e.g., Kotovsky & Gentner, 1996). Gentner and Medina (1998) argue that analogical comparison is critical in rule learning. In Shimpi et al. (2007), the children hear and repeat 10 primes and then describe 10 target scenes. It is rare that 10 straight utterances are of the same construction; it is just this sort of sequence that would allow an analogical learning mechanism to abstract the grammatical rule (e.g., Tomlinson & Love, 2006). Structural priming as learning is consistent with Chang et al. (2006). However, that model uses a different learning mechanism and only focuses on syntactic priming (as defined here). Evidence for Chang and colleagues’ implicit learning account comes from long-term effects of priming, and they suggest that short-term priming effects may be achieved by a different mechanism. Pickering and Garrod (2004) propose that these shortterm effects are the result of conversants aligning their grammatical representations and are in the service of making dialogue more fluent. It is quite possible that children achieve this alignment through relational mapping. Our results are the first to show that analogical
168
M. B. Goldwater et al. ⁄ Cognitive Science 35 (2011)
mapping mechanisms are involved in dialogue; however, more research is needed to confirm that these mechanisms are helping to achieve fluency or aid in grammatical acquisition. How analogical mapping and learning mechanisms relate to developing grammatical knowledge is a crucial area of future research. Analogical processing is implicated in many domains of cognition and to our model attempts to both make language tasks continuous with other aspects of cognition and to drive predictions. This perspective’s first novel prediction, confirmed in this paper, was that semantic representations (independent of sequence) are primed in structural priming tasks. In past developmental research, syntax alone was the focus. We will continue to investigate how much of language phenomena can be explained or predicted from this perspective, building on this important first step.
Notes 1. See Goldberg (1995) for discussion of fine grain semantic differences between dative alternates. 2. These constructions have semantics constrained to describing transfer events, making syntactic-only priming impossible. 3. Typically in the model priming is a result of thematic role sequence competition, but the model can also account for priming of syntactic structure that has been de-correlated from thematic role sequence (Bock & Loebell, 1990). However, this is again about priming one specific syntactic alternate at the cost of the other. 4. To confirm that both the nouns and verbs were more similar to each other in the highsimilarity condition than in the low-similarity condition, we obtained their Latent Semantic Analysis similarity scores (Landauer & Dumais, 1997). We compared the postverbal nouns of one prime utterance to the postverbal nouns used in the other prime utterance from the same block of trials for each similarity condition (the same few nouns were used for subjects across all primes). The average similarity (on a scale of )1 to 1) for the high-similarity nouns was 0.41, while the average for the low-similarity nouns was 0.06. The analogous analysis for the verbs showed the high-similarity verbs had an average score of 0.43, while the low-similarity verbs had an average score of 0.23. 5. Another way to analyze the data is to perform a 2 (prime construction) · 2 (target construction) anova, and priming is shown by a significant interaction. These analyses show the same pattern of effects as the analyses presented.
Acknowledgments Support for this research was provided by NSF grant 0447018 to the third author. We thank Kelli Gross and April Hernandez for their able research assistance and Franklin Chang, Adele Goldberg, Art Markman, and three anonymous reviewers for valuable discussion of the work. We are grateful to the children who participated, their families, and Julie Wall of the Children’s Research Laboratory for their indispensable roles in this research.
M. B. Goldwater et al. ⁄ Cognitive Science 35 (2011)
169
References Andrews, G., & Halford, G. S. (2002). A cognitive complexity metric applied to cognitive development. Cognitive Psychology, 45, 153–219. Bencini, G. M. L., & Valian, V. (2008). Abstract sentence representation in 3-year-olds: Evidence from comprehension and production. Journal of Memory and Language, 59, 97–113. Bock, J. K. (1986). Structural persistence in language production. Cognitive Psychology, 18, 355–387. Bock, K., & Griffin, Z. M. (2000). The persistence of structural priming: Transient activation or implicit learning? Journal of Experimental Psychology: General, 129, 177–192. Bock, J. K., & Loebell, H. (1990). Framing sentences. Cognition, 35, 1–39. Chang, F., Bock, K., & Goldberg, A. E. (2003). Can thematic roles leave traces of their places? Cognition, 90, 29–49. Chang, F., Dell, G. S., & Bock, K. (2006). Becoming syntactic. Psychological Review, 112, 234–272. Chomsky, N. (1957). Syntactic structures. The Hague, The Netherlands: Mouton. Conwell, E., & Demuth, K. (2007). Early syntactic productivity: Evidence from dative shift. Cognition, 103, 163–179. Falkenhainer, B., Forbus, K. D., & Gentner, D. (1989). The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41, 1–63. Gentner, D. (1983). Structure-mapping: A theoretical framework for analogy. Cognitive Science, 7, 155– 170. Gentner, D., & Markman, A. B. (1997). Structure mapping in analogy and similarity. American Psychologist, 52 (1), 45–56. Gentner, D., & Medina, J. (1998). Similarity and the development of rules. Cognition, 65, 263–297. Gentner, D., & Rattermann, M. J. (1991). Language and the career of similarity. In S. A. Gelman & J. P. Byrnes (Eds.), Perspectives on thought and language: Interrelations in development (pp. 225–277). London: Cambridge University Press. Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press. Hummel, J. E., & Holyoak, K. J. (2003). A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110, 220–264. Huttenlocher, J., Vasilyeva, M., & Shimpi, P. (2004). Syntactic priming in young children. Journal of Memory and Language, 50, 182–195. Jackendoff, R. (1990). Semantic structures. Cambridge, MA: MIT Press. Kaschak, M. P., & Borreggine, K. L. (2008). Is long-term structural priming affected by patterns of experience with individual verbs? Journal of Memory and Language, 58, 862–878. Kotovsky, L., & Gentner, D. (1996). Comparison and categorization in the development of relational similarity. Child Development, 67, 2797–2822. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. Larkey, L. B., & Love, B. C. (2003). CAB: Connectionist analogy builder. Cognitive Science, 27, 781– 794. Loewenstein, J., & Gentner, D. (2001). Spatial mapping in preschoolers: Close comparisons facilitate far mappings. Journal of Cognition and Development, 2, 189–219. Pickering, M. J., & Branigan, H. P. (1998). The representation of verbs: Evidence from syntactic priming in language production. Journal of Memory and Language, 39, 633–651. Pickering, M. J., & Garrod, S. (2004). Towards a mechanistic Psychology of dialogue. Behavioral and Brain Sciences, 27, 169–226. Rattermann, M. J., & Gentner, D. (1998a). The effect of language on similarity: The use of relational labels improves young children’s performance in a mapping task. In K. Holyoak, D. Gentner & B. Kokinov (Eds.),
170
M. B. Goldwater et al. ⁄ Cognitive Science 35 (2011)
Advances in analogy research: Integration of theory & data from the cognitive, computational, and neural sciences (pp. 274–282). Sophia: New Bulgarian University. Rattermann, M. J., & Gentner, D. (1998b). More evidence for a relational shift in the development of analogy: Children’s performance on a causal-mapping task. Cognitive Development, 13, 453–478. Savage, C., Lieven, E., Theakston, A., & Tomasello, M. (2003). Testing the abstractness of children’s linguistic representations: Lexical and structural priming of syntactic constructions. Developmental Science, 6, 557– 567. Savage, C., Lieven, E., Theakston, A., & Tomasello, M. (2006). Structural priming as implicit learning in language acquisition: The persistence of lexical and structural priming in 4-year-olds. Language Learning and Development, 2, 27–49. Shimpi, P., Gamez, P., Huttenlocher, J., & Vasilyeva, M. (2007). Syntactic priming in 3- and 4-year-old children: Evidence for abstract representations of transitive and dative forms. Developmental Psychology, 43, 1334–1346. Song, H., & Fisher, C. (2004) Structural priming in three-year-old children. Paper presented at the 29th Annual Boston University Conference on Language Development, Boston, MA. Thothathiri, M., & Snedeker, J. (2008). Syntactic priming during language comprehension in three- and fouryear-old children. Journal of Memory and Language, 58, 188–213. Tomlinson, M. T., & Love, B. C. (2006). From pigeons to humans: Grounding relational learning in concrete examples. Twenty-First National Conference on Artificial Intelligence (AAAI-2006), USA, 17, 136–141.
Cognitive Science 35 (2011) 171–183 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 0364-0213 print / 1551-6709 online DOI: 10.1111/j.1551-6709.2010.01141.x
Iconic Gestures Prime Words De-Fu Yap, Wing-Chee So, Ju-Min Melvin Yap, Ying-Quan Tan, Ruo-Li Serene Teoh Department of Psychology, National University of Singapore Received 30 August 2009; received in revised form 22 June 2010; accepted 22 June 2010
Abstract Using a cross-modal semantic priming paradigm, both experiments of the present study investigated the link between the mental representations of iconic gestures and words. Two groups of the participants performed a primed lexical decision task where they had to discriminate between visually presented words and nonwords (e.g., flirp). Word targets (e.g., bird) were preceded by video clips depicting either semantically related (e.g., pair of hands flapping) or semantically unrelated (e.g., drawing a square with both hands) gestures. The duration of gestures was on average 3,500 ms in Experiment 1 but only 1,000 ms in Experiment 2. Significant priming effects were observed in both experiments, with faster response latencies for related gesture–word pairs than unrelated pairs. These results are consistent with the idea of interactions between the gestural and lexical representational systems, such that mere exposure to iconic gestures facilitates the recognition of semantically related words. Keywords: Gesture; Semantic priming; Lexical decision; Lexical processing; Semantic representation
1. Introduction Speakers from all cultural and linguistic backgrounds move their hands and arms when they talk (Feyereisen & de Lannoy, 1991). For example, a speaker opens her palm and moves her hand forward, while saying I gave him a present. Such hand and arm movements are collectively referred to as a gesture (McNeill, 1992). Even though there are various linguistic and structural differences between gestures and speech, previous findings generally support the idea that gestures and speech are tightly integrated temporally, semantically,
Correspondence should be sent to So Wing Chee, Department of Psychology, National University of Singapore, BLK AS4, 9 Arts Link, Singapore 117572. E-mail:
[email protected] 172
D.-F. Yap et al. ⁄ Cognitive Science 35 (2011)
and pragmatically (Kita, 2000; McNeill, 1992; So, Kita, & Goldin-Meadow, 2009). Temporally, Morrel-Samuels and Krauss (1992) have shown that gestures are initiated either just before or simultaneously with their lexical affiliate (i.e., the word whose retrieval the gesture facilitates), indicating that gestural and language systems are linked during communication (see also Mayberry & Jaques, 2000). Semantically, the meaning of co-expressed gestures goes hand in hand with the meaning of accompanying speech (e.g., Kita & Ozyurek, 2003; Ozyurek, Kita, Allen, Furman, & Brown, 2005). Pragmatically, gesture packages preverbal spatio-motoric messages into units that are suitable for speaking (Kita, 2000). This research will further explore the integrated link between gestures and language by examining the interplay between gestures and the lexical processing system. Specifically, we examine whether the presentation of a gesture (e.g., a two-hands flapping gesture) would activate a semantically related word (e.g., ‘‘bird’’ or ‘‘flying’’). Although there are many different types of gestures, we focus on iconic gestures, because they express meaning that is related to the semantic content of the speech they accompany (Krauss, Chen, & Chawla, 1996; McNeill, 2005) and resemble the physical properties and movement of objects or actions being described in speech (McNeill, 2005). We are primarily interested in whether the presentation of an iconic gesture (e.g., a pair of hands flapping) would prime a semantically related word (e.g., bird or flying), using a cross-modal semantic priming paradigm. Semantic priming refers to the facilitation in the cognitive processing of information after recent exposure to related information (Neely, 1991), and it is most often studied using primed visual word recognition paradigms (see McNamara, 2005; Neely, 1991, for excellent reviews). In these tasks, the participants are presented with a context word (e.g., cat) that is followed by a target letter string which is either semantically related (e.g., dog) or unrelated (e.g., table). In the primed speeded naming task, participants read the target letter string aloud, whereas in the primed lexical decision task, they have to decide via a button press whether the letter string forms a real word or nonword (e.g., flirp). In general, words preceded by semantically related primes are responded to faster than words preceded by semantically unrelated primes. This effect, which is extremely robust, is known as the semantic priming effect. In our experiments, we use a cross-modal priming paradigm where context words are replaced with gestures, with the aim of testing the hypothesis that gestures prime semantically related words. There are strong theoretical motivations for this hypothesis. For example, Krauss (1998) proposed that our memorial representations are encoded in multiple formats or levels, including gestures and their lexical affiliates. McNeill (1985) further argued that gesture and language should be viewed as a single system within a unified conceptual framework. In other words, language and gestures, despite being represented with different formats,1 can be seen as elements of a single tightly integrated process. Interestingly, there is already some provocative evidence in the literature that gestures do prime words. To our knowledge, Bernardis et al. represents the first study demonstrating gesture–word priming effects. In their study, participants were visually presented with gestural primes, followed by either related or unrelated lexical targets, and they were then asked to name the lexical targets. On average, each gesture clip lasted for 3,672 ms (ranging from 2,320 to 4,680 ms). Bernardis et al. then measured the time taken to name each lexical target and compared this against a ‘‘neutral’’ baseline latency, which was estimated using a
D.-F. Yap et al. ⁄ Cognitive Science 35 (2011)
173
separate group of participants who simply had to name aloud the lexical targets (i.e., gestural primes were not presented). The use of a baseline allows researchers to estimate facilitation (faster latencies for related compared with neutral condition) and inhibition (slower latencies for unrelated compared with neutral condition) effects, which are supposed to respectively map onto the fast automatic and slower expectancy-based aspects of priming (see Neely, 1977; Posner & Snyder, 1975). Bernardis and colleagues reported a significant inhibition effect, that is, participants took longer to name a word when it was unrelated to the gesture, compared with the neutral baseline. Intriguingly, they did not find a facilitation effect, that is, there was no significant difference between the related and neutral conditions. On the basis of the nonreliable facilitation effect, they concluded that ‘‘the meaning of iconic gestures did not prime the same-meaning words’’ (p. 1125). However, when the participants were instructed to read the target words and form mental images of those words, a facilitation effect was found (Bernardis & Caramelli, 2009).2 Yet the conclusion drawn by Bernardis, Salillas, and Caramelli (2008) should be interpreted with some caution for two main reasons. First, each gesture was displayed for more than 3,000 ms. This might allow sufficient time for the participants to actually name or label the gestures. Naming of gestures would, in turn, activate the semantically related words that might serve as lexical primes. Henceforth, the facilitation effect in the naming task could be attributed to the lexical primes instead of the gesture primes. In order to discourage participants from naming the gestures, the duration of each gesture clip should be substantially shorter than 3,000 ms. Second, Bernadis et al.’s conclusion rests exclusively on the nonsignificant facilitation effect. As discussed earlier, the magnitude of the facilitation effect reflects the difference between the related and neutral conditions, and it is therefore modulated by the specific neutral baseline selected. While the neutral baseline putatively allows researchers to disentangle the automatic and controlled influences of priming, there is evidence that the use of neutral baselines is associated with a number of problems. Most important, Jonides and Mack (1984) have convincingly demonstrated that neutral baselines can artifactually overestimate or underestimate facilitation and inhibition effects, depending on the specific neutral prime (e.g., BLANK or XXXXX) selected. Related to this, Forster (1981) pointed out that evidence for small (or null) facilitation effects is necessarily ambiguous, because facilitation effects can be underestimated by one’s choice for the neutral condition. One might contend that Bernardis et al. have neatly circumvented the problem of using a neutral prime by estimating the neutral condition from an independent sample that saw only the target words (without the context of the gesture primes). However, in order to validly estimate facilitation and inhibition, it is necessary to insure that the neutral and cuing conditions are ‘‘identical with respect to all processing consequences of the cue except the specific preparatory effect elicited by the informative cue’’ (Jonides & Mack, 1984, p. 33). Clearly, the processing demands associated with naming isolated words versus naming words preceded by a gesture are not identical. For instance, it is likely that there are differences in target encoding times for the neutral and cuing conditions. Specifically, primes are informative and take time to process. To the extent that related primes are not fully encoded by the time the target appears, this adds additional processing time to target processing in
174
D.-F. Yap et al. ⁄ Cognitive Science 35 (2011)
related trials. Given that this additional processing is not relevant to trials in the neutral condition, this leads to an underestimation of facilitation effects. Given that finding an appropriate neutral baseline (i.e., one that is equated on all dimensions with the prime conditions) is virtually impossible, Jonides and Mack (1984) strongly recommended that the neutral condition should be excluded. Importantly, they also pointed out that if a researcher is primarily interested in whether a related condition facilitates performance, it is sufficient to examine the overall priming effect (Neely, 1977), defined as the difference between the unrelated and related conditions. Interestingly, Bernardis et al. (2008) did find a significant overall priming effect. Specifically, targets preceded by a related gesture, compared with targets preceded by an unrelated gesture, were named 39 ms faster, p < .001 (p. 1118). In summary, our reexamination of the Bernardis et al.’s (2008) results indicates that contrary to their claim, their basic findings are actually consistent with the idea that iconic gestures prime semantically related words. Our present study aimed to reexamine the cross-modal gesture–word priming effect, addressing the two major limitations in Bernardis et al.’s (2008) study as discussed above. We conducted two experiments and investigated whether the gestures would prime the responses to semantically related words. In both experiments, we measured the overall priming effect. In Experiment 1, the average duration of gesture clip was 3,500 ms to replicate the overall priming effect found in the Bernardis et al.’s study. In Experiment 2, the duration of each gesture clip was shortened to 1,000 ms in order to minimize the likelihood that participants strategically recode the gestural primes into verbal labels and use the latter to prime the lexical targets. Different from Bernardis et al.’s study, we used a different word recognition paradigm-primed lexical decision task. Doing this is critical because different word recognition tasks produce effects specific to that task (see Balota & Chumbley, 1984; Yap & Balota, 2007), and it is important to establish effects across different tasks (Grainger & Jacobs, 1996). In addition, we used a different set of gesture primes and targets for this study than Bernardis et al. in both of our experiments, which will help provide converging evidence for the generality of the effect.
2. Experiment 1 2.1. Method 2.1.1. Participants Sixty-three undergraduates from the Research Participant Program at the National University of Singapore participated in this experiment for course credit. They were all native English speakers with normal or corrected-to-normal vision. 2.1.2. Materials 2.1.2.1. Gesture primes: To select the gestures that conveyed meaningful semantic information for this study and pair them up with semantically related and nonrelated words, a separate group of 45 English-speaking undergraduates from the National University of
D.-F. Yap et al. ⁄ Cognitive Science 35 (2011)
175
Singapore were presented with 80 silent videotaped gestures, each lasting for 3–4 s on a computer screen in a speechless context. They were given 7 s3 to write a single word that best described the meaning of a gesture. As speech was not available, the participants derived meaning from gestures according to their physical forms and movements. We determined that a gesture had a common interpretation if its meaning was agreed upon by 70% of the participants (see also Goh, Sua´rez, Yap, & Tan, 2009). Our findings showed that 40 of the gestures had consistency rates of 70% or above. Reliability was assessed by a second coder who analyzed all the responses and calculated the consistency rate of each gesture. These gestures served as gesture primes in our study. The interrater reliability was .946. See Appendix for a list of gestures and their associated meanings given by the participants. 2.1.2.2. Lexical targets: The lexical targets consisted of words and nonwords. In terms of words, we used the gesture meanings derived from the participants and matched the meaning with the words listed in Nelson, McEvoy, and Schreiber’s (1998) free association norms. We then selected the strongest associate in the Nelson et al.’s norms for each gesture. For example, consider the following gesture, whereby the index and middle fingers of both hands form versus above the head, and the fingers are flexing and unflexing. Most participants classified this gesture as rabbit. The strongest associate of rabbit is bunny, and we thus selected bunny to be the lexical target for the rabbit gestural prime. The Appendix provides the prime–target associative strength for all our stimuli (M = 0.27). The nonwords were matched to the semantically related words from the Nelson et al. (1998) norms with respect to (a) word length, (b) number of orthographic neighbors, and (c) number of syllables according to the ARC Nonword Database (Rastle, Harrington, & Coltheart, 2002) and the English Lexicon Project4 (Balota et al., 2007). Please see Table 1 for summary statistics for the lexical targets across the three conditions (see Appendix for the list of nonwords). Each participant was presented with 10 related and 10 unrelated prime–target pairs, and 20 nonwords that were preceded by gesture primes. The primes in the three conditions (related, unrelated, and nonword) were counterbalanced across participants such that each prime had an equal chance of appearing in each of the three conditions. The order of presentation was randomized anew for each participant. 2.1.3. Procedure The experiment was run using e-prime 2.0 (Schneider, Eschmann, & Zuccolotto, 2002). The gesture video clips and words were presented one at a time at the center of the computer Table 1 Stimulus characteristics of the words used in Experiment 1 Factor Length Orthographic neighbors Syllables
Minimum 3 0 1
Maximum 8 20 3
M (SD) 4.65 (1.42) 6.65 (5.89) 1.23 (0.480)
176
D.-F. Yap et al. ⁄ Cognitive Science 35 (2011)
screen against a white background. Words were presented in black lowercase letters. Each trial began with a fixation stimulus (+) appearing in the center of the screen for 1,000 ms followed by a blank screen for 200 ms. After the blank screen, the gesture prime video clip appeared for 3,000–4,000 ms, which was followed by another blank screen for 200 ms. The blank screen was replaced by the lexical target, which remained on the screen for 3,000 ms or until the participants indicated on the keyboard their lexical decision. The participants were seated in front of a 17-inch monitor of a computer and instructed to make a lexical decision (i.e., ‘‘ ⁄ ’’ for a word and ‘‘z’’ for a nonword) on the keyboard as quickly and accurately as possible. They received 10 practice trials prior to the experiment. A blank screen that served also as intertrial interval followed a correct response for 1,000 ms. An ‘‘Incorrect’’ display would be presented for 1,000 ms above the fixation point for incorrect responses. Both the accuracy rates and response latencies to the hundredth millisecond were recorded. 2.2. Results and discussion Accuracy was almost at ceiling across all three conditions. However, a few of the participants had noticeably low accuracy rates in some conditions. A box-plot analysis based on mean accuracy rates collapsed across all three conditions was conducted and five participants who had mean accuracy rates below 2 SDs of the sample mean were removed. Errors (3.1% across all three conditions) and response latencies faster than 200 ms or slower than 3,000 ms were excluded from the analyses. Response latencies more than 2.5 SDs above or below each participant in each condition were also excluded from the analyses. These criteria removed a further 1.2% of the responses. The dependent variables examined were the response latency as well as the accuracy rates. Participants responded faster and more accurately in the related condition, mean reaction time = 641 ms (SD = 143 ms) and mean accuracy = 99% (SD = 3.1%), when compared with the unrelated condition, mean reaction time = 670 ms (SD = 179 ms) and mean accuracy = 96% (SD = 6.7%). The mean response latency in the related condition was significantly faster than that in the unrelated condition (i.e., 29 ms); the effect was significant by participants, tp(57) = 2.16, p = .035, Cohen’s d = .18, and by items, ti(39) = 2.10, p = .042, Cohen’s d = .26. The effect in accuracy (3%) was also significant by participants, tp(57) = 3.47, p = .001, Cohen’s d = .57, and by items, ti(39) = 3.56, p = .001, Cohen’s d = .84. Overall, we found that iconic gestures facilitated the recognition of a semantically related word such that the words preceded by semantically related gestures were responded to faster than words preceded by semantically unrelated gestures. However, the duration of each gesture clip in our study was unnecessarily long (on average 3,500 ms), leading to long stimulus onset asynchronies (SOAs). Hence, one might contend that participants might have had sufficient time to ‘‘label’’ each gesture prime in their mind, and such unspoken label might activate the lexical target. In other words, the iconic gesture activates its lexical referent, and the lexical referent, in turn, activates or facilitates the retrieval of the semantically related lexical target. The obvious solution is to shorten the SOAs to minimize strategic
D.-F. Yap et al. ⁄ Cognitive Science 35 (2011)
177
influences. Thus, in Experiment 2, the SOA was reduced by shortening the duration of the gesture clip from 3,500 to 1,000 ms.
3. Experiment 2 3.1. Methodology 3.1.1. Participants Seventy-six undergraduates from the Research Participant Program at the National University of Singapore participated in this experiment for course credit. They were all native English speakers with normal or corrected-to-normal vision. 3.1.2. Materials 3.1.2.1. Gesture primes: We used the same gesture clips in Experiment 1. However, we truncated the gesture clips such that each clip lasted for only 1,000 ms. According to McNeill (1992, 2005), the production of a gesture undergoes at least three phases: preparation, stroke, and retraction. Consider the gesture for bird. When producing this gesture, a speaker raises both hands in the preparation phase, flaps both hands in the stroke phase, and relaxes both hands in the retraction phrase. The stroke carries the imagistic content that conveys the semantic meaning of a gesture (in this example, bird). Therefore, when shortening the duration of each gesture, we retained the stroke phases only. For each of the 80 gestures, the stroke phase lasted for 1,000 ms. In order to examine whether the semantic meanings of gestures were preserved after cropping, we presented the 80 gestures to a separate group of 37 English-speaking undergraduates from the National University of Singapore and asked them to describe their meanings in one word. Specifically, we explored whether the 40 gestures used in Experiment 1 elicited the same high consistency rates when we shortened the duration from 3,000 to 1,000 ms. Same as Experiment 1, we determined that a gesture had a common interpretation if its meaning was agreed upon by 70% of the participants. Our findings showed that 42 of the gestures had consistency rates of 70% or above. Reliability was assessed by a second coder who analyzed all the responses and calculated the consistency rate of each gesture. The interrater reliability was .932. Of the 42 gestures, all the 40 gestures that had common interpretation in Experiment 1 obtained consistency rates of 70% or above in Experiment 2. More important, they all had the same interpretation in both Experiment 1 and Experiment 2.5 These 40 gestures served as gesture primes in Experiment 2. 3.1.2.2. Lexical targets: Since the shortened gesture primes were interpreted the same way, we used the same set of lexical targets (words and nonwords) presented in Experiment 1. As before, each participant was presented with 10 related and 10 unrelated prime–target pairs, and 20 nonwords that were preceded by gesture primes in Experiment 2. The primes in the three conditions (related, unrelated, and nonword) were counterbalanced across participants
178
D.-F. Yap et al. ⁄ Cognitive Science 35 (2011)
such that each prime had an equal chance of appearing in each of the three conditions. The order of presentation was randomized anew for each participant. 3.1.3. Procedure The procedure was the same as in Experiment 1. 3.2. Results and discussion As in Experiment 1, accuracy was almost at ceiling across all three conditions. However, a few participants had noticeably low accuracy rates in some conditions. A box-plot analysis based on mean accuracy rates collapsed across all three conditions was conducted and two participants who had mean accuracy rates below 2 SDs of the sample mean were removed. Errors (4.93% across all three conditions) and response latencies faster than 200 ms or slower than 3,000 ms were excluded from the analyses. Response latencies more than 2.5 SDs above or below each participant in each condition were also excluded from the analyses. These criteria removed a further 1.75% of the responses. Participants responded faster and more accurately in the related condition, mean reaction time = 725 ms (SD = 181 ms) and mean accuracy = 98% (SD = 4.9%), as compared to the unrelated condition, mean reaction time = 757 ms (SD = 187 ms) and mean accuracy = 96% (SD = 7.1%). The mean response latency in the related condition was significantly faster than that in the unrelated condition (i.e., 32 ms); the effect was significant by participants, tp(73) = 2.42, p = .018, Cohen’s d = .17, but not by items, t(39) = 1.21, p = .30, Cohen’s d = .27. The effect in accuracy (2%) was also significant by participants, t(73) = 2.28, p = .026, Cohen’s d = .32, but not by items, ti(39) = 1.74, p = .09, Cohen’s d = .68. Unlike Experiment 1, the priming effect by items was not significant for reaction times in Experiment 2.6 To check whether the nonsignificant item effect in Experiment 2 was due to violations of nonparametric assumptions, we also conducted nonparametric sign tests based on the number of items showing priming in Experiment 1 and Experiment 2. The nonparametric sign tests (one-tailed) yielded the same results as the parametric tests. Specifically, 28 of 40 items showed priming in Experiment 1 (p = .008) while only 25 of 40 items in Experiment 2 showed priming (p = .07).7 Thus, fewer items showed priming in Experiment 2, compared with Experiment 1. The results suggest that, even though the priming effect was reliable by participants in both experiments, priming was actually stronger in Experiment 1 than in Experiment 2. This raises the possibility that the priming effect in Experiment 1 was partially mediated by some type of strategic verbal recoding. When that strategy was minimized in Experiment 2 due to the shorter SOA, priming was consequently less strong.
4. General discussion In both experiments, we demonstrated that iconic gestures prime responses to semantically related word targets, using a lexical decision task. We found the semantic link between
D.-F. Yap et al. ⁄ Cognitive Science 35 (2011)
179
iconic gestures and words that are semantically related to them. Our findings are inconsistent with the claims by Bernardis et al. (2008), but they are compatible with their empirical findings. Collectively, the two studies indicate that the cross-modal gesture–word priming effect is robust and generalizes to different lexical processing tasks (i.e., lexical decision and speeded naming). Our findings indicate that gestures and words share a tight link in our mental representation. For example, a fist moving up and down near the mouth can refer to the action of brushing teeth. Likewise, two hands with V-shaped index and middle fingers above the head refer to bunny. Evidence for a gesture–word semantic link is also found in embodied cognition research. Based on the theory of embodied cognition, knowledge can be represented by reenactments of sensory, motor, and introspective states (Bower, 1981; Fodor, 1975; Fodor & Pylyshyn, 1988; Potter, 1979). As suggested by McNeill (1992, p.155), gestures are representations of thought or so-called material carriers of thinking. Recently, there is empirical evidence supporting the notion that gesture as a form of reenactments that embodies mental representation, such that production of a particular gesture would activate congruent mental representation (Casasanto & Dijkastra, 2010; Dijkstra, Kaschak, & Zwaan, 2007; Lakoff & Johnson, 1999). In Casasanto and Dijkastra’s (2010) study, they asked a group of participants to retrieve positive and negative autobiographic memory while moving their hands upwards and downwards. Interestingly, the activation of positive autobiographical memory was faster when the participants moved their hands upwards than when they moved their hands downwards. Likewise, the activation of negative memory was faster when hands moved downwards than when they moved upwards. Their findings suggested that positive memory is embodied with upward hand movement while negative memory with downward hand movement. Given the semantic link between iconic gestures and words, how might one explain the cross-modal semantic priming of words from iconic gestures? At this point, there is no clear model that provides a mechanism for cross-modal priming from gestures to words. Based on the spreading activation framework (Collins & Loftus, 1975), a canonical model of semantic priming, semantic memory is made up of a network of interconnected internal representations or nodes. Activation spreads from a node or concept to semantically related concepts, and prior activation of concepts facilitates their subsequent retrieval. However, gestures are not represented in the standard spreading activation network. Interestingly, Krauss (1998) has proposed that our semantic representation might contain multiple formats, such as gestures and words. We suggest that Krauss’ ideas can be incorporated into an embellished spreading activation framework, such that nodes or concepts in the semantic network are represented in the forms of both gestures and their lexical affiliates. There is empirical evidence showing that an activation of a concept in one format (e.g., gesture) would activate related concepts in another format (e.g., words) during the process of speech production (Morrel-Samuels & Krauss, 1992). Along with this line of research, we found that mere exposure of iconic gesture would therefore activate semantically related words. However, there are three limitations in our study. First, due to various constraints, the pool of useable gesture primes was very limited. The modest number of stimuli might have lowered the statistical power of our item analyses. Second, our gestures vary in their
180
D.-F. Yap et al. ⁄ Cognitive Science 35 (2011)
iconicity that might modulate their ability to prime words. Some iconic gestures possess a more direct and transparent relationship with their meaning (Beattie & Shovelton, 2002). For example, the drive gesture (two fists alternately moving up and down) is more iconic than the saw gesture (open palm moving repeatedly back and forth). The gestures with a higher degree of iconicity should be easier to recognize. Due to the limited number of gestures stimuli in our study, we are unable to explore the influence of iconicity on gestural priming. However, this is an interesting issue that merits further investigation. Third, even though we have shortened SOA to 1,000 ms, we still cannot entirely rule out the possibility that the participants strategically recoded the gesture primes to lexical primes, which in turn activated the lexical target. Although our findings demonstrated that priming was reliable in both experiments, the participants also responded faster when the duration of gesture was on average 3,500 ms than when it was 1,000 ms. Why did participants take longer to respond when the SOA was 1,000 ms? One possibility is that with a shorter SOA, the participants might not have fully processed the gesture before the word target was presented. Thus, additional gesture processing might have taken place, delaying their lexical decision. When SOA was longer (e.g., 3,500 ms), the participants might have already fully processed the gestures before lexical targets were presented. It is also worth noting that priming was weaker (but still reliable) in the short SOA condition, suggesting that the priming seen in Experiment 1 may be partially mediated by strategic verbal recoding of the gestures. To conclude, our study used a cross-modal semantic priming paradigm to show a tight semantic link between iconic gestures and words. Specifically, the mere exposure of iconic gestures activates semantically related words.
Notes 1. Although the former is linear-segmented, the latter is global-synthetic and instantaneous imagery. 2. According to Bernardis and Caramelli (2009), forming a mental image of word would activate visuospatial information that was also represented in the semantically related gesture, hence producing a facilitation effect. 3. We had conducted tests on 10 pilot participants and realized that 7 s was the optimal time for a participant to quickly process a gesture and describe it in a single word. 4. The English Lexicon Database was used to match nonwords to words that have more than one syllable. 5. Two gestures that were considered as ‘‘meaningless’’ (i.e., their meaning was agreed upon by .2).
4. Discussion The current experiment tested whether lexically guided recalibration occurs for tone contrasts, how strongly learning generalizes to novel words, and to what extent learning is moderated by the phonetic context. As in previous experiments (Eisner & McQueen, 2005, 2006; Kraljic & Samuel, 2005, 2006; van der Linden & Vroomen, 2007; McQueen et al., 2006; Norris et al., 2003), we adopted a two-phase test procedure. Exposure to ambiguous tone contours in a biasing context led to recalibration of the tone contrast. Furthermore, this recalibration was effective for both old and new syllables in the test phase but was slightly stronger for old than for new words. The recalibration was not moderated by the phonetic context. These results show that such a form of recalibration is not restricted to acoustically ‘‘local’’ segmental contrast used in previous studies (for a review, see Samuel & Kraljic, 2009) but also applies to acoustically more distributed tone contrasts, therefore extending the generality of perceptual learning in speech perception. Our main rationale was to directly compare the strength of episodic versus phonological abstract learning. We found evidence for both. Learning generalized from old to new words, which speaks for abstract phonological learning. The learning effect was, however, slightly larger for old than for new words, suggesting additional effects of episodic learning on a lexical level. This empirically strengthens the evolving consensus that models of word recognition need to incorporate both abstract phonological mechanisms as well as listeners’
194
H. Mitterer, Y. Chen, X. Zhou ⁄ Cognitive Science 35 (2011)
ability to store individual exemplars of words. Furthermore, our results indicate that phonological learning is much more potent than episodic learning. How to model the effects found in this experiment? Both extreme abstractionist models, such as Shortlist (Norris, 1994), and purely episodic models, such as Minerva (Goldinger, 1998), have trouble explaining the current data straightforwardly. Specifically, purely abstractionist models fail to explain the word-specific effects, and purely episodic models are challenged by the generalization of learning to new words. It is clear that a hybrid model is needed to explain the current results; furthermore, the architecture of such a hybrid model also needs to be constrained to explain the asymmetrical effects of the abstract phonological versus episodic learning. One may conceive of models in which phonological categories are functional in lexical access or are epiphenomena of lexical access. In the first case, the input activates prelexical units, generating an abstraction of the acoustic signal with which the lexicon is addressed. In the latter case, the signal, often conceptualized as a grainy spectrogram (Pierrehumbert, 2002), directly accesses the mental lexicon, which would still consist of episodic traces without abstraction (Johnson, 1997). Abstraction only occurs after lexical access when the lexical entries in turn activate phonological categories. Phonological categories are then defined by all the words in which they occur. It is, however, difficult to see how such models can account for the current finding. If tone categories are defined by all the relevant words, the empirical basis for generalization is rather small. Participants are only exposed to 20 ambiguous tone contours. A lexical database for Modern Mandarin (http://lingua.mtsu.edu/ chinese-computing/) lists 300 different tone1 syllables and 242 tone2 syllables. Therefore, in the experiment, listeners were exposed to [200 words in the lexicon]). Given that the episodic representations of about 90% of the words within the category (or cluster of exemplars) are not influenced by the experimental exposure, we expect that a postlexical tone category should be altered only slightly, if at all. Accordingly, the word-specific effect should be stronger than the generalization effect. This prediction is not borne out by our results, because the phonological learning effect is five times stronger than the word-specific learning effect. A model that fits well with the relative strength of the episodic and categorical effects is the production model proposed by Pierrehumbert (2002). In this model, word-specific effects are second-order effects while categorical processing provides the backbone on which speech production operates. Without such categorical machinery, it cannot explain the relative strengths of abstract and episodic learning found here, nor can it explain more complex speaker normalization effects (Mitterer, 2006). Such an intervening level explains generalization of the current and other types of phonetic learning (Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005; Maye, Aslin, & Tanenhaus, 2008). While arguing for abstraction, the current results also underline the importance of storing episodic traces at different levels of processing. First of all, recalibration effects can be speaker specific (Eisner & McQueen, 2005) and may not be automatically driven by the acoustic input, but modulated by previous or a concurrent visual experience of the speaker (Kraljic, Samuel, & Brennan, 2008). These highly dynamic effects can only be explained if one assumes that episodic information, which encodes how particular speakers produce
H. Mitterer, Y. Chen, X. Zhou ⁄ Cognitive Science 35 (2011)
195
particular categories, is retained at a prelexical level of processing. Additionally, the results also indicate that listeners retain information about how speakers produce particular words. It is worth noting that the emphasis of previous research has been to show that episodic information is encoded at all (Goldinger, 1996); the current results, however, indicate that such memory traces are in fact functional in word recognition. In other words, hearing a word in a phonetically ambiguous form not only creates an episodic memory but also influences the recognition of new tokens of this word with similar forms. Such lexical episodic storage, however, contributes less to spoken-word recognition than prelexical abstraction. The main burden of word recognition thus seems to lie on the prelexical abstraction (from the speech signal input) with which lexical access can be achieved efficiently. Because prelexical and lexical codes must be commensurable, it then follows that lexical representations should be abstract as well. Such an architecture facilitates fast adaptation to inter- and intraspeaker variation, which is necessary for effective and efficient speech communication and would otherwise be difficult to achieve.
Acknowledgments We would like to thank Yingyi Luo for running the experiment. Support from the Netherlands Organization for Scientific Research (NWO-VIDI 016084338) and the European Research Council (ERC-Starting Independent Researcher Grant 206198) are gratefully acknowledged.
References Baayen, H. R., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412. Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5, 341–345. Brainard, D. H. (1997). The psychophysics toolbox. Spatial Vision, 10, 433–436. Chen, Y., & Gussenhoven, C. (2008). Emphasis and tonal implementation in Mandarin Chinese. Journal of Phonetics, 36, 724–746. Connine, C. M., Ranbom, L. J., & Patterson, D. J. (2008). Processing variant forms in spoken word recognition: The role of variant frequency. Perception & Psychophysics, 70, 403–411. Cutler, A., & Weber, A. (2007). Listening experience and phonetic-to-lexical mapping in L2. In J. Trouvain & W. J. Barry (Eds.), Proceedings of the 16th International Congress of Phonetic Sciences (pp. 43–48). Dudweiler, Germany: Pirrot. Davis, M. H., Johnsrude, I. S., Hervais-Adelman, A., Taylor, K., & McGettigan, C. (2005). Lexical information drives perceptual learning of distorted speech: Evidence from the comprehension of noise-vocoded sentences. Journal of Experimental Psychology: General, 134, 222–241. Eisner, F., & McQueen, J. M. (2005). The specificity of perceptual learning in speech processing. Perception & Psychophysics, 67, 224–238. Eisner, F., & McQueen, J. M. (2006). Perceptual learning in speech: Stability over time. Journal of the Acoustical Society of America, 119, 1950–1953. Francis, A. L., Ciocca, V., Wong, N. K. U., Leung, W. H. Y., & Chu, P. C. Y. (2006). Extrinsic context affects perceptual normalization of lexical tone. Journal of the Acoustical Society of America, 119, 1712–1726.
196
H. Mitterer, Y. Chen, X. Zhou ⁄ Cognitive Science 35 (2011)
Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance, 6, 110–125. Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 1166–1183. Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105, 251–279. Goldinger, S. D. (2007). A complementary-systems approach to abstract and episodic speech perception. In J. Trouvain & W. J. Barry (Eds.), Proceedings of the 16th International Congress of Phonetic Sciences (pp. 49–54). Dudweiler, Germany: Pirrot. Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446. Johnson, K. (1997). Speech perception without speaker normalization: An exemplar model. In K. Johnson & J. Mullennix (Eds.), Talker variability in speech processing (pp. 145–165). San Diego, CA: Academic Press. Klatt, D. (1989). Review of selected models of speech perception. In W. D. Marslen- Wilson (Ed.), Lexical representation and process (pp. 169–226). Cambridge, MA: MIT Press. Kraljic, T., & Samuel, A. G. (2005). Perceptual learning for speech: Is there a return to normal? Cognitive Psychology, 51, 141–178. Kraljic, T., & Samuel, A. G. (2006). Generalization in perceptual learning for speech. Psychonomic Bulletin and Review, 13, 262–268. Kraljic, T., Samuel, A. G., & Brennan, S. (2008). First impressions and last resorts: How listeners adjust to speaker variability. Psychological Science, 19, 332–338. Kuzla, C., Ernestus, M., & Mitterer, H. (2010). Compensation for assimilatory devoicing and prosodic structure in german fricative perception. In C. Fougeron & M. D’Imperio (Eds.), Laboratory phonology 10 (pp. 731– 758). Berlin: Mouton. Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. Journal of the Acoustical Society of America, 27, 98–104. Leather, J. (1983). Speaker normalization in perception of lexical tone. Journal of Phonetics, 11, 373–382. van der Linden, S., & Vroomen, J. (2007). Recalibration of phonetic categories by lipread speech versus lexical information. Journal of Experimental Psychology: Human Perception and Performance, 33, 1483–1494. Mann, V. A. (1980). Influence of preceding liquid on stop-consonant perception. Perception & Psychophysics, 28, 407–412. Maye, J., Aslin, R. N., & Tanenhaus, M. K. (2008). The weckud wetch of the wast: Lexical adaptation to a novel accent. Cognitive Science, 32, 543–562. McQueen, J. M., Cutler, A., & Norris, D. (2006). Phonological abstraction in the mental lexicon. Cognitive Science, 30, 1113–1126. Mitterer, H. (2006). Is vowel normalization independent of lexical processing? Phonetica, 63, 209–229. Moore, C. B., & Jongman, A. (1997). Speaker normalization in the perception of Mandarin Chinese tones. Journal of the Acoustical Society of America, 102 (3), 1864–1877. Norris, D. (1994). Shortlist: A connectionist model of continuous speech recognition. Cognition, 52, 189–234. Norris, D., McQueen, J. M., & Cutler, A. (2003). Perceptual learning in speech. Cognitive Psychology, 47, 204– 238. Pierrehumbert, J. (2002). Word-specific phonetics. In C. Gussenhoven & N. Warner (Eds.), Laboratory phonology VII (pp. 101–139). Berlin: Mouton de Gruyter. Pitt, M. A. (2009). How are pronunciation variants of spoken words recognized? A test of generalization to newly learned words. Journal of Memory and Language, 61, 19–36. Samuel, A. G., & Kraljic, T. (2009). Perceptual learning for speech. Attention, Perception, & Psychophysics, 71, 1207–1218. Wong, P. C. M., & Diehl, R. L. (2003). Perceptual normalization for inter- and intratalker variation in Cantonese level tones. Journal of Speech, Language, and Hearing Research, 46, 413–421.
H. Mitterer, Y. Chen, X. Zhou ⁄ Cognitive Science 35 (2011)
197
Xu, Y. (1994). Production and perception of coarticulated tones. Journal of the Acoustical Society of America, 95, 2240–2253. Yu, A. C. L. (2007). Understanding near mergers: The case of morphological tone in Cantonese. Phonology, 24, 187–214.
Supporting Information Additional Supporting Information may be found in the online version of this article on Wiley Online Library: Appendix S1: Experimental items. Appendix S2: Pretest for exposure items. Please note: Wiley-Blackwell is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.
Cognitive Science 35 (2011) 198–209 Copyright ! 2010 Cognitive Science Society, Inc. All rights reserved. ISSN: 0364-0213 print / 1551-6709 online DOI: 10.1111/j.1551-6709.2010.01146.x
Lexicons, Contexts, Events, and Images: Commentary on Elman (2009) From the Perspective of Dual Coding Theory Allan Paivio,a Mark Sadoskib a
Department of Psychology, University of Western Ontario Department of Teaching, Learning, & Culture, Texas A&M University
b
Received 29 October 2009; received in revised form 26 July 2010; accepted 28 July 2010
Abstract Elman (2009) proposed that the traditional role of the mental lexicon in language processing can largely be replaced by a theoretical model of schematic event knowledge founded on dynamic context-dependent variables. We evaluate Elman’s approach and propose an alternative view, based on dual coding theory and evidence that modality-specific cognitive representations contribute strongly to word meaning and language performance across diverse contexts which also have effects predictable from dual coding theory. Keywords: Lexicon; Context; Imagery; Event representation; World knowledge
1. Introduction Elman’s (2009) theory of lexical knowledge without a mental lexicon is based on analyses of (a) problems with the standard view of the mental lexicon; (b) the roles of world knowledge and linguistic knowledge in comprehension of event descriptions; (c) experimental evidence that word meanings are heavily dependent on their sentence contexts; and (d) the possibility of computational modeling of his approach. We re-analyze these issues from the perspective of dual coding theory (DCT) and find agreements and disagreements with different aspects of Elman’s position. We especially find common ground in his emphasis on the importance of world knowledge in comprehension of event descriptions, but not his ultimate schema interpretation of how that knowledge is represented. We agree too on the importance of context in determining word meaning, but we attach much more weight than he does to stable semantic properties of language units. Finally, while respecting his goal of Correspondence should be sent to Allan Paivio, Department of Psychology, Faculty of Social Science, University of Western Ontario, London, Ontario, Canada N6A 5C2. E-mail:
[email protected] A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
199
providing a formal computational theory to explain event knowledge, we find that this goal remains unsure and elusive. In contrast, noncomputational DCT has successfully explained and predicted relevant phenomena and can be easily extended to the specific language domains addressed by Elman. We elaborate on these comparisons after summarizing the DCT approach to the focal issues.
2. Mental ‘‘lexicon,’’ context, and DCT Elman (2009) emphasized knowledge complexity and context dependence as major problems associated with current theories of the mental lexicon. This applies particularly to the standard view that word information is stored mentally in a single, abstract (lemma) form applicable to different modalities of language (e.g., auditory, visual, motor) along with their semantic, syntactic, and pragmatic properties. The inclusion of pragmatic properties particularly implies that contextual influences on language performance arise from the language user’s world knowledge as well as from the language context. The mental ‘‘lexicon’’ coupled with corresponding nonlinguistic representations are the core of DCT structures and processes (e.g., Paivio, 1971, 1986, 2007; Sadoski & Paivio, 2001, 2004). However, these representations are modality-specific and embodied rather than abstract as they are in the standard theories, although abstract phenomena (e.g., abstract language) can be handled by the DCT assumptions. Initially, the units were simply called verbal and nonverbal (imaginal) internal representations that vary in sensorimotor modality (Paivio, 1971; pp. 54–56). Subsequently, the verbal and nonverbal units were named logogens and imagens, respectively (Paivio, 1978). Logogen was adapted from Morton’s (1979) word recognition model in which the logogen was interpreted as a multimodal concept that includes auditory and visual logogens as well as input and output logogens. In DCT (e.g., Paivio, 1986; Sadoski & Paivio, 2001) the concept expanded to include auditory, visual, haptic, and motor logogens, as well as separate logogen systems for the different languages of bilinguals. Logogens of all modalities are hierarchical sequential structures of increasing length, from phonemes (or letters) to syllables, conventional words, fixed phrases, idioms, sentences, and longer discourse units––anything learned and remembered as an integrated sequential language unit. The DCT imagens are mental representations that give rise to conscious imagery and are involved as well in recognition, memory, language, and other functional domains. Like logogens, imagens come in different modalities—visual, auditory, haptic, and motor. They also are hierarchically organized but, in the case of visual imagens in particular, the hierarchy consists of spatially nested sets––pupils within eyes within faces, rooms within houses within larger scenes, and so on. Logogen and imagen units also differ fundamentally in their meaningfulness. Modalityspecific logogen have no meaning in the semantic sense that characterizes the standard views of linguistic lexical representations. They are directly ‘‘meaningful’’ only in that, when activated, they have some degree of familiarity or recognizability. Imagens, however,
200
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
are intrinsically meaningful in that the conscious images they activate resemble the perceived objects and scenes they represent. Further meaning for both logogens and imagens arises from their referential or associative connections to other representations of either class. Referential connections between concrete-word logogens and imagens permit objects to be named and names to activate images that represent world knowledge. Associative connections between logogens (whether concrete or abstract) and between imagens allow for within-system associative processing that defines associative meaning as measured, for example, by word association tests and analogous nonverbal procedures. All DCT interunit connections are many-to-many and their activation is probabilistically determined by task-relevant and contextual stimuli. The preceding statement means that contextual relations and task-relevant properties of items are described in terms of the same DCT concepts, namely, logogens, imagens, and connections between them. For example, mental images provide situational contexts for language in the absence of the referent objects and settings. Verbal associations include meaningful relationships such as synonymy, antonymy, and paraphrases as well as intralinguistic contextual relations. Ensembles of DCT units and contextual relations are involved in different tasks such as learning word pairs or understanding sentences and longer texts. The tasks might involve extraneous contexts, such as task instructions, that vary in their relations to the task-relevant ensembles so that they could enhance, interfere with, or have no effect on task performance. These implications differ from those arising from Elman’s context-dependency proposal. Empirical evidence (reviewed later) bears on the alternatives. We conclude this section by identifying parallels to DCT assumptions in other theories. (a) Regarding logogen length, linguistic theories generally accept idioms and fixed phrases as lexical units, and Langacker (1991) suggested further that the ‘‘lexicon is most usefully…described as the set of fixed expressions in a language, irrespective of size and regularity. Thereby included as lexical items are morphemes, stems, words, compounds, phrases, and even longer expressions—provided that they are learned as established units’’ (cited in Scho¨nefeld, 2001, p. 182). (b) The limited meaningfulness of logogens agrees with the assumption proposed by Rumelhart (1979) and accepted by Elman (2004) that lexical words are clues to meaning rather than being semantically meaningful in themselves. (c) The role of interword associations on the verbal side of DCT has its parallel in corpus linguistic studies of word collocations in text and speech. Psychological correspondences include Kiss’s (1975) associative thesaurus of English and the latent semantic analysis of large language corpuses by Landauer and Dumais (1997). (d) The DCT multimodal representational systems has a partial parallel in Morton’s (1979) logogen theory, including its addition of imagen-like ‘‘pictogens’’ to account for picture recognition. Likewise, Coltheart (2004) found neuropsychological evidence for at least two lexicons, visual and auditory, along with a nonverbal ‘‘lexicon’’ involved in object recognition. Similarly, Caramazza (1997) postulated modality-specific lexical forms in lexical access without any modality-neutral (lemma) level of lexical representation. (e) Finally, the DCT view of context is generally similar to that of other theories, but with specific aspects that are unique to DCT.
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
201
3. The nature of event knowledge Elman (2009) appropriately emphasized event knowledge in relation to current language behavior. Such knowledge was crucial as well in the evolution of syntactic communication (Nowak, Plotkin, & Jansen, 2000). Syntactic properties became part of language ‘‘because they can activate useful memories of relevant events in both listener and speaker in the absence of the perceptual events themselves’’ (Paivio, 2007, p. 308). The main problem, then, is how the absent events are cognitively represented. First, event discourse and nonverbal event knowledge are necessarily distinct, as Elman implied when he asked about ‘‘the nature of the events being described’’ (Elman, 2009, p. 20). However, he then went on to treat events operationally as verbal descriptions rather than as nonverbal representations of the described events. Thus, meanings, roles, contexts, and ambiguities of target sentences entailed language properties that affect, for example, expectations of what words are likely to occur next during sentence processing. Elman assumed that such effects are mediated by representations of events involving referent objects but provided no independent defining operations for such mediators. Rather than concretize event knowledge, Elman sought an abstract form of representation by first adding conceptual knowledge to the distinction between world knowledge and linguistic knowledge. He questioned whether the tripartite differences ‘‘require a separate copy of a person’s conceptual and world knowledge plus the linguistic facts that are specific to linguistic forms. Or is there some other way by which such (differences) can operate directly on a shared representation of conceptual and world knowledge?’’ (p. 22). The ‘‘other way’’ is via event schemas. He asserted that ‘‘The similarity between event knowledge and schema theory should be apparent’’ (p. 22). However, this similarity is not apparent. What constitutes an ‘‘event’’ or a ‘‘schema’’ is not well defined. Elman suggested that schemas are epiphenomenal, emerging ‘‘as a result of (possibly higher order) co-occurrence between the various participants in the schema. The same network may instantiate multiple schemas, and different schemas may blend. Finally, schemas emerge as generalizations across multiple individual examples’’ (p. 24). What the ‘‘participants’’ are, how they network, and how this network of participants can both produce and instantiate multiple epiphenomena that both generalize and blend meaning is not apparent in any version of schema theory. The shortcomings of schema as an explanatory concept have been identified in a number of reviews (e.g., Alba & Hasher, 1983; Sadoski, Paivio, & Goetz, 1991). These reviews concluded that schema theory fails to account for the rich and detailed memory of complex events frequently observed in research. The cognitive puzzle here is how specific objects and events are transformed into abstract representations, from which the original details are somehow recovered later at a better than chance level. This fatal empirical and logical instantiation problem does not arise in DCT because it uses theoretically and operationally defined modality-specific representations to predict and explain performance in cognitive tasks.
202
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
4. Empirical implications and evidence Elman (2009, pp. 17–21) summarized collaborative research that tested predictions from his contextual approach using meaningful but unexpected combinations of roles played by agents, instruments, and patients in event descriptions. The results showed that the unexpected combinations (e.g., the butcher cut the grass with scissors) slow down on-line processing relative to expected combinations (e.g., the butcher cut the meat with a knife). The studies provided new information on such details as how quickly the effects occur as the sentence unfolds, and converge theoretically with DCT’s emphasis on the anticipatory functions of dual coding systems based on anticipatory imagery and ⁄ or verbal associations (e.g., Paivio, 2007, pp. 87–90; Sadoski & Paivio, 2001, chapter 4). The DCT explanation of the Elman et al. results described above is as follows. In actual reading, larger text and pragmatic contexts plus the sentence stem The butcher cut the… could evoke anticipatory verbal associates (meat, knife) and also anticipatory images of a butcher cutting meat with a knife. In DCT, both verbal and nonverbal contexts constrain and direct further anticipations probabilistically (Sadoski & Paivio, 2001; p. 73 ff.) The addition of the nonverbal code to anticipatory processing offers explanatory advantages over a schema-based interpretation. For example, a mental image might synchronically include inferred (i.e., anticipated) information regarding the type and cut of meat, the identity of the butcher (e.g., shop owner, industrial plant worker), the setting of the action, and other information that sets the stage for still further anticipations. Schemata would necessarily operate at a level more general than this; there is no default reason why butchers would be either shop owners or industrial plant workers, for example. However, imagery reports of text events typically include such specific, elaborated details (e.g., Krasny & Sadoski, 2008; Sadoski et al., 1991; Sadoski, Goetz, Olivarez, Lee, & Roberts, 1990). For example, Sadoski et al. (1990) found that nearly 25% of imagery reports from reading contained information elaborated beyond the text or imported from memory in a manner consistent with the constraints of the text (e.g., allowable specifics of setting, action, appearance). Furthermore, nearly 43% of such elaborations and importations were included in reports that combined information from across text segments. That is, inferred information reported in imaginal form was applied to adjacent and ensuing text. At the sentence level, converging behavioral and neuropsychological evidence indicates that mental imagery occurs during sentence processing, especially for concrete sentences (e.g., Bergen, Lindsay, Matlock, & Narayanan, 2007; Holcomb, Kounios, Anderson, & West, 1999; Paivio & Begg, 1971). Elman conceded that lexical explanations could work if one assumes a sufficiently information-rich lexicon associated with verbs (e.g., cut) that can combine with many kinds of agents and instruments, but that lexical explanations are less plausible in the case of effects of verb aspect, which do not depend on verb-specific information. However, given the relations (albeit complex) between verb aspect and verb tense, explanations based on DCT-defined lexical representations are plausible. First, tense-inflected verbs (e.g., eaten, ate) can be independent logogen units in DCT. Second, verb past tense can be generated from either verb stems or action pictures (Woollams, Joanisse, & Patterson, 2009), implicating DCT logogens and imagens. Third, participants can generate distinct
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
203
nonverbal images to represent situational aspects of past, present, and future events (Werner & Kaplan, 1963, pp. 425–438), the last reflecting the real-world anticipatory function of imagery alluded to above. We turn to DCT-related research concerning questions that arise from Elman’s article, focusing on (a) the distinction between world knowledge and linguistic knowledge, (b) the functional reality of abstract conceptual representations, and (c) the lexicon-context issue. 4.1. World knowledge and linguistic knowledge The distinction between world knowledge (the DCT nonverbal system) and linguistic knowledge (the DCT verbal system) is fundamental in DCT research. The distinction is operationally defined in terms of variables that affect the probability that verbal or nonverbal systems (or both) will be used in a given task. The most relevant classes of defining variables for present purposes are stimulus attributes (e.g., pictures versus words, concrete language versus abstract language), and experimental manipulations (e.g., instructions to use imagery or verbal strategies). Such methods have revealed separate and joint effects of verbal and nonverbal variables in numerous tasks (most recently reviewed in Paivio, 2007). Additively independent dual coding effects on memory were obtained with materials ranging from single items, to pairs, phrases, sentences, paragraphs, and longer text. For example, presenting pictures along with their names increases recall additively relative to once-presented words or pictures, as does image-plus-verbal coding of words. A similarly large memory advantage consistently occurred for concrete over abstract verbal material, which is explainable in terms of differential dual coding resulting from a higher probability of imagery activation by concrete than abstract language. Additive dual coding effects have also been obtained in comprehension tasks (e.g., Mayer, 2001; Sadoski, Goetz, & Fritz, 1993; Sadoski, Goetz, & Rodriguez, 2000). Singularly important here are effects predicted from the conceptual peg hypothesis of DCT (its history and current status are reviewed in Paivio, 2007, pp. 22–24, 60–67). The hypothesis states that item concreteness is especially potent when it is a property of the item that serves as the retrieval cue in associative memory. Predictions were confirmed in memory experiments which showed that concreteness of the retrieval cue was related to response recall much more strongly than concreteness of the response items, or concreteness of items in non-cued (free) recall. The relevance here is two-fold. First, the hypothesis agrees with Elman’s general emphasis on the importance of words as cues. Second, the efficacy of concrete cues is linked to their capacity to activate images of referent objects or situations that mediate recall, thus requiring direct access to knowledge of the world. Elman’s words-as cues approach, however, does not specify mechanisms with similar predictive implications. The conclusion is that DCT-related research reveals contributions of both nonverbal world knowledge and linguistic knowledge to language phenomena that are not revealed by Elman’s (2009) reliance on event-descriptive language materials alone. World knowledge likely played a role in his experimental results, but the extent of that contribution is uncertain because the inferred real-world relations are confounded with verbal associative
204
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
relations between words that describe agents, instruments, patients, and verbs. Even two experiments that used pictures and imagery instructions made no comparisons with analogous verbal procedures to test for differential contributions of the pictures (or imagery) and language. Research under the rubric of embodied cognition supports the same conclusion. The prototypical studies relate language comprehension and memory to nonverbal motor processes, perception, imagery, and language. Early research summarized by Werner and Kaplan (1963, pp. 26–29) showed that, to be perceived at apparent eye level, the printed words ‘‘climbing’’ and ‘‘raising’’ had to be positioned below ‘‘lowering’’ and ‘‘dropping.’’ Werner and Kaplan interpreted the effect in terms of an organismic theory according to which word meaning exerts a directional ‘‘pull’’ consistent with the dynamic meaning of the stimulus. Many variants of the Werner et al. studies have been recently reported (e.g., Bergen et al., 2007; Glenberg & Kaschak, 2002; Zwaan, 2004). An experiment by Louwerse (2008) is especially apropos because it distinguished between effects attributable to world knowledge and to language. ‘‘Iconic’’ word pairs (attic-basement) were judged to be related faster than reverse iconic pairs when the words were presented in a vertical spatial arrangement but not when they were presented horizontally. The initial interpretation was that the effects resulted from nonverbal world knowledge of spatial relations. However, measures of relational frequency showed that iconic word order is more frequent than non-iconic word order and that, when word-order frequency was controlled, the iconicity effect disappeared. Such confounding by linguistic associations was not investigated in the event knowledge experiments by Elman and his collaborators. DCT explains these effects readily. 4.2. The functional reality of abstract representational concepts Elman proposed that world knowledge and linguistic knowledge draw on abstract conceptual knowledge that he interpreted in terms of an improved (though as yet unrealized) version of schema theory. We have discussed schema theory critically and here we deal similarly with a broader class of abstract conceptual representations that DCT research has addressed. The relevant studies systematically varied item attributes and task characteristics designed to test effects of modality-specific processes that could not be explained by undifferentiated properties of any single, modality-neutral representational code. An early review (Paivio, 1983) turned up 60 independent findings by various researchers that were predicted or explained by DCT but not single-code theories. A more comprehensive summary (Paivio, 1986) prompted a reviewer to conclude that ‘‘The data demand something better than common coding theories have been able to provide’’ (Lockhart, 1987, p. 389). That conclusion has been further strengthened by recent findings from behavioral and neuroscience research (reviewed in Paivio, 2007). 4.3. Item-specific variables versus context Many early DCT studies (e.g., see Paivio, 1971, pp. 377–384) explicitly investigated the joint effects of item-specific variables (e.g., pictures versus words) and contextual variables
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
205
(e.g., conjunctive versus meaningful relational organization of units). Robust item-specific memory effects were augmented by meaningful contexts. Subsequently, beginning in the 1980s, context became specifically relevant to DCT because some theorists suggested that language concreteness effects depend on contextual support that is generally more available for concrete than abstract items (e.g., Schwanenflugel & Shoben, 1983). The contextavailability hypothesis is a specific variant of Elman’s more general hypothesis that lexical knowledge is context dependent. However, neither hypothesis explains the persistent concreteness effects in a wide variety of contexts in the early studies, nor in studies that controlled for context or were designed to pit contextual variables against item concreteness. Large concreteness effects were found in comprehension and recall of sentences and paragraphs matched for verbal contextual factors (Sadoski et al., 1993, 2000). Additively independent memory effects of item concreteness ⁄ imagery and contextual variables were obtained using: (a) noun pairs (Paivio, Walsh, & Bons, 1994); (b) adjective–noun pairs and sentences (Paivio, Khan, & Begg, 2000); and (c) concrete and abstract words presented in the context of meaningful sentences or in anomalous ones that inhibited relational processing (Richardson, 2003). No hint of an interaction occurred in any of these experiments. Sadoski, Goetz, and Avila (1995) tested competing predictions using two sets of paragraphs about historical figures and events that were matched for number of sentences, words, syllables, sentence length, information density, cohesion, and rated comprehensibility. One set of paragraphs were rated equal in familiarity but unequal in concreteness. Here, DCT predicted that the concrete paragraphs would be recalled better than the abstract paragraphs due to the advantage provided by imagery, whereas context availability theory predicted comparable recall for the two types because they were alike in familiarity and contextual support. In another set, the paragraphs differed in both familiarity and concreteness, with the abstract paragraph being relatively more familiar. In this set, DCT predicted that recall of the familiar abstract paragraph would approximate recall of the unfamiliar concrete paragraph (i.e., offsetting disadvantages), whereas context availability theory predicted that the abstract paragraph would be recalled better than the concrete paragraph (reflecting the advantage of greater familiarity). The results matched the predictions of DCT but not context availability theory. Begg and Clark (1975) obtained imagery ratings for homonyms that have both a concrete and an abstract meaning, as well for the words in sentence contexts that biased concrete or abstract interpretations (e.g., justice of the peace versus love of justice). Free recall tests showed that out-of-context word imagery ratings correlated significantly with recall of words in lists, whereas imagery ratings in contexts correlated significantly with recall of the words in sentence contexts. Thus, the experiment demonstrated both item-specific and context-dependent effects of imageability. O’Neill and Paivio (1978) showed interactive effects of concreteness and extreme variation of context on comprehension, imagery, and memory. Ratings of comprehensibility, sensibleness, and imagery were obtained for normal concrete and abstract sentences as well as anomalous sentences created by substituting content words from one sentence to another. The substitutions produced general rating decrements on all variables, but the decrements were greater for concrete than abstract sentences. Most notably, whereas comprehensibility
206
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
and sensibleness ratings were higher for concrete than abstract normal sentences, the difference was completely reversed for anomalous sentences. Moreover, an incidental free recall task following the ratings showed that recall of content words and whole sentences was much higher for concrete than abstract materials whether sensible or anomalous, and word imageability specifically benefited recall in anomalous as well as meaningful contexts, presumably because the words evoked memorable images in either case. Thus, DCT itemspecific variables benefited recall even in massively disrupted contexts. In sum, the studies cited in this section revealed persistent effects of DCT-relevant itemspecific lexical variables that were sometimes qualified by contextual variables in ways predictable from DCT. The results appear not to be explainable in terms of Elman’s suggestion that behavioral effects of lexical knowledge arise mainly from language contexts in which lexical units occur.
5. Computational modeling Elman’s (2009) stated goal is to develop a computational model that would be consistent with a contextual explanation of apparent lexical influences on sentence processing. He conceded that his Simple Recurrent Network model is too simple to serve as anything but a conceptual metaphor and he envisages modeling event schemas using newer connectionist architecture that is ‘‘better suited for instantiating the schema’’ (p. 23). Thus far this goal remains a promissory note and in our view it is likely to remain elusive because it requires modeling an abstract conceptual entity that has not been successfully instantiated in terms of empirical correlates. The theory-building enterprise moves from observable sentence phenomena to assumed knowledge of the world to increasingly abstract descriptions and conceptual representations, ending with completely disembodied computational models. We await the development and domain-specific explanatory power of the newer improved models. We turn finally to DCT and computational modeling. A useful bridge to the topic is the situational model in Kintsch’s theory of comprehension because, like Elman and DCT, the situational model is intended to represent knowledge of the world, including event sequences. Moreover, as in Elman but not DCT, the situation model is represented in an abstract, propositional format related to schemas. It is therefore somewhat surprising to see his recent concession that ‘‘Situational models may be imagery based, in which case the propositional formalism used by most models fails us’’ (Kintsch, 2004, p. 1284; see also Kintsch, 1998). A computational escape from the above impasse would require direct formal modeling of nonverbal event knowledge as reflected in imagery and pictured scenes. The models to date have failed to represent the kinds of detailed event information that behavioral experiments have shown to be available perceptually and in language-evoked imagery. The impasse is the same as that involved in Elman’s (2009) and Kintsch’s (2004) representation of event knowledge only indirectly as event descriptions. That is, computational scene perception and imagery are models based on natural-language verbal descriptions transformed into
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
207
abstract formal descriptions (e.g., propositions, structural descriptions) that are necessary for computer programming. This was the case with early imagery simulations and it remains so in more recent computational models of static and dynamic imagery (e.g., Croft & Thagard, 2002) as well as AI-inspired computational imagination (Setchi, Lagos, & Froud, 2007). Problems associated with computer simulation models motivated Kosslyn to abandon the computer metaphor and shift instead to tests of a theory of imagery based on functional properties of the brain (Kosslyn, Van Kleek, & Kirby, 1990). A possible exception to this negative conclusion is Mel’s (1986) use of virtual robotics together with connectionist architecture to model three-dimensional mental rotation, zoom, and pan. Using a flat array of processors driven by a coarsely tuned binocular feature detector, the system learned to run simulations of the visual transformations from visual-motor experience with various kinds of motion. It remains to be seen how far the approach can be extended to comprehension, memory, and other phenomena relevant to DCT or Elman’s approach, and also go beyond simulation of known effects to generate predictions and discoveries of new properties of imagery, perception, and language.
6. Conclusions We conclude that there is much value in Elman’s reconceptualization of the mental lexicon and in his emphasis on contextualized event knowledge. However, we do not agree that schema theory or other models based on computational descriptions offer an adequate solution to the issues he raises. The main problem is that such theories do not include the multimodal, verbal–nonverbal distinctions necessary for capturing the richness of real-world contexts that we agree are needed to fully account for meaning. Only theories that deal directly with these distinctions would be sufficient, and we submit DCT as one viable candidate (Sadoski & Paivio, 2007).
References Alba, J. W., & Hasher, L. (1983). Is memory schematic? Psychological Bulletin, 2, 203–231. Begg, I., & Clark, J. M. (1975). Contextual imagery in meaning and memory. Memory and Cognition, 3, 117– 122. Bergen, B. K., Lindsay, S., Matlock, T., & Narayanan, S. (2007). Spatial and linguistic aspects of visual imagery in sentence comprehension. Cognitive Science, 31, 733–764. Caramazza, A. (1997). How many levels of processing are there in lexical access? Cognitive Neuropsychology, 14, 177–208. Coltheart, M. (2004). Are there lexicons? The Quarterly Journal of Experimental Psychology, 57A, 1153–1171. Croft, D., & Thagard, P. (2002). A computational model of motion and visual analogy. In L. Magnani & N. J. Nessessian (Eds.), Model-based reasoning: Scientific discovery, technological innovation, values (pp. 259– 274). New York: Kluwer ⁄ Plenum. Elman, J. L. (2004). An alternative view of the mental lexicon. Trends in Cognitive Sciences, 8, 301–305. Elman, J. L. (2009). On the meaning of words and dinosaur bones: Lexical knowledge without a lexicon. Cognitive Science, 33, 1–36.
208
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
Glenberg, A., & Kaschak, M. (2002). Grounding language in action. Psychonomic Bulletin & Review, 9, 558– 565. Holcomb, P. J., Kounios, J., Anderson, J. E., & West, W. C. (1999). Dual-coding, context-availability, and concreteness effects in sentence comprehension: An electrophysiological investigation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 721–742. Kintsch, W. (1998). Comprehension: A paradigm for cognition. New York: Cambridge University Press. Kintsch, W. (2004). The construction-integration model of text comprehension and its implications for instruction. In R. B. Ruddell & N. J. Unrau (Eds.), Theoretical models and processes of reading (5th ed., pp. 1270– 1328). Newark, DE: International Reading Association. Kiss, G. R. (1975). An associative thesaurus of English: Structural analysis of a large relevance network. In A. Kennedy & A. Wilkes (Eds.), Studies in long-term memory (pp. 103–121). New York: Wiley. Kosslyn, S. M., Van Kleek, M. H., & Kirby, K. N. (1990). A neurologically plausible model of individual differences in visual mental imagery. In P. J. Hampson, D. E. Marks, & J. T. E. Richardson (Eds.), Imagery: Current developments (pp. 39–77). London: Routledge. Krasny, K. A., & Sadoski, M. (2008). Mental imagery and affect in English ⁄ French bilingual readers: A crosslinguistic perspective. Canadian Modern Language Review, 64, 399–428. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The Latent Structure Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. Langacker, R. W. (1991). Foundation of cognitive grammar 2. Stanford, CA: Stanford University Press. Lockhart, R. S. (1987). Code duelling. Canadian Journal of Psychology, 41, 387–389. Louwerse, M. M. (2008). Embodied relations are encoded in language. Psychonomic Bulletin & Review, 15, 838–844. Mayer, R. E. (2001). Multimedia learning. London: Cambridge University Press. Mel, B. W. (1986). A connectionist learning model for 3-D mental rotation, zoom, and pan. In Proceedings of the eighth annual conference of the cognitive science society (pp. 562–571). New York: Erlbaum. Morton, J. (1979). Facilitation in word recognition: Experiments causing change in the logogen model. In P. A. Kolers, M. Wrolstead, & H. Bouma (Eds.), Processing of visible language (Vol. 1, pp. 259–268). New York: Plenum Press. Nowak, M. A., Plotkin, J. B., & Jansen, V. A. A. (2000). The evolution of syntactic communication. Nature, 404, 495–498. O’Neill, B. J., & Paivio, A. (1978). Semantic constraints in encoding judgments and free recall of concrete and abstract sentences. Canadian Journal of Psychology, 32, 3–18. Paivio, A. (1971). Imagery and verbal processes. New York: Holt, Rinehart, and Winston. Paivio, A. (1978). The relationship between verbal and perceptual codes. In E. C. Carterette & M. P. Friedman (Eds.), Handbook of perception. Vol. IX: Perceptual processing (pp. 113–131). New York: Academic Press. Paivio, A. (1983). The empirical case for dual coding. In J. C. Yuille (Ed.), Imagery, memory and cognition: Essays in honor of Allan Paivio (pp. 307–332. Hillsdale, NJ: Lawrence Erlbaum Associates. Paivio, A. (1986). Mental representations: A dual coding approach. New York: Oxford University press. Paivio, A. (2007). Mind and its evolution: A dual coding theoretical approach. New York: Psychology Press. Previously published by Lawrence Erlbaum Associates. Paivio, A., & Begg, I. (1971). Imagery and comprehension latencies as a function of sentence concreteness and structure. Perception & Psychophysics, 10, 408–412. Paivio, A., Khan, M., & Begg, I. M. (2000). Concreteness and relational effects on recall of adjective-noun pairs. Canadian Journal of Experimental Psychology, 54, 149–159. Paivio, A., Walsh, M., & Bons, T. (1994). Concreteness and memory: When and Why? Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 1196–1204. Richardson, J. T. E. (2003). Dual coding versus relational processing in memory for concrete and abstract words. European Journal of Cognitive Psychology, 15, 481–50l. Rumelhart, D. E. (1979). Some problems with the notion that words have literal meanings. In A. Ortony (Ed.), Metaphor and thought (pp. 71–82). Cambridge, England: Cambridge University Press.
A. Paivio, M. Sadoski ⁄ Cognitive Science 35 (2011)
209
Sadoski, M., Goetz, E. T., & Avila, E. (1995). Concreteness effects in text recall: Dual coding or context availability? Reading Research Quarterly, 30, 278–288. Sadoski, M., Goetz, E. T., & Fritz, J. (1993). Impact of concreteness on comprehensibility, interest, and memory for text: Implications for dual coding theory and text design. Journal of Educational Psychology, 85, 291– 304. Sadoski, M., Goetz, E. T., Olivarez, A., Lee, S., & Roberts, N. M. (1990). Imagination in story reading: The role of imagery, verbal recall, story analysis, and processing levels. Journal of Reading Behavior, 22, 55–70. Sadoski, M., Goetz, E. T., & Rodriguez, M. (2000). Engaging texts: Effects of concreteness on comprehensibility, interest, and recall in four text types. Journal of Educational Psychology, 92, 85–95. Sadoski, M., & Paivio, A. (2001). Imagery and text: A dual coding theory of reading and writing. Mahwah, NJ: Lawrence Erlbaum Associates. Sadoski, M., & Paivio, A. (2004). A dual coding theoretical model of reading. In R. B. Ruddell & N. J. Unrau (Eds.), Theoretical models and processes of reading (5th ed., pp. 1329–1362). Newark, DE: International Reading Association. Sadoski, M., & Paivio, A. (2007). Toward a unified theory of reading. Scientific Studies of Reading, 11, 337– 356. Sadoski, M., Paivio, A., & Goetz, E. T. (1991). A critique of schema theory in reading and a dual coding alternative. Reading Research Quarterly, 26, 463–484. Scho¨nefeld, D. (2001). Where lexicon and syntax meet. New York: Mouton de Gruyter. Schwanenflugel, P. J., & Shoben, E. J. (1983). Differential context effects in the comprehension of abstract and concrete verbal materials. Journal of Experimental Psychology: Learning, Memory, and Cognition, 9, 82– 102. Setchi, R., Lagos, N., & Froud, D. (2007). Computational imagination: Research agenda. In M. A. Orgun & J. Thornton (Eds.), Proceedings of the 20th Australian joint conference on artificial intelligence (pp. 387– 393). Berlin: Springer-Verlag. AI 2007, LNAI 4830. Werner, H., & Kaplan, B. (1963). Symbol formation: An organismic-developmental approach to the psychology of language and the expression of thought. New York: Wiley. Woollams, A. M., Joanisse, M., & Patterson, K. (2009). Past-tense generation from form versus meaning: Behavioural data and simulation evidence. Journal of Memory and Language, 61, 55–76. Zwaan, R. A. (2004). The immersed experiencer: Toward an embodied theory of language comprehension. In B. H. Ross (Ed.), The psychology of learning and motivation (Vol. 44, pp. 35–62). New York: Academic Press.