CONSCIOUSNESS: A Mathematical Treatment of the Global Neuronal Workspace Model
CONSCIOUSNESS: A Mathematical Treatmen...
51 downloads
516 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
CONSCIOUSNESS: A Mathematical Treatment of the Global Neuronal Workspace Model
CONSCIOUSNESS: A Mathematical Treatment of the Global Neuronal Workspace Model
by Rodrick Wallace, Ph.D. Epidemiology of Mental Disorders Research Department New York Psychiatric Institute
i-e-> the network is 'tight' in the sense that each node interacts with it as a whole with unit probability. Hence Pj^ is a legitimate probability distribution. These are deep waters: For any probability distribution, 0 < Pj < 1 such that ^2j Pj = 1 the quantity
(2.5)
is the distribution's Shannon uncertainty, a fundamental quantity of classical information theory. Neglecting details explored below, the transfer of uncertainty represents the transmission of information: The Shannon Coding Theorem, the first important result of information theory, states that for any rate R < C, where C represents the capacity of the information channel, it is possible to find a 'coding scheme' such that a sufficiently long message can be sent with arbitrarily small error. This is surely one of the most striking conclusions of 20th Century applied mathematics.
10
CONSCIOUSNESS
1.
The Shannon Coding Theorem
Messages from a source, seen as symbols Xj from some alphabet, each having probabilities Pj associated with a random variable X, are 'encoded' into the language of a 'transmission channel', a random variable Y with symbols yk, having probabilities Pk, possibly with error. Someone receiving the symbol yk then retranslates it (without error) into some xk, which may or may not be the same as the Xj that was sent. More formally, the message sent along the channel is characterized by a random variable X having the distribution
The channel through which the message is sent is characterized by a second random variable Y having the distribution
Let the joint probability distribution of X and Y be defined as P(X = x3,Y = yk) = P{xj,yk)
= Pjik
and the conditional probability of Y given X as
= xj) = P(yk\xj). Then the Shannon uncertainty of X and Y independently and the joint uncertainty of X and Y together are defined respectively as
M
3 =
k=l
M
(2.6)
L
Information theory
11
The conditional uncertainty of Y given X is defined as
M
L
H{Y\X) = - £ £ P^k \og[P(yk\x3)] j=lfc=l (2.7)
For any two stochastic variates X and Y, H(Y) > H(Y\X), as knowledge of X generally gives some knowledge of Y. Equality occurs only in the case of stochastic independence.
Since P{xj,yk) = P(x3)P(yh\xj), we have H(X\Y) = H(X,Y) - H{Y) The information transmitted by translating the variable X into the channel transmission variable Y - possibly with error - and then retranslating without error the transmitted Y back into X is defined as
I(X\Y) = H(X) - H(X\Y) = H(X) + H(Y) - H{X, Y) (2.8)
See, for example, Ash (1990), Khinchine (1957) or Cover and Thomas (1991) for details. The essential point is that if there is no uncertainty in X given the channel Y, then there is no loss of information through transmission. In general this will not be true, and herein lies the essence of the theory. Given afixedvocabulary for the transmitted variable X, and a fixed vocabulary and probability distribution for the channel Y, we may vary the probability distribution of X in such a way as to maximize the information sent. The capacity of the channel is defined as
C = max I(X\Y) P(X)
12
CONSCIOUSNESS
(2.9)
subject to the subsidiary condition that J2P(X) = 1. The critical trick of the Shannon Coding Theorem for sending a message with arbitrarily small error along the channel Y at any rate R < C is to encode it in longer and longer 'typical' sequences of the variable X\ that is, those sequences whose distribution of symbols approximates the probability distribution P(X) above which maximizes C. If S(n) is the number of such 'typical' sequences of length n, then
log[S(n)} « nH(X) where H(X) is the uncertainty of the stochastic variable defined above. Some consideration shows that S(n) is much less than the total number of possible messages of length n. Thus, as n —» oo, only a vanishingly small fraction of all possible messages is meaningful in this sense. This observation, after some considerable development, is what allows the Coding Theorem to work so well. In sum, the prescription is to encode messages in typical sequences, which are sent at very nearly the capacity of the channel. As the encoded messages become longer and longer, their maximum possible rate of transmission without error approaches channel capacity as a limit. Again, Ash (1990), Khinchine (1957) and Cover and Thomas (1991) provide details.
2.
More heuristics: a 'tuning theorem'
Telephone lines, optical wave guides and the tenuous plasma through which a planetary probe transmits data to earth may all be viewed in traditional information-theoretic terms as a noisy channel around which we must structure a message so as to attain an optimal error-free transmission rate. Telephone lines, wave guides and interplanetary plasmas are, relatively speaking, fixed on the timescale of most messages, as are most sociogeographic networks. Indeed, the capacity of a channel, according to equation (2.9), is defined by varying the probability distribution of the 'message' process X so as to maximize I(X\Y). Suppose there is some message X so critical that its probability distribution must remain fixed. The trick is to fix the distribution P(x) but modify the channel - i.e. tune it - so as to maximize I(X\Y). The dual channel capacity C* can be defined as
C* =
max P(Y),P(Y\X)
I(X\Y)
Information theory
13
(2.10)
But C* =
max
I(Y\X)
P(Y),P(Y\X)
since
I(X\Y) = H(X) + H(Y) - H(X, Y) = I(Y\X). Thus, in a purely formal mathematical sense, the message transmits the channel, and there will indeed be, according to the Coding Theorem, a channel distribution P(Y) which maximizes C*. One may do better than this, however, by modifying the channel matrix
P{Y\X). Since M
P(Y) is entirely defined by the channel matrix P(Y\X)
C* =
max
for fixed P(X) and
I(Y\X) = max I(Y\X).
P(Y),P(Y\X)
P(Y\X)
Calculating C* requires maximizing the complicated expression
I(X\Y) = H{X) + H(Y) - H{X, Y) which contains products of terms and their logs, subject to constraints that the sums of probabilities are 1 and each probability is itself between 0 and 1. Maximization is done by varying the channel matrix terms P(yj\xi) within the constraints. This is a difficult problem in nonlinear optimization. See Parker et al. (2003) for a comprehensive treatment, using traditional Lagrange multiplier methods. However, for the special case M — L, C* may be found by inspection: If M = L, then choose
P(yj\xi) = Sj^ where Sij is 1 if i = j and 0 otherwise. For this special case
C* = H{X)
14
CONSCIOUSNESS
with P{yk) = P{xk) f° r aU &• Information is thus transmitted without error when the channel becomes 'typical' with respect to the fixed message distribution P(X). If M < L matters reduce to this case, but for L < M information must be lost, leading to 'Rate Distortion' arguments explored more fully below. Thus modifying the channel may be a far more efficient means of ensuring transmission of an important message than encoding that message in a 'natural' language which maximizes the rate of transmission of information on a fixed channel. We have examined the two limits in which either the distributions of P(Y) or of P(X) are kept fixed. The first provides the usual Shannon Coding Theorem, and the second, hopefully, a tuning theorem variant. It seems likely, however, than for many important systems P(X) and P(Y) will 'interpenetrate,' to use Richard Levins' terminology. That is, P(X) and P(Y) will affect each other in characteristic ways, so that some form of mutual tuning may be the most effective strategy.
3.
The Shannon-McMillan Theorem
Not all statements - sequences of the random variable X - are equivalent. According to the structure of the underlying language of which the message is a particular expression, some messages are more 'meaningful' than others, that is, in accord with the grammar and syntax of the language. The other principal result from information theory, the Shannon-McMillan or Asymptotic Equipartition Theorem, describes how messages themselves are to be classified. Suppose a long sequence of symbols is chosen, using the output of the random variable X above, so that an output sequence of length n, with the form
has joint and conditional probabilities
r) / y
y
-XT-
\
P(Xn = an\Xo = ao,..., X n _i = a n _ i ) . (2.11)
Using these probabilities we may calculate the conditional uncertainty
Information theory
15
The uncertainty of the information source, H\X], is defined as
H[X]= lim
H{Xn\X0,Xu...,Xn-i).
n—>oo
(2.12)
In general Xo,Xi,...,Xn-i)
0
then the source is said to be of order n. It is easy to show that
In general the outputs of the Xj,j = 0,1, ...,n are dependent. That is, the output of the communication process at step n depends on previous steps. Such serial correlation, in fact, is the very structure which enables most of what follows in this book. Here, however, the processes are all assumed stationary in time, that is, the serial correlations do not change in time, and the system is memoryless. A very broad class of such self-correlated, memoryless, information sources, the so-called ergodic sources for which the long-run relative frequency of a sequence converges stochastically to the probability assigned to it, have a particularly interesting property: It is possible, in the limit of large n, to divide all sequences of outputs of an ergodic information source into two distinct sets, S\ and S2, having, respectively, very high and very low probabilities of occurrence, with the source uncertainty providing the splitting criterion. In particular the Shannon-McMillan Theorem states that, for a (long) sequence having n (serially correlated) elements, the number of 'meaningful' sequences, N{n) - those belonging to set S\ - will satisfy the relation
16
CONSCIOUSNESS
(2.13)
More formally,
=
fl
= lim n—>oc
= lim n-*oo
n+ 1
(2.14)
The Shannon Coding theorem, by means of an analogous splitting argument, shows that for any rate R < C, where C is the channel capacity, a message may be sent without error, using the probability distribution for X which maximizes I(X\Y) as the coding scheme. Using the internal structures of the information source permits limiting attention only to meaningful sequences of symbols. This restriction can greatly raise the maximum possible rate at which information can be transmitted with arbitrarily small error: if there are M possible symbols and the uncertainty of the source is H[X], then the effective capacity of the channel C, using this 'source coding,' becomes (Ash, 1990)
CE
(2.15)
=
c
lQg(M)
Information theory
17
As H[X] < log(M), with equality only for stochastically independent, uniformly distributed random variables,
CE>C. (2.16)
Note that, for a given channel capacity, the condition H[X] < C always holds. Source uncertainty has a very important heuristic interpretation. As Ash (1990) puts it, ...[W]e may regard a portion of text in a particular language as being produced by an information source. The probabilities P[Xn = a n | ^ o = a o , . . . , X n - i = &n-i) may be estimated from the available data about the language; in this way we can estimate the uncertainty associated with the language. A large uncertainty means, by the [Shannon-McMillan Theorem], a large number of 'meaningful' sequences. Thus given two languages with uncertainties Hi and H2 respectively, if H\ > H2, then in the absence of noise it is easier to communicate in the first language; more can be said in the same amount of time. On the other hand, it will be easier to reconstruct a scrambled portion of text in the second language, since fewer of the possible sequences of length n are meaningful.
It is possible to significantly generalize this heuristic picture in such a way as to characterize the interaction between different 'languages,' something at the core of the development.
4.
The Rate Distortion Theorem
The Shannon-McMillan Theorem can be expressed as the 'zero error limit' of something called the Rate Distortion Theorem (Dembo and Zeitouni, 1998; Cover and Thomas, 1991), which defines a splitting criterion that identifies high probability pairs of sequences. We follow closely the treatment of Cover and Thomas (1991). The origin of the problem is the question of representing one information source by a simpler one in such a way that the least information is lost. For example we might have a continuous variate between 0 and 100, and wish to represent it in terms of a small set of integers in a way that minimizes the inevitable distortion that process creates. Typically, for example, an analog
18
CONSCIOUSNESS
audio signal will be replaced by a 'digital' one. The problem is to do this in a way which least distorts the reconstructed audio waveform. Suppose the original memory less, ergodic information source Y with output from a particular alphabet generates sequences of the form y U = 1 / 1 , . . . , T/n-
These are 'digitized,' in some sense, producing a chain of 'digitized values'
where the 6-alphabet is much more restricted than the y-alphabet. bn is, in turn, deterministically retranslated into a reproduction of the original signal yn. That is, each bm is mapped on to a unique n-length y-sequence in the alphabet of the information source Y: bm-+yn
=
yi,...,yn.
n
Note, however, that many y sequences may be mapped onto the same retranslation sequence y n , so that information will, in general, be lost. The central problem is to explicitly minimize that loss. The retranslation process defines a new memoryless, ergodic information source, Y. The next step is to define a distortion measure, d(y, y), which compares the original to the retranslated path. For example the Hamming distortion is
d(y,y)
=
l,y^y
d(y,y) = 0,y = y. (2.17)
For continuous variates the Squared error distortion is
d{y,y) = (y-y)2(2.18)
Information theory
19
Possibilities abound. The distortion between paths yn and yn is defined as
_ 1 yn r-J
3
'
J
(2.19)
Suppose that with each path y n and 6n-path retranslation into the y-language and denoted y n , there are associated individual, joint, and conditional probability distributions n n n n P(y lp(y )My \y ).
The average distortion is defined as
(2.20)
It is possible, using the distributions given above, to define the information transmitted from the incoming Y to the outgoing Y process in the usual manner, using the Shannon source uncertainty of the strings:
, Y) = H(Y) - H(Y\Y) = H(Y) + H(Y) - H(Y, Y). If there is no uncertainty in Y given the retranslation Y9 then no information is lost. In general, this will not be true. The information rate distortion function R(D) for a source Y with a distortion measure d(y, y) is defined as
20
CONSCIOUSNESS
R(D) =
_
min
(2.21)
The minimization is over all conditional distributions p(y\y) for which the jointdistributionp(y,y) = p{y)p(y\y) satisfies the average distortion constraint (i.e. average distortion < D). The Rate Distortion Theorem states that R(D) is the maximum achievable rate of information transmission which does not exceed the distortion D. Cover and Thomas (1991) or Dembo and Zeitouni (1998) provide details, and Parker et al. (2003) formalize a comprehensive attack. More to the point, however, is the following: Pairs of sequences (yn,yn) can be defined as distortion typical, that is, for a given average distortion Z), defined in terms of a particular measure, pairs of sequences can be divided into two sets, a high probability one containing a relatively small number of (matched) pairs with d(yn1 yn) < D, and a low probability one containing most pairs. As n —> oo, the smaller set approaches unit probability, and, for those pairs,
(2.22)
Thus, roughly speaking, /(F, Y) embodies the splitting criterion between high and low probability pairs of paths. For the theory of interacting information sources, then, I(Y, Y) can play the role of H in the dynamic treatment that follows. The rate distortion function of eq. 2.21 can actually be calculated in many cases by using a Lagrange multiplier method - see Section 13.7 of Cover and Thomas (1991). At various points in the development we will suggest using s = d(x, x) as a metric in a geometry of information sources, e.g. when simple ergodicity fails, and H(x) ^ H(x) for high probability paths x and x. See eq. (3.2).
Information theory
5.
21
Large Deviations
The use of information source uncertainty above as a splitting criterion between high and low probability sequences (or pairs of them) displays the fundamental characteristic of a growing body of work in applied probability often termed the 'Large Deviations Program,' (LDP) which seeks to unite information theory, statistical mechanics and the theory of fluctuations under a single umbrella. It serves as a convenient starting point for further developments. We can begin to place information theory in the context of the LDP as follows (Dembo and Zeitouni, 1998, p.2): Let X\,X2,.-.Xn be a sequence of independent, standard Normal, realvalued random variables and let
nf •7
(2.23)
Since Sn is again a Normal random variable with zero mean and variance l/n,forall(S>0
lim P(\Sn\ >oo
(2.24)
where P is the probability that the absolute value of Sn is greater or equal to S. Some manipulation, however, gives
\/27r J-Sy/rl (2.25)
exp(-x2/2)dx,
22
CONSCIOUSNESS
so that
Urn n—>oo
(2.26)
This can be rewritten for large n as
P(\Sn\
>5)^
(2.27)
That is, for large n, the probability of a large deviation in Sn follows something much like equation (2.13), i.e. that meaningful paths of length n all have approximately the same probability P(n) oc exp(—nH[X]). Our questions about 'meaningful paths' appear suddenly as formally isomorphic to the central argument of the LDP which encompasses statistical mechanics, fluctuation theory, and information theory into a single structure (Dembo and Zeitouni, 1998). A cardinal tenet of large deviation theory is that the 'rate function' —52/2 in equation (2.26) can, under proper circumstances, be expressed as a mathematical 'entropy' having the standard form
(2.28)
for some set of probabilities pk. This striking result goes under various names at various levels of approximation - Sanov's Theorem, Cramer's Theorem, the Gartner-Ellis Theorem, the Shannon-McMillan Theorem, and so on (Dembo and Zeitouni, 1998).
Information theory
6.
23
Fluctuations
The standard treatment o f fluctuations' (OnsagerandMachlup, 1953; Fredlin and Wentzell, 1998) in physical systems is the principal foundation for much current study of stochastic resonance and related phenomena and also serves as a useful reference point. The macroscopic behavior of a complicated physical system in time is assumed to be described by the phenomenological Onsager relations giving largescale fluxes as
(2.29)
where the Rij are appropriate constants, S is the system entropy and the Ki are the generalized coordinates which parametize the system's free energy. Entropy is defined from free energy F by a Legendre transform - more of which follows below:
where the Kj are appropriate system parameters. Neglecting volume problems (which will become quite important later), free energy can be defined from the system's partition function Z as ()
= \og[Z(K)].
The partition function Z, in turn, is defined from the system Hamiltonian defining the energy states - as
where K is an inverse temperature or other parameter and the Ej are the energy states. Inverting the Onsager relations gives
24
CONSCIOUSNESS
(2.30)
The terms dS/dKi are macroscopic driving 'forces' dependent on the entropy gradient. Let a white Brownian 'noise' e(t) perturb the system, so that
dKi/dt = J2 Lhjds/dKj
+
3
(2.31)
where the time averages of e are < e(t) > = 0 and < e(t)e(O) > = DS(t). 5(i) is the Dirac delta function, and we take X a s a vector in the K{. Following Luchinsky (1997), if the probability that the system starts at some initial macroscopic parameter state Ko at time t = 0 and gets to the state K(t) at time t is P(K, £), then a somewhat subtle development (e.g. Feller, 1971) gives the forward Fokker-Planck equation for P:
dP(K,t)/dt = -V • (L(K,t)P(K,t)) + (D/2)\72P(K,t). (2.32)
In the limit of weak noise intensity this can be solved using the WKB, i.e. the eikonal, approximation, as follows: take
P(K,t) = z{K,t)exp(-s(K,t)/D). (2.33)
Information theory
25
z(K, i) is a prefactor and s(K, t) is a classical action satisfying the HamiltonJacobi equation, which can be solved by integrating the Hamiltonian equations of motion. The equation reexpresses P(K, t) in the usual parametized negative exponential format. Letp = Vs. Substituting equation (2.33) in equation (2.32) and collecting terms of similar order in D gives
dK/dt =p + L, dp/dt =
-dL/dKp
v2 -ds/dt = h(K,p,t) = pL(K,t) + —, with h(K, t) the 'Hamiltonian' for appropriate boundary conditions. Again following Luchinsky (1997), these 'Hamiltonian' equations have two different types of solution, depending on p. For p — 0, dK/dt = L(K, t) which describes the system in the absence of noise. We expect that with finite noise intensity the system will give rise to a distribution about this deterministic path. Solutions for which p ^ 0 correspond to optimal paths along which the system will move with overwhelming probability. This is a formulation of fluctuation theory which has particular attraction for physicists, few of whom can resist the nearly magical appearance of a Hamiltonian. These results can, however, again be directly derived as a special case of a Large Deviation Principle based on 'generalized 'entropies' mathematically similar to Shannon's uncertainty from information theory, bypassing the 'Hamiltonian' formulation entirely (Dembo and Zeitouni, 1998). For languages, of course, there is no possibility of a Hamiltonian, but the generalized entropy or splitting criterion treatment still works. The trick will be to do with entropies what is most often done with Hamiltonians: Here we will be concerned, not with a random Brownian distortion of simple physical systems, but with a complex 'behavioral' structure, in the largest sense, composed of quasi-independent 'actors' for which [1] the usual Onsager relations of equations (2.29) and (2.30) may be too simple, [2] the 'noise' may not be either small or random, and, most critically, [3] the meaningful/optimal paths have extremely structured serial correlation amounting to a grammar and syntax, precisely the fact which allows definition of an information source and enables the use of the very sparse equipartition of the Shannon-McMillan and Rate Distortion Theorems. The sparseness and equipartition, in fact, permit solution of the problems we will address.
26
CONSCIOUSNESS
In sum, to again paraphrase Luchinsky (1997), large fluctuations, although infrequent, are fundamental in a broad range of processes, and it was recognized by Onsager and Machlup (1953) that insight into the problem could be gained from studying the distribution of fluctuational paths along which the system moves to a given state. This distribution is a fundamental characteristic of the fluctuational dynamics, and its understanding leads toward control of fluctuations. Fluctuational motion from the vicinity of a stable state may occur along different paths. For large fluctuations, the distribution of these paths peaks sharply along an optimal, most probable, path. In the theory of large fluctuations, the pattern of optimal paths plays a role similar to that of the phase portrait in nonlinear dynamics. In this development 'meaningful' paths play the role of 'optimal' paths in the theory of large fluctuations, but without benefit of a 'Hamiltonian.'
7.
The fundamental homology
Section 5 above gives something of the flavor of the LDP which tries to unify statistical mechanics, large fluctuations and information theory. This opens a methodological Pandora's Box: the LDP provides justification for a massive transfer of superstructure from statistical mechanics to information theory, including real-space renormalization for address of phase transition, thermodynamics and an equation of state, generalized Onsager relations, and so on. From fluctuation theory and nonlinear dynamics come phase space, domains of attraction and related matters. Several particulars distinguish this approach. First is a draconian simplification which seeks to employ information theory concepts only as they directly relate to the basic limit theorems of the subject. That is, message uncertainty and information source uncertainty are interesting only because they obey the Coding, Source Coding, Rate Distortion, and related theorems. 'Information Theory' treatments which do not sufficiently center on these theorems are, from this view, far off the mark. Thus most discussion of 'complexity,' 'entropy maximization,' different definitions of 'entropy,' and so forth, just does not appear on the horizon. In the words of William of Occam, "Entities ought not be multiplied without necessity." The second matter is somewhat more complicated: Rojdestvenski and Cottarn (2000, p.44), following Wallace and Wallace (1998), see the linkage between information theory and statistical mechanics as a characteristic ...[homological] mapping... between... unrelated... problems that share the same mathematical basis... [whose] similarities in mathematical formalisms...become powerful tools for [solving]... traditional problems.
The possible relation of information theory to biological and social process, both of which can involve agency, appears very sharply constrained, involving:
Information theory
27
(1) a 'linguistic' equipartition of sets of probable paths consistent with the Shannon-McMillan, Rate Distortion, or related theorems which serves as the formal connection with nonlinear mechanics and fluctuation theory, and (2) a homological correspondence between information source uncertainty and statistical mechanical free energy density, not statistical mechanical entropy. In this latter regard, the definition of the free energy density of a parametized physical system is
lim , ...,K m J = lim V—+OO V
V
(2.34)
where the Kj are parameters, V is the system volume, and Z is, again, the partition function. For an ergodic information source the equivalent relation associates the source uncertainty with the number of 'meaningful' statements N(n) of length n, in the limit,
• I. A thermodynamic analogy Physical Review E, DOI: 10.1103/PhysRevE.64.011917. Steyn-Ross M., D. Steyn-Ross, J. Sleigh, and D. Whiting, (2003), Theoretical predictions for spatial covariance of the electroencephalographic signal during the anesthetic-induced phase transition: Increased correlation length and emergence of spatial self-organization, Physical Review E, DOI: 10.1103/PhysRevE.68.021902. Tauber A., (1998), Conceptual shifts in immunology: Comments on the 'two-way paradigm.' In K. Schaffner and T. Starzl (eds.), Paradigm changes in organ transplantation, Theoretical Medicine and Bioethics, 19:457-473. Tegmark M., (2000), Importance of quantum decoherence in brain processes, Physical Review E, 61:4194-4206. Teunis M., A Kavelaars, E. Voest, J. Bakker, B. Ellenborek, A. Cools, and C Heijnen, (2002), Reduced tumor growth, experimental metastasis formation, and angiogenesis in rats with a hyperreactive dopaminergid system, FASEB Journal express article 10.1096/fj .02-014fje. Thayer J. and R. Lane, (2000), A model of neurovisceral integration in emotion regulation and dysregulation, Journal of Affective Disorders, 61:201-216. Thayer J., and B. Friedman, (2002), Stop that! Inhibition, sensitization, and their neurovisceral concomitants, Scandinavian Journal of Psychology, 43:123130. Timberlake W., (1994), Behavior systems, associationism, and Pavlovian conditioning, Psychonomic Bulletin, Rev. 1, 405-420.
112
CONSCIOUSNESS
Tishby N., F. Pereira, and W. Bialek, (1999), The information bottleneck method, Proceedings of the 37th Allerton Conference on Communication, Control, and Computing. Tononi G., and G. Edelman, (1998), Consciousness and complexity, Science, 282:1846-1851. Torrey E. and R. Yolken, (2001), The schizophrenia-rheumatoid arthritis connection: infectious, immune, or both? Brain, Behavior, and Immunity, 15:401-410. Toth G., C. Lent, P. Tougaw, Y. Brazhnik, W. Weng, W. Porod, R. Liu, and Y. Huang, (1996), Quantum cellular neural networks, Superlattices and Microstructures, 20:473-478. Wallace R., M. Fullilove,, and A. Flisher, (1996), AIDS, violence and behavioral coding: information theory, risk behavior, and dynamic process on core-group sociogeographic networks, Social Science and Medicine, 43:339352. Wallace D. and R. Wallace, (2000), Life and death in Upper Manhattan and the Bronx: toward an evolutionary perspective on catastrophic social change, Environment and Planning A, 32:1245-1266. Wallace R., (2000), Language and coherent neural amplification in hierarchical systems: Renormalization and the dual information source of a generalized spatiotemporal stochastic resonance, International Journal of Bifurcation and Chaos 10:493-502. Wallace R., (2002a), Immune cognition and vaccine strategy: pathogenic challenge and ecological resilience, Open Systems and Information Dynamics 9:51-83. Wallace R., (2002b), Adaptation, punctuation and rate distortion: noncognitive 'learning plateaus' in evolutionary process, Acta Biotheoretica, 50:101116. Wallace R., (2003), Systemic Lupus erythematosus in African-American women: cognitive physiological modules, autoimmune disease, and structured psychosocial stress, Advances in Complex Systems, 6:599-629. Wallace R., (2004), Comorbidity and anticomorbidity: autocognitive developmental disorders of structured psychosocial stress. Ada Biotheoretica, 52:71-93. Wallace R. and D. Wallace, (2004), Structured psychosocial stress and therapeutic failure, Journal of Biological Systems, 12:335-369. Wallace R. and R.G. Wallace, (1998), Information theory, scaling laws and the thermodynamics of evolution, Journal of Theoretical Biology 192:545-559. Wallace R. and R.G. Wallace, (1999), Organisms, organizations and interactions: an information theory approach to biocultural evolution, BioSystems 51:101-119.
References
113
Wallace R. and R.G. Wallace, (2002), Immune cognition and vaccine strategy: beyond genomics, Microbes and Infection 4:521-527. Wallace R., R.G. Wallace and D. Wallace, (2003), Toward cultural oncology: the evolutionary information dynamics of cancer, Open Systems and Information Dynamics 10:159-181. Wallace R., D. Wallace and R.G. Wallace, (2004), Biological limits to reduction in rates of coronary heart disease: a punctuated equilibrium approach to immune cognition, chronic inflammation, and pathogenic social hierarchy, Journal of the National Medical Association96:609-6\9. Weinstein A., (1996), Groupoids: unifying internal and external symmetry, Notices of the American Mathematical Association, 43:744-752. Welchman A. and J. Harris, (2003), Is neural filling-in necessary to explain the perceptual completion of motion and depth information? Proceedings of the Royal Society of London, B. Biological Sciences, 270:83-90. Wilson K.,(1971), Renormalization group and critical phenomena. I Renormalization group and the Kadanoff scaling picture. Physical Review B, 4:31743183. Wright R., M. Rodriguez, and S. Cohen, (1998), Review of psychosocial stress and asthma, Thorax, 53:1066-1074. Zur D., and S. Ullmann, (2003), Filling-in of retinal scotomas, Vision Research, 43:971-982.
This page intentionally left blank
Appendix A Coarse-Graining
We use a simplistic mathematical picture of an elementary predator/prey ecosystem for illustration. Let X represent the appropriately scaled number of predators, Y the scaled number of prey, t the time, and UJ a parameter defining the interaction of predator and prey. The model assumes that the system's 'keystone' ecological process is direct interaction between predator and prey, so that
dX/dt = OJY dY/dt = -uX. Thus the predator populations grows proportionately to the prey population, and the prey declines proportionately to the predator population. After differentiating the first and using the second equation, we obtain the differential equation d2X/dt2
+ UJ2X = 0,
having the solution
X(t) = sin(ut)\Y(t) = cos(ut), with
X{t)2 + Y(t)2 = sin2(ujt) + cos2(ujt) = 1. Thus in the two dimensional 'phase space' defined by X(t) and Y(t), the system traces out an endless, circular trajectory in time, representing the outof-phase sinusoidal oscillations of the predator and prey populations.
116
CONSCIOUSNESS
Divide the X — Y 'phase space' into two components - the simplest 'coarse graining' - calling the halfplane to the left of the vertical F-axis A and that to the right B. This system, over units of the period 1/(2TTU;), traces out a stream of A's and B's having a very precise 'grammar' and 'syntax': ABABABAB... Many other such 'statements' might be conceivable, for example, AAAAA..., BBBBB..., AAABAAAB..., ABAABAAAB..., and so on, but, of the obviously infinite number of possibilities, only one is actually observed, is 'grammatical': ABABABAB.... Note that finer coarsegrainings are possible within a system, for example dividing phase space in this simple model into quadrants, producing a single 'gramatical' statement of the form ABCDABCDABCD.... The obvious, and difficult, question is which coarsegraining will capture the essential behaviors of interest without too much distracting high-frequency 'noise'. More complex dynamical system models, incorporating diffusional drift around deterministic solutions, or even very elaborate systems of complicated stochastic differential equations, having various 'domains of attraction', i.e. different sets of grammars, can be described by analogous 'symbolic dynamics' (e.g. Beck and Schlogl, 1993, Ch. 3).