Information Processing and Algorithmic Complexity
Pauli Salo
Cognitive Science
Department of Computer Science
University of Jyväskylä
Computational
models of human cognition are often discussed in terms of their relatively
broad paradigms. For example, there exists much literature concerning the pros
and cons of connectionist versus classical symbolic models. On the other hand,
there is some evidence that even more abstract properties might be relevant, as
there are deficits, such as the problem of ‘combinatorial explosion,’ that
trouble both types of models, whether
connectionist or symbolic. It is argued in this paper that one such abstract,
yet relevant parameter is the ‘absolute information content’ of the model,
measured in terms of Kolmogorov complexity. This might be relevant since,
explicitly or implicitly, most computational paradigms are designed around the
idea of producing complex-looking structures from innovative simple principles,
such as recursion (classical models) or homogeneous networks (connectionism),
implying that they are extremely simple in terms of their Kolmogorov
complexity. It is argued that this strategy might not succeed. If these
conclusions are true, then the human mind is truly complex, lacking one crucial
feature to which many cognitive scientists have based all their wishes, namely,
that the complexity somehow ‘emerges’ from a right set of simple principles.
Keywords:
computational modelling, Kolmogorov complexity, connectionism,
computationalism, empiricism, reduction, emergence
Computational models of human
cognition are often discussed in terms of the relatively broad paradigms. For
example, there exists much literature concerning the pros and cons of
connectionist versus classical symbolic models. Although the differences
between such paradigms are already rather abstract, it is precisely by virtue
of this abstractness that one can capture in a most effective way their
differences, limitations and capacities. On the other hand, there is some evidence
that even more abstract properties might be relevant, as there are deficits
that haunt both types of models,
whether connectionist or symbolic. To mention one especially pertinent problem,
both kind of models tend to be either 1) applicable to a very narrow problem
domain or, 2) to be vastly complex computationally. Human cognitive processes,
in contrast, do not suffer from either in many domains. These problems are
certainly related to each other in a way that can be best described as the scaling problem: whenever a larger
problem domain is involved, computational models become too complex
computationally regardless of the more specific paradigm chosen.
The pressing question is of
course to determine how to scale the models and solve the scaling problem. How
we attempt to tackle this problem depends, in turn, on where we presuppose the
problem to be. One may assume that we just miss the correct heuristic
algorithm, correct set of parameters, or the correct network topology. When the
correct set of explicit axioms are in place, scalable cognitive processes
emerge. For example, using the correct learning algorithm with the correct
network topology would avoid the computational complexity of learning.
Another choice is to argue that
the limitations are not in the particular algorithms, but in some more abstract
level that concerns more or less all computational models. This stance is
characteristic of Dreyfus (1974) and Fodor (2000), for instance. I will argue
in this paper that the latter stance is indeed correct regarding the scaling
problem, and that the difficulties have to do with rather abstract properties
of all computational models. More particularly, I will argue that it is the absolute information content of those
models, not the specific designs or paradigms, that is insufficient and thereby
causes the scaling problem to occur.
In section 2, I will first show
how the rejection of the behavioristic stimulus-response psychology involved
arguments suggesting that there is substantial
irregularity involved in mental processing. In chapter 3, I will show how
to translate and generalize these observations to a more exact language with
the help of Kolmogorov-complexity. In
section 4, I consider the consequences from the perspective of “information
processing.” Namely, if the stimulus states are truly random
or irregular in the sense of Kolmogorov complexity, then there
cannot, by the definition of “irregular,” exist any simple algorithmic solution
for the mapping of those states (say, acoustic features) into the corresponding
mental states (say, phonemes). On the other hand, it is easily seen that all
computational models currently explored, whether symbolic, connectionist or
dynamic, assume exactly the contrary: the information content of the model is
very low, apart from the misleading appearance. For instance, although a
typical neural network model appears to be a ‘complex’ web of nodes, this is
very misleading: its true information
content is very low, since all of the nodes are based on the same extremely
simple code describing the prototype of formal neuron. Similarly, in a typical
symbolic model, one uses recursion:
the same code over and over again with ingenious use of empty memory.
I also argue that this might
provide an explanation for the fact that such models are computationally
complex, and that the problem domains in which human are engaged effectively
require much more ‘brute information’ as implicated in the current
computational models. Finally, it seems that it is not just the psychophysical
functions which are complex, but also many higher-level processes such as
thinking: the case of chess expertise
is considered as an example.
In the
50’s, the collapse of the empiricist and behaviorist research programme paved
the way to the new mentalist inquiries. I will discuss some of the relevant
history of this process, at the risk of repeating of what is generally assumed
and basically well known. This provides one of the best-documented routes to
the phenomenon of ‘irregularity’ in cognitive processes, while it is
irregularity or even randomness in its various forms and implications that we
will be concerned throughout the paper.
American structuralists,
influenced by Bloomfield’s positivist conception of (the science of) language,
explored the possibility of capturing the notion of ‘phoneme,’ or any
‘linguistically relevant object,’ by using linguistic stimuli, phonetic
transcriptions, or even physical properties of utterances ¾ in fact, anything that was ‘directly observable,’ or was
thought to be so at that time. That undertaking is nontrivial: there is no
trivial mapping from sounds into the ‘phonological structure of the language.’
For instance, there is no reduction of phonemes into local physical properties
of sound. This is why taxonomists were structuralists as well: a phoneme for a
language L was assumed to depend on the properties of language L as a whole.[1]
As a corollary, it was hoped that
finding and defining linguistic elements from a corpus would be automated in
terms of a mechanical ‘discovery procedure.’ A discovery procedure is a
mechanical method which, given a corpus as input, gives a ‘linguistically
relevant’ taxonomy of elements and their relations for that corpus (or language)
as output. It was reasoned, on par with the pursuit of the discovery procedure,
that a child, when learning a language, must learn it from the stimulus
available, thus ‘discover its structure.’[2]
The discovery procedure and associated learning mechanism(s) represent an
‘empiricist learning algorithm’ so that, when demonstrated with some success,
it would show how the traditional ‘empiricist principles’ and doctrines could
actually work, at least in this case of human language(s). From the perspective
of computation, a discovery procedure is an algorithm targeted to find
regularities from the linguistic data and, based on such information, to
extrapolate the linguistics rules, principles and primitives in order to gain a
full mastery of the language.
However, it was eventually, though
also surprisingly, admitted that this project cannot succeed. In hindsight the
reason is easy to explain: the actual physical or acoustical realization of
‘linguistic elements’ result from numerous factors, so that it becomes
impossible to define them by relying upon their ‘observable properties.’ For
instance, the actual realization of a phoneme can be determined by at least the
following independent ingredients:
1.
independent phonological
rules
2.
rules of stress and
intonation
3.
rules of morphology and
syntax
4.
rules of articulatory
mechanisms and co-articulatory effects
5.
meanings (minimal pairs)
To illustrate, consider such words
as sit-divinity and divine-car.
In the latter pair, the sound [A] corresponds to the letters a and i. Assuming that the orthography is missing something, we conclude
that divine is to be analysed as
being close to divAin and as being
unrelated to divinity expect for the
initial sounds, and related to cAr.
Yet these complications miss the fact that the ‘vowel shift’ is predictable
based on general rules, and is not idiosyncratic to divine-divinity (i.e. profane-profanity,
serene-serenity, explain-explanatory,
various-variety, Canada-Canadian).[3]
But assuming more abstract phonemes with independent phonological rules and
other factors (i-v above) this state of affairs becomes predictable. Linguists
were thus able to collapse the complexity of the theory at the cost of assuming
abstract phonemes and abstract principles
that regulate their realization in actual speech sounds.
Given that the lower-level and
higher-level taxonomies are assumed as independent levels of linguistic
representations, two possibilities remain:
i. we
may stick with reductive definitions, leading us to a situation where the
theory of language use becomes more
and more complex due to the exceptions resulting from the mismatch between the
lower- and higher-level taxonomies and missed generalisations thereof; or
ii. we may gain simplicity by jettisoning
reductive definitions, relying upon abstract, more or less ‘tentative elements’
that relate to actual sound only indirectly, though by general rules.
Suppose we
adopt “Strategy i”. The theory is either poor in its explanatory power, or
leads to severe problems - and in the end also to anomalies - if applied to the
actual use of language. A more likely procedure would be to follow “Strategy
ii.” Pursuing (ii), linguists set up, first, a ‘tentative set of elements’
(i.e. phonemes) and try to capture their relations to actual sounds, and to the
more abstract elements they are constituents of, by general rules and other
postulated interactions entering into the phenomenon we are studying (1-5
earlier).[4]
But then the physical properties ‘associated’ with the mental entities like
phonemes become more and more irregular and random: the cognitive states, as they clearly mediate behaviour somehow, cannot
be reduced to any regularity in the stimulus, as was put by Howard Lasnik:
But what is a word? What is
an utterance? These notions are already quite abstract [...] The big leap is
what everyone takes for granted. It’s widely assumed that the big step is going
from sentences to transformation, but this in fact isn’t a significant step.
The big step is going from “noise” [irregularity] to “word.” (Lasnik, 2000, p.
3.)
There are
obscuring random factors that blur the connection between what the subject
hears and what appears in his or her ears. Yet these problems are entirely
general, spanning through the domain of what we may pre-theoretically speak of
as ‘the mental,’ or which falls naturally under the title of ‘mentalistic
inquiries.’ Julian Hochberg was one of those who tried to find psychophysical
laws to connect percepts with physical invariances in the environment, but he
abandoned the project because
(1) […] we don’t have a
single physical dimension of stimulation by which to define the stimulus
configuration to which a subject is responding, or by which we can decide what
it is that two different patterns, which both produce the same form percepts,
have in common [and] (2) because the subject can perceive very different form
in response to the same physical stimulus, and which form he perceives
critically determines the other answers he gives to questions about shape,
motion, size, depth, and so on. (Hochberg,
1968, pp. 312-313.)
That is,
the science of what is perceived and how the organism comes to behave in
response to such stimuli is not the
science of the physical conditions imposed upon his or her senses. Rather, as
these lower and higher taxonomies of properties cross-classify each other, we
are forced to rely upon tentative theoretical constructs, such as ‘percepts,’
and leave their exact characterization for a science-to-come to explore.
We may even say that physical,
directly observable sensory states are, because they are irregular, more or
less random vis-à-vis the ‘percepts,’
as was suggestedby Zenon Pylyshyn: ‘if
we attempt to describe human behaviour in terms of physical properties of the
environment, we soon come to the conclusion that, except for tripping, falling,
sinking, bouncing of walls, and certain autonomic reflexes, human behaviour is
essentially random’ (Pylyshyn, 1984, p. 12). According to Fodor, it is because
required physical invariants are absent in the impoverished sensory organs that
”perception cannot, in general, be thought of as the categorization of physical
invariants, however abstractly such invariance may be described” (Fodor, 1975,
p. 48, footnote 15). The taxonomies based on sensory stimulations are not
necessary, not even sufficient, to allow useful generalizations about behavior
and actual language use, though here we must remember that the speaker/hearer
must still be able to somehow correlate ‘random noise’ to
linguistic/mentalistic categories– the notorious problem to which I return
presently.
In this
section I want to consider the phenomenon of ‘randomness’ and its relation to
reduction more closely. My reasons for wanting to do this will become clear
later; suffice it at this point to note that the issue of randomness and
complexity is directly relevant to the issue of computation and, in a similar vein, to issues of ‘information
processing.’ Furthermore, the issue of ‘complexity’ has certainly played a role
in shaping the discussion of the very nature of social scientific laws (e.g.
Clark, 1998, McIntyre, 1993, Sambrook & Whiten, 1997, Rescher, 1988), so
any conceptual clarification on the matter could be potentially helpful.
Suppose that we have two kinds of
information readily available: pairs consisting of physical sensory properties
(Q) and the corresponding ‘perceptual classification’ of that physical
stimulus-in-a-context (P). If the above observations (§2) are anywhere near the
truth, it must be the case that these pairs do not possess significant
‘regularity.’ For example, an arbitrary collection of noises can correspond to
the perception of phoneme /m/. Borrowing some philosophical jargon, the
higher-level property p1 corresponds to a ‘disjunctive property’ q1
Ú q2 Ú . . . (see Fodor,
1974). Looking at the matter from a slightly different
perspective, suppose we fix p1 and collect various sensory stimuli
that appear to be as paired with p1. We expect them to contain no
physical invariance(s). The high-level properties are not ‘projectible’ to the
lower-level ones, or they do not have a ‘model’ at the lower level (in the
sense of being logically constructible at the lower level).
Let us consider more carefully the
notion of ‘randomness’ or ‘lack of invariances’ we are supposed to be
observing. What does it mean if some
object lacks regularity or involves ‘disjunctive properties’? There exists a
long tradition in hunting down the correct properties of this notion, omitted
here (but see Li & Vitányi, 1997). The most satisfactory notion of
randomness, and most important for the present purposes, can be formulated with
the help of Kolmogorov complexity.
Kolmogorov complexity measures the ‘absolute information content’ of a finite
object so that a ‘complex’ or ‘random’ object is one which is its own shortest description and thus cannot be
‘compressed’ based on any regularity.
Suppose that in describing
sequences of strings of 0s and 1s, or whatever alphabet we choose to use, we
use computers and their programs. Computers plus their programming languages,
henceforth called ‘specification methods,’ are understood as recursive
functions f1, . . . from
programs p into outputs x. Let | × | measure, in bits, the length of those programs. Fixing f, we may follow the above idea and
define the complexity of x, Cf(x), as follows:
Cf(x) = min{ | p | : f(p) = x
}.
Thus, we
search for the minimal program that can produce x. Intuitively, if p is
much shorter than x, we say that
there is a compression of x in terms of p. It is hoped that this measure will screen out the regularities
in x and, if there is no such
compression, we may call x ‘random’
or ‘irregular’. The idea is intuitively appealing, since the existence of a
much shorter computer program for producing x
captures the rules and regularities in x
that ‘random’ entities, perhaps by definition, must lack entirely. Randomness,
in our sense, is then a property of data,
not of process (of ‘chance’). Thus, string
is not
‘random,’ since there exists a very short computer program that prints it,
namely:
This
program is a ‘compression’ of the string, capturing its regularities. The same
is true of the decimals of p,
which are not random by the proposed test: they can be computed by using small
computer program (although p
is random by many other tests).
Returning to the formal properties
of this notion, trying to shape the notion of ‘randomness’ along these lines,
it appears that there might be other specification methods, computers and
computer languages, that can compress even x
further. We do not want randomness to be relative to a particular specification
method, yet it is meaningful to ask if the above string would be more random
with respect to LISP than FORTRAND. In other words, is there a minimal element
for C(x), free of the choice of the
specification method?
It is easy to develop a measure
that takes these choices into account. Suppose that we have specification
methods f1, . . . fn available, one being LISP interpreter, the other being FORTRAND
interpreter, and so on. We may choose which programming language to prefer.
Then we can develop a new specification method, g, that makes the choice and computes string x. The input of g
consists of information concerning method f,
clearly a constant, c (say, log n), plus the program, p. We then obtain a new specification
method with the property that for any x
and f, Cf(x) £ Cg(x) + c.
That is, the complexity method g
outputs for any string exceed those given by any of the fs only by constant c, an
asymptotically vanishing price to pay for the fact that we can use several
specification methods in seeking regularities in the data. Seen from a more
intuitive point of view, we establish a ‘compromise’ between finding the
absolute minimal specification method plus the specification of that methods
itself, seeking the alternative that is the least expensive with respect to
this measure. No other method in our initial stock can improve this result
infinitely often.
We may go further and imagine a
situation in which any computational
specification technique, that is, any possible (imaginable, constructible)
computing language, is used. There are an infinite number of such techniques,
as there is an infinite number of recursive functions. But suppose that,
despite this obstacle, we choose to develop a specification method, comparable
to g above, that is capable of making
the choice between them. For this purpose we can use the universal Turing machine f0 (i.e. a function f0 such that for any n and p, f0(n, p)
= fn(p)). Thus, with the help of this new powerful specification method,
we can define a new notion of complexity,
C0(x) = min{| p |: f0(n, p) = x},
obtaining,
analogously to the previous case,
C0(x) £ Cn(x) + cn,
where c can be set to 2| n | + 1, depending only upon fn.
This extra cost is due to the coding of the specification method, depending
only on the method chosen. This is the invariance
theorem of Kolmogorov complexity, its mathematical cornerstone, established
independently during the 60’s by R. Solomonoff (1964), A. N. Kolmogorov (1965), and G. Chaitin (1969). This theorem establishes the fact that our
intuitive idea of measuring complexity in terms of shortest descriptions has
just the desirable mathematical properties we expect it to have: complexity is
invariant with respect of the specification method used, as long as the
specification method is computable.
I will not discuss further here
the interesting mathematical properties of this measure. Suffice it to say that
it is closely related, in an interesting way, to many other mathematical
notions, such as Shannon’s information theory, proof-theory, computability
theory, machine learning, probability theory, and, indeed, many others (see Li
& Vitányi, 1997). Moreover, it is also related to some fundamental physical
measures and hence subject to explanations in physics, such as entropy and
information (see Zurek, 1990). Some authors have also claimed that it is
relevant to human cognition (more below). But in any case, I postulate that,
concerning the notion of randomness, Kolmogorov complexity is “about as
fundamental and objective as any criterion can be” (Rissanen, 1990, p. 118).
Let us return to cognitive science
and the case of unsuccessful reduction and associated, abandoned empiricism
discussed earlier (§2). What is established, if these empirical arguments are
true, seems to be related to the complexity-property captured by K(olmogorov)-complexity
above, or to some relevant variant of it. It was argued that there are no
invariances in the set of stimuli correlating with, or defining, a given
mentalistic state. Assume that the notion of ‘invariance’ is restricted to
computable invariances (which is entirely reasonable); then the lack of such
invariances can be identified by the fact that there is no computer program, in
the sense given above, which can compress the set of stimuli sampled together
with the ‘mental state,’ or whatever the abstract property we target in our
explanations is. For this reason the set can be said to be necessarily
‘disjunctive’: each member must be stated as such, in the absence of any common
(defining) invariance. That is, Kolmogorov complexity seems to capture exactly
an important aspect of what happened when the stimulus-response psychology
collapsed: certain kinds of complexity or irregularity prevented that program
from succeeding.
Consider this assumption from the
perspective of ‘information processing.’ We assumed that the stimuli are
characterised by its K-complexity vis-à-vis the ‘mental state’ it is
‘correlated’ with, the notion of ‘mental state’ taken still in a
pre-theoretical sense but understood as being, for example, a perceptual
hypothesis concerning the world beyond the sensory organs. The relevant set of
stimuli being K-complex means, by definition, that a simple algorithm to make the required computations and
correlations does not exist.
It might even be that ‘information
processing’ is itself a somewhat unfortunate term here, connoting to the idea,
explicitly argued by the behaviorists and Gibson, and implicitly by many
others, that the sensory channels could end up with rational hypothesis by
literally manipulating the information present in the senses, whereas a truer
picture should take the sensory channels to be essentially sources of
disturbing noise that the system must somehow deal with in inferring what’s behind one’s eyes or ears. These two
perspectives on the functioning of the senses might imply, though they are
metaphorical, radically different strategies for a more rigorous empirical
research.
Simplifying, this conclusion is
inescapable in the following conceptual sense. Defining the autonomy of the
mental as a lack of sensory invariance sounds to me like it is being defined as
a lack of computational invariance that could be found from the stimulus
properties. The latter is just a more formal statement of what is said
empirically in the former. But then it all comes down to whether there is an algorithmic solution for
‘perceptual classification,’ since, if there is, we can withdraw to
behaviorism; yet if this is not going to happen, there cannot exist such an algorithm. In other words, on a certain
abstract but arguably relevant level of analysis, all computationalists are
still behaviorists: it is their simple program code that describes a putative
connection between a simple stimulus
regularity and a response.
Skinner would have called this a S-R –law (Skinner, 1953, pp. 34-35).
Some authors have explicitly
denied the present thesis, namely, that the human mind is, in an important
sense, confronted with complexity, noise, or irregularity, by proposing that
one of the fundamental cognitive principles would be to search for simplicity in terms of K-complexity
(Chater, 1996, 1999). Chater argues that the world around us is ‘immensely
complex’ yet, oddly enough, ‘highly patterned,’ whereas the ability to ‘find
such patterns in the world is [...] of central importance throughout cognition’
(Chater, 1999, p. 273). Such ‘patterns are found by following a fundamental
principle: Choose the pattern that provides the simplest explanation of the available data.’ Chater then captures
the relevant notion of simplicity in terms of K-complexity, arriving at
virtually the opposite thesis proposed here. But the matter is naturally
empirical. We have seen some evidence from the properties of language and the
rejection of behaviorism (§2) that a cognitive agent could not succeed in
learning a language by following the principle(s) proposed by Chater, and that
the matter could be rather the opposite. It is in fact the standard argument
against empiricism and behaviorism to show that, were a child to choose the
simplest hypothesis, he or she would fail in a way that is against the
empirical facts (see Crain & Lillo-Martin, 1999).
On the other hand, those who have
assumed the autonomy thesis introduced in section 2, together with the
associated rejection of empiricism, have of course been referring to
‘complexity’ in capturing the empirical consequences of the issue. It was
suggested much earlier already against behaviorism that ‘we may now draw the
conclusion that the causation of behavior is immensely more complex than was assumed in the generalizations of
the past,’ since a ‘number of internal and external factors act upon complex central nervous structures’
(Tinbergen, 1951, p. 74, my emphasis). Furthermore, ‘the program [of
behaviorism] seems increasingly hopeless as empirical research reveals how complex the mental structures of
organisms, and the interactions of such structures, really are’ (Fodor, 1975,
p. 52, my emphasis). In the present
case I argue that the ‘complexity’ is K-complexity, thus meaning complexity in
an information theoretical sense.[5]
Finally, consider the important
insight concerning complexity in social sciences provided by McIntyre (1993)
that ‘one cannot say that at any level
the phenomena are too complex for us to find laws, for complexity just is a function of the level of
inquiry that we are using.’ The claim is that whether a rational inquiry is
blocked by complexity depends upon the chosen level of description. I am not
denying that claim. K-complexity is
applied to a putative level of description: the behaviorist and taxonomist
descriptions on the other hand, the mentalist descriptions on the other.
Complexity is a consequence of the wrong
choice; the semantics of making and constructing one particular choice and
teasing out the formal structure based on the preferred choice is another
matter. I also agree with McIntyre in that complexity does not prevent one from
finding simplicity, given that the level of description is correct: modern mentalism in linguistics seems to support
exactly this claim.
If what I
have argued so far is true, then there are some consequences worth of taking a
closer look. Looking at most computational models entertained in cognitive
science, whether connectionist models, dynamical models, algorithms for pattern
recognition, and so on, they seem to be ‘K-simple algorithms’ that try to
induce (or ‘adapt to’ or ‘self-organize to’) invariances in their ‘sample
space.’ Consider, for example, a typical associationist model. The model
consists of a number of nodes, connected to each other via connections that
allow information to ‘flow’ in the network. This flow is controlled by specific
weights, attached to the connections, plus various properties of the nodes
themselves. I spare the reader the details, since they are widely known. A typical
network nevertheless looks as follows:
Figure 1. A typical connectionist model
Although
the entire network looks like a complex entity, it is extremely simple in terms
of its K-complexity, hiding its simplicity (and triviality) within its complex
but misleading appearance. The information that spans the whole network
consists of two parts: the extremely simple code specifying the computations in
the formal neuron, the node (or better, each node), the other to copy this node
everywhere, plus setting up the initial pseudo-random (that is, K-simple),
connections between the nodes. In fact, this seems to be the very idea, and
presumably also one part of the attraction of such models (and maybe empiricist
models in general): hooking up a complex-looking system with the lowest
possible amount of information. Yet if mentality, understood here as opposed to
stimulus-response behaviorism, consists, in part, of high information content,
or disparity between the ‘data’ and ‘perceptual hypotheses,’ these models are
indeed not to be expected to do, or explain, much. Even if they look like
complex systems, they capture only simple invariances.
This is not to be intended as a
critique of connectionism, specifically. What are called ‘symbolic models’ or
‘Classical models,’ at least what comes to those involved in modelling human
performance, as opposed to competence (see below), do not often fare better in
this respect, since they just embody other means to save information in order
to attain K-simplicity. Recursion is one favorite and ingenious procedure. By
using recursion with memory for bookkeeping one can do a lot by writing a
simple code which, in an admittedly ingenious manner, is able to ‘use itself’
to define further computations and properties.
It is also possible, in principle,
to provide the required information for a model in each paradigm without
rejecting the whole approach. For instance, one could code information to the
weights between the nodes in a connectionist network, or to the activation
values (if they are real numbers), or use a large databank in the case of a
symbolic model. But that never happens, the present point being that is should
happen.
For, interestingly, both models,
‘symbolic’ and ‘connectionist,’ are usually hopeless from a psychological point
of view, a fact that has been known for long, but which has not, I believe,
received enough attention in print (but see, for example, Fodor, 2000). These models are always computationally complex.
Being computationally complex means ‘computationally deep’: to check
whether some property or invariance holds for a item in the data set, one must
compute a lot, often exhaustively in the worst case. Typically the increase of
computation is an exponential function of the increase of the size of the data,
soon rendering the actual process hopeless, as was put by Dennett:
As we ascend the scale of
complexity from simple thermostat, through sophisticated robot, to human being,
we discover that our efforts to design systems with the requisite behaviour
increasingly run foul of the problem of combinatorial
explosion. Increasing some parameter by, say, ten percent¾ten percent more inputs or more degrees of freedom in the
behaviour to be controlled or more words to be recognized or whatever¾tends to increase the internal complexity of the system
being designed by order of magnitude. Things get out of hand very fast and, for
instance, can lead to computer programs that will swamp the largest, fastest
machines. Now somehow the brain has solved the problem of combinatorial
explosion (Dennett, 1997, p. 77).
I view it
as an interesting hypothesis to speculate whether the drastic failure of
computational models in terms of their computational complexity and the scaling
problem is actually a consequence of their underlying empiricist philosophy of
getting complex-looking models ‘emerging’ out of only very few bits of
information? Consider the situation these models are required to handle as it
is seen through K-complexity: a simple model is required to do something that,
by its nature, requires a lot of information and irreducible complexity. What
are the possible consequences? Such models are predestined to fail, perhaps
exactly so that the lack of informational complexity is traded into a overflow
of computational complexity.[6]
Let us consider these arguments
from the perspective of well-defined problem domain, that of chess. Chess is a computationally complex game (Stockmeyer
& Chandra, 1979): computing the ‘next good’ move consumes
considerable amounts of computational resources. Suppose that we take the
stimuli to consist of pairs áq,
pñ of board positions in a chess
game. We take the response to consist of a judgement of whether p would
represent a good and possible move, with respect to initial position q. Now,
let us suppose that this computational ability is studied by Martian scientists
who have no knowledge of chess. All they have at their disposal is a sample of
good moves, and hence a sample of the extension of the concept of ‘good move.’
They ask whether there exists a simple invariance, formulated by relying upon
the stimulus pairs áq,
pñi, that defines the extension of the concept ‘good move.’ If
they find an invariance, it is possible to reason that it is this property to which the players are
‘responding.’ If the Martians are clever, they can indeed find an extremely
simple (in terms of K-complexity) invariance behind all the moves, namely, the
one that performs a more or less exhaustive search within the limits of the
rules of chess, looking either to win, or to find ‘good positions’ (comparable
to any chess program with at least moderate skill). The invariance in question
is, although very parsimonious, ‘computationally deep,’ requiring computational
resources in order to be ‘defined’ over the sample of good moves; or to borrow
a phrase from Luc Longpré, ‘the non-randomness is so intricately buried that
any space or time bounded process would fail to detect it’ (Longpré, 1992, p.
69).
Having found such a deep
invariance, Martians would wonder how the human mind could respond so rapidly,
given that the property it is responding to is computationally deep. In fact,
they would find that humans entertain only a few hypotheses. Furthermore, it would soon become apparent that,
when given certain ‘untypical’ chess boards, the concept of ‘good move’
entertained by human subjects would not correspond with the one defined by the
exhaustive search methods. (As is well known,
human performance would cease in such conditions, while the computational
method could maintain its level of play easily.)
From the perspective of Kolmogorov
complexity: when the amount of hypotheses (computations) that are allowed to be
considered in the game is made realistic from a psychological point of view
(say, restricted to 50 as opposed to the hundreds of millions computed by
computers), the set of ‘good moves’ will, with near certainty, assuming only
that the nontrivial character of chess is maintained in the instance that we
are actually playing on, look more and more random (see Li & Vitányi, 1990, §7). In other words, since the invariance is
computationally deep, players cannot detect it if they do not compute; and
since they do not compute, they must be responding to what would appear more or
less to be ‘noise,’ that is, a good deal of randomness in terms of
K-complexity. It follows that computationally shallow chess cannot be played by
using ‘computational tricks.’ That is, there are no ingenious ‘heuristic
models’ to explain how the players
select the relevant move, considering just a few alternatives – and no
scientific interest whatsoever in finding one, contrary to the prevailing view
among those partaking in computational theorising (cf. Newell, Shaw, & Simon, 1958, Newell
& Simon, 1963, 1972). Players are responding to mere noise insofar
as their computations are shallow, at most mimicking a computationally deep
invariance on a selected array of positions, but not in others. Indeed,
empirical evidence to this claim can be found by considering the construction
of actual chess playing programs, which, when they succeed in beating human
players, either compute enormous amount of positions, require extensive
database of brute information, or indeed just both.[7]
Summarizing, it had seem impossible to implement ‘rational
inference,’ whether perceptual or a higher level, characteristic of human
intelligence, using computational techniques. The problem is that, in fixing
beliefs based on an agent’s epistemological commitments, rationality seems to
require a considerable amount of holism-cum-context-sensitivity: the processes
seems to be sensitive to the whole background of beliefs and desires, whereas,
as is well-known, ‘[c]lassical architectures know of no reliable way to
recognize such properties short of exhaustive searchers of the background of
epistemic commitments’ (Fodor, 2000, p. 38). That is, rationality is purchased
by lots of information (background knowledge) plus deep computation in
exploring it.
In his latest book, Fodor (2000)
argues that the problem of abductive inference renders the whole computational theory of the mind suspect, at least as far as
the non-modular components of human cognition are concerned. He writes that the
problem originates in the locality of computational processes versus the
context-sensitivity of the corresponding cognitive processes. The ‘local formal
properties’ of a thought – its so-called ‘logical form’ – does not suffice to
determine its causal-cum-inferential role in the system. That is the problem of
‘locality’; in fact, presumably the same problem that we observed in the case
of chess, viz. that the property of ‘good move’ cannot be defined without
extensive use of computations on the formal properties of a chess board. We may
say that the property of being a good move is ‘context-sensitive’ because it
depends not on the local properties of the board position, but on the location
of this board position in the space of possible games and their outcomes. As a
consequence, we need to explore this space, the ‘context’ of a single board. As
Fodor correctly notes, this, in general, characterizes practically any routine inference in our normal
environment: what we do infer, and ought to infer, depends not on the ‘logical
form’ of the single thought, but rather on its place in the vast web of our
epistemological commitments as a whole. The notion ‘logical form of thought’ is
thus simply the wrong kind of idealisation in that case.
Empirical
evidence has made the empiricist picture of the mind less and less attractive
over the last 50 years or so, being replaced with a more rationalist one that
emphasises the role of innate structure instead of environmentally induced
complexity. I have argued that one important aspect of this difference might
come to abstract Kolmogorov complexity, which, as it is based on exact
mathematical formulation and instant generalizability, is applied to all
computational modelling, hopefully explaining why such models have failed
regardless of the paradigm used, and will fail in the future unless the hidden
empiricist premises are rejected.
If these speculations are not
completely off the mark, then the human mind is truly complex, lacking one
crucial feature to which many cognitive scientists have based all their wishes,
namely, that the complexity somehow ‘emerges’ from a right set of simple
principles. How and why physical laws kick off a process that could generate
information and complexity remains to be seen (see Chaitin, 1979, Bennett,
1988a, b, McShea, 1991, Sambrook, & Whiten, 1997, Zurek, 1990, among
others). Given the fact that DNA in our genes does not seem to involve that
much information at all, one would wonder if there is something in the nature,
in the ontogeny of an human being, that produces true information, and whether
there is something that can crystallize it to use (Griffiths & Gray, 1994,
Oyama, 2000)?
Almost miraculously, there are areas of interest, most of them the modular aspects of cognition, where general and elegant principles have been found within a ‘single level of analysis’ despite the apparent complexity between levels. According to my standards of scientific explanation, this is especially true in the case of knowledge (competence), but not so in the use of that knowledge (performance). Such abstract knowledge of chess rules would be easily extracted from observing the play, but it is the effective use of that knowledge which remains a mystery.
References
Chaitin, G. J. 1969: On the length of programs for
computing finite binary sequences. Journal of Association of Computer
Machinery, 13, 547-569.
Chater,
N. 1996: Reconciling Simplicity and Likelihood Principles in Perceptual
Organization. Psychological Review, 103(3), 566-581.
Chater,
N. 1999: The Search for Simplicity: A Fundamental Cognitive Principle. The
Quaterly Journal of Experimental Psychology, 52A(2), 273-302.
Chomsky, N. 1981: Reply to Putnam.
In N. Block (Ed.), Readings in Philosophy of Psychology . Cambridge, MA.:
Harvard University Press.
Chomsky, N. 2000: New Horizons in
the Study of Language and Mind . Cambridge: Cambridge University Press.
Chomsky, N., & Halle, M. 1968:
The Sound Pattern of English. New York : Harper & Row.
Churchland, P. 1981: Eliminative
Materialism and Propositional Attitudes. Journal of Philosophy, 78(2), 67-90.
Clark,
A. 1998: Twisted tales: Causal complexity and cognitive scientific explanation.
Minds and Machines, 8(1), 79-99.
Cosmides, L., & Tooby, J. 1997:
The Modular Nature of Human Intelligence. In A. Scheibel & J. W. Schopf
(Eds.), The Origin and Evolution of Intelligence (pp. 71-101). Sudbury, MA:
Jones and Bartlett.
Cowie, F. 1999: What's Within? . New
York: Oxford University Press.
Crain, S., & Lillo-Martin, D. (1999). An Introduction to Linguistic Theory and
Language Acquisition . London: Blackwell Publishers.
Dennett, D. C. (1997). True
Believers: The Intentional Strategy and Why It Works. In J. Haugeland (Ed.), Mind Design . Cambridge, MA.: MIT Press.
Fodor, J. A. 1974: Special sciences
(or: the disunity of science as a working hypothesis). Synthese, 28, 97-115.
Fodor, J. A. 1975: The language of
thought. New York: Crowell.
Fodor, J. A. 1987: Modules, Frames,
Fridgeons, Sleeping Dogs, and the Music of the Spheres. In Z. W. Pylyshyn
(Ed.), The Robot's Dilemma (pp. 139-149). New Jersey: Ablex Publishing.
Fodor, J. A. 1989: Why Should the
Mind be Modular? In a. George (Ed.), Reflections on Chomsky (pp. 1-22). Oxford:
Basil Blackwell.
Fodor, J. A. 2000: The mind doesn't work
that way. The scope and limits of computational psychology . Cambridge, MA.:
MIT Press.
Griffiths, P.
E., & Gray, R. D. 1994: Developmental systems and evolutionary explanation.
Journal of Philosophy, 91, 277-304.
Harris, Z. 1957: Methods in
Structural Linguistics . Chicago: Chicago University Press.
Harris, Z. S. 1951: Structural
Linguistics . Chicago: Univeristy of Chicago Press.
Harris, Z. S. 1955: From phoneme to
morpheme. Language, 31, 190-222.
Hochberg, J. 1968: In the Mind's
Eye. In R. N. Haber (Ed.), Contemporary Theory and Research in Visual
Perception . New York: Holt, Rinehart & Winston.
Kolmogorov,
A. N. 1965: Three approaches to the
quantitative definition of information. Problems of Information Transmission,
1(1), 1-7.
Li,
M., & Vitányi, P. M. B. 1990: Kolmogorov
Complexity and Its Applications. In J. v. Leeuwen (Ed.), Handbook of
Theoretical Computer Science (vol. A) (pp. 188-254).
Li,
M., & Vitányi, P. (1997). Introduction
to Kolmogorov complexity and ts applications . New York: Springer-Verlag.
Longpré, L. 1992: Resource bounded
Kolmogorov Complexity and Statistical Tests. In O. Watanabe (Ed.), Kolmogorov
Complexity and Computational Complexity (pp. 66-84). Berlin: Springer-Verlag.
McIntyre,
L. C. 1993: Complexity and social scientific laws. Synthese, 97, 209-227.
McShea, D. W.
1991: Complexity and evolution: What everybody knows. Biology and Philosophy,
6, 303-324.
Newell, A., Shaw, J. C., &
Simon, H. 1958: Chess playing programs and the problem of complexity. IBM
Journal of Research and Development, 2, 320-335.
Newell, A., & Simon, H. A. 1963:
GPS: A program that simulates human thought. In E. A. Feigenbaum & J.
Feldman (Eds.), Computers and thought (pp. 279-293). New York: McGraw-Hill.
Newell, A., & Simon, H. A. 1972:
Human problem solving . Englewood Cliffs, NJ.: Prentice-Hall.
Orponen, P., Ko, K.-I., Schöning,
U., & Watanabe, O. 1994: Instance Complexity. Journal of the Association
for Computing Machinery, 41(1), 96-121.
Oyama, S.
2000: The ontogeny of information: Developmental systems and evolution. Durham,
NC: Duke University Press.
Putnam, H. 1971: The 'Innateness
Hypothesis' and Explanatory Models in Linguistics. In J. Searle (Ed.), The
Philosophy of Language . London: Oxford University Press.
Pylyshyn,
Z. W. 1984: Computation and cognition: Toward
a foundation for cognitive science . Cambridge: The MIT Press.
Remez, R. E., Rubin, P. E., Pisoni,
D. B., & Carrell, T. D. 1981: Speech perception without traditional speech
cues. Science, 212, 170-176.
Rescher,
N. 1998: Complexity: A Philosophical Overview . London: Transaction Publishers.
Rissanen,
J. (1990). Complexity of Models. In Zurek (1990), pp. 117-125.
Sambrook,
T., & Whiten, A. 1997: On the Nature of Complexity in Cognitive and
Behavioural Science. Theory & Psychology, 7(2), 191-213.
Skinner, B. F. (1953). Science and Human Behavior. New York:
Macmillan.
Solomonoff, R. J. 1964: A formal
theory of inductive inference, part 1, part 2. Information and Control, 7,
1-22, 224-254.
Stockmeyer, L. J., & Chandra, A.
K. 1979: Provably difficult combinatorial games. SIAM Journal of Computing, 8,
151-174.
Tinbergen, N. 1951: The Study of
Instinct . Toronto: Oxford University Press.
Watanabe, O. 1992: Kolmogorov complexity and computational
complexity. Berlin: Springer-Verlag.
Zurek, W. H. 1990 (Ed.): Complexity, entropy and the physics of
information. Redwood, CA.: Addison-Wesley.
[1] To operationalize this structuralist, reductive and positivist enterprise, linguists have relied upon powerful analytical and statistical notions, such as free variation and complementary distribution (Harris, 1951, 1955, 1957).
[2] The structuralists themselves were not always explicitly engaged in the study of learning, though the connection between the discovery procedure and language learning is obvious.
[3] Consider such pairs as latter / ladder. The difference between these words looks entirely mysterious for the reductionist. There is merely a slight difference in the length of the second vowel.
[4] Putting option
(i) aside, consider these facts from the perspective of the learner who is obviously pursuing the
full mastery of language use, thus (ii). If s/he applies a ‘general learning
algorithm’ to the physical stimulus available, presupposing no linguistic
categories and no prior knowledge of such structures, s/he will find the physical
properties, but fail to gain mastery of the higher-level entities and the
general rules that determine the behavior of the system. In Chomsky’s words,
‘[w]e must ask how the child acquires knowledge of this [theory and its
principles], a feat that is particularly remarkable since, as already noted,
much of the evidence that leads the linguistic to posit this principle is drawn
from the study of percepts and is thus not even available to the child’
(Chomsky, 1981, p. 300). A ‘general learner’ would find language entirely
mysterious. In other words, the rejection of reductive structuralism leads to a
theory where linguistic elements and the principles that govern their
constitution and lead to behavioural consequences were based on irreducible
‘percepts,’ not properties of the ‘directly observed’ stimuli. But a child
approaching the task of learning a language could not observe such ‘percepts’
by means of stimuli available to him or her (unless by telepathy, or by an
ability see inside other peoples’ mind). Notably, what the child has available
is the behavior of the entities
surrounding him or her as filtered through an extremely narrow band of
information channels, such as eyes, ears and touch.
Consider an argument put forward recently by Fiona Cowie (1999).
She argues that the poverty of stimulus arguments, explained above, are ‘at
worst outright false, and at best highly dubious’ (Cowie, 1999, p. 177). Why is
that so? Not because Cowie would have found, after all, a way to define even a
single phoneme without recourse to percepts and properties that are not
available to learners, to say something of morphemes, syntactic, or semantic
concepts. There are no such proposals in her book, although one would have
hoped to find one. Instead, Cowie sees the matter other way around: ‘for many
years, Chomskyans made no serious attempt to provide empirical validation of
their claims about what information is and is not available in the [primary
linguistic data] and what errors children do and do not make’ (Cowie, 1999, p.
190, my emphasis). But this sounds as if the actual collapse of empiricist and
reductive position does not count as evidence or ‘empirical validation.’ This
cannot be the right diagnosis, whether Cowie’s or anybody else’s; the mentalist
movement was and is first and foremost an empirical undertaking.
Interestingly, Cowie favors a model where ‘children can use statistical information about the acoustical properties of language without being reducible to those acoustical properties,’ being thereby ‘perfectly well able to acquire the ‘abstract’ syntactic concepts that they need to form such hypotheses through statistical analysis of the speech they hear around them’´(Cowie, 1999, pp. 190, 193, respectively). But this seems dubious: if to-be-learned linguistic properties cannot be defined by means of the properties of the stimulus, then they cannot be ‘abstracted’ in any rigorous sense of ‘rational abstraction’ from the stimulus properties without assuming that the hypotheses were already available to the learner. Only a wild guess and some kind of pseudo-inferential steps could ensure the mastery of a concept in such an environment.
[5] Returning again to the case of reduction of the notion of phoneme, as noted in item (i), on page 6, it is possible to try to construct a theory of language use, and hence of its higher-level properties, by mimicking them ‘on the lower-level deck’ by allowing countless exceptions and complex, context-sensitive definitions to enter as part of the theory, often in such a manner as not to guarantee any converging into a finite theory due to the fact that the obscuring factors are ultimately independent of the true abstract principles. In fact, this is what threatened to happen with the American structuralists. The reductive theory becomes K-complex in its content and, for the empiricist who still insists that it could be ‘learned,’ also impossible to learn by any known rational method of induction. Of course, what could, from a theoretical point of view, be ‘learned’ from it is not the same entity as what is provided by the insightful, albeit abstract, theory of ‘percepts’; it can only be a complex and leaky mimicry of the underlying reality.
[6] One way to understand the relation between the lack of information and computational complexity is to observe that when the model lacks the required information, the missing information must still be transferred from some external source to the model. For example, in a fairly popular but also highly ‘empiricist’ scenario the model numerates hypotheses while the environment provides feedback, resulting inefficient and computationally complex hill-climbing. That process can be made efficient only if the learning encompasses a prior information of the correct hypothesis. Yet if the hypothesis is itself complex, then either the prior information must be in the model already, or it must be provided from some external source via some more direct means. In either case, the complexity does not ‘emerge from simple principles.’
[7] Here it must be noted that computational complexity and Kolmogorov complexity are different aspects of ‘complexity,’ though they are related to each other (e.g., Watanabe, 1992). Computational complexity is defined as the slope of the function from the size of the given problem instant (e.g., a chess position) to the amount of computation that is required to solve that instant (e.g., the detect a good move) in the worst case, and it is well-defined only for an infinite problem instances. That is, computational complexity tells us how much computational resources a given problem is required to take as a function of the size of the problem instant. If the problem consists only of finite amount of instances, then it can be solved trivially by a ‘table-lookup method’ where each instance is directly paired with its solution. Kolmogorov complexity, in turn, characterizes the complexity of such finite objects by determining how much the lookup table can be further compressed. Importantly, besides these two measures there are other properties that one can and have associated with the term ‘complexity,’ but they do not seem important to the empirically interesting type of randomness investigated in section two.
EUROOPAN UNIONI