Since at least the time of Plato, humanity has been groping toward an understanding of the phenomenon of representation. We have learned that our representation of the world is not the same as the world itself, and that there are fundamental constraints on the relationship between a representation and the thing represented. As we have evolved our understanding of representation, we have also begun to appreciate the striking degree to which the laws of our universe seem to be finely tuned to permit the emergence of subsystems complex enough to form representations of the system they inhabit. Our universe seems to be delicately balanced at "the edge of chaos." It is simple enough to exhibit regularities that are learnable by the human subsystems it has evolved, but complex enough always to have surprises in store for us.
Of course, whether a system is simple or complex, learnable or not, is a property not just of the system being learned, but of the relationship between the system being learned and the system doing the learning. Usually we make a distinction between the universe "out there" and the scientist "in here" who is trying to learn about it. But we need to close that loop, and appreciate that for our universe, the learner and the subject matter are one and the same.
This paper proposes a view of the universe we live in as an information processing system constructing a self-representation. The structure and rules of behavior of our universe are complex enough to permit it to form representations of itself, constrained enough that it is capable of forming increasingly accurate representations of itself, yet sufficiently unconstrained that it exhibits irreducible uncertainty. The burgeoning application of methods from statistical physics to learning problems suggests a complementarity between the physical and information processing views of the universe. The universe has evolved a set of loosely coupled, independent conscious subsystems, that in the aggregate form the universeís self-representation. Consciousness is identified with the mapping between the state of a conscious subsystem of the universe and its representation of itself and its environment.
From the vantage point of our local region of space-time, this view seems to fit the evidence remarkably well. Not only that, but the appreciation that this is the kind of universe we live in seems to be growing. The conscious subsystems of our local piece of the universe appear right now to be fascinated by the properties of learning systems, to be arriving at an increasingly sophisticated understanding of the mathematics of self-representation and learnability, and to be discovering that these fascinating new theories apply to themselves and the universe they inhabit. We are on the brink of mass understanding that we can be modeled as a self-learning system. In what follows, I describe a thought experiment involving the design of a self-learning system, discuss properties that appear common to systems that construct representations, and sketch a general framework for constructing systems that evolve self-representations.
2.0 A Thought Experiment
Suppose God wants to design a universe that learns what kind of universe it is. More specifically, God is going to create a family of universes so that, if any member of the family were selected and started running, it would eventually learn, to a high degree of accuracy, which one of the possible universes it was. (Readers who dislike the theistic metaphor can remove references to the designer, and obtain an observationally equivalent non-theistic theory.)
There are a few concepts that seem fundamental to a theory of learning complex representations. First, we have discovered that if we want to design systems capable of learning extremely complex patterns, it is useful to put some randomness into the learning system. Otherwise the system gets stuck at local optima, i.e., poor representations with no nearby ones that are any better. The ability of global optimizers with intrinsic randomness to find optima for arbitrarily complex problems on which deterministic optimizers of equivalent complexity fail, provides justification for God to design a universe that "plays dice."
Although some randomness will keep our system from "freezing" into a sub-optimal self-representation, too much unpredictability is harmful. The system must be predictable enough that it exhibits learnable regularities and structured enough that it can form representations. Therefore, God would have to design this system with just the right amount of randomness.
Another fundamental idea is modularity. Loosely coupled local learners are more easily constructed and less brittle than top-down global learners. Along with modularity, there needs to be repeated structure, also known as self-similarity or conditional independence. Modularity and self-similarity enhance learnability because elements of the system are similar enough to one another that the learners can apply what they have learned about one element to other elements in the same class. Thus, God would design this system as a loosely coupled collection of independent learners that exchange information with each other.
Another important characteristic of successful learning systems is evolution, or biased sampling over time with feedback about how well the system is doing. Successful learning systems evolve in time by trying out models and picking the good ones. Thus, there needs to be some sort of "goodness metric," and adaptation mechanisms that select forr representations that are "better" according to this metric.
Another thing we have learned about the business of constructing representations, is that it is a mistake to base our evaluation of a learnerís performance only on how well it performs on data it used to derive its representation. The system that does the best job of retrospectively explaining the observed data is likely to perform poorly on cases it has not yet seen. God would need to build in a bias, but not too strong a bias, toward a priori expectations. Most successful learning systems incorporate a bias toward models that are simple according to the learner's representation scheme.
To summarize, our most successful and sophisticated learning systems are constructed out of loosely coupled, similar but not identical building blocks, which are reshuffled in a biased way, evolving stochastically to better and better representations, and which incorporate a bias toward simple representations, but can form complex representations when necessary. This description sounds a great deal like the universe we live in. It sounds a great deal like science as we have come to practice it: a community of loosely coupled, similar but not identical scientists, each attempting to construct increasingly accurate representations, in which individual scientists are free to choose their own approaches and models without coercion, and in which scientists have evolved a widely shared tendency to prefer simple models. In other words, the universe we live in looks like just the kind of universe God would design, if God wanted to create a sophisticated learning system.
3.0 Probability and Representation
We use logic to create formal descriptions of the world, to reason with these descriptions, and to draw conclusions from them. Logic enables us to derive conclusions that are implied by our initial premises but may not have been obvious to us before we performed the derivation. Science strives for theories covering a wide range of phenomena, in which many conclusions can be logically derived from a few logically consistent premises, in which the conclusions drawn from the theory are in accordance with empirical observation, and in which as few phenomena as possible remain unexplained.
Traditional logic has nothing to say about statements that can be neither proven nor disproven. The self-learning universe of our thought experiment will have to start out with a great deal of uncertainty about what kind of universe it is, and will learn more about itself as it observes itself in action. To represent this uncertainty we need to extend logic to deal with truth values intermediate between strict truth and falsity. The theory of probability was developed for this purpose.
The developers of probability theory conceived of the theory as a formalization of rational human reasoning applied to problems for which conclusions could not be drawn with certainty. Probability can be used to model an agent's degree of belief about whether an event will occur or whether a proposition is true. The mathematics of probability can be used to derive beliefs about some propositions from beliefs about others. This provides a means to check the internal consistency of beliefs, to make predictions about future events, and to learn from observations. The theory of belief dynamics, in which observations are used to learn about the phenomenon generating them, is called Bayesian statistics. Bayesian statistics is a theoretically grounded and principled way to update a representation as new information about the environment is obtained. Recent work in Bayesian statistics has involved the design of learners that incorporate a bias toward simplicity, while retaining the ability to represent arbitrarily complex phenomena when justified by the observations. This can be done by using robust and flexible "intelligent prior distributions" that encode a belief that the phenomenon in question is complex but learnable. Such models can learn patterns that are strongly supported by the data, but resist jumping to unwarranted conclusions about spurious patterns.
Three important innovations have contributed to a rapid spread of Bayesian learning applied to complex problems. First is the theory of graphical models, which provide a language for representing systems with modular, loosely coupled substructures. Second is the importation of Markov Chain Monte Carlo (MCMC) algorithms from statistical physics. MCMC provides a recipe for stochastically evolving a system toward a target stationary distribution. In learning problems, an MCMC system is designed to evolve to the posterior distribution of the target parameter given the observations. The third innovation comes from the combination of graphical models with theories of knowledge representation, to form probabilistic logics for representing parsimoniously parameterized theories about a domain. Bayesian inference can be used to learn parameters of a given domain theory or to weigh empirical support for competing domain theories.
4.0 Self-Representing Systems
We now consider the application of the methods described in Section 3 to a system that learns about itself by observing itself. Imagine the God of our thought experiment designing a family of Markov chain Monte Carlo systems, in which each system in the family learns over time which member of the family it is. An MCMC system can be described in terms of a state space, a transition distribution, and an initial state. Each state in the state space represents one possible "way the universe could be" at a given instant in time. That is, the state encodes the system's beliefs about what kind of universe it lives in. The transition distribution is a rule specifying the probability of each possible next state as a function of the current state. To create a universe, then, God picks an initial state from the state space, selects a transition distribution, and then lets the system evolve from state to state according to the transition distribution.
There are certain mathematical conditions which imply that a Markov chain is ergodic : that is, it will converge in the long run to a stationary distribution. The stationary distribution for an ergodic system encodes the probability that it will be in any given state at some point far into the future, if we start it running now and donít look at it again until that future time. A good way for God to construct a self-learning system would be to specify a system that evolves to a stationary distribution in which accurate models of itself have very high probability and inaccurate models have very low probability. We now consider how God might design such a family of MCMC systems.
It is important here to recall the distinction between representation and reality. We can know only our representation of reality. Although reality itself remains inaccessible to us, we can study the relationship between representation and reality by building models of the representation process. This is exactly what we are doing in our thought experiment. The state of our system represents actual reality at a single instant in time. We assume our system does not know its actual state. Rather, there exists a self-representation function, which maps each state into the systemís representation of reality at the corresponding instant of time. For our purposes, we need to consider three aspects of the self-representation function. First is the systemís universe distribution, which is a probability distribution over MCMC systems. The universe distribution represents the system's beliefs about what kind of universe it is. Second is the systemís state distribution. The state distribution is a probability distribution over elements of the state space, and represents the systemís beliefs about what its current state is. Third is the systemís prediction, which also is a probability distribution over states, and represents its beliefs about what its state will be at the next instant of time. If the systemís beliefs are consistent, the prediction can be obtained from the state distribution and the universe distribution by the rules of probability theory.
It seems reasonable to postulate that God would design this family of MCMC systems and the self-representation function in such a way that, no matter which system from the family is chosen, the system will evolve to as accurate a self-representation as is possible for that system. Let us consider how this might actually be done. Let us assume that God gives the system one of the flexible, robust prior distributions described above, to represent its a priori expectations about which system it is. The systemís posterior distribution for which system it is, given the observable aspects of its current state, is the most the system could possibly be expected to know about itself. To arrive at this posterior distribution, the system needs to know the conditional probability of the observable part of its current state, for each system in the family. Assuming all the systems in the family are ergodic, it seems reasonable to use the stationary distribution of the observables as the appropriate conditional distribution. Thus, God might design this system so that its universe distribution is (or evolves to a close approximation to) the posterior distribution for which system is being run, given the observable features of its current state.
The universe we live in appears to have many independent samplers, each of which constructs its own representation, and that exchange information with each other. Experience has shown that this kind of architecture leads to learning systems that are both flexible and highly efficient. For such a system, the global self-representation would be some form of aggregation of the individual samplers' representations.
Any two samplers with the same stationary distribution are observationally indistinguishable. Although this may seem like a limitation, it turns out to leave an opening for God to insert free will into the universe. God chooses the stationary distribution, and the consciousnesses God has created are free to choose among sampling distributions that give rise to this stationary distribution. Consider, for example, the Metropolis-Hastings sampler, which arose in statistical physics and has been applied extensively in Bayesian statistics. A Metropolis-Hastings sampler proposes a new state according to some probability distribution. If there are multiple local samplers, each proposes local changes to its immediate neighborhood. Proposals are either accepted or rejected according to a rule satisfying a property called local reversibility. Local reversibility, together with a small amount of irreducible randomness, can be shown to imply that the system is ergodic and converges to the target stationary distribution. (We note in this connnection that local reversibility and irreducible randomness appear to be fundmental properties of the universe we live in.) Thus, God could institute free will by allowing the consciousnesses of the universe to use any locally computable proposal distribution they choose (subject to an irreducible randomness requirement to ensure ergodicity), while using Metropolis-Hastings acceptance to ensure convergence to the appropriate stationary distribution.
If the universe God creates follows extremely simple laws, it will freeze into a highly accurate simple model of itself. If the universe God picks is very complex, it will do the best it can at an impossible task, but will never learn to make accurate predictions about its observable features. If the system God picks is at the "edge of chaos," it will evolve to an increasingly accurate representation of the statistical regularities in its observable behavior, but will continue forever to discover new things about itself. It would appear highly probable that our universe is one of these "edge of chaos" systems.
To summarize, God begins with a state space, a self-representation function, and a prior distribution. The self-representation function maps each state to: (1) a probability distribution over MCMC systems on the state space (the universe distribution), (2) a probability distribution over states (the state distribution) and (3) another probability distribution over states (the prediction). The universe distribution is (or evolves to approximate) the posterior distribution for which system is being run given the observable part of the state. The prediction is (or evolves to approximate) the probability distribution for the next state, averaged over the state distribution and the universe distribution. According to this metaphor, our universe appears to be an "edge of chaos" system: one that evolves to an accurate self-representation, but exhibits intrinsic unpredictability and continutes to learn new things about itself even in the long term limit. The conscious subsystems of the universe have free will in that they may choose any distribution for sampling the next state, subject to the constraint that the system must converge to the target stationary distribution.
5.0 Conclusion
It appears that this model of the universe as a self-learning system fits the observable evidence remarkably well. For the first time in history, we have the mathematical tools to formulate this theory and make empirically testable predictions about learnable universes. It would appear from the evidence at this point in space-time that the universe we inhabit can be modeled as an "edge of chaos" system, and that right now our corner of the universe is learning this about itself. The regularities we have learned about our universe are our current best representation of which MCMC system our universe is.
The model of consciousness that emerges from this view of the universe
is that consciousness can be identified with the mapping from state to
representation. We can identify free will with the ability to choose the
transition distribution for the next state from among distributions that
converge to the target stationary distribution. It is interesting to note
that this view of the universe suggests an answer to the riddle of the
arrow of time: time flows in the direction of increasing self-knowledge.
That this direction should be the same as the direction of increasing complexity,
as measured by thermodynamic entropy, is not at all surprising.