The proof is not in the pudding

Why experiments can’t be used to prove their own assumptions

From Narrow To General AI
11 min readMar 13, 2024

Imagine a researcher in the field of psychology who is exploring the depths of human motivation and reasoning. One hypothesis he is currently testing goes as follows:

The increase in number of sins correlates with an increase in unhappiness.

In other words, the more the subject engages in sin, the more likely they are to be unhappy. To test this hypothesis, he devises a questionnaire, consisting of two items:

  1. How many sins did you engage in over the last week?
  2. How unhappy were you during the last week?

The first participant, on seeing these questions, is somewhat turned off by the unexpected use of the word “sin”. He doesn’t feel that “sin” is the right way to frame the patterns of events happening in his mind. However, there is no option in the questionnaire to indicate that the framing of the question is misleading.

In the end, he decides that “sin” is as good a category as any other with which to designate his mental patterns. The human mind cannot be read in a “raw” format. All experiments must be based on some predefined framework; if not “sin”, then perhaps the Freudian id, ego, and superego, or perhaps one based on symbolic mental models. In order to be able to communicate his internal life to others at all he must fit it into a set of previously identified and agreed-upon concepts and labels. He’s also concerned that if he denies that sin is a useful perspective, he will be judged by the researcher as “sinful”; only a sinner would want to deny the existence of sin after all. So he proceeds with the questionnaire.

As the experiment is carried out over many subjects, a pattern begins to emerge — the correlation is about 0.6 — not perfect, but not bad. The tepid success of this first experiment leads him to perform many others trying to find the correlation between individual sins (greed, wrath, etc) and unhappiness. Finally, something like a statistically reliable effect is discovered.

Impressed by the success demonstrated in the papers, other researchers begin to take an interest. Eventually, it becomes commonplace to infer the presence of sin from its observed effect (unhappiness), and a new field is established. Outsiders are encouraged to pay heed to the research, and the existence of sin becomes an well-attested feature of the science of cognition. Those who question the fundamental assumptions of the field are simply directed to the mountain of evidence that shows that “sin” can be used to predict the occurrence of unhappiness.

“The proof is in the pudding” they say.

The above example shows why any experiment, no matter how successful, cannot be used to prove the validity of its own assumptions. In this case, the phrasing of the questions influenced the subject by presuming the underlying structure for cognition. The experiment can only demonstrate correlations within the space of its own assumptions. It cannot prove the assumptions themselves, or the validity of the ontological framework on which the experiment is based. That one is able to find correlations between self-reports of sin and unhappiness doesn’t prove that “sin” is the best model for thinking; nor “unhappiness” for that matter. The fact that any patterns in reported behaviour are observed is partly an artifact of theory itself.

All models are wrong, some are useful — George Box

There is nothing inherently wrong with making assumptions before you begin an experiment — in fact, it is impossible to perform an experiment without them. Nor are the biases that are introduced in the results necessarily insidious. In each case the researcher hopes that they have found a more useful model and set of assumptions than previously existed — “useful” in the sense that it predicts its own concepts correctly. But there is one special case in which such experiments can lead us astray more often than usual: cognitive science research. And here is where this post becomes relevant to AI.

Logical Mental Models

Though the example of using “sin” above appears contrived, it is not too far off from how cognitive science actually treats its subject matter, the mind. Take, for example, the variety of theories regarding how humans build up mental models of the world. Beginning with Johnson-Laird’s seminal papers, many researchers have used this theory in countless experiments to show how the mind forms mental models of experiences, and uses those to reason predictibly across conjunctions, disjunctions, abductions, inductions, etc. Experiments generally involve asking subjects tricky logical questions, such as:

How many cards below must you turn over to prove (or disprove) the following: “If a card has an “A” on one side, then it has a “2” on its other side”?

From the patterns of answers given by test subjects, studies have demonstrated that the mind can sustain a limited set of logical inferences, that it performs better on some categories, and makes reliable mistakes across others. However, all experiments focus their subjects on the topic they are trying to study — namely logical inference using mental models.

Reasoners use the meanings of assertions together with general knowledge to construct mental models of the possibilities compatible with the premises. Each model represents what is true in a possibility.

intelligent programs should draw conclusions that succinctly express all the information in the premises. — Johnson-Laird, Oxford Handbook

As with the example involving sin, researchers don’t ask questions outside the domain of logic, nor ask if the subjects would like to discuss different topics. Therefore any conclusion that minds generally (or ever) think in terms of mental models would be unwarranted — only that minds can appear to do so when asked questions regarding logical inference. Were the test written in Latin, the subjects would do their best to answer in Latin — but the experimenters would be misguided if they concluded that their subjects prefer to think in Latin.

Logic was introduced into the interaction, not by the subject, but by the experimenters. The subjects felt they had to go along with it since they were being tested — “tested”, in the sense of a quiz — and failure would reflect poorly on them. Humans will strive to match the expected behaviour of any experiment — it’s what we do well. Clearly, (the subject might think), the answers obtained by applying formal logic are the objectively “right” ones, and any rejection of its conclusions is a failure on my part. To reject these logic problems, or to propose an alternative thread of conversation would have made them seem ignorant and illogical; just as happened with “sin” above.

The example involving “sin” seemed to signal a sort of bias on the part of the researcher towards religious theories. Science, in contrast, is arguably intended to filter out such bias, and be as objective and neutral as possible. Though many would deny that scientists can ever truly be objective, a more interesting question is: what if they were?

Humans are not by default objective in their everyday dealings. They need to be trained to be so, to practice critical thinking, to be impartial, etc. It is a socially encouraged perspective and motivation. Objectivity, regardless of how it comes about, is just one human interpretive attitude. Other ways of experiencing and interacting with the world include preferentially, religiously, vengefully, artistically, emotionally, politically, helpfully, tactfully, nepotistically, and so on.

To study test subjects scientifically requires the experimenter to adopt this singular attitude. And although this doesn’t mean that the experimenter’s questions and expected answers will revolve around objective thinking as well, in practice that is what happens. Irrational behaviour — for example, cutting off one’s nose to spite one’s face, something that is so common within the vengeful mindset — is framed as a failure, and rational behaviour highlighted a success. The questions asked in mental model experiments don’t leave room for any other perspective.

Reasoning lies at the core of human intelligence. And it is central to science, society, and the solution of practical problems. — Mental models and deduction, Johnson-Laird

You can see in the quote above from Johnson-Laird how intelligence is equated with science and solving social problems. What is neglected says a lot: it ignores problems of art, philosophy, religion, personal psychology, politics, etc. The author’s focus is clear. Anyone who rejects it would have to sit in front of a scientist and tell them that logic puzzles are stupid; very few do.

The well-known expression “seek and ye shall find” has a flip-side: you will only find what you seek.

The result from these experiments merely reinforce the assumptions that the researchers came in with; namely that the mind thinks rationally. It’s like offering a hungry child nothing but artichokes for two years, then concluding from their behaviour that they must love artichokes above other foods. The proliferation of experiments undertaken based on these assumptions appear, on the surface, to confirm their validity.

“The proof is in the pudding” they say.

But the phrase “the proof is in the pudding” assumes you are trying to make pudding. The observer’s values matter. You can only judge a participant’s success against a predefined notion of “better” and “worse”. It is only because the questions and experiments are designed around gauging a particular type of rational behaviour that we see it in the results. All else is interpreted as either noise or failure.

Beyond reaffirming existing biases through experimentation, this has practical effects on AI research. AI takes its cues from cognitive science, and the latter searches for and finds in its subject only the rational agent, a reflection and projection of its own mode of study. Rational, probabilistic thinking is of particular value to science and industrial productivity, e.g. in factories and for logistics. AI, which is supposed to seek out a more comprehensive view of human behaviour, is burdened with its marriage to the field of cognitive science. All its benchmarks are aimed at rational deduction, objective categorization, or accurate measurement:

An agent that chooses actions to maximize its utility will be rational according to the external performance measure…

Improving the model components of a model-based agent so that they conform better with reality is almost always a good idea, regardless of the external performance standard. — AI: A Modern Approach

Claims that AI that perform better than humans at logical tasks are therefore ironic, since they assume that performing logical tasks is what all humans are generally aiming for.

Chasing a narrow ideal

There was a recent study in which the author discussed the ability for Large Language Models (LLMs) to perform symbolic logic. The core argument is that both symbol manipulation and symbol grounding are not well-defined enough in humans to categorically rule out LLMs from performing either. The author goes on to give clear examples where some semblance of these abilities appears to be in play in various connectionist AI.

The report, however, relies on a definition of intelligence akin to the one discussed in the previous section. Below are some relevant quotes from the paper:

Human behaviour, across many domains, is explained very well by traditional symbolic models, i.e. models that represent concepts as discrete constituents which are manipulated using abstract operations such as variable binding and logic…

The fact that LLMs might obey similar principles during their training to the principles obeyed by probabilistic symbolic systems provides evidence that the processes they use under the hood may reflect those that traditional theories in cognitive science routinely employ…

One way to show that grounding is not necessary for learning meaning is to show that conceptual representations learned by an (ungrounded) LLM are isomorphic to grounded representations of those same concepts. — Symbols and grounding in large language models

To summarize, the author assumes the following: human minds employ probabilistic symbolic systems to represent what is true in the world, and LLMs may be able to match these representations. Success in LLMs is measured against an ideal model of symbolic reasoning. But is it true that humans regularly create consistent, probabilistic world models? Or is this rather an ideal that dominates scientific research, that is now being projected onto human cognition?

Evolutionary psychologists postulate that natural selection led to an innate “module” in the mind that makes Bayesian inferences from naturally occurring frequencies. — Johnson-Laird, Oxford Handbook

In-context learning — the primary mechanism via which LLMs exhibit sample efficiency — can be understood as an implicit implementation of more familiar Bayesian inference. — Symbols and grounding in large language models

Again, do humans implicitly reason using Bayesian probabilities? Or is this a niche, exceptional skill relegated to scientific research, which every human steadfastly resists in day-to-day life? Is it not rather a truism that “people believe what they want to believe”?

The principle of equiprobability: Each mental model represents an equiprobable possibility unless there are reasons to the contrary — Johnson-Laird, Oxford Handbook

Again, is this what people actually do? Or is this rather what a good, impartial scientist should do?

The same can be said of any formal, structured thinking, like propositional logic, composition and hierarchies, logical problem-solving, and dimensionality-based similarity matching. For each of these ideal mental models, numerous exceptions can be found where humans, in their natural mode of functioning, seem to fall short of their standards. These are often referred to as “biases”, which is another way of saying they are deviations. The myriad of exceptions show rather that the rules are “fictitious”, in the sense that they are fabricated, contrived.

When we test human subjects on their performance on symbolic reasoning or probabilistic decision making we are comparing them to a set of idealised expectations. That in itself is fine — it’s like quizzing them on any other intellectual virtue. But when we compare an LLM’s ability to perform the same functions, we are not comparing them to humans, but to the free-floating ideal we have defined. That humans and AI can match these expectations simply means that both are able to accommodate the structures we have invented. We are gauging their ability to align with shared social symbols and the logic we have defined for their dynamics; not with thought itself.

We imagine that symbolic logic is a fundamental function of the mind rather than a socially mediated function.

The paper correctly argues that there is no good reason to doubt that LLMs can, or eventually could, perform symbolic logic without needing to ground them in embodied experience. Since symbol interpretation is carried out in a formal symbolic space, any formal system could replicate it, without needing grounding (as the paper points out). The LLM is ultimately extracting from the text corpora the very patterns of logical inference we intentionally put in them in the first place. We are getting out the very thing we put in.

when we inspect the internal representations of modern neural networks, do they reflect aspects of traditionally symbolic structure? Specifically, do they encode discrete constituents, organized within abstract predicate-argument structures, which combine productively? — Symbols and grounding in large language models

Here the author projects the very structures that we train humans to abide by into the core of AI itself, and by extension into the roots of human psychology. Symbolic reasoning like that demonstrated by Johnson-Laird’s tests is a narrow and not too representative space of human cognition. Yet it occupies a disproportionately large chunk of cognitive science research. By holding AI to these same standards we are merely reinforcing the hegemony of those theories. The better that AI perform the more likely such researchers will be to point at their theories, then point at the AI, and say:

“See? The proof is in the pudding”

--

--

From Narrow To General AI

The road from Narrow AI to AGI presents both technical and philosophical challenges. This blog explores novel approaches and addresses longstanding questions.