A fantasy called “Out of Distribution”

The importance of causal features for ML generalisation

11 min readApr 19, 2025

By Yervant Kulbashian. You can support me on Patreon here.

For Machine Learning (ML) models to be valuable in industry, they must effectively deal with situations they haven’t encountered during training. We refer to this as a model’s ability to generalise. For example, once trained on a set of images of cars, it should be able to infer the existence of cars in images it has not yet seen. The failures of most ML models in production are due to their inability to handle exceptional or unforeseen situations, when the real world deviates from what a model has learned so far.

As of yet, no universal solution exists to addressing this challenge¹. As we’ll see, the only successful models in production are those that get around the issue by hand-crafting narrow, task-specific biases into the model, the architecture, and the training data. This post will discuss why this is not an accidental oversight — there are structural reasons why generalisation from static datasets is fundamentally infeasible.

When we expect a model to generalise, we are hoping it will somehow discover how to step outside its training data using only the patterns extracted from that data. For example, when looking at pictures of chairs, the model should ignore the angle of the view and the colour, and isolate the shape and structure of a singular object in it, so that if it later sees an upside down chair it is not confused. The idea of generalization therefore carries an assumption with it: that there is a right and a wrong way to do it, and that the agent will be able to figure out what the right way is.

We humans tend to intuit which features are the important ones when recognising new objects. Moreover, we believe that it was our experiences that informed us about their relative importance. And so we feel it should be straightforward for a model to extract the correct features that constitute the essence of the subject from data alone.

There is a hidden assumption that props up this intuition: that there exists, in reality, some underlying trend, pattern, or force that generates all the samples (e.g. images) of chairs. In other words, outside the set of chairs on which we trained the model there is a larger, Platonic ideal set, from which all images of chairs are supposed to be drawn. This is referred to as the underlying distribution. It is presumably the reason we combined all images of chairs under one category. Our hope is that the pattern for generating the larger dataset can be extracted from our smaller training set — like discovering the general mathematical formula to describe the motion of planets around the sun. Put in simple terms a successful ML algorithm should:

  1. Extract, from the training set, a pattern that represents all chairs
  2. Encode that into a model
  3. Use that model to recognise unseen chairs

If this were fundamentally not possible, the entire field of Machine Learning would be a dead end. We refer to this ability as generalising “out of distribution” (OOD) — i.e. outside what we trained on. It is the goal of practically all of Machine Learning. Interestingly, we also employ the term “out of distribution” in a second way, to indicate a failure of this same process. When we notice that a sample is too different for the model to respond correctly to, we call that sample OOD. The term OOD is always at the threshold between what the model can currently do well, and what we wish it could do; between where we expect the model to fail, and where we assume it should “know better”.

The reason for this fluid definition is simply because there is no consistent way to calculate what the boundaries of an underlying distribution are, or to know when you have left them (OOD). We only really know a sample is “too different” retroactively, when a model does poorly on it. True, you can roughly estimate the similarity between new samples and what we have trained on, but raw similarity is not enough to tell you if an item belongs in the distribution or not. The threshold or measure varies by situation: in some cases tiny differences may matter, as small as a pixel; in others large differences can be ignored. There is no universal criterion; strictly speaking, anything outside the original training dataset is OOD. And as for the so-called underlying distribution — the Platonic ideal from which all images of chairs are presumably pulled — that is a fabrication; it’s existence can never be proven, only assumed.

What is “important”?

Fundamentally, all a model can do during training is to derive statistical patterns from what it is given, and assume that those can somehow be used to recognise every instance. During this process, it must determine for itself which features are significant and which can be ignored. Unfortunately, there is no blanket rule for doing so. The frequency of, or associations between features in a sample are not a reliable metric for making this decision.

For example, given a dataset of images of apples, if you invert the colours in an image you would completely change that image’s colour distribution; yet we humans still find it easy to identify an apple in such an image by its shape. On the other hand, taking an image and moving the colour pixels around maintains a colour distribution close to the original, but produces garbage. We humans intuit that the position of relative colour-contrasts (the shape) are what is important, rather than the actual colours themselves. Of course, the situation would be different if the goal was to distinguish real trees from fake ones — then the colours would matter very much.

Two inverted images. You can recognise the apple despite the inverted colours, but it is harder to determine if the tree on the right is a real one (it is a fake blue tree, with colours inverted)

Some features will be found consistently in the data; yet whether such features are necessary or incidental cannot be decided from their frequency alone. For example, image generators often misattribute feature significance, such as when ChatGPT was unable to produce pictures of nerds without also adding glasses:

Source: Reddit

This is because most nerds in web imagery are represented as having glasses, and it is difficult to explain to the AI, which only looks at the data distribution, that glasses are not really part of the definition of nerd; they are only highly correlated due to stereotyping and visual symbolism. (One can see how concerns over bias and discrimination in AI could arise due to a similar misattribution.)

A feature may be missing from many instances in the dataset and still be the critical one. Whether a laptop is from a particular manufacturer often depends on noticing a clump of pixels that constitute its logo. This is especially important if you want the AI to identify newly released laptops that are not in the training set. You and I know to look for the logo because we know the significance of branding in business. If you cannot see the logo on a new laptop, you would say that the brand is “unknown” rather than hallucinating a response (as AI are wont to do). You have determined that the logo is important, even if the experiences from which you extracted that information were themselves inconsistent. This is why humans don’t need large curated datasets to train on, and can reliably learn even from a small handful of messy instances.

Without any input or contribution on the part of humans to push a model in the right direction, it can’t reliably figure out which features to focus on, and which are only incidental. To expect the model to generalise out of distribution in such cases is to expect it to get lucky or to be a mind reader; there is no principled reason to think that it can do so correctly. The prevailing conviction that an algorithm can somehow grok what is important from data alone is a sort of mysticism, as though the “aura” of the relevant entities — the ones we hone in on — is somehow located in the material world itself

Guidance through data and architecture

Even in successful ML models, training must always be guided by the makeup of training datasets. This is why it is so important to train models on large, curated, balanced datasets³, so that it can eventually extract the patterns we ourselves know are important. Experience in training robots on Behavioural Cloning has shown that it is critical to select only ideal, high-quality demonstrations, so as to guide the model down a narrow channel of correct behaviour. Doing so merely circumvents the real challenge of robust generalisation by using the shape of the training data to push the model’s inference towards what we ourselves know is correct. It is more akin to specialisation than to generalisation.

Much of the recent progress with large models in NLP and computer vision has relied heavily on delicate strategies for curating pre-training and post-training data — π0: A Vision-Language-Action Flow Model for General Robot Control

To counteract the resulting narrow focus, the most productive models make the training dataset so large and broad, and the model’s parameter count so wide, that it becomes hard to ever step “out of distribution”. This is a brute force approach. Not only is it costly to collect and curate such data, it also needs to be continually updated as the world of facts around it changes. For most small ML teams, producing good datasets is an ordeal. Fortunately, that effort has proven valuable enough in narrow contexts to be worth the cost.

In addition to shaping the training data, teams have also succeeded by introducing architectural biases, such as a model that places less emphasis on colour, and more on shape, or vice versa. The practice of tuning hyper-parameters can help shift the emphasis of the model one way or another as well, and has been a critical step in stabilizing ML products. In still other cases developers and researchers have unconsciously introduced their own domain expertise by iterating on various model architectures and parameters until they get the result they want⁴.

Convolutional neural networks are a good example of how we have injected our intuitions through architecture. They are intended to be translation-invariant; that is, an item will be interpreted similarly regardless of where on the image it is. But there was no a priori reason to assume that interpreting images should be translation-invariant — and there are many cases where the assumption is wrong. Translational invariance is a human intuition gained from experience, that we have injected into visual ML.

Other models have attempted to introduce transformational invariance more generally — including reflections and rotations of images. The validity of such transformations depends highly on context. For example, the fact that most people write with their right hand doesn’t really matter for depicting writing — right and left hands are equally viable. But the same is not true for “port” and “starboard”; their relationship to the orientation of a ship is critical. Again, the dataset alone can’t tell you which of these is the case. If there are outliers in the dataset, the model wouldn’t know if they are errors to be purged, or rarities that should be included in the rule. More context is needed, or a human to guide the model aright.

However, if you end up injecting your own bias into the architecture, as popular AI products do, the system collapses when the assumptions, needs, or data distributions of the task change over time. For every correction you introduce, there will be exceptions that require their own corrections, and so on. This is why we see so many mistakes made at the “bleeding” edge of an AI.

Sutton was wrong when he proclaimed that modern ML has moved away from hand-crafted features; we have entered an new era of hand-crafting, only this time we are tweaking datasets and architectural assumptions.

The underlying distribution is causal

How, then, do we humans figure out which features are the important ones to focus on when generalising? Simply by bringing “importance” to the interaction ourselves, rather than finding it in the data. This makes sense — importance depends on what you value and its relation to the world around you. Your needs and interactions will always be the final arbiters of what counts as correct generalization. No amount of data, for example, could have told you that fire in the shape of a chair would not qualify as a chair, unless you also took into consideration the interactions you wish to have with it; or that having four legs wasn’t absolutely necessary as long as the shape of the chair allowed for a stable place to rest.

Our frustration with AI failures always arises from its inability to recognise what is important to us. Yet how could it? A model trained on a dataset of images has no corporeal body, nor any desire to sit, and it cannot gain the same values as us merely from looking at images of chairs. It is always making decisions from “behind a pane of glass”, so to speak; it cannot touch the subject matter to verify if its generalizations were correct. Without the opportunity to try out some hypotheses in an embodied way, then any time a model successfully generalises it is either due to luck or because its trainer injected subtle suggestions into the system’s architecture.

The so-called “underlying distribution”, the Platonic ideal we were searching for is really derived from our embodied interactions with the subject, on what is useful to us. And, as argued elsewhere, to understand how the world can be useful to you, you must build causal models. For example, an elevator button is “important” because it causes an elevator to come. Thus when people are asked to draw an elevator, they don’t forget the button. Causation enables you to think about utility, it lets you build an understanding that focuses on those aspects of your experiences that are material to your use cases.

Our brains are not wired to do probability problems, but they are wired to do causal problems. — Pearl, The Book of Why

The only useful distributions for generalisation are causal ones, and these cannot be extracted from static datasets. The expectation that we will eventually be able to generalise cleanly from training data is based on a fallacy that the correctness of this generalisation can be determined from inside the data distribution without adding something to it. This fantasy arises from the idea of an “underlying distribution” which claims to be the transcendental source of the data, and tapping into which we gain an insight into the data itself. Ultimately, we are asking ML models to do the impossible, that they should, without interacting with the world, learn the same causal intuitions that we get from embodied experiences.

¹ Even foundation models (general-purpose models) are not actually general, in the sense that they must still be fine-tuned to be useful outside of their training data.

² We don’t realise these limitations of our intuitions, because the black-box nature of Neural Networks allows us to rely on our gut feelings without actually being forced to analyse them by digging into the details.

³ The idea of a “balanced” dataset is subjective; humans decide case-by-case what qualifies as “balanced”. Similarly, the assumption that datasets are Independent and Identically Distributed (IID) is an attempt to retrofit messy reality into a statistical form which works to the benefit of our models, because it matches our model’s assumptions.

⁴ This is a consequence of the Behaviourist principles at the root of ML — if you only rely on how a model behaves you will tend to move the model towards success only in the narrow space in which you are testing it.

--

--

From Narrow To General AI
From Narrow To General AI

Written by From Narrow To General AI

The road from Narrow AI to AGI presents both technical and philosophical challenges. This blog explores novel approaches and addresses longstanding questions.

Responses (3)