The need to constantly micromanage AI
Why the responsibility for ground truth always falls back into human hands
Last week, the US government announced a 700 million dollar investment with the goal of producing quality data for a military intelligence program called Maven. The project sought to create “ground-truth” datasets that could be used to train AI to recognize friend-or-foe designations across various sources of military intelligence, including satellite and aircraft images. The money was earmarked for manual labelling, for human labellers to encode what they believe the images contain, such as tank movements or other content of military import.
The significant dollar amount of this investment reinforces what we have long known: there is a dearth of data available for training Machine Learning (ML) models. Modern ML is notoriously data-hungry, and the lack of quality data has long been one of the inhibitors of rapid progress in the field. The investment in Maven reflects one of two approaches for dealing with this resource constraint — that is, generating new datasets. The other is to make the models more data-efficient, i.e. to do more with less.
What has not been proposed or attempted is for agents to define their own ground-truth datasets, i.e. by labelling their own images, just as humans do when faced with the task. Only human researchers are permitted to collect their own data, clean it up, structure it correctly, and finally push it through some statistical computation pipeline. AI models cannot be trusted to sift through the images by themselves and compile their own ground truth.
This latter suggestion sounds both implausible and undesirable. We humans consider ourselves the “measure of all things”, that is, the source of all foundational truth. Whenever a decision must be made as to what to include in a dataset, we don’t believe AI is fit to make the call, to select which samples are valid and filter out those that aren’t, without at least some initial human guidance. The AI may infer conclusions based on that initial setup, but we humans are the only ones permitted on the ground floor: at some point the decision must always come back, directly or indirectly, to a human evaluation and judgment call. Anytime doubt creeps into a model’s performance, it is always left to a human to prepare the necessary corrective. This didactic process, of course, leaves the AI helpless outside the data we give it. The models become fragile, always dependant on human guidance and intervention.
Any AI that did claim the authority to define its own datasets — i.e. truth — could, by that fact alone, also plausibly claim to possess human rights in some sense; and that is a thorny and nuanced issue. So for the moment we prefer to clone our own truth onto AI, as a form of employee training. As with any piece of software, our hope is to create synthetic substitutes for our decision-making that do what we want them to. The need to create labelled datasets reflects the need to transmit our wisdom as binary packages into AI. Yet rarely do we pause and try to determine where that wisdom came from or how it was created by us.
This is no small omission. Whoever defines the datasets defines, to a large degree, the outcome; a fact that is centre-stage in debates over data bias. Despite the common wisdom, numbers can lie, and statistics has gotten a bad rap — worse than damned lies, I hear — due to the influence initial data decisions can have on the output. Which images do or don’t count as “suspicious enemy behaviour” must be made explicit in the dataset, and those early designations determine the statistical correlations that will be subsequently discovered in the data.
Why can’t an AI make these decisions itself? What is preventing an agent from being the creator of its own datasets? Humans admittedly have access to a larger universe of experiences than AI do; we live in the real world. We have conversations and refer to events of which AI have no awareness. With this in mind, one viable option may be to move away from employing tailored, supervised datasets to more open Reinforcement Learning (RL) agents. RL, in principle, may invite agents to collect their own datasets through unguided exploration of an open environment (although this possibility has yet to be seriously studied).
The question of how an AI can determine its own ground truth resolves, in the end, to the question of how RL-type agents can create discrete, all-or-nothing information. Creating ground truth is always an act of selection, filtering, and restructuring of raw, real-world experiences; grouping together what is important, deleting what is noise. This transition from open exploration to structured, discrete “truth” is what is at question here.
Is it entirely out of the question for us to trust an AI to decide if an image were of a friend or foe? Surely the data is there, merely waiting to be collated in some way. If we go by the assumption that ground truth arises directly out of the data itself, then nothing but the data is important to the outcome. On this assumption a dataset — whether it includes every experience an agent has (as is the case in RL), or some curated subset of those (as in supervised or semi-supervised models) — is fundamentally the same; data is merely what is available to the agent via its inputs. All that is required for an agent to define its own datasets is a physical body and some data transformation function.
Yet is this true? It seems like there is more to a dataset than merely a refinement of our total experiences. The selection process is not trivial. It is not just a question of dumping a data-lake’s worth of information onto a model and hoping it will supervise itself and create a dataset. Were this the case we would not be spending millions of dollars doing it manually. A dataset is ultimately an opinion (e.g. “friend or foe”), one which could be made differently depending on who you ask, and when you ask them. It is not impartial. There is a strong connection here between questions of moral alignment, such as between humans and AI, and the process of creating discrete datasets from a stream of empirical inputs.
What is missing is a theoretical framework that defines the move from a raw stream of empirical data to the active selection of datasets. It is wrong to treat them as interchangeable, as though all the necessary information for determining the latter was available in the former, and all that is needed is a transformation function. They are separate, standalone entities, created by humans, and that creative process influences the outcome. The exercise of transforming one into the other could even teach us something important about the nature of human cognition.
From the above, a few challenges have already revealed themselves when confronting this task:
- How does the AI become the arbiter of discrete knowledge/concepts?
- How can the AI judge in cases of ambiguity and contradiction, in order to establish its truth?
- How can it create novel concepts and ideas?
- How does it reason about what it has experienced?
(The links above point to posts that deal with these topic directly.)
There are two general directions to take this line of inquiry. The first, which was already mentioned, and which fits the modern data science paradigm, is to assume that since truth is objective, all that is required is to filter, transform, and collate what is given in experiences into a dataset. But since defining the data samples in question is always arbitrary this approach, though intuitively appealing, is not viable. For example, there is no objective way to decide whether an outlier is valuable or if should be excluded. There are cases when you may want to include “errors” in your dataset, such as when the goal of the model is to detect those errors. The concepts that define the data are determined by the utility of the task to which it is being put.
This leads to the other option, which is to tie the definition of datasets to what is useful to the agent in question¹. The act of creating a dataset is a cognitive task in itself; it can be performed better or worse, according to some benchmark. We should therefore look at why the agent would engage this act to begin with. Creating a dataset must have a goal or intention, such as to solve a business problem, or to create a public resource other teams would find useful. This is the reason the US government is committing millions of dollars to creating them. The nature of the dataset will be determined by the task to which it will be applied:
“Data labeling is the process where the human actually identifies the object, and then, in a way that is understandable by the model, informs the model. And so you have to actually label it in a very specific way” — Vice Adm. Frank Whitworth, Breaking Defense
A dataset is not just data, but structured data. For every value it contains, there is a property to which it is assigned, such as “geo-spatial location”, “terrain type”, or “type of equipment”. The choice of which property will be included, and how each is defined and measured is where the real work of translating continuous experiences into discrete datasets occurs.
At every step in this process of creating datasets — how properties are defined, how they are selected, how they are measured and stored — there are decisions being made which entrench the author’s values into the outcome. And as many have noted, each step introduces the possibility for bias. In fact, to say that bias may be introduced is almost misleading; it implies the existence of a dataset that doesn’t contain bias. In reality datasets are just artifacts that record your biases, where bias is defined as “the act of shaping the representation to suit your purposes, or the purposes of the group”.
When compiling a dataset, you don’t just invent a property from whole-cloth and expect everyone to go along with it. It must be agreed upon beforehand as part of social discourse. There is a communal act of definition at work here. We humans at some point decided to create concepts like “location” or “asset”. A dataset is a format of alignment between the teams who will use it², and the creation of any explicit knowledge, including of datasets, is part of a larger act of communication.
As data scientists, we rarely think of datasets as social artifacts, even though they must be so, since they are created in a social setting for communal purposes. Rather we tend to look transparently through them as representations of truth. This cuts us off from determining how they could be created by an entity other than ourselves. By trying to remove the subject from the equation (aka by being objective) we end up removing the creator, and thus making it impossible to understand and imitate the act of its creation.
Few people have addressed this tension between reality and data, and the subjectivity entailed in every choice, as well as Bill Kent in his influential book Data and Reality:
The information in the system is part of a communication process among people. There is a flow of ideas from mind to mind.
[Data structures arise] not by any natural law, but by the arbitrary decision of some human beings, because the perception was useful to them, and corresponded to the kinds of information they were interested in maintaining in the system.
We are not modeling reality, but the way information about reality is processed, by people. — Kent, Data and Reality
For an AI to define its own ground truth through an act of exploration, that truth must first be recognized as a subjective and inter-agent construct. It cannot come out of the raw data of experience itself, it must be created through the interactions of the agent with other agents. There are no fixed, static, or objective datasets in the world of an agent; each must have been a motivated choice, and subject to change when those motives change.
Consequently an agent would only be allowed to create their own datasets if there is trust between it and other agents, including humans, regarding the motivations that drive the selection. To this end it must put in the effort to align itself, as the rest of us do, to mutually agreed upon — aka reasonable — standards of correctness. Since the underlying motive for defining the dataset determines what will and will not be included, they must become good citizens before they become good curators.
Despite the potential benefits of this approach it is, of course, largely neglected. We sense, intuitively, that the series of steps from an unbounded, exploring agent to one that can compile and communicate its datasets will be long and costly. An AI with the mind of a toddler does not make for a good employee, and we don’t have 18 years to wait on an investment that may or may not deliver. The long journey of laying down solid roots for intelligence is simply not cost-effective, and so it is tempting to jump right to training at higher levels, leaving the dataset creation to us humans.
These shortcuts we take come at the expense of fragility — the inability for AI to adapt to unusual situations — and more importantly, at the cost of truly understanding what human cognition is. When LLMs have nothing but our curated text (and images) to run on, they have no frame of context by which to correct their errors, to double-check the “wisdom” they spout by holding it up against reality. Engagement with the real world is a set of skills from which we have shut it out. The important decisions were made before the AI ever arrived on the scene, and any self-correction it does can only be done within the domain of the data as it was given. It is no wonder then that the resulting agents behave as children, hallucinating without self-restraint, making suggestions without measured forethought, and with no grounding in the greater truth that surrounds them.
¹ To be clear, what is useful is not random or unconditioned, it is tied to the reality that causes the agent to define its utility.
² Even when we have no intention of giving a dataset to anyone except the agent being trained, we expect the benefit to be something others can ultimately understand. In this sense, AI are substitutes for humans; we communicate to both through datasets.