AGI: Beyond Reward Maximisation

How reward maximisation keeps AI narrow

19 min readMar 11, 2023

I have a small garden in front of my house. Most evenings I step outside and remove weeds from between the zinnias.

At first, weeding was a obligatory part of maintaining my garden. It was a means to an end. Somewhere along the way, I began to enjoy the act of weeding itself. Instead of being a chore, it became a daily ritual and source of satisfaction.

One day, I stepped outside and found there were no weeds to remove. I should have felt relieved. Instead I resented my garden for depriving me of the joy of uprooting weeds. The pursuit now contradicted the original goal — I wanted to find weeds so I could enjoy removing them.

Paradoxical Motivations

You can likely relate to this peculiar situation in some aspect of your own life. Whether it’s exercising to improve your physique, reading to broaden your understanding, building a new skill, or growing your financial wealth, you may discover one day that the original ambitions have taken a backseat. The sub-goal takes on a life and mastery of its own until it starts to undermine its original purpose; for instance, you may work excessively, earning money far beyond what you need to live well.

In this way, your everyday experiences appear to contradict the core argument of a 2021 paper on Artificial General Intelligence (AGI), titled “Reward is enough”. Its authors suggest that the act of maximising base-level, built-in rewards within the framework of Reinforcement Learning (RL) is a sufficient foundation for the emergence of human-level intelligence. All cognitive and behavioural abilities, like language, social intelligence, riding a bike, eating, and survival can be derived by an agent from a basic set of motivating goals. In short, we have every theoretical tool we need to build AGI:

Hypothesis “Reward-is-Enough”: Intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment. — Reward is Enough
…the generic objective of reward maximisation contains within it many or possibly even all the goals of intelligence. — Reward is Enough

There’s a lot to admire about the paper. It has a grand vision: to unify language, planning, logic, perception, etc. into a single integrated RL agent. As the agent learns to cope with the complexities of its environment it will accumulate a catalog of nuanced, generally intelligent behaviours.

Somewhere along the way, however, the paper makes the unwarranted leap from “reward” to “reward maximisation”. Its subsequent emphasis on reward maximisation introduces a theoretical flaw which prevents it from being a practicable route to general intelligence. This post will argue that reward maximisation is actually what keeps narrow AI narrow.

Teleological Policies

Reward maximisation is a teleological policy. It defines an end goal or goals, then draws the entire system towards it. If you assume this to be the active principle in human learning, then reward maximisation (through regression) shapes every conscious thought and muscular movement in day-to-day life towards that goal, like a magnet pulling iron filings.

Image from *Reward is Enough* *showing a squirrel and a robot both maximising rewards, and consequently learning actions, perceptions, imagination, etc.*

This hypothesis is not necessarily controversial. Maximising utility has long been the theoretical foundation of research in many branches of AI. From the popular textbook, AI: A Modern Approach:

According to what we have called the standard model, AI is concerned mainly with rational action. An ideal intelligent agent takes the best possible action in a situation. We study the problem of building agents that are intelligent in this sense.
The modern notion of rational decision making under uncertainty involves maximizing expected utility — AI: A Modern Approach, Fourth Edition

And in practice this approach has proven useful, at least for narrowly defined tasks. But as a mechanism for achieving general intelligence or for describing human intelligence, it runs into five critical challenges. Together, these suggest the theory needs to be updated if it is to fulfil its grand ambitions. For even granting the authors the widest scope of interpretation on what “reward maximisation” means, we encounter the first problem right out of the starting gate:

#1: What rewards, exactly?

Throughout “Reward is enough” the authors hint at a set of rewards that are supposed to drive all human learning. But they never explicitly list them, even suggesting that their specifics are not important. This omission is understandable — they weren’t certain what those rewards should be.

Why are they so hard to discover? If human behaviour and intelligence are in fact centred around maximising a core set of rewards, and maximising them is the ultimate purpose of a person’s existence, then it should be an easy task for anyone to discover what these rewards are. As the leading strings of life, they would hold your attention and command a singular focus. Not being aware of them, or being mistaken about them, would be the greatest factual error a person could make.

And yet two thousand years of philosophical, religious, and scientific scholarship have demonstrated that the question of what humans fundamentally want is not an easy one to answer. This strikes me as odd: we understand the properties of quarks and bosons, but have difficulty figuring out what humans inherently want. That should be the first thing we ever learn.

Reading between the lines of the paper, the authors tenatively propose a combination of hedonism and evolutionary adaptation as candidates for this purported “ultimate” goal. A moment’s reflection on your own life will show that neither of these is able to meet the necessary criteria.

Although hedonism — the attainment of food, water, sex, and physical comfort — surely has its attractions, anyone who has spent a day or two in such a languid situation knows how emotionally draining it can be, rapidly devolving into ennui. But why should this be, when their attainment is supposed to be the meaning of life itself? What more could anyone want?

An RL agent grounded in hedonism that achieved such a lascivious state would have little reason to so much as move a muscle. Its rewards would be at their maximum. And no doubt, experience shows that hedonistic rewards do in fact drive human learning. However, it’s wrong to conclude that maximising these signals is the self-evident and incontestable “meaning of life”. They are at best somehow involved in the process.

This becomes blindingly obvious when you consider that no one actually seeks to “maximise” their food intake — that’s just absurd. You eat when you are hungry, and you stop eating when you are full. And the same is true for all hedonistic motivations.

As for evolutionary adaptation, although natural selection pushes groups and species towards outcomes like survival and reproduction, that does not mean those outcomes drive, or even can drive learning in individuals themselves. The authors of the paper seem to be aware of this, though they draw hazy lines between innate vs. learned knowledge. In perhaps an unconscious slip-up, they propose “maximising survival time” as a reward¹, despite the fact that if a negative signal is only triggered when the agent fails to survive — i.e. it dies — it would be too late for it to learn anything. I’ll assume this is meant to apply to the agent’s genetic code, not its day-to-day learning.

The authors recognize that in practice you and I are driven by a broad array of loosely connected goals and rewards. These include specific and nuanced ones, like completing an academic paper on time, to far-reaching ones, like avoiding pain. However, they skip over the mechanism by which base-level rewards connect to the higher-level motives that support them. Throughout the paper they maintain a vague optimism that integrating multiple motivational contexts is easier to do with a single RL goal than with several:

…implementing abilities in service of a singular goal, rather than for their own specialised goals, also answers the question of how to integrate abilities, which otherwise remains an outstanding issue. — Reward is Enough

But the fact is that as yet RL has only been successfully applied to narrow problem environments, not diverse ones.

The problem of integrating multiple problem contexts (aka tasks) is one I’d like to focus on, because teleological policies struggle in addressing a second issue:

#2: When should the agent learn?

A common challenge for teleological models, ones in which the whole system is retroactively adjusted based on ultimate success or failure signals, is that there is no easy way to determine when to make the adjustment, that is, when to reward or punish the system. How does an agent connect a given reward to its cause when the cause may have been experienced at any time before it? This is known as the credit assignment problem.

To explain this, let’s take an example referenced in the paper: the game of Go. In games like Go, the key moment of learning seems straightforward — it occurs when the agent wins or loses.

As humans, however, we are not programmed at birth to want to win at Go. Nor can you and I expect to receive an objective external signal for when to perform a “regression” on winning. Rather, as you’re building up your goals and motivations from scratch, you must at some point learn to characterise certain Go situations as “successes” or “failures”. Experience will teach you that winning a game is often better than losing, saving a playing piece is better than forfeiting it, and all the relevant exceptions to these rules.

This is trickier than it sounds. That a game is even being played, that it ends, and that the ending represents some significant moment of “reward” is something the players tentatively define through agreements outside the game itself. Players who invent their own rules can define this moment differently. Some games, like World of Warcraft, never end, so you must select moments of “reward” at a finer granularity, even down to mini-events like gaining a valuable item.

Seemingly tangential motivations also play a role in how you judge success in a game. By playing ruthlessly, you may achieve a short-term victory; but if this triggers backlash from your friends, you may reconsider the value of your actions from an interpersonal standpoint, albeit not from a tactical one. A person can even choose not to believe he lost a given game. He may grade his success on another set of criteria, such as a “moral victory”, or teaching a beginner how to play, or making friends. Aesop’s fable of The Fox and The Grapes shows us how subjective success and failure can be.

Each of these cases represents a sub-goal or sub-motivation that drives its own learning within its respective domain. They are all characterised by an underlying problem which is then used to define its own context of rewards. This despite the fact they all deal with the same stimuli or set of states: the game board. So the same situation (state) can be a success, or a failure, or both at the same time. In Q-learning terms, its Q value for a given state is multivalent.

There may be an infinite number of such learning contexts, and you create new ones all the time. Each of them is derived from a more fundamental motive — e.g. winning a game is based on proving your skills and intelligence to others, which is itself based on the desire for social approval or personal achievement, and so on.

Every problem is then paired with a set of recognized solutions or states to aim for; e.g. the sight of weeds is paired with the prospect of a clear garden, or the dread of homelessness with a full bank account. This is where “positive” rewards first make their appearance. Together with the problem they form a situation-driven model.

As in classical Pavlovian conditioning, the plans you make and physical actions you take are tailored to the requirements of each problem context. When gardening you may see a weed (problem), then figure out how to remove it (learning) so your garden is clean (solution). This is true even for built-in, physiological rewards, from which you derive the others: e.g. you are hungry (problem), you eat (learning), so you feel satiated (solution).

Only when you finally solve a problem do you automatically update your mind, or “learn”. Thus the timing of regression is straightforward. It begins when you perceive a problem, and ends when you experience one of its solutions. Between these two, the problem draws the agent’s focus single-mindedly towards a solution, iterating until it’s found. Each problem is its own mini-reward and context of learning. It need not wait for the overarching reward to be signalled before it records a success².

Contexts of learning, each with its own problem, solution and regression

Of course, segregating contexts of learning may also isolate them from their creators to the point the child goal may contradict its parent. This is what happened in the weeding example that began this post. On the plus side, the much-bewailed credit assignment problem with which we started this section now becomes moot. Credit is easy to assign since the sub-goal is acting on its own terms, “selfishly”, with its own scope of regression³. It is not part of a global agenda of hill-climbing and optimization.

To be frank a single, all-consuming agenda of reward maximisation depicts us humans in too flattering a light. A reward maximising agent is one that is monomaniacally hammering away to discover a new peak in its basic rewards, endlessly following a incessant drumbeat of batch learning. Although this may be an useful way of framing formal computational problems, and well-suited to a computer, it stands in stark contrast to actual human nature which tends to…

#3: Satisfice

To satisfice is to accept the first good solution, then tread that path until there is compelling reason to learn something new.

Herbert Simon… [showed] that models based on satisficing — making decisions that are “good enough,” rather than laboriously calculating an optimal decision — gave a better description of actual human behavior — AI: A Modern Approach, Fourth Edition

Let’s face it, you and I only solve problems when we have to, or we think we have to. Few of us pretend to such lofty aims as fully maximising our benefits. Rather we are reactive in responding to pressures, such as fear, boredom, anxiety, stress, competition, social pressure — and specifically their causes. John Locke put it succinctly 350 years ago:

The greatest positive good determines not the will, but present uneasiness alone. — John Locke, Essay Concerning Human Understanding, Book 1, 35

Every act of exploration and learning must have a motive that drives you out of your safe, sedentary routine. Any detour off the beaten path is at best tiring, and at worst fatal. Even if you do find yourself compelled to take such a detour, the scope of your exploration will be bounded by the original stressor. You look for a job because you are unemployed — once you are employed, you stop looking for one. Only when your job becomes unaccommodating, or the old routine ceases to satisfy do you take a risk and try something new.

The same is true when playing a game like Go. You put in as much effort as is required to win, and not much more. If a more competent player comes along or you’re starved for intellectual stimulation, you may have a reason to break out of your routine. In the absence of such a spur, most of us are content to live a life of ease and habit.

Looking back, you may have noticed that the notion of satisficing and the situation-driven approach are identical. In both cases the trigger for learning is an immediately perceived or imagined problem. The effort exerted, and the risks taken are bounded to the scope and timeframe of that problem.

In contrast, a teleological model of reward maximisation assumes the problem-solving context is ubiquitous and always active. You are never free of an overriding feeling of anxiety. In such a world “workaholism” would be the universal human condition, and its absence would be a sign of mental illness.

You could argue that such a “workaholic” agent would be superhuman, i.e. us, but better. And within a narrow scope that might be true. But in the general case, reward maximisation is what keeps narrow AI narrow. A general AI must have the option of choosing what goals are appropriate to its current situation, which means it must have the option of choosing not to pursue any goal at all.

This latter characterisation of agent behaviour more closely resembles what we know about the human condition: that it is lazy and risk averse. Once you or I discover an approach that works, and for as long as it works, it remains routine⁴. There is little reason to explore and try to find a new “maximum”, much less a global one. In a dangerous and pitiless world, this is the only safe way to live, because…

#4: Exploration is too risky

When training an RL agent you have the luxury of working in a simulated environment such as XLand or Gymnasium, one in which the AI can cycle through randomised actions for an extended time, with reckless abandon. Its only goal is to increase some aggregate success metric. And the simulation can be reset if the consequences prove disastrous.

In real life you can’t afford to die a few times before learning that knives and cliff edges present an existential threat. Faced with a world that gives few second chances, exploration must be safe and above all necessary. Maximising rewards is therefore an extravagance. As priorities go, it pales in comparison to minimising risk.

Maximising rewards is an extravagance; it pales in comparison to minimising risk.

Despite the explore/exploit paradigm prevalent in RL training, no one actually randomly tries new restaurants during an ‘explore’ phase of his or her life. You wait until you’re dissatisfied with the usual one, or it is closed, then you search around for what you deem to be the next-best option. When found, you make a plan and drive to the location. If you like what you find, it may become the new norm. And if no good option is available, you fall back to the tried-and-true.

These flaws in the explore/exploit paradigm have had negative ramifications on practical AI. In my day job I work in the Machine Learning department of a robotics company. The idea of allowing a half-ton robot arm or an autonomous vehicle to explore a peopled workspace by flailing randomly is generally frowned upon. It’s one of the main reasons RL has yet to make significant inroads in robotics. We tend to prefer its safer cousin, Behavioural Cloning.

In everyday life the suggestion that humans explore by taking random actions is fundamentally nonsensical if you remember what “random” truly means — a seizure or convulsion, a random shaking of the muscles. Except perhaps in early infancy such actions have limited value, and are not a viable means of exploring nuanced problem spaces, such as the construction of a nuclear weapon.

Instead, in all cases where you encounter a problem you first create a goal, then test it out and see if reality will align with your intentions. Exploration is therefore akin to verifying a hypothesis or a plan. “Randomly” exploring new restaurants is in fact a contradiction in terms — if you’re only looking for open restaurants, it’s not really a random action is it?

This is not to say you won’t occasionally deeply analyse and explore some narrow, safe task — such as theoretical math or devising a new sorting algorithm —and dig through every possibility to maximise a selected outcome. But you only do this if you are feeling enterprising (a response to the problem of boredom), or are expected to be mathematically rigorous (a response to the problem of social reproach), or perhaps envious of your peers’ relative success (a response to the problem of competition). In other words you consciously engage in “reward maximisation” because there is a problem that demands it. Reward maximisation is actually a flavour of situation-driven learning, one that is only ever amenable in an intellectual context.

This is why reward maximisation feels so clinical and inorganic; like a pitiless schoolmaster. It attempts to lay down a universal, steady-state formula for what is in truth a lazy, opportunistic, and short-sighted process. As such, it can only ever be a theoretical limit, a distant, aggregate view of what is, close-up, a fundamentally different process. Much like Adam Smith’s “hidden hand” driving all economic interactions, it isn’t realised in operation, it’s only perceived to do so from a distance. Reward maximisation is an ideal: it’s what you wish humans would do, not what they actually do.

As for a situation-driven RL policy, it is more than just a policy of cautious or lazy exploration. It is intertwined with a concurrent process of learning thousands of bespoke problem-solution sets, then iterating in the space between them. Instead of performing regression to maximise a single scalar reward, the agent bounces between a myriad of problems and their solutions, creating and destroying new ones as needed, with no predetermined sense of where it’s going.

#5: Thoughts can be rewards in themselves

The advantages of such an approach become apparent when you apply it to cognition. Just as you may work through problems in physical space and discover their solutions, you can also imagine problems and contemplatively work through their solutions. This creates an opportunity for planning and reasoning.

For example, when faced with a potential problem such as losing a queen in chess, an agent can visualise various scenarios, and if any meet the criteria for a solution it can resolve the problem in its mind. At that point it will perform a regression — i.e. it will learn. All this happens before it has put any action into practice. In short, physical exploration and imaginative planning would, in this paradigm, be the same basic behaviour. The former happens in an physical space, the latter in a “hallucinated” one.

“Reward is enough” is vague on details of how rewards give rise to beliefs or cognition. The connection between them is not an easy one to establish and the authors only do so in broad terms. It seems obvious to say that “good” beliefs should lead to high cumulative rewards, but that pragmatic connection is difficult to systematise if the fundamental goal is still one of maximising external reward signals.

All this leads to the fifth problem with reward maximisation: it neglects any behaviour that does not aim at an external reward. This is a major oversight in the theory. Our experiences in life repeatedly demonstrate that thoughts can be ends in themselves; that is, thoughts by themselves can provide reward signals. Examples include

enjoyment of movies, fiction, music, comedy, and other entertainment
religious practices aimed at improving ones inner life, e.g. “spirituality”
daydreaming and fantasising

All of these examples would be inexplicable if your only goal is the maximisation of external rewards. But they are easy to explain if you consider them as the result of independent problem contexts working on their own schedule of rewards, even in thought. For example, you may watch a fictional movie to live out — in thoughts — a fantasy you can’t realise in everyday life. It’s like making an “intention” or “plan”, albeit one you never bother trying to put to practice. The fantasy itself is enough to trigger a reward — which is why fantasising is often derided as idle or unproductive self-gratification.

Every model of AI I know of directly connects an agent’s values and preferences to maximising some external reward. Anything else is sidelined as irrational. A more cynical person might be tempted to think AI research is driven by an ulterior motive: to build a perfectly obedient and diligent worker, one that aims at nothing but maximising its productivity — and consequently its creator’s profits— without any such inconvenient aspirations as inner contentment or happiness. This is the historical legacy of rational choice theory in AI. If you pre-define AI as a rational, reward-maximising agent, then “unproductive” irrationalities, like enjoying comedy shows, get washed out as flaws or anomalies rather than what they are: part of the fundamental human condition.

Although this post has been critical of “Reward is Enough” there’s actually a lot to like in the paper, and I wouldn’t have written it if I didn’t fundamentally agree with many of its premises. For example, the authors rightly argue that language is rooted in the social efficacy of speech. This runs refreshingly counter to modern predictive models of language generation. They nudge linguistics away from distribution- or data- modelling algorithms like GPT, towards utility-based ones. The effect is to reframe the language question from “what is the agent most likely to say?” to “what is it useful for it to say?” Such a reframing would highlight social and linguistic influences on our ideas of truth, and arguably points to a solution to the symbol-grounding problem.

The authors of “Reward is Enough” don’t pretend to have uncovered all the details of implementing AGI. And there’s every possibility that Reinforcement Learning can be amended, updated, or reformed to meet the conditions for situation-driven learning. It would be misleading however to say we already have everything we need for AGI, as some of the basics, especially reward maximisation, demand to be significantly reworked. Rewards may be enough, but reward maximisation isn’t.

¹ Quote: “Furthermore, the maximisation of many other reward signals by a squirrel (such as maximising survival time, minimising pain, or maximising reproductive success)…would also yield abilities of perception, locomotion, manipulation, and so forth. — Reward is Enough”

² Such a system stands in for the epoch start and end that is currently hard-coded into narrow RL.

³ To take full advantage of this approach, the agent must be able to separate actions and thoughts into domains of relevance so that, for instance, it perceives the circular shape ‘O’ as a letter (“O”) or a number (“0”) depending on the type of problem it’s facing. You could view this as a practical instance of the Theory of Affordances. Regressions or alterations to the neural network must also be restricted to only those connections that were created within its scope. If you misunderstood the sound that the letter “O” made, you should be able to correct that without inadvertently affecting “0” as a number. Isolating domains of learning thus prevents a common problem in ML training where new and unrelated skills may interfere with older ones, which is known as catastrophic forgetting.

⁴ The same applies to knowledge and beliefs, even scientific ones. We only theorise as far as we need to to solve our current problems. This has ramifications for what it means to discover the “truth”.