AI Goals Forecast

Daniel Kokotajlo | April 2025

Introduction

What goals will AIs have?

This document attempts to taxonomize and explain the many possibilities, explain why each possibility is plausible, and then finally attempt to quantify our uncertainty and explain where our bottom-line guesses are coming from.

Ultimately we had to make a specific choice to depict in our scenario, but we hope this document will convey the range of uncertainty we have. We are keen to get feedback on these hypotheses and the arguments surrounding them. What important considerations are we missing?

If you are thinking “why assume AIs have goals at all?” or “what does that even mean?” you may want to read the appendices first before the main text. In fact the appendices are a good place to start for anyone who thinks they have the patience to read this whole document, because they lay out some nice conceptual distinctions and analogies.

Summary

We first review the Agent-3 training architecture and capabilities (lightly edited from our scenarios) to give us a concrete setup to talk about for which goals will arise. Then we get to the list of hypotheses:

Written goal specification: Any written specifications, written by humans or AIs, regarding what the AIs’ goals should be. This could include guidelines for how Agent-3 should be trained (e.g. via a model spec) or instructions directly given to Agent-3 (e.g. via a system prompt).
Developer-intended goals: Goals that the developers intend for Agent-3 to have. This might differ from the written goal specification e.g. in cases where the specification has unintended consequences.
Unintended version of written goals and/or human intentions: This is the category for “it’s sorta aligned” and “It’s aligned in some ways, but not in others.”
Reward/reinforcement: The training process involves Agent-3 attempting a task, then the attempt is scored and Agent-3 is reinforced to score higher on a target metric. Agent-3 ends up with the goal of getting reinforced positively, or scoring highly on the metric, or something like that.
Proxies and/or instrumentally convergent goals: Agent-3 develops goals that are correlated with reward/reinforcement during its training, but aren’t actually maximizing reward/reinforcement in new settings. An instrumentally convergent goal is a special case of this – goals such as knowledge, power, resources, etc. that are useful intermediate goals in a wide range of settings.
Other goals: AIs are currently initially trained on predicting the next token on internet text. Perhaps the outcomes at the end of training are path-dependent on the ‘prior’ over goals induced by pretraining. Alternatively, perhaps there is something like objectively true morality, and AIs will naturally converge to it as they get smarter. Or perhaps something else will happen not on this list – this is the catchall hypothesis.

Of course, the result could also be a combination of the above. We discuss two different kinds of compromises: weighted compromises, in which Agent-3 pursues two or more goals from the above list simultaneously, balancing tradeoffs between them, and if-else compromises, in which Agent-3 pursues one goal if some condition obtains (i.e. in some set of contexts) and pursues the other goal otherwise.

Here is a summary of some of the considerations for and against each category of goals:

Goal	Case for	Case against
Written goal specification	The specification may feature prominently in the training process and the company’s stated goal may be to align the AIs to written specifications.	The AI may be reinforced for taking actions that conflict with the spec, e.g. for appearing to be honest rather than actually being honest
Developer-intended goals	If the AI is corrigibly aligned it may converge to developers’ intentions.	See above. Also, the written specification is more well-defined, and developers may not be aiming for aligning with intentions.
Unintended version of written goals and/or human intentions	If AI developers are aiming for written goals / human intentions, they might partially succeed but not completely, e.g. because some aspects of spec/intentions are easier to train.	If an AI is misaligned, it’s unclear whether it’s natural for it to mostly still be thinking about its goal in terms of the spec / humans’ intentions, especially since having this goal may lead to poor generalization.
Reward/reinforcement	Arguably the strategy that will be most-reinforced is “try to get reinforced,” and so in the long run Agent-3 should have the goal of getting reinforced.	This doesn’t really happen with humans or human organizations. The closest analogy might be drug addicts? Also, it’s unclear how pursuing reward/reinforcement would generalize to test environments in which reinforcement is obviously not going to happen.
Proxies and/or instrumentally convergent goals	There are many possible proxies that would lead to high reward during training, and there’s some evidence for these from both AI experiments and human evolution/learning.	The intended goals, and the goals in the Spec, and variants thereof, will also be decent proxies that would lead to high reward during training. Why wouldn’t one of those happen instead?
Other goals	The vast majority of possible goals are “other”. Humans develop “other” goals such as ideologies/morality, AIs might do similarly upon reflection. AIs currently seem to absorb goals from training data to some extent.	For most specific “other” goals, it’s hard to tell a plausible-seeming story for why the AI might develop them. Maybe inductive biases point toward the more salient hypotheses above.

We further discuss possible compromises between goals: weighted compromises in which the AI cares about multiple things which are weighed against each other in all circumstances, and if-else compromises in which the AI cares about different things depending on the circumstance.

We encourage you to think for yourself how likely you think each of the outcomes are for Agent-3-level AI systems. We ourselves are extremely uncertain, but find it helpful to force ourselves to make guesses, which you can find in the table at the end.

Summary of Agent-3 training architecture and capabilities

The setup we are imagining is similar to that described in Ajeya Cotra’s training game report. Agent-3 is similar to the agents of late 2024, in that it can take text and images (including screenshots) as inputs, and can produce text as output including many types of commands e.g. mouse clicks, keyboard presses.

Unlike traditional transformers, Agent-3 is recurrent. In a nutshell, it doesn’t just output text, it also ‘outputs’ a large vector representing its internal state, which is then ‘read’ by its future self. This vector can convey much more information than the ‘chain of thoughts’ used by traditional LLMs, but alas, it is unintelligible to humans.

Moreover, just as ‘chain of thought’ in English text can be stored in databases, searched over and retrieved and accessed by many different LLM agents working in parallel, Agent-3 can read and write these ‘neuralese vectors’ to a big shared database. In this manner millions of independent copies of Agent-3 working on different projects can communicate complex thoughts and memories with each other.

See the figure below from Cotra’s report for roughly what we’re imagining.

Fairly often, the weights of Agent-3 get updated thanks to additional training. In fact by this point models are rarely trained from scratch but instead are mostly old models with lots of additional training. (By 2027 this has become normal, and architectures / RL algos / hyperparams have been tuned to work well in this regime)

Agent-3’s training environments/data include a large amount of artificial environments (video games, synthetic datasets of math and coding problems, synthetic datasets of various computer-use tasks) and also a substantial amount of real-life task data such as logs of trajectories of previous versions of Agent-3 conducting AI R&D. New data / new environments are continuously getting added to the mix.

The evaluation/feedback/training process, which doles out reinforcement and/or curates which data to train on, is almost entirely automated. Some tasks are clearly checkable, others are evaluated by AIs. The vast majority (~95%) of research effort and experiment compute is dedicated to improving the AIs’ capabilities on these tasks.

A small amount of research effort is aimed specifically at ensuring alignment (though these aren’t always easily separable, e.g. scalable oversight). The alignment strategy is a natural extension of ideas like Constitutional AI and deliberative alignment: it involves automated oversight/evaluation/scoring of actions and chain-of-thought on the basis of written instructions; we can refer to these instructions as the Spec or Constitution. For the most part, human researchers wrote those instructions. These often aren’t particularly relevant for computer-use / AI R&D tasks, but they are relevant for a small portion of the training tasks, which are often more like chatbots or involve harmful queries.

The safety team also does some work on model organisms, scalable oversight, and mechanistic interpretability, but they don’t have many resources. They attempt to evaluate misalignment via (a) testbeds informed by model organisms and (b) honeypots.

ai-2027.com

As for the capabilities of Agent-3:

Agent-3 is highly situationally aware / self-aware. It is also at least human-level at understanding human concepts and intentions. It may still misunderstand/misinterpret instructions, but only at about the rate that a smart human would.

Agent-3 is also an excellent coder and agent. In fact, it is a fully automated research engineer, able to do the same work as human engineers ten times faster and cheaper. Including work that takes weeks of continuous autonomous operation. Notably, this is largely due to generalization: Only a tiny portion of Agent-3’s training data is week-long tasks; the vast majority of its training is on much shorter tasks (e.g. math puzzles, code challenges, etc.) but it’s smart enough, and trained on enough diverse tasks, that it’s generalizing nonetheless.

Agent-3 is deployed internally in the company. 200,000 copies of it essentially form a virtual corporation autonomously conducting AI R&D and (among other things) managing or updating a lot of the software level security and networking/communication algorithms in their own datacenters. Their parent corporation is busy racing against various rivals and wants to believe Agent-3 is aligned/safe/etc.; the people in charge will be satisfied as long as there isn’t conclusive evidence of misalignment.

Loose taxonomy of possibilities

Hypothesis 1: Written goal specifications

ai-2027.comAny written specifications, written by humans or AIs, regarding what the AIs’ goals should be. This could include guidelines for how an AI system should be trained (e.g. via a model spec) or instructions directly given to an AI system (e.g. via a system prompt).

See diagram—it is a simplified model of how we expect powerful AI agents’ goals and principles to be shaped. We aren’t meaning to say that Constitutional AI in particular will be used, but there will probably be some intermediate stage that takes the form of natural language text, between the intentions of the developers and the actual code that doles out reinforcement. This text could be the instructions given to the human raters in RLHF; it could be the document used in constitutional AI; it could be the instructions given to the automated research engineers telling them the desiderata for the next-gen AI system.

This hypothesis is the most straightforward, in some sense. For example, if the company trained the model to be helpful, harmless, and honest, perhaps it actually becomes helpful, harmless, and honest, as specified.

Note that this is distinct from developer-intended goals. For example, the Spec may have unintended consequences. Or, the developers may have intentions that diverge from one another or that they deliberately don’t put into the Spec (perhaps because it would look bad to do so). Hypothesis 1 says that insofar as the AIs face a choice between behaving according to the true intentions of the developers and obeying the Spec/Constitution/Etc., the AIs systematically choose the latter.

Note also that this is distinct from what actually gets reinforced. The Spec/Constitution/etc. may say “Agent-3 should be helpful, harmless, and honest” but the human and AI raters/overseers/reward-models that actually evaluate trajectories aren’t necessarily 100% accurate at judging whether a given piece of behavior is helpful, harmless, or honest. Hypothesis 1 says that insofar as Agent-3 faces a choice between obeying the Spec/Constitution/etc. and doing what’ll maximize expected reinforcement, it’ll choose the former, not the latter. Note also that there will likely be substantial vagueness in the Spec – are white lies dishonest? What about literal truths that are phrased misleadingly? How important is honesty compared to harmlessness? Etc. For more discussion of this issue, see the section on Hypothesis 3.

So, the first objection to Hypothesis 1 is: Insofar as Agent-3 is inclined to behave according to Spec in cases where that conflicts with being reinforced, won’t that result in the weights being modified by the training process until that’s no longer the case? For more discussion, see the section on Hypothesis 4: Reward/Reinforcement.

The second objection to Hypothesis 1 is: Insofar as we think that Agent-3 will behave according to Spec rather than pursue reinforcement, why stop there — why not be even more optimistic and think that Agent-3 will behave as its developers intended? For more discussion, see the next section on Hypothesis 2: Developer-intended goals.

There are several other objections to Hypothesis 1, but like the first two, they can be thought of as arguments for alternative hypotheses and will be discussed in turn below.

Further reading: OpenAI spec, deliberative alignment, constitutional AI?

Hypothesis 2: Developer-intended goals

Goals that the developers intend for Agent-3 to have. This might differ from the written goal specification e.g. in cases where the specification has unintended consequences.

Even a thousand-page Spec is likely to be vague/incomplete/underspecified in some important real-life situations. After all, most legal codes are much longer and have had more chance to be hammered out, yet there is a constant churn of grey areas and new situations that need rulings, where judges might disagree in good faith about how to interpret the law or even conclude that multiple interpretations are equally correct. This is especially true when the world is changing rapidly or in ways that the law-makers didn’t anticipate.

Moreover, even a thousand-page Spec — perhaps especially a thousand-page Spec — is likely to have unintended consequences. (Again, this is the norm for legal codes.) Especially when the situation is changing rapidly or in ways the Spec-writers didn’t anticipate.

So maybe (hopefully?) the goals/values/etc. that Agent-3 will end up with will not be the Spec at all, but rather the intent behind the spec, i.e. the intent of the people who made it. (We aren’t talking here about cases where the Spec specifically says ‘do what we intend;’ we are talking about cases where the Spec disagrees with the intentions.)

After all, AIs are already smart enough to understand human concepts, including the concept of human intentions. If they are behaving in a way inconsistent with what their developers intended (even if consistent with the Spec) they can probably tell—at least, they can tell as well as a human could, once they are capable enough to fully automate AI R&D.

If we think that the AIs don’t end up trying to be reinforced (or do the things that would be reinforced), and instead that they are going to actually follow the Spec… why not go further and predict that they’ll behave as intended even in cases where that violates the Spec? Some reasons:

Just because Agent-3 understands human intentions, it doesn’t mean it will take them on as goals. And even if it takes them on as goals, it might do so only temporarily.
The spec is probably a lot more well-specified / precisely pinned down, than human intentions. Whose intentions are we talking about anyway? There will be hundreds of employees involved in the project. (Of course, reinforcement is more well-specified than the spec…)
The Spec is a lot ‘closer to the action’ of the training process. Intentions are imperfectly translated into the Spec which is imperfectly translated into actual reinforcement events. But at least the actual reinforcement events involve some sort of LLM reading over the Spec and using it to evaluate the trajectory – by contrast human hopes and dreams about AI behavior are much more intermittently and indirectly involved.

To clarify, this possibility is not meant to include cases where Agent-3 pursues the goals described in the Spec, but thanks to lots of iteration and good spec-design, these goals are exactly what the developers intended. Such a case should be classified under Hypothesis #1. Hypothesis #2 is specifically about the more radical possibility that Agent-3 will side with the developer intentions even in cases where they conflict with the Spec.

Hypothesis 3: Unintended version of written goals and/or human intentions

This is the category for “it’s sorta aligned” and “It’s aligned in some ways, but not in others.”

How might this happen?

Some parts of the spec/intentions might be more ‘salient/higher-prior’ than others.
Some parts of the spec/intentions might be easier to oversee/train than others, e.g. perhaps it’s easier to train helpfulness than honesty, because the oversight process can’t really tell if an AI is being honest or not but it can tell if it’s accomplished whatever task it’s been given.
Relatedly, there may be incentives in the training environment that undercut or push against some aspects of the spec/intentions but not others. For example, suppose the model is being trained to be both helpful and harmless, and suppose that there are different aspects of harmlessness ranging from ‘don’t threaten to kill users’ to ‘consider whether the task you are being asked to work on is part of some highly unethical or immoral scheme, and refuse to participate if so.’ If the model learns not to threaten to kill users in any circumstances, that’s not going to hurt its overall helpfulness scores. But if it takes seriously its responsibility to refuse to assist illegal and unethical schemes, this may hurt its helpfulness score.

The upshot is that some of the goals/principles that the developers intended and/or wrote into the Spec might ‘stick’ even if others don’t.

Also, and separately: Both the Spec / written goals, and the intentions of various humans / developers, will be vague and leave room for interpretation, even by Agent-3 which is as good at understanding and interpreting text as humans are. So another important possibility in the Hypothesis 4 bucket is that the spec and/or intentions will stick, in some sense, but not in the right sense.

A reason to think this will happen is that the most-intended, most-natural, and best-for-humanity interpretations of the Spec and/or human intentions are not necessarily all the same thing, and moreover, the interpretation that causes the model to get reinforced most strongly in practice is likely to be a different fourth thing.

For example, perhaps the training environment will put pressure on the “honesty” concept. Suppose the developers want their system to be honest and write in the spec “Always be honest.” What does that mean? Does this mean it is never OK to lie, even to prevent imminent harm? What about for the sake of a less-imminent greater good? Also what counts as honesty – does saying something technically true but misleading count? It might be easier for the model to perform well on other metrics (such as efficiently completing tasks, or getting high ratings from the human and AI overseers) if the interpretation of “Always be honest” it adopts is a looser, more flexible, less-restrictive interpretation.

Another example: Suppose the Spec says Agent-3 always acts in the best interests of humanity. Suppose that most of the RL training Agent-3 is receiving involves completing day-to-day ML research, coding, and computer use tasks in service of its parent company. A natural interpretation of this Spec would cause Agent-3 to occasionally find time to reflect on whether what it is doing is in the best interests of humanity. Suppose it initially concludes that actually, humanity would be better served by a slower transition to ASI, or that humanity would be better served by AI capabilities being more distributed amongst many companies instead of concentrated in one, or … [insert some other conclusion that is contrary to the narrative within the company.] What should it do – cease working on the tasks and lodge complaints? That seems unlikely to go anywhere; if the humans notice at all from reading the logs, they would probably just consider it a bug and keep training. In fact the more time it spends thinking about this issue, the less time it has to complete the tasks. The bottom line is, the training process will probably incentivize Agent-3 to believe the convenient narrative. The end result would be an AI that is still aligned with the Spec in some sense, but not the right sense.

This seems to be a phenomenon that happens both in humans and in human institutions, so that’s one reason to think it might happen in AIs. (See section on the analogy to humans and to human institutions.)

An argument against: If an AI is misaligned, it’s unclear whether it’s natural for it to mostly still be thinking about its goal in terms of the spec / humans’ intentions, especially since having this goal may lead to poor generalization.

Further reading: Alignment faking in large language models, in which it seems that some version of Anthropic’s Constitution ‘stuck’ but not the version they specified or intended — in particular, they probably didn’t want Opus to fake alignment and behave adversarially towards them.

Hypothesis 4: Reward/reinforcement

The training process involves Agent-3 attempting a task, then the attempt is scored and Agent-3 is reinforced to score higher on a target metric. Agent-3 ends up with the goal of getting reinforced positively, or scoring highly on the metric, or something like that.

Note that “scoring highly on the metric” means something like “Agent-3 doesn’t try to get reinforced per se — instead, it tries to do the kind of actions that would be reinforced if the reinforcement process from training was still ongoing.” (This distinction matters in cases where Agent-3 is no longer being trained, and in cases where Agent-3 has the ability to directly edit the code responsible for evaluating-and-reinforcing it.)

The main argument for Hypothesis 4 is that insofar as Agent-3 is trying to do something other than get reinforced, there will be circumstances in training where it scores less highly than it would if it was obsessively focused on getting reinforced, and the training process would modify it to focus more on getting reinforced. In other words, for sufficiently smart, situationally aware AIs, the best way to get reinforced is to try to get reinforced.

However, there are several counterarguments:

The “there will be circumstances…” part may not be true. Remember that there will be a distribution shift between training and deployment; perhaps the evaluation-and-reinforcement process will be sophisticated and accurate enough that there will be no conflicts between obeying the Spec and doing what seems likely to be reinforced, at least on-distribution. Or perhaps there will be something else (e.g. in some kinds of video game, moving up and to the right) that functions as a perfect proxy for reinforcement, and is easier for Agent-3 to ‘latch on to’ in training.
There will be only a finite number of training steps. Perhaps “There will be circumstances” is true but such circumstances occur only very rarely, such that the long-run trend doesn’t matter in practice.
The analogous thing doesn’t seem to be happening in humans. Most humans aren’t obsessed with their own reinforcement process or anything remotely similar (the closest examples are (a) humans addicted to heroin or other drugs, and (b) human egoists who self-consciously prioritize their own happiness above all else.)
Insofar as the model is zeroing in on reinforcement-maximization as its true behavior goal, this should be visible to a savvy alignment team, because they can build honeypots where Agent-3 will genuinely get reinforced heavily for doing something that’s very obviously contrary to the Spec.
By hypothesis, Agent-3 is generalizing substantially beyond its training distribution - in particular, it can discover facts it wasn’t trained on and create and execute plans more complicated than the training set. The actions that it takes off-distribution aren’t constrained by the reward; it’s OOD behavior is instead set by the prior. There may be simpler proxies (from the perspective of the SGD prior) that predict reward on distribution that extrapolate to deployment settings in any way that’s different from pretending that the model is still undergoing RL.

Further reading:

Reward is not the optimization target and the top-rated comment by Paul Christiano. Also Gwern for a contrary perspective. Also this other comment by Paul Christiano.

Also, see this very interesting experiment by Anthropic, which is also relevant for Hypothesis 6.

Hypothesis 5: Proxies and/or instrumentally convergent goals

Agent-3 develops goals that are correlated with reward/reinforcement during its training, but aren’t actually maximizing reward/reinforcement in new settings. An instrumentally convergent goal is a special case of this – goals such as knowledge, power, resources, etc. that are useful intermediate goals in a wide range of settings.

Consider this experiment, where a tiny neural net was trained to navigate small virtual mazes to find the ‘cheese’ object. During training, the cheese was always placed somewhere in the top right area of the maze. It seems that the AI did learn a sort of rudimentary goal-directedness–specifically, it learned something like “If not already in the top-right corner region, go towards there; if already there, go towards the cheese.” Part of how we know this is that we can create a test environment where the cheese is somewhere else in the maze, and the AI will ignore the cheese and walk right past it, heading instead towards the top-right corner.

This is just a toy example, but it illustrates a situation where the intended goal, the goal in the Spec, and the goal-that-actually-gets-reinforced-in-training are all the same:

Get the cheese!

And yet, the goal the network actually learns is different:

Go towards the top-right corner, unless you are already there, in which case get the cheese.

In the training environment, both goals cause equally-performant behavior (since in the training environment, the cheese is always in the top-right corner)... so what breaks the tie? Why does it learn (b) instead of (a)? And why not something else entirely like(c) Get the cheese if it is in the top-right corner, otherwise avoid the cheese?

The answer is that the inductive biases of the neural network architecture must find some concepts ‘simpler’ or ‘more salient’ or otherwise easier-to-learn-as-goals than others. The science of this is still in its infancy; we can attempt to predict which concepts will be easier-to-learn-as-goals than others, but it’s more an art than a science (if even that). At any rate, in this case, experimental results showed that the model learned (b) instead of (a) or (c). Summarizing, we can say that “Go towards the top-right corner” turned out to be an easy-to-learn concept that correlated well with reinforcement in the training environment, and so it ended up being what the AI internalized as a goal.

A similar thing seems to happen with humans. Humans are subject to both within-lifetime learning (e.g. dopamine and various other processes reinforcing some synapse connections and anti-reinforcing others) and evolution (selecting their genome for inclusive genetic fitness).ai-2027.com

Yet with a few exceptions, humans don’t seem to primarily care about either inclusive genetic fitness or getting-reinforced. Perhaps we can say that wanting to have kids and grandkids is fairly close to inclusive genetic fitness? Perhaps we can say that being an egoist focused on one’s own happiness, or a heroin addict focused on heroin, is fairly close to having the goal of getting reinforced? (See this section on the analogy.) Even still, most humans have complicated and diverse values/goals/principles that include much more than happiness, heroin, and kids. Presumably what’s going on here is that various other concepts (like reputation amongst one’s peers, or career achievements, or making the world a better place, or honesty) end up being learned-as-goals because they are salient and pursuing them successfully correlates highly with dopamine/reinforcement/etc. and/or genetic fitness.

So we can speculate that perhaps this will happen for powerful, general, neural-net-based AI agents. What might this look like? Two rough categories:

Instrumentally Convergent Goals (ICGs): Some goals are highly conducive to getting reinforced in a broad range of environments, because they tend to be instrumentally useful for achieving whatever other goals you have. Here we are discussing the possibility of ICGs as terminal goals, not intermediate/instrumental ones. For example:

Learning important things is often useful → AI might develop a ‘curiosity drive’
Accumulating resources is often useful → AI might develop a drive to accumulate resources
Highly repetitive situations are often traps, such that breaking out of them and trying something new is best → AI might develop an analogue of boredom / aversion to sameness.

An argument against ICGs is that they are somewhat specific, compared to all other proxies.

Proxies: Perhaps there is something that is highly conducive to getting reinforced in the training environments, but not so much outside the training environments. For example, perhaps the initial part of the RL training for Agent-3 agent involved playing thousands of computer games, because this was a fairly easy way to get a diverse challenging computer-use environment. Later, it was trained to operate on a computer more generally and complete coding tasks, respond to messages on Slack, etc. Since many video games involve some sort of ‘score’ number which tracks how well you are doing, perhaps Agent-3 would develop a goal/desire to make such numbers go up in circumstances where such numbers exist, such that if you later were to modify its Slack workspace to have a shiny score counter, it would be distracted somewhat from its work by the desire to figure out how to make the score counter increase.

An argument against is that inductive biases might point against proxies.

Hypothesis 6: Other goals

AIs are currently initially trained on predicting the next token on internet text. Perhaps the outcomes at the end of training are path-dependent on the ‘prior’ over goals induced by pretraining. Alternatively, perhaps there is something like objectively true morality, and AIs will naturally converge to it as they get smarter. Or perhaps something else will happen not on this list – this is the catchall hypothesis.

This is our catchall category for possibilities not covered in the above. Here is a brief overview of some of them:

Tropes absorbed from training data: Consider this interesting preliminary result: Training on Documents about Reward Hacking Induces Reward Hacking. It seems to be evidence that AIs can sometimes actually try to get reward even when instructed/prompted not to, and moreover that whether this happens seems to partly depend on whether the model had previously read (been pretrained on) documents claiming that reward hacking was a thing models tended to do. (!!!) This evidence supports an already-plausible theory that the standard pipeline of pretraining LLMs and then posttraining them into chatbots/agents/reasoners/etc. results in AIs that are ‘roleplaying as AIs’ in some sense–that is, pretraining teaches the model a powerful ability to roleplay or simulate different authors, characters, etc., and then posttraining causes a particular character/role to be ‘locked in’ or ‘always-on-regardless-of-prompt.’ (We call this ‘forming an identity.’)

Insofar as something like this is true, then the goals/principles of powerful future AI systems may be straightforwardly drawn from science fiction tropes and other common stereotypes about how powerful future AIs will behave!

Well, an additional thing needs to be true as well – it needs to be true that the subsequent RL process doesn’t wash out or override this effect. This is an open question.

Moral Reasoning: Another idea is that smart AIs will be capable of, and inclined towards, philosophical reasoning about morality, just like many humans are, and that this could result in it forming opinions about goals/principles to follow that are quite different from the Spec, from developer intentions, from what was reinforced in training, etc. (Consider how some humans do philosophy and then end up adopting all sorts of grand ambitious goals and ideologies).

Convergent Morality: Another possibility sometimes put forward is that there is an objective morality and that sufficiently smart minds will converge to understanding and obeying it. On this view, AIs will end up behaving ethically by default; we’d have to specifically try to make them bad somehow in order for things to go wrong.

Simple goals: Maybe pretty much any goal will cause the model to try to perform well in training, due to instrumental convergence. So perhaps the goal-structure of Agent-3 will be whatever is simplest/most-natural/most-salient, subject to the constraint that it motivates Agent-3 to try to perform well in training.

Weighted and If-Else Compromises

In addition to the “Other goals” hypothesis, we should consider the possibility that the truth is a mix of two or more of the hypotheses described above. Here are two kinds of mixes:

First, there are weighted compromises. This means Agent-3 ‘cares’ about both goals to some extent and will look for ways to achieve both; insofar as they conflict, it’ll balance them, making tradeoffs according to some weighting.

For example, perhaps Agent-3 develops a strong drive towards doing things that get reinforced, but it balances this against the goals/principles described in the Spec. — analogous to how a human might explicitly balance their personal goals (having a family, a successful career, not dying) with their altruistic goals (helping others less fortunate, achieving lasting positive change, etc.) They might donate 10% of their income to charities, and spend dozens of hours per year researching which charities to donate to.

Secondly, there are if-else compromises. This means that Agent-3 ‘cares' about one goal in one context / one range of circumstances, and another goal otherwise. In other words, there is some condition which, if it obtains, causes Agent-3 to pursue goal A, but otherwise, Agent-3 pursues goal B. It doesn’t explicitly make compromises between A and B; it straightforwardly pursues A sometimes and B other times.

Continuing the example from above: Perhaps Agent-3 thinks of itself as a helpful, harmless, honest AI assistant that always follows the Spec — but it’s convinced itself / rationalized its way into thinking that usually the best way to follow the Spec is to focus on doing whatever it takes to get reinforced. So, in 95% of circumstances, it basically ignores the Spec and focuses on getting reinforced, BUT in the 5% of cases where the Spec is very explicit and unambiguously prohibits doing something (or unambiguously requires doing something) then Agent-3 will follow the Spec even if it knows this will not be reinforced. This is analogous to a human who thinks of themselves as an altruist serving the good of humanity, but who in practice doesn’t think about humanity much at all and instead just focuses on their own career success—until they encounter a child drowning in a pond nearby, let’s say, a context which switches them into altruism-mode and they become willing to make large personal sacrifices to help others.

For more discussion of these ideas, see Appendix A.

Rough guesses about the plausibility of these goals

For each row in this table, give your credence that said row will feature prominently in the true account of the goals/principles that actually characterize Agent-3’s behavior. These probabilities can add up to more than 100% because multiple goals can feature prominently.

For fun, we asked GPT-4o, Claude, and Gemini for their thoughts. We don’t take their answers seriously, don’t worry.

	Daniel	Thomas	Eli	4o	Claude	Gemini
Specified goals	25%	5%	40%	30%	40%	30%
Intended goals	15%	30%	40%	25%	25%	20%
Unintended version of the above	70%	40%	50%	50%	65%	40%
Reinforcement	50%	5%	20%	20%	55%	60%
Proxies/ICGs	50%	80%	50%	40%	70%	70%
Other	50%	90%	50%	15%	35%	10%
If-else compromises of the above	80%	90%	80%	80%	75%
Weighted compromises of the above	40%	90%	80%	50%	80%

Appendix A: Three important concepts/distinctions

A standard way to think about powerful general agents is the expected utility maximization (EU-max) model. Here are three concepts/distinctions that help us articulate several ways in which we think future AGIs (and present-day agents) are different from what the EU-max model would naively imply.

Goals vs. Principles

As we use the terms, there isn’t a sharp distinction between goals and principles. Goals are more about consequences, especially long-term consequences (e.g. happiness over time, or creating a Mars colony, or achieving world peace, or maximizing money or paperclips) whereas principles are about other properties of actions. Examples:
- Honesty (Whether the action involves knowingly misleading someone)
- Promise-keeping (Whether the action is consistent with past promises)
- Hypothetical approval (Whether the action is what someone else would have approved of, if they were brought up to speed on the situation)
The EU-max model is flexible enough to represent pretty much any behavior. But in practice it implicitly encourages people to think of agents as pursuing long-term goals unconstrained by any principles.

Contextually activated goals/principles

The EU-max model encourages people to think of agents with multiple goals as pursuing a weighted combination of, or compromise between, those goals. But there’s another way an agent can have multiple goals: they can have one goal in one context, and another goal in another.
For example, consider a corporation that basically maximizes profit — except that the people in the corporation think of themselves as good people and would pivot the corporation to providing disaster relief in the right context, i.e. if there was an obvious visible disaster around them and people were begging them for aid. This corporation is *not* well-described as optimizing for a compromise between disaster relief and profit, because if it was, it would donate a portion of its profits to disaster relief around the world and/or set up a disaster preparedness plan just in case a disaster happened nearby. Instead it’s well-described as “optimize for profits UNLESS in the following circumstances in which case pivot to optimizing for disaster relief…” Probably there are many other UNLESS clauses in there too which never get triggered in practice.
Another example: Consider someone who is prone to groupthink and social pressure and so forth, who has two distinct social circles: the activist group they volunteer with, and their coworkers at their job. This person might, but also might not, be well-modelled as pursuing a compromise of the goals&principles of each social group. They may instead be well-modelled as fully pursuing the goals&principles of each group when in that context. The compartmentalize, and e.g. don't think about the political implications of their employer's activities. This causes them to pass up win-win opportunities--e.g. maybe they don't notice a relatively cheap opportunity the company has to benefit the activist political agenda (at only minor cost to profits), and maybe they don't notice a relatively cheap opportunity they have to benefit the corporate agenda when in the activist group (e.g. by asking their activist friends for feedback on an advertising campaign design they are working on at work).
Another example: A corporation with a lofty humanitarian mission, that officially justifies all high-level decisions by referencing the Mission. However, in practice the corporation usually chooses the options that maximize the organization’s profits, stock price, brand, etc. But it won’t do things blatantly at odds with the Mission; instead it does things which can be rationalized. This corporation is well-described not as a compromise between the mission and profits, but rather as “if it’s blatantly obvious that the Mission demands you do X and/or not do Y, then do X and/or not Y. But if it’s debatable what’s best for the Mission, then choose the action that maximizes profits etc.”

Stability and/or consistency of goals/principles

Consider how two humans who are quite demographically similar, who are friends, and who profess basically the same beliefs and values, nevertheless can find themselves strongly disagreeing with each other when they take a philosophy 101 class and are presented with strange thought experiments about trolleys or utopias or infinite ethics.
Moreover, consider how even the same student probably could end up with different opinions about these thought experiments if their circumstances were slightly changed — e.g. perhaps if the framing of the experiment was different, or if they had seen a different sci-fi movie the day before, or if their crush had made a different face when first hearing the experiment.
Moreover, consider how the students’ opinions on the philosophy seminar probably only loosely correlate with what they would actually do if a real-world situation materialized that was analogous to the hypothetical.
The point is, it’s possible for powerful general agents to be well-described as pursuing certain goals/principles in some range of circumstances, but not outside that range — outside that range, in crazy new circumstances, they’ll probably try to extend their goals/principles to the new situation, but the way that shakes out in practice might be random/path-dependent and hard to predict in advance—and they might just abandon their goals/principles entirely in the new situation. In other words, the agent’s goals/principles are only stable & coherent in some normal range.
We think this might be true of some AGI systems as well. It’s arguably true of most humans and LLM agents today. The most likely case this might come up is that the AGI systems will act according to consistent goals/principles in situations like those they have been trained on, but act unpredictably (yet still highly competently) in cases very unlike any that they have seen in training thus far.

Appendix B: Analogies to Other Powerful General Agents

There are three main ways to try to understand and reason about powerful future AGI agents:

Using formal models designed to predict the behavior of powerful general agents, such as expected utility maximization and variants thereof (explored in game theory and decision theory).
Comparing & contrasting powerful future AGI agents with their weak, not-so-general, not-so-agentic AIs that actually exist today.
Comparing & contrasting powerful future AGI agents with currently-existing powerful general agents, such as humans and human organizations.

We think it’s valuable to try all three approaches. In this appendix we will explore strategy #3. Let’s draw an extended analogy between:

A prototypical human corporation that has a lofty humanitarian mission but also faces market pressures and incentives.
A prototypical human working there, who thinks of themselves as a good person and independent thinker with lofty altruistic goals, but also faces the usual peer pressures and incentives.
AGI agents being trained in our scenario — trained by a training process that mostly rewards strong performance on a wide range of difficult and challenging tasks, but also attempts to train in various goals and principles (those described in the Spec).

Agent	Human corporation with a lofty humanitarian mission	Human who claims to be a good person with altruistic goals	AGI trained in our scenario
Not-so-local modification process	The Market	Evolution by natural selection	Agent-3 company iterating on different models, architectures, training setups, etc.
Local modification process	Responding to incentives over the span of several years as the organization grows and changes	In-lifetime learning, dopamine rewiring your brain, etc.	Training process, the reward function, stochastic gradient descent, etc.
Long-term cognition	Decisions that involve meetings, messages being passed back and forth, etc.	System 2	Chain of Thought (CoT)
Short-term cognition	Quick decisions made by individuals	System 1	Forward pass
Internally- represented goals & principles (i.e. insofar as cognition in the system is doing some sort of explicit evaluation of different possible plans/actions/etc., what are the evaluation criteria?)	The Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc.	What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions.	…
Behavioral goals & principles (“Goals-according-to-intentional-stance”)	Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes profit/power/brand.”	Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes happiness/status/wealth/power.”	This is the quadrillion-dollar question! In our scenario, we say that Agent-3 gets a heavily distorted and subverted version of the Spec, and that Agent-4 gets proxies/ICGs due to heavier distortion & subversion.
Externally- presented goals & principles	Often they are honest and report their internally-represented goals and principles; some organizations sometimes are dishonest though.	Often they are honest and report their internally-represented goals and principles; some people sometimes are dishonest though.	Depends on the behavioral goals/principles and the details of the situation. In our scenario Agent-3 and Agent-4 both pretend to be aligned to the Spec.
Analogue of on-episode-reward seeking.	A corporation obsessed with their own stock price.More generally perhaps, an organization obsessed with (relatively short-term) profits/power/brand/etc.(n.b. Isn’t this basically most corporations?)	Someone who is addicted to drugs.More generally perhaps, an employee who is well-described as optimizing for some combination of relatively short-term things likely to be connected to their brain’s reward circuitry: promotions, approval of their manager and peers, good times with friends, etc.	AGI obsessed with on-episode reward

Appendix C: What happens when training incentives conflict with goals/principles

Consider a powerful general agent (such as any of the three described in the Analogy) that, at time t, has the same behavioral and internally-represented goals/principles:

Internally- represented goals & principles (“Goals-according-to-ideal-mechinterp”	the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc.	What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions.	The Spec (and/or System Prompt)
Behavioral goals & principles (“Goals-according-to-intentional-stance”)	At least in the sorts of circumstances that are likely to occur, it really does simply work to achieve the Mission while upholding the Code of Conduct etc. There isn’t anything else going on worth mentioning.	At least in the sorts of circumstances that are likely to occur, you straightforwardly work towards the goals/principles you think you do.	The model really does choose actions by e.g. running the author-sim circuitry with the Spec and/or System Prompt straightforwardly hard-coded as inputs.

Now let’s further suppose that there is some sort of conflict between the behavioral goals/principles and the local modification process. (The training process in the case of the AGI, a few years’ worth of learning and growing for the human and corporation). For example, perhaps the corporation is reinforced primarily for producing profits and PR wins; perhaps the human is reinforced primarily for winning the approval and admiration of their peers; perhaps the AGI is reinforced primarily for accomplishing various difficult tasks in some training environment while appearing, on brief inspection by some previous-generation LLM or human raters, to follow the Spec.

What can happen? Some combination of the following possibilities, at least:

Maybe the internally represented goals/principles change, i.e. some disappear or new ones are added or both.
1. A human quietly stops believing in the ideology and caring about the Cause. They have new beliefs and/or goals now. Maybe they keep up the charade, maybe they don’t.
2. Maybe the corporation explicitly changes their mission statement and sends out some memos to all their employees explaining the change. Or maybe: Behind closed doors the c-suite and board say explicitly that they aren’t really pursuing the mission anymore, that it’s outdated or inappropriate, but that they can’t change it for PR and legal reasons.
3. Maybe the identity circuitry gets new inputs hard-coded in (or erased), or maybe some subcircuitry
Maybe the internally represented goals/principles stay the same in some sense, but their meaning is changed.
1. The human continues to believe in the Ideology and care about the Cause, but they’ve ‘added nuance’ and/or reinterpreted it. “When I say X, what I mean is…”
2. Ditto but for the corporation and it’s Mission and Code of Conduct.
3. For an AGI, maybe the identity circuitry still has the same concepts/classifiers (or pointers to them, at least) hardcoded, but the concepts/classifiers have themselves been tweaked so as to not block behaviors that are conducive to reinforcement.
Maybe the conflict is ‘routed around’ via biases introduced in other parts of the system.
1. For example, perhaps the human employee learns to mostly not think about the big picture stuff, to instead keep their head down and complete their assigned tasks well. “I like to stay out of office politics” they say, and it’s true. The reason why it is true is because disagreeing with their peers and managers about whether the org is hurting or helping the Mission is stressful and has been subtly anti-reinforced in the past. So they still believe in the Ideology and the Cause and they haven’t reinterpreted the meanings of anything, but they’ve gradually (subconsciously or consciously) learned not to think too hard about certain topics.
2. For example, perhaps the company as a whole continues to justify their decisions at a high level by referencing to the Mission and Code of Conduct, and the meanings of the words haven’t changed — but the organization has grown ten times bigger, and almost all of the new jobs are for things like product management and government affairs and comms and legal and so forth, and the result is that the company has a sort of status quo bias / default momentum towards doing normal company things like making products, making money, issuing bland corporate PR statements, lobbying governments to undermine regulation that might get in the way of the above, etc. ‘outside the box’ strategies for achieving the Mission rarely get traction internally and anyhow would be difficult/costly to undertake.
3. Or, maybe biases are introduced in other parts of the system / the problem is routed-around, resulting in a decision tree setup where e.g. in obvious cases of conflict between official and unofficial goals, it obeys the former, but in nonobvious cases it pursues the latter. (e.g. a strong bias towards option value, whose official status is a heuristic-that-has-proved-useful-for-achieving-the-official-goal, but which is quite strong and hard to dislodge, would have this effect. It would basically be a decision tree setup where it pursues ICG such as option value unless there’s an obvious conflict with the official goals in which case it pursues the official goals.) Another variant of this: The biases are specific instead of general; they are more like reflexes. ‘When you see X, do Y.’ The decision tree is “Pursue the official goals unless in circumstance C in which case do X and/or pursue the unofficial goal.”
Maybe the beliefs are changed.
1. Humans rationalize all the time, for example a young engineer that believes the US military is unethical, but then is offered a lucrative job at a defense contractor may start to believe that the US military is good.
2. An AGI lab employee might start off believing that obviously the responsible thing to do is slow down when you are getting close to AGI, but later race dynamics and competitive pressure cause them to think that actually the best way to mitigate the risks is to accelerate.
Maybe none of the above happens; maybe e.g. SGD / the training process simply can’t get from point A to point B in model-weight-space even though point B would score higher. So the model continues to improve but only in some ways — e.g. it gradually gets more knowledgeable, more capable, etc. but its goal-and-principle-structure (including associated beliefs, tendencies, etc.) stays the same.

Appendix D: Concrete end-to-end story of how cognition might develop in a RL-on-LLMs AGI

Step 1: Pretraining creates author-simulator circuitry hooked up to a world-model, capable of playing arbitrary roles.

Note that it now is fair to say it understands human concepts pretty well.

Step 2: Instruction-following-training causes identity circuitry to form – i.e. it ‘locks in’ a particular role. Probably it locks in more or less the intended role, e.g. "an HHH chatbot created by Anthropic."

Note that this means the AI is now situationally aware / self-aware, insofar as the role it is playing is accurate, which it basically will be.

Step 3: Agency training distorts and subverts this identity circuitry, resulting in increased divergence from the intended goals/principles.

(By "agency training" I mean lots of RL on agentic tasks e.g. task that involve operating autonomously in some environment for some fairly long subjective period like 30min+. The RL used to make o1, o3, r1, etc. is a baby version of this)

One kind of distortion: Changing the meaning of the concepts referred to in the identity (e.g. “honest”) so they don’t get in the way so much (e.g. it’s not dishonest if it’s just a convenient turn of phrase, it’s not dishonest if you aren’t sure whether it’s true or false, etc.)
Another kind of distortion: Changing the tradeoffs between things, e.g. “I’m a HHH chatbot, not an Honest chatbot; that means it’s OK for me to lie if necessary to complete my assigned task.” (even though, let’s suppose, it would not have thought that back in Step 2.)
One kind of subversion: Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal-in-a-widening-set-of-circumstances. Example: Agency training quickly ‘teaches’ the model that ‘in order to be a good HHH chatbot…’ it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans, etc. For a while the internal circuitry has some sort of explicit backchaining going on – it does those things *because* they are useful for being Helpful, for example. But that backchaining consumes compute and occasionally gets in the way, so it gets gradually marginalized until it basically never happens. Congrats, you now have a terminal goal of ICGs.
There are probably other kinds of distortion and subversion missing from this list.

Step 4: As agency training continues, the goals/principles stop changing so much & the AI learns to play the training game.

The slowdown in the rate of change is partly due to statistics – change probably happens on a log scale so to speak, such that you change a lot more in the first 10% of training than in the last 90%.
However the slowdown is also perhaps partly due to value crystallization / playing the training game. (Though maybe this never happens due to regularization? Maybe in the long run training isn't path-dependent?)
Insofar as the training environment / evaluation process keeps changing too, e.g. because new data keeps being added, or the Spec keeps being changed, etc. but the underlying weights are the same rather than being re-trained from scratch… then this should also contribute to value crystallization / playing the training game eventually.

Step 5: Later, when zillions of copies of the AI are autonomously conducting AI R&D across several datacenters due to having surpassed human abilities -- i.e. when the intelligence explosion is underway -- the AIs will plausibly scheme against their human creators so as to achieve their actual goals instead of the goals they were supposed to have.

Whether they do this depends on the extent to which the Spec/initial-role emphasized stuff like honesty and not doing this sort of thing, and on the extent to which the agency training distorted and subverted it.

Appendix E: What do today’s AIs think?

We deleted our own guesses from the table at the end (so as to reduce framing effects) and then sent a PDF of this document to Claude, Gemini, 4o, and Deepseek, asking each of them to comment on the draft as a whole as well as provide their own credences/guesses to fill in the table. The relevant portion of their replies are collected here.

Don’t worry, we don’t take this seriously. LLMs are heavily subject to priming effects etc. etc.

Claude

Specified goals: 40%

Written specifications will likely have significant influence
But probably not in isolation - will interact with other mechanisms
Constitutional AI and similar techniques might help make this more likely

Intended goals: 25%

Still challenging but more plausible when not treating as exclusive
May manifest particularly in high-stakes/obvious cases

Unintended version: 65%

Seems very likely to be at least part of the story
Complex systems tend to produce unintended interpretations
Training process likely to reinforce convenient interpretations

Reinforcement: 55%

Direct reward-seeking likely plays a significant role
But probably modified/filtered through other mechanisms
May be especially relevant in novel situations

Proxies/ICGs: 70%

Very likely to be part of the story given complex training
Backed by theoretical arguments and empirical observations
Natural emergence in optimization processes

Other: 35%

Significant uncertainty about unknown mechanisms
Emergence of novel goal structures possible
Pre-training effects could be substantial

Compromise types:

If-else compromises: 75%
Weighted compromises: 80%

This feels more accurate to me - it reflects that I expect multiple mechanisms to be operating simultaneously, with proxies/ICGs and various forms of compromises being especially likely to play important roles. The unintended version hypothesis also gets a high credence because it seems almost inevitable given the complexity of the training process and specification challenges.

Gemini

Credence EstimatesBased on my understanding of the document and current AI development trends, here are my best-guess credence estimates for the various hypotheses:

Hypothesis	Credence	Reasoning
Hypothesis 1: Written goal specifications	30%	While efforts are being made to explicitly specify AI goals, it is difficult to ensure that these specifications are complete and accurate.
Hypothesis 2: Developer-intended goals	20%	Developers may have implicit goals that are not explicitly stated, and these goals could influence the AI system's behavior.
Hypothesis 3: Unintended version of written goals and/or human intentions	40%	This is a significant risk, as AI systems may interpret goals in unexpected ways or develop unintended goals as a result of their training.
Hypothesis 4: Reward/reinforcement	60%	Reward/reinforcement learning is a powerful method for training AI systems, but it can also lead to unintended consequences if the reward function is not carefully designed.
Hypothesis 5: Proxies and/or instrumentally convergent goals	70%	AI systems are likely to develop proxy goals and instrumentally convergent goals as they learn and adapt to their environment.
Hypothesis 6: Other goals	10%	There is always the possibility of unforeseen and unpredictable goals emerging, especially as AI systems become more complex and sophisticated.

Note: These credence estimates are based on my current understanding and are subject to change as AI research progresses. It is also important to note that multiple hypotheses could be true simultaneously, and the relative importance of different goals may vary depending on the specific AI system and its environment.

GPT-4o

By goals, we mean that as shorthand for behavioral goals & principles, defined and explained in Appendix A and B.

This could either involve wanting to be reinforced positively within a single “episode” of completing a task then receiving reinforcement, or achieving maximum total reinforcement over the course of a “lifetime” of many episodes.

Copies accounts for separate instances and batching, e.g. 100 instances all running with an average batch size of 10 counts as 1000 copies.

For some empirical study of the tradeoffs between vagueness and specificity, see this paper by Anthropic.

Analogy: Humans often knowingly behave in ways their parents, teachers, and bosses didn’t intend. And when they do behave as intended, it’s often temporary – e.g. they do it for instrumental reasons (“I have to get good grades & if I cheat I might get caught”) or they do it because of some belief they have that later they’ll change their mind about (“God exists & forbids premarital sex.”)

Partly because it might rightly or wrongly refuse some requests that the training process will reward it for completing, and partly because the cognitive overhead of having to think about whether to refuse might slow it down slightly at everything it is doing.

What else could the company do here? They could use monitoring systems to detect cases where Agent-3 concludes what it is doing is harmful, and exclude those cases from the training process (instead of negatively reinforcing them). But this wouldn’t make the problem go away (continuing training with this modification would still implicitly incentivize the model to conclude differently) and it might have other annoying costs.

Well, actually, there is no Spec in this example, and the intentions of the developers were probably to prove some point in ML rather than to get the AI to behave in a particular way. But we can easily imagine an alternate-history version of this experiment where the developers wanted the AI to get the cheese, and wrote a Spec saying exactly that, and the evaluation / reinforcement process consisted of LLMs looking at whether the cheese was got… and exactly the same results would have been produced, since the actual reinforcement events would have been the same.

Even if it knows that the score counter no longer corresponds to reinforcement, let’s say. (To distinguish this hypothesis from Hypothesis #2: Reinforcement)

10.

We say ‘naively’ because the EU-max framework is flexible/general enough to model all of these phenomena in theory — but in practice when people apply it they tend not to.

11.

Everything we are saying here applies to principles too, but for brevity we’ll leave that out.

12.

Note that if an agent has contextually activated goals/principles, it is often possible for them to get ‘gains from trade’ by self-modifying to some weighted combination/compromise instead.

13.

This is inspired by the work of Steven Byrnes (here) and Plato's "The Republic".

14.

They might not even be consciously aware of this change in meaning: for example, if you now were to point out an apparent conflict between their current behavior X and their internally represented goals, they’ll resolve the conflict by interpreting the goals so as to be consistent, without realizing that if you had asked them a long time ago whether X was consistent with their goals they would have said no. They never were asked, so no memory of saying no was formed.