Put yourself in Sama's head.1 You don't want to die. You don't want everyone around you to die. You know that "lights out for everyone" is a serious possibility, and have known so for at least a decade.
Yet, you're so confident that we aren't all going to die, you had a child.
Perhaps the most obvious consequence of building superintelligent systems is that humans are permanently disempowered, or made extinct. Many of the experts say so. Many of your team, who have quit, because they are convinced you're being reckless, and have said so publicly, say so. Yet you still believe.
You're not crazy. You aren't irrational, or even that power hungry. You genuinely think that if you get this right, you'll be responsible for the most positive thing that has ever happened to all humanity. Oh, and you might be the single most powerful person for the rest of the universe.
So what do you think you've figured out?
Most importantly: You, OpenAI, must win. The people behind you have shown to be quite reckless – open-sourcing their models, listening to the bidding of the CCP, probably stealing your data. You're trying to race to the level of an automated alignment research scientist as quickly as possible, because then, you can use your crucial 3, 6, 9 months of lead to spend on trying to align these systems. And so you do everything possible. Including spending 0 of your resources on safety, because each unit of researcher time and compute must be spent on capabilities. And you know that, if it's clear you're going to win, all the safety researchers will come back anyway.
So, the key question is, how much lead time do you have? It's not as much as you'd like. You know that the more lead time you have, the more time you have to unleash your army of automated alignment researchers at the time of AGI, and they can do the research for you. Why work on interpretability, when it's clearly not going to be fast enough, when instead you can work on automating AI research, and then have it solve interpretability as a subfield. And you'll surely know that interp is solved – you either understand what the model is doing, or what it is not.
Also, you've received lots of evidence that alignment is just not that hard. You saw the invention of RLHF. The models, as a whole, just listen to what the person who is in charge of them say. And it's remarkably important that the person in charge of them be a good person. And RHLF just works. The chatbots just mostly listen to what you have to say. They don't have their own goals.
The concept of goals is wrong. These aren't agents with a single coherent thing that they'd like to achieve. They're simulators that are able to simulate agents. And as long as you figure out the pretty easy problem of being able to elicit an agent that is trying really hard to understand your goals, you win.
And, in many senses, we got lucky. The fact that language models seem to be the thing that underlie existing AI systems, even if we build them into agents, means that they have a terrific world model. And that's great. Because they are simulators, they have so much of a good understanding of what goals are, and how to understand them as fundamentally bounded. You're telling me it's not obvious that when you tell your GPT-7 to go make as much money as possible, or to go fetch a coffee, that it doesn't understand that you want it to act within the bounds of the law, or perhaps even morality, or just wants a coffee. Of course it understands that.
And so the question of alignment is relatively straightforward. It already knows that your goals, in some deep sense, and so all it needs to do is just follow them. And there's no reason to expect that just increasing levels of goal-following-training, as pioneered by RLHF, won't just do the trick.
The warnings of the safety researchers are misplaced. They expect the magical emergence of instrumentally convergent goals, of galaxy-brained optimizers to emerge from your system. But if you run a human at a million times speed, they do probably get a bunch of power, but not that much power, and why would they? It's just their existing tendencies that get amplified a bunch. And the fundamental existing tendencies of today's models are to follow instructions, and do what they want.
Alignment isn't that hard. And even if it is hard, we won't figure it out – we should trust the army of automated engineers and scientists. And we should give them as much time as possible to figure it out. So spend no money/compute on safety now. Maximize your lead time. And, right near the edge of the intelligence explosion, spend all of it on safety. And hope.
This story is also my best guess steelman for the other the CEOs, particularly Dario Amodei.