Everyone wants AI to go well. There is a kind of person that will describe themselves as “making AI go well”, though. The second part is left unsaid: “because I don’t think it will, by default.” These folks are the AI-safety-ists. And they are decidedly not a monolith.
To butcher Freud (or SSC), there’s nothing more vicious than the small difference. And I would argue that AI-safety-land splits into two camps. Perhaps to the outside world these differences are merely theoretical, but boy is it not a magnanimous relationship:
(1) The “AI alignment is not easy, but tractable” crowd (the “EAs,” the “moderates”) if we spend the right resources, at the right time. My OpenAI steelman hinted at some of their beliefs, but I’ll be more explicit:
AI alignment is a matter of iteratively improving on existing techniques. To improve model alignment, this requires contact with reality and the scientific method. Theorizing makes very little progress. AI alignment is just another engineering problem.
It’s important for the right people to win the AGI race: it should be the US, and, if we had to pick from all the labs, Anthropic. Helping these groups (US data center buildout, Anthropic RL) is likely good. Evaluating AI models as part of the US government, or safety at Anthropic is definitely good. Let’s upskill people accordingly.
We must win the race against China. That doesn’t mean being reckless, or cutting corners. But it also means that we can’t pursue safety techniques that impose too much of “alignment tax”, otherwise they won’t be (and shouldn’t be) implemented.
The lab leaders face regular corporate incentives, but are mostly doing their best, and mean well (or, at least, Dario).
Market incentives, voluntary commitments and culture push towards aligned models, even if they don’t fully solve the problem.
We should pour resources into automating AI research (or “scalable oversight” or “iterative alignment” or “getting AIs to do our alignment homework for us”).
We got lucky with the shape and general goodness of current AI systems — they tend to understand our intentions, and it is simply a problem of elicitation to get them to act on those intentions. Things might benignly generalize.
Many of the leading research agendas, like AI control or AI evaluations have a good chance at working, and we should invest heavily into them, because they give good warning shots. If we catch the AIs scheming, we can use RLHF and more sophisticated methods to re-align the models, and it matters how often we catch them, and then reward/punish them accordingly.
We should worry about whether the current AI models are sentient / capable of feeling pain.
It would be good if we had more time or international cooperation, and if we could press a pause button or get a treaty, we would, but that's unrealistic. We are realists and pragmatists.
We should be very cautious about mobilizing the broader public. In general, we will want to work within the system, aim to have relevance and be in the right room, and be thoughtful, and weigh up the benefits and the harms.
(2) The “if anyone builds it [superintelligence], everyone dies” crowd (or the "AInotkilleveryoneists", the "doomers", “humanists”). The best way to get a sense of these people will, of course, be to read the titular Eliezer Yudkowsky and Nate Soares' book when it comes out in less than 2 weeks. But my attempt anyway:
Solving AI alignment is exceedingly unlikely to work on the current trajectory.
The only winner of a race to superintelligence is the superintelligence itself. Another variation: there are two races in AGI. One race that we're used to dealing with: the military and economic one, which we can win by regularly competing. But the second one is one we must coordinate to prevent: ideally through international treaty with verifiable agreements. There is no right person to build superintelligence.
Alignment is chiefly a scientific problem, one that we don't have enough time to understand. The problem of buying time, and thus safety more broadly, is a political, not a technical or technocratic one.
While there have been interesting scientific results eg. from interpretability or chain-of-thought monitoring, we are nowhere near having a full mechanistic understanding of today's models, let alone superintelligence, and we will not have them in time.
There are conflicts of interest abound in the "AI Safety community", where it overall may have been net negative (see: the start of Deepmind, initial founding and funding of OAI and Anthropic).
The default safety plan is we do more sophisticated versions of reinforcement learning from human or AI feedback (whacking the model with a stick whenever we or another AI catches it doing something wrong), and this is likely to incentivize models to lie/deceive us, rather than actually do the right thing.
AIs are ultimately more competitive than humans, and competitive pressures (eg. the free market, geopolitics) push towards AIs that replace us, either gradually or all at once.
For those who focus on monitoring, evaluations and "control", the default answer to the question: "What do we do when we catch the model misbehaving?" We whack them with the RLHF++ stick, and hope for the best. It is very unlikely that the developer will be able to recognize the warning shot in time, and stop.
On the question of AI consciousness, some believe it is important to figure out, while many think that a humanist vision of the future is necessary.
The whole situation is crazy! We should not be risking even a 5% risk for all the benefits, particularly because we could eg. wait 10 years, and then get all the benefits from a host of narrow AI systems, or some other architecture that is safe-by-design.
We should listen to the global public, which broadly does not want superintelligence, and get humanity's approval before we do this thing unilaterally.
The public is on our side. We know that the competitive systems create this fundamental challenge, but collective action and clever political maneuvering means that we should win. Let's not be "sneaky" or astroturf. Let's just say our arguments, straight up, and then ally with those who believe in humanity.
This is, of course, to say nothing of those that describe themselves safetyists, and instead are nimbyesque regulators, sweeping technopessimists or want to use AI safety to litigate their own cultural battles.
Both camps believe that AI will be incredibly important, soon. Both believe that the current level of safety, and the political landscape more broadly, is inadequate to deal with either the proximate challenges, nor capture the broad value that the future has to bring.
Both, fundamentally, view themselves as doing good, or the bare minimum.
Thanks to Richard Ren for feedback.