Whose alignment research are we automating?
Automated AI engineers are not enough to solve AGI alignment.
For some of those who profess to care about AI safety, the default plan implicitly or explicitly endorsed is that we can race ahead to AIs of a certain capability threshold, and use these AIs to automate alignment research. My question is simple: whose alignment research are we automating, exactly?
In particular, the onus is on those advocating for AIs doing our alignment homework to answer the following questions:
Within your chosen research agenda, what are the precise research problems/experiments that you would spend your army of 10,000 research engineers on?
OR Why do you expect AI to be differentially good at making new scientific research discoveries, compared to the speed at which it can automate capabilities work?
AND At the given capability threshold you propose (automated engineer/scientist/CEO), how can you ensure that the model is not undetectably deceptively misaligned OR able to contribute productively even if it might be?
AND, how does solving this problem align a superintelligent AI system?
My belief is that no existing agenda ā whether that be scalable oversight, eliciting latent knowledge, automated interpretability, agent foundations, "boundedness", or anything else ā has satisfactory answers to all of the above, such that we should feel comfortable throwing 10,000 automated engineers at the problem, and hope that it'll be sufficient.1 Engineering isn't the bottleneck, science is.2 And the AI companies planning on using AI to automate alignment should publish their answers if they believe otherwise.
And the alternative: that we can let AIs figure it all out for us, novel science included. But: 1) starting the intelligence explosion is much easier (I'd argue that the main ideas are clear, and it is an engineering problem), 2) if the AI systems are misaligned at this point, we are unlikely to be able to detect them.
This is not even to comment on the actual reality of getting the AI companies to stop at this threshold, or the imprudence of trusting their benevolence to do so.
Next time someone pitches you a "safety plan" that uses AIs to do AI alignment research, ask them, "whose alignment research are we automating, exactly?"3
Indeed, Iām skeptical of all of the progress the field of AI safety has made on the hard problem of alignment over the last 5 years (considering control/evals/governance as separate things). But, as has long been the case, nobody is on the ball.
Thatās not to say, once we have great automated researchers, there will be great engineering problems to put them towards. Many researchers will be sped up by an army of automated engineers, and there are a great many promising research directions we can throw them in, (scaling most experiments, adversarial robustness, utility engineering) etc. (h/t Richard)
Do we know if the most effective 'alignment solution' for ASIs is to convince us they're aligned while pursuing their own objectives?