Why LLMs help design bioweapons
It's probably error correction.
The AI models are getting good at biology and bioweapon design.
Two days ago, Moonshot released Kimi 2.5 open-weight. It’s quite good. We’ll wait for its scores to come in on the virology capabilities test, a benchmark of virology questions, but I suspect it’ll do well.
The difficulty of bioweapon design is in error correction. People often misunderstand how, mechanistically, LLMs help rogue actors make bioweapons. It’s not that difficult to find the instructions for horsepox (a smallpox variant) online, but the reason culturing and modifying is hard is that it is a multi-step process, where every step can fail.
Let’s say hypothetically, you’re not a great biologist-terrorist (even the best experts you could convince to join you in the caves were not the brightest of their biology classes). Your success rate for every one of the steps involved (eg. determining feasible candidates, growing the biological agent in sufficient quantities) is maybe ~30%, and so at the end of this long multi-step process (what are these weird electron-dense circles in my TEM of HEK293FT cells again? what’s wrong with this picture of my influenza virus plaque assay?), your chance of success is ~0%. In a normal university lab, you would just go ask your specialist colleague.1 But you’re in a cave.
But, imagine you had an excellent multimodal LLM,2 you could maybe increase the success rate to 70% at every step of both sequence design and construction.3 It would still be hard, but you might succeed.
Currently, if you do this with a closed-source model (and Google decides to implement those damn input-output classifiers), particularly combined with an IP address from a “hostile nation”, the AIs will refuse to answer. For open-source models, like Kimi K2.5, no such restrictions apply.
Even folks who have tried to implement tamper-proof “unlearning techniques” for years, or implement safeguards on models they are planning to open-weigh (as Meta claims they do), it is trivially easy to strip these away. The Chinese labs barely bother.
Worse, training a bad model on the outputs of closed-source good ones can increase their overall capabilities (Kimi K2.5 does seem to have an identity crisis, confusing itself with Claude...). So even if the best models were closed-source, any sufficiently powerful open-source models will likely be close behind the frontier, through “distillation” on those outputs, for example.
This suggests that we should be pretty worried. Much biosecurity hardening is going to have to happen on later parts of the supply chain, or involve hardening (e.g., the provision of PPE and similar), but LLMs do provide uplift to humans doing realistic virology tasks.
There may be a sharp transition where designing systems goes from “completely impossible” to “very easy” once reliability reaches a critical point.
Policymakers should prioritizeconvincing or coercing the Chinese AI companies to stop open-weighting their models. That’ll be the topic of the next post.
Thank you to Antoine Vigouroux for helpful feedback.
(h/t Antoine) Probably this doesn’t increase success rate that much -- many things fail on the first attempt when making something new -- a university lab also has other advantages, like additional equipment for verification and troubleshooting.
Like Gemini 3 Pro, for example, which has quite bad biology filters -- it was the only LLM I could consult in the writing of this post -- those Nobels maybe do come at a cost...
(h/t Antoine again), For example, if an AI tool can generate a DNA-binding protein with a 90% success rate, it’ll still be hard to make an entire virus with many interactions that need to work especially because you need to avoid undesired interactions between parts, so #interactions scale as the square of parts.




