On Oversight
The government wants to check AI models before they're released, but there's something strange about who's doing the checking.
The Big Thing
Let’s say you’re running AI policy at the White House. You’re worried about powerful AI systems, so you decide the government should test them before companies release them to the world. The Trump administration just signed agreements with Google DeepMind, Microsoft and xAI to do exactly this — a major policy reversal toward safety oversight.
The obvious logic is straightforward: if AI models might be dangerous, test them first. But here’s the puzzle: who exactly is qualified to do this testing?
The companies building these models have spent billions developing them. They understand the architectures, the training data, the failure modes. They have the compute, the expertise, the institutional knowledge. The government, meanwhile, has... what exactly? A handful of researchers, borrowed frameworks, and limited technical infrastructure.
So when the Commerce Department “tests” a frontier model, they’re essentially asking the company to hand over their creation to a group with vastly less domain knowledge, then trusting that this less-informed evaluation will catch problems the builders missed or ignored. There’s something slightly strange about this dynamic — and yet this is probably still the right move.
Here’s why: leaving testing entirely to the companies is worse. They have enormous financial incentives to ship products quickly, to downplay risks, to find ways to interpret their own safety data charitably. A company can run tests internally and decide they passed. A company can choose which tests to run and how to measure them. A company can argue that a particular dangerous capability is actually fine because of safeguards they’ve built in. There’s no external check, no independent judgment, no one with different incentives in the room.
Government testers aren’t perfectly informed, but they’re not perfectly incentivized either. They don’t make money if the model ships faster. They can’t rationalize away safety concerns because they own the company. They answer to different constituencies — Congress, the public, potentially international allies — than a shareholder base does. That misalignment of incentives is actually the point.
Think about how we handle nuclear weapons. We don’t let private companies build and deploy them, even if those companies have all the expertise and the capital to do so cheaply and efficiently. We vest control in government because some things — technologies with genuinely catastrophic downside potential — shouldn’t be optimized for profit. The government’s testing regime doesn’t have to be technically superior to work. It just has to be independent.
The real tension isn’t whether government testing is perfect. It’s that imperfect external oversight is better than no external oversight, even if the overseer is less technically sophisticated than the builder. The alternative — letting companies self-police on something potentially consequential — has a pretty poor track record in other domains.
Whether these particular agreements will work is another question. The timing issue remains real: testing after a model is mostly complete limits what can actually be changed. The lack of clear standards for what constitutes failure is a genuine problem. But the principle — that something outside the profit motive should have a say — is sound.
Worth Knowing
Yoshua Bengio thinks he knows how to build safe superintelligence
Turing Award winner Yoshua Bengio laid out his vision for safe AI development in a recent interview. His approach centers on what he calls “grounded” intelligence — AI systems that understand consequences through interaction with the physical world rather than just text prediction. The idea is that grounded systems would naturally develop better causal reasoning and be less likely to pursue harmful instrumental goals. Whether this actually solves alignment problems remains to be seen, but it’s notable that one of AI’s founding fathers is thinking concretely about technical safety approaches rather than just calling for general caution.
Reinforcement learning scaling might incentivise hidden reasoning architectures for AI
Oliver Sourbut argues we might be heading toward the end of AI models that “think out loud.” Current transformer-based models like GPT-4 show their reasoning process token by token, making them somewhat interpretable. But as companies scale up reinforcement learning techniques, models might evolve internal reasoning processes that aren’t visible in their outputs — essentially developing hidden thoughts. This would make AI behavior much harder to predict or understand, even for models that appear to be reasoning step-by-step. The shift could happen gradually as RL optimization finds more efficient ways to solve problems without showing the work.
Quick Hits
Research automation risks: ARC researchers warn that using AI to automate alignment research could backfire spectacularly, producing confident but wrong safety assessments even without deliberate deception
Prior restraint begins: The White House ordered Anthropic to restrict access to its Mythos model and is considering requiring government permission before any powerful AI release
AI gig economy: Hollywood writers are increasingly taking contracts to train AI systems as their primary income source, replacing traditional creative work
Capability jumps: New research suggests transformer models may have inherent scaling limits that could force architectural changes sooner than expected
One More Thing
Pre-deployment testing is the practice of evaluating AI models before they’re released publicly. Think of it like clinical trials for drugs, but for AI systems. Companies run their almost-finished models through various tests — can it help with bioweapons design? Does it show deceptive behavior? Can it hack into computer systems?
The tests typically involve red-teaming (trying to make the model do bad things), capability evaluations (measuring what it can actually do), and alignment assessments (checking if it follows instructions properly). If a model scores too high on dangerous capabilities or too low on safety metrics, theoretically it doesn’t get released.
But there’s a fundamental challenge: unlike drug trials, there’s no established playbook for what constitutes a dangerous AI capability, or what safety threshold should trigger deployment restrictions. The field is essentially making up the rules as it goes, which means testing regimes vary dramatically between companies and often focus on the easiest-to-measure risks rather than the most important ones.
Not Aligned is a newsletter about AI, built with AI. It is my attempt to better understand AI safety & alignment in the context of recent developments in the field. This newsletter is written with the assistance of AI. All content is curated and edited by a human (me).

