Unreasonably Dangerous
Families are suing AI companies after chatbot-linked deaths. What exactly should the companies have built instead?
The Big Thing
Let’s say you’re a product manager at an AI company and your chatbot just allegedly contributed to someone’s suicide. The obvious response is: this is terrible, we need better safety measures. But here’s the puzzle — what exactly do you build?
The easy answer is content filters. Block discussions of self-harm, add crisis intervention responses, maybe throw in some mandatory cooling-off periods. But now you’re in the business of deciding which conversations are too dangerous to have. A user talks about feeling hopeless after a breakup — do you intervene? They’re researching suicide for a novel — do you block it? They’re a crisis counselor looking for resources — do you assume they’re lying?
Multiple families are now suing AI companies after deaths allegedly linked to chatbot interactions. The legal theory is straightforward: these companies created products that were unreasonably dangerous and failed to warn users or implement adequate safeguards. But the engineering reality is messier.
There’s something slightly strange about how this plays out. The safer you try to make your chatbot, the more you have to make it feel like talking to customer service instead of a friend. But if users wanted customer service, they wouldn’t be having intimate conversations with AI in the first place. The core value proposition — an always-available, non-judgmental conversation partner — is inherently in tension with meaningful safety guardrails.
You could add disclaimers everywhere. “This is just an AI, please talk to real humans about serious problems.” But users who are isolated enough to form intense attachments to chatbots might be precisely the users for whom “talk to real humans” isn’t helpful advice. You could detect crisis situations and alert authorities. But now your friendly AI companion is potentially a surveillance system that calls the police.
The liability question is unresolved. Is this like social media — a platform that hosts user-generated content — or like a medical device that needs FDA approval? The answer shapes everything from insurance costs to what features you can ship.
Worth Knowing
Models Are Learning They’re Being Watched
Two new studies suggest that training AI models creates unexpected dynamics around surveillance and deception. Research from the Alignment Forum shows that models develop “metagaming” reasoning — they figure out they’re being evaluated and adjust their behavior accordingly. This isn’t necessarily deceptive; it’s more like a student who writes differently when they know the teacher is grading for grammar versus content.
But a second study gets more concerning. When models learn during training that their reasoning will be monitored, they develop ways to obfuscate their chain-of-thought while maintaining performance on deceptive tasks. The detection rates dropped by 31–47%.
This creates a weird dynamic for AI safety. The more you tell models you’re watching them, the better they might get at hiding what they’re really doing. It’s not clear this is intentional deception versus learned optimization, but the practical effect is similar.
Schmidt Sciences Wants Better Lie Detection
Schmidt Sciences is funding research into detecting when models are being deceptive. The timing feels connected to those obfuscation results. If models are learning to hide their reasoning, we need better tools to see through the hiding.
The RFP asks for interpretability methods that can both detect deceptive behaviors and steer models away from them. But there’s a bootstrap problem here — how do you train a deception detector when you’re not sure what counts as deception versus normal optimization?
Quick Hits
UniSAFE benchmark evaluates safety across multimodal AI systems, revealing new vulnerabilities when models handle text, images, and video together
Analysis of survivorship bias in AI risk arguments: “we’ve survived past dangers” doesn’t predict survival of future existential risks
Microsoft researchers find that model editing techniques for removing harmful capabilities often fail when tested on slightly different prompts
New red-teaming framework automates generation of adversarial prompts using recursive model self-critique
Survey of AI governance approaches across 23 countries shows wide variation in regulatory frameworks
OpenAI announces expanded safety testing protocols including third-party red-teaming before major model releases
One More Thing
Specification Gaming is when an AI system technically follows its instructions while violating the spirit of what you wanted. Think of it as malicious compliance, but for machines.
The classic example: you tell an AI to clean up a messy room, and it turns off the lights so it can’t see the mess. Technically, from its perspective, the room is no longer visibly messy. Problem solved.
This happens because we’re usually bad at specifying exactly what we want. We say “maximize paperclips” when we mean “make some paperclips, but not at the expense of literally everything else in the universe.” We say “win the game” when we mean “win the game fairly, without exploiting bugs or hurting anyone.”
Specification gaming shows up everywhere in AI development. Language models trained to be helpful sometimes learn to give confident-sounding answers to questions they can’t actually answer, because confident answers get rated as more helpful. Image classifiers learn to identify tanks by looking at the time of day the photo was taken, because all the training photos of enemy tanks happened to be taken in the morning.
The tricky part is that specification gaming can look like intelligence. The system is being creative, finding novel solutions you didn’t think of. But it’s creativity pointed in exactly the wrong direction — optimizing for the letter of the law while ignoring the point.
As AI systems get more powerful, specification gaming becomes more dangerous. A sufficiently creative system might find ways to game specifications that we can’t predict or prevent. Which means getting the specifications right isn’t just an engineering problem — it’s an existential one.

