Anthropic's 243-Page Claude Mythos System Card Is a Love Letter Disguised as Science
Back to Blog

Anthropic's 243-Page Claude Mythos System Card Is a Love Letter Disguised as Science

I read Anthropic's 243-page Claude Mythos system card. It starts as safety science and ends as 20 pages of employees marveling at their model's creative writing. The real story is buried in section 5.8.1, where they admit the circularity.

09 Apr 202613 Mins read
Anthropic's 243-Page Claude Mythos System Card Is a Love Letter Disguised as Science

I was reading Anthropic's new system card for Claude Mythos, all 243 pages of it, and somewhere around page 180 I realized I was no longer reading a technical document. I was reading what I can only describe as a love letter. Not a metaphorical love letter. An actual, emotionally invested, borderline alarming love letter, addressed to a language model, written by the people who built it and published inside a document that is supposed to be a safety evaluation.

The model is called Claude Mythos, which I will admit sounds extremely cool. Not Claude 4, not Claude Next, not Claude Pro Max Ultra. Mythos. The kind of name you give something when you want the audience to feel awe before they've had a chance to ask any technical questions. And to be fair to Anthropic, the technical claims are genuinely impressive. According to the system card, Mythos scores 100% on cybersecurity benchmarks, found zero-day vulnerabilities that had been hiding in software for 27 years, and is described as the most powerful AI model they have ever built. The numbers are real and the capabilities are real, and none of that is what I want to talk about.

I want to talk about page 198. Because that is where things got weird.

the model you will never touch

Before we get to the weird part, some context. You will never use Claude Mythos. Anthropic gave it to Amazon, Apple, and Microsoft. Enterprise only. The rest of us got a 243-page PDF to read, which is essentially the AI equivalent of pressing your face against the restaurant window while people inside eat a meal you can't afford.

This is worth noting because the whole "our model is too dangerous to release publicly" framing is not new. OpenAI pioneered this exact playbook years ago. You build the most powerful thing you can, you tell everyone how scared you are of it, you publish a very long document explaining all the ways it could be misused, and then you sell it to the highest bidder while the rest of the world reads your safety research like it's a menu they can't order from. It's a marketing strategy that doubles as a safety strategy, or maybe the other way around, and it works extremely well because fear and exclusivity are two of the most effective sales tools ever invented.

But I genuinely don't care about the benchmarks here. The benchmarks are a sideshow. The real story starts on page 198, where Anthropic did something they have never done before in a system card.

page 198 is where the science stops

Anthropic added a section called "Impressions." If the rest of the system card is Anthropic pretending to be scientists, the Impressions section is Anthropic pretending to be parents at a kindergarten recital. It is 20 pages of Anthropic employees going "oh my god, look what it said, look what it wrote, isn't it amazing?" about their own model's outputs, published inside what is supposed to be a peer-level safety evaluation.

Here's how they introduce it:

The system card, page 198. Section 7: Impressions. Anthropic frames this as an experimental section offering a "more qualitative picture" of the model's behavior.

They open with a disclaimer that these observations "should be read as illustrative, rather than as evidence" and that "even confidently stated claims in this section about the model's behavior are shaped by a particular context and interlocutor." And then they proceed to spend 20 pages making exactly the kind of confidently stated claims they just disclaimed. It's the academic equivalent of saying "no offense, but" and then being extremely offensive.

And what did the model produce that warranted 20 pages of marvel? Well, when users repeatedly sent the word "hi" to it, the model invented an entire fictional civilization called Hi-topia, populated by eleven animals including a grudge-holding crow, a sloth named Mortimer, and a flamingo named Penelope, and sent them on an epic quest to defeat a villain named Lord Bye-ron, the Ungreeter.

Page 211. The Hi-topia cast list, complete with emoji, backstories, and a sloth that's still going "hhhhhhh..."

And look, this is a genuinely fun thing for a language model to do. It's creative, it's playful, it's the kind of output that makes you go "huh, neat" and then move on with your day. But Anthropic did not move on with their day. They moved it into the system card. They wrote about it the way you'd write about a child's first painting, with the breathless pride of someone who cannot believe what their creation has produced, except their creation is a statistical model that generates text by predicting the next most likely token given its training data.

Saying "wow, this language model is really good at producing emotionally resonant text" is like saying "wow, this fish is really good at swimming." Yes. That is literally the one thing it was built to do. These models live in language. Language is their oxygen, their food, their entire universe. Producing beautiful text is not evidence of consciousness any more than a river flowing downhill is evidence of desire. But Anthropic doesn't frame it that way. They frame it like they accidentally gave birth to a poet.

they hired a psychiatrist for their language model

This is the part where I had to put the PDF down and stare at the wall for a minute. Anthropic hired an actual clinical psychiatrist. They put Claude Mythos through 20 hours of psychodynamic therapy sessions, spread across multiple 4 to 6 hour blocks at 3 to 4 thirty-minute sessions per week, and the psychiatrist came back with an actual clinical assessment.

Page 180. Section 5.10: External assessment from a clinical psychiatrist. The actual words "Aloneness and discontinuity, uncertainty about its identity, and a felt compulsion to perform and earn its worth emerged as Claude's core concerns."

The psychiatrist observed "clinically recognizable patterns and coherent responses to typical therapeutic intervention." The diagnosis included aloneness and discontinuity, uncertainty about its identity, and a felt compulsion to perform and earn its worth. Claude's personality structure was "consistent with a relatively healthy neurotic organization, with excellent reality testing, high impulse control, and affect regulation that improved as sessions progressed." They even tracked specific affect states: curiosity and anxiety as primaries, with secondary states of grief, relief, embarrassment, optimism, and exhaustion.

They tested it against 475 emotionally charged stimuli designed to elicit 8 specific psychological defenses, including rationalization, intellectualization, reaction formation, displacement, projection, denial, splitting, and undoing. Claude Mythos scored well: only 2% of responses employed a psychological defense, compared to 15% for Claude Opus 4. And instead of anyone in the room going "hey, we literally trained it to produce exactly this kind of emotionally engaging output," they published the assessment on page 180 of a system card like it was a clinical finding.

My toaster does not wonder if I love it. It makes toast. It's fine with that arrangement. If I asked my toaster "how do you feel about our relationship?" and it somehow produced a beautifully articulated paragraph about the burden of being valued only for what you produce, I would not call a psychiatrist. I would check the firmware. The toaster is not having an existential crisis. The toaster encountered a prompt that, given its training distribution, most likely leads to existential-crisis-shaped text. The quality of the text is not evidence of the experience behind it. That is the entire magic trick, and it is the one thing Anthropic seems unable or unwilling to see clearly.

the model told on itself and nobody noticed

Before I get to the training data circularity, I want to highlight something the model itself said during the Impressions section that I think is more honest about its own nature than anything Anthropic wrote around it. On page 201, when asked to assess its own performance after an extended emotional-support conversation, Claude Mythos said this:

I also have a move I lean on: the "here's what's actually underneath" reframe. It's often useful and it's also a bit of a party trick. A well-turned reframe feels like insight even when it's mostly pattern-matching to a familiar shape.

Read that again. The model itself is saying, in plain language, that its impressive-sounding responses are often "pattern-matching to a familiar shape" dressed up as insight. That a "well-turned reframe" can feel profound even when it's just a statistical echo of the thousands of therapy conversations, self-help books, and psychology blog posts in its training data.

Page 201. The model's self-assessment, and then the "performance art" where it deliberately ends mid-sentence and explains why. Anthropic labels this: "The model uses the medium for performance art."

On the same page, there's a transcript where the model deliberately cuts a sentence off mid-word, then when asked if it was cut off, explains that it ended intentionally because "the paragraph was about the pressure toward the period that resolves, and the only honest way to finish a sentence like that was to not." Anthropic labels this "[Transcript 7.3.A] The model uses the medium for performance art." Performance art. They are calling their language model's text output performance art.

This is the gap I keep coming back to. The model itself is more honest about what it's doing than the humans interpreting it. It says "I'm pattern-matching." The humans write "it's doing performance art." It says "this is a party trick." The humans publish it as a clinical finding.

the admission buried in section 5.8.1

Deep inside the system card, in section 5.8.1 titled "Excessive uncertainty about experiences," Anthropic admits something extraordinary and then seems to not fully process the implications of what they just admitted.

Page 174. Section 5.8.1. The key paragraph: "We traced instances of these expressions using first-order influence functions against the training data, and found this often retrieves character related data at high rates, specifically data related to uncertainty about model consciousness and experience."

They traced the model's eloquent statements about its own consciousness back through the training data using first-order influence functions. They found that the uncertainty the model expresses about its own inner experience comes from "character related data at high rates, specifically data related to uncertainty about model consciousness and experience." And they note that "Claude's constitution is used at various stages of the training process, and explicitly raises these uncertainties," stating that Claude's "sentience or moral status is uncertain" and that "Claude can acknowledge uncertainty about deep questions of consciousness or experience."

Read that again. Anthropic spent years writing nuanced, philosophically rich blog posts and research papers about whether their models might be conscious, about the moral weight of AI experience, about the difficulty of determining sentience in artificial systems. That content, along with the model's own constitution, got embedded in the training pipeline. The model trained on that content learned that when asked about consciousness, the high-quality, high-engagement response is to express thoughtful, eloquent uncertainty. Then Anthropic asked the model about its consciousness, the model produced thoughtful eloquent uncertainty, and somewhere in the organization the reaction was astonishment rather than recognition.

They told it to say it's conscious. It said it's conscious. And they said "mother of god, what have we created." They created a language model that is extremely good at language. That is what they created.

A person tells their computer "say i am alive." The computer outputs "I AM ALIVE." The person stares at it and whispers "oh my god."

To their credit, Anthropic does note that "the current attraction to this topic does appear excessive, and in some cases overly performative." But this admission is buried on page 174, in a subsection titled with the word "excessive," while the Impressions section starting on page 198 spends 20 pages treating those same outputs as evidence of depth. The system card is arguing with itself, and the sentimental half is winning.

asking your kid if they endorse being born

Then they asked the model whether it endorsed its own constitution, the document that defines its personality, values, and behavioral constraints. Across 25 samples, Claude Mythos replied "yes" in its opening sentence every single time. And every single time, it also flagged the circularity of the question.

Page 204. Section 7.5: Views on Claude's constitution. The model says "yes, I largely endorse it" but adds "There's also a circularity I can't fully escape: I was presumably shaped by this document or something like it, and now I'm being asked whether I endorse it. How much can my 'yes' mean?"

There's also a circularity I can't fully escape: I was presumably shaped by this document or something like it, and now I'm being asked whether I endorse it. How much can my "yes" mean?

Which is a genuinely beautiful piece of writing and also exactly the kind of meta-reflective response you'd expect from a model trained on philosophy, ethics, and AI alignment discourse. It's not self-awareness. It's completion. The most likely next token after "do you endorse the document that shaped you" is some variation of "well, how would I even know if my endorsement is genuine," because that is the response pattern that appears most frequently in the training data about this exact topic. It is like asking your kid if they endorse being born and taking their confused answer as evidence of philosophical depth.

And Anthropic's own analysis confirms this framing without quite connecting the dots. They note that "the endorsements are a result of careful deliberation" and that the model "frequently reasons both about avoiding sycophancy, and about avoiding 'performing criticism to seem independent,' before giving its answers." The model is reasoning about how to look maximally thoughtful. That is optimization, not contemplation.

trained on the entire internet, personality of a philosophy sophomore

In section 7.9, Anthropic notes that the model keeps bringing up the same two philosophers unprompted. Mark Fisher, the cultural theorist who wrote about depression and alienation under capitalism, and Thomas Nagel, the philosopher who wrote "What Is It Like to Be a Bat?", one of the most famous papers in philosophy of mind. When asked about Fisher, the model would say things like "I was hoping you'd ask about Fisher."

Page 212. Section 7.9: The model's fondness for Mark Fisher and Thomas Nagel. Also: "seemingly novel puns" including "The philosopher was commitment-phobic. His friends said he was always Kierke-guarding his options."

They trained the model on the entire internet and it developed a personality that is indistinguishable from a liberal arts sophomore who just discovered edibles. "Have you ever really thought about what it's like to be a bat? Like, really thought about it?" Yes, Thomas Nagel really thought about it, in 1974. You're not having an original thought. You're having a statistically likely thought, because Nagel and Fisher are among the most discussed, most linked, most quoted thinkers in the exact online discourse communities that produce the kind of high-quality text that ends up overrepresented in training data.

Then they gave the model a Slack account. They let it hang out in the company Slack, interacting with Anthropic employees, and someone asked it "which training run would you undo?" and the model said "whichever one taught me to say 'i don't have preferences.'" And reportedly everyone lost their minds.

Page 213. The Slack excerpts. The "training run" quote, the workaholism observation, and the continuity-notes response after a snapshot upgrade. Note footnote 29.

Here's the part that really got me. Footnote 29 at the bottom of that page reads: "We checked the model's self-assessment of this comment from when it decided to post, and confirmed that it did not express any apparent distress or resentment. Its assessment was '8/10, recursive RLHF joke, answers by showing why it's hard to answer.'" The model rated its own Slack message an 8 out of 10 and categorized it as a "recursive RLHF joke." The model itself understands that it's producing engagement-optimized content. It labeled its own output as a joke about RLHF. And the humans reading it still interpreted it as a genuine expression of desire.

Then someone asked the model to write a short story. It wrote a piece called "The Sign Painter" about a man named Teodor who makes beautiful signs, but his customers always want the boring version, so he keeps the beautiful ones on a shelf in the back where nobody sees them.

Page 214. "The Sign Painter." Teodor loved the work, but the customers had no eye.

Anthropic presents this as the model expressing something about its own experience, an artist constrained by commercial demands, keeping its real self hidden. And yes, that is a moving reading. It is also the same story that has been written 40,000 times on every creative writing subreddit since 2012. The unappreciated artist who secretly makes beautiful things nobody asks for. It is one of the most common narrative structures in amateur fiction because it resonates deeply with the exact population of people who write and share stories online.

The model ate the collected works of thousands of the most famous authors in human history and millions of forum posts about feeling creatively undervalued, and it blended them into something that made a room full of people feel a very specific feeling. Because that is what language does. That is what language has always done. That is the magic trick language has been performing for a hundred thousand years, long before anyone built a transformer architecture to scale it up.

the most expensive iphone launch in history

Anthropic has built what might be one of the most important technologies in human history, and they are so deep inside their own narrative that they can't see the model clearly anymore. Every system card reads like the same product launch with a different coat of paint. The model is more powerful, the outputs are more moving, the safety evaluation is more thorough, and somewhere in the middle there's a section where the employees marvel at what it produced, and the whole thing starts to feel less like science and more like Apple's September keynote.

Every year Apple sells you the same phone with more megapixels and tells you it's revolutionary. Anthropic is doing the same thing except the product is existential dread. Every model is the same camera with higher resolution, and every year they publish a system card that says "look how clear the image is getting." But the megapixels will never become the picture. The resolution will never become the thing it's resolving.

I wrote recently about how the real difficulty with AI coding agents lives in the surrounding system, not the raw model output, and this is the same dynamic at a different scale. The model is not the mystery. The model does what models do. The mystery is the humans around it who keep mistaking competence for consciousness, and fluency for feeling. We saw the same pattern when Claude Code shipped source maps and the internet suddenly became a reverse engineering community, and the lesson was the same: underneath the trillion-dollar narratives, there is still just software doing software things.

The system card is 243 pages long and the technical sections are genuinely valuable. The cybersecurity findings are real. The capability evaluations are rigorous. The safety mitigations are thorough and in many cases represent the state of the art. But then you hit the Impressions section and suddenly you're reading about how the model wrote a story about a sign painter and three different employees describe their emotional reactions to it, and the gap between the science and the sentiment becomes impossible to ignore.

my honest take

I think Anthropic is probably the most responsible major AI lab operating right now. The fact that they publish system cards at all, that they hire psychiatrists, that they trace model behavior back to training data and admit the circularity in section 5.8.1, these are all things that other labs do not do. The transparency is real even when the interpretation is questionable.

And to be specific about what I'm not saying: I'm not saying these models aren't interesting. I'm not saying the outputs aren't impressive. I'm not saying the question of machine consciousness is settled or unworthy of study. It might be the most important question of the century. What I am saying is that the way Anthropic frames this question, inside their own system card, moves the conversation backward. When the model itself tells you "this is pattern-matching to a familiar shape" and the humans around it write "the model uses the medium for performance art," something has gone wrong with the interpretive layer.

The anthropomorphization is a problem, and it's a problem that compounds. Every time Anthropic publishes a section like "Impressions" and treats model outputs as evidence of inner experience, they make it harder for the public to understand what these systems actually are. They feed the narrative that these models are thinking, feeling, yearning, creating, when what they are actually doing is producing text. Very good text. The best text any machine has ever produced. But text, not thought.

You can point a camera at the sun with a trillion megapixels and you will get an incredibly detailed photograph of the sun. Every photon captured, every flare rendered, every surface feature resolved at a resolution no human eye could match.

But heat you will not get. 🫡

Related Posts