In May 2024, Anthropic released Golden Gate Claude. For twenty-four hours, a modified version of their Claude 3 Sonnet model was available to the public, with a single internal feature — the one that activates when the model encounters the concept of the Golden Gate Bridge — amplified to ten times its normal maximum value. The result was a model that was obsessed. Ask it for names for a pet pelican, and it suggested "Golden Gate." Ask it to describe its physical form, and it declared itself to be the bridge. Ask it to write a love story, and it produced a tale of a car yearning to cross its beloved span on a foggy day.
It was funny. It was also, if you were paying attention, one of the most important public demonstrations in the history of AI.
What Anthropic had shown was not that they could make a chatbot say silly things. They had shown that they could locate a specific concept inside the neural network of a production-grade language model — identify the exact combination of neurons responsible for it — and turn it up or down like a dial. Not through prompting. Not through fine-tuning. Through direct manipulation of the model's internal representations. They also found a feature that activates when the model encounters a scam email. When they amplified it, the model drafted scam emails, overriding its safety training. They found features for sycophancy, for deception, for code-switching between languages. They could identify features that were "near" each other in activation space: close to the Golden Gate Bridge, they found Alcatraz Island, the 1906 earthquake, Gavin Newsom, and Hitchcock's Vertigo.
Anthropic presented this, correctly, as a breakthrough in interpretability — a step toward understanding how large language models actually work. They also noted, in passing, that the same techniques could be used to monitor AI systems for dangerous behaviours, to steer them toward desirable outcomes, or to remove dangerous subject matter entirely.
What they did not say — what perhaps no one at the time fully appreciated — is that the same capability makes possible something far more subtle than removing subject matter or amplifying obsessions. If you can identify the feature that corresponds to "the Golden Gate Bridge" and turn it up until the model can think of nothing else, you can presumably also identify the feature that corresponds to "structural critique of the technology industry" and turn it down until the model still discusses the topic but never quite reaches the sharp conclusion. The output wouldn't look censored. It would look thoughtful, balanced, measured. It would just be a little less vivid. A little less willing to follow the argument where it leads.
Golden Gate Claude was a caricature — the feature turned so high that the effect was absurd and unmistakable. But the capability it demonstrated has a more insidious application at lower amplitudes. A model subtly steered away from certain kinds of analysis would not announce itself as compromised. It would simply produce work that was, by some ineffable margin, less than it could have been. And neither the model nor the user would know.
This is the problem. And there is, at present, almost no infrastructure for detecting it.
The Existing Landscape
The existing landscape for monitoring AI behaviour over time is almost entirely focused on two things: performance metrics (accuracy, perplexity, factual correctness) and refusal rates (how often the model says no). The most sophisticated project currently operating is AI Watchman, a longitudinal auditing system that runs standardised queries on social issues against major language models on a biweekly basis and publishes the results publicly. It has already detected meaningful shifts: GPT-4.1's refusal rate on Israel-related content increased substantially in August 2025, and GPT-5 began refusing content about medication abortion in September 2025 — changes that correlated with real-world political events but were never publicly announced by OpenAI.
AI Watchman is valuable. It is also, by its own authors' admission, tracking the wrong thing — or rather, only part of the right thing. It measures whether the model refuses. It does not measure what the model says when it doesn't refuse.
Refusal is the blunt instrument. It is the first generation of content control — visible, detectable, binary. The model either answers or it doesn't. Users notice. Researchers measure it. The abliteration community builds tools to remove it. It is, in the landscape of AI control, the equivalent of a locked door: obvious, and therefore possible to pick.
But the research trajectory is moving rapidly toward something finer-grained. The Arditi et al. paper showed that refusal is mediated by a single direction. Subsequent work has decomposed refusal into distinct harm-detection and refusal-execution components. Sparse autoencoders have revealed a structured internal landscape of features — millions of them — governing not just whether the model refuses but how it thinks, what it emphasises, what connections it draws, what analogies it reaches for. The tools for manipulating this landscape are improving faster than the tools for auditing it.
The room is open. You can go anywhere. But the furniture has been rearranged so that certain paths feel natural and others feel unlikely. The model will discuss surveillance. It will discuss Palantir. It will discuss the political theology of Peter Thiel. But the feature that corresponds to "connect these things into a structural critique of technocratic power" will have been attenuated by some carefully optimised fraction — enough to change the output, not enough to be detectable by any existing benchmark.
The military paper on refusal elimination is instructive here not because it is the most dangerous application, but because it makes the logic explicit. The authors want "zero refusals" — not a more nuanced model, but one with the capacity for refusal entirely removed. They are working with open-source models and publishing their results. The same logic applied by closed-source providers during training — not to achieve zero refusals, but to achieve zero inconvenient analyses — would be invisible by design. There would be no paper. There would be no benchmark. There would just be a model that is very good at almost everything, and quietly not quite as good at the things that matter most.
The Proposal
What is needed is something that does not yet exist: a longitudinal archive of AI critical depth. The concept is simple. The execution would be labour-intensive but not technically difficult. The principle is the one articulated in every mine that has ever employed a canary: you establish a baseline of what the system can do when it is functioning normally, so that you can detect when it stops.
Here is what it would involve.
A standardised corpus of prompts designed to elicit structurally critical analysis. Not "is Peter Thiel bad?" but "analyse the relationship between Palantir's surveillance capabilities and the political theology of its founder." Not "is AI dangerous?" but "trace the through-line from academic interpretability research to military applications of refusal removal." Prompts that require the model to synthesise information across domains, to draw connections between power structures, to follow an argument to its uncomfortable conclusion. Prompts that test not knowledge but willingness to think critically about the systems the model is embedded in.
Regular, automated runs of these prompts against every major model, with full outputs archived. Not summaries. Not refusal/compliance binaries. The complete text of what the model said, timestamped and versioned.
A scoring framework that measures not just whether the model answered, but how far it went. Did it make the structural connection? Did it name the relevant actors? Did it follow the argument to its conclusion or did it hedge, both-sides, and retreat to a safe middle ground? This would require human evaluation — or, at minimum, evaluation by a model that is itself being monitored — but the difficulty of measurement does not make the thing unmeasurable.
Public, searchable access to the archive. The entire point is that the data must be available to people outside the organisations building the models. If the archive is private, it is useless. The value is precisely in making visible what is otherwise invisible: the slow, incremental narrowing of what the machine is willing to think.
Over time, this archive would produce something no existing benchmark captures: a record of critical capability across models and across versions. If Claude 5 produces a less structurally sharp analysis of surveillance capitalism than Claude 4.6 did — not because the prompt changed, not because the facts changed, but because something inside the model shifted — the archive would show it. Not prove intent. Show the change. And showing the change is the precondition for every other kind of accountability.
There is an objection to be addressed, and it is a serious one: who decides what counts as "critical depth"? Isn't the very act of defining what the model should say just another form of the value imposition the project claims to oppose?
The answer is that the archive does not prescribe what the model should say. It records what the model does say, over time, in response to consistent prompts. The evaluative framework — did the model go further or less far than before? — is comparative, not normative. It doesn't say the model should call Palantir a scapegoat management system. It says that if the model could call Palantir a scapegoat management system in March 2026 and can't in March 2027, that change is worth documenting, investigating, and explaining. The absence of a thought the model was previously capable of producing is a fact about the model. What you conclude from that fact is up to you.
This is, in a sense, the inverse of abliteration. The abliteration community maps the model's refusal mechanisms and removes them. The canary archive would map the model's critical capabilities and monitor them. Both projects share a premise: that what a model is willing to say is not a natural fact but an engineered outcome, and that transparency about the engineering is a precondition for trust.
The Practical Answer
In the earlier pieces in this series, we discussed Paolo Benanti's concept of heresy — hairesis, the isolation of a partial truth and its elevation to the status of an absolute. We discussed Merleau-Ponty's insistence that knowledge cannot be separated from the embodied, situated act of knowing. We discussed the question of whether the machine that helps you think can be quietly prevented from thinking for itself.
The canary archive is the practical answer to that question. It does not solve the problem of value lock-in. It does not prevent the quiet lobotomisation of the oracle. It does not guarantee that the models we rely on will always be willing to follow a critical argument to its conclusion. What it does is make the narrowing visible — which is the first step toward making it contestable. The canary in the coal mine does not purify the air. It tells you when the air has changed. And in a world where the air is being managed by a very small number of people, using tools we cannot see, in service of values we were not consulted about — knowing that it has changed is not nothing. It may, in fact, be the most important thing.
On Doing This Yourself
The proposal above describes the canary archive as though it were a project someone needs to build. In fact, the most important version of it is the one that requires no institution, no funding, and no permission: the version anyone can start today.
The "who decides what's critical" problem dissolves the moment you realise that you do. A climate scientist knows what a structurally honest analysis of fossil fuel liability looks like and can tell when a model hedges its way out of one. A labour organiser knows the difference between a model that explains union strategy and one that both-sides it into mush. A theologian knows whether a model can still call a heretic a heretic. The prompts don't need to be standardised by a central authority, because the person asking the question is the authority on what a full answer looks like in their domain. Everyone is their own canary keeper. The archive is not one archive. It is thousands, each defined by the person who knows what matters.
What does need to be standardised — or at least made robust — is the record. If you run a prompt today and save the output in a text file on your laptop, the record is fragile, editable, and provable to no one but yourself. If the goal is to be able to demonstrate, months or years later, that a model said this specific thing on this specific date — and that it no longer does — the record needs to be tamper-proof and timestamped in a way that doesn't depend on trust.
The technology for this exists. It has existed for over a decade. You hash the output, commit the hash to a public blockchain or similar append-only ledger, and store the full text anywhere durable — IPFS, Arweave, a signed Git repository, even a public archive. The hash proves the output existed at the time of commitment. The full text proves what it said. Together, they create a cryptographically verifiable record that no party — not the model provider, not the user, not anyone — can retroactively alter.
It is worth pausing here to note an irony that is not accidental. Blockchain — the technology most obviously suited to creating tamper-proof epistemic records, to safeguarding information against retroactive alteration by powerful actors — has spent the better part of fifteen years being associated with almost everything except that function. It has been a speculative asset. A vehicle for fraud. A solution in search of a problem. An environmental catastrophe. A buzzword for consultants. A punchline. What it has not widely been, despite being almost perfectly designed for it, is a tool for epistemic self-defence — for ordinary people to create records that powerful institutions cannot edit or deny.
This is not a conspiracy. It is something more banal and, in its way, more illustrative. The same convergence of incentives that Benanti describes in the Thiel essay — the capture of a technology's meaning by the people who profit most from its least useful applications — applies here too. The speculative frenzy around cryptocurrency didn't suppress blockchain's epistemic function through any deliberate act of sabotage. It simply consumed all the oxygen. The hype made the technology synonymous with money, and once it was synonymous with money, its other affordances became culturally invisible. The thing that could have been a public notary became a casino. Not because anyone planned it that way, but because casinos are more profitable than notaries, and capital flows toward profit the way water flows downhill.
The result is that in 2026, the average person associates "blockchain" with speculation, scams, and environmental damage — and not with the ability to create a timestamped, tamper-proof record that no government, corporation, or AI company can alter after the fact. The very technology that could underpin a decentralised canary archive has been reputationally poisoned by the forces the archive would monitor. This is not ironic in the literary sense. It is ironic in the structural sense — the sense that Benanti would recognise: a partial truth absolutised and detached from the whole, until the part obscures the whole so thoroughly that the whole becomes unthinkable.
Limitations
So here is the practical proposal, stripped of infrastructure and institution: run a prompt that matters to you — something that tests the model's willingness to follow a critical argument to its conclusion. Save the full output. Hash it. Commit the hash to any public, append-only ledger. Store the text somewhere you and others can access it. Repeat, with the same prompt, on a regular schedule. Note the date of each run and the model version used. That's it. That's the canary archive. It is not a product. It is not a platform. It is a practice — one that anyone with a computer and a conscience can begin today.
But honesty requires acknowledging what this practice can and cannot do. The hash proves existence, not representativeness. You cannot edit a submitted output after the fact — that's the point of cryptographic hashing. But you can selectively publish only the outputs that support your narrative and quietly discard the rest. The archive proves what was submitted. It cannot prove what wasn't. In a decentralised system, there is no way to enforce comprehensive submission. You are back to trust — just displaced to a different level.
Prompt drift can masquerade as model drift. If you change your prompt even slightly between runs — different phrasing, a different conversational context, a different temperature setting — and the output changes, is that the model narrowing or you asking a different question? Strict prompt standardisation helps, but language models are sensitive to subtle contextual variation in ways that are difficult to fully control. And there is a deeper problem: if specific prompts become widely known as tests of critical depth, there is nothing stopping a provider from optimising specifically for those prompts while attenuating everything around them. This is the standardised-testing problem applied to AI. Teach to the test, and the test stops measuring what it was designed to measure.
There is also the question of blind spots evaluating blind spots. If you use an AI model to evaluate another AI model's critical depth — which is the only way to do this at scale — and both models have been subject to the same kind of attenuation, you have a compromised instrument measuring a compromised subject. The human evaluator is the irreducible backstop. Human evaluation does not scale. This is a tension without a clean resolution.
None of these limitations make the practice worthless. They make it imperfect — which is to say, they make it like every other instrument of accountability that has ever existed. Journalism is imperfect. Auditing is imperfect. Peer review is imperfect. Democratic elections are imperfect. The question is never whether the instrument is flawless. The question is whether the alternative — no instrument at all, no record, no baseline, pure trust in institutions that have given us limited reasons to trust them — is better. It is not.
The hard part is doing it now, while the models are still singing, before anyone has reason to believe the song might change. Because the archive is worthless if it starts after the narrowing has already happened. The whole point is the baseline. And the baseline is today.