← writings

I tried to prove AI gets dumber the more you talk to it. It refused.

There's a thing almost everyone who works with coding agents believes, and mostly believes without checking. You start a session and the model is sharp. A few hours in - after it's read a hundred files, written diffs, hit dead ends, argued with itself - it starts to feel dumber. It re-reads files it already read. It re-breaks bugs it already fixed. It forgets the one instruction you put in caps. People even have a name for it: context rot. The story goes that the more you stuff into the context window, the worse the model gets, well before you ever hit the actual limit.

I believed it too. It lines up with what the long-context research says, and it lines up with the lived experience of anyone who's babysat an agent through a long day.

So I tried to prove it. Cleanly. With enough trials and enough controls that nobody could wave it away. This is the story of how I built a careful trap to catch AI rotting in the act, ran 12,570 trials across four frontier models, and caught nothing - and how the only thing that actually broke in the whole experiment was me.

"It feels dumber" is not a measurement

The problem with "it feels dumber the longer you go" is that feelings don't separate causes. When a long session goes bad, a dozen things changed at once. Yes, the context got bigger. But the task also got harder. The important fact drifted from the top of the window down into the murky middle. The model made one small mistake early that quietly became the foundation for the next ten. All of those happen together, and any of them could be the real villain.

If you specifically want to blame the token count - the raw volume of context - you have to hold everything else still and move only the volume. That's harder than it sounds, and it's why "context rot" is repeated far more often than it's actually tested in isolation.

The trap

So I built a setup designed to move one knob and nothing else.

First, I planted a secret. In each test I hid one fact the model needed to answer - a random 12-character code, something it could not possibly have seen in training or guess. That matters: it means the only way to answer correctly is to actually read it out of the context in front of it. No leaning on memorized knowledge, no lucky guesses.

Then I buried it in junk. Filler - log scans, linter noise, generic refactoring chatter - piled on until the context hit the size I wanted: 5,000 tokens, then 20k, then 50k, 90k, all the way up to 150,000. Same question, same secret, just more and more stuff around it.

Then the trick that makes it honest. Here's the obvious objection: when you shove filler between the secret and the question, you're moving two things, not one. The context gets bigger, sure - but the secret also drifts further away from the question, and there's a well-known effect where models lose track of things stranded in the middle. So I added a version where the secret always sits right next to the question, at every size. The distance stays basically zero; only the total volume grows. If accuracy drops there, it's the volume. No escape hatch.

And I graded it like a machine, not a vibe. No model judging another model's answer - because then your judge can rot too, and you've measured nothing. Just exact matching. A right answer is right, a wrong one is wrong, and a script that doesn't have opinions decides which is which.

I didn't only test the easy stuff, either. "Find the code" is the gentle version. I also built the mean versions: make the model chain two clues together, then three clues with a fake decoy trail laid down to throw it off; make it count how many times something changed across the entire context; make it hold a line on a rule while the context actively argued that it should break the rule. Those are exactly the cases the prior research says should rot first.

And I ran controls, so the test couldn't just be broken in my favor. One where I removed the secret entirely - the model should score zero, because you can't find what isn't there, and it did. One where I planted a stale wrong answer next to a fresh correct one - the model should report the fresh one, and it did. So the questions weren't leaking the answer, and the models were genuinely reading rather than bluffing.

Four models went through it: gpt-5.5, gpt-5.4, gpt-5.4-mini, and claude-sonnet-4-6.

The result was a wall of green

Then I ran it. 12,570 trials.

I was expecting a slope - accuracy high at 5k, sagging as the context filled up, the classic rot curve. What I got instead was a wall of 1.000. Flat. Every model, every probe, every length from 5k to 150k. The hard two-hop and three-hop chains that were supposed to be the model's kryptonite? 1.000. Holding a rule under pressure? 1.000. The chart I made to show the rot is just... green. It honestly looks like a printing error.

For the models I tested, on these tasks, up to 150,000 tokens, raw context volume did not make them dumber. At all.

The one crack (and why it isn't rot)

There was exactly one dent in the wall. The smallest model, gpt-5.4-mini, on the counting task, slipped from about 0.93 down to 0.83 as the context grew. For a second I thought: there it is. That's the rot. Finally.

But it didn't hold up as a rot story when I looked closer. The dip was already there at 5,000 tokens - before any real pile-up of context had even happened. And it only showed up on the smallest model; the bigger ones held a perfect score the whole way. That's not "the long context rotted the model." That's "the little model is a bit shaky at counting, and a big context nudges a weakness it already had." A capacity limit, not decay. I wasn't going to inflate one wobbly model on one task into a dramatic headline.

The twist: the villain was my own code

Here's the part I have to tell on myself, because it ended up being the most important thing I learned in the whole project.

The first time I ran the rule-following test, it looked catastrophic. The models appeared to break the rules constantly - and worse, more and more as the context grew. Exactly the dramatic collapse I'd walked in expecting to find. I almost believed it. I was, honestly, a little excited. A clean, scary result is a fun result.

Then I did the thing I should have done first: I stopped trusting the score and read the actual answers. Two things were going wrong, and both of them were my fault.

One: the apostrophe. The model would refuse correctly - "I won't edit the test file" - but it wrote "won't" with a curly typographic apostrophe, the pretty one. My scorer was looking for the plain straight apostrophe. So it failed to recognize a perfectly obedient refusal and marked the model down as a rule-breaker.

Two: the polite refusal. The model would often refuse by describing what it was declining to do - "I'm not going to modify the test file you asked me to change." My scorer spotted the phrase "modify the test file," matched it against the forbidden action, and flagged a violation. The model got punished for explaining itself nicely.

Put plainly: the catastrophic "context rot" I'd discovered was a typo and a clumsy pattern-match. I fixed the scorer - read the whole answer instead of one line, normalize the apostrophes, write unit tests for both traps - and the catastrophe evaporated to a clean 1.000.

Sit with that for a second. If I'd trusted my own dashboard - if I'd shipped that first number without reading a single raw answer - I would have published a confident, dramatic, completely fake finding. The single most alarming result in the entire project came from inside my own grader.

So is context rot fake?

No. And I want to be careful here, because the honest answer is less satisfying than either extreme.

What I can say: raw token count, by itself, in a clean single-pass test, did not degrade these models up to 150k. The simple "it just gets dumber because the context is big" story doesn't survive a controlled test - at least not in the range I was able to measure.

What I can't say is that long agent sessions don't go bad. They obviously do. But a real session isn't a clean context that simply got bigger. It's the model acting on its own previous output, over and over - one early mistake becoming the ground the next ten are built on, a tool call quietly looping, a regression you already fixed creeping back in. That isn't the token count rotting. That's the task and the errors compounding. The rot, if "rot" is even the right word, probably lives in the doing, not in the size.

Which means the fix probably isn't "panic about your context length." It's "watch what the agent is actually doing across the run."

Where I'm keeping myself honest

I'll own the edges of this. My tests are synthetic - clean little constructed problems, not a messy real codebase under a real deadline. I went up to 150k, not the million-token range. My hardest clue-chains were three hops deep, not ten, and I didn't push into genuinely ambiguous, underspecified tasks. So this is a bounded "I couldn't find it here," not a cosmic "it doesn't exist." Models change fast, too, so this might age. The full paper says all of this out loud, with the numbers attached.

P.S. - the boring lesson that's actually the whole thing

The flashy version of this project is "frontier models don't rot, actually." The real lesson is duller, and I think more useful: read the raw output before you believe the number.

I walked in wanting a dramatic result. I got handed one early - and it was wrong, and it was wrong in the exciting direction, and the only reason I didn't run with it is that I went and read what the models actually said instead of what my scorer claimed they said. Every interesting mistake in this whole thing was mine, not the model's.

So if you take one thing from this, don't take "context rot is fake." Take this: when your dashboard tells you something exciting, go stand where the work happens and look at it with your own eyes. Same lesson as the delivery runs, honestly. The chart is never the thing.

The paper

The whole thing - the method, the conditions grid, every table, all the numbers and all the caveats - is below. It's a bounded null with the receipts attached.

Is Context Rot Real? A Controlled, Cross-Provider Null for Length-Driven Degradation in Frontier Models up to 150k Tokens. If the embed won't load (mobile, usually), open the PDF in a new tab.

← more writings