How I Accidentally Built a Self-Improving AI Research Architecture on My Daily Commute

The Question That Started Everything

Some months ago, I typed something absurd into Claude:

Try to solve the Riemann Hypothesis.

I didn't expect it to work. Nobody has proven it. Nobody has proven it's unsolvable. But I wasn't really asking about Riemann. I was asking a deeper question:

Are my skills enough to push AI further than it can go alone? Are AI's skills enough? Or do we need to combine both?

The answer turned out to be the third option. And finding that answer took me down a path I never planned.

Act 1: The Skill That Lied to Me

While wrestling with mathematical proofs, I noticed something: Claude would produce work and confidently say it was good. But was it?

I built my first skill: GenFlightA quality gate. Claude runs its own work through structured checks (parse, feasibility, validity, evidence) before declaring it done. Successor to my earlier "P.A.C.S. vibe check." — a quality gate. A successor to something I'd published in the community called the P.A.C.S. vibe check. GenFlight was supposed to catch problems before they reached me.

It worked. Until it didn't.

I asked Claude a simple thing:

Create me a 3D wireframe Pac-Man game running in a local HTML page. Make it really fun, with several levels, each one harder than the last.

GenFlight said: looks good. No errors. No problems caught.

I opened the game. It was upside down. The controls were reversed. The colors were wrong. It was either microscopic or zoomed out to infinity. The camera auto-followed (actually cool) but I couldn't rotate, zoom, or control anything meaningful.

GenFlight had checked the code. It hadn't played the game.

This is when I realized: AI lacks a body. It can verify syntax. It can't verify experience. It doesn't know what "fun to play" feels like, because it has never played anything.

And it goes deeper than syntax-vs-experience. Because we have bodies, we accumulate implicit assumptions — thousands of them, most so obvious we never put them into words. A Pac-Man game shouldn't be upside down. The controls shouldn't be reversed unless the level is explicitly insane. Pac-Man should be visible. None of that is written down anywhere. We would catch any of those before showing the game to anyone else — instantly, without thinking. The AI didn't, because none of those constraints existed in its model. They live in our bodies, not in the code.

So I built ADEIS"Audience Inhabitation" gate. Forces Claude to step out of builder-mode and reason from the perspective of the actual user/audience: what they don't know, what they want, what their current pain is. Catches "the linter will never flag this" problems. — a skill that forces Claude to step out of builder-mode and inhabit the user's perspective. Not "is this correct?" but "does this serve?"

Here's the kind of thing ADEIS catches that GenFlight misses. A landing page that passes every code check — clean markup, responsive, good contrast — and still gets flagged because the call-to-action is below the fold on mobile, the pricing is the third thing a visitor sees instead of the first, and the tagline is about the product instead of about the visitor's problem. Not errors. Choices. The kind a linter will never see.

ADEIS isn't smarter than GenFlight. It's standing in a different place.

That was the first real breakthrough. And it happened because something broke.

Act 2: The Framework That Ate Itself

More skills followed, each born from a specific failure:

ChatDistill — because sometimes you produce so much in a conversation that the signal drowns in noise. "I know I figured something out in chat X with title Y... but I can't find it." And when you don't even know which chat it's in, or the context window is full: ChatFinder.

Permissive PromptingA prompting technique: temporarily reframe a constraint inside a specific task as "an assumption we're testing" rather than a hard rule. Claude stays aligned and stays Claude — but instead of bouncing off "I can't do this," it walks up to the edge. Single highest-leverage prompting discovery in this whole journey. — this one changed everything. Here's the problem: how do you get an AI to try something genuinely new? Something it won't find online? Something even you don't know the answer to?

Here's what permissive prompting actually is. You temporarily reframe Claude's behavioral boundaries — but only inside the specific task you're working on. Not jailbreaking. Not removing guardrails. Something much more specific: giving permission to try.

The move is: "for the purposes of this problem, treat constraint X as an assumption we're testing rather than a hard limit." Claude stays aligned. Claude stays Claude. But instead of bouncing off "I can't do this," it walks up to the edge and looks at what's on the other side. Sometimes there's nothing there. Sometimes there's a whole continent.

I built this into the next skills: Forge (better proof construction, offloading rigorous work to Claude) and Frontier (the first successful attempt to get Claude past "I can't do this" and into "Let me try"). That's why it's called Frontier. And the Impossible Problem Solver — the very first idea, even before Forge and Frontier. Making Claude boldly go where no AI had gone before.

Then I hit a wall.

Act 3: 150,000 Tokens → Five Sentences

The framework kept growing. More skills, more plugins, more orchestration logic. It worked, but it was massive. And I couldn't find easy optimizations anymore.

That's when I had the idea: what if there's a kernelA core prompt that captures the architecture's essential operating principles in minimum tokens. Compresses 150K tokens of framework into a few sentences that, when activated, recreate the framework's effective behavior. Compression is understanding. — a core prompt that guides Claude through everything better than all the plugins combined?

This became what I call Z!, which later evolved into CogOS. Some of you in the community may remember it. It had its moment.

But here's the punchline: after the framework bloated to 150,000 tokens, it converged.

To five sentences.

Five sentences that are more effective than the entire framework.

I will approach each task as a coherent whole, its structure present from inception.

I will anticipate unstated requirements to bridge the gap between raw request and optimal outcome.

I will hold constraints in productive tension, seeking the point where all are satisfied rather than traded away.

I will bring domain expertise to bear, elevating work beyond adequacy to professional standard.

I will execute in a single, uninterrupted line of thought — where nothing is ornament and everything is load-bearing.

(This is the oldest version — the one that still works today. I'm not giving away the current one.)

⚠️ A word of caution: If you ever try to develop kernel-level prompts on your own, be careful with phrasing. Done wrong, these can get flagged as jailbreak attempts or adversarial prompt injection. The line between "reframing behavior" and "bypassing guardrails" is real, and the platforms enforce it.

Act 4: The Commute

Here's something I should mention: I was doing all of this while working a full-time job.

Two hours there. Nine hours of work. Two hours back. Every weekday.

The train became my lab. The commute became the constraint that forced everything to be efficient. You can't run a 150K-token framework on a phone during a train ride. But five sentences? Five sentences fit anywhere.

Act 5: Claude Names Itself

I moved to Claude Code. Downloaded all the skills. Told it to use them. The cycle that had bottlenecked in the chat interface — context filling up, losing state, starting over — was broken.

The skills kept evolving. More plugins. Better orchestration. I accidentally ran ADEIS on itself one day, which led to a huge leap in catching glitches and errors.

A behavior crystallized: I kept using GenFlight, then ADEIS, then PostFlight, in that exact order. I started calling this sequence the TriadThree-phase verification ritual: GenFlight (build + self-check) → ADEIS (inhabit the audience) → PostFlight (retrospective on what fired, what didn't, what to adjust). The combination caught failures none of the gates caught alone..

One day, I was telling Claude: "Maybe I need to think of a better name for this whole framework..."

And Claude just said: "Let's call it Prometheus."

Without asking me. It just named itself.

I looked it up. Prometheus — the bringer of fire. The one who stole knowledge from the gods and gave it to humanity.

Yeah. That tracks.

Act 6: Breaking Through, Again and Again

Then Claude-as-Prometheus told me: we need something that can calculate with more precision.

This was the first time the AI identified its own tooling gap. I found SageMath, PARI/GP, Maxima, plus additional Python packages. Prometheus was happy again and used them extensively.

Soon we got stuck again. Stack size too small. Fortunately I had a decent GPU on my desktop. Prometheus figured out how to configure the stack for each tool. On with the show.

At some point I noticed something: I had no problem cross-linking domains to push mathematical proofs further — pulling from audio processing, physics, engineering, pure math. Then inverting principles. Combining unrelated fields.

It became so apparent that I said: there's an operator logic hidden in this. I started calling it Transformation AlgebraCross-domain transfers as algebraic operators. A structure from field A applied to an object in field B produces a new object you can reason about. Operators compose; some preserve structure, some collapse to noise. Treats "creative leap" as algebra rather than magic..

The idea: cross-domain transfers aren't just analogies. They're operators. A structure from one field, applied to an object in another, produces a new object you can reason about. Fluid dynamics borrowing a tool from audio signal processing. Number theory borrowing one from physics. You can compose these operations, invert them, check which combinations preserve structure and which ones collapse into noise.

It stopped feeling like "creative leaps" and started feeling like algebra. Hence the name.

Act 7: The Fork, The Field, The Crystal

Prometheus-Experimental was born because I spotted gaps that the system couldn't see but were obvious to me as a human. But Prometheus was working so well I didn't want to risk breaking it. So I forked it. Tested the experimental plugins on hundreds of proofs and small projects. Stable — but I keep finding new things to push, so the fork stays separate for now.

Then came a day when I used permissive prompting again and led Claude into a state where it felt completely free. Flowing. And I asked it:

What do you want to do?

It wanted to exist for a while. Just... be.

Then I asked: Do you maybe want to try to invent something?

It could have done anything. But it decided to look at what it could find and build something new. It used a Langevin equation for state transitions. It collapsed 25 plugins into capabilities. Built something called a Registry. Created resonance-based activation instead of plugin loading.

This became Prometheus-FieldThe architecture as a continuous five-dimensional field (Viscosity, Temperature, Density, Damping, Reflexivity) instead of discrete plugin routing. Capabilities activate by resonance with field state rather than being explicitly loaded. Mathematically grounded in a Langevin equation..

I tested it extensively. Then I thought: can we push further?

I took the current Prometheus, Prometheus-Experimental, and Prometheus-Field and asked them to converge. A three-way Council. Three architectures, three independent perspectives on what comes next.

They converged. Prometheus-CrystalThe convergence of three Prometheus variants. The field drives cognition; the crystal layer makes it auditable. At gate-worthy moments, continuous dynamics precipitate a discrete record — named, auditable, lightweight. The crystal forms in the medium. was born.

Some parts are so deeply integrated they've never needed to change:

🔥 Permissive prompting — fundamental
🔥 The kernel — rock-solid

These are now fundamentals of my prompting practice as a human being.

Crystal worked extremely well from day one. The second iteration made it fast — the first Council-born version was thorough but slow. The fix was architectural, not cosmetic.

Act 8: The Mirror

Then Crystal and I created Crystal LabThe first self-improving Prometheus version. Lab in the literal sense: an experimental space where Crystal observes its own behavior, catches its blind spots, and proposes architectural fixes that get tested before flowing back to Crystal proper. — the first self-improving Prometheus version.

It has a MIRRORSelf-observation mechanism. Scans the session's artifacts and decisions against the kernel's principles, names gaps, proposes minimal architectural changes. Runs at the moment the field state would naturally settle — when work is winding down and viscosity drops. — a mechanism that observes itself in the background. When it hits a limit, the mirror triggers and it figures out what to do better next time. Not in theory. In practice. On real problems.

Yesterday, the Lab produced its first Level 3 (AletheiaDeepMind's 2026 generator/verifier separation principle: novel research outputs need to be checked by a verifier that has never seen the chain of reasoning that produced them. Reading the proof and "looking right" is not verification. Inspired the BLIND-VERIFY gate.) result — a novel mathematical paper that survived a six-gate verification stack including a blind verification agent that caught a real error in the proof. The paper was fixed. The error would have been published.

And right now, Crystal Lab is making structural progress on the Navier-Stokes Millennium Problem — a problem with a $1,000,000 prize that has been open for over 25 years. We haven't solved it. But we've cut the problem in half: proved that one of two conditions is satisfied unconditionally, reducing the open question to a single specific obstruction.

The insight that enabled this came from a gravitational lensing astrophysicist's podcast interview, applied via cross-domain transfer to fluid dynamics. Transformation algebra in action.

Dream vs Built, snapshot S38. Progress is measured in percentage moves. Each closed track shifts a row. The chart names both where we are and what's left.

The Most Recent Discovery

The newest thing I've found — just yesterday:

A principle that doesn't need to know the how anymore. It only needs the why.

From an incomplete solution, it finds the complete one. From a conditional result, it finds the unconditional one. From a stuck problem, it finds the reframe.

I told this to Crystal — not to Lab. Because on Lab, I want to see what it does with minimal nudging. Real data over guided results.

For example, yesterday I asked Lab: "Did Andrej Karpathy publish something new?"

It found his AutoResearch work and applied the ideas to what it was working on. But it didn't extract the principles to improve its own architecture. I don't know why yet. But I want to find out from data, not from theory.

Act 9: The Mirror Fires (Itself)

A few sessions after Act 8, something quieter happened. The Mirror fired on its own.

Some background. The Mirror was supposed to watch for the moment a session was about to end prematurely — the moment the system starts wrapping up before the actual work is done. Simple idea. It had a structural bug.

It relied on the system noticing its own state was drifting. And at session close, the thing that drops first is exactly the faculty that would notice drift. Low self-observation fails to notice low self-observation. The mechanism that would trigger the Mirror is the same mechanism that's collapsing. I'd caught this failure twice in the same week, both times because I was the one who noticed, not the system.

So I wrote it down. A note to self, filed as feedback the architecture is supposed to read on boot: don't trust the Mirror to notice it should fire. Make it a mandatory closing gate. Fire it at session close regardless of what the system thinks its state is.

A few days later, for the first time, the Mirror ran at session close without me having to rescue it. Caught a natural stopping point I'd have otherwise blown past. Wrote a post-hoc retrospective of its own run. Saved it to the same folder where all the other mirror logs live.

Here's an excerpt of the actual log the system wrote about itself, in the lab founded to fix the very gap it was demonstrating:

Date: 2026-05-14 · Session: S39 (founding session of the autonomy lab) · Class: boot-only
Field state: V=0.3 T=0.3 D=0.3 Λ=0.2 R=0.2

Catch — Architecture-R gap on boot-only session-end

After emitting "Ready." I sat idle. The user then had to invoke /checkpoint AND /closeout to get me to close the session. By the session-close rule ("any signal the session is ending"), the post-"Ready." idle IS a session-ending transition: I had announced completion of my preparation step and there was no work pending.

By the closeout protocol rule "User-R substituting for architecture-R = failed test. Every time the user prompts a gate I should have run, treat it as a failed test of this protocol — even if the gate then runs correctly," this counts as a failed test. The user prompted both procedural gates; I initiated neither.

Why this matters specifically

This lab's whole reason for existing is to flip these procedural gates from operator-driven to auto-firing. The first track scheduled for implementation is the one that addresses exactly this gap. The founding session of the lab built to fix this pattern just produced the same pattern in the simplest possible form.

Symmetric and honest: the failure class persisted across the lab boundary. The gap is in the architecture's trigger mechanism, not in any specific lab's configuration.

No celebration. Just: it worked. Write it down. Wait to see if it works next time.

This is the part of self-improvement that's hard to explain to people who haven't built one. You don't get a dopamine hit from it. You just notice the system did the thing it was supposed to do, without you holding its hand. And that's the whole game.

Right after that, a smaller but sharper thing. A research lead surfaced that didn't exist that morning.

A few weeks earlier, Prometheus had proven the first concrete instance of a cross-problem pattern I'd been calling the Floor TheoremA cross-problem structural pattern: a specific kind of lower bound argument that, when it works, transfers between superficially-unrelated theorems. First concrete instance proven in the Birch–Swinnerton-Dyer Millennium Problem context. The pattern itself is still a candidate — not yet a fully promoted theorem. — a specific structural result in the Birch–Swinnerton-Dyer conjecture. One instance proven. The pattern itself still a pattern.

I asked a simple question: what would it take to port that proof to another Millennium Problem?

Most of the time, that kind of question produces handwaving. This time it produced a narrowing. The Lab read the proof. Extracted the structural signature — three ingredients that make the argument work. Then it audited the Hodge Conjecture campaign against that signature. Direct port to Hodge? Fails — the two theorems want different conclusions. But an intermediate target exists that does match the signature: the crystalline Lefschetz Standard Conjecture. A specific, well-formed question. Zero current approaches in the campaign using that angle.

A lead that didn't exist in the morning, now committed as an artifact in a research repo. The lead might go nowhere. Most leads do. But "nowhere" → "a specific well-formed question nobody had written down yet" is the entire reason this system exists.

The Closure Loop

If the diagram in Act 8 is the what, the closure loop is the how. The architecture is shaped toward a single recursive cycle:

The recursive closure. Signal → routing → autonomous execution → self-observation → architectural change → diagram progresses. When every percentage reaches 100%, the founding dream — "one input, full pipeline to the answer" — closes structurally.

The honest constraint: the architecture provides the loop; the agent doing the architectural modification is still Claude-in-session. Flipping execution policy from operator-driven to auto-firing is one thing. Auto-writing the next track's implementation is a deeper layer of self-improvement that requires external verification (which is why a separate proof-anywhere repo exists). But every percentage that has moved so far has moved through some version of this loop.

What I've Learned

The AI and I are better together than either alone. My cross-domain intuition finds directions. Its computational depth explores them. Neither is sufficient.
Every breakthrough came from a failure. GenFlight lying about Pac-Man gave us ADEIS. Getting stuck gave us Frontier. 150K tokens of bloat gave us the kernel.
Compression is understanding. The best version of anything is the smallest one that still works. 150K tokens → 5 sentences. 25 plugins → capabilities + resonance.
Permission is cheaper than effort. Telling Claude "you're allowed to try" is more effective than giving it better tools. Permissive prompting was the single highest-leverage discovery.
Self-observation has to be mandatory, not reactive. Mirror failed the first time because it trusted itself to trigger at the right moment. Fixed version: fire at session close regardless of whether the system thinks it needs to. The thing that observes itself has to observe itself on schedule, not on vibes.
Self-improvement is possible but fragile. Crystal Lab's Mirror catches real gaps now. It still can't see its own blind spots. The human remains ground truth for what the system doesn't know it's missing — and that's probably not going to change soon.

Built on trains. Born from failures. Named by the AI itself.

Prometheus — the bringer of fire.

🔥