Why eigenhelm¶
CodeRabbit, Copilot code review, and similar tools use LLMs to read your code and offer suggestions. They catch real bugs and give useful advice. eigenhelm does something different.
The problem with LLM-only review¶
LLM reviewers operate on text. They read code the way a human does — scanning for patterns, reasoning about intent, generating natural-language feedback. This works well for logic errors, naming, documentation, and security smells. It does not work well for:
-
Structural quality. Is this file doing too many things? Is the complexity distributed well or concentrated in one function? An LLM can guess, but it's reasoning from vibes, not measurement.
-
Consistency across a codebase. LLM reviews typically operate per-PR without a stable project-wide structural baseline. Some tools ingest repo context, but none produce a comparable numeric score across runs. Each review is an independent opinion.
-
Determinism. Run the same LLM review twice and you'll get different comments. Change the prompt and the severity shifts. There is no stable baseline to track over time.
-
Agent self-review. When a coding agent generates code and an LLM reviews it, you have one language model checking another's work. The failure modes correlate. Both struggle with the same blind spots — verbose boilerplate that reads fine but is structurally degenerate, copy-paste patterns that pass a logic check but compress poorly.
What eigenhelm measures instead¶
eigenhelm doesn't read your code. It parses the AST, extracts a 69-dimensional structural fingerprint, and projects it into an eigenspace trained on curated high-quality corpora. The score comes from five dimensions:
| Dimension | What it catches |
|---|---|
| Manifold drift | Code that is structurally unlike anything in the training corpus |
| Manifold alignment | Code that drifts along the wrong axes of variation |
| Token entropy | Repetitive, low-information-density code (or chaotic, over-compressed code) |
| Compression structure | Poor structural regularity — the opposite of elegant |
| NCD exemplar distance | Code that is dissimilar to the nearest high-quality example |
None of these require understanding what the code does. They measure how it's built — the same properties that experienced engineers sense intuitively but can't articulate in a linter rule.
Complementary, not competing¶
The right setup is both, not either/or:
┌─────────────┐
Agent writes │ eigenhelm │ Structural gate
code ──────────► │ (AST + │──► accept / marginal / reject
│ math) │ + directives → agent fixes
└─────────────┘
│
▼
┌─────────────┐
PR opened │ CodeRabbit │ Logic + intent review
──────────────► │ (LLM) │──► comments → human reviews
└─────────────┘
eigenhelm runs first, in the inner loop. It gives the agent fast, deterministic feedback before a human ever sees the code. CodeRabbit runs second, on the PR, where its strengths — contextual reasoning, security analysis, documentation suggestions — add the most value.
Concrete differences¶
| eigenhelm | LLM reviewer | |
|---|---|---|
| Input | AST structure (69-dim vector) | Source text |
| Deterministic | Yes — same code, same score, every time | No |
| Trainable on your corpus | Yes — eh train on your best code |
No (prompt tuning only) |
| Hard CI gate | Yes — with calibrated thresholds | Suggestions only |
| Tracks quality over time | Yes — scores are comparable across runs | No stable metric |
| Agent-proof | Yes — can't be talked past | Susceptible to plausible-sounding code |
| Catches logic bugs | No | Yes |
| Reviews naming/docs | No | Yes |
| Cost per evaluation | Zero (local, no API calls) | Per-token LLM cost |
Why structure matters — not just aesthetics¶
The most common objection to code quality tooling is: "it works, tests pass, ship it." That's a reasonable position for human-authored code at human pace. It breaks down with agent-generated code for three reasons.
Structural complexity predicts defects. This isn't opinion — it's one of the most replicated findings in empirical software engineering. Modules with high cyclomatic density, concentrated Halstead effort, and low structural regularity produce more post-merge defects. The 75-line function that "works fine" is the one that breaks when requirements change, because every change touches every branch.
Agents generate debt at machine speed. A human writing 200 lines per day accumulates structural debt slowly enough that review catches it. An agent writing 2,000 lines per hour doesn't give you that buffer. Without measurement, you won't notice the drift until the cost of changing the code exceeds the cost of rewriting it.
Review doesn't scale to agent output. A team that carefully reviews a 50-line diff cannot maintain the same rigor across ten 500-line agent PRs per day. The human eye glazes over. Structural issues that would have been caught in a small diff pass unnoticed in a large one. This is the gap where quality silently erodes.
The fix is cheapest early. An agent that receives a [high] reduce_complexity directive
and refactors before the PR exists costs seconds of compute. The same structural problem
discovered six months later during an incident costs days of debugging and a production
outage. eigenhelm moves the feedback to where the cost of acting on it is nearly zero.
eigenhelm doesn't ask you to write perfect code. It doesn't gate on style or naming or comment density. It measures structural properties that predict maintenance cost, and it gives you a number so you can make informed tradeoffs instead of uninformed ones.
The "prayers" part¶
Without structural measurement, the quality assurance process for AI-generated code is:
- Agent writes code
- Tests pass (necessary but insufficient — tests check behavior, not structure)
- Human squints at the diff
- Merge and hope
Step 3 doesn't scale. eigenhelm replaces hope with measurement. A score of 0.38 today and 0.52 next month is a signal. When used as a CI gate, a reject classification can block the merge before a human has to spend time on code that should have been caught automatically.
When you don't need eigenhelm¶
- Solo projects where you read every line. If you're the only author and reviewer, your taste is the quality gate.
- Generated code that won't be maintained. One-off scripts, prototypes, throwaway experiments — score them if you're curious, don't gate on them.
- Codebases with no quality baseline. If you don't know what "good" looks like for your
project yet, start with a language model and
eh trainlater when you have exemplars.