AI Agent Skills¶
eigenhelm publishes agent skills for coding assistants. When an agent loads the skill, it gains a structured protocol for evaluating and improving code quality — with guardrails that prevent over-iteration.
Install the skill¶
What the skill enforces¶
The skill encodes five rules based on controlled testing:
-
One pass, not a loop. Evaluate once after writing a file. If reject, apply obvious fixes, re-evaluate once. Two passes maximum.
-
Never sacrifice correctness for score. If a directive suggests refactoring that would change behavior, skip it. A working reject is better than a broken marginal.
-
Directives are suggestions, not commands.
extract_repeated_logicmeans "this region has structural repetition" — it doesn't mean you must extract. Sometimes repetition is the right choice. -
Tests gate, eigenhelm advises. Run tests before and after any eigenhelm-motivated change. If a change breaks tests, revert it immediately.
-
Marginal is an acceptable end state. Do not iterate to improve a marginal score.
Why these rules matter¶
In controlled benchmarks, we tested two approaches:
Loop-until-accept (old approach): An agent rewrote code repeatedly to chase a better score. It introduced a fatal algorithm bug while optimizing for structure — the topological sort ran backwards because the agent refactored it to lower the WL hash penalty.
Skill contract (current approach): The same agent, same spec, but following the rules above. It produced code rated 46% higher on design, robustness, and spec compliance by an independent reviewer — with zero correctness regressions.
Benchmark results (3 scenarios, 6 builds)¶
| Category (out of 15) | With skill | Without | Improvement |
|---|---|---|---|
| Bugs (fewer = better) | 10 | 7 | +43% |
| Design | 12 | 9 | +33% |
| Robustness | 12 | 7 | +71% |
| Clarity | 13 | 10 | +30% |
| Spec Compliance | 10 | 6 | +67% |
| Total (out of 75) | 57 | 39 | +46% |
The eigenhelm agent won all 3 scenarios: CLI tool, data pipeline library, and message protocol.
Agent workflow¶
1. Write code
2. Run tests → all pass
3. Run: eh evaluate src/mymodule.py --classify
4. Read output:
- accept/marginal → done, move on
- reject → read directives, apply obvious fixes
5. If you changed anything: run tests again
6. Re-evaluate once. Accept the result regardless.
7. Done. Two passes maximum.
Directive reference¶
| Category | Meaning | Action |
|---|---|---|
extract_repeated_logic |
Structural repetition detected in AST | Look for shared abstractions, but only if the repetition is logic, not data |
reduce_complexity |
High cyclomatic/Halstead metrics | Break up complex functions |
review_structure |
Code structure diverges from corpus norms | Informational — review but don't necessarily change |
review_token_distribution |
Unusual token entropy | Usually noise — ignore unless extreme |
improve_compression |
Low information density | May indicate boilerplate — review for opportunities to simplify |
Declaration-aware behavior¶
eigenhelm automatically handles files that are primarily type definitions or data tables:
- Pure type-definition files (dataclasses, interfaces, structs): Skipped entirely — the WL hash signal is syntax noise.
- Data table files (const arrays, probability tables): Scored with dampened drift/alignment and
review_structuredirectives instead ofextract_repeated_logic. Output includes a[declaration-heavy]tag. - Barrel files (
__init__.py,index.ts,mod.rs): Skipped when >=80% imports/exports.
Agents don't need to handle these cases — eigenhelm filters them automatically.
Example agent prompt¶
If you can't use a skill file, add this to your system prompt or project instructions:
After writing or modifying code, evaluate it with:
eh evaluate <file> --classify
Rules:
- Only evaluate after tests pass
- accept or marginal → move on, do not iterate
- reject → read directives, apply what's obviously correct, re-evaluate ONCE
- Never rewrite working code to improve a score
- Two evaluation passes maximum per file