Scoring Dimensions¶
eigenhelm scores code across five dimensions. Each measures a different aspect of code quality, and together they produce a holistic assessment.
Manifold drift¶
Weight: 0.30 | Source: PCA reconstruction error
Measures how far a file's structural fingerprint sits from the learned code quality manifold. High drift means the code has unusual structural properties not seen in the training corpus.
What drives high drift:
- Unusual AST patterns (deeply nested conditionals, atypical class structures)
- Code that doesn't resemble any category in the training corpus
- Generated code with repetitive boilerplate
Manifold alignment¶
Weight: 0.30 | Source: PCA projection onto quality axes
Measures how well the code aligns with the principal directions of quality variation learned during training. Good alignment means the code varies from the corpus mean in the same directions that high-quality code does.
What drives poor alignment:
- Metric combinations rarely seen in quality code (e.g., high complexity + low vocabulary)
- Structural patterns orthogonal to the quality manifold
Token entropy¶
Weight: 0.15 | Source: Shannon entropy of byte stream
Measures the information density of the source code's byte stream. The normalized score is 1.0 - (entropy / 8.0), where 8.0 bits is the theoretical maximum for byte-level entropy. Lower entropy (more repetitive code) produces a higher score (worse).
This is a monotonic penalty — only low entropy is penalized. High-entropy code scores well on this dimension.
What drives poor entropy scores:
- Highly repetitive code (copy-pasted blocks, boilerplate)
- Files with large amounts of duplicated structure
Compression structure¶
Weight: 0.15 | Source: Birkhoff aesthetic measure
Applies the Birkhoff aesthetic measure — an information-theoretic ratio of order to complexity:
$$M_Z = \frac{N \cdot H - K}{N \cdot H}$$
Where N = raw bytes, H = entropy (bits/byte), K = compressed bytes (zlib).
High values indicate the code has significant redundancy that compression can exploit — often a sign of structural repetition.
Small files
Files under ~80 lines almost always score high on this dimension due to insufficient statistical mass. Directives from compression structure are capped at medium severity for small files.
NCD exemplar distance¶
Weight: 0.10 | Source: Normalized Compression Distance
Measures how similar the code is to the nearest high-quality exemplar stored in the model, using Normalized Compression Distance:
$$NCD(x, y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}$$
Low NCD means the code structurally resembles a known good example. High NCD means it's unlike anything in the exemplar set.
Test code dilution¶
When a source file contains both production code and inline test code (e.g., Rust #[cfg(test)] modules, Python class TestFoo blocks), the test code's structural signature can dominate the overall score. Repetitive test assertions, setup/teardown patterns, and boundary enumeration inflate manifold drift and compression metrics — masking genuine improvements to the production code.
eigenhelm detects inline test code and reports a region breakdown alongside the overall score:
myfile.rs
decision: reject
score: 0.72 (p19)
regions:
production (lines 1-80): 0.55 (p55)
test (lines 81-270): 0.82 (p8)
The overall score is unchanged. The region breakdown shows that the production code (0.55) is substantially better than the overall score suggests — the test code (0.82) is pulling it up.
Best practices:
- In languages where tests are in separate files (Go
_test.go, Java conventions), this isn't an issue — each file gets its own score. - For Rust and Python where inline tests are common, use the region breakdown to assess production code quality independently.
- In CI, you can extract the production-only score from JSON output:
jq '.results[0].regions[] | select(.label == "production") | .score'
Currently detected patterns:
| Language | Pattern |
|---|---|
| Rust | #[cfg(test)] mod tests { ... } |
| Python | class Test*, top-level def test_* |
Weight configurations¶
Weights adjust based on available data:
| Scenario | Drift | Alignment | Entropy | Compression | NCD |
|---|---|---|---|---|---|
| Model + exemplars | 0.30 | 0.30 | 0.15 | 0.15 | 0.10 |
| Model only | 0.35 | 0.35 | 0.15 | 0.15 | 0.00 |
| Exemplars only | 0.00 | 0.00 | 0.30 | 0.30 | 0.40 |
| Neither (fallback) | 0.00 | 0.00 | 0.50 | 0.50 | 0.00 |