Models¶
eigenhelm models are .npz files containing a trained PCA eigenspace, exemplar vectors, and calibration data.
Bundled models¶
| Model | Languages | PCs | Vectors | Use case |
|---|---|---|---|---|
general-polyglot-v1.npz |
Python, JS, TS, Go, Rust | 36 | 8,228 | General-purpose (default) |
lang-python.npz |
Python | — | — | Python-specific |
lang-javascript.npz |
JavaScript | — | — | JavaScript-specific |
lang-typescript.npz |
TypeScript | — | — | TypeScript-specific |
lang-go.npz |
Go | — | — | Go-specific |
lang-rust.npz |
Rust | — | — | Rust-specific |
pattern-cli.npz |
Mixed | — | — | CLI tool patterns |
baseline.npz |
Python | — | — | Baseline reference |
When no model is specified, eigenhelm uses the bundled polyglot model.
Choosing a model¶
- Language-specific models are more discriminating for their target language
- The polyglot model works across all supported languages and is the best default
- Custom models can be trained on your own curated corpora for domain-specific evaluation
Training a custom model¶
# Prepare a corpus manifest
cat > my-corpus.toml << 'EOF'
[corpus]
name = "my-team-best"
language = "python"
class = "A"
[[target]]
name = "my-code"
url = "https://github.com/my-org/my-repo"
ref = "main"
include = ["src/**/*.py"]
description = "Our best production code"
EOF
# Sync the corpus to a local directory
eh corpus sync my-corpus.toml corpus/my-team-best
# Train
eh train corpus/my-team-best --language python -o models/my-model.npz
See eh train for the full training reference.
Model contents¶
Each .npz model contains:
- PCA components: The principal component matrix defining the quality manifold
- Corpus statistics: Mean and standard deviation of the training feature vectors
- Calibration data: Empirical score distribution (p25/p75 thresholds) from the training corpus
- Exemplar refs: Compressed code content from selected high-quality samples, used for NCD comparison
- Metadata: Language, corpus class, training date, file count