Models¶

eigenhelm models are .npz files containing a trained PCA eigenspace, exemplar vectors, and calibration data.

Bundled models¶

Model	Languages	PCs	Vectors	Use case
`general-polyglot-v1.npz`	Python, JS, TS, Go, Rust	36	8,228	General-purpose (default)
`lang-python.npz`	Python	—	—	Python-specific
`lang-javascript.npz`	JavaScript	—	—	JavaScript-specific
`lang-typescript.npz`	TypeScript	—	—	TypeScript-specific
`lang-go.npz`	Go	—	—	Go-specific
`lang-rust.npz`	Rust	—	—	Rust-specific
`pattern-cli.npz`	Mixed	—	—	CLI tool patterns
`baseline.npz`	Python	—	—	Baseline reference

When no model is specified, eigenhelm uses the bundled polyglot model.

Choosing a model¶

Language-specific models are more discriminating for their target language
The polyglot model works across all supported languages and is the best default
Custom models can be trained on your own curated corpora for domain-specific evaluation

Training a custom model¶

# Prepare a corpus manifest
cat > my-corpus.toml << 'EOF'
[corpus]
name = "my-team-best"
language = "python"
class = "A"

[[corpus.sources]]
type = "local"
path = "/path/to/curated/code"
EOF

# Sync the corpus
eh corpus sync my-corpus.toml

# Train
eh train --corpus corpus/my-team-best --language python --output models/my-model.npz

See eh train for the full training reference.

Model contents¶

Each .npz model contains:

PCA components: The principal component matrix defining the quality manifold
Corpus statistics: Mean and standard deviation of the training feature vectors
Calibration data: Empirical score distribution (p25/p75 thresholds) from the training corpus
Exemplar blobs: Selected high-quality code samples for NCD comparison
Metadata: Language, corpus class, training date, vector count