Skip to content

Models

eigenhelm models are .npz files containing a trained PCA eigenspace, exemplar vectors, and calibration data.

Bundled models

Model Languages PCs Vectors Use case
general-polyglot-v1.npz Python, JS, TS, Go, Rust 36 8,228 General-purpose (default)
lang-python.npz Python Python-specific
lang-javascript.npz JavaScript JavaScript-specific
lang-typescript.npz TypeScript TypeScript-specific
lang-go.npz Go Go-specific
lang-rust.npz Rust Rust-specific
pattern-cli.npz Mixed CLI tool patterns
baseline.npz Python Baseline reference

When no model is specified, eigenhelm uses the bundled polyglot model.

Choosing a model

  • Language-specific models are more discriminating for their target language
  • The polyglot model works across all supported languages and is the best default
  • Custom models can be trained on your own curated corpora for domain-specific evaluation

Training a custom model

# Prepare a corpus manifest
cat > my-corpus.toml << 'EOF'
[corpus]
name = "my-team-best"
language = "python"
class = "A"

[[target]]
name = "my-code"
url = "https://github.com/my-org/my-repo"
ref = "main"
include = ["src/**/*.py"]
description = "Our best production code"
EOF

# Sync the corpus to a local directory
eh corpus sync my-corpus.toml corpus/my-team-best

# Train
eh train corpus/my-team-best --language python -o models/my-model.npz

See eh train for the full training reference.

Model contents

Each .npz model contains:

  • PCA components: The principal component matrix defining the quality manifold
  • Corpus statistics: Mean and standard deviation of the training feature vectors
  • Calibration data: Empirical score distribution (p25/p75 thresholds) from the training corpus
  • Exemplar refs: Compressed code content from selected high-quality samples, used for NCD comparison
  • Metadata: Language, corpus class, training date, file count