Development¶
This page is for contributors and integrators: how the source tree is organized, how to run the tests, and the conventions the codebase follows. End users do not need to read this.
Repository layout¶
cherimoya/
├── cherimoya/ # The Python package
│ ├── __init__.py # Public API re-exports: Cherimoya, CheriBlock, EMA
│ ├── cherimoya.py # Cherimoya model + EMA wrapper + fit/save/load
│ ├── cheri.py # CheriBlock + Triton kernels + dispatcher
│ ├── io.py # PeakGenerator + PeakNegativeSampler
│ ├── losses.py # Profile MNLL + log1pMSE mixture loss
│ └── performance.py # Evaluation metrics
├── cherimoya_cli/ # The CLI entry-point package
│ ├── __main__.py # Argparse driver and subcommand registry
│ ├── defaults.py # All default JSON parameter dicts
│ ├── utils.py # JSON merging and parameter helpers
│ └── commands/ # One file per subcommand
│ ├── pipeline.py
│ ├── pipeline_json.py
│ ├── batch.py
│ ├── fit.py
│ ├── evaluate.py
│ ├── attribute.py
│ ├── seqlets.py
│ ├── marginalize.py
│ └── negatives.py
├── tests/ # Pytest suite (see below)
├── docs/ # Sphinx docs (this site)
├── imgs/ # Architecture / pipeline diagrams
├── bench_kernels.py # Standalone forward-path benchmark
└── pyproject.toml # Build, deps, and tooling config
Two top-level packages: cherimoya is the model and data plumbing,
cherimoya_cli is the command-line tool. They are independent —
cherimoya_cli imports cherimoya, never the reverse.
Public vs. private API¶
The convention is the standard Python one: anything prefixed with an underscore is private, and may change or be removed without notice. Explicitly:
Public symbols, re-exported from
cherimoya.__init__:Cherimoya,CheriBlock,EMA.Public module-level symbols:
PeakGenerator(),PeakNegativeSampler,fused_dilated_conv_norm(),FusedDilatedConvNormFunc,calculate_performance_measures()and its component metrics,_mixture_loss()(despite the underscore — it is the trainer’s loss function and the API is stable).Private and may change: anything else, including the Triton kernels (
_fwd_*,_bwd_*,_fwd_inf_*), the CPU fallback (_cheri_conv_norm_cpu), the CheriBlock weight cache (_w_cache), and the model’s checkpoint-payload helper (_init_kwargs).
Development install¶
For development, install in editable mode with the docs extra:
git clone https://github.com/jmschrei/cherimoya.git
cd cherimoya
pip install -e .[docs]
The docs extra adds sphinx, furo, and
sphinx-copybutton, which you need to build this documentation
locally:
cd docs
sphinx-build -b html . _build
The build produces docs/_build/index.html. Read the Docs runs the
same command with the same dependency set.
Running the tests¶
The test suite lives in tests/ and uses pytest.
pytest tests/
Test files:
File |
Covers |
|---|---|
|
Cheri Block forward parity (CPU vs training Triton vs inference megakernel), backward parity against CPU autograd, weight-cache invalidation, dtype matrix. |
|
Full Cherimoya forward/backward parity, no_grad == grad-enabled equivalence, EMA-applied save/load round trip. |
|
|
|
EMA update/apply/restore semantics. |
|
|
|
Evaluation-metric correctness. |
|
End-to-end fit step on tiny data: confirms optimizers, schedulers, EMA, and checkpoint paths are wired correctly. |
|
JSON merge and default-handling helpers. |
Markers:
@pytest.mark.cuda— requires a CUDA device; skipped on CPU-only hosts.@pytest.mark.triton— requires both a CUDA device and a Triton install.
Both markers are wired through tests/conftest.py, which also
disables torch.compile for the suite so tests don’t pay the
several-minute autotune cost on every run.
To run only the CPU-safe subset:
pytest tests/ -m "not cuda and not triton"
To run only the GPU parity tests:
pytest tests/ -m "cuda or triton"
Benchmarking¶
bench_kernels.py at the repo root is a standalone script that
times the three forward paths and checks they all agree within
machine precision. It is intentionally not packaged with the
install. Run it with:
python bench_kernels.py
See Benchmarks for the published numbers and the measurement methodology.
Coding conventions¶
Tabs, not spaces. The codebase uses tab indentation throughout.
Channels-last layout
(N, L, C)is used inside the Cheri Block backbone. The input stem and output heads do the necessary transpositions. New blocks should follow the same convention.fp32 for normalization statistics even under bf16 autocast. Both the CPU fallback and the Triton kernels accumulate
sum/sq_sumin fp32; this is load-bearing for stability and shouldn’t be changed casually.Triton autotune keys. Kernels are keyed by
(C, L)so the same configuration is reused across batches with the same shapes. Adding a new kernel that depends on a new shape parameter should add that parameter to the key.No public bias terms inside Cheri Blocks. The input stem, profile head, and count head use biases; the block layers do not. This is intentional (see Architecture).