CLI Pipeline Walkthrough¶

This page is the user’s guide to the end-to-end Cherimoya CLI: how to take raw BAM/BED files and a reference genome through peak calling, training, attribution, seqlet calling, motif discovery, and motif marginalization in a single reproducible run.

For a complete list of every CLI flag, every JSON key, and every default value, see CLI Reference. For assay-specific recipes (TF ChIP-seq, ATAC-seq, DNase-seq) see the recipe pages.

Prerequisites¶

You will need:

Cherimoya installed (see Installation).
A reference genome FASTA file (e.g. hg38.fa).
One or more signal files (BAM/SAM, fragment BED/TSV, or bigWig).
A motif database in MEME format (optional, used by TF-MoDISco reports and marginalization).
Optional: a BED of peak coordinates. If you don’t provide one, the pipeline calls peaks with MACS3.
Optional: BED of GC-matched negative regions. If you don’t provide one, the pipeline samples them automatically.

All file inputs can be remote URLs (http://, https://, s3://, gs://). bam2bw streams the data directly without downloading the file first; the validation step before the run only checks paths that look local.

Subcommands at a glance¶

The two commands you actually run, in the order you run them:

cherimoya pipeline-json — emit a fully-populated JSON config from a handful of CLI pointers.
cherimoya pipeline — run the full end-to-end pipeline from that JSON.

Every other subcommand (fit, evaluate, attribute, seqlets, marginalize, negatives, batch) corresponds to an individual pipeline stage and can be run on its own. Each is driven by its own JSON, which the pipeline writes alongside its outputs. See CLI Reference for the full subcommand reference.

Two-step workflow¶

The expected workflow is two steps: generate a JSON, then run it.

Step 1: generate a pipeline JSON¶

For stranded ChIP-seq with input controls:

cherimoya pipeline-json \
    -s hg38.fa -p peaks.narrowPeak \
    -i input1.bam -i input2.bam \
    -c control1.bam -c control2.bam \
    -m JASPAR_2024.meme -n my_experiment -o pipeline.json

For unstranded paired-end ATAC-seq with the standard +4 / -4 fragment shift:

cherimoya pipeline-json \
    -s hg38.fa -p peaks.narrowPeak \
    -i fragments.bam -m JASPAR_2024.meme \
    -n atac_experiment -o pipeline.json \
    -ps 4 -ns -4 -u -f -pe

Repeating -i adds another signal file. Repeating -c adds another control. Repeating -p adds another peak file. The full pipeline-json flag list is in CLI Reference.

The resulting JSON has every parameter the pipeline uses, filled in from cherimoya_cli.defaults.default_pipeline_parameters. You only need to edit the values you want to change from defaults; the runtime re-merges with defaults before each step.

Step 2: edit and run¶

Open the JSON and override any defaults — model width, training and validation chromosomes, seqlet p-value threshold, MoDISco settings, anything in the JSON.

cherimoya pipeline -p pipeline.json

What this does, in order:

MACS3 peak calling (skipped if loci is set).
bam2bw conversion to bigWig (skipped if signals are already bigWigs).
GC-matched negative sampling (skipped if negatives is set).
Model training — writes {name}.torch (best checkpoint by validation count Pearson) and {name}.final.torch (EMA weights at end of training), plus {name}.log.
Attribution via saturation mutagenesis over the central 400 bp of each example, saved as {name}.attributions.{ohe,attr}.npz and {name}.attributions.idxs.npy.
Seqlet identification with TF-MoDISco-style recursive seqlet calling on the (attribution × one-hot) signal, written to {name}.seqlets.bed.
tomtom-lite seqlet annotation against the motif database, if motifs was provided; results in {name}.seqlets_annotated.bed and a counts table in {name}.motif_seqlet_count.tsv.
TF-MoDISco motif discovery and HTML report; results in {name}_modisco_results.h5 and {name}_modisco/.
Marginalization — measures the predicted effect of inserting each motif into negative backgrounds. Output in {name}_marginalize/.

Each sub-step writes its own JSON snapshot ({name}.fit.json, {name}.attribute.json, …) so individual stages can be re-run in isolation, and pre-existing snapshots can be edited and re-run if you only need to change one stage.

Running individual steps¶

Each stage has its own subcommand and JSON schema. You can run them directly:

cherimoya fit -p my_experiment.fit.json
cherimoya evaluate -p my_experiment.evaluate.json
cherimoya attribute -p my_experiment.attribute.json
cherimoya seqlets -p my_experiment.seqlets.json
cherimoya marginalize -p my_experiment.marginalize.json

The defaults for each command are in cherimoya_cli.defaults.default_*_parameters; the merged JSON snapshots written by pipeline make them concrete.

Every step JSON also supports "skip": true to no-op that step.

Batch mode¶

For training the same configuration across many datasets in parallel on multiple GPUs:

cherimoya batch -p batch.json

Minimal batch.json:

{
    "name": null,
    "device": "*",
    "signals": "/path/to/data/*.bam",
    "sequences": "/path/to/hg38.fa"
}

Behavior:

"device": "*" is expanded to the full list of available CUDA devices (cuda:0, cuda:1, …).
When "signals" is a glob and "name" is null, names are derived from the signal filenames automatically.
Each derived job is written to its own {name}.pipeline.json and run via cherimoya pipeline. Joblib distributes jobs round-robin across the device list.

To use batch mode with custom names or paired-up controls/loci/negatives, provide same-length lists in those fields; element i is consumed by job i.

About bam2bw¶

The pipeline does not call BAM/SAM/fragment files directly into the training step — it converts them to bigWig first using the bam2bw tool, which is a hard dependency. bam2bw streams the input (local or remote URL), counts reads or fragments into per-base coverage, optionally applies a ± shift (used for Tn5 / DNase corrections), and emits one bigWig (unstranded) or two bigWigs (+ / - stranded).

This conversion is what enables remote URLs as inputs: bam2bw fetches reads via byte-range requests rather than downloading the whole file. The resulting bigWigs are written into the working directory and re-used by every downstream stage.

If your signals are already bigWigs, the pipeline skips this step automatically. To use bam2bw standalone (outside the pipeline), invoke it directly — see its own documentation.

Calling negatives independently¶

cherimoya pipeline calls negatives for you when the JSON’s negatives field is null. If you want to sample GC-matched negatives without running the full pipeline (e.g. you’re going to train a non-Cherimoya model on the same regions), use the negatives subcommand:

cherimoya negatives \
    -i peaks.narrowPeak \
    -f hg38.fa \
    -b signal.bw \
    -o negatives.bed \
    --bin_width 0.02 --max_n_perc 0.1 --beta 0.5

The output is a 3-column BED of regions matched by GC content to the input peaks, with at most --max_n_perc fraction of N bases and (optionally) signal below --beta × min(peak_counts).

Running on a non-hg38 reference¶

The defaults assume hg38. To run on a different reference (mouse mm10, non-human, or a different hg version), override three keys in the pipeline JSON:

fit_parameters.training_chroms — chromosomes used to train. Replace with the appropriate list for your reference (e.g. mm10: ["chr1", "chr2", …, "chr19", "chrX", "chrY"] minus the two validation chromosomes you choose).
fit_parameters.validation_chroms — held-out chromosomes for validation. Two chromosomes is enough.
preprocessing_parameters.callpeaks_gsize — MACS3 effective genome size. Use "mm" for mouse, "ce" for C. elegans, "dm" for fly, or a numeric value (e.g. "2.7e9" for hg38) for any other organism.

For non-chromosome reference contigs (scaffolds, alternate haplotypes, viral integrations), exclude them by listing them explicitly in the training/validation chromosome lists, or by passing an exclusion_lists BED.

Outputs¶

A successful pipeline run leaves the following in the working directory (with {name} from the -n flag in step 1):

File	Contents
`{name}.torch`	Best-by-validation-count-Pearson checkpoint (config + state_dict).
`{name}.final.torch`	Final EMA-applied checkpoint at end of training.
`{name}.log`	Per-epoch training and validation metrics (TSV).
`{name}.performance.tsv`	Final held-out chromosome metrics (single row TSV).
`{name}.+.bw` / `{name}.-.bw`	bigWigs produced by `bam2bw` for the stranded signal.
`{name}.bw`	bigWig produced by `bam2bw` for unstranded signal.
`{name}.control.{+,-}.bw`	bigWigs produced by `bam2bw` for stranded controls.
`{name}_peaks.narrowPeak`	Peaks called by MACS3 (when `loci` not provided).
`{name}.negatives.bed`	GC-matched negative regions (when `negatives` not provided).
`{name}.attributions.ohe.npz`	One-hot encoded sequences over the central 400 bp window.
`{name}.attributions.attr.npz`	Hypothetical importance scores (saturation mutagenesis).
`{name}.attributions.idxs.npy`	Boolean mask into the original loci list selecting examples that had no Ns over the window.
`{name}.seqlets.bed`	Recursive seqlets in genome coordinates.
`{name}.seqlets_annotated.bed`	Seqlets with closest-motif annotation from tomtom-lite.
`{name}.motif_seqlet_count.tsv`	Count of seqlets matched per motif.
`{name}_modisco_results.h5`	TF-MoDISco pattern HDF5.
`{name}_modisco/`	TF-MoDISco HTML report.
`{name}_marginalize/`	Motif marginalization report (HTML plus CSVs).
`{name}.{pipeline,fit,evaluate,attribute,seqlets,marginalize}.json`	Per-step JSON snapshots of the actual parameters used.