CLI Pipeline Walkthrough ======================== This page is the user's guide to the end-to-end Cherimoya CLI: how to take raw BAM/BED files and a reference genome through peak calling, training, attribution, seqlet calling, motif discovery, and motif marginalization in a single reproducible run. .. image:: ../../imgs/pipeline.png :align: center :width: 70% :alt: Cherimoya end-to-end pipeline | For a complete list of every CLI flag, every JSON key, and every default value, see :doc:`../cli`. For assay-specific recipes (TF ChIP-seq, ATAC-seq, DNase-seq) see the recipe pages. Prerequisites ------------- You will need: - Cherimoya installed (see :doc:`../installation`). - A reference genome FASTA file (e.g. ``hg38.fa``). - One or more signal files (BAM/SAM, fragment BED/TSV, or bigWig). - A motif database in MEME format (optional, used by TF-MoDISco reports and marginalization). - Optional: a BED of peak coordinates. If you don't provide one, the pipeline calls peaks with MACS3. - Optional: BED of GC-matched negative regions. If you don't provide one, the pipeline samples them automatically. All file inputs can be remote URLs (``http://``, ``https://``, ``s3://``, ``gs://``). ``bam2bw`` streams the data directly without downloading the file first; the validation step before the run only checks paths that look local. Subcommands at a glance ----------------------- The two commands you actually run, in the order you run them: * ``cherimoya pipeline-json`` — emit a fully-populated JSON config from a handful of CLI pointers. * ``cherimoya pipeline`` — run the full end-to-end pipeline from that JSON. Every other subcommand (``fit``, ``evaluate``, ``attribute``, ``seqlets``, ``marginalize``, ``negatives``, ``batch``) corresponds to an individual pipeline stage and can be run on its own. Each is driven by its own JSON, which the pipeline writes alongside its outputs. See :doc:`../cli` for the full subcommand reference. Two-step workflow ----------------- The expected workflow is two steps: generate a JSON, then run it. Step 1: generate a pipeline JSON ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For stranded ChIP-seq with input controls: .. code-block:: bash cherimoya pipeline-json \ -s hg38.fa -p peaks.narrowPeak \ -i input1.bam -i input2.bam \ -c control1.bam -c control2.bam \ -m JASPAR_2024.meme -n my_experiment -o pipeline.json For unstranded paired-end ATAC-seq with the standard +4 / -4 fragment shift: .. code-block:: bash cherimoya pipeline-json \ -s hg38.fa -p peaks.narrowPeak \ -i fragments.bam -m JASPAR_2024.meme \ -n atac_experiment -o pipeline.json \ -ps 4 -ns -4 -u -f -pe Repeating ``-i`` adds another signal file. Repeating ``-c`` adds another control. Repeating ``-p`` adds another peak file. The full ``pipeline-json`` flag list is in :doc:`../cli`. The resulting JSON has every parameter the pipeline uses, filled in from ``cherimoya_cli.defaults.default_pipeline_parameters``. You only need to edit the values you want to change from defaults; the runtime re-merges with defaults before each step. Step 2: edit and run ~~~~~~~~~~~~~~~~~~~~ Open the JSON and override any defaults — model width, training and validation chromosomes, seqlet p-value threshold, MoDISco settings, anything in the JSON. .. code-block:: bash cherimoya pipeline -p pipeline.json What this does, in order: 1. **MACS3 peak calling** (skipped if ``loci`` is set). 2. **bam2bw conversion** to bigWig (skipped if signals are already bigWigs). 3. **GC-matched negative sampling** (skipped if ``negatives`` is set). 4. **Model training** — writes ``{name}.torch`` (best checkpoint by validation count Pearson) and ``{name}.final.torch`` (EMA weights at end of training), plus ``{name}.log``. 5. **Attribution** via saturation mutagenesis over the central 400 bp of each example, saved as ``{name}.attributions.{ohe,attr}.npz`` and ``{name}.attributions.idxs.npy``. 6. **Seqlet identification** with TF-MoDISco-style recursive seqlet calling on the (attribution × one-hot) signal, written to ``{name}.seqlets.bed``. 7. **tomtom-lite seqlet annotation** against the motif database, if ``motifs`` was provided; results in ``{name}.seqlets_annotated.bed`` and a counts table in ``{name}.motif_seqlet_count.tsv``. 8. **TF-MoDISco motif discovery** and HTML report; results in ``{name}_modisco_results.h5`` and ``{name}_modisco/``. 9. **Marginalization** — measures the predicted effect of inserting each motif into negative backgrounds. Output in ``{name}_marginalize/``. Each sub-step writes its own JSON snapshot (``{name}.fit.json``, ``{name}.attribute.json``, …) so individual stages can be re-run in isolation, and pre-existing snapshots can be edited and re-run if you only need to change one stage. Running individual steps ------------------------ Each stage has its own subcommand and JSON schema. You can run them directly: .. code-block:: bash cherimoya fit -p my_experiment.fit.json cherimoya evaluate -p my_experiment.evaluate.json cherimoya attribute -p my_experiment.attribute.json cherimoya seqlets -p my_experiment.seqlets.json cherimoya marginalize -p my_experiment.marginalize.json The defaults for each command are in ``cherimoya_cli.defaults.default_*_parameters``; the merged JSON snapshots written by ``pipeline`` make them concrete. Every step JSON also supports ``"skip": true`` to no-op that step. Batch mode ---------- For training the same configuration across many datasets in parallel on multiple GPUs: .. code-block:: bash cherimoya batch -p batch.json Minimal ``batch.json``: .. code-block:: json { "name": null, "device": "*", "signals": "/path/to/data/*.bam", "sequences": "/path/to/hg38.fa" } Behavior: * ``"device": "*"`` is expanded to the full list of available CUDA devices (``cuda:0``, ``cuda:1``, …). * When ``"signals"`` is a glob and ``"name"`` is ``null``, names are derived from the signal filenames automatically. * Each derived job is written to its own ``{name}.pipeline.json`` and run via ``cherimoya pipeline``. Joblib distributes jobs round-robin across the device list. To use batch mode with custom names or paired-up controls/loci/negatives, provide same-length lists in those fields; element ``i`` is consumed by job ``i``. About bam2bw ------------ The pipeline does not call BAM/SAM/fragment files directly into the training step — it converts them to bigWig first using the ``bam2bw`` tool, which is a hard dependency. ``bam2bw`` streams the input (local or remote URL), counts reads or fragments into per-base coverage, optionally applies a ± shift (used for Tn5 / DNase corrections), and emits one bigWig (unstranded) or two bigWigs (``+`` / ``-`` stranded). This conversion is what enables remote URLs as inputs: ``bam2bw`` fetches reads via byte-range requests rather than downloading the whole file. The resulting bigWigs are written into the working directory and re-used by every downstream stage. If your signals are already bigWigs, the pipeline skips this step automatically. To use ``bam2bw`` standalone (outside the pipeline), invoke it directly — see its own documentation. Calling negatives independently ------------------------------- ``cherimoya pipeline`` calls negatives for you when the JSON's ``negatives`` field is null. If you want to sample GC-matched negatives without running the full pipeline (e.g. you're going to train a non-Cherimoya model on the same regions), use the ``negatives`` subcommand: .. code-block:: bash cherimoya negatives \ -i peaks.narrowPeak \ -f hg38.fa \ -b signal.bw \ -o negatives.bed \ --bin_width 0.02 --max_n_perc 0.1 --beta 0.5 The output is a 3-column BED of regions matched by GC content to the input peaks, with at most ``--max_n_perc`` fraction of ``N`` bases and (optionally) signal below ``--beta × min(peak_counts)``. Running on a non-hg38 reference ------------------------------- The defaults assume hg38. To run on a different reference (mouse mm10, non-human, or a different hg version), override three keys in the pipeline JSON: * ``fit_parameters.training_chroms`` — chromosomes used to train. Replace with the appropriate list for your reference (e.g. mm10: ``["chr1", "chr2", …, "chr19", "chrX", "chrY"]`` minus the two validation chromosomes you choose). * ``fit_parameters.validation_chroms`` — held-out chromosomes for validation. Two chromosomes is enough. * ``preprocessing_parameters.callpeaks_gsize`` — MACS3 effective genome size. Use ``"mm"`` for mouse, ``"ce"`` for *C. elegans*, ``"dm"`` for fly, or a numeric value (e.g. ``"2.7e9"`` for hg38) for any other organism. For non-chromosome reference contigs (scaffolds, alternate haplotypes, viral integrations), exclude them by listing them explicitly in the training/validation chromosome lists, or by passing an ``exclusion_lists`` BED. Outputs ------- A successful pipeline run leaves the following in the working directory (with ``{name}`` from the ``-n`` flag in step 1): .. list-table:: :header-rows: 1 :widths: 50 50 * - File - Contents * - ``{name}.torch`` - Best-by-validation-count-Pearson checkpoint (config + state_dict). * - ``{name}.final.torch`` - Final EMA-applied checkpoint at end of training. * - ``{name}.log`` - Per-epoch training and validation metrics (TSV). * - ``{name}.performance.tsv`` - Final held-out chromosome metrics (single row TSV). * - ``{name}.+.bw`` / ``{name}.-.bw`` - bigWigs produced by ``bam2bw`` for the stranded signal. * - ``{name}.bw`` - bigWig produced by ``bam2bw`` for unstranded signal. * - ``{name}.control.{+,-}.bw`` - bigWigs produced by ``bam2bw`` for stranded controls. * - ``{name}_peaks.narrowPeak`` - Peaks called by MACS3 (when ``loci`` not provided). * - ``{name}.negatives.bed`` - GC-matched negative regions (when ``negatives`` not provided). * - ``{name}.attributions.ohe.npz`` - One-hot encoded sequences over the central 400 bp window. * - ``{name}.attributions.attr.npz`` - Hypothetical importance scores (saturation mutagenesis). * - ``{name}.attributions.idxs.npy`` - Boolean mask into the original loci list selecting examples that had no Ns over the window. * - ``{name}.seqlets.bed`` - Recursive seqlets in genome coordinates. * - ``{name}.seqlets_annotated.bed`` - Seqlets with closest-motif annotation from tomtom-lite. * - ``{name}.motif_seqlet_count.tsv`` - Count of seqlets matched per motif. * - ``{name}_modisco_results.h5`` - TF-MoDISco pattern HDF5. * - ``{name}_modisco/`` - TF-MoDISco HTML report. * - ``{name}_marginalize/`` - Motif marginalization report (HTML plus CSVs). * - ``{name}.{pipeline,fit,evaluate,attribute,seqlets,marginalize}.json`` - Per-step JSON snapshots of the actual parameters used.