CLI Pipeline Tutorial

This tutorial walks through using the Cherimoya command-line tool to run the full end-to-end pipeline from raw sequencing data to motif analysis.

Prerequisites

Before starting, make sure you have:

  • Cherimoya installed (see Installation)

  • A reference genome FASTA file (e.g., hg38.fa)

  • Signal files (BAM, SAM, BED, or bigWig format)

  • (Optional) Control signal files

Overview of CLI Commands

Command

Description

negatives

Sample GC-content-matched negative regions

pipeline-json

Generate a pipeline configuration JSON file

fit

Train a Cherimoya model

evaluate

Evaluate a trained model

attribute

Calculate sequence attributions

seqlets

Identify important subsequences from attributions

marginalize

Run marginalization experiments for motifs

pipeline

Run the full end-to-end pipeline

batch

Run multiple pipelines in parallel

Step 1: Generate a Pipeline Configuration

The pipeline is driven by a JSON configuration file. You can generate one using the pipeline-json command:

cherimoya pipeline-json \
    -s /path/to/hg38.fa \
    -i /path/to/signal.bam \
    -n my_ctcf_experiment \
    -o ctcf.pipeline.json

For stranded experiments with controls:

cherimoya pipeline-json \
    -s /path/to/hg38.fa \
    -i /path/to/treatment.bam \
    -c /path/to/control.bam \
    -n ctcf_stranded \
    -o ctcf_stranded.pipeline.json

Tip

Use -u for unstranded data and -f if your input consists of fragment files rather than aligned reads.

Tip

If using ATAC-seq or DNase-seq data you may need to shift the reads. Many packages use +4/-5 for ATAC-seq shifting, but the recommended shift here is actually +4/-4. You can use -ps and -ns to shift your data but make sure the original data is not shifted!

Step 2: Customize the Parameters (Optional)

Open the generated JSON file and adjust parameters as needed. Key parameters include:

{
    "fit_parameters": {
        "n_filters": 64,
        "n_layers": 9,
        "max_epochs": 100,
        "lr": 0.004,
        "batch_size": 512,
        "max_jitter": 500,
        "early_stopping": 15
    }
}

Note

The pipeline will automatically call peaks with MACS3, convert BAMs to bigWigs, and generate GC-matched negatives if these are not explicitly provided in the JSON.

Step 3: Run the Pipeline

cherimoya pipeline -p ctcf.pipeline.json

The pipeline will execute the following steps in order:

  1. Peak calling (if loci is not provided) — uses MACS3

  2. Data conversion (if signal files are BAMs) — uses bam2bw

  3. Negative sampling (if negatives is not provided) — GC-matched

  4. Model training — trains a Cherimoya model with dual optimizers

  5. Attribution calculation — saturation mutagenesis on validation chroms

  6. Seqlet identification — extract important subsequences

  7. TF-MoDISco — motif discovery and report generation

  8. Marginalization — motif effect size estimation

Running Individual Steps

You can also run steps individually using their respective JSON configuration files.

Training:

cherimoya fit -p fit_parameters.json

Evaluation:

cherimoya evaluate -p evaluate_parameters.json

Attribution:

cherimoya attribute -p attribute_parameters.json

Batch Processing

To train models on many datasets in parallel, use the batch command:

cherimoya batch -p batch_parameters.json

The batch command will automatically distribute jobs across available GPUs. Use "device": "*" in the JSON to auto-detect all available CUDA devices.

{
    "name": null,
    "device": "*",
    "signals": "/path/to/data/*.bam",
    "sequences": "/path/to/hg38.fa"
}

When signals contains a glob pattern and name is null, names are automatically derived from the signal filenames.