CLI Pipeline Tutorial¶

This tutorial walks through using the Cherimoya command-line tool to run the full end-to-end pipeline from raw sequencing data to motif analysis.

Prerequisites¶

Before starting, make sure you have:

Cherimoya installed (see Installation)
A reference genome FASTA file (e.g., hg38.fa)
Signal files (BAM, SAM, BED, or bigWig format)
(Optional) Control signal files

Overview of CLI Commands¶

Command	Description
`negatives`	Sample GC-content-matched negative regions
`pipeline-json`	Generate a pipeline configuration JSON file
`fit`	Train a Cherimoya model
`evaluate`	Evaluate a trained model
`attribute`	Calculate sequence attributions
`seqlets`	Identify important subsequences from attributions
`marginalize`	Run marginalization experiments for motifs
`pipeline`	Run the full end-to-end pipeline
`batch`	Run multiple pipelines in parallel

Step 1: Generate a Pipeline Configuration¶

The pipeline is driven by a JSON configuration file. You can generate one using the pipeline-json command:

cherimoya pipeline-json \
    -s /path/to/hg38.fa \
    -i /path/to/signal.bam \
    -n my_ctcf_experiment \
    -o ctcf.pipeline.json

For stranded experiments with controls:

cherimoya pipeline-json \
    -s /path/to/hg38.fa \
    -i /path/to/treatment.bam \
    -c /path/to/control.bam \
    -n ctcf_stranded \
    -o ctcf_stranded.pipeline.json

Tip

Use -u for unstranded data and -f if your input consists of fragment files rather than aligned reads.

Tip

If using ATAC-seq or DNase-seq data you may need to shift the reads. Many packages use +4/-5 for ATAC-seq shifting, but the recommended shift here is actually +4/-4. You can use -ps and -ns to shift your data but make sure the original data is not shifted!

Step 2: Customize the Parameters (Optional)¶

Open the generated JSON file and adjust parameters as needed. Key parameters include:

{
    "fit_parameters": {
        "n_filters": 64,
        "n_layers": 9,
        "max_epochs": 100,
        "lr": 0.004,
        "batch_size": 512,
        "max_jitter": 500,
        "early_stopping": 15
    }
}

Note

The pipeline will automatically call peaks with MACS3, convert BAMs to bigWigs, and generate GC-matched negatives if these are not explicitly provided in the JSON.

Step 3: Run the Pipeline¶

cherimoya pipeline -p ctcf.pipeline.json

The pipeline will execute the following steps in order:

Peak calling (if loci is not provided) — uses MACS3
Data conversion (if signal files are BAMs) — uses bam2bw
Negative sampling (if negatives is not provided) — GC-matched
Model training — trains a Cherimoya model with dual optimizers
Attribution calculation — saturation mutagenesis on validation chroms
Seqlet identification — extract important subsequences
TF-MoDISco — motif discovery and report generation
Marginalization — motif effect size estimation

Running Individual Steps¶

You can also run steps individually using their respective JSON configuration files.

Training:

cherimoya fit -p fit_parameters.json

Evaluation:

cherimoya evaluate -p evaluate_parameters.json

Attribution:

cherimoya attribute -p attribute_parameters.json

Batch Processing¶

To train models on many datasets in parallel, use the batch command:

cherimoya batch -p batch_parameters.json

The batch command will automatically distribute jobs across available GPUs. Use "device": "*" in the JSON to auto-detect all available CUDA devices.

{
    "name": null,
    "device": "*",
    "signals": "/path/to/data/*.bam",
    "sequences": "/path/to/hg38.fa"
}

When signals contains a glob pattern and name is null, names are automatically derived from the signal filenames.