CLI Pipeline Tutorial¶
This tutorial walks through using the Cherimoya command-line tool to run the full end-to-end pipeline from raw sequencing data to motif analysis.
Prerequisites¶
Before starting, make sure you have:
Cherimoya installed (see Installation)
A reference genome FASTA file (e.g.,
hg38.fa)Signal files (BAM, SAM, BED, or bigWig format)
(Optional) Control signal files
Overview of CLI Commands¶
Command |
Description |
|---|---|
|
Sample GC-content-matched negative regions |
|
Generate a pipeline configuration JSON file |
|
Train a Cherimoya model |
|
Evaluate a trained model |
|
Calculate sequence attributions |
|
Identify important subsequences from attributions |
|
Run marginalization experiments for motifs |
|
Run the full end-to-end pipeline |
|
Run multiple pipelines in parallel |
Step 1: Generate a Pipeline Configuration¶
The pipeline is driven by a JSON configuration file. You can generate one using
the pipeline-json command:
cherimoya pipeline-json \
-s /path/to/hg38.fa \
-i /path/to/signal.bam \
-n my_ctcf_experiment \
-o ctcf.pipeline.json
For stranded experiments with controls:
cherimoya pipeline-json \
-s /path/to/hg38.fa \
-i /path/to/treatment.bam \
-c /path/to/control.bam \
-n ctcf_stranded \
-o ctcf_stranded.pipeline.json
Tip
Use -u for unstranded data and -f if your input consists of
fragment files rather than aligned reads.
Tip
If using ATAC-seq or DNase-seq data you may need to shift the reads. Many packages use +4/-5 for ATAC-seq shifting, but the recommended shift here is actually +4/-4. You can use -ps and -ns to shift your data but make sure the original data is not shifted!
Step 2: Customize the Parameters (Optional)¶
Open the generated JSON file and adjust parameters as needed. Key parameters include:
{
"fit_parameters": {
"n_filters": 64,
"n_layers": 9,
"max_epochs": 100,
"lr": 0.004,
"batch_size": 512,
"max_jitter": 500,
"early_stopping": 15
}
}
Note
The pipeline will automatically call peaks with MACS3, convert BAMs to bigWigs, and generate GC-matched negatives if these are not explicitly provided in the JSON.
Step 3: Run the Pipeline¶
cherimoya pipeline -p ctcf.pipeline.json
The pipeline will execute the following steps in order:
Peak calling (if
lociis not provided) — uses MACS3Data conversion (if signal files are BAMs) — uses bam2bw
Negative sampling (if
negativesis not provided) — GC-matchedModel training — trains a Cherimoya model with dual optimizers
Attribution calculation — saturation mutagenesis on validation chroms
Seqlet identification — extract important subsequences
TF-MoDISco — motif discovery and report generation
Marginalization — motif effect size estimation
Running Individual Steps¶
You can also run steps individually using their respective JSON configuration files.
Training:
cherimoya fit -p fit_parameters.json
Evaluation:
cherimoya evaluate -p evaluate_parameters.json
Attribution:
cherimoya attribute -p attribute_parameters.json
Batch Processing¶
To train models on many datasets in parallel, use the batch command:
cherimoya batch -p batch_parameters.json
The batch command will automatically distribute jobs across available GPUs.
Use "device": "*" in the JSON to auto-detect all available CUDA devices.
{
"name": null,
"device": "*",
"signals": "/path/to/data/*.bam",
"sequences": "/path/to/hg38.fa"
}
When signals contains a glob pattern and name is null, names are
automatically derived from the signal filenames.