CLI Pipeline Tutorial ===================== This tutorial walks through using the Cherimoya command-line tool to run the full end-to-end pipeline from raw sequencing data to motif analysis. Prerequisites ------------- Before starting, make sure you have: - Cherimoya installed (see :doc:`../installation`) - A reference genome FASTA file (e.g., ``hg38.fa``) - Signal files (BAM, SAM, BED, or bigWig format) - (Optional) Control signal files Overview of CLI Commands ------------------------ .. list-table:: :header-rows: 1 :widths: 20 80 * - Command - Description * - ``negatives`` - Sample GC-content-matched negative regions * - ``pipeline-json`` - Generate a pipeline configuration JSON file * - ``fit`` - Train a Cherimoya model * - ``evaluate`` - Evaluate a trained model * - ``attribute`` - Calculate sequence attributions * - ``seqlets`` - Identify important subsequences from attributions * - ``marginalize`` - Run marginalization experiments for motifs * - ``pipeline`` - Run the full end-to-end pipeline * - ``batch`` - Run multiple pipelines in parallel Step 1: Generate a Pipeline Configuration ------------------------------------------ The pipeline is driven by a JSON configuration file. You can generate one using the ``pipeline-json`` command: .. code-block:: bash cherimoya pipeline-json \ -s /path/to/hg38.fa \ -i /path/to/signal.bam \ -n my_ctcf_experiment \ -o ctcf.pipeline.json For stranded experiments with controls: .. code-block:: bash cherimoya pipeline-json \ -s /path/to/hg38.fa \ -i /path/to/treatment.bam \ -c /path/to/control.bam \ -n ctcf_stranded \ -o ctcf_stranded.pipeline.json .. tip:: Use ``-u`` for unstranded data and ``-f`` if your input consists of fragment files rather than aligned reads. .. tip:: If using ATAC-seq or DNase-seq data you may need to shift the reads. Many packages use +4/-5 for ATAC-seq shifting, but the recommended shift here is actually +4/-4. You can use -ps and -ns to shift your data but make sure the original data is not shifted! Step 2: Customize the Parameters (Optional) -------------------------------------------- Open the generated JSON file and adjust parameters as needed. Key parameters include: .. code-block:: json { "fit_parameters": { "n_filters": 64, "n_layers": 9, "max_epochs": 100, "lr": 0.004, "batch_size": 512, "max_jitter": 500, "early_stopping": 15 } } .. note:: The pipeline will automatically call peaks with MACS3, convert BAMs to bigWigs, and generate GC-matched negatives if these are not explicitly provided in the JSON. Step 3: Run the Pipeline ------------------------ .. code-block:: bash cherimoya pipeline -p ctcf.pipeline.json The pipeline will execute the following steps in order: 1. **Peak calling** (if ``loci`` is not provided) — uses MACS3 2. **Data conversion** (if signal files are BAMs) — uses bam2bw 3. **Negative sampling** (if ``negatives`` is not provided) — GC-matched 4. **Model training** — trains a Cherimoya model with dual optimizers 5. **Attribution calculation** — saturation mutagenesis on validation chroms 6. **Seqlet identification** — extract important subsequences 7. **TF-MoDISco** — motif discovery and report generation 8. **Marginalization** — motif effect size estimation Running Individual Steps ------------------------ You can also run steps individually using their respective JSON configuration files. **Training:** .. code-block:: bash cherimoya fit -p fit_parameters.json **Evaluation:** .. code-block:: bash cherimoya evaluate -p evaluate_parameters.json **Attribution:** .. code-block:: bash cherimoya attribute -p attribute_parameters.json Batch Processing ---------------- To train models on many datasets in parallel, use the ``batch`` command: .. code-block:: bash cherimoya batch -p batch_parameters.json The batch command will automatically distribute jobs across available GPUs. Use ``"device": "*"`` in the JSON to auto-detect all available CUDA devices. .. code-block:: json { "name": null, "device": "*", "signals": "/path/to/data/*.bam", "sequences": "/path/to/hg38.fa" } When ``signals`` contains a glob pattern and ``name`` is null, names are automatically derived from the signal filenames.