Recipe: Differential / Conditional Analysis =========================================== This recipe covers comparing two (or more) conditions — treated vs. control, knockout vs. wildtype, time point A vs. B. The standard pattern in Cherimoya is to train one model per condition, then compare their predictions on a shared set of loci. This is simpler to set up than a single multi-output model and gives cleaner attribution and marginalization results. If you have only a single condition, use the assay-specific recipe (:doc:`chipseq_tf`, :doc:`atacseq`, or :doc:`dnaseq`) instead. Inputs ------ * Reference genome FASTA. * Per-condition signal BAMs (with replicates pooled or treated as separate ``-i`` files). * Per-condition control BAMs (for ChIP-seq) or none (for ATAC/DNase). * A motif database in MEME format. Step 1: train one model per condition -------------------------------------- Generate one pipeline JSON per condition and run them sequentially (or in parallel via the ``batch`` subcommand; see :doc:`../cli`): .. code-block:: bash cherimoya pipeline-json \ -s hg38.fa \ -i condA_rep1.bam -i condA_rep2.bam \ -c condA_input.bam \ -m JASPAR_2024.meme -n condA -o condA.pipeline.json cherimoya pipeline-json \ -s hg38.fa \ -i condB_rep1.bam -i condB_rep2.bam \ -c condB_input.bam \ -m JASPAR_2024.meme -n condB -o condB.pipeline.json cherimoya pipeline -p condA.pipeline.json cherimoya pipeline -p condB.pipeline.json Each run produces a model checkpoint (``condA.torch`` / ``condB.torch``), per-track bigWigs, attributions, seqlets, and a TF-MoDISco report scoped to that condition's peaks. Use the same ``training_chroms`` / ``validation_chroms`` in both JSONs so the held-out evaluation is comparable. Step 2: build a shared locus set -------------------------------- Decide what regions to compare across. The natural choices, in increasing order of conservatism: * **Union** of condition-A and condition-B peaks — catches gains and losses but includes weakly-supported regions. * **Intersection** — only regions called as peaks in both conditions; conservative but misses pure gains/losses. * **Union over a reference annotation** (e.g. promoters, GENCODE-defined TSSes) — gives a biologically interpretable set that doesn't depend on the peak calls. A typical union with ``bedtools``: .. code-block:: bash cat condA_peaks.narrowPeak condB_peaks.narrowPeak | \ sort -k1,1 -k2,2n | \ bedtools merge -i - > shared_loci.bed Step 3: predict from both models on the shared set -------------------------------------------------- .. code-block:: python from cherimoya import Cherimoya from tangermeme.io import extract_loci from tangermeme.predict import predict model_A = Cherimoya.load("condA.torch", device="cuda") model_B = Cherimoya.load("condB.torch", device="cuda") model_A.eval(); model_B.eval() X, _ = extract_loci( sequences="hg38.fa", loci="shared_loci.bed", chroms=["chr8", "chr20"], # the held-out chromosomes used during training in_window=2114, out_window=1000, max_jitter=0, ignore=list('QWERYUIOPSDFHJKLZXVBNM'), return_mask=True, ) y_prof_A, y_counts_A = predict(model_A, X, batch_size=64, device="cuda") y_prof_B, y_counts_B = predict(model_B, X, batch_size=64, device="cuda") # Differences in log counts (predicted log fold change). delta_log_counts = y_counts_B - y_counts_A Loci with large positive ``delta_log_counts`` are predicted to gain signal in condition B; large negative values are predicted losses. Step 4: identify differential motifs ------------------------------------ For motif-level differences, run marginalization on both models and compare. Each model's pipeline already runs marginalization on its own negative loci; to compare directly, run marginalization on a **shared** background: .. code-block:: bash cherimoya marginalize -p marginalize_A.json cherimoya marginalize -p marginalize_B.json with the two JSONs differing only in ``model`` and ``output_filename``, but identical in ``loci`` (the shared background) and ``motifs``. The delta in per-motif marginalization scores between A and B is a direct estimate of which motifs cause the condition-specific signal. Caveats ------- * **Held-out chromosomes must match.** A model trained with ``chr8``/``chr20`` as validation cannot be compared with one trained with ``chr1``/``chr12`` as validation on common loci — some of those loci were in the second model's training set. Use the same split in both JSONs. * **Replicate-to-replicate noise is the floor.** Before treating ``delta_log_counts`` as biological signal, compare to the technical-replicate baseline by training two models on independent replicates of the same condition and computing the same delta. Biological deltas should be larger than the replicate baseline. * **The two models share no parameters.** Each model is fit independently and the comparison happens only at prediction time. Multi-output single-model training is possible by passing ``-i condA.bw -i condB.bw`` to one pipeline (Cherimoya treats each signal as a separate output track), but in practice per-condition models give cleaner attribution and motif results.