Attribution & Motif Analysis¶

This tutorial covers how to calculate sequence attributions, identify important subsequences (seqlets), and discover motifs using TF-MoDISco.

What Are Attributions?¶

Attributions quantify how much each base pair in the input sequence contributes to the model’s prediction. Cherimoya uses saturation mutagenesis — the effect of every possible single-nucleotide mutation — to compute importance scores.

These scores can be used to:

Identify transcription factor binding sites
Discover de novo motifs
Understand regulatory grammar

Calculating Attributions via CLI¶

cherimoya attribute -p attribute_params.json

Example attribute_params.json:

{
    "model": "my_model.torch",
    "sequences": "/path/to/hg38.fa",
    "loci": "peaks.narrowPeak",
    "chroms": ["chr2", "chr4", "chr5"],
    "output": "counts",
    "batch_size": 512,
    "device": "cuda",
    "ohe_filename": "attributions.ohe.npz",
    "attr_filename": "attributions.attr.npz",
    "idx_filename": "attributions.idx.npy"
}

The output parameter controls what the attributions are calculated with respect to:

"counts" — attribute to total predicted counts (recommended for most analyses)
"profile" — attribute to the predicted profile shape

Calculating Attributions via Python¶

import torch
from tangermeme.saturation_mutagenesis import saturation_mutagenesis
from bpnetlite.bpnet import ControlWrapper, CountWrapper

from cherimoya import Cherimoya

# Load model
model = Cherimoya.load("my_model.torch", device="cuda")

# Wrap for count-based attribution
if model.n_control_tracks > 0:
    model = ControlWrapper(model)
wrapper = CountWrapper(model)

# Calculate attributions (hypothetical importance scores) over the
# central 400 bp window of each sequence.
mid = X_sequences.shape[-1] // 2
X_attr = saturation_mutagenesis(
    wrapper, X_sequences,
    batch_size=512,
    device='cuda',
    hypothetical=True,
    start=mid - 200,
    end=mid + 200,
)

Identifying Seqlets¶

Seqlets are contiguous subsequences with high attribution scores that likely correspond to functional elements like transcription factor binding motifs.

Via CLI:

cherimoya seqlets -p seqlet_params.json

Via Python:

from tangermeme.seqlet import recursive_seqlets

# Combine one-hot encoding and attributions
importance = (X_attr * X_ohe).sum(dim=1)

seqlets = recursive_seqlets(
    importance,
    threshold=0.01,
    min_seqlet_len=4,
    max_seqlet_len=25,
    additional_flanks=3,
)

TF-MoDISco Motif Discovery¶

TF-MoDISco groups similar seqlets into motif patterns. This is run automatically as part of the pipeline, or manually:

modisco motifs \
    -s attributions.ohe.npz \
    -a attributions.attr.npz \
    -n 100000 \
    -o modisco_results.h5

modisco report \
    -i modisco_results.h5 \
    -o modisco_report/ \
    -s ./

Marginalization Experiments¶

Marginalization experiments measure the causal effect of motif instances by inserting them into neutral backgrounds:

cherimoya marginalize -p marginalize_params.json

This produces a report showing how each motif affects the predicted profile and counts when inserted into negative (non-peak) sequences.

Note

Marginalization requires a motif database file in the MEME format. Most databases will provide a file of this format. If you are looking for a database to use, we recommend JASPAR2026.