Attribution & Motif Analysis¶
This tutorial covers how to calculate sequence attributions, identify important subsequences (seqlets), and discover motifs using TF-MoDISco.
What Are Attributions?¶
Attributions quantify how much each base pair in the input sequence contributes to the model’s prediction. Cherimoya uses saturation mutagenesis — the effect of every possible single-nucleotide mutation — to compute importance scores.
These scores can be used to:
Identify transcription factor binding sites
Discover de novo motifs
Understand regulatory grammar
Calculating Attributions via CLI¶
cherimoya attribute -p attribute_params.json
Example attribute_params.json:
{
"model": "my_model.torch",
"sequences": "/path/to/hg38.fa",
"loci": "peaks.narrowPeak",
"chroms": ["chr2", "chr4", "chr5"],
"output": "counts",
"batch_size": 512,
"device": "cuda",
"ohe_filename": "attributions.ohe.npz",
"attr_filename": "attributions.attr.npz",
"idx_filename": "attributions.idx.npy"
}
The output parameter controls what the attributions are calculated with
respect to:
"counts"— attribute to total predicted counts (recommended for most analyses)"profile"— attribute to the predicted profile shape
Calculating Attributions via Python¶
import torch
from tangermeme.saturation_mutagenesis import saturation_mutagenesis
from bpnetlite.bpnet import ControlWrapper, CountWrapper
from cherimoya import Cherimoya
# Load model
model = Cherimoya.load("my_model.torch", device="cuda")
# Wrap for count-based attribution
if model.n_control_tracks > 0:
model = ControlWrapper(model)
wrapper = CountWrapper(model)
# Calculate attributions (hypothetical importance scores) over the
# central 400 bp window of each sequence.
mid = X_sequences.shape[-1] // 2
X_attr = saturation_mutagenesis(
wrapper, X_sequences,
batch_size=512,
device='cuda',
hypothetical=True,
start=mid - 200,
end=mid + 200,
)
Identifying Seqlets¶
Seqlets are contiguous subsequences with high attribution scores that likely correspond to functional elements like transcription factor binding motifs.
Via CLI:
cherimoya seqlets -p seqlet_params.json
Via Python:
from tangermeme.seqlet import recursive_seqlets
# Combine one-hot encoding and attributions
importance = (X_attr * X_ohe).sum(dim=1)
seqlets = recursive_seqlets(
importance,
threshold=0.01,
min_seqlet_len=4,
max_seqlet_len=25,
additional_flanks=3,
)
TF-MoDISco Motif Discovery¶
TF-MoDISco groups similar seqlets into motif patterns. This is run automatically as part of the pipeline, or manually:
modisco motifs \
-s attributions.ohe.npz \
-a attributions.attr.npz \
-n 100000 \
-o modisco_results.h5
modisco report \
-i modisco_results.h5 \
-o modisco_report/ \
-s ./
Marginalization Experiments¶
Marginalization experiments measure the causal effect of motif instances by inserting them into neutral backgrounds:
cherimoya marginalize -p marginalize_params.json
This produces a report showing how each motif affects the predicted profile and counts when inserted into negative (non-peak) sequences.
Note
Marginalization requires a motif database file in the MEME format. Most databases will provide a file of this format. If you are looking for a database to use, we recommend JASPAR2026.