Python API Tutorial
===================

This tutorial demonstrates how to use the Cherimoya Python API to train a
model, make predictions, and evaluate performance.


Creating a Model
----------------

A Cherimoya model is a standard ``torch.nn.Module``:

.. code-block:: python

   from cherimoya import Cherimoya

   model = Cherimoya(
       n_filters=96,           # Width of the convolutional backbone (default 96)
       n_layers=9,             # Number of Cheri Blocks (dilation: 1, 2, ..., 256)
       n_outputs=2,            # Output tracks (2 for stranded data)
       n_control_tracks=0,     # Control input tracks (0 if no controls)
       expansion=2,            # MLP expansion factor inside each Cheri Block (default 2)
       residual_scale=0.15,    # Fixed scalar on each residual connection (default 0.15)
       single_count_output=True,  # Single scalar count vs. per-track counts
       name="my_model",        # Model name (used for save filenames)
   ).cuda()


Understanding the Input/Output Shapes
--------------------------------------

.. list-table::
   :header-rows: 1
   :widths: 25 35 40

   * - Tensor
     - Shape
     - Description
   * - ``X`` (input)
     - ``(batch, 4, in_window)``
     - One-hot encoded DNA sequence
   * - ``X_ctl`` (optional)
     - ``(batch, n_control, in_window)``
     - Control signal per position
   * - ``y_profile`` (output)
     - ``(batch, n_outputs, out_window)``
     - Predicted profile logits
   * - ``y_counts`` (output)
     - ``(batch, n_count_outputs)``
     - Predicted log counts

The default ``in_window`` is 2114 bp and ``out_window`` is 1000 bp. The
difference (``trimming = (2114 - 1000) / 2 = 557``) accounts for the receptive
field of the dilated convolution stack.


Loading Training Data
---------------------

The :func:`~cherimoya.io.PeakGenerator` function handles all data I/O:

.. code-block:: python

   from cherimoya.io import PeakGenerator

   training_data = PeakGenerator(
       peaks="peaks.narrowPeak",         # Peak regions (BED/narrowPeak)
       negatives="negatives.bed",        # GC-matched negative regions
       sequences="hg38.fa",             # Reference genome
       signals=["signal.+.bw", "signal.-.bw"],  # Signal tracks
       controls=None,                    # Control tracks (or list of bigWigs)
       chroms=["chr2", "chr4", "chr5"],  # Training chromosomes
       in_window=2114,
       out_window=1000,
       max_jitter=128,                   # Random position jitter (bp)
       negative_ratio=0.1,              # Ratio of negatives to peaks
       reverse_complement=True,          # Augment with reverse complements
       batch_size=64,
       num_workers=1,                    # Async prefetch workers
       random_state=0,                   # Sampler seed (reproducible)
   )

.. tip::

   Use ``verbose=True`` to see progress bars during data loading and summary
   statistics about the number of peaks, negatives, and filtered regions.


Sampling Strategy
^^^^^^^^^^^^^^^^^

The underlying :class:`~cherimoya.io.PeakNegativeSampler` is fully
deterministic given ``random_state``. ``__getitem__(idx)`` is a pure
function of ``idx`` and the current epoch, with no dependence on call
history. As a consequence:

- Each epoch yields exactly ``n_peaks + int(n_peaks * negative_ratio)``
  examples; every peak appears exactly once and the peak/negative
  interleaving is random but reproducible.
- Setting ``num_workers > 1`` produces the **same** sequence of batches
  as ``num_workers = 1`` — just faster. All workers compute the same
  data for any given index.
- Per-position augmentations (jitter, reverse-complement) are also
  drawn from the per-epoch RNG, so two runs with the same seed produce
  bit-identical training data.


Preparing Validation Data
--------------------------

Validation data is loaded as a single block of tensors:

.. code-block:: python

   from tangermeme.io import extract_loci

   valid_data = extract_loci(
       sequences="hg38.fa",
       signals=["signal.+.bw", "signal.-.bw"],
       loci="peaks.narrowPeak",
       chroms=["chr8", "chr20"],  # Held-out validation chromosomes
       in_window=2114,
       out_window=1000,
       max_jitter=0,              # No jitter for validation
       ignore=list('QWERYUIOPSDFHJKLZXVBNM'),  # Ignore non-standard chroms
   )

   X_valid, y_valid = valid_data  # Without controls
   # X_valid, y_valid, X_ctl_valid = valid_data  # With controls


Setting Up Optimizers
---------------------

Cherimoya uses a dual-optimizer strategy — **Muon** for 2D projection layers
and **AdamW** for everything else:

.. code-block:: python

   from torch.optim import AdamW, Muon
   from torch.optim.lr_scheduler import (
       LinearLR, CosineAnnealingLR, SequentialLR
   )

   # Route parameters to the appropriate optimizer, with Muon only handling
   # internal 2D matrices and leaving all other parameters, including the
   # head and tail layers, to AdamW
   muon_params, adam_params = [], []
   for name, p in model.named_parameters():
       if p.ndim == 2 and "weight" in name and name != "linear.weight":
           muon_params.append(p)
       else:
           adam_params.append(p)

   # Create optimizers
   muon_optimizer = Muon(muon_params, lr=0.01, weight_decay=0.0)
   adam_optimizer = AdamW(adam_params, lr=0.004, weight_decay=0.0)

   # Warmup + cosine decay schedules
   n_warmup_iters = len(training_data) * 5      # 5 epoch warmup
   n_total_iters = len(training_data) * 50       # 50 total epochs

   muon_scheduler = SequentialLR(muon_optimizer, schedulers=[
       LinearLR(muon_optimizer, start_factor=0.01, total_iters=n_warmup_iters),
       CosineAnnealingLR(muon_optimizer, T_max=n_total_iters, eta_min=1e-5),
   ], milestones=[n_warmup_iters])

   adam_scheduler = SequentialLR(adam_optimizer, schedulers=[
       LinearLR(adam_optimizer, start_factor=0.01, total_iters=n_warmup_iters),
       CosineAnnealingLR(adam_optimizer, T_max=n_total_iters, eta_min=1e-5),
   ], milestones=[n_warmup_iters])


Training the Model
------------------

.. code-block:: python

   model.fit(
       training_data,
       muon_optimizer, adam_optimizer,
       muon_scheduler, adam_scheduler,
       X_valid=X_valid,
       X_ctl_valid=None,    # Pass control tensors here if using controls
       y_valid=y_valid,
       max_epochs=50,
       batch_size=64,
       early_stopping=15,   # Stop after 15 epochs without improvement
       dtype='float32',     # Or 'bfloat16' for mixed precision
       device='cuda',
   )

During training, the model saves:

- ``my_model.torch`` — best model checkpoint
- ``my_model.final.torch`` — final model after training
- ``my_model.log`` — training/validation metrics log


Saving and Loading Models
-------------------------

Cherimoya checkpoints store the constructor arguments next to the parameter
state dict. This makes loading robust to package layout changes and lets the
checkpoint be loaded with PyTorch's ``weights_only=True`` setting.

**Saving**:

.. code-block:: python

   from cherimoya import Cherimoya

   model = Cherimoya(n_filters=96, n_layers=9, n_outputs=2)
   # ... train ...
   model.save("my_model.torch")

**Loading**:

.. code-block:: python

   from cherimoya import Cherimoya

   # Load to CPU (default) — fine for inspection or CPU-only inference.
   model = Cherimoya.load("my_model.torch")

   # Load directly onto GPU.
   model = Cherimoya.load("my_model.torch", device="cuda")

The checkpoint is a dictionary with two keys, ``config`` (the kwargs
passed to ``Cherimoya.__init__``) and ``state_dict`` (the parameter
tensors). Older checkpoints saved with ``torch.save(model, ...)`` are not
compatible with this loader and should be retrained or migrated.


Making Predictions
------------------

.. code-block:: python

   from tangermeme.predict import predict

   model.eval()
   y_profile, y_counts = predict(
       model, X_test,
       batch_size=64,
       device='cuda',
       dtype='float32',
   )

For reverse-complement averaging (often improves performance):

.. code-block:: python

   import torch

   y_profile_rc, y_counts_rc = predict(
       model, torch.flip(X_test, dims=(-1, -2)),
       batch_size=64, device='cuda',
   )

   y_profile_avg = (y_profile + torch.flip(y_profile_rc, dims=(-1, -2))) / 2
   y_counts_avg = (y_counts + y_counts_rc) / 2


Evaluating Performance
----------------------

.. code-block:: python

   from cherimoya.performance import calculate_performance_measures

   measures = calculate_performance_measures(
       y_profile, y_valid, y_counts,
       measures=['profile_pearson', 'count_pearson', 'profile_jsd'],
   )

   for name, values in measures.items():
       print(f"{name}: {values.mean():.4f}")