cherimoya.io

Data loading utilities for training Cherimoya. The dataset is the peak/negative mixture sampler; the function-style entry point PeakGenerator() is the typical way to build a training DataLoader.

PeakGenerator

cherimoya.io.PeakGenerator(peaks, negatives, sequences, signals, controls=None, chroms=None, in_window=2114, out_window=1000, max_jitter=50, negative_ratio=0.1, reverse_complement=True, shuffle=True, min_counts=None, max_counts=None, summits=False, exclusion_lists=None, random_state=None, pin_memory=True, num_workers=1, batch_size=32, verbose=False)[source]

This is a constructor function that handles all IO.

This function will extract signal from all signal and control files, pass that into a DataGenerator, and wrap that using a PyTorch data loader. This is the only function that needs to be used.

Parameters:
  • peaks (str or pandas.DataFrame or list/tuple of such) – A BED-formatted file containing peak coordinates. This can be either the string path to the BED file or a pandas DataFrame object containing three columns: chrom, start, and end. Alternatively, this can be a list of such objects whose coordinates will be interleaved.

  • negatives (str or pandas.DataFrame or list/tuple of such) – A BED-formatted file containing negative coordinates. This can be either the string path to the BED file or a pandas DataFrame object containing three columns: chrom, start, and end. Alternatively, this can be a list of such objects whose coordinates will be interleaved.

  • sequences (str or dictionary) – Either the path to a fasta file to read from or a dictionary where the keys are the unique set of chromosoms and the values are one-hot encoded sequences as numpy arrays or memory maps.

  • signals (list of strs or list of dictionaries) – A list of filepaths to bigwig files, where each filepath will be read using pyBigWig, or a list of dictionaries where the keys are the same set of unique chromosomes and the values are numpy arrays or memory maps.

  • controls (list of strs or list of dictionaries or None, optional) – A list of filepaths to bigwig files, where each filepath will be read using pyBigWig, or a list of dictionaries where the keys are the same set of unique chromosomes and the values are numpy arrays or memory maps. If None, no control tensor is returned. Default is None.

  • chroms (list or None, optional) – A set of chromosomes to extact loci from. Loci in other chromosomes in the locus file are ignored. If None, all loci are used. Default is None.

  • in_window (int, optional) – The input window size. Default is 2114.

  • out_window (int, optional) – The output window size. Default is 1000.

  • max_jitter (int, optional) – The maximum amount of jitter to add, in either direction, to the midpoints that are passed in. Default is 50.

  • negative_ratio (float, optional) – The ratio of negatives compared to peaks in each batch. A value of 1 means that each batch is balanced, and a value of 10 means that there would be 10 negatives for each positive. Note that this is independent of the number of peaks and negatives provided. Even if the peaks input has 10x the number of coordinates as the negatives one, if the ratio is 1 each batch during training will be balanced (on average).

  • reverse_complement (bool, optional) – Whether to reverse complement-augment half of the data. Default is True.

  • shuffle (bool, optional) – Whether to randomly sample peaks, if True, or to proceed sequentially through them, if False. Negatives are always randomly sampled. Default is True.

  • min_counts (float or None, optional) – The minimum number of counts, summed across the length of each example and across all tasks, needed to be kept. If None, no minimum. Default is None.

  • max_counts (float or None, optional) – The maximum number of counts, summed across the length of each example and across all tasks, needed to be kept. If None, no maximum. Default is None.

  • summits (bool, optional) – Whether to return a region centered around the summit instead of the center between the start and end. If True, it will add the 10th column (index 9) to the start to get the center of the window, and so the data must be in narrowPeak format.

  • exclusion_lists (list or None, optional) – A list of strings of filenames to BED-formatted files containing exclusion lists, i.e., regions where overlapping loci should be filtered out. If None, no filtering is performed based on exclusion zones. Default is None.

  • random_state (int or None, optional) – Base seed for the sampler’s deterministic per-epoch RNG. If None, a seed is captured once from system entropy.

  • pin_memory (bool, optional) – Whether to pin page memory to make data loading onto a GPU easier. Default is True.

  • num_workers (int, optional) – The number of processes fetching data at a time to feed into a model. If 0, data is fetched from the main process (synchronous, can become a bottleneck because each batch blocks the GPU). Default is 1, which runs one async prefetch worker. Higher values are safe and produce the same sequence of batches as num_workers = 1, just faster: __getitem__(idx) is a pure function of idx and the current epoch, so all workers compute the same data for any given index.

  • batch_size (int, optional) – The number of data elements per batch. Default is 32.

  • verbose (bool, optional) – Whether to display a progress bar while loading. Default is False.

Returns:

X – A PyTorch DataLoader wrapped DataGenerator object.

Return type:

torch.utils.data.DataLoader

PeakNegativeSampler

class cherimoya.io.PeakNegativeSampler(*args, **kwargs)[source]

Bases: Dataset

A data generator mimicking the BPNet data loading procedure.

Here, a set of peaks and negatives are separately loaded. These sets can be any size. From these sets, batches of given size are sampled that are a mixture of peaks and negatives.

Sampling is fully deterministic given random_state and the epoch number. __getitem__(idx) is a pure function of idx and the current epoch, so num_workers > 1 produces the same per-index data tuples as num_workers = 1 — the DataLoader yields identical batch sequences, just faster.

Each peak is drawn exactly once per epoch; the peak/negative interleaving and all augmentations are reproducible from (random_state, epoch).

In the documentation below, mj = max_jitter.

Parameters:
  • peak_sequences (torch.tensor, shape=(n_peaks, 4, in_window+2*mj)) – A tensor of peak sequences that are one-hot encoded.

  • peak_signals (torch.tensor, shape=(n_peaks, t, out_window+2*mj)) – A tensor of signals to predict, usually base-pair resolution integer counts.

  • peak_controls (torch.tensor, shape=(n, t, out_window+2*mj) or None,) – optional Optional control input track for peak examples.

  • negative_sequences (torch.tensor, shape=(n, 4, in_window+2*mj)) – One-hot encoded negative sequences.

  • negative_signals (torch.tensor, shape=(n, t, out_window+2*mj)) – Negative sequence signals.

  • negative_controls (torch.tensor or None, optional) – Optional control input track for negative examples.

  • negative_ratio (float, optional) – Ratio of negatives to peaks per epoch. 0 means no negative draws. Default 0.1.

  • in_window (int, optional) – The input window size. Default 2114.

  • out_window (int, optional) – The output window size. Default 1000.

  • max_jitter (int, optional) – Maximum jitter (in either direction) applied to peaks. Default 0.

  • reverse_complement (bool, optional) – Whether to reverse complement-augment half of the data. Default False.

  • shuffle (bool, optional) – Whether to shuffle the peak ordering each epoch. Default True.

  • random_state (int or None, optional) – Base seed for the deterministic per-epoch RNG. If None, a random seed is captured once at construction time so that all forked worker processes share it.

Constructor

__init__(peak_sequences, peak_signals, negative_sequences, negative_signals, peak_controls=None, negative_controls=None, negative_ratio=0.1, in_window=2114, out_window=1000, max_jitter=0, reverse_complement=False, shuffle=True, random_state=None)[source]

The sampler is fully deterministic given random_state and the epoch number. __getitem__(idx) is a pure function of idx and the current epoch, so num_workers > 1 yields the same batch sequence as num_workers = 1 and two runs with the same seed produce bit-identical training data.