evoaug.augment

Library of data augmentations for genomic sequence data.

To contribute a custom augmentation, use the following syntax:

class CustomAugmentation(AugmentBase):
    def __init__(self, param1, param2):
        self.param1 = param1
        self.param2 = param2

    def __call__(self, x: torch.Tensor) -> torch.Tensor:
        # Perform augmentation
        return x_aug

Module Contents

Classes

AugmentBase

Base class for EvoAug augmentations for genomic sequences.

RandomDeletion

Randomly deletes a contiguous stretch of nucleotides from sequences in a training

RandomInsertion

Randomly inserts a contiguous stretch of nucleotides from sequences in a training

RandomTranslocation

Randomly cuts sequence in two pieces and shifts the order for each in a training

RandomInversion

Randomly inverts a contiguous stretch of nucleotides from sequences in a training

RandomMutation

Randomly mutates sequences in a training batch according to a user-defined

RandomRC

Randomly applies a reverse-complement transformation to each sequence in a training

RandomNoise

Randomly add Gaussian noise to a batch of sequences with according to a user-defined

class evoaug.augment.AugmentBase

Base class for EvoAug augmentations for genomic sequences.

abstract __call__(x)

Return an augmented version of x.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences (shape: (N, A, L)).

Returns:

Batch of one-hot sequences with random augmentation applied.

Return type:

torch.Tensor

class evoaug.augment.RandomDeletion(delete_min=0, delete_max=20)

Bases: AugmentBase

Randomly deletes a contiguous stretch of nucleotides from sequences in a training batch according to a random number between a user-defined delete_min and delete_max. A different deletion is applied to each sequence.

Parameters:
  • delete_min (int, optional) – Minimum size for random deletion (defaults to 0).

  • delete_max (int, optional) – Maximum size for random deletion (defaults to 20).

__call__(x)

Randomly delete segments in a set of one-hot DNA sequences.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences (shape: (N, A, L)).

Returns:

Sequences with randomly deleted segments (padded to correct shape with random DNA)

Return type:

torch.Tensor

class evoaug.augment.RandomInsertion(insert_min=0, insert_max=20)

Bases: AugmentBase

Randomly inserts a contiguous stretch of nucleotides from sequences in a training batch according to a random number between a user-defined insert_min and insert_max. A different insertions is applied to each sequence. Each sequence is padded with random DNA to ensure same shapes.

Parameters:
  • insert_min (int, optional) – Minimum size for random insertion, defaults to 0

  • insert_max (int, optional) – Maximum size for random insertion, defaults to 20

__call__(x)

Randomly inserts segments of random DNA to a set of DNA sequences.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences (shape: (N, A, L)).

Returns:

Sequences with randomly inserts segments of random DNA. All sequences are padded with random DNA to ensure same shape.

Return type:

torch.Tensor

class evoaug.augment.RandomTranslocation(shift_min=0, shift_max=20)

Bases: AugmentBase

Randomly cuts sequence in two pieces and shifts the order for each in a training batch. This is implemented with a roll transformation with a user-defined shift_min and shift_max. A different roll (positive or negative) is applied to each sequence. Each sequence is padded with random DNA to ensure same shapes.

Parameters:
  • shift_min (int, optional) – Minimum size for random shift, defaults to 0.

  • shift_max (int, optional) – Maximum size for random shift, defaults to 20.

__call__(x)

Randomly shifts sequences in a batch, x.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences (shape: (N, A, L)).

Returns:

Sequences with random translocations.

Return type:

torch.Tensor

class evoaug.augment.RandomInversion(invert_min=0, invert_max=20)

Bases: AugmentBase

Randomly inverts a contiguous stretch of nucleotides from sequences in a training batch according to a user-defined invert_min and invert_max. A different insertions is applied to each sequence. Each sequence is padded with random DNA to ensure same shapes.

Parameters:
  • invert_min (int, optional) – Minimum size for random insertion, defaults to 0.

  • invert_max (int, optional) – Maximum size for random insertion, defaults to 20.

__call__(x)

Randomly inverts segments of random DNA to a set of one-hot DNA sequences.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences (shape: (N, A, L)).

Returns:

Sequences with randomly inverted segments of random DNA.

Return type:

torch.Tensor

class evoaug.augment.RandomMutation(mutate_frac=0.05)

Bases: AugmentBase

Randomly mutates sequences in a training batch according to a user-defined mutate_frac. A different set of mutations is applied to each sequence.

Parameters:

mutate_frac (float, optional) – Probability of mutation for each nucleotide, defaults to 0.05.

__call__(x)

Randomly introduces mutations to a set of one-hot DNA sequences.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences (shape: (N, A, L)).

Returns:

Sequences with randomly mutated DNA.

Return type:

torch.Tensor

class evoaug.augment.RandomRC(rc_prob=0.5)

Bases: AugmentBase

Randomly applies a reverse-complement transformation to each sequence in a training batch according to a user-defined probability, rc_prob. This is applied to each sequence independently.

Parameters:

rc_prob (float, optional) – Probability to apply a reverse-complement transformation, defaults to 0.5.

__call__(x)

Randomly transforms sequences in a batch with a reverse-complement transformation.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences (shape: (N, A, L)).

Returns:

Sequences with random reverse-complements applied.

Return type:

torch.Tensor

class evoaug.augment.RandomNoise(noise_mean=0.0, noise_std=0.2)

Bases: AugmentBase

Randomly add Gaussian noise to a batch of sequences with according to a user-defined noise_mean and noise_std. A different set of noise is applied to each sequence.

Parameters:
  • noise_mean (float, optional) – Mean of the Gaussian noise, defaults to 0.0.

  • noise_std (float, optional) – Standard deviation of the Gaussian noise, defaults to 0.2.

__call__(x)

Randomly adds Gaussian noise to a set of one-hot DNA sequences.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences (shape: (N, A, L)).

Returns:

Sequences with random noise.

Return type:

torch.Tensor