rendseq Algorithm
Preprocessing Rend-seq files
rendseq will assist in the normalization of files after they have been aligned to genomes.
- It takes an input a
.sam
file from alignment of sequencing files to the genome of interest. - The first step in the rendseq analysis pipeline is to convert these aligned files into Rend-seq tracks. There are four Rend-seq tracks for every experiment, one for each the 5′ and 3′ mapping reads on both the forward and the reverse strands.
- While making these files, rendseq will handle the occasional issue which arises when non-templated additions are made to the 5′ end of reads during the library preparation process, and will remove the nucleotides which do not match at the 5′ end before generating the 5′ end Rend-seq tracks. Each of these Rend-seq tracks is saved in a wiggle, or .wig format.
After the creation of the Rend-seq track files, each must be preprocessed before peak calling can proceed. This is because the density of reads mapping at any location can vary by several orders of magnitude, and the distribution of read density surrounding any particular read can also vary drastically. The step of pre-processing takes each Rend-seq track and converts it into a form where it is easier to make comparisons across disparate parts of the genome.
- The first step of pre-processing is to pad the zero-values of the Rend-seq track. This padding is done using gaussian generated values, with a mean of 0 and a standard deviation of 1. It is important to pad with values with a non-zero standard deviation, rather than just setting the values to 0.
- As the peak calling protocol ultimately uses surrounding distribution’s standard deviation to identify peaks if the standard deviation is artificially low it can lead to a high false positive rate of peaks.
- Even though true reads densities do not exist as floats, in practice treating these area of low read coverage as floats increased the performance of the peak-calling algorithm in areas with low read density.
After padding, each genomic position is replaced by its normalized value, which I will refer to as a “score”, which is calculated via a function of the read density at that genomic position, and the surrounding read density.
Parameters
Only scores above a certain threshold of 5 are kept, as in practice scores of less than 5 do not correspond to true peaks. The score-generating function is a modified Z-score which considers the Z-score of the position with respect to both the (optionally normalized) upstream and downstream read-density distributions. It has several parameters: gap size (an integer), window size (an integer), percent trim (a float in the range [0, 1]), and winsorize (a boolean). We will explain each of these here and the possible effects that each can have on score-generation.
Gap Size:
- The gap size is the distance upstream and downstream of the given genomic position to exclude from any distribution calculations.
- This exclusion is important as peaks can often be distributed across many read positions, meaning that they can artificially inflate the estimation of the local distribution’s standard deviation, and may deflate the score we assign to true peaks, potentially leading to a higher false negative rate.
- The default of the program is to set this value to 5.
Window Size:
- The window size refers to the total number of reads to include in the distribution.
- This number ought to be large enough to provide a useful estimate of the mean and standard deviation, but not so large as to cause it to include other genes or features which would further distort the gaussian assumption of those distributions.
- In the program it defaults to 50.
Percent Trim
- Percent trim is a value which aids in the turning of the upstream and downstream distributions before the mean and standard deviation are called.
- It will remove the top x % of reads, based on their read density.
- The default is 0, though the user can supply a different percentage, which in certain cases may be useful in removing extremely high values, such as peaks. S
- Setting a value greater than ~10% is not recommended however, as there are cases where a high percentage of trimming will allow other features, such as noise, or the shadows which can accompany peaks, to appear as peaks as well.
Winsorize
- Winsorize serves a similar purpose as percent trim - although it facilitates the removal of reads based on standard deviation.
- For a given window - any read which exceeds a standard deviation of 1.5 will be dropped.
- We use a low standard deviation here, as it is important to remember extreme outliers can lead to a very large standard deviation.
- We have found that a threshold for winsorization of 1.5 strikes a good balance between aiding in the elimination of true outliers (other peaks) while not appearing to disrupt regions without any extreme outliers.