MACRO-PERFECTOS-APE — MAtrix CompaRisOn & PrEdicting Regulatory Functional Effect of SNPs by Approximate P-value Estimation – User Manual – version 2.0.0 May 16, 2015 1 Abstract Here we present MACRO-APE and PERFECTOS-APE software designed for practical sequence analysis involving classic mononucleotide and dinucleotide position weight matrices (PWMs) of DNA sequence patterns often called motifs. The common usage case for DNA motifs is representation of transcription factor binding sites. The software allows (1) comparing different PWMs using a variant of Jaccard similarity measure, e.g. scanning a motif collection for motifs similar to a given query, (2) analysing single-nucleotide variants for possible regulatory effect through transcription factor affinity changes, (3) performing basic PWM analysis (P-value and threshold estimation). 2 Technical notes MACRO- and PERFECTOS-APE require Java Runtime Environment 1.6 (or newer) to run. Thus *-APE should be able to function under most modern operating systems. Several existing motif collections such as HOCOMOCO as well as several individual PWM examples are available to be used with the *-APE package: HOCOMOCO [http://autosome.ru/HOCOMOCO/] TFBS model collection and several examples of PWMs (motifs) can be downloaded with MACROPERFECTOS-APE at [http://opera.autosome.ru/downloads/all_collections_ pack.tar.gz]. Windows users can get the latest Java directly from Oracle: [http://www. java.com]. Modern Linux distributions typically have OpenJDK preinstalled, otherwise it should be available via a distribution-specific package manager. 1 The latest MACRO-PERFECTOS-APE package can be found at [http:// opera.autosome.ru/downloads/ape.jar]. Source codes are distributed under WTFPL public license. They are available in a github repository: [https: //github.com/prijutme4ty/macro-perfectos-ape] and as a single archive at [http://opera.autosome.ru/downloads/macro-perfectos-ape_src.jar]. Web version (only basic functionality available) can be found at [http:// opera.autosome.ru]. This manual is also hosted on github in a repository: [https://github. com/prijutme4ty/macro-perfectos-ape-manual]. 3 Overview All tools are packed in a jar-file with compiled Java classes. There are three main packages for tools: ru.autosome.ape, ru.autosome.macroape and ru.autosome.perfectosape. APE in ru.autosome.ape stands for Approximate P-value Estimation, this package contains basic tools: • FindThreshold — to estimate a PWM score threshold for a given P-value • FindPvalue — to estimate a PWM P-value corresponding to a given score threshold • PrecalculateThresholds — to precalculate lists of thresholds tabulated by P-values for a given motif collection MACRO-APE in ru.autosome.macroape denotes MAtrix CompaRisOn by Approximate P-value Estimation. Package consists of several tools related to motif comparison: • EvalSimilarity — to evaluate similarity for a given pair of PWMs. • ScanCollection — to search a collection of motifs for PWMs similar to a given query. PERFECTOS-APE in ru.autosome.perfectosape denotes Predicting Regulatory Functional Effect of SNPs by Approximate P-value Estimation. Package contains a single tool: • SNPScan — to search a pack of sequences with SNVs or SNPs against a collection of PWMs for (SNV, PWM) pairs, such that single nucleotide substitution induces significant change of predicted affinity for a given PWM. Please note, that *-APE tools by default consider all given matrices as positional weight matrices with additive scores already passed counts-to-weights transformation (e.g. log-odds). The usage of count matrices (PCMs) or frequency matrices (PPMs) is also possible with additional command-line keys (see the respective sections). 2 3.1 Command line format All tools use similar command-line format. The examples are shown under the assumtion that the *-APE package ape.jar is located in the current folder (working directory). A typical command line will look like: java -cp ape.jar ru.autosome.ToolName <required arguments> [options] Each tool can be used with --help or -h options to display a detailed help message describing order of arguments and a list of optional parameters. Each tool is provided in mononucleotide and dinucleotide versions for monoand diPWMs and respective background models. Generally, mononucleotide version has wider application range, since most of existing motif collections provide only basic mononucleotide PWMs. Naming convention is the same for all tools: mononucleotide version is located in package’s root, dinucleotide version has the same name but is located in a subpackage ".di". E.g. for ape.FindThreshold the full class names are: • ru.autosome.ape.FindThreshold for mononucleotide version • ru.autosome.ape.di.FindThreshold for dinucleotide version. Please note, that dinucleotide tools use special input formats for dinucleotide Position Weight Matrices (diPWM) and respective background models. Input data formats are described in a special section. 3.1.1 Output formats All tools except PrecalculateThresholds print their results into the standard output stream (stdout). PrecalculateThresholds stores its results in a set of output files created in a specified folder. For each tool the output can be redirected to a file using OS syntax, e.g. with a ”>”-sign. For example: java -cp ape.jar ru.autosome.ape.FindPvalue motifs/KLF4 f2.pwm 3.3 5.0 7.1 > KLF4 P-values.txt Output generally consists of two types of lines. Lines starting with "#" character (comments) show input parameters and descriptions. The results are presented in non-commented lines. 4 Basic APE tools APE tools are designed to properly convert PWM thresholds to P-values and vice versa. Position weight matrix (PWM) of DNA motifs assigns a score to each ”word” (nucleotide sequence of a fixed length l). It makes possible to range the words by their scores, e.g. corresponding to predicted transcription affinity for PWMs 3 of transcription factor binding sites (TFBS). Given a threshold, one can divide all l-mers into two subsets: words whose score are not less than the threshold and the rest. Typically, the words passing the score threshold are selected for downstream analysis, e.g. they are considered as putative transcription factor binding sites. What is important, the threshold values are not directly comparable for different PWMs. One strategy to have a unified scale is to use motif P-values instead. The P-value of a certain PWM and a score threshold is the probability to generate a word with the score not less than the threshold at random. Inverse task is to estimate a threshold for a predefined P-value. In particular this allows to select a PWM score threshold corresponding to a predefined positive prediction rate across the l-mer dictionary (e.g. only x% of words are predicted as putative TFBS). Our tools perform threshold – P-value conversion implementing a dynamic programming algorithm on a granulated (discretized) PWM models using a simplified approach comparing to that described in Touzet et al. [2007]. More details on P-values, thresholds and the algorithm are provided in the MACRO-APE paper. Vorontsov et al. [2013] [http://www.almob.org/ content/8/1/23] 4.1 FindThreshold This is a stand-alone tool to search for a score threshold corresponding to a given P-value for a given PWM. FindThreshold requires a PWM and a P-value as input and returns a threshold for which the set of words scoring with this PWM no less than the given threshold has the aggregated probability equal to the given P-value. The program can process a set of P-values, and return a set of thresholds. This tool implements a simplified algorithm derived from that implemented in the TFM-Pvalue software of Helen Touzet [http://bioinfo. lifl.fr/TFM/TFMpvalue/] but with the fixed predefined discretization level (see section 9.1). Usage: java -cp ape.jar ru.autosome.ape.FindThreshold <motif file> [list of P-values] Example (motif file KLF4 f2.pat, P-value of 0.001 and 0.0005): java -cp ape.jar ru.autosome.ape.FindThreshold motifs/KLF4 f2.pat 0.001 0.0005 NOTE! By default FindThreshold looks for threshold large enough to obtain P-value not greater than requested (lower boundary for P-value). For details see --boundary option description in section 8.4. 4 4.2 FindPvalue FindPvalue is a stand-alone tool to find the P-value corresponding to a given threshold level for a given PWM. Usage: java -cp ape.jar ru.autosome.ape.FindPvalue <motif file> <list of thresholds> Example (motif file KLF4 f2.pat, thresholds of 4.1719 and 5.2403): java -cp ape.jar ru.autosome.ape.FindPvalue motifs/KLF4 f2.pat 4.1719 5.2403 4.3 PrecalculateThresholds This tool is intended to process the motif collection (a folder containing separate files for each motif) and to store precomputed score distributions of motif PWMs. Each score distributions is saved as a sorted list of (threshold,P-value) pairs with P-values taken at uniform intervals at quantiles of score distribution. It allows for faster score – P-value conversion performing binary search through a list of thresholds or P-values. PrecalculateThresholds doesn’t store precise score distribution because for a non-disretized PWM it can be extremely large with unpractical precision. Practically it’s sufficient to estimate P-value with a specified error level of e.g. 5%. In order to use precalculated distribution several *-APE tools have --precalc option which takes a folder containing results of PrecalculateThresholds. Note: Precalculation allows notably increase speed of threshold to P-value calculation (up to 100x). Unfortunately it deals with a file system to load the precalculated data. Thus it’s recomended to use precalculated score distribution for tasks where the same motif P-value evaluation is performed multiple times so that the score distribution is loaded once and used multiple times. At a moment the only use case is – perfectosape.SNPScan which assesses each of multiple SNPs against the same motif collection. Note: Resulting score distribution depends on a discretization level and on a specified background model. It is up to the user to control that a score distribution was precomputed with the proper parameters. The parameters values are not anyhow stored after score distribution precalculation and are not implicitly contolled when reusing precomputed data. Usage: java -cp ape.jar ru.autosome.ape.PrecalculateThresholds <motif collection folder> <output folder> [options] Example: java -cp ape.jar ru.autosome.ape.PrecalculateThresholds ./motifs/ ./motif thresholds/ 5 This will create ./motif thresholds/ folder (if not already exist) and multiple files inside, one file per motif in ./motifs/ folder. For a given motif output file will be named as <name of motif>.thr. Each file contains lines in the following format: <threshold> tab <corresponding P-value> Lines are sorted with thresholds ascending (P-value descending). It takes about half a minute to preprocess the collection of ∼400 mononucleotide PWMs with default parameters using 1.5 GHz CPU. During precalculation task progress will be printed to standard error stream. To suppress output use --silent option. To alter granularity of resulting P-values list one can use --pvalues option in the following format: --pvalues <from,to,step,mode> Parameters set the P-values progression in the resulting list. P-values can use arithmetic or geometric progession which corresponds to add or mul value of mode. from and to represent progression boundaries and step corresponds to a common difference (add) or a common ratio (mul) of progression. Parameters are comma-separated without spaces between. For example, default progression can be written as follows: --pvalues 1.0,1e-15,1.05,mul It means that PrecalculateThresholds collect thresholds for each of these P-values: 1.0, 1.0/1.05, 1.0/1.052 , 1.0/1.053 , . . . , 10−15 To specify relative error of use geometric progression with common ratio of 1 + and boundaries: from 1.0 to a minimal expected non-zero P-value. 5 MACRO-APE: Matrix Comparison by Approximate P-value Estimation Let us have two PWMs with given threshold levels. The similarity between PWMs is related to the number of words recognized by both PWMs (or the aggregated probability of the word set under the given i.i.d. model). To calculate this value we use generalized approach described in Touzet et al. [2007] for two PWMs simultaneously in a way similar to that in Pape et al. [2008]. The number of words recognized by both PWMs can be used to construct a variant of Jaccard similarity measure for motifs considered as sets of allowed words scoring no less than predefined thresholds. Typical methods of PWM comparison are based on direct evaluation of matrix elements, for instance by comparing matrices column by column (where different columns correspond to different positions of a transcription factor binding site). On the other hand, in applications PWM is used as TFBS model to identify binding sites by scanning a given sequence and identifying words with 6 scores no less than a threshold. Thus, in reality a TFBS model is related to the set of words scoring no less than the given threshold for the given PWM. It is desirable to construct a similarity measure for TFBS models based on the similarity between word sets recognized by the matrices with given thresholds, rather than on similarity between matrices per se. Moreover, comparison-by-elements strategy requires the matrices to have algebraically comparable values (either frequencies or specifically scaled weights) which is not necessary if sets of recognized TFBS are compared. MACRO-APE computes a similarity measure which directly accounts for similarity of recognized word sets. This measure does not require PWM elements to be algebraically comparable and so it can be used to compare weight matrices constructed by different normalization / conversion strategies (e.g. log-odds with different pseudocounts and/or background normalization). Let us have a position weight matrix of length l. The whole set of ACGTalphabet words of length l will be called the dictionary of size N = 4l . For a fixed threshold level t one can calculate the fraction of the dictionary (i.e. the number of words n) scoring no less than the threshold. We will call the value of n/N as the motif P-value. Suppose we have two PWMs m1 , m2 of length l and some P-value levels p1 , p2 . For m1 and m2 we can estimate the thresholds t1 and t2 corresponding to p1 , p2 . Having PWMs with the corresponding thresholds we can estimate the fraction f of the dictionary recognized by both models, i.e. the size of the set of words scoring no less than t1 on m1 and no less than t2 on m2 . Moreover one can construct the Jaccard index J= A∩B , A∪B (1) where A and B are sets of words recognized by m1 and m2 with the thresholds t1 and t2 . If necessary one also can construct a Jaccard distance as d(A, B) = 1 − J . (2) In the general case we have two PWMs of different widths, unknown optimal mutual alignment and orientation. For each possible alignment shift and orientation the matrices can be extended to the same length by adding zero-columns (not affecting either score or threshold) and then compared as the two models of the same width. Then one can determine the optimal shift and orientation by selecting the case with the highest Jaccard similarity. More formal and detailed explanation can be found in the corresponding macroape paper Vorontsov et al. [2013]. NOTE! The reverse complementary transformation can be necessary to optimally align a given pair of matrices, thus the background nucleotide composition for matrix comparison tools should be symmetrical, i.e. p(A) = p(T) and p(C) = p(G). 7 5.1 EvalSimilarity EvalSimilarity computes the similarity of two given motifs defined as a Jaccard similarity of sets of words recognized by each motif. Optimal mutual alignment of the motifs is also estimated. Sets of recognized words are given by a PWM accompanied with threshold or a P-value. By default a set of recognized words is defined as top 0.05% of words (i.e. P-value level of 0.0005) ranked by a PWM. It’s possible to set required P-value with --pvalue <P-value> option or to specify thresholds explicitly so that word sets contain all words passing corresponding thresholds. It can be accomplished using --first-threshold <threshold> and --second-threshold <threshold>. In order to get intuition of Jaccard similarity scale and to better catch our output format, try these examples and take a look at corresponding motif logos (see the sample data): Example (rather similar motifs KLF4 f2 and SP1 f1, see fig. 1): java -cp ape.jar ru.autosome.macroape.EvalSimilarity motifs/KLF4 f2.pat motifs/SP1 f1.pat Figure 1: Sequence logo corresponding to a motif alignment. Example (the same motif SP1 f1 in opposite orientations): java -cp ape.jar ru.autosome.macroape.EvalSimilarity motifs/SP1 f1 revcomp.pat motifs/SP1 f1.pat Example (significantly different motifs SP1 f1 and GABPA f1): java -cp ape.jar ru.autosome.macroape.EvalSimilarity motifs/SP1 f1.pat motifs/GABPA f1.pat By default EvalSimilarity tests all possible mutual motif alignments in both orientations. A special option --position will force evaluating similarity with the explicitly specified motif alignment: --position <shift>,<direct|revcomp> Option parameters are comma-separated, spaces not allowed; the position is defined for the second motif relative to the first. Try the following examples: Example (rather similar motifs KLF4 f2 and SP1 f1 at optimal alignment): java -cp ape.jar ru.autosome.macroape.EvalSimilarity motifs/KLF4 f2.pat 8 motifs/SP1 f1.pat --position -1,direct Example (rather similar motifs KLF4 f2 and SP1 f1 at completely wrong alignment): java -cp ape.jar ru.autosome.macroape.EvalSimilarity motifs/KLF4 f2.pat motifs/SP1 f1.pat --position 3,revcomp Note! By default EvalSimilarity selects the thresholds corresponding to the P-value not less than requested (upper boundary) possibly making compared word sets larger (not to miss words with scores too close to the threshold). This differs from FindThreshold approach which, by default, uses lower boundary for P-valuethus controlling the prediction rate more strictly. It is very important to select upper P-value boundary for short PWMs. In case of given low P-values they can recognize no words at all (so the Jaccard measure may have zero numerator and zero denominator). For reasonable threshold levels both upper and lower boundaries usually produce very close similarity values, see the MACRO-APE paper for details Vorontsov et al. [2013]. Nevertheless, one can override this behavior with --boundary lower option. In such a case if any of supplied PWMs recognizes no words for a selected P-value, then similarity can not be correctly determined and macroape will report the similarity value of −1. 5.1.1 ScanCollection This tool uses a collection of motifs to find PWMs similar to a given query. The list of similar PWMs is sorted by similarity in descending order so the PWMs similar to the query are located at the top of the list. NOTE! The shift and orientation are reported for PWMs from the collection relative to the query PWM. Example(search for motifs similar to KLF4 f2, HOCOMOCO collection): java -cp ape.jar ru.autosome.macroape.ScanCollection motifs/KLF4 f2.pat ./hocomoco/ The two-pass search mode is available to recheck the top part of the list using a more precise discretization level. Second pass is executed only if --precise [min similarity=0.01] key is specified. The precise search will recheck only the PWMs similar to the query with a similarity no less than min similarity. The results of the second pass will be marked by asterisk(*). One can specify similarity cutoff with option --similarity-cutoff <similarity cutoff> or -c <similarity cutoff> to discard comparison results with the resulting similarity less than a given value (the 1st pass results are used). By default, records with similarity less than 0.05 are not shown. In order to print comparison results for all PWMs in collection --all option can be used. 9 Example(search PWMs similar to KLF4 f2, extended precision for the most similar PWMs): java -cp ape.jar ru.autosome.macroape.ScanCollection motifs/KLF4 f2.pat ./jaspar/ --precise To find similar PWMs using a particular P-value level one should use the "--pvalue" option. Default P-value is 0.0005. Example (): java -cp ape.jar ru.autosome.macroape.ScanCollection motifs/KLF4 f2.pat ./selex/ --pvalue 0.001 --similarity-cutoff 0.06 --precise 0.1 6 PERFECTOS-APE: Predicting Regulatory Functional Effect of SNPs by Approximate P-value Estimation. Variations in genome sequences are quite common. One widespread type of variations is represented by single nucleotide substitutions called single nucleotide variants (SNVs) or, for a given population, single nucleotide polymorphisms (SNPs). SNVs in gene regulatory regions may affect gene expression through alterations in transcription factor binding sites. PWM of transcription factor binding sites provides a score for any putative TFBS. This score roughly represents binding affinity, thus allowing to estimate the impact of a given substitution through change in a score value. As discussed earlier (section 4) scores are not directly comparable and do not have a unified scale. More convenient measure is the P-value - the probability to find a high-scoring word at random. PERFECTOS-APE computes motif P-values for each sequence variant and calculates P-value fold change of a given substitution. Detailed algorithm for evaluating a fold change for a given TF and a substituion: • Calculate PWM scores for putative TFBS overlapping a sequence variant. • Choose the best position and score for both sequence variants independently. • Estimate P-values for the best scores. • Compute fold change as the rate of P-values. PERFECTOS-APE tests given SNVs against a whole collection of PWMs and yields (SNV, TF) pairs of SNVs that may significantly affect TF affinity. More details on the algorithm are provided in the PERFECTOS-APE paper. Vorontsov et al. [2015] [http://dx.doi.org/10.5220/0005189301020108] 10 6.1 SNPScan SNPScan takes a list of SNVs with flanking sequences and a motif collection and returns a list of predicted TFBS which were possibly disrupted by or emerged after a certain SNV. If flanking sequences around SNVs are too short for some TFBS models, the sequences are extended by poly-N tails up to necessary length. Usage: java -cp ape.jar ru.autosome.perfectosape.SNPScan <path to the collection of motifs> <path to the file with the list of SNVs> [options] SNPScan has two filters. The first discards (SNV, TF) pairs without TFBS prediction at any of nucleotide variants. SNPScan treat a word as a putative TFBS if P-value of this word’s score is not greater than the predefined threshold (0.0005 by default, changed via --pvalue-cutoff option: --pvalue-cutoff <maximal P-value to be considered> or in short form: -P <maximal P-value to be considered>. The second filter requires check P-value fold change to be large enough. By default fold change threshold is equal to 5. It means that only SNVs causing P-value change of 5x and more (F oldChange ≥ 5 or F oldChange ≤ 1/5) will be included in results. Fold change threshold can be specified using --fold-change-cutoff: --fold-change-cutoff <minimal fold change to be considered> or in short form: -F <minimal fold change to be considered> --log-fold-change option changes fold change from P-value1 into log2 P-value1 P-value2 P-value2 both in command-line parameter settings and output. Option --expand-region <length> allows PWM hits to be located nearby but not strictly overlap the position with the nucleotide substitution. When this option is specified, the PWM occurrence can be located up to length bp away from the SNV position. This option is intended for analysis involving control data with SNVs not necessarily overlapping the binding sites. The last but the most useful option is --precalc which forces SNPScan to work with precalculated P-value,thresholds pairs performing binary search to evaluate the P-value instead of calculating motif score distribution each time from scratch. It can reduce total computation time in hundreds of times for large datasets. As an input it requires a folder with precalculated (P-value,threshold) pairs - one for each motif: --precalc <path to a folder with precalculated P-value, threshold pairs> These precalculated score distributions are to be produced by a PreprocessCollection from APE toolbox. Please refer to the respective section for details. Example: java -cp ape.jar ru.autosome.perfectosape.SNPScan ./hocomoco/pwms/ snp.txt --precalc ./collection thresholds 11 java -cp ape.jar ru.autosome.perfectosape.SNPScan ./hocomoco/pcms/ snp.txt --pcm --discretization 10 --background 0.2,0.3,0.3,0.2 6.1.1 Output data format SNPScan prints all results to standard output, errors and messages go into standard error stream. First line of output is a header of table. Latter lines are rows of this table. Columns are: • Name of sequence containing SNV • TF motif name • for the first allele variant: – the best position and strand of putative TF-DNA binding – nucleotide word corresponding to the best binding sequence among all other words in sequence, intersecting SNV • the same two columns for the second allele variant • allele variants • P-value for the first allele variant • P-value for the second allele variant • fold change (the first P-value divided by the second P-value) Position of the best binding place is given for the leftmost boundary of a binding sequence (independent of strand orientation). The SNV location is at zero, so the TFBS coordinates are always less or equal to zero. Strand is denoted as ‘direct‘ or ‘revcomp‘. Words are given at the relevant strand (i.e. reverse-complement transformation is applied if necessary). More compact output format can be produced using the --compact option. The resulting table will have the following columns: • Name of sequence containing SNV • TF motif name • P-value for the first allele variant • P-value for the second allele variant • the best position and strand of putative TF-DNA binding for the first allele variant • the best position and strand of putative TF-DNA binding for the second allele variant 12 Please note that fold change and word sequences are not shown (comparing to the default output). Strand information is given as +/- form (versus direct/revcomp in the default output). P-valuesare rounded up to three significant digits. This option is intended to process huge lists of SNVs and reduce the output ( 2.5x less size). 7 7.1 Data formats Position matrix format description All tools in the *-APE package use the following matrix file format (each binding site position corresponds to a separate line): some header pos1 A weight pos1 C weight pos1 G weight pos1 T weight ... posw A weight posw C weight posw G weight posw T weight Position matrix format is appliable for all kinds of positional matrices: positional weight(PWM), count(PCM) and probability/frequency(PPM). Positonal count matrices are allowed to contain floating point numbers (e.g. in the case the counts were derived from somehow weighted alignments). The total number of lines corresponds to the PWM width (minus the header line). If given, header will be treated as a motif name, otherwise filename will stand for motif name. Header may carry an optional ">" sign at line start (like in fasta files). If necessary it’s possible to read transposed matrices, with nucleotides in rows and positions in columns using --transpose option. Example (PWM similar to HOCOMOCO transcription factor motif for KLF4): >KLF4_f2 0.308 -2.254 0.135 0.328 -1.227 -4.814 1.305 -4.908 -2.443 -4.648 1.358 -4.441 -2.717 -3.807 1.356 -3.504 -0.556 0.534 -3.614 0.527 -1.868 -4.381 1.337 -3.815 -2.045 -2.384 0.719 0.544 -1.373 -3.006 1.285 -2.502 -2.103 -1.894 1.249 -1.428 -1.327 0.898 -0.808 -0.181 Example (Transposed PWM similar to HOCOMOCO transcription factor motif for KLF4): 13 > KLF4_f2 1233.5 264.0 93.2 5.3 1036.6 3347.6 1258.3 4.7 76.8 6.6 3529.5 8.6 57.9 18.1 3520.3 25.2 518.2 1545.9 22.4 1535.0 138.0 9.3 3456.3 17.9 115.3 81.5 1861.9 1562.8 227.8 42.8 3278.6 72.2 108.7 134.5 3162.9 215.4 238.5 2226.0 402.4 754.7 More real-life examples are provided with the package in respective motif collections. Dinucleotide versions of *-APE tools use dinucleotide motifs. Dinucleotide positional matrices have similar format but contain 16 columns instead of 4. Columns go in order: AA, AC, AG, AT, CA, CC, . . . , TT. It’s also possible to use mononucleotide motifs in dinucleotide tools (e.g. to use dinucleotide background). For rationales and details take a look at --from-mono option. 7.2 SNVs/SNPs format SNPScan uses a list of sequences with SNVs as input data. The list of sequences with SNVs should be given in a single plain text file. Each sequence should be presented at a separate line using the following format: <SNV name> <left flank>[<variant 1>/<variant 2>]<right flank> SNV name shouldn’t contain empty delimiters (spaces or tabs). Sequence consists of two allele variants in square brackets, separated with ‘/‘, and flanking sequences at both sides. Length of flanking sequences should be sufficient to place the longest motif of a given collection (so it is advised to provide 25-30bp at each side) into all positions relative to a nucleotide substitution position. So, first two columns are SNV name and SNV sequence. Later columns (if present) are ignored, thus can contain any data. Example (SNV list): # Text after "#" doesn’t matter # It’s possible to include any number of comment lines into input rs10040172 gattgcagttactga[G/A]tggtacagacatcgt Anything rs10116271 gtggggaagaggtct[C/T]gtagaggcgatgatt can go rs10208293 ttatgtccagtacct[A/G]tggaccctccttgtg after first rs10431961 ggtcaggcggcgtcg[C/T]cggtacgctctgagc two columns Note that lines starting with # are considered as comments and thus ignored by SNPScan. 8 Additional command-line options Many additional options are available for *-APE tools. The options should be provided after the required arguments. There are common options among all *-APE tools as well as tool-specific options (already described in the respective sections). This section covers common options: those altering input data format and those affecting calculation parameters. The first class of options allows using 14 input motifs as different matrices (counts, PCM or probabilities, PPM) instead of default weights (PWM), load matrices in transposed format and use mononucleotide motifs in dinucleotide tools. The second class of options allows to set the background model, select P-value evaluation mode, limit memory consumption and so on. For a full list of options for a particular tool please run the tool with the --help command line option. 8.1 Option families Options are grouped into ”families” of options with similar names but different prefixes. For example macroape.di.EvalSimilarity tool, has an option --from-mono. This option creates dinucleotide motifs by loading mononucleotide matrices. In turn, --first-from-mono options forces loading of the first motif from mononucleotide input and --second-from-mono does the same for the second motif. Similar options for macroape.ScanCollection are named --query-from-mono and --collection-from-mono. Option --query-from-mono requires mononucleotide query matrix, and --collection-from-mono means that each motif in collection should be loaded from mononucleotide matrix. The same is appliable for --background. Such triples of options are typically listed in the help string like this: --[first-|second-]from-mono. It means that one can use both prefixed and non-prefixed options. Possible prefixes are given in square brackets separated with a pipe sign "|". Note: Prefixed options exist only in a long form. E.g. one can use both -b and --background as synonymous but for there is no short analogue for --first-background. Note: Presence of separate options for each of used motifs doesn’t necessarily involve existence of a common option. E.g. macroape.EvalSimilarity has --first-threshold and --second-threshold options but doesn’t have --threshold since it generally makes no sense to use the same algebraical threshold value for two independent PWMs (common P-value level in turn is a reasonable parameter). 8.2 Motif loading options By default motifs are expected to be provided as position weight matrices in a nucleotides-in-columns plain text format. Basic tools use mononucleotide positional matrices, dinucleotide tools use dinucleotide matrices. However, many motif collections provide position frequency matrices (PFMs, or probability matrices, PPMs) or position count matrices (PCMs). *-APE tools can convert these matrices to PWMs internally (using a log-odds-like transformation as in Lifanov et al. [2003], see the section 9). 8.2.1 Obtaining PWM from PCM and PPM models To load motif from position count matrices there is a special --pcm option. A similar option --ppm words for positional probability matrices (see fig. 2). The PCM → PWM or PPM → PWM data model transformations can be configured. 15 PPM --ppm −→ diP P M --ppm −→ PWM• | --from-mono ↓ diP W M • --pcm ←− P CM --pcm ←− diP CM Figure 2: Command-line options to read a motif from non-PWM motif models. Conversion end-points are marked with bullets. PPM diP P M --effective-count −→ P CM --effective-count −→ diP CM --background --pseudocount −→ PWM• ↓ --background --pseudocount −→ diP W M • Figure 3: Motif transformations configuration options. Conversion end-points are marked with bullets. The PCM → PWM conversion is described in a section 9.2. It’s possible to manually specify a fixed pseudocount a with --pseudocount <a> option. When not specified, pseudocount is derived from alignment weight W : a = ln(max(W, 2)) We manage case W < 2 as if W = 2 to avoid zero and negative pseudocount values. √ Another pseudocount option --pseudocount sqrt sets pseudocount as a = W. PPM → PWM conversion is done in two stages. At first PPM is multiplied by a constant alignment weight W to obtain a PCM. Then this PCM is converted to a PWM as described above. For the PPM → PWMconversion, a user should supply alignment weight W (for example it can be the total count of words in the initial alignment) explicitly by the --effective-count <W> option. If this information is not given, alignment weight of 100.0 will be used as a default assumption. PCM → PWM conversion will take the user-specified background into account. DiPCMs are converted to diPWMs using the same formula as for PCM → PWM conversion, the only difference is that now nucleotide index goes through 16 dinucleotides at each position instead of 4 nucleotides. Possible configuration options can be seen on a fig. 3. 16 8.2.2 Obtaining dinucleotide motifs from mononucleotide ones Dinucleotide *-APE tools take dinucleotide motifs as input parameters. But there is an option --from-mono which allows to use basic mononucleotide motifs instead so that PWM → diPWM will be done internally. It can be useful in following cases: • Comparison of dinucleotide motif against mononucleotide one. In this case one motif should be loaded as dinucleotide motif, the rest - as mononucleotide motif internally converted to a dinucleotide motif. Further comparison performs on two dinucleotide motifs. • Study of mononucleotide motif properties on dinucleotide background. It isn’t possible to specify dinucleotide background for a mononucleotide tool, but is possible to specify mononucleotide motif and dinucleotide background for a dinucleotide tool. PWM → diPWM is done in such a way that each word has the same score on diPWM as it had on PWM. Notice: scores of words on discreted PWM and corresponding diPWM can be slightly different due to a discretization step performed after PWM → diPWM conversion. This discrepancy shouldn’t worry you, it’s small enough and goes to zero with discretization increase. When both --pcm and --from-mono options are specified, the conversion is done in two stages (see fig. 3). First, PCM → PWM transformation is applied and then PWM → diPWM transformation is applied. Background model used in PCM → PWM conversion should be given as mononucleotide letter frequencies (”Bernoulli” i.i.d. random model). Background provided to a dinucleotide tool should be given as dinucleotide frequencies. In this case mononucleotide frequencies are estimated by averaging dinucleotide background: P P β pαβ + β pβα (3) pα = 2 8.2.3 --transpose option One can load motifs from nucleotides-in-rows using --transpose option. The only difference in format is matrix orientation, header remains the same (see section 7.1). 8.3 Background model options Nucleotide frequencies of a background model can be specified in optional arguments, e.g. --background or --query-background. All background options use the same format with a single required argument: --background <value>. Default background model is a wordwise model. It means that all our calculations assume uniform nucleotide distribution and the exact number of words is used everywhere instead of probabilities of a word set. E.g. FindPvalue will 17 calculate not the probability of a random word score to pass the threshold but a fraction of words scoring greater than threshold estimating the exact number of such words. A number of words is a more natural and intuitive to use, especially if the background model cannot be properly selected thus we suggest ”wordwise” mode by default. Wordwise mode can be specified explicitly, e.g. using --background wordwise key. All following formats are different ways to specify frequencies of each nucleotide: • The most simple nucleotide background model is uniform, each nucleotide has the same probability to occur. Option format is: --background uniform. This is close to wordwise mode, but word set probabilities are used and reported instead of raw counts of words. • It is also possible to specify a fixed GC-content (in range 0 to 1): --background <GC-content>. E.g. "--background 0.6" • The most detailed format is to explicitly specify nucleotide frequencies: --background <pA , pC , pG , pT >. E.g. "--background 0.2,0.3,0.3,0.2" will define the same frequencies as for GC-content of 0.6. Note that nucleotide frequencies should be given in alphabetical ACGT-order separated with commas. Note: No spaces between frequencies are allowed (commas only). Sum of frequencies should be equal to 1.0. 8.3.1 Dinucleotide background options Dinucleotide background for dinucleotide tools has the same variants: wordwise, uniform, GC-content and full list of dinucleotide frequencies. Wordwise, uniform and GC-content backgrounds are effectively the same as mononucleotide ones and don’t carry any nucleotide interdependencies. Dinucleotide frequencies require additional clarification. Dinucleotide frequencies should be given in an alphabetical order: AA, AC, AG, . . . , TT — 16 terms. Each value corresponds to a probability of a specific dinucleotide. These probabilities are not conditional probabilities used by an algorithm internally, but actual dinucleotide frequencies. Please be careful if you already got used to use Markov model background. Again, list of probabilities is comma-separated, no spaces allowed, sum of probabilities should be equal to 1.0. Also one can specify mononucleotide ACGT-frequencies background for dinucleotide tools. It will be recognized automatically when 4 values are specified instead of 16. 18 8.4 8.4.1 Additional command-line options Specifying custom discretization level For a more precise result --discretization <discretization rate> or -d <discretization rate> command line key can be used to explicitly set the discretization level for PWM elements, like "--discretization 100000" (see the section 9.1). The discretization level of 105 corresponds to the precision of PWM elements up to 5 decimal places. A larger number of decimal places results in increased precision and computational time. The default setting of 104 for single-motif tools and 101 for motif comparison tools gives reasonable ”time-precision” tradeoff. 8.4.2 Specifying custom P-value level All tools in MACRO-APE package estimate motif threshold by a P-value for further use. By default P-value level of 0.0005 is assumed. It can be overriden with --pvalue <P-value> or -p <P-value> option key. 8.4.3 Choose proper threshold by a P-value All *-APE tools except ape.FindPvalue and perfectosape.SNPScan perform internal P-value to threshold conversion. Since PWM P-values have discrete distribution a given P-value can be achieved only approximately. A fixed threshold corresponds to the actual P-value which is smaller or larger than the requested P-value. The boundary selection can be done using --boundary <lower|upper>. For model comparison by default we use the upper boundary for the P-value (so even at low given P-values PWMs recognize some words and thus the models can be compared). If searching for a threshold corresponding to the given P-value we report the lower boundary of the P-value by default (to properly control the positive prediction rate corresponding to a given threshold). Note: lower boundary means that P-value will be not greater than the requested one. The threshold for lower P-value will be greater than the threshold for upper boundary P-value. 8.4.4 Limiting CPU and memory consumption It’s possible to create an artificially arranged PWM whose score distribution will grow exponentially with length and thus can take a lot of memory and time for computation. This option is mostly designed to prevent *-APE tools from unnormal CPU and memory consumption. If hash size exceeded a given limit, tools cancel calculations with "Hash overflow" error message. In such case user can manually expand hash size limits or lower discretization level. • --max-hash-size <size>: set the internal hash (used for score distribution calculation) size limit. Default value is 107 19 • --max-2d-hash-size <size>: set the internal two-dimensional hash size limit (used for PWM comparison in MACRO-APE toolbox). Default value is 104 . 9 9.1 Formal math PWM discretization Following the general idea described in Touzet et al. [2007] we can effectively calculate the P-value for a given PWM with a fixed precision and a given threshold value. The algorithm of Touzet et al. efficiently processes matrices with integer elements. The matrices with real values are transformed into integer value matrices by multiplying each value by discretization constant and truncating the decimals. Effectively this is similar to rounding up real values leaving only the fixed number of decimal places. The higher discretization level will result in a more accurate P-value calculation and an increased computational time. Please note, that in contrast to the original Touzet algorithm here we applying ”ceil” operation to the matrix elements (instead of ”floor” in the original paper of Touzet). This allows us to have a strict upper boundary of the threshold for a given P-value. We use the default discretization level of 104 to perform calculations with accuracy up to four significant digits for single-PWM tools from APE toolbox. For motif comparison the straightforward discretization by rounding up to the nearest integer is used by default for a fast and rough search through the motif collection. The default level of 10 (one decimal place) is used for a more precise search of similar motifs. Thus in our case discretization is the transformation as follows: discretized S is S multiplied by discretization level V and rounded up to the nearest integer value. Example: S = 1.6734 discretization V=1 discretized S = d1.6734e = 2 discretization V=10 discretized S = d16.734e = 17 discretization V=100 discretized S = d167.34e = 168 Discretization will generally preserve the word score ranking with the common exception for words that would obtain identical scores. The main advantage of the discretization is decreasing of the number of possible scores so the set of all possible scores can be enumerated more effectively. 9.2 PCM to PWM conversion algorithm Matrix of positional counts (PCM) can be transformed to PWM using the following formula Lifanov et al. [2003]: P W Mα,j = ln P CMα,j + aqα , (W + a)qα 20 (4) where α is a nucleotide (or dinucleotide) index and j is a position index; W is the total weight of the alignment (or the number of aligned words), a is the pseudocount value, and qα is the background probability of nucleotide letter α. Pseudocount is taken by default as the ln W but can be explicitly specified by user. Alignment weight W is typically a total number of aligned words and can be calculated from P a given PCM as a sum of nucleotide counts in a particular column : Wi = α P CMα,i . Wi is the alignment weight for i-th position. Typically each position has the same alignment weight W , but multiple local alignment algorithms may produce positional count matrices with different weights Wi of words covering each position (e.g. flanks can have less weight than a central part of motif). Thus the weight is safer to calculate separately for each motif position. For PCM → PWM conversion *-APE tools use a slightly modified formula: P W Mα,j = ln P CMα,j + aj qα (Wj + aj )qα (5) Here aj is a pseudocount related to j-th position. It can be either fixed for each position or equal to a logarithm of corresponding alignment weight: aj = ln Wj By default all tools accept weight matrices (i.e. already converted using any similar procedure). References Alexander P Lifanov, Vsevolod J Makeev, Anna G Nazina, and Dmitri A Papatsenko. Homotypic regulatory clusters in drosophila. Genome research, 13 (4):579–588, 2003. Utz J Pape, Sven Rahmann, and Martin Vingron. Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics, 24(3):350–357, 2008. Hélène Touzet, Jean-Stéphane Varré, et al. Efficient and accurate p-value computation for position weight matrices. Algorithms Mol Biol, 2(1510.1186): 1748–7188, 2007. Ilya E Vorontsov, Ivan V Kulakovskiy, and Vsevolod Makeev. Jaccard index based similarity measure to compare transcription factor binding site models. Algorithms for Molecular Biology, 8(1):23, 2013. Ilya E. Vorontsov, Ivan V. Kulakovskiy, Grigory Khimulya, Daria D. Nikolaeva, and Vsevolod J. Makeev. PERFECTOS-APE – Predicting Regulatory Functional Effect of SNPs by Approximate P-value Estimation. In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2015), pages 102–108, 2015. ISBN 978-989-758070-3. 21

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement