GeneTrail 2 workflow

Input data

GeneTrail 2 offers you two main variants for data input. You either may choose to download preprocessed and normalized expression data from GEO or provide a precomputed list of scores. In case users want to upload their own data, GeneTrail 2 supports several line-based file formats. In all formats, used identifiers are assumed to not contain any whitespace and scores are assumed to be in decimal or scientific format using "." as decimal mark. In the following section, we describe the requirements and restrictions associated with these formats.

GSM368771 GSM368772 GSM368773 GSM368774 GSM368775
60496 8.12378 8.37528 8.02755 7.80555 8.49108
605 7.9787 7.95646 8.54249 8.6799 8.38875
6050 7.97796 7.96793 7.80971 7.78868 8.14545
60506 5.49484 5.38719 5.50641 5.78784 5.37093
60509 7.80633 7.85155 7.35756 7.82349 7.3896
6051 8.27485 8.5358 8.26115 8.2277 8.30632
Matrices must have a header line that contains one unique identifier for each sample. Every other line contains one identifier and one expression value per sample.
hsa-let-7a-3p 0.128145025
hsa-let-7b-5p -0.24955694
hsa-let-7b-3p 0.594572760000001
hsa-let-7d-5p 0.00343912500000165
hsa-let-7d-3p 0.349173635
hsa-let-7e-5p -0.0185707299999986
Score lists contain one identifier and a corresponding score per line, separated by a whitespace. The given scores should indicate the importance of the identifier.
rs9262636
rs9262635
rs9262615
rs10859313
rs3130000
rs4947296
Identifier lists contain one identifier per line. Depending on the used algorithm they are assumed to be sorted.

The Gene Expression Omnibus (GEO) is a MIAME compliant online database for functional genomics data. Normalized data is stored in the GEO SOFT format, whereas unprocessed data is stored in a platform dependent raw format. When using a record from GEO GeneTrail 2 relies on the proper normalization of the stored data. If you want to normalize the data yourself you will need to obtain and process the raw data from GEO and upload a score file.

The SOFT format is supported for GEO Datasets (GDS) and GEO Series (GSE).

GSE files
are collections of related samples and provide a description of the study design.
GDS files
are curated collections of statistically comparable GEO samples. These samples originate from GSE files that are curated and reassembled by GEO employees.
GeneTrail 2 requires you to select either one GSE record and distribute the contained samples into a sample and reference set or select two GDS records that define your sample and reference set.

Identifier level statistic

Whereas identifier lists and score lists can be used directly as input for computing enrichments or subgraphs, expression matrices need to be processed to identifier-level scores first. This step is needed to assess the amount of differential expression for each biological entity. In this section, methods are described that can be used to quantify the difference between between sample and reference group.

• Independent Shrinkage t-test
• Independent Student's t-test
• Paired Student's t-test

Parametric tests are hypothesis tests, which assume the data to be generated by a certain probability distribution and that estimate the parameters of this distribution from given samples [1]. Parametric tests can achieve a higher accuracy and a higher precision than non-parametric ones, if the assumptions about the probability distribution are correct [2]. But if the assumptions are incorrect, these methods might be deceptive.

In our work, we focus on parametric tests called t-tests. T-tests are a family of statistical hypothesis tests [3] whose test statistics follow a Student's t distribution [4]. They can be used to test assumptions about the population mean. All presented t-tests are accurate if the populations are normally distributed and may be regarded as approximate if this is not the case [5].

• In case your data uses a quantitative scale the t-tests should be appropriate.
• The Shrinkage t-test is more robust than the standard t-tests, since this approach allows to control the influence of outliers. For this reason this method should always be preferred when sample sizes are small.
• The paired Student's should be applied if all samples of the two groups are obtained in pairs. Apart from population differences, the observations in each pair should be carried out under identical, or almost identical, conditions [5].
• Wilcoxon Rank Sum Test
• Wilcoxon Matched Pairs Signed Rank Test

In comparison to parametric tests, non-parametric methods make fewer assumptions about the analyzed data, for example they do not rely on probability distributions of assessed variables [6]. Due to the reliance on fewer assumptions, these approaches are more robust and may be applied in situations where less is known about the analyzed data. For example, non-parametric methods can be applied to samples that have a ranking but no clear numerical interpretation, such as when assessing preferences.

Both implemented tests are based solely on the order of the values in the two samples. They can be used to test if two samples are drawn from populations with the same underlying distribution.

• The Wilcoxon Matched Pairs Signed Rank Test should be applied if all samples of the two groups are obtained in pairs.
• Pearson correlation coefficient
• Spearman correlation coefficient

As an alternative to hypothesis tests, correlation coefficients can be applied if both groups contain more than 15 samples. Correlation coefficients are measures for linear dependence between two variables X and Y. They range from -1 to 1. A value of 1 implies that the relationship between X and Y is perfectly described by a linear function, with all data points lying on a line for which both X and Y increase. A value of -1 implies that all data points lie on a line for which X increases as Y decreases. A value of 0 implies that there is no linear dependence between the variables.

• The Spearman correlation should be used if the order of the samples is more important than the actual value.
• Z-score
• Log-Mean-Fold-Quotients
• Mean-Fold-Quotients
• Mean-Fold-Difference

If, however, the sample group consists only of one measurement (e.g. for diagnostic purposes), the Z-score or the fold change have to be used, as none of the other methods is applicable is this case.

• For better interpretability, it is common to use the logarithm of the fold change.

Score transformation

In some cases the result of an analysis can be improved by transforming the original scores. For example, Ackermann and Strimmer [7] show that squared values improve the detection of categories containing both up and down-regulated genes.

Users can choose from the following transformations:

• Absolute scores
• Logarithmized scores (natural logarithm)
• Logarithmized scores (base 2)
• Logarithmized scores (base 10)
• Squared scores
• Square root of scores

Enrichment analysis

High-throughput techniques such as genome sequencing, microarrays, and mass spectrometry have revolutionized bio-medical research by enabling comprehensive monitoring of huge biological systems. Irrespective of the technology used, analysis of high-throughput data typically yields a list of differentially expressed biological entities such as genes, miRNAs or proteins. This list is extremely useful in identifying entities that may have important roles in pathological mechanisms. Enrichment analysis of molecular signatures is a natural extension of the study of individual genes or proteins. The general idea of all enrichment methods is to revise if a certain category $C$ is significantly enriched or depleted in the analyzed data. A category is a set of biological entities like genes, proteins, or metabolites that are associated with a certain biological process, molecular function, or any molecular signature that might be of interest. The category is used to divide the input data into two groups, entries that are contained and entries that are not contained. Based on this information, a statistical test is applied that computes the differences between these two groups. Focusing on groups rather than on individual biological entities has several benefits. From a mathematical point of view, the analysis of groups instead of individual entities is advantageous as this typically increases power and reduces the dimensionality of the underlying statistical problem [7]. From the biological perspective, identifying molecular signatures that differ between two conditions can have more explanatory power than a simple list of differential expressed genes or miRNAs [8].

In the end a category is declared significantly enriched if the upper-tailed p-value of a test is significant and depleted if the lower-tailed p-value is significant.

$$P_{enriched} = P(X \ge x)$$ $$P_{depleted} = P(X \le x)$$
• Over-representation analysis (ORA)
• Weighted gene set enrichment analysis
• Gene set enrichment analysis (GSEA)
• Averaging methods (mean, median, sum)
• Maxmean statistic
• One sample t-test
• Welch's t-test
• Wilcoxon rank-sum test

Extensive reviews on enrichment methods ( [7], [9], [10], [11], [12], [13], [14]) have been published and reveal that no real gold standard exists. This is due to the fact that each of the proposed methods is based on differing definitions of enriched categories (differing null hypotheses), making their results incomparable in general. Instead of using a single “magic bullet”, an appropriate algorithm needs to be chosen carefully for each individual research task.

• In case you want to analyze a small set of biological entities (like the most significant ones), or there are no scores that indicate the importance of each entry in the data, an Over-Representation Analysis (ORA) has to be performed.
• However, if information about the extent of regulation (e.g., fold-changes, t-scores, etc.) is present, one of the other methods should be used instead. For non-expert users we recommend to use the Gene Set Enrichment Analysis (GSEA), as this is a popular and robust method.
• GeneTrail 2 offers the possibility to perform multiple enrichments and to compare them in order to reach an even higher sensitivity or specificity. For this reason two modes are available. While the union mode displays all categories that are significant in at least one enrichment, the intersection mode only displays categories that are significant in all. Whereas the union mode is useful for detecting variability between related enrichments, the intersection mode can be used to reduce the number of false positives by computing and comparing two or more enrichments using different algorithms. Using these modes, the user is able to effectively balance the sensitivity and specificity of an analysis in a straightforward manner.

Multiple testing correction

In an enrichment analysis multiple categories are tested simultaneously. For each individual test the same significance threshold $\alpha$ is used to judge if a category is significant. This means $\alpha$ is the probability to make a false positive prediction (Type-I-Error). Subsequently, each test has probability $\alpha$ to make a Type-I-Error. The problem with multiple testing is that this probability is accumulated.

For k tested hypotheses this probability is defined as:

$$P(\text{at least one significant result}) = 1-(1-\alpha)^k$$

Multiple testing procedures adjust p-values derived from multiple statistical tests to correct for the number of false positive predictions (Type-I-Error). See [15] [16] for a general overview of p-value adjustment algorithms.

• Bonferroni
• Sidak
• Holm-Sidak
• Finner
• Hochberg

When performing multiple hypotheses tests, the familywise error rate (FWER) is the probability of making at least one false positive prediction, or Type-I-Error, among all the tested null hypotheses [15].

$$FWER = Pr(|FP| > 0)$$
• Benjamini-Hochberg
• Benjamini-Yekutieli

The false discovery rate (FDR) is the expected proportion of Type-I-Errors among all rejected null hypotheses [15].

$FDR=E(\frac{FP}{FP+TN})$, where $\frac{FP}{FP+TN}=0$, when $FP=TN=0$

FDR-controlling adjustments are less conservative than adjustments controlling the familywise error rate [17] [15].

All methods described above can be used to control the number of false positive predictions. This is generally needed to improve the interpretation of the results. The choice of the p-value adjustment method can be used to adapt the results in order to achieve a higher sensitivity or specificity. Conservative methods like the Bonferroni correction or the Benjamini-Yekutieli procedure can be to used obtain a higher specificity, while more liberal methods, like the Hochberg method or the Benjamini-Hochberg adjustment, can be used to achieve a high sensitivity and still reduce the number of false positive predictions.

Subgraph analysis

The deregulation of biochemical pathways is known to play a crucial role in diseases like cancer or Parkinsons's disease. Hence, calculating such deregulated pathways may help to gain new insights into pathogenic mechanisms and may open novel avenues for therapy stratification in the sense of personalized medicine. Subgraph analysis algorithms allow to detect deregulated pathways in biological networks such as KEGG [18] or String [19] based on gene expression data.

We currently integrated the following algorithms.

• Subgraph ILP
• FiDePA

• FiDePa detects the most deregulated paths in the network.
• The subgraph ILP detects the most deregulated subgraphs.
• For non-expert users we recommend to use our Subgraph ILP, as this method delivers more interpretaböe results.

Biblibgraphy

1. Geisser, Seymour and Johnson, Wesley O, Modes of parametric statistical inference, John Wiley \& Sons,
2. Hoskin, Tanya, Parametric and nonparametric: Demystifying the terms, Mayo Clinic CTSA BERD Resource. Retrieved from http://www. mayo. edu/mayo-edudocs/center-for-translational-science-activities-documents/berd-5-6. pdf,
3. Student, The probable error of a mean, Biometrika, JSTOR, (View online)
4. Rinne, Horst, Taschenbuch der Statistik, Harri Deutsch Verlag, (View online)
5. Gopal K Kanji, 100 statistical tests, Sage, (View online)
6. Corder, Gregory W and Foreman, Dale I, Nonparametric statistics for non-statisticians: a step-by-step approach, John Wiley & Sons,
7. Ackermann, Marit and Strimmer, Korbinian, A general modular framework for gene set enrichment analysis, BMC Bioinformatics, (View online)
8. Glazko, Galina V and Emmert-Streib, Frank, Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets, Bioinformatics, Oxford Univ Press,
9. Huang, Da Wei and Sherman, Brad T and Lempicki, Richard A, Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic acids research, Oxford Univ Press,
10. Hung, Jui-Hung and Yang, Tun-Hsiang and Hu, Zhenjun and Weng, Zhiping and DeLisi, Charles, Gene set enrichment analysis: performance evaluation and usage guidelines, Briefings in bioinformatics, Oxford Univ Press,
11. Khatri, Purvesh and Sirota, Marina and Butte, Atul J, Ten years of pathway analysis: current approaches and outstanding challenges, PLoS computational biology, Public Library of Science,
12. Naeem, Haroon and Zimmer, Ralf and Tavakkolkhah, Pegah and Küffner, Robert, Rigorous assessment of gene set enrichment tests, Bioinformatics, Oxford Univ Press,
13. Nam, Dougu and Kim, Seon-Young, Gene-set approach for expression pattern analysis, Briefings in bioinformatics, Oxford Univ Press,
14. Song, Sarah and Black, Michael A, Microarray-based gene set analysis: a comparison of current methods, BMC bioinformatics, BioMed Central Ltd,
15. SAS, p-Value Adjustments - SAS/STAT(R) 9.22 User's Guide, (View online)
16. Westfall, Peter H, Resampling-based multiple testing: Examples and methods for p-value adjustment, John Wiley & Sons,
17. Hochberg, Yosef and Benjamini, Yoav, More powerful procedures for multiple significance testing, Statistics in medicine, Wiley Online Library,
18. Kanehisa, Minoru and Goto, Susumu, KEGG: kyoto encyclopedia of genes and genomes, Nucleic acids research, Oxford Univ Press,
19. Szklarczyk, Damian and Franceschini, Andrea and Wyder, Stefan and Forslund, Kristoffer and Heller, Davide and Huerta-Cepas, Jaime and Simonovic, Milan and Roth, Alexander and Santos, Alberto and Tsafou, Kalliopi P and others, STRING v10: protein-protein interaction networks, integrated over the tree of life, Nucleic acids research, Oxford Univ Press,