Acknowledgements

This research project was conducted under the supervision of A/Prof. Lê Cao Kim-Anh for University of Melbourne Mathematics and Statistics Vacation Scholarships Program 2021. This project was funded by a Norman Beischer Medical Research Foundation Grant (2020).

I would like to express my sincere gratitude to my advisor A/Prof. Lê Cao from Melbourne Integrative Genomics for the continuous support of my research. Her immense knowledge in the multi-omics analysis has guided me in all the time of researching and exploring the bioinformatics field.

Furthermore, I would like to thank Prof. Eva Dimitriadis and Dr. Ellen Menkhorst from Gynaecology Research Centre, The Royal Women’s Hospital, for their kind provision of the datasets and invaluable biological inputs that enhances my statistical research results.

This web tool provides an interactive interface to explore the results from multi-omics methods used to study preeclampsia biomarkers. To be used in complementary to the main summary poster here.

Datasets

  • Illumina Novaseq and shotgun proteomics were used to sequence chorionic villus sampling from a cohort of 18 women to obtain (A) mRNA, (B) short RNA, and (C) proteomics count datasets.
  • Of the samples, 8 had a healthy pregnancy, 4 experienced early onset (preterm) preeclampsia, and 6 experienced late onset (term) preeclampsia. This information was encoded in a class vector (D).
  • Genes with low counts were removed and the datasets were then normalised using the TMM method (McCarthy, Chen, & Smyth, 2012).
Schema for datasets. Adapted from @mixOmicsbook

Figure 1: Schema for datasets. Adapted from Lê Cao & Welham (2021)

edgeR

RNA-Seq Differential Expression (DE) analysis univariate tools adopt an inferential approach in statistically testing for a set of individual genes that are differentially expressed between conditions.

  • We used the edgeR pipeline (McCarthy, Chen, & Smyth, 2012) to fit a negative binomial distribution for each gene in (A) and (B).
  • Pairwise comparisons between the three groups were then performed with a likelihood ratio test and adjusted for multiple testing using the Benjamini-Hochberg (Benjamini & Hochberg, 1995) method.
  • Significant genes were labelled upregulated if the log fold change is positive and downregulated otherwise.
Schema for edgeR method

Figure 2: Schema for edgeR method

Click here to see the interactive DE results - Warning: Large data will be loaded

PLS-DA

  • One major drawback of using univariate methods like edgeR in identifying biomarkers is that they do not recognize the interactions between groups of variables as a source of phenotype variations.
  • Projection to Latent Structures (PLS) is a family of multivariate methods that can addressed this issue. PLS-based methods perform matrix decomposition that takes one (single) or more (multi) omics expression matrices as input, and outputs a component matrix and corresponding loading vectors for each omics.
  • Different PLS-based methods define component (a group of related biomarkers) differently according to a specific statistical criterion that is maximised, such as variance, covariance, or correlation (Rohart, Gautier, Singh, & Lê Cao, 2017).
  • Unsupervised methods (e.g., PCA, PLS) only take the omics data as input.
  • Meanwhile, supervised methods (e.g., PLS-DA, DIABLO) take a class vector as an additional input, which is included in the matrix decomposition process. This ensures that the identified components also have maximal covariance with the phenotype / class information, suitable for experimental designs that aim to identify differentially expressed biomarkers in different conditions such as the current study. Hence, the following results will focus on supervised methods.
  • Schematic comparison between unsupervised (PCA) and supervised (PLS-DA) methods for single omics is shown in 3.
Schematic comparison of single-omics methods. Adapted from @mixOmicsbook

Figure 3: Schematic comparison of single-omics methods. Adapted from Lê Cao & Welham (2021)

  • In the sample plot below 4, each sample (i.e., patient) is projected onto the first 2 components using PCA (unsupervised) versus PLS-DA (supervised).
  • Compared to PCA, the plots show that the PLS-DA components are better at separating the samples based on conditions.
  • Most notably, the genes in Component 1 (horizontal axis) may be used to separate PE.term with the other cases, whereas the genes in Component 2 (vertical axis) may be used to separate PE.preterm with the others.

Figure 4: Toggle the slider to see how different projections between PCA (unsupervised) and PLS-DA (supervised).

  • The number of variables per component can be explicitly stated, which will result in applying L1 lasso penalisation on the loading vectors (see figure 3). This ensures that only the most relevant variables are selected.
  • The number of components to decompose into and the number of variables per component are tuned by treating the output as a classification problem and cross-validating using leave-one-out schema.
  • The optimal PLS-DA parameters for the mRNA dataset in this study are 4 components with 20 mRNAs per component.

Figure 5: PLS-DA correlation plot with each variable scaled to its loading factor

Interpreting PLS-DA Results

  • PLS-DA correlation plot 5 shows that the 50 identified mRNAs in the first 2 latent components tend to have strong correlation (the interior and exterior circle corresponds to correlation levels of 0.5 and 1 respectively) and clustered well into separate subsets.
  • The contributions of each gene to the component score are summarized in the table below.
  • Genes on page 1 are from Component 1 (highlighted in beige), while genes on page 2 are from Component 2 (not highlighted).
  • The Loadings values also do not indicate overexpression or underexpression like log fold change in DE Analysis. It simply indicates whether an increase in a specific gene in the combinations increases or decreases the overall component score.
  • As a stochastic algorithm, different runs of sPLSDA can result in different combinations. The Stability values indicate how confident we are that a gene is part of a component by repeatedly training PLS-DA models using cross-validation.
  • The Min Exp and Max Exp columns indicate the group with the lowest / highest median expression of each gene. Note that these also do not indicate overexpression or underexpression, but only as a reference for downstream analysis.

DIABLO

  • DIABLO (block PLS-DA) is the multi-omics variant of PLS-DA.
  • DIABLO integrates all (A), (B), (C) and (D) as its input to maximise covariance between the latent components.
  • Lasso regularisation is applied on each of (A), (B), and (C) to identify the top key variables in each omics layer.
Schema for DIABLO multi-omics method. Adapted from @mixOmicsbook

Figure 6: Schema for DIABLO multi-omics method. Adapted from Lê Cao & Welham (2021)

Figure 7: Toggle the slider to see how different projections between the omics layers

  • Toggling between different omics layers in the sample plot 7 shows that the differences between the sample projections onto the components are not too significant across datasets. This suggests that there is a high level of agreement between the 3 dataset for all samples.
  • The samples are best clustered with the sRNA components, compared to mRNA and proteomics.
  • Sample C6471 is potentially an outlier, as the data from all omics layers agree on its outlying position.
Circos plot for correlations between DIABLO variables

Figure 8: Circos plot for correlations between DIABLO variables

  • DIABLO circos plot 8 identifies a set of strongly correlated 15 mRNAs, 5 sRNAs and 5 proteins that maximises covariance.
  • Orange links (e.g., ERGIC2 and SLC44A3-AS1) indicate strong, positive correlation between the expressions of those genes (increase in 1 gene is usually associated with increase in the other), while purple links indicate strong, negative correlation.
  • Correlations between omics are predominantly negative, implying potential patterns of expression in preeclampsia conditions.
  • The contributions of each variable to the component scores are summarized in the table below, with a similar layout to the PLS-DA result table.
  • Resulting mRNAs are highlighted in blue, sRNAs are highlighted in grey, and proteins are highlighted in orange.

References

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.
Lê Cao, K.-A., & Welham, Z. (2021). Multivariate data integration using R: Methods and applications with the mixOmics package (in preparation).
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288–4297. https://doi.org/10.1093/nar/gks042
Rohart, F., Gautier, B., Singh, A., & Lê Cao, K.-A. (2017). mixOmics: An R package for ’omics feature selection and multiple data integration. PLoS Computational Biology, 13(11), e1005752. Retrieved from http://www.mixOmics.org