Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Similarity-driven multi-view embeddings from high-dimensional biomedical data

A Publisher Correction to this article was published on 04 March 2021

This article has been updated

A preprint version of the article is available at arXiv.

Abstract

Diverse, high-dimensional modalities collected in large cohorts present new opportunities for the formulation and testing of integrative scientific hypotheses. Similarity-driven multi-view linear reconstruction (SiMLR) is an algorithm that exploits inter-modality relationships to transform large scientific datasets into smaller, more well-powered and interpretable low-dimensional spaces. SiMLR contributes an objective function to identify joint signal regularization based on sparse matrices representing prior within-modality relationships and an implementation that permits application to joint reduction of large data matrices. We demonstrate that SiMLR outperlforms closely related methods on supervised learning problems in simulation data, a multi-omics cancer survival prediction dataset and multiple modality neuroimaging datasets. Taken together, this collection of results shows that SiMLR may be applied to joint signal estimation from disparate modalities and may yield practically useful results in a variety of application domains.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: An overview of the SiMLR workflow.
Fig. 2: SiMLR simulation study results for sensitivity to noise and ability to recover the signal.
Fig. 3: Fully supervised brain age prediction and performance comparison with SGCCA for PTBP data.

Similar content being viewed by others

Data availability

All visualized plots in the main manuscript are generated from our code capsule, which contains both the specific data sources and software calls necessary to reproduce the figures80.

The simulation data are built dynamically in R. The scripts that generate the data are publicly available in our code capsule80. We downloaded evaluation data from the multi-omic cancer benchmark47 website at http://acgt.cs.tau.ac.il/multi_omic_benchmark/download.html. Data are available in our code capsule80 along with the relevant statistical details and calls needed to reproduce the results reported here. The data are free to use with no restrictions. The brain age data used here were obtained from PTBP81. These data were originally downloaded from https://figshare.com/articles/dataset/The_Pediatric_Template_of_Brain_Perfusion_PTBP_/923555. The relevant subset is available in our code capsule80. The data are free to use with no restrictions. Supplementary data used here were obtained from the PING study database (https://chd.ucsd.edu/research/ping-study.html). PING requires a user to register and request data. The review of the request may also require institutional support and justification of data use. We originally gained access to these data in 2013 as part of the PING-in-a-box service, which is now defunct. Data used here were also obtained from the ADNI database (http://adni.loni.usc.edu). ADNI was launched in 2003 as a public-private partnership, led by M. W. Weiner. The primary goal of ADNI has been to test whether serial magnetic resonance imaging, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment and early Alzheimer’s disease. For up-to-date information, see http://adni.loni.usc.edu. The investigators within ADNI contributed to the design and implementation of ADNI and/or provided data, but did not participate in the analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Authorship_List.pdf. ADNI requires a user to register and request data. The review of the request may also require institutional support and justification of data use. We originally gained access to these data in 2008. The version used in the Supplementary Information was downloaded in August 2020 from LONI.

Code availability

ANTsR is open source and freely available at https://github.com/ANTsX/ANTsR. The development of the code available on GitHub is ongoing. The specific release version of the code and scripts used for the analysis and generation of figures in the main body of this manuscript are available in our code capsule80.

Change history

References

  1. Cole, J. H., Marioni, R. E., Harris, S. E. & Deary, I. J. Brain age and other bodily ‘ages’: implications for neuropsychiatry. Mol. Psychiatry 24, 266–281 (2019).

    Article  Google Scholar 

  2. Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).

    Article  Google Scholar 

  3. Habeck, C., Stern, Y. & Alzheimer’s Disease Neuroimaging Initiative. Multivariate data analysis for neuroimaging data: overview and application to Alzheimer’s disease. Cell Biochem. Biophys. 58, 53–67 (2010).

    Article  Google Scholar 

  4. Shamy, J. L. et al. Volumetric correlates of spatiotemporal working and recognition memory impairment in aged rhesus monkeys. Cereb. Cortex 21, 1559–1573 (2011).

    Article  Google Scholar 

  5. McKeown, M. J. et al. Analysis of fMRI data by blind separation into independent spatial components. Hum. Brain Mapp. 6, 160–188 (1998).

    Article  Google Scholar 

  6. Calhoun, V. D., Adali, T., Pearlson, G. D. & Pekar, J. J. A method for making group inferences from functional MRI data using independent component analysis. Hum. Brain Mapp. 14, 140–151 (2001).

    Article  Google Scholar 

  7. Calhoun, V. D., Liu, J. & Adali, T. A review of group ICA for fMRI data and ICA for joint inference of imaging, genetic, and ERP data. Neuroimage 45, S163–S172 (2009).

    Article  Google Scholar 

  8. Avants, B. B., Cook, P. A., Ungar, L., Gee, J. C. & Grossman, M. Dementia induces correlated reductions in white matter integrity and cortical thickness: a multivariate neuroimaging study with sparse canonical correlation analysis. Neuroimage 50, 1004–1016 (2010).

    Article  Google Scholar 

  9. de Pierrefeu, A. et al. Structured sparse principal components analysis with the TV-elastic net penalty. IEEE Trans. Med. Imaging 37, 396–407 (2018).

    Article  Google Scholar 

  10. Du, L. et al. Structured sparse canonical correlation analysis for brain imaging genetics: an improved GraphNet method. Bioinformatics 32, 1544–1551 (2016).

    Article  Google Scholar 

  11. Avants, B. et al. Sparse unbiased analysis of anatomical variance in longitudinal imaging. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention (eds Jiang, T. et al.) 324–331 (Springer, 2010).

  12. Avants, B. B. et al. Sparse canonical correlation analysis relates network-level atrophy to multivariate cognitive measures in a neurodegenerative population. Neuroimage 84, 698–711 (2014).

    Article  Google Scholar 

  13. Du, L.et al. in Brain Informatics and Health (eds Guo, Y. etal.) 275–284 (Springer, 2015)..

  14. Guigui, N. et al. Network regularization in imaging genetics improves prediction performances and model interpretability on Alzheimer’s disease. In Proc. IEEE 16th International Symposium on Biomedical Imaging. 1403–1406 (IEEE, 2019).

  15. Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).

    Article  MATH  Google Scholar 

  16. Chalise, P. & Fridley, B. L. Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PLoS ONE 12, e0176278 (2017).

  17. Dhillon, P. et al. Subject-specific functional parcellation via Prior Based Eigenanatomy. Neuroimage 99, 14–27 (2014).

    Article  Google Scholar 

  18. Tikhonov, A. N. On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195–198 (1943).

    MathSciNet  Google Scholar 

  19. Bell, J. B Solutions of ill-posed problems. Math. Comput. 32, 1320–1322 (1978).

    Article  Google Scholar 

  20. Smilde, A. K., Westerhuis, J. A. & de Jong, S. A framework for sequential multiblock component methods. J. Chemom. 17, 323–337 (2003).

    Article  Google Scholar 

  21. Tenenhaus, A. & Tenenhaus, M. Regularized generalized canonical correlation analysis. Psychometrika 76, 257–284 (2011).

    Article  MathSciNet  MATH  Google Scholar 

  22. Tenenhaus, M., Tenenhaus, A. & Groenen, P. J. Regularized generalized canonical correlation analysis: a framework for sequential multiblock component methods. Psychometrika 82, 737–777 (2017).

    Article  MathSciNet  MATH  Google Scholar 

  23. Zhan, Z., Ma, Z. & Peng, W. Biomedical data analysis based on multi-view intact space learning with geodesic similarity preserving. Neural Processing Lett. 49, 1381–1398 (2019).

    Article  Google Scholar 

  24. Baltrušaitis, T., Ahuja, C. & Morency, L. P. Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443 (2018).

    Article  Google Scholar 

  25. Kettenring, J. R. Canonical analysis of several sets of variables. Biometrika 58, 433–451 (1971).

    Article  MathSciNet  MATH  Google Scholar 

  26. Tenenhaus, A. et al. Variable selection for generalized canonical correlation analysis. Biostatistics 15, 569–583 (2014).

    Article  MATH  Google Scholar 

  27. Rohart, F., Gautier, B., Singh, A. & LêCao, K.-A. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput. Biol. 13, e1005752 (2017).

  28. Garali, I. et al. A strategy for multimodal data integration: application to biomarkers identification in spinocerebellar ataxia. Brief. Bioinform. 19, 1356–1369 (2017).

    Article  Google Scholar 

  29. Gloaguen, A. et al. Multiway generalized canonical correlation analysis. Biostatisticskxaa https://doi.org/10.1093/biostatistics/kxaa010 (2020).

  30. Hotelling, H. The most predictable criterion. J. Educ. Psychol. 26, 139–142 (1935).

    Article  Google Scholar 

  31. Hotelling, H. Relations between two sets of variants. Biometrika 28, 321–377 (1936).

    Article  MATH  Google Scholar 

  32. Lock, E. F., Hoadley, K. A., Marron, J. S. & Nobel, A. B. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7, 523–542 (2013).

    Article  MathSciNet  MATH  Google Scholar 

  33. Yu, Q., Risk, B. B., Zhang, K. & Marron, J. S. JIVE integration of imaging and behavioral data. Neuroimage 152, 38–49 (2017).

    Article  Google Scholar 

  34. Ceulemans, E., Wilderjans, T. F., Kiers, H. A. & Timmerman, M. E. MultiLevel simultaneous component analysis: a computational shortcut and software package. Behav. Res. Methods 48, 1008–1020 (2016).

    Article  Google Scholar 

  35. Argelaguet, R. et al. Multi-omics factor analysis–a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 14, e8124 (2018).

  36. Carmichael, I. et al. Joint and individual analysis of breast cancer histologic images and genomic covariates. Preprint at https://arxiv.org/abs/1912.00434 (2019).

  37. McMillan, C. T. et al. White matter imaging helps dissociate tau from TDP-43 in frontotemporal lobar degeneration. J. Neurol. Neurosurg. Psychiatry 84, 949–955 (2013).

    Article  Google Scholar 

  38. McMillan, C. T. et al. Genetic and neuroanatomic associations in sporadic frontotemporal lobar degeneration. Neurobiol. Aging 35, 1473–1482 (2014).

    Article  Google Scholar 

  39. Cook, P. A. et al. Relating brain anatomy and cognitive ability using a multivariate multimodal framework. Neuroimage 99, 477–486 (2014).

    Article  Google Scholar 

  40. Hyvärinen, A. & Oja, E. Independent component analysis: a tutorial. In Notes for International Joint Conference on Neural Networks (IJCNN, 1999)..

  41. Hyvärinen, A. & Oja, E. Independent component analysis: algorithms and applications. Neural Networks 13, 411–430 (2000).

    Article  Google Scholar 

  42. Haykin, S. & Chen, Z. The cocktail party problem. Neural Comput. 17, 1875–1902 (2005).

    Article  Google Scholar 

  43. Andersen, P. K. & Gill, R. D. Cox’s regression model for counting processes: a large sample study. Ann. Stat. 10, 1100–1120 (1982).

    Article  MathSciNet  MATH  Google Scholar 

  44. Fox, J. & Weisberg, S. An R Companion to Applied Regression 2nd edn (2011).

  45. Huang, L. et al. Development and validation of a prognostic model to predict the prognosis of patients who underwent chemotherapy and resection of pancreatic adenocarcinoma: a large international population-based cohort study. BMC Med. 17, 1–16 (2019).

    Article  Google Scholar 

  46. Neums, L., Meier, R., Koestler, D. C. & Thompson, J. A. Improving survival prediction using a novel feature selection and feature reduction framework based on the integration of clinical and molecular data. Pac. Symp. Biocomput. 25, 415–426 (2020).

    Google Scholar 

  47. Rappoport, N. & Shamir, R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 46, 10546–10562 (2018).

    Article  Google Scholar 

  48. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

    Article  Google Scholar 

  49. Yong, W.-S., Hsu, F.-M. & Chen, P.-Y. Profiling genome-wide DNA methylation. Epigenetics Chromatin 9, 1–16 (2016).

    Article  Google Scholar 

  50. Ozsolak, F. & Milos, P. M. RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 12, 87–98 (2011).

    Article  Google Scholar 

  51. Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).

    Article  MATH  Google Scholar 

  52. Barnhart, H. X., Haber, M. & Song, J. Overall concordance correlation coefficient for evaluating agreement among multiple observers. Biometrics 58, 1020–1027 (2002).

    Article  MathSciNet  MATH  Google Scholar 

  53. Avants, B. B. et al. The pediatric template of brain perfusion. Sci. Data 2, 1–17 (2015).

    Article  Google Scholar 

  54. Kandel, B. M., Wang, D. J., Detre, J. A., Gee, J. C. & Avants, B. B. Decomposing cerebral blood flow MRI into functional and structural components: a non-local approach based on prediction. Neuroimage 105, 156–170 (2015).

    Article  Google Scholar 

  55. Tustison, N. J. et al. Logical circularity in voxel-based analysis: normalization strategy may induce statistical bias. Hum. Brain Mapp. 35, 745–759 (2014).

    Article  Google Scholar 

  56. Franke, K. & Gaser, C. Ten years of BrainAGE as a neuroimaging biomarker of brain aging: what insights have we gained?. Front. Neurol. 10, 789 (2019).

    Article  Google Scholar 

  57. Jernigan, T. L. et al. The pediatric imaging, neurocognition, and genetics (PING) data repository. Neuroimage 124, 1149–1154 (2016).

    Article  Google Scholar 

  58. Bro, R., Kjeldahl, K., Smilde, A. K. & Kiers, H. A. Cross-validation of component models: a critical look at current methods. Anal. Bioanal. Chem. 390, 1241–1251 (2008).

    Article  Google Scholar 

  59. Bickel, S. & Scheffer, T. Multi-view clustering. In Proc. IEEE International Conference on Data Mining. 19–26 (ICDM, 2004).

  60. Wang, Y., Wu, L., Lin, X. & Gao, J. Multiview spectral clustering via structured low-rank matrix factorization. IEEE Trans. Neural Netw. Learn. Syst. 29, 4833–4843 (2018).

    Article  Google Scholar 

  61. De Vito, R., Bellio, R., Trippa, L. & Parmigiani, G. Multi-study factor analysis. Biometrics 75, 337–346 (2019).

    Article  MathSciNet  MATH  Google Scholar 

  62. Eddelbuettel, D. & Balamuta, J. J. Extending R with C++: a brief introduction to Rcpp. Am. Stat. 72, 28–36 (2018).

    Article  MathSciNet  Google Scholar 

  63. Avants, B. B., Johnson, H. J. & Tustison, N. J. Neuroinformatics and the The Insight Toolkit. Front. Neuroinform. 9, 5 (2015).

    Article  Google Scholar 

  64. Avants, B. B. et al. A reproducible evaluation of ANTs similarity metric performance in brain image registration. Neuroimage 54, 2033–2044 (2011).

    Article  Google Scholar 

  65. Muschelli, J. et al. Neuroconductor: an R platform for medical imaging analysis. Biostatistics 20, 218–239 (2019).

    Article  MathSciNet  Google Scholar 

  66. Zou, H., Hastie, T. & Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat. 15, 265–286 (2006).

    Article  MathSciNet  Google Scholar 

  67. Shen, H. & Huang, J. Z. Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal. 99, 1015–1034 (2008).

    Article  MathSciNet  MATH  Google Scholar 

  68. Jolliffe, I. T., Trendafilov, N. T. & Uddin, M. A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12, 531–547 (2003).

    Article  MathSciNet  Google Scholar 

  69. Lin, C. J. Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19, 2756–2779 (2007).

    Article  MathSciNet  MATH  Google Scholar 

  70. Jain, P., Netrapalli, P. & Sanghavi, S. Low-rank matrix completion using alternating minimization. In Proc. 45th Annual ACM Symposium on Theory of Computing. 665–674 (ACM, 2013).

  71. Blumensath, T. & Davies, M. E. Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27, 265–274 (2009).

    Article  MathSciNet  MATH  Google Scholar 

  72. Pustina, D., Avants, B., Faseyitan, O. K., Medaglia, J. D. & Coslett, H. B. Improved accuracy of lesion to symptom mapping with multivariate sparse canonical correlations. Neuropsychologia 115, 154–166 (2018).

    Article  Google Scholar 

  73. Hanafi, M. PLS path modelling: computation of latent variables with the estimation mode B. Comput. Stat. 22, 275–292 (2007).

    Article  MathSciNet  MATH  Google Scholar 

  74. Tenenhaus, A., Philippe, C. & Frouin, V. Kernel generalized canonical correlation analysis. Comput. Stat. Data Anal. 90, 114–131 (2015).

    Article  MathSciNet  MATH  Google Scholar 

  75. Malkov, Y. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2018).

    Article  Google Scholar 

  76. Hill, W. G. & Robertson, A. Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38, 226–231 (1968).

    Article  Google Scholar 

  77. Bahmani, S. & Raj, B. A unifying analysis of projected gradient descent for ℓp-constrained least squares. Appl. Comput. Harmon. Anal. 34, 366–378 (2013).

    Article  MathSciNet  MATH  Google Scholar 

  78. Martí, R., Resende, M. G. & Ribeiro, C. C. Multi-start methods for combinatorial optimization. Eur. J. Oper. Res. 226, 1–8 (2013).

    Article  MathSciNet  MATH  Google Scholar 

  79. Jernigan, T. L. et al. The Pediatric Imaging, Neurocognition, and Genetics (PING) Data Repository. NeuroImage 124, 1149–1154 (2016).

    Article  Google Scholar 

  80. Avants, B. B., Tustison, N. J. & Stone, J. R. SiMLR in ANTsR: interpretable, similarity-driven multi-view embeddings from high-dimensional biomedical data. Code Ocean https://doi.org/10.24433/CO.3087836.v2 (2021).

  81. Avants, B. B., Tustison, N. J. & Wang, D. J. J. The pediatric template of brain perfusion (PTBP). figshare https://doi.org/10.6084/m9.figshare.923555.v20 (2013).

Download references

Acknowledgements

This work is supported by a combined grant from Cohen Veterans Bioscience (CVB-461) and the Office of Naval Research (N00014-18-1-2440) as well as the National Institutes of Health (K01-ES025432-01).

Supplementary data used in the preparation of this article were obtained from the PING study database (https://chd.ucsd.edu/research/ping-study.html). The investigators within PING contributed to the design and implementation of the PING database and/or provided data, but did not participate in the analysis or writing of this report. A complete listing of investigators of the PING study can be found at ref. 79.

Supplementary data collection and sharing for this project was funded by ADNI (National Institutes of Health Grant U01 AG024904) and the Department of Defense ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica; Biogen; Bristol Myers Squibb; CereSpir; Cogstate; Eisai; Elan Pharmaceuticals; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche and its affiliated company Genentech; Fujirebio; GE Healthcare; IXICO; Janssen Alzheimer Immunotherapy Research & Development; Johnson & Johnson Pharmaceutical Research & Development; Lumosity; Lundbeck; Merck & Co.; Meso Scale Diagnostics; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (https://fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory of Neuro Imaging at the University of Southern California.

Author information

Authors and Affiliations

Authors

Contributions

B.B.A., N.J.T. and J.R.S. made substantial contributions to the conception and design of the work, and the analysis and interpretation of data. B.B.A. and N.J.T. created the software. All authors drafted and revised the manuscript.

Corresponding author

Correspondence to Brian B. Avants.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review informationNature Computational Science thanks Steve Marron, Cathy Philippe and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Fernando Chirigati was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–5, Tables 1–4 and discussion.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Avants, B.B., Tustison, N.J. & Stone, J.R. Similarity-driven multi-view embeddings from high-dimensional biomedical data. Nat Comput Sci 1, 143–152 (2021). https://doi.org/10.1038/s43588-021-00029-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-021-00029-8

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics