一种基于贝叶斯框架的多种基因组数据整合的途径引导癌症亚群识别方法。
A Bayesian framework for pathway-guided identification of cancer subgroups by integrating multiple types of genomic data.
发表日期:2023 Sep 15
作者:
Zequn Sun, Dongjun Chung, Brian Neelon, Andrew Millar-Wilson, Stephen P Ethier, Feifei Xiao, Yinan Zheng, Kristin Wallace, Gary Hardiman
来源:
MOLECULAR & CELLULAR PROTEOMICS
摘要:
近年来,综合癌症基因组学平台,例如癌症基因组图谱(TCGA),为每位患者提供了大量高通量基因组数据集,包括基因表达、DNA拷贝数变异、DNA甲基化和体细胞突变。尽管整合这些多组学数据集具有潜在的提供创新见解、实现个体化医学的能力,但大部分现有方法仅集中于基因水平分析,并且缺乏在通路水平上促进生物学发现的能力。本文中,我们提出了Bayes-InGRiD(基于通路导向的贝叶斯稀疏潜在因子模型),通过连续、二值和计数数据的联合分析,在统一框架内同时识别癌症患者亚群(聚类)和关键分子特征(变量选择)。通过利用通路(基因集)信息,Bayes-InGRiD不仅提高了癌症患者亚群和关键分子特征识别的准确性和稳健性,还促进了生物学的理解和解释。最后,为了方便有效的后验抽样,提出了一种基于泊松-伽马混合正态分布表示二值和计数数据的潜变量的替代Gibbs采样器,从而产生了后验的条件高斯表示。实现了上述方法的R软件包“INGRID”,目前可在我们研究团队的GitHub网页(https://dongjunchung.github.io/INGRID/)上获取。
© 2023 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.
In recent years, comprehensive cancer genomics platforms, such as The Cancer Genome Atlas (TCGA), provide access to an enormous amount of high throughput genomic datasets for each patient, including gene expression, DNA copy number alterations, DNA methylation, and somatic mutation. While the integration of these multi-omics datasets has the potential to provide novel insights that can lead to personalized medicine, most existing approaches only focus on gene-level analysis and lack the ability to facilitate biological findings at the pathway-level. In this article, we propose Bayes-InGRiD (Bayesian Integrative Genomics Robust iDentification of cancer subgroups), a novel pathway-guided Bayesian sparse latent factor model for the simultaneous identification of cancer patient subgroups (clustering) and key molecular features (variable selection) within a unified framework, based on the joint analysis of continuous, binary, and count data. By utilizing pathway (gene set) information, Bayes-InGRiD does not only enhance the accuracy and robustness of cancer patient subgroup and key molecular feature identification, but also promotes biological understanding and interpretation. Finally, to facilitate an efficient posterior sampling, an alternative Gibbs sampler for logistic and negative binomial models is proposed using Pólya-Gamma mixtures of normal to represent latent variables for binary and count data, which yields a conditionally Gaussian representation of the posterior. The R package "INGRID" implementing the proposed approach is currently available in our research group GitHub webpage (https://dongjunchung.github.io/INGRID/).© 2023 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.