研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

SpiderLearner:一种高斯图模型估计的集成方法。

SpiderLearner: An ensemble approach to Gaussian graphical model estimation.

发表日期:2023 Apr 02
作者: Katherine H Shutta, Laura B Balzer, Denise M Scholtens, Raji Balasubramanian
来源: Disease Models & Mechanisms

摘要:

高斯图模型(Gaussian graphical models,GGMs)是一种流行的网络模型,其中节点代表多元正态数据中的特征,边表示特征之间的条件依赖性。GGM估计是一个活跃的研究领域。目前可用的GGM估计工具需要研究人员在算法、评分标准和调优参数方面做出几项选择。估计出的GGM可能对这些选择高度敏感,而每种方法的准确性可能会根据网络结构特征(如拓扑结构、度分布和密度)的结构特征而变化。由于这些特征事先不知道,因此很难建立选择GGM估计方法的通用准则。我们通过引入SpiderLearner来解决这个问题,这是一种集成方法,可以从多个估计的GGM中构建共识网络。给定一组候选方法,SpiderLearner使用基于似然损失函数估计每种方法的结果的最优凸组合。在这个过程中,采用K-fold交叉验证以减少过度拟合的风险。在模拟中,SpiderLearner根据各种指标(包括相对Frobenius范数和样本外似然)表现比最佳候选方法更好或相当。我们将SpiderLearner应用于公开的卵巢癌基因表达数据,包括来自13项不同研究的2013名参与者,展示了我们工具鉴定复杂疾病的生物标志物的潜力。SpiderLearner以灵活、可扩展、开放源代码的形式实现,作为R软件包ensembleGGM的一部分,网址为https://github.com/katehoffshutta/ensembleGGM。©John Wiley&Sons Ltd. 2023
Gaussian graphical models (GGMs) are a popular form of network model in which nodes represent features in multivariate normal data and edges reflect conditional dependencies between these features. GGM estimation is an active area of research. Currently available tools for GGM estimation require investigators to make several choices regarding algorithms, scoring criteria, and tuning parameters. An estimated GGM may be highly sensitive to these choices, and the accuracy of each method can vary based on structural characteristics of the network such as topology, degree distribution, and density. Because these characteristics are a priori unknown, it is not straightforward to establish universal guidelines for choosing a GGM estimation method. We address this problem by introducing SpiderLearner, an ensemble method that constructs a consensus network from multiple estimated GGMs. Given a set of candidate methods, SpiderLearner estimates the optimal convex combination of results from each method using a likelihood-based loss function. K $$ K $$ -fold cross-validation is applied in this process, reducing the risk of overfitting. In simulations, SpiderLearner performs better than or comparably to the best candidate methods according to a variety of metrics, including relative Frobenius norm and out-of-sample likelihood. We apply SpiderLearner to publicly available ovarian cancer gene expression data including 2013 participants from 13 diverse studies, demonstrating our tool's potential to identify biomarkers of complex disease. SpiderLearner is implemented as flexible, extensible, open-source code in the R package ensembleGGM at https://github.com/katehoffshutta/ensembleGGM.© 2023 John Wiley & Sons Ltd.