批量归一化和合并对于整合多个异质性研究的表型预测非常有用。
Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies.
发表日期:2023 Oct 16
作者:
Yilin Gao, Fengzhu Sun
来源:
PLoS Computational Biology
摘要:
不同基因组研究中的异质性损害了机器学习模型在交叉研究表型预测中的性能。在表型预测方面整合不同研究时克服异质性是开发在独立数据集上具有可重复预测性能的机器学习算法的一个具有挑战性和关键的步骤。我们研究了在各种不同异质性下整合同一类型组学数据的不同研究的最佳方法。我们开发了一个全面的工作流程来模拟各种不同类型的异质性,并使用 ComBat 评估不同集成方法以及批量归一化的性能。我们还分别通过六项结直肠癌(CRC)宏基因组研究和六项结核病(TB)基因表达研究的实际应用证明了这些结果。我们表明,不同基因组研究中的异质性会对机器学习分类器的再现性产生显着的负面影响。 ComBat 归一化提高了机器学习分类器在存在异质群体时的预测性能,并且可以成功消除同一群体内的批次效应。我们还表明,随着潜在疾病模型在训练和测试人群中变得更加不同,机器学习分类器的预测准确性可能会显着降低。比较不同的合并和集成方法,我们发现合并和集成方法在不同场景下可以表现得更好。在实际应用中,我们观察到,在 CRC 和 TB 研究中应用 ComBat 归一化和合并或集成方法时,预测精度有所提高。我们说明了批量归一化对于减轻不同研究的群体差异和批量效应至关重要。我们还表明,与批量归一化相结合时,合并策略和集成方法都可以取得良好的性能。此外,我们探索了通过排名聚合方法提高表型预测性能的潜力,并表明排名聚合方法与其他集成学习方法具有相似的性能。版权所有:© 2023 高,孙。这是一篇根据知识共享署名许可条款分发的开放获取文章,允许在任何媒体上不受限制地使用、分发和复制,前提是注明原始作者和来源。
Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier's reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier's prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.Copyright: © 2023 Gao, Sun. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.