研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

功能基因嵌入的输入数据模态选择的评估。

Evaluation of input data modality choices on functional gene embeddings.

发表日期:2023 Dec
作者: Felix Brechtmann, Thibault Bechtler, Shubhankar Londhe, Christian Mertes, Julien Gagneur
来源: Disease Models & Mechanisms

摘要:

功能基因嵌入,即捕获基因功能的数值向量,提供了一种将功能基因信息集成到机器学习模型中的有前途的方法。这些嵌入是通过对各种数据类型应用自监督机器学习算法来学习的,包括定量组学测量、蛋白质-蛋白质相互作用网络和文献。然而,缺乏比较用于构建功能基因嵌入的替代数据模式的下游评估。在这里,我们对从各种数据模式获得的功能基因嵌入进行了基准测试,用于预测疾病基因列表、癌症驱动因素、表型-基因关联以及全基因组关联研究的分数。在预先计算的嵌入上训练的现成预测器匹配或优于专用的最先进的预测器,展示了它们的高实用性。在预测策划的基因列表时,基于文献和从低通量实验推断出的蛋白质-蛋白质相互作用的嵌入优于源自全基因组实验数据(转录组学、缺失筛选和蛋白质序列)的嵌入。相比之下,它们在预测全基因组关联信号时表现不佳,并且偏向于经过深入研究的基因。这些结果表明,源自文献和低通量实验的嵌入在许多现有基准中似乎是有利的,因为它们偏向于经过充分研究的基因,因此应谨慎考虑。总而言之,我们的研究和预计算嵌入将促进遗传学和相关领域机器学习模型的开发。© 作者 2023。由牛津大学出版社代表 NAR 基因组学和生物信息学出版。
Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein-protein interaction networks and literature. However, downstream evaluations comparing alternative data modalities used to construct functional gene embeddings have been lacking. Here we benchmarked functional gene embeddings obtained from various data modalities for predicting disease-gene lists, cancer drivers, phenotype-gene associations and scores from genome-wide association studies. Off-the-shelf predictors trained on precomputed embeddings matched or outperformed dedicated state-of-the-art predictors, demonstrating their high utility. Embeddings based on literature and protein-protein interactions inferred from low-throughput experiments outperformed embeddings derived from genome-wide experimental data (transcriptomics, deletion screens and protein sequence) when predicting curated gene lists. In contrast, they did not perform better when predicting genome-wide association signals and were biased towards highly-studied genes. These results indicate that embeddings derived from literature and low-throughput experiments appear favourable in many existing benchmarks because they are biased towards well-studied genes and should therefore be considered with caution. Altogether, our study and precomputed embeddings will facilitate the development of machine-learning models in genetics and related fields.© The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.