研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

使用文献挖掘和转录组机器学习构建拼接特征数据库,揭示癌症途径。

Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning.

发表日期:2023
作者: Kyubin Lee, Daejin Hyung, Soo Young Cho, Namhee Yu, Sewha Hong, Jihyun Kim, Sunshin Kim, Ji-Youn Han, Charny Park
来源: Disease Models & Mechanisms

摘要:

Alternative splicing(AS)事件调节癌症中的某些通路和表型可塑性。虽然以前的研究已经通过计算分析剪接事件,但仍然很难从众多的候选项中发现可靠AS事件所引起的生物功能。为了提供必要的剪接事件标记以评估通路调控,我们通过收集两个数据集开发了一个数据库:(i)报道的文献,和(ii)癌症转录组。前者包括通过自然语言处理从63229篇PubMed摘要中收集的基于知识的剪接标记,涉及202个通路。后者则是从16种癌症类型和42个通路的泛癌症转录组中识别出的基于机器学习的剪接标记。我们建立了六个不同的学习模型来分类剪接谱中的通路活动作为学习数据集。通过学习模型特征重要性排名的AS事件成为每个通路的标记。为了验证我们的学习结果,我们进行了评估,包括(i)性能指标,(ii)从外部数据集获取的差异AS集合,以及(iii)我们的基于知识的标记。学习模型的ROC曲线下面积值没有出现明显的差异。然而,随机森林在与从外部数据集和基于知识的标记比较时表现出最佳性能。因此,我们使用从随机森林模型获得的标记。我们的数据库提供了AS标记的临床特征,包括生存检验、分子亚型和肿瘤微环境。此外,我们还进一步研究了剪接因子的调节作用。我们开发的标记数据库支持检索和可视化系统。 ©2023年作者。
Alternative splicing (AS) events modulate certain pathways and phenotypic plasticity in cancer. Although previous studies have computationally analyzed splicing events, it is still a challenge to uncover biological functions induced by reliable AS events from tremendous candidates. To provide essential splicing event signatures to assess pathway regulation, we developed a database by collecting two datasets: (i) reported literature and (ii) cancer transcriptome profile. The former includes knowledge-based splicing signatures collected from 63,229 PubMed abstracts using natural language processing, extracted for 202 pathways. The latter is the machine learning-based splicing signatures identified from pan-cancer transcriptome for 16 cancer types and 42 pathways. We established six different learning models to classify pathway activities from splicing profiles as a learning dataset. Top-ranked AS events by learning model feature importance became the signature for each pathway. To validate our learning results, we performed evaluations by (i) performance metrics, (ii) differential AS sets acquired from external datasets, and (iii) our knowledge-based signatures. The area under the receiver operating characteristic values of the learning models did not exhibit any drastic difference. However, random-forest distinctly presented the best performance to compare with the AS sets identified from external datasets and our knowledge-based signatures. Therefore, we used the signatures obtained from the random-forest model. Our database provided the clinical characteristics of the AS signatures, including survival test, molecular subtype, and tumor microenvironment. The regulation by splicing factors was additionally investigated. Our database for developed signatures supported retrieval and visualization system.© 2023 The Authors.