使用Perceiver从DNA和蛋白质序列预测基因和蛋白质表达水平。
Predicting gene and protein expression levels from DNA and protein sequences with Perceiver.
发表日期:2023 Mar 22
作者:
Matteo Stefanini, Marta Lovino, Rita Cucchiara, Elisa Ficarra
来源:
Epigenetics & Chromatin
摘要:
一个生物体的功能和生物过程是由基因和蛋白质的表达所导致的。因此,量化和预测mRNA和蛋白质水平是科学研究中至关重要的方面。关于mRNA水平的预测,目前的方法使用转录起始位点(TSS)上下游的序列作为神经网络的输入。最先进的模型(例如Xpresso和Basenjii)利用卷积神经网络(CNN)或长短时记忆(LSTM)网络预测mRNA水平。然而,CNN的预测取决于卷积核大小,而LSTM则受到捕捉序列中长程依赖关系的困扰。关于蛋白质水平的预测,据我们所知,目前没有利用基因或蛋白质序列预测蛋白质水平的模型。在这里,我们利用一种新的模型类型(称为Perceiver)进行mRNA和蛋白质水平预测,利用具有注意机制的Transformer-based架构来关注序列中的长距离交互。此外,Perceiver模型克服了标准Transformer架构的二次复杂度。本文的贡献为:1.使用DNAPerceiver模型从TSS上下游序列预测mRNA水平;2.使用ProteinPerceiver模型从蛋白质序列预测蛋白质水平;3.使用Protein&DNAPerceiver模型从TSS和蛋白质序列预测蛋白质水平。这些模型在细胞系、小鼠、胶质母细胞瘤和肺癌组织上进行评估。结果显示Perceiver类型模型在预测mRNA和蛋白质水平方面的有效性。本文提出了一种Perceiver架构,用于mRNA和蛋白质水平的预测。将调控和表观遗传信息插入模型中,可能会提高mRNA和蛋白质水平的预测结果。源代码可在https://github.com/MatteoStefanini/DNAPerceiver免费获取。版权所有© 2023 Elsevier B.V.出版。
The functions of an organism and its biological processes result from the expression of genes and proteins. Therefore quantifying and predicting mRNA and protein levels is a crucial aspect of scientific research. Concerning the prediction of mRNA levels, the available approaches use the sequence upstream and downstream of the Transcription Start Site (TSS) as input to neural networks. The State-of-the-art models (e.g., Xpresso and Basenjii) predict mRNA levels exploiting Convolutional (CNN) or Long Short Term Memory (LSTM) Networks. However, CNN prediction depends on convolutional kernel size, and LSTM suffers from capturing long-range dependencies in the sequence. Concerning the prediction of protein levels, as far as we know, there is no model for predicting protein levels by exploiting the gene or protein sequences.Here, we exploit a new model type (called Perceiver) for mRNA and protein level prediction, exploiting a Transformer-based architecture with an attention module to attend to long-range interactions in the sequences. In addition, the Perceiver model overcomes the quadratic complexity of the standard Transformer architectures. This work's contributions are 1. DNAPerceiver model to predict mRNA levels from the sequence upstream and downstream of the TSS; 2. ProteinPerceiver model to predict protein levels from the protein sequence; 3. Protein&DNAPerceiver model to predict protein levels from TSS and protein sequences.The models are evaluated on cell lines, mice, glioblastoma, and lung cancer tissues. The results show the effectiveness of the Perceiver-type models in predicting mRNA and protein levels.This paper presents a Perceiver architecture for mRNA and protein level prediction. In the future, inserting regulatory and epigenetic information into the model could improve mRNA and protein level predictions. The source code is freely available at https://github.com/MatteoStefanini/DNAPerceiver.Copyright © 2023. Published by Elsevier B.V.