Deep protein representations enable recombinant protein expression prediction

A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting c...

Full description

Bibliographic Details
Main Authors: Armenteros, J.J.A (Author), Johansen, A.R (Author), Martiny, H.-M (Author), Nielsen, H. (Author), Salomon, J. (Author)
Format: Article
Language:English
Published: Elsevier Ltd 2021
Subjects:
Online Access:View Fulltext in Publisher
LEADER 03261nam a2200517Ia 4500
001 10.1016-j.compbiolchem.2021.107596
008 220427s2021 CNT 000 0 und d
020 |a 14769271 (ISSN) 
245 1 0 |a Deep protein representations enable recombinant protein expression prediction 
260 0 |b Elsevier Ltd  |c 2021 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1016/j.compbiolchem.2021.107596 
520 3 |a A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model. © 2021 The Authors 
650 0 4 |a Bacillus subtilis 
650 0 4 |a Bacillus subtilis 
650 0 4 |a bacterial protein 
650 0 4 |a Bacterial Proteins 
650 0 4 |a Bacteriology 
650 0 4 |a Cultivation 
650 0 4 |a Cultivation conditions 
650 0 4 |a 'current 
650 0 4 |a Enzymes 
650 0 4 |a Escherichia coli 
650 0 4 |a Expression vectors 
650 0 4 |a Forecasting 
650 0 4 |a Gene expression 
650 0 4 |a gene expression regulation 
650 0 4 |a Gene Expression Regulation 
650 0 4 |a genetics 
650 0 4 |a Industrial enzymes 
650 0 4 |a machine learning 
650 0 4 |a Machine Learning 
650 0 4 |a Molecular tools 
650 0 4 |a Overexpressions 
650 0 4 |a Performance 
650 0 4 |a Recombinant expression 
650 0 4 |a Recombinant gene expressions 
650 0 4 |a recombinant protein 
650 0 4 |a Recombinant protein expression 
650 0 4 |a Recombinant proteins 
650 0 4 |a Recombinant Proteins 
700 1 |a Armenteros, J.J.A.  |e author 
700 1 |a Johansen, A.R.  |e author 
700 1 |a Martiny, H.-M.  |e author 
700 1 |a Nielsen, H.  |e author 
700 1 |a Salomon, J.  |e author 
773 |t Computational Biology and Chemistry