Summary: | <p> High-throughput sequencing technologies and new computational techniques for analyzing population genetics data are rapidly improving our understanding of disease susceptibility in humans and adaptation in a wide variety of organisms. These studies often discover nonsynonymous variation with large effects as even a single amino acid change can disrupt the folding, catalytic activity, and physical interactions of proteins. Current estimates predict that every human genome contains 10,000-11,000 nonsynonymous variations and, while we cannot currently characterize all this diversity experimentally, many variants that alter protein function can be identified computationally from destabilization of structural models or amino acid conservation. Methods for annotating variant effects in genome-wide association studies and exome sequencing studies use conservation and other sequence-based features to identify damaging variants but cannot predict the effect these variants have on protein function. Recent studies of de novo variants have demonstrated the power of these methods but also the need for additional information, such as physical models from the Protein Data Bank, to identify causal variants in disease association studies. </p><p> I present VIPUR, a computational framework that integrates sequence analysis and structural modeling using the Rosetta protein modeling suite to identify and interpret deleterious protein variants. To train VIPUR, I collected 9,477 protein variants with known effects on protein function from multiple organisms and curated structural models for each variant from crystal structures and homology models. VIPUR can be applied to variants in any organism’s proteome with improved generalized accuracy (AUROC .83) and interpretability (AUPR .87) compared to other methods. I show that VIPUR’s predictions of deleteriousness match the biological phenotypes for pathogenicity in ClinVar despite being trained on a different label. I use VIPUR to interpret mutations associated with inflammation and diabetes, demonstrating the structural diversity of disrupted functional sites and improved interpretation functional effects. </p><p> Generalizable tools for interpreting genetic variants are especially needed with individualized exome sequencing, where clear indications of confident predictions are necessary to identify causal variation. I demonstrate VIPUR’s ability to select candidate variants associated with human diseases by predicting the effects of <i>de novo</i> variants associated with Autism Spectrum Disorders (ASD) in the Simons Simplex Collection. Compared to existing methods, VIPUR deleterious predictions have the greatest enrichment for mutations found in children with ASD. VIPUR’s predictions of deleterious effects are easily combined with other protein functional data to produce a small set of candidate genes and variants with specific mechanistic predictions. </p><p> Although designed to aid in the discovery of causal variants, VIPUR can also simulate mutations to better understand specific protein functions. The distribution of VIPUR scores across all positions in a protein can be used to highlight conserved residues and provides an overall measure of protein conservation. When applied to levoglucosan kinase, a bacterial enzyme of interest for biofuel processing, VIPUR neutral predictions have a five fold enrichment for beneficial growth mutations. While VIPUR is not designed to detect gain-of-function mutations, this enrichment suggests VIPUR scores can identify potentially beneficial mutations by removing clearly deleterious ones. When applied to TP53, a human protein that is mutated in nearly half of all cancers, VIPUR score trends highlight the most common mutations in the COSMIC database, suggesting other variants that may have similar effects on tumor growth. VIPUR and the large-scale data analysis empowering it will aid in the interpretation of protein variation by providing a detailed feature space to characterize protein functional effects and confident predictions of deleterious variation in Genome-Wide Association Studies, exome sequencing initiatives, and protein engineering. </p><p>
|