TheSNPpit-A High Performance Database System for Managing Large Scale SNP Data.

The fast development of high throughput genotyping has opened up new possibilities in genetics while at the same time producing considerable data handling issues. TheSNPpit is a database system for managing large amounts of multi panel SNP genotype data from any genotyping platform. With an increasi...

Full description

Bibliographic Details
Main Authors: Eildert Groeneveld, Helmut Lichtenberg
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2016-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC5079601?pdf=render
id doaj-dd97ea19731245428b653d4e47fc4c24
record_format Article
spelling doaj-dd97ea19731245428b653d4e47fc4c242020-11-25T00:25:36ZengPublic Library of Science (PLoS)PLoS ONE1932-62032016-01-011110e016404310.1371/journal.pone.0164043TheSNPpit-A High Performance Database System for Managing Large Scale SNP Data.Eildert GroeneveldHelmut LichtenbergThe fast development of high throughput genotyping has opened up new possibilities in genetics while at the same time producing considerable data handling issues. TheSNPpit is a database system for managing large amounts of multi panel SNP genotype data from any genotyping platform. With an increasing rate of genotyping in areas like animal and plant breeding as well as human genetics, already now hundreds of thousand of individuals need to be managed. While the common database design with one row per SNP can manage hundreds of samples this approach becomes progressively slower as the size of the data sets increase until it finally fails completely once tens or even hundreds of thousands of individuals need to be managed. TheSNPpit has implemented three ideas to also accomodate such large scale experiments: highly compressed vector storage in a relational database, set based data manipulation, and a very fast export written in C with Perl as the base for the framework and PostgreSQL as the database backend. Its novel subset system allows the creation of named subsets based on the filtering of SNP (based on major allele frequency, no-calls, and chromosomes) and manually applied sample and SNP lists at negligible storage costs, thus avoiding the issue of proliferating file copies. The named subsets are exported for down stream analysis. PLINK ped and map files are processed as in- and outputs. TheSNPpit allows management of different panel sizes in the same population of individuals when higher density panels replace previous lower density versions as it occurs in animal and plant breeding programs. A completely generalized procedure allows storage of phenotypes. TheSNPpit only occupies 2 bits for storing a single SNP implying a capacity of 4 mio SNPs per 1MB of disk storage. To investigate performance scaling, a database with more than 18.5 mio samples has been created with 3.4 trillion SNPs from 12 panels ranging from 1000 through 20 mio SNPs resulting in a database of 850GB. The import and export performance scales linearly with the number of SNPs and is largely independent of panel and database size. Import speed is around 6 mio SNPs/sec, export between 60 and 120 mio SNPs/sec. Being command line based, imports and exports can easily be integrated into pipelines. TheSNPpit is available under the Open Source GNU General Public License (GPL) Version 2.http://europepmc.org/articles/PMC5079601?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Eildert Groeneveld
Helmut Lichtenberg
spellingShingle Eildert Groeneveld
Helmut Lichtenberg
TheSNPpit-A High Performance Database System for Managing Large Scale SNP Data.
PLoS ONE
author_facet Eildert Groeneveld
Helmut Lichtenberg
author_sort Eildert Groeneveld
title TheSNPpit-A High Performance Database System for Managing Large Scale SNP Data.
title_short TheSNPpit-A High Performance Database System for Managing Large Scale SNP Data.
title_full TheSNPpit-A High Performance Database System for Managing Large Scale SNP Data.
title_fullStr TheSNPpit-A High Performance Database System for Managing Large Scale SNP Data.
title_full_unstemmed TheSNPpit-A High Performance Database System for Managing Large Scale SNP Data.
title_sort thesnppit-a high performance database system for managing large scale snp data.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2016-01-01
description The fast development of high throughput genotyping has opened up new possibilities in genetics while at the same time producing considerable data handling issues. TheSNPpit is a database system for managing large amounts of multi panel SNP genotype data from any genotyping platform. With an increasing rate of genotyping in areas like animal and plant breeding as well as human genetics, already now hundreds of thousand of individuals need to be managed. While the common database design with one row per SNP can manage hundreds of samples this approach becomes progressively slower as the size of the data sets increase until it finally fails completely once tens or even hundreds of thousands of individuals need to be managed. TheSNPpit has implemented three ideas to also accomodate such large scale experiments: highly compressed vector storage in a relational database, set based data manipulation, and a very fast export written in C with Perl as the base for the framework and PostgreSQL as the database backend. Its novel subset system allows the creation of named subsets based on the filtering of SNP (based on major allele frequency, no-calls, and chromosomes) and manually applied sample and SNP lists at negligible storage costs, thus avoiding the issue of proliferating file copies. The named subsets are exported for down stream analysis. PLINK ped and map files are processed as in- and outputs. TheSNPpit allows management of different panel sizes in the same population of individuals when higher density panels replace previous lower density versions as it occurs in animal and plant breeding programs. A completely generalized procedure allows storage of phenotypes. TheSNPpit only occupies 2 bits for storing a single SNP implying a capacity of 4 mio SNPs per 1MB of disk storage. To investigate performance scaling, a database with more than 18.5 mio samples has been created with 3.4 trillion SNPs from 12 panels ranging from 1000 through 20 mio SNPs resulting in a database of 850GB. The import and export performance scales linearly with the number of SNPs and is largely independent of panel and database size. Import speed is around 6 mio SNPs/sec, export between 60 and 120 mio SNPs/sec. Being command line based, imports and exports can easily be integrated into pipelines. TheSNPpit is available under the Open Source GNU General Public License (GPL) Version 2.
url http://europepmc.org/articles/PMC5079601?pdf=render
work_keys_str_mv AT eildertgroeneveld thesnppitahighperformancedatabasesystemformanaginglargescalesnpdata
AT helmutlichtenberg thesnppitahighperformancedatabasesystemformanaginglargescalesnpdata
_version_ 1725347978348068864