OCMA: Fast, Memory-Efficient Factorization of Prohibitively Large Relationship Matrices

Matrices representing genetic relatedness among individuals (i.e., Genomic Relationship Matrices, GRMs) play a central role in genetic analysis. The eigen-decomposition of GRMs (or its alternative that generates fewer top singular values using genotype matrices) is a necessary step for many analyses...

Full description

Bibliographic Details
Main Authors: Zhi Xiong, Qingrun Zhang, Alexander Platt, Wenyuan Liao, Xinghua Shi, Gustavo de los Campos, Quan Long
Format: Article
Language:English
Published: Oxford University Press 2019-01-01
Series:G3: Genes, Genomes, Genetics
Subjects:
Online Access:http://g3journal.org/lookup/doi/10.1534/g3.118.200908
Description
Summary:Matrices representing genetic relatedness among individuals (i.e., Genomic Relationship Matrices, GRMs) play a central role in genetic analysis. The eigen-decomposition of GRMs (or its alternative that generates fewer top singular values using genotype matrices) is a necessary step for many analyses including estimation of SNP-heritability, Principal Component Analysis (PCA), and genomic prediction. However, the GRMs and genotype matrices provided by modern biobanks are too large to be stored in active memory. To accommodate the current and future “bigger-data”, we develop a disk-based tool, Out-of-Core Matrices Analyzer (OCMA), using state-of-the-art computational techniques that can nimbly perform eigen and Singular Value Decomposition (SVD) analyses. By integrating memory mapping (mmap) and the latest matrix factorization libraries, our tool is fast and memory-efficient. To demonstrate the impressive performance of OCMA, we test it on a personal computer. For full eigen-decomposition, it solves an ordinary GRM (N = 10,000) in 55 sec. For SVD, a commonly used faster alternative of full eigen-decomposition in genomic analyses, OCMA solves the top 200 singular values (SVs) in half an hour, top 2,000 SVs in 0.95 hr, and all 5,000 SVs in 1.77 hr based on a very large genotype matrix (N = 1,000,000, M = 5,000) on the same personal computer. OCMA also supports multi-threading when running in a desktop or HPC cluster. Our OCMA tool can thus alleviate the computing bottleneck of classical analyses on large genomic matrices, and make it possible to scale up current and emerging analytical methods to big genomics data using lightweight computing resources.
ISSN:2160-1836