Fast analysis of scATAC-seq data using a predefined set of genomic regions [version 2; peer review: 2 approved]

Background: Analysis of scATAC-seq data has been recently scaled to thousands of cells. While processing of other types of single cell data was boosted by the implementation of alignment-free techniques, pipelines available to process scATAC-seq data still require large computational resources. We p...

Full description

Bibliographic Details
Main Authors: Valentina Giansanti, Ming Tang, Davide Cittaro
Format: Article
Language:English
Published: F1000 Research Ltd 2020-05-01
Series:F1000Research
Online Access:https://f1000research.com/articles/9-199/v2
id doaj-79ccd3134f2845538ae9508202288026
record_format Article
spelling doaj-79ccd3134f2845538ae95082022880262020-11-25T01:23:19ZengF1000 Research LtdF1000Research2046-14022020-05-01910.12688/f1000research.22731.226547Fast analysis of scATAC-seq data using a predefined set of genomic regions [version 2; peer review: 2 approved]Valentina Giansanti0Ming Tang1Davide Cittaro2Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, ItalyFAS informatics, Harvard University, Cambridge, MA, USACenter for Omics Sciences, IRCCS San Raffaele Institute, Milan, ItalyBackground: Analysis of scATAC-seq data has been recently scaled to thousands of cells. While processing of other types of single cell data was boosted by the implementation of alignment-free techniques, pipelines available to process scATAC-seq data still require large computational resources. We propose here an approach based on pseudoalignment, which reduces the execution times and hardware needs at little cost for precision. Methods: Public data for 10k PBMC were downloaded from 10x Genomics web site. Reads were aligned to various references derived from DNase I Hypersensitive Sites (DHS) using kallisto and quantified with bustools. We compared our results with the ones publicly available derived by cellranger-atac. We subsequently tested our approach on scATAC-seq data for K562 cell line. Results: We found that kallisto does not introduce biases in quantification of known peaks; cells groups identified are consistent with the ones identified from standard method. We also found that cell identification is robust when analysis is performed using DHS-derived reference in place of de novo identification of ATAC peaks. Lastly, we found that our approach is suitable for reliable quantification of gene activity based on scATAC-seq signal, thus allows for efficient labelling of cell groups based on marker genes. Conclusions: Analysis of scATAC-seq data by means of kallisto produces results in line with standard pipelines while being considerably faster; using a set of known DHS sites as reference does not affect the ability to characterize the cell populations.https://f1000research.com/articles/9-199/v2
collection DOAJ
language English
format Article
sources DOAJ
author Valentina Giansanti
Ming Tang
Davide Cittaro
spellingShingle Valentina Giansanti
Ming Tang
Davide Cittaro
Fast analysis of scATAC-seq data using a predefined set of genomic regions [version 2; peer review: 2 approved]
F1000Research
author_facet Valentina Giansanti
Ming Tang
Davide Cittaro
author_sort Valentina Giansanti
title Fast analysis of scATAC-seq data using a predefined set of genomic regions [version 2; peer review: 2 approved]
title_short Fast analysis of scATAC-seq data using a predefined set of genomic regions [version 2; peer review: 2 approved]
title_full Fast analysis of scATAC-seq data using a predefined set of genomic regions [version 2; peer review: 2 approved]
title_fullStr Fast analysis of scATAC-seq data using a predefined set of genomic regions [version 2; peer review: 2 approved]
title_full_unstemmed Fast analysis of scATAC-seq data using a predefined set of genomic regions [version 2; peer review: 2 approved]
title_sort fast analysis of scatac-seq data using a predefined set of genomic regions [version 2; peer review: 2 approved]
publisher F1000 Research Ltd
series F1000Research
issn 2046-1402
publishDate 2020-05-01
description Background: Analysis of scATAC-seq data has been recently scaled to thousands of cells. While processing of other types of single cell data was boosted by the implementation of alignment-free techniques, pipelines available to process scATAC-seq data still require large computational resources. We propose here an approach based on pseudoalignment, which reduces the execution times and hardware needs at little cost for precision. Methods: Public data for 10k PBMC were downloaded from 10x Genomics web site. Reads were aligned to various references derived from DNase I Hypersensitive Sites (DHS) using kallisto and quantified with bustools. We compared our results with the ones publicly available derived by cellranger-atac. We subsequently tested our approach on scATAC-seq data for K562 cell line. Results: We found that kallisto does not introduce biases in quantification of known peaks; cells groups identified are consistent with the ones identified from standard method. We also found that cell identification is robust when analysis is performed using DHS-derived reference in place of de novo identification of ATAC peaks. Lastly, we found that our approach is suitable for reliable quantification of gene activity based on scATAC-seq signal, thus allows for efficient labelling of cell groups based on marker genes. Conclusions: Analysis of scATAC-seq data by means of kallisto produces results in line with standard pipelines while being considerably faster; using a set of known DHS sites as reference does not affect the ability to characterize the cell populations.
url https://f1000research.com/articles/9-199/v2
work_keys_str_mv AT valentinagiansanti fastanalysisofscatacseqdatausingapredefinedsetofgenomicregionsversion2peerreview2approved
AT mingtang fastanalysisofscatacseqdatausingapredefinedsetofgenomicregionsversion2peerreview2approved
AT davidecittaro fastanalysisofscatacseqdatausingapredefinedsetofgenomicregionsversion2peerreview2approved
_version_ 1725122989786136576