CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning

Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training...

Full description

Bibliographic Details
Main Authors: Ryan Conrad, Kedar Narayan
Format: Article
Language:English
Published: eLife Sciences Publications Ltd 2021-04-01
Series:eLife
Subjects:
vEM
Online Access:https://elifesciences.org/articles/65894
id doaj-37d37a4176d44e1f8754b29f69cd5dbe
record_format Article
spelling doaj-37d37a4176d44e1f8754b29f69cd5dbe2021-05-05T22:57:59ZengeLife Sciences Publications LtdeLife2050-084X2021-04-011010.7554/eLife.65894CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learningRyan Conrad0Kedar Narayan1https://orcid.org/0000-0001-7982-6494Center for Molecular Microscopy, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, United States; Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, United StatesCenter for Molecular Microscopy, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, United States; Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, United StatesAutomated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 106 unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz.https://elifesciences.org/articles/65894electron microscopydeep learningsegmentationvEMneural networkimage dataset
collection DOAJ
language English
format Article
sources DOAJ
author Ryan Conrad
Kedar Narayan
spellingShingle Ryan Conrad
Kedar Narayan
CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
eLife
electron microscopy
deep learning
segmentation
vEM
neural network
image dataset
author_facet Ryan Conrad
Kedar Narayan
author_sort Ryan Conrad
title CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_short CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_full CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_fullStr CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_full_unstemmed CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_sort cem500k, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
publisher eLife Sciences Publications Ltd
series eLife
issn 2050-084X
publishDate 2021-04-01
description Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 106 unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz.
topic electron microscopy
deep learning
segmentation
vEM
neural network
image dataset
url https://elifesciences.org/articles/65894
work_keys_str_mv AT ryanconrad cem500kalargescaleheterogeneousunlabeledcellularelectronmicroscopyimagedatasetfordeeplearning
AT kedarnarayan cem500kalargescaleheterogeneousunlabeledcellularelectronmicroscopyimagedatasetfordeeplearning
_version_ 1721457493629992960