Parallel data analysis for atmospheric science

Data sizes are growing in atmospheric science, as climate models increase to higher resolutions to improve the representation of atmospheric phenomena, and larger numbers of ensemble members are used so as to better capture the variability in the atmosphere. New methods need to be developed to handl...

Full description

Bibliographic Details
Main Author: Jones, Matthew
Published: University of Reading 2018
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.749359
id ndltd-bl.uk-oai-ethos.bl.uk-749359
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-7493592018-09-11T03:22:54ZParallel data analysis for atmospheric scienceJones, Matthew2018Data sizes are growing in atmospheric science, as climate models increase to higher resolutions to improve the representation of atmospheric phenomena, and larger numbers of ensemble members are used so as to better capture the variability in the atmosphere. New methods need to be developed to handle the increasing size of data – traditional analysis scripts often inefficiently read and process data, leading to excessive analysis times. Research into large data analysis often focuses on providing solutions in the form of software, or hardware, rather than providing quantitative results on what factors can reduce performance in an application. This thesis quantitatively investigates these factors in the software-hardware stack, in order to make decisions how to handle large data sizes during application development and data management. This is done in the context of an atmospheric science workflow in a high-performance computing environment. A major bottleneck in analysis in atmospheric science is reading data. Two of the primary factors which are commonly known to affect the read time are the read pattern, and the read size. These factors are found in this work to reduce the read rate by up to 10-50 times for poor combinations. Other factors which could affect the read rate for atmospheric analysis include: the programming language, the libraries used, and the file layout. NetCDF4 is one of the most commonly used data formats in atmospheric science, and the Python library netCDF4-python is one of the main interfaces used. As part of the NetCDF4 file format, there are options for chunking (multidimensional tiling), and inbuilt compression, which can be used to improve read and write performance from the files. It was found that at peak performance the netCDF4-python library performs 40% worse than the underlying C NetCDF4 library. With respect to chunking and compression, poor combinations of chunking, and inbuilt compression, were found to reduce the performance by over 100 times. One solution to reduced performance, or a way to reduce analysis times on large datasets, is to run applications in parallel. It is important to understand how, on a particular platform, application relevant parallel reads will scale in order design an efficient application. The parallel scaling of the JASMIN super-data cluster was analysed. The investigation methodology, and conclusions from the investigation can be applied to other platforms. A case study was used to apply the results from this work in a real atmospheric science workflow – a space-time spectral analysis technique. It confirmed that these results do indeed apply to real workflows.University of Readinghttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.749359http://centaur.reading.ac.uk/77840/Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
description Data sizes are growing in atmospheric science, as climate models increase to higher resolutions to improve the representation of atmospheric phenomena, and larger numbers of ensemble members are used so as to better capture the variability in the atmosphere. New methods need to be developed to handle the increasing size of data – traditional analysis scripts often inefficiently read and process data, leading to excessive analysis times. Research into large data analysis often focuses on providing solutions in the form of software, or hardware, rather than providing quantitative results on what factors can reduce performance in an application. This thesis quantitatively investigates these factors in the software-hardware stack, in order to make decisions how to handle large data sizes during application development and data management. This is done in the context of an atmospheric science workflow in a high-performance computing environment. A major bottleneck in analysis in atmospheric science is reading data. Two of the primary factors which are commonly known to affect the read time are the read pattern, and the read size. These factors are found in this work to reduce the read rate by up to 10-50 times for poor combinations. Other factors which could affect the read rate for atmospheric analysis include: the programming language, the libraries used, and the file layout. NetCDF4 is one of the most commonly used data formats in atmospheric science, and the Python library netCDF4-python is one of the main interfaces used. As part of the NetCDF4 file format, there are options for chunking (multidimensional tiling), and inbuilt compression, which can be used to improve read and write performance from the files. It was found that at peak performance the netCDF4-python library performs 40% worse than the underlying C NetCDF4 library. With respect to chunking and compression, poor combinations of chunking, and inbuilt compression, were found to reduce the performance by over 100 times. One solution to reduced performance, or a way to reduce analysis times on large datasets, is to run applications in parallel. It is important to understand how, on a particular platform, application relevant parallel reads will scale in order design an efficient application. The parallel scaling of the JASMIN super-data cluster was analysed. The investigation methodology, and conclusions from the investigation can be applied to other platforms. A case study was used to apply the results from this work in a real atmospheric science workflow – a space-time spectral analysis technique. It confirmed that these results do indeed apply to real workflows.
author Jones, Matthew
spellingShingle Jones, Matthew
Parallel data analysis for atmospheric science
author_facet Jones, Matthew
author_sort Jones, Matthew
title Parallel data analysis for atmospheric science
title_short Parallel data analysis for atmospheric science
title_full Parallel data analysis for atmospheric science
title_fullStr Parallel data analysis for atmospheric science
title_full_unstemmed Parallel data analysis for atmospheric science
title_sort parallel data analysis for atmospheric science
publisher University of Reading
publishDate 2018
url https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.749359
work_keys_str_mv AT jonesmatthew paralleldataanalysisforatmosphericscience
_version_ 1718732791215030272