SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável

Submitted by Aelson Maciera (aelsoncm@terra.com.br) on 2017-09-06T18:32:40Z No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) === Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-09-25T16:56:27Z (GMT) No. of bitstreams: 1...

Full description

Bibliographic Details
Main Author: Castro, Marcelo Rodrigo de
Other Authors: Senger, Hermes
Language:Portuguese
Published: Universidade Federal de São Carlos 2017
Subjects:
Online Access:https://repositorio.ufscar.br/handle/ufscar/9114
Description
Summary:Submitted by Aelson Maciera (aelsoncm@terra.com.br) on 2017-09-06T18:32:40Z No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) === Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-09-25T16:56:27Z (GMT) No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) === Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-09-25T16:56:34Z (GMT) No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) === Made available in DSpace on 2017-09-25T17:05:03Z (GMT). No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) Previous issue date: 2017-02-13 === Outra === Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) === Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) === Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ) === With the evolution of next generation sequencing devices, the cost for obtaining genomic data has significantly reduced. With reduced costs for sequencing, the amount of genomic data to be processed has increased exponentially. Such data growth supersedes the rate at which computing power can be increased year after year by the hardware and software evolution. Thus, the higher rate of data growth in bioinformatics raises the need for exploiting more efficient and scalable techniques based on parallel and distributed processing, including platforms like Clusters, and Cloud Computing. BLAST is a widely used tool for genomic sequences alignment, which has native support for multicore-based parallel processing. However, its scalability is limited to a single machine. On the other hand, Cloud computing has emerged as an important technology for supporting rapid and elastic provisioning of large amounts of resources. Current frameworks like Apache Hadoop and Apache Spark provide support for the execution of distributed applications. Such environments provide mechanisms for embedding external applications in order to compose large distributed jobs which can be executed on clusters and cloud platforms. In this work, we used Spark to support the high scalable and efficient parallelization of BLAST (Basic Local Alingment Search Tool) to execute on dozens to hundreds of processing cores on a cloud platform. As result, our prototype has demonstrated better performance and scalability then CloudBLAST, a Hadoop based parallelization of BLAST. === Com a redução dos custos e evolução dos mecanismos que efetuam o sequenciamento genômico, tem havido um grande aumento na quantidade de dados referentes aos estudos da genomica. O crescimento desses dados tem ocorrido a taxas mais elevadas do que a industria tem conseguido aumentar o poder dos computadores a cada ano. Para melhor atender a necessidade de processamento e analise de dados em bioinformatica faz-se o uso de sistemas paralelos e distribuídos, como por exemplo: Clusters, Grids e Nuvens Computacionais. Contudo, muitas ferramentas, como o BLAST, que fazem o alinhamento entre sequencias e banco de dados, nao foram desenvolvidas para serem processadas de forma distribuída e escalavel. Os atuais frameworks Apache Hadoop e Apache Spark permitem a execucao de aplicacoes de forma distribuída e paralela, desde que as aplicacoes possam ser devidamente adaptadas e paralelizadas. Estudos que permitam melhorar desempenho de aplicacoes em bioinformatica tem se tornado um esforço contínuo. O Spark tem se mostrado uma ferramenta robusta para processamento massivo de dados. Nesta pesquisa de mestrado a ferramenta Apache Spark foi utilizada para dar suporte ao paralelismo da ferramenta BLAST (Basic Local Alingment Search Tool). Experimentos realizados na nuvem Google Cloud e Microsoft Azure demonstram desempenho (speedup) obtido foi similar ou melhor que trabalhos semelhantes ja desenvolvidos em Hadoop.