Systém pro sběr XML dat a metadat z Internetu

The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the be...

Full description

Bibliographic Details
Main Author: Sochna, Jan
Other Authors: Žemlička, Michal
Format: Dissertation
Language:Czech
Published: 2010
Online Access:http://www.nusl.cz/ntk/nusl-282045
id ndltd-nusl.cz-oai-invenio.nusl.cz-282045
record_format oai_dc
spelling ndltd-nusl.cz-oai-invenio.nusl.cz-2820452017-06-27T04:40:57Z Systém pro sběr XML dat a metadat z Internetu Collecting XML data and meta-data from the Internet Žemlička, Michal Sochna, Jan Bednárek, David The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the beginning to choose one of the systems as a base of the solution. The open source web crawler Apache Nutch was identified as the most suitable. Then necessary extensions and modifications of the crawler were designed and implemented in order to make the crawler efficient in downloading XML-family documents. Downloaded XML-family data were analyzed and evaluated using the Analyzer application, which was enhanced within this Diploma Thesis in order to process the data. The main outcome of Diploma Thesis is an exploitable system collecting the XML-family documents from the Internet. Implemented modification and extensions of the system lead to elimination of "useless" documents download, improving the ratio targeted XML-family documents. 2010 info:eu-repo/semantics/masterThesis http://www.nusl.cz/ntk/nusl-282045 cze info:eu-repo/semantics/restrictedAccess
collection NDLTD
language Czech
format Dissertation
sources NDLTD
description The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the beginning to choose one of the systems as a base of the solution. The open source web crawler Apache Nutch was identified as the most suitable. Then necessary extensions and modifications of the crawler were designed and implemented in order to make the crawler efficient in downloading XML-family documents. Downloaded XML-family data were analyzed and evaluated using the Analyzer application, which was enhanced within this Diploma Thesis in order to process the data. The main outcome of Diploma Thesis is an exploitable system collecting the XML-family documents from the Internet. Implemented modification and extensions of the system lead to elimination of "useless" documents download, improving the ratio targeted XML-family documents.
author2 Žemlička, Michal
author_facet Žemlička, Michal
Sochna, Jan
author Sochna, Jan
spellingShingle Sochna, Jan
Systém pro sběr XML dat a metadat z Internetu
author_sort Sochna, Jan
title Systém pro sběr XML dat a metadat z Internetu
title_short Systém pro sběr XML dat a metadat z Internetu
title_full Systém pro sběr XML dat a metadat z Internetu
title_fullStr Systém pro sběr XML dat a metadat z Internetu
title_full_unstemmed Systém pro sběr XML dat a metadat z Internetu
title_sort systém pro sběr xml dat a metadat z internetu
publishDate 2010
url http://www.nusl.cz/ntk/nusl-282045
work_keys_str_mv AT sochnajan systemprosberxmldatametadatzinternetu
AT sochnajan collectingxmldataandmetadatafromtheinternet
_version_ 1718469127456161792