Systém pro sběr XML dat a metadat z Internetu
The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the be...
Main Author: | |
---|---|
Other Authors: | |
Format: | Dissertation |
Language: | Czech |
Published: |
2010
|
Online Access: | http://www.nusl.cz/ntk/nusl-282045 |
id |
ndltd-nusl.cz-oai-invenio.nusl.cz-282045 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-nusl.cz-oai-invenio.nusl.cz-2820452017-06-27T04:40:57Z Systém pro sběr XML dat a metadat z Internetu Collecting XML data and meta-data from the Internet Žemlička, Michal Sochna, Jan Bednárek, David The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the beginning to choose one of the systems as a base of the solution. The open source web crawler Apache Nutch was identified as the most suitable. Then necessary extensions and modifications of the crawler were designed and implemented in order to make the crawler efficient in downloading XML-family documents. Downloaded XML-family data were analyzed and evaluated using the Analyzer application, which was enhanced within this Diploma Thesis in order to process the data. The main outcome of Diploma Thesis is an exploitable system collecting the XML-family documents from the Internet. Implemented modification and extensions of the system lead to elimination of "useless" documents download, improving the ratio targeted XML-family documents. 2010 info:eu-repo/semantics/masterThesis http://www.nusl.cz/ntk/nusl-282045 cze info:eu-repo/semantics/restrictedAccess |
collection |
NDLTD |
language |
Czech |
format |
Dissertation |
sources |
NDLTD |
description |
The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the beginning to choose one of the systems as a base of the solution. The open source web crawler Apache Nutch was identified as the most suitable. Then necessary extensions and modifications of the crawler were designed and implemented in order to make the crawler efficient in downloading XML-family documents. Downloaded XML-family data were analyzed and evaluated using the Analyzer application, which was enhanced within this Diploma Thesis in order to process the data. The main outcome of Diploma Thesis is an exploitable system collecting the XML-family documents from the Internet. Implemented modification and extensions of the system lead to elimination of "useless" documents download, improving the ratio targeted XML-family documents. |
author2 |
Žemlička, Michal |
author_facet |
Žemlička, Michal Sochna, Jan |
author |
Sochna, Jan |
spellingShingle |
Sochna, Jan Systém pro sběr XML dat a metadat z Internetu |
author_sort |
Sochna, Jan |
title |
Systém pro sběr XML dat a metadat z Internetu |
title_short |
Systém pro sběr XML dat a metadat z Internetu |
title_full |
Systém pro sběr XML dat a metadat z Internetu |
title_fullStr |
Systém pro sběr XML dat a metadat z Internetu |
title_full_unstemmed |
Systém pro sběr XML dat a metadat z Internetu |
title_sort |
systém pro sběr xml dat a metadat z internetu |
publishDate |
2010 |
url |
http://www.nusl.cz/ntk/nusl-282045 |
work_keys_str_mv |
AT sochnajan systemprosberxmldatametadatzinternetu AT sochnajan collectingxmldataandmetadatafromtheinternet |
_version_ |
1718469127456161792 |