Systém pro sběr XML dat a metadat z Internetu

The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the be...

Full description

Bibliographic Details
Main Author:	Sochna, Jan
Other Authors:	Žemlička, Michal
Format:	Dissertation
Language:	Czech
Published:	2010
Online Access:	http://www.nusl.cz/ntk/nusl-282045

id	ndltd-nusl.cz-oai-invenio.nusl.cz-282045
record_format	oai_dc
spelling	ndltd-nusl.cz-oai-invenio.nusl.cz-2820452017-06-27T04:40:57Z Systém pro sběr XML dat a metadat z Internetu Collecting XML data and meta-data from the Internet Žemlička, Michal Sochna, Jan Bednárek, David The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the beginning to choose one of the systems as a base of the solution. The open source web crawler Apache Nutch was identified as the most suitable. Then necessary extensions and modifications of the crawler were designed and implemented in order to make the crawler efficient in downloading XML-family documents. Downloaded XML-family data were analyzed and evaluated using the Analyzer application, which was enhanced within this Diploma Thesis in order to process the data. The main outcome of Diploma Thesis is an exploitable system collecting the XML-family documents from the Internet. Implemented modification and extensions of the system lead to elimination of "useless" documents download, improving the ratio targeted XML-family documents. 2010 info:eu-repo/semantics/masterThesis http://www.nusl.cz/ntk/nusl-282045 cze info:eu-repo/semantics/restrictedAccess
collection	NDLTD
language	Czech
format	Dissertation
sources	NDLTD
description	The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the beginning to choose one of the systems as a base of the solution. The open source web crawler Apache Nutch was identified as the most suitable. Then necessary extensions and modifications of the crawler were designed and implemented in order to make the crawler efficient in downloading XML-family documents. Downloaded XML-family data were analyzed and evaluated using the Analyzer application, which was enhanced within this Diploma Thesis in order to process the data. The main outcome of Diploma Thesis is an exploitable system collecting the XML-family documents from the Internet. Implemented modification and extensions of the system lead to elimination of "useless" documents download, improving the ratio targeted XML-family documents.
author2	Žemlička, Michal
author_facet	Žemlička, Michal Sochna, Jan
author	Sochna, Jan
spellingShingle	Sochna, Jan Systém pro sběr XML dat a metadat z Internetu
author_sort	Sochna, Jan
title	Systém pro sběr XML dat a metadat z Internetu
title_short	Systém pro sběr XML dat a metadat z Internetu
title_full	Systém pro sběr XML dat a metadat z Internetu
title_fullStr	Systém pro sběr XML dat a metadat z Internetu
title_full_unstemmed	Systém pro sběr XML dat a metadat z Internetu
title_sort	systém pro sběr xml dat a metadat z internetu
publishDate	2010
url	http://www.nusl.cz/ntk/nusl-282045
work_keys_str_mv	AT sochnajan systemprosberxmldatametadatzinternetu AT sochnajan collectingxmldataandmetadatafromtheinternet
_version_	1718469127456161792

Systém pro sběr XML dat a metadat z Internetu

Similar Items