A Web Scraper For Forums : Navigation and text extraction methods

Web forums are a popular way of exchanging information and discussing various topics. These websites usually have a special structure, divided into boards, threads and posts. Although the structure might be consistent across forums, the layout of each forum is different. The way a web forum presents...

Full description

Bibliographic Details
Main Authors:	Palma, Michael, Zhou, Shidi
Format:	Others
Language:	English
Published:	KTH, Skolan för informations- och kommunikationsteknik (ICT) 2017
Subjects:	Data mining Web Scraper Java Web forums Text-extraction Link Duplicates Computer and Information Sciences Data- och informationsvetenskap
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-219903

id	ndltd-UPSALLA1-oai-DiVA.org-kth-219903
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-kth-2199032018-01-14T05:10:25ZA Web Scraper For Forums : Navigation and text extraction methodsengPalma, MichaelZhou, ShidiKTH, Skolan för informations- och kommunikationsteknik (ICT)KTH, Skolan för informations- och kommunikationsteknik (ICT)2017Data miningWeb ScraperJavaWeb forumsText-extractionLink DuplicatesData miningWeb ScraperJavaWeb forumsText-extractionLink DuplicatesComputer and Information SciencesData- och informationsvetenskapWeb forums are a popular way of exchanging information and discussing various topics. These websites usually have a special structure, divided into boards, threads and posts. Although the structure might be consistent across forums, the layout of each forum is different. The way a web forum presents the user posts is also very different from how a news website presents a single piece of information. All of this makes the navigation and extraction of text a hard task for web scrapers. The focus of this thesis is the development of a web scraper specialized in forums. Three different methods for text extraction are implemented and tested before choosing the most appropriate method for the task. The methods are Word Count, Text-Detection Framework and Text-to-Tag Ratio. The handling of link duplicates is also considered and solved by implementing a multi-layer bloom filter. The thesis is conducted applying a qualitative methodology. The results indicate that the Text-to-Tag Ratio has the best overall performance and gives the most desirable result in web forums. Thus, this was the selected methods to keep on the final version of the web scraper. Webforum är ett populärt sätt att utbyta information och diskutera olika ämnen. Dessa webbplatser har vanligtvis en särskild struktur, uppdelad i startsida, trådar och inlägg. Även om strukturen kan vara konsekvent bland olika forum är layouten av varje forum annorlunda. Det sätt på vilket ett webbforum presenterar användarinläggen är också väldigt annorlunda än hur en nyhet webbplats presenterar en enda informationsinlägg. Allt detta gör navigering och extrahering av text en svår uppgift för webbskrapor. Fokuset av detta examensarbete är utvecklingen av en webbskrapa specialiserad på forum. Tre olika metoder för textutvinning implementeras och testas innan man väljer den lämpligaste metoden för uppgiften. Metoderna är Word Count, Text Detection Framework och Text-to-Tag Ratio. Hanteringen av länk dubbleringar noga övervägd och löses genom att implementera ett flerlagers bloom filter. Examensarbetet genomförs med tillämpning av en kvalitativ metodik. Resultaten indikerar att Text-to-Tag Ratio har den bästa övergripande prestandan och ger det mest önskvärda resultatet i webbforum. Således var detta den valda metoden att behålla i den slutliga versionen av webbskrapan. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-219903TRITA-ICT-EX ; 2017:98application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Data mining Web Scraper Java Web forums Text-extraction Link Duplicates Data mining Web Scraper Java Web forums Text-extraction Link Duplicates Computer and Information Sciences Data- och informationsvetenskap
spellingShingle	Data mining Web Scraper Java Web forums Text-extraction Link Duplicates Data mining Web Scraper Java Web forums Text-extraction Link Duplicates Computer and Information Sciences Data- och informationsvetenskap Palma, Michael Zhou, Shidi A Web Scraper For Forums : Navigation and text extraction methods
description	Web forums are a popular way of exchanging information and discussing various topics. These websites usually have a special structure, divided into boards, threads and posts. Although the structure might be consistent across forums, the layout of each forum is different. The way a web forum presents the user posts is also very different from how a news website presents a single piece of information. All of this makes the navigation and extraction of text a hard task for web scrapers. The focus of this thesis is the development of a web scraper specialized in forums. Three different methods for text extraction are implemented and tested before choosing the most appropriate method for the task. The methods are Word Count, Text-Detection Framework and Text-to-Tag Ratio. The handling of link duplicates is also considered and solved by implementing a multi-layer bloom filter. The thesis is conducted applying a qualitative methodology. The results indicate that the Text-to-Tag Ratio has the best overall performance and gives the most desirable result in web forums. Thus, this was the selected methods to keep on the final version of the web scraper. === Webforum är ett populärt sätt att utbyta information och diskutera olika ämnen. Dessa webbplatser har vanligtvis en särskild struktur, uppdelad i startsida, trådar och inlägg. Även om strukturen kan vara konsekvent bland olika forum är layouten av varje forum annorlunda. Det sätt på vilket ett webbforum presenterar användarinläggen är också väldigt annorlunda än hur en nyhet webbplats presenterar en enda informationsinlägg. Allt detta gör navigering och extrahering av text en svår uppgift för webbskrapor. Fokuset av detta examensarbete är utvecklingen av en webbskrapa specialiserad på forum. Tre olika metoder för textutvinning implementeras och testas innan man väljer den lämpligaste metoden för uppgiften. Metoderna är Word Count, Text Detection Framework och Text-to-Tag Ratio. Hanteringen av länk dubbleringar noga övervägd och löses genom att implementera ett flerlagers bloom filter. Examensarbetet genomförs med tillämpning av en kvalitativ metodik. Resultaten indikerar att Text-to-Tag Ratio har den bästa övergripande prestandan och ger det mest önskvärda resultatet i webbforum. Således var detta den valda metoden att behålla i den slutliga versionen av webbskrapan.
author	Palma, Michael Zhou, Shidi
author_facet	Palma, Michael Zhou, Shidi
author_sort	Palma, Michael
title	A Web Scraper For Forums : Navigation and text extraction methods
title_short	A Web Scraper For Forums : Navigation and text extraction methods
title_full	A Web Scraper For Forums : Navigation and text extraction methods
title_fullStr	A Web Scraper For Forums : Navigation and text extraction methods
title_full_unstemmed	A Web Scraper For Forums : Navigation and text extraction methods
title_sort	web scraper for forums : navigation and text extraction methods
publisher	KTH, Skolan för informations- och kommunikationsteknik (ICT)
publishDate	2017
url	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-219903
work_keys_str_mv	AT palmamichael awebscraperforforumsnavigationandtextextractionmethods AT zhoushidi awebscraperforforumsnavigationandtextextractionmethods AT palmamichael webscraperforforumsnavigationandtextextractionmethods AT zhoushidi webscraperforforumsnavigationandtextextractionmethods
_version_	1718609269416263680

A Web Scraper For Forums : Navigation and text extraction methods

Similar Items