Data mining in text streams using suffix trees

Data mining in text streams, or text stream mining, is an increasingly im- portant topic for a number of reasons, including the recent explosion in the availability of textual data and an increasing need for people and organi- sations to process and understand as much of that information as possible...

Full description

Bibliographic Details
Main Author:	Snowsill, Tristan
Published:	University of Bristol 2012
Subjects:	006.312
Online Access:	http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556708

id	ndltd-bl.uk-oai-ethos.bl.uk-556708
record_format	oai_dc
spelling	ndltd-bl.uk-oai-ethos.bl.uk-5567082015-03-20T05:43:47ZData mining in text streams using suffix treesSnowsill, Tristan2012Data mining in text streams, or text stream mining, is an increasingly im- portant topic for a number of reasons, including the recent explosion in the availability of textual data and an increasing need for people and organi- sations to process and understand as much of that information as possible, from single users to multinational corporations and governments. In this thesis we present a data structure based on a generalised suffix tree which is capable of solving a number of text stream mining tasks. It can be used to detect changes in the text stream, detect when chunks of text are reused and detect events through identifying when the frequencies of phrases change in a statistically significant way. Suffix trees have been used for many years in the areas of combinatorial pattern matching and computational genomics. In this thesis we demonstrate how the suffix tree can become more widely applicable by making it possible to use suffix trees to analyse streams of data rather than static data sets, opening up a number of future avenues for research. The algorithms which we present are designed to be efficient in an on-line setting by having time complexity independent of the total amount of text seen and polynomial in the rate at which text is seen. We demonstrate the effectiveness of our methods on a large text stream comprising thousands of documents every day. This text stream is the stream of text news coming from over 600 online news outlets and the results ob- tained are of interest to news consumers, journalists and social scientists.006.312University of Bristolhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556708Electronic Thesis or Dissertation
collection	NDLTD
sources	NDLTD
topic	006.312
spellingShingle	006.312 Snowsill, Tristan Data mining in text streams using suffix trees
description	Data mining in text streams, or text stream mining, is an increasingly im- portant topic for a number of reasons, including the recent explosion in the availability of textual data and an increasing need for people and organi- sations to process and understand as much of that information as possible, from single users to multinational corporations and governments. In this thesis we present a data structure based on a generalised suffix tree which is capable of solving a number of text stream mining tasks. It can be used to detect changes in the text stream, detect when chunks of text are reused and detect events through identifying when the frequencies of phrases change in a statistically significant way. Suffix trees have been used for many years in the areas of combinatorial pattern matching and computational genomics. In this thesis we demonstrate how the suffix tree can become more widely applicable by making it possible to use suffix trees to analyse streams of data rather than static data sets, opening up a number of future avenues for research. The algorithms which we present are designed to be efficient in an on-line setting by having time complexity independent of the total amount of text seen and polynomial in the rate at which text is seen. We demonstrate the effectiveness of our methods on a large text stream comprising thousands of documents every day. This text stream is the stream of text news coming from over 600 online news outlets and the results ob- tained are of interest to news consumers, journalists and social scientists.
author	Snowsill, Tristan
author_facet	Snowsill, Tristan
author_sort	Snowsill, Tristan
title	Data mining in text streams using suffix trees
title_short	Data mining in text streams using suffix trees
title_full	Data mining in text streams using suffix trees
title_fullStr	Data mining in text streams using suffix trees
title_full_unstemmed	Data mining in text streams using suffix trees
title_sort	data mining in text streams using suffix trees
publisher	University of Bristol
publishDate	2012
url	http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556708
work_keys_str_mv	AT snowsilltristan dataminingintextstreamsusingsuffixtrees
_version_	1716793944550932480

Data mining in text streams using suffix trees

Similar Items