Data mining in text streams using suffix trees

Data mining in text streams, or text stream mining, is an increasingly im- portant topic for a number of reasons, including the recent explosion in the availability of textual data and an increasing need for people and organi- sations to process and understand as much of that information as possible...

Full description

Bibliographic Details
Main Author: Snowsill, Tristan
Published: University of Bristol 2012
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556708
id ndltd-bl.uk-oai-ethos.bl.uk-556708
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-5567082015-03-20T05:43:47ZData mining in text streams using suffix treesSnowsill, Tristan2012Data mining in text streams, or text stream mining, is an increasingly im- portant topic for a number of reasons, including the recent explosion in the availability of textual data and an increasing need for people and organi- sations to process and understand as much of that information as possible, from single users to multinational corporations and governments. In this thesis we present a data structure based on a generalised suffix tree which is capable of solving a number of text stream mining tasks. It can be used to detect changes in the text stream, detect when chunks of text are reused and detect events through identifying when the frequencies of phrases change in a statistically significant way. Suffix trees have been used for many years in the areas of combinatorial pattern matching and computational genomics. In this thesis we demonstrate how the suffix tree can become more widely applicable by making it possible to use suffix trees to analyse streams of data rather than static data sets, opening up a number of future avenues for research. The algorithms which we present are designed to be efficient in an on-line setting by having time complexity independent of the total amount of text seen and polynomial in the rate at which text is seen. We demonstrate the effectiveness of our methods on a large text stream comprising thousands of documents every day. This text stream is the stream of text news coming from over 600 online news outlets and the results ob- tained are of interest to news consumers, journalists and social scientists.006.312University of Bristolhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556708Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 006.312
spellingShingle 006.312
Snowsill, Tristan
Data mining in text streams using suffix trees
description Data mining in text streams, or text stream mining, is an increasingly im- portant topic for a number of reasons, including the recent explosion in the availability of textual data and an increasing need for people and organi- sations to process and understand as much of that information as possible, from single users to multinational corporations and governments. In this thesis we present a data structure based on a generalised suffix tree which is capable of solving a number of text stream mining tasks. It can be used to detect changes in the text stream, detect when chunks of text are reused and detect events through identifying when the frequencies of phrases change in a statistically significant way. Suffix trees have been used for many years in the areas of combinatorial pattern matching and computational genomics. In this thesis we demonstrate how the suffix tree can become more widely applicable by making it possible to use suffix trees to analyse streams of data rather than static data sets, opening up a number of future avenues for research. The algorithms which we present are designed to be efficient in an on-line setting by having time complexity independent of the total amount of text seen and polynomial in the rate at which text is seen. We demonstrate the effectiveness of our methods on a large text stream comprising thousands of documents every day. This text stream is the stream of text news coming from over 600 online news outlets and the results ob- tained are of interest to news consumers, journalists and social scientists.
author Snowsill, Tristan
author_facet Snowsill, Tristan
author_sort Snowsill, Tristan
title Data mining in text streams using suffix trees
title_short Data mining in text streams using suffix trees
title_full Data mining in text streams using suffix trees
title_fullStr Data mining in text streams using suffix trees
title_full_unstemmed Data mining in text streams using suffix trees
title_sort data mining in text streams using suffix trees
publisher University of Bristol
publishDate 2012
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556708
work_keys_str_mv AT snowsilltristan dataminingintextstreamsusingsuffixtrees
_version_ 1716793944550932480