Tweet Collect: short text message collection using automatic query expansion and classification

The growing number of twitter users create large amounts of messages that contain valuable information for market research. These messages, called tweets, which are short, contain twitter-specific writing styles and are often idiosyncratic give rise to a vocabulary mismatch between typically chosen...

Full description

Bibliographic Details
Main Author:	Ward, Erik
Format:	Others
Language:	English
Published:	Uppsala universitet, Institutionen för informationsteknologi 2013
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-194961

id	ndltd-UPSALLA1-oai-DiVA.org-uu-194961
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-uu-1949612013-02-20T15:58:30ZTweet Collect: short text message collection using automatic query expansion and classificationengWard, ErikUppsala universitet, Institutionen för informationsteknologi2013The growing number of twitter users create large amounts of messages that contain valuable information for market research. These messages, called tweets, which are short, contain twitter-specific writing styles and are often idiosyncratic give rise to a vocabulary mismatch between typically chosen keywords for tweet collection and words used to describe television shows. A method is presented that uses a new form of query expansion that generates pairs of search terms and takes into consideration the language usage of twitter to access user data that would otherwise be missed. Supervised classification, without manually annotated data, is used to maintain precision by comparing collected tweets with external sources. The method is implemented, as the Tweet Collect system, in Java utilizing many processing steps to improve performance. The evaluation was carried out by collecting tweets about five different television shows during their time of airing and indicating, on average, a 66.5% increase in the number of relevant tweets compared with using the title of the show as the search terms and 68.0% total precision. Classification gives a, slightly lower, average increase of 55.2% in number of tweets and a greatly increased 82.0% total precision. The utility of an automatic system for tracking topics that can find additional keywords is demonstrated. Implementation considerations and possible improvements are discussed that can lead to improved performance. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-194961UPTEC IT, 1401-5749 ; 13 003application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
description	The growing number of twitter users create large amounts of messages that contain valuable information for market research. These messages, called tweets, which are short, contain twitter-specific writing styles and are often idiosyncratic give rise to a vocabulary mismatch between typically chosen keywords for tweet collection and words used to describe television shows. A method is presented that uses a new form of query expansion that generates pairs of search terms and takes into consideration the language usage of twitter to access user data that would otherwise be missed. Supervised classification, without manually annotated data, is used to maintain precision by comparing collected tweets with external sources. The method is implemented, as the Tweet Collect system, in Java utilizing many processing steps to improve performance. The evaluation was carried out by collecting tweets about five different television shows during their time of airing and indicating, on average, a 66.5% increase in the number of relevant tweets compared with using the title of the show as the search terms and 68.0% total precision. Classification gives a, slightly lower, average increase of 55.2% in number of tweets and a greatly increased 82.0% total precision. The utility of an automatic system for tracking topics that can find additional keywords is demonstrated. Implementation considerations and possible improvements are discussed that can lead to improved performance.
author	Ward, Erik
spellingShingle	Ward, Erik Tweet Collect: short text message collection using automatic query expansion and classification
author_facet	Ward, Erik
author_sort	Ward, Erik
title	Tweet Collect: short text message collection using automatic query expansion and classification
title_short	Tweet Collect: short text message collection using automatic query expansion and classification
title_full	Tweet Collect: short text message collection using automatic query expansion and classification
title_fullStr	Tweet Collect: short text message collection using automatic query expansion and classification
title_full_unstemmed	Tweet Collect: short text message collection using automatic query expansion and classification
title_sort	tweet collect: short text message collection using automatic query expansion and classification
publisher	Uppsala universitet, Institutionen för informationsteknologi
publishDate	2013
url	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-194961
work_keys_str_mv	AT warderik tweetcollectshorttextmessagecollectionusingautomaticqueryexpansionandclassification
_version_	1716578085162188800

Tweet Collect: short text message collection using automatic query expansion and classification

Similar Items