Summary: | Microblogging is an increasingly popular form of social media. One of the most popular microblogging services is Twitter. The number of messages posted to Twitter on a daily basis is extremely large. Accordingly, it becomes hard for users to sort through these messages and find ones that interest them. Twitter offers search mechanisms but they are relatively simple and accordingly the results can be lacklustre. Through participation in the 2011 Text Retrieval Conference's Microblog Track, this thesis examines real-time ad hoc search using standard information retrieval approaches without microblog or Twitter specific modifications. It was found that using pseudo-relevance feedback based upon a language model derived from Twitter posts, called tweets, in conjunction with standard ranking methods is able to perform competitively with advanced retrieval systems as well as microblog and Twitter specific retrieval systems. Furthermore, possible modifications both Twitter specific and otherwise are discussed that would potentially increase retrieval performance.
Twitter has also spawned an interesting phenomenon called hashtags. Hashtags are used by Twitter users to denote that their message belongs to a particular topic or conversation. Unfortunately, tweets have a 140 characters limit and accordingly all relevant hashtags cannot always be present in tweet. Thus, Twitter users cannot easily find tweets that do not contain hashtags they are interested in but should contain them. This problem is investigated in this thesis in three ways using learning methods. First, learning methods are used to determine if it is possible to discriminate between two topically different sets of a tweets. This thesis then investigates whether or not it is possible for tweets without a particular hashtag, but discusses the same topic as the hashtag, to be separated from random tweets. This case mimics the real world scenario of users having to sift through random tweets to find tweets that are related to a topic they are interested in. This investigation is performed by removing hashtags from tweets and attempting to distinguish those tweets from random tweets. Finally, this thesis investigates whether or not topically similar tweets can also be distinguished based upon a sub-topic. This was investigated in almost an identical manner to the second case.
This thesis finds that topically distinct tweets can be distinguished but more importantly that standard learning methods are able to determine that a tweet with a hashtag removed should have that hashtag. In addition, this hashtag reconstruction can be performed well with very few examples of what a tweet with and without the particular hashtag should look like. This provides evidence that it may be possible to separate tweets a user may be interested from random tweets only using hashtags they are interested in. Furthermore, the success of the hashtag reconstruction also provides evidence that users do not misuse or abuse hashtags since hashtag presence was taken to be the ground truth in all experiments. Finally, the applicability of the hashtag reconstruction results to the TREC Microblog Track and a mobile application is presented.
|