Summary: | The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets
for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event
Trend Archive Research (GETAR) projects. Researchers across varying disciplines
have an interest in leveraging DLRL's collections of tweets for their own analyses.
However, due to the steep learning curve involved with the required tools (Spark,
Scala, HBase, etc.), simply converting the Twitter data into a workable format can
be a cumbersome task in itself. This prompted the effort to build a framework that
will help in developing code to analyze the Twitter data, run on arbitrary tweet
collections, and enable developers to leverage projects designed with this general
use in mind. The intent of this thesis work is to create an extensible framework of
tools and data structures to represent Twitter data at a higher level and eliminate the
need to work with raw text, so as to make the development of new analytics tools
faster, easier, and more efficient.
To represent this data, several data structures were designed to operate on top of
the Hadoop and Spark libraries of tools. The first set of data structures is an abstract
representation of a tweet at a basic level, as well as several concrete implementations
which represent varying levels of detail to correspond with common sources
of tweet data. The second major data structure is a collection structure designed to
represent collections of tweet data structures and provide ways to filter, clean, and
process the collections. All of these data structures went through an iterative design
process based on the needs of the developers.
The effectiveness of this effort was demonstrated in four distinct case studies. In
the first case study, the framework was used to build a new tool that selects Twitter
data from DLRL's archive of tweets, cleans those tweets, and performs sentiment
analysis within the topics of a collection's topic model. The second case study applies
the provided tools for the purpose of sociolinguistic studies. The third case
study explores large datasets to accumulate all possible analyses on the datasets.
The fourth case study builds metadata by expanding the shortened URLs contained
in the tweets and storing them as metadata about the collections. The framework
proved to be useful and cut development time for all four of the case studies. === Master of Science
|