Summary: | In recent years we have seen an increase in the number of public transit service disruptions due to aging infrastructure, system failures and the regular need for maintenance. With the fleeting growth in the usage of these transit networks there has been an increase in the need for the timely detection of such disruptions. Any types of disruptions in these transit networks can lead to delays which can have major implications on the daily passengers.
Most current disruption detection systems either do not operate in real-time or lack transit network coverage. The theme of this thesis was to leverage Twitter data to help in earlier detection of service disruptions. This work involves developing a pure Data Mining approach and a couple different approaches that use Graph Neural Networks to identify transit disruption related information in Tweets from a live Twitter stream related to the Washington Metropolitan Area Transit Authority (WMATA) metro system. After developing three different models, a Dynamic Query Expansion model, a Tweet-GCN and a Tweet-Level GCN to represent the data corpus we performed various experiments and benchmark evaluations against other existing baseline models, to justify the efficacy of our approaches. After seeing astounding results across both the Tweet-GCN and Tweet-Level GCN, with an average accuracy of approximately 87.3% and 89.9% we can conclude that not only are these two graph neural models superior for basic NLP text classification, but they also outperform other models in identifying transit disruptions. === Master of Science === Millions of people worldwide rely on public transit networks for their daily commutes and day to day movements. With the growth in the number of people using the service, there has been an increase in the number of daily passengers affected by service disruptions. This thesis and research involves proposing and developing three different approaches to help aid in the timely detection of these disruptions. In this work we have developed a pure data mining approach along with two deep learning models using neural networks and live data from Twitter to identify these disruptions. The data mining approach uses a set of dirsuption related input keywords to identify similar keywords within the live Twitter data. By collecting historical data we were able to create deep learning models that represent the vocabulary from the disruptions related Tweets in the form of a graph. A graph is a collection of data values where the data points are connected to one another based on their relationships. A longer chain of connection between two words defines a weak relationship, a shorter chain defines a stronger relationship. In our graph, words with similar contextual meanings are connected to each other over shorter distances, compared to words with different meanings. At the end we use a neural network as a classifier to scan this graph to learn the semantic relationships within our data. Afterwards, this learned information can be used to accurately classify the disruption related Tweets within a pool of random Tweets. Once all the proposed approaches have been developed, a benchmark evaluation is performed against other existing text classification techniques, to justify the effectiveness of the approaches. The final results indicate that the proposed graph based models achieved a higher accuracy, compared to the data mining model, and also outperformed all the other baseline models. Our Tweet-Level GCN had the highest accuracy of 89.9%.
|