Summary: | Facing the enormous volumes of data available nowadays, we try to extract useful information from the data by properly modeling and characterizing the data. In this thesis, we focus on one particular type of semantic data --- online movie reviews, which can be found on all major movie websites. Our objective is mining movie review data to seek quantifiable patterns between reviews on the same movie, or reviews from the same reviewer. A novel approach is presented in this thesis to achieve this goal. The key idea is converting a movie review text into a list of tuples, where each tuple contains four elements: feature word, category of feature word, opinion word and polarity of opinion word. Then we further convert each tuple into an 18-dimension vector. Given a multinomial distribution representing a movie review, we can systematically and consistently quantify the similarity and dependence between reviews made by the same or different reviewers using metrics including KL distance and distance correlation, respectively. Such comparisons allow us to find reviewers sharing similarity in generated multinomial distributions, or demonstrating correlation patterns to certain extent. Among the identified pairs of frequent reviewers, we further investigate the category-wise dependency relationships between two reviewers, which are further captured by our proposed ordinary least square estimation models. The proposed data processing approaches, as well as the corresponding modeling framework, could be further leveraged to develop classification, prediction, and common randomness extraction algorithms for semantic movie review data.
|