Corpus and sentiment analysis
Information extraction/retrieval has been of interest to researchers since the early 1960's. A series of conferences and competitions have been held by DARPA/NIST since the late 1980's has resulted in the analysis of news reports and government reports in English and other languages, notab...
Main Author: | |
---|---|
Published: |
University of Surrey
2007
|
Subjects: | |
Online Access: | http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.442666 |
id |
ndltd-bl.uk-oai-ethos.bl.uk-442666 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-bl.uk-oai-ethos.bl.uk-4426662015-08-04T03:33:15ZCorpus and sentiment analysisCheng, Tai Wai David2007Information extraction/retrieval has been of interest to researchers since the early 1960's. A series of conferences and competitions have been held by DARPA/NIST since the late 1980's has resulted in the analysis of news reports and government reports in English and other languages, notably Chinese and Arabic. A number of methods have been developed for analysing `free' natural language texts. Furthermore, a number of systems for understanding messages have been developed, focusing on the area of named entity extraction, templates for dealing with certain kinds of news. The templates were handcrafted, and a lot of ad-hoc knowledge went into the creation of such systems. Seven of these systems have been reviewed. Despite the fact that IE systems built for different tasks often differ from each other, the core elements are shared by nearly every extraction system. Some of these core elements such as parser and part of speech (POS) tagger, are tuned for optimal performance for a specific domain, or text with pre-defined structures. The extensive use of gazetteers and manually crafted grammar rules further limits the portability of the existing IE systems to work language and domain independently. The goal of this thesis is to develop an algorithm that can be used to extract information from free texts, in our case, from financial news text; and from arbitrary domains unambiguously. We believe the use of corpus linguistics and statistical techniques would be more appropriate and efficient for this task, instead of using other approaches that rely on machine learning, POS taggers, parsers, and so on, which are tuned to work for a predefined domain. Based on this belief, a framework using corpus linguistics and statistical techniques, to extract information as unambiguously as possible from arbitrary domains was developed. A contrastive evaluation has been carried out not only in the domain of financial texts and movie reviews, but also with multi-lingual texts (Chinese and English). The results are encouraging. Our preliminarily evaluation, based on the correlation between a time series of positive (negative) sentiment word (phrase) counts with a time series of indices produced by stock exchanges (Financial Times Stock Exchange, Dow Jones Industrial Average, Nasdaq, S&P 500, Hang Seng Index, Shanghai Index, and Shenzhen Index) showed that when the positive (negative) sentiment series correlates with the stock exchange index, the negative (positive) shows a smaller degree of correlation and in many cases a degree of anti-correlation. Any interpretation of our result requires a careful econometrically well grounded analysis of the financial time series - this is beyond the scope of this work.006.35University of Surreyhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.442666http://epubs.surrey.ac.uk/2744/Electronic Thesis or Dissertation |
collection |
NDLTD |
sources |
NDLTD |
topic |
006.35 |
spellingShingle |
006.35 Cheng, Tai Wai David Corpus and sentiment analysis |
description |
Information extraction/retrieval has been of interest to researchers since the early 1960's. A series of conferences and competitions have been held by DARPA/NIST since the late 1980's has resulted in the analysis of news reports and government reports in English and other languages, notably Chinese and Arabic. A number of methods have been developed for analysing `free' natural language texts. Furthermore, a number of systems for understanding messages have been developed, focusing on the area of named entity extraction, templates for dealing with certain kinds of news. The templates were handcrafted, and a lot of ad-hoc knowledge went into the creation of such systems. Seven of these systems have been reviewed. Despite the fact that IE systems built for different tasks often differ from each other, the core elements are shared by nearly every extraction system. Some of these core elements such as parser and part of speech (POS) tagger, are tuned for optimal performance for a specific domain, or text with pre-defined structures. The extensive use of gazetteers and manually crafted grammar rules further limits the portability of the existing IE systems to work language and domain independently. The goal of this thesis is to develop an algorithm that can be used to extract information from free texts, in our case, from financial news text; and from arbitrary domains unambiguously. We believe the use of corpus linguistics and statistical techniques would be more appropriate and efficient for this task, instead of using other approaches that rely on machine learning, POS taggers, parsers, and so on, which are tuned to work for a predefined domain. Based on this belief, a framework using corpus linguistics and statistical techniques, to extract information as unambiguously as possible from arbitrary domains was developed. A contrastive evaluation has been carried out not only in the domain of financial texts and movie reviews, but also with multi-lingual texts (Chinese and English). The results are encouraging. Our preliminarily evaluation, based on the correlation between a time series of positive (negative) sentiment word (phrase) counts with a time series of indices produced by stock exchanges (Financial Times Stock Exchange, Dow Jones Industrial Average, Nasdaq, S&P 500, Hang Seng Index, Shanghai Index, and Shenzhen Index) showed that when the positive (negative) sentiment series correlates with the stock exchange index, the negative (positive) shows a smaller degree of correlation and in many cases a degree of anti-correlation. Any interpretation of our result requires a careful econometrically well grounded analysis of the financial time series - this is beyond the scope of this work. |
author |
Cheng, Tai Wai David |
author_facet |
Cheng, Tai Wai David |
author_sort |
Cheng, Tai Wai David |
title |
Corpus and sentiment analysis |
title_short |
Corpus and sentiment analysis |
title_full |
Corpus and sentiment analysis |
title_fullStr |
Corpus and sentiment analysis |
title_full_unstemmed |
Corpus and sentiment analysis |
title_sort |
corpus and sentiment analysis |
publisher |
University of Surrey |
publishDate |
2007 |
url |
http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.442666 |
work_keys_str_mv |
AT chengtaiwaidavid corpusandsentimentanalysis |
_version_ |
1716815526808780800 |