Um modelo temporal-relacional para classificação de documentos

=== Automatic Document Classification (ADC) is one of the most relevant and challenging research problems in Information Retrieval. Despite the large number of ADC techniques already proposed, few of them take into consideration characteristics of the human language. As discussed in recent studies...

Full description

Bibliographic Details
Main Author: Fernando Henrique de Jesus Mourao
Other Authors: Wagner Meira Junior
Format: Others
Language:Portuguese
Published: Universidade Federal de Minas Gerais 2009
Online Access:http://hdl.handle.net/1843/SLSS-7Z8MWL
id ndltd-IBICT-oai-bibliotecadigital.ufmg.br-MTD2BR-SLSS-7Z8MWL
record_format oai_dc
spelling ndltd-IBICT-oai-bibliotecadigital.ufmg.br-MTD2BR-SLSS-7Z8MWL2019-01-21T18:06:39Z Um modelo temporal-relacional para classificação de documentos Fernando Henrique de Jesus Mourao Wagner Meira Junior Altigran Soares da Silva Edleno Silva de Moura Marcos Andre Goncalves Automatic Document Classification (ADC) is one of the most relevant and challenging research problems in Information Retrieval. Despite the large number of ADC techniques already proposed, few of them take into consideration characteristics of the human language. As discussed in recent studies [Montejo-Raez et al., 2008; Chen, 1995], understanding and considering such characteristics may benefit ADC. Therefore, in this work we propose a new network-based representation for textual documents that is based on fundamental concepts of Linguistic, in particular those associated with relationships between terms. Using the proposed model, we also introduce a relational algorithm for ADC which exploits such relationships. Experimental evaluation of this algorithm shows that it achieves results that are comparable to SVM in four real datasets. In addition, its simplicity, execution efficiency and a simple parameter tuning are characteristics that make our algorithm an interesting alternative to SVM. A deeper analysis also shows that there are several dimensions in which relational algorithms may be enhanced. Due to its relevance, particular attention is given to the temporal dimension. In fact, changes occur spontaneously at every moment affecting settings and observations made previously on the term network. Considering this evolving behavior may be very useful in the area of Information Retrieval [Alonso et al., 2007]. In order to incorporate the temporal dimension to our algorithm, we attach to every relationship of our network information about the moment of its construction. The evaluation of simple temporal versions of the proposed algorithm showed that considering the temporal evolution has improved the performance of our relational classifier, by providing more accurate information about the behavior of each term. A preliminary assessment of other dimensions of analysis, such as information scarcity and the use of attributes of relationships, also showed that more elaborated techniques to address such dimensions may benefit the proposed algorithm. Further, considering the generality of the linguistic concepts incorporated in this work, we believe that our proposal may be equally successful in various ADC application domains. Classificação Automática de Documentos (CAD) representa um dos mais relevantes problemas de pesquisa em Recuperação de Informação. Apesar do grande número de técnicas existentes e da importância de características da linguagem humana, poucas levam em consideração tais características. Dessa forma, neste trabalho propomos uma representação para documentos, através de uma rede de termos, baseada em conceitos lingüísticos de relacionamentos entre termos. Usando essa representação, apresentamos um algoritmo relacional para CAD. Avaliações experimentais desse algoritmo mostram resultados comparáveis ao SVM em quatro bases reais. Uma análise detalhada também mostrou que considerar a evolução temporal da linguagem pode aperfeiçoar nosso algoritmo. Simples versões temporais do algoritmo proposto foram capazes de melhorar o desempenho do nosso classificador. Além disso, sua simplicidade e eficiência de execução são características que tornam nosso algoritmo uma interessante alternativa ao SVM. 2009-11-23 info:eu-repo/semantics/publishedVersion info:eu-repo/semantics/masterThesis http://hdl.handle.net/1843/SLSS-7Z8MWL por info:eu-repo/semantics/openAccess text/html Universidade Federal de Minas Gerais 32001010004P6 - CIÊNCIA DA COMPUTAÇÃO UFMG BR reponame:Biblioteca Digital de Teses e Dissertações da UFMG instname:Universidade Federal de Minas Gerais instacron:UFMG
collection NDLTD
language Portuguese
format Others
sources NDLTD
description === Automatic Document Classification (ADC) is one of the most relevant and challenging research problems in Information Retrieval. Despite the large number of ADC techniques already proposed, few of them take into consideration characteristics of the human language. As discussed in recent studies [Montejo-Raez et al., 2008; Chen, 1995], understanding and considering such characteristics may benefit ADC. Therefore, in this work we propose a new network-based representation for textual documents that is based on fundamental concepts of Linguistic, in particular those associated with relationships between terms. Using the proposed model, we also introduce a relational algorithm for ADC which exploits such relationships. Experimental evaluation of this algorithm shows that it achieves results that are comparable to SVM in four real datasets. In addition, its simplicity, execution efficiency and a simple parameter tuning are characteristics that make our algorithm an interesting alternative to SVM. A deeper analysis also shows that there are several dimensions in which relational algorithms may be enhanced. Due to its relevance, particular attention is given to the temporal dimension. In fact, changes occur spontaneously at every moment affecting settings and observations made previously on the term network. Considering this evolving behavior may be very useful in the area of Information Retrieval [Alonso et al., 2007]. In order to incorporate the temporal dimension to our algorithm, we attach to every relationship of our network information about the moment of its construction. The evaluation of simple temporal versions of the proposed algorithm showed that considering the temporal evolution has improved the performance of our relational classifier, by providing more accurate information about the behavior of each term. A preliminary assessment of other dimensions of analysis, such as information scarcity and the use of attributes of relationships, also showed that more elaborated techniques to address such dimensions may benefit the proposed algorithm. Further, considering the generality of the linguistic concepts incorporated in this work, we believe that our proposal may be equally successful in various ADC application domains. === Classificação Automática de Documentos (CAD) representa um dos mais relevantes problemas de pesquisa em Recuperação de Informação. Apesar do grande número de técnicas existentes e da importância de características da linguagem humana, poucas levam em consideração tais características. Dessa forma, neste trabalho propomos uma representação para documentos, através de uma rede de termos, baseada em conceitos lingüísticos de relacionamentos entre termos. Usando essa representação, apresentamos um algoritmo relacional para CAD. Avaliações experimentais desse algoritmo mostram resultados comparáveis ao SVM em quatro bases reais. Uma análise detalhada também mostrou que considerar a evolução temporal da linguagem pode aperfeiçoar nosso algoritmo. Simples versões temporais do algoritmo proposto foram capazes de melhorar o desempenho do nosso classificador. Além disso, sua simplicidade e eficiência de execução são características que tornam nosso algoritmo uma interessante alternativa ao SVM.
author2 Wagner Meira Junior
author_facet Wagner Meira Junior
Fernando Henrique de Jesus Mourao
author Fernando Henrique de Jesus Mourao
spellingShingle Fernando Henrique de Jesus Mourao
Um modelo temporal-relacional para classificação de documentos
author_sort Fernando Henrique de Jesus Mourao
title Um modelo temporal-relacional para classificação de documentos
title_short Um modelo temporal-relacional para classificação de documentos
title_full Um modelo temporal-relacional para classificação de documentos
title_fullStr Um modelo temporal-relacional para classificação de documentos
title_full_unstemmed Um modelo temporal-relacional para classificação de documentos
title_sort um modelo temporal-relacional para classificação de documentos
publisher Universidade Federal de Minas Gerais
publishDate 2009
url http://hdl.handle.net/1843/SLSS-7Z8MWL
work_keys_str_mv AT fernandohenriquedejesusmourao ummodelotemporalrelacionalparaclassificacaodedocumentos
_version_ 1718847013772066816