Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning

Bibliographic Details
Main Author: Mysore Gopinath, Abhijith Athreya
Language:English
Published: University of Cincinnati / OhioLINK 2018
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677
id ndltd-OhioLink-oai-etd.ohiolink.edu-ucin1535371714338677
record_format oai_dc
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-ucin15353717143386772021-08-03T07:08:18Z Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning Mysore Gopinath, Abhijith Athreya Computer Science HTML Structure Analysis Natural Language Processing Topicality Detection in HTML Machine Learning Privacy Policies Web documents are one of the most important sources of obtaining publicly available information, and researchers in need of textual data often scour the web for information. Most web documents organize the textual content into different sections based on the topicality of the text. Each section contains two distinguishable parts: (1) the title, which consists of a summary/title of the text which follows it, and (2) the text which follows the title, also known as prose text. Apart from the aesthetic appeal, this organization could be helpful in natural language processing (NLP) tasks such as question answering, information extraction, text summarization and text classification. The section title acts as an index or a quick summary of the prose content that follows it. Just like searching for information using a table of contents in a book, these indexes can be used to focus on content relevant to a search. Each section is lexically cohesive, and at the same time, it is cohesively different from other sections.Current methods of web text extraction are agnostic of these textual demarcations, as they cannot identify titles and prose text. One reason is the inherent difficulty in determining sections, since two documents with the same appearance can be structured in many different ways, and a rule-based method may not work well on various websites. Also, the complex nesting of HTML tags and the copious presence of unrelated data complicate processing. Through this thesis, we solve the problem of automatic identification of section titles and prose text.We developed two methods: one an unsupervised domain-independent approach and the other a supervised domain-dependent approach. In the domain-independent approach, we make use of lexical and morphological features of text to perform k-means clustering to identify title labels. Then, further techniques are used to determine corresponding prose text for the titles. In the domain-dependent approach, we train a neural network classifier on the dense word embeddings of title and prose text collected from a domain. The system produces a simplified output of the original HTML page which can be machine-read using simple rules. Along with these novel methods, we also have created a corpus of web documents containing privacy policies, terms of service agreements and miscellaneous web documents. This corpus includes both the original version and the simplified output of all HTML documents.To test our assumptions and methods, we used online privacy policies, terms of service agreements and miscellaneous web documents. We evaluated the models on two fronts: (1) the traditional precision, recall and F-1 scores for segment identification, and (2) a metric we name coverage, which measures the amount of the original legitimate text reproduced in the final output. The domain-independent approach achieved an overall precision of 0.82, recall of 0.98 and coverage of 0.97. The domain-dependent model returned with an accuracy of 0.99, recall of 0.75 and coverage of 0.93. These results demonstrate that our system is largely accurate and robust. 2018 English text University of Cincinnati / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677 http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
collection NDLTD
language English
sources NDLTD
topic Computer Science
HTML Structure Analysis
Natural Language Processing
Topicality Detection in HTML
Machine Learning
Privacy Policies
spellingShingle Computer Science
HTML Structure Analysis
Natural Language Processing
Topicality Detection in HTML
Machine Learning
Privacy Policies
Mysore Gopinath, Abhijith Athreya
Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning
author Mysore Gopinath, Abhijith Athreya
author_facet Mysore Gopinath, Abhijith Athreya
author_sort Mysore Gopinath, Abhijith Athreya
title Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning
title_short Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning
title_full Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning
title_fullStr Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning
title_full_unstemmed Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning
title_sort automatic detection of section title and prose text in html documents using unsupervised and supervised learning
publisher University of Cincinnati / OhioLINK
publishDate 2018
url http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677
work_keys_str_mv AT mysoregopinathabhijithathreya automaticdetectionofsectiontitleandprosetextinhtmldocumentsusingunsupervisedandsupervisedlearning
_version_ 1719454576425304064