Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning
Main Author: | |
---|---|
Language: | English |
Published: |
University of Cincinnati / OhioLINK
2018
|
Subjects: | |
Online Access: | http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677 |
id |
ndltd-OhioLink-oai-etd.ohiolink.edu-ucin1535371714338677 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-OhioLink-oai-etd.ohiolink.edu-ucin15353717143386772021-08-03T07:08:18Z Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning Mysore Gopinath, Abhijith Athreya Computer Science HTML Structure Analysis Natural Language Processing Topicality Detection in HTML Machine Learning Privacy Policies Web documents are one of the most important sources of obtaining publicly available information, and researchers in need of textual data often scour the web for information. Most web documents organize the textual content into different sections based on the topicality of the text. Each section contains two distinguishable parts: (1) the title, which consists of a summary/title of the text which follows it, and (2) the text which follows the title, also known as prose text. Apart from the aesthetic appeal, this organization could be helpful in natural language processing (NLP) tasks such as question answering, information extraction, text summarization and text classification. The section title acts as an index or a quick summary of the prose content that follows it. Just like searching for information using a table of contents in a book, these indexes can be used to focus on content relevant to a search. Each section is lexically cohesive, and at the same time, it is cohesively different from other sections.Current methods of web text extraction are agnostic of these textual demarcations, as they cannot identify titles and prose text. One reason is the inherent difficulty in determining sections, since two documents with the same appearance can be structured in many different ways, and a rule-based method may not work well on various websites. Also, the complex nesting of HTML tags and the copious presence of unrelated data complicate processing. Through this thesis, we solve the problem of automatic identification of section titles and prose text.We developed two methods: one an unsupervised domain-independent approach and the other a supervised domain-dependent approach. In the domain-independent approach, we make use of lexical and morphological features of text to perform k-means clustering to identify title labels. Then, further techniques are used to determine corresponding prose text for the titles. In the domain-dependent approach, we train a neural network classifier on the dense word embeddings of title and prose text collected from a domain. The system produces a simplified output of the original HTML page which can be machine-read using simple rules. Along with these novel methods, we also have created a corpus of web documents containing privacy policies, terms of service agreements and miscellaneous web documents. This corpus includes both the original version and the simplified output of all HTML documents.To test our assumptions and methods, we used online privacy policies, terms of service agreements and miscellaneous web documents. We evaluated the models on two fronts: (1) the traditional precision, recall and F-1 scores for segment identification, and (2) a metric we name coverage, which measures the amount of the original legitimate text reproduced in the final output. The domain-independent approach achieved an overall precision of 0.82, recall of 0.98 and coverage of 0.97. The domain-dependent model returned with an accuracy of 0.99, recall of 0.75 and coverage of 0.93. These results demonstrate that our system is largely accurate and robust. 2018 English text University of Cincinnati / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677 http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws. |
collection |
NDLTD |
language |
English |
sources |
NDLTD |
topic |
Computer Science HTML Structure Analysis Natural Language Processing Topicality Detection in HTML Machine Learning Privacy Policies |
spellingShingle |
Computer Science HTML Structure Analysis Natural Language Processing Topicality Detection in HTML Machine Learning Privacy Policies Mysore Gopinath, Abhijith Athreya Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning |
author |
Mysore Gopinath, Abhijith Athreya |
author_facet |
Mysore Gopinath, Abhijith Athreya |
author_sort |
Mysore Gopinath, Abhijith Athreya |
title |
Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning |
title_short |
Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning |
title_full |
Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning |
title_fullStr |
Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning |
title_full_unstemmed |
Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning |
title_sort |
automatic detection of section title and prose text in html documents using unsupervised and supervised learning |
publisher |
University of Cincinnati / OhioLINK |
publishDate |
2018 |
url |
http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677 |
work_keys_str_mv |
AT mysoregopinathabhijithathreya automaticdetectionofsectiontitleandprosetextinhtmldocumentsusingunsupervisedandsupervisedlearning |
_version_ |
1719454576425304064 |