An Efficient Minimal Text Segmentation Method for URL Domain Names

Text segmentation of the URL domain name is a straightforward and convenient method to analyze users’ online behaviors and is crucial to determine their areas of interest. However, the performance of popular word segmentation tools is relatively low due to the unique structure of the website domain...

Full description

Bibliographic Details
Main Authors: Yiqian Li, Tao Du, Lianjiang Zhu, Shouning Qu
Format: Article
Language:English
Published: Hindawi Limited 2021-01-01
Series:Scientific Programming
Online Access:http://dx.doi.org/10.1155/2021/9946729
id doaj-f2a55c95657647c8b0005a1f8b42162c
record_format Article
spelling doaj-f2a55c95657647c8b0005a1f8b42162c2021-07-12T02:13:37ZengHindawi LimitedScientific Programming1875-919X2021-01-01202110.1155/2021/9946729An Efficient Minimal Text Segmentation Method for URL Domain NamesYiqian Li0Tao Du1Lianjiang Zhu2Shouning Qu3School of Information Science and EngineeringSchool of Information Science and EngineeringSchool of Information Science and EngineeringSchool of Information Science and EngineeringText segmentation of the URL domain name is a straightforward and convenient method to analyze users’ online behaviors and is crucial to determine their areas of interest. However, the performance of popular word segmentation tools is relatively low due to the unique structure of the website domain name (such as extremely short lengths, irregular names, and no contextual relationship). To address this issue, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names to achieve efficient adaptive text mining. We first designed a targeted hierarchical task model to reduce noise interference in minimal texts. We then presented a novel method of integrating conflict game into the two-directional maximum matching algorithm, which can make the words with higher weight and greater probability to be selected, thereby enhancing the accuracy of recognition. Next, Chinese Pinyin and English mapping were embedded in the word segmentation rules. Besides, we incorporated a correction factor that considers the text length into the F1-score to optimize the performance evaluation of text segmentation. The experimental results show that the EMTS yielded around 20 percentage points improvement with other word segmentation tools in terms of accuracy and topic extraction, providing high-quality data for the subsequent text analysis.http://dx.doi.org/10.1155/2021/9946729
collection DOAJ
language English
format Article
sources DOAJ
author Yiqian Li
Tao Du
Lianjiang Zhu
Shouning Qu
spellingShingle Yiqian Li
Tao Du
Lianjiang Zhu
Shouning Qu
An Efficient Minimal Text Segmentation Method for URL Domain Names
Scientific Programming
author_facet Yiqian Li
Tao Du
Lianjiang Zhu
Shouning Qu
author_sort Yiqian Li
title An Efficient Minimal Text Segmentation Method for URL Domain Names
title_short An Efficient Minimal Text Segmentation Method for URL Domain Names
title_full An Efficient Minimal Text Segmentation Method for URL Domain Names
title_fullStr An Efficient Minimal Text Segmentation Method for URL Domain Names
title_full_unstemmed An Efficient Minimal Text Segmentation Method for URL Domain Names
title_sort efficient minimal text segmentation method for url domain names
publisher Hindawi Limited
series Scientific Programming
issn 1875-919X
publishDate 2021-01-01
description Text segmentation of the URL domain name is a straightforward and convenient method to analyze users’ online behaviors and is crucial to determine their areas of interest. However, the performance of popular word segmentation tools is relatively low due to the unique structure of the website domain name (such as extremely short lengths, irregular names, and no contextual relationship). To address this issue, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names to achieve efficient adaptive text mining. We first designed a targeted hierarchical task model to reduce noise interference in minimal texts. We then presented a novel method of integrating conflict game into the two-directional maximum matching algorithm, which can make the words with higher weight and greater probability to be selected, thereby enhancing the accuracy of recognition. Next, Chinese Pinyin and English mapping were embedded in the word segmentation rules. Besides, we incorporated a correction factor that considers the text length into the F1-score to optimize the performance evaluation of text segmentation. The experimental results show that the EMTS yielded around 20 percentage points improvement with other word segmentation tools in terms of accuracy and topic extraction, providing high-quality data for the subsequent text analysis.
url http://dx.doi.org/10.1155/2021/9946729
work_keys_str_mv AT yiqianli anefficientminimaltextsegmentationmethodforurldomainnames
AT taodu anefficientminimaltextsegmentationmethodforurldomainnames
AT lianjiangzhu anefficientminimaltextsegmentationmethodforurldomainnames
AT shouningqu anefficientminimaltextsegmentationmethodforurldomainnames
AT yiqianli efficientminimaltextsegmentationmethodforurldomainnames
AT taodu efficientminimaltextsegmentationmethodforurldomainnames
AT lianjiangzhu efficientminimaltextsegmentationmethodforurldomainnames
AT shouningqu efficientminimaltextsegmentationmethodforurldomainnames
_version_ 1721307963481653248