Summary: | 碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 94 === In recent years, search engines have already played the key roles among Web applications, and link analysis algorithms are the major methods to measure the important values of Web pages. They employ the conventional flat Web graph built by Web pages and link relations of Web pages to obtain the relative importance of Web objects. Previous researches have observed that PageRank-like link analysis algorithms have a bias against newly created Web pages. A new ranking algorithm called Page Quality was proposed to save this issue. Page Quality anticipates future ranking values by the difference rate between current ranking values and previous ranking values. In this paper, we propose a new algorithm called DRank to diminish the bias of PageRank-like link analysis, and attain the better performance of Page Quality. In this algorithm, we model Web graph as a three-layer graph which includes Host Graph, Directory Graph and Page Graph by using the hierarchical structure of URLs and the structure of link relation of Web pages. At first, we discuss the aggregated phenomena of link relations within host level and directory level and according to what we observe we assign different weight to different types of links. We then calculate the importance of hosts, Directories and Pages by weighted graph we built. We find two phenomena: One is that hosts or directories that have higher rank value contain the majority of important pages and we observe that directory level is a better block level to prove new pages created within an important blocks have the higher probability to be important pages. The other is that there are ladder-graphs within directories while we sort ranking values within directories in the decreasing order. By combining Page Quality algorithm and the two phenomena we state above, we can predicate the more accurate values of page importance to diminish the bias of newly created pages. Experiment results on our data shows that DRank algorithm works well on anticipating future ranking values of pages, and the performance of DRank is better than Page Quality.
|