Design and Implementation of an Index System for Super Scale Search Engines

博士 === 國立中正大學 === 資訊工程所 === 95 === Search Engines is the most important and critical technology in the information age. Search technology is not only used for the search engine service, it is also the fundamental building block for many other information services and for the techniques of informatio...

Full description

Bibliographic Details
Main Authors: Hsien-Tsung Chang, 張賢宗
Other Authors: Sun Wu
Format: Others
Language:en_US
Published: 2007
Online Access:http://ndltd.ncl.edu.tw/handle/85196733496003994312
Description
Summary:博士 === 國立中正大學 === 資訊工程所 === 95 === Search Engines is the most important and critical technology in the information age. Search technology is not only used for the search engine service, it is also the fundamental building block for many other information services and for the techniques of information and knowledge management. As the World Wide Web grows to be a huge information space comprised of tens or even hundreds of billions of web objects, building up a global search engine is an extreme technology challenge that requires sophisticated design of data structures and algorithms as well as an efficient system implementation that is able to deliver a robust and efficient search service with high quality of search results. The goal of our research is to design and implement a very efficient search engine technology that can meet the challenge, and even more, to provide a solution that is also much more cost-effective than current popular search engines. This dissertation will focus on the index part of the search engine technology. In this dissertation we present a high performance index system that are capable of indexing up to approximate one hundred million web pages using one PC level server, in a few days, while the index structure is able to provide a sub-second query processing for most of the query patterns. Using our index system, indexing 10 billion pages will require approximately only one hundred PC level servers. On the other hand, the search engines, like Google, only handle 2.5 million web pages in one index server as their approach which is to store the index in memory to be able to provide a fast query processing. It is shown that our approach is much more cost effective and as the number of servers needed is much smaller by comparing the index power and query performance with the datasheet of the Google Appliance [35] , our approach has a higher scalability and manageability. Beside the highly efficient index system, we also design a new index method that is capable of providing approximate matching in the index level, which is not available in any popular search engines. Our approximate index scheme is based on our new proposed index structure called Listance Bounded Subsequence Index, which can be used to correct spelling error or find similar words in a very high speed. Our index system is also applied in a new information system called NUWeb [70] , which is to build a new Web cyberspace that can cover user''s information in the search space, to build a search engine to cover the existing WWW information space plus the personal users’ web space. Lastly, we present a query cache index and dynamic proxy system that can be used to improve the search service performance significantly with low cost.