Measuring on Large-Scale Read-Intensive Web sites

We have in this thesis continued the work started in our project, i.e. to explore the practical and economic feasibility of assessing the scalability of a read-intensive large-scale Internet site. To do this we have installed the main components in a news site using open source software. This scalab...

Full description

Bibliographic Details
Main Authors: Ruud, Jørgen, Tveiten, Olav Gisle
Format: Others
Language:English
Published: Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap 2005
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9226
Description
Summary:We have in this thesis continued the work started in our project, i.e. to explore the practical and economic feasibility of assessing the scalability of a read-intensive large-scale Internet site. To do this we have installed the main components in a news site using open source software. This scalability exploration has been driven by the scaling scenario of increased article size. We have managed to assess the scalability of our system in a good way, but it has been more time consuming and knowledge demanding than expected. This means that the feasibility of such a study is lesser than we expected, but if the experiences and the method of this thesis are applied, such a study should be more feasible. We have assessed the scalability of a general web architecture, and this means that our approach can be applied to all read-intensive web sites and not just the one looked at in the cite{prosjekt}. This general focus is one of the strengths with this thesis. One of the objectives in our thesis was to make a resource function workbench (RFW) that is a framework which aids in the measuring and data interpretation. We feel that our RFW is one of the most important outcomes from this thesis, because it should be easy to reuse, thus saving time for future projects and making the feasibility of such a study higher. One of the most important is that the impact of increased article size on the throughput is bigger than expected. A small increase in article size, especially image size, leads to a clear decrease in the throughput. This reduction is larger on the small image sizes that on the large ones. This has wide implications for news sites, as many of them expect to increase the article size and still use the same system. Another major finding is that it is hard to predict the effects a scale-up of one or more components (a non-uniform scaling) will have on the throughput. This is because the throughput have different levels of dependency on the components on different image/text sizes. As we have seen the effects of the scale-up on the throughput varied between the different image sizes (a increase in throughput by 4.5 on 100 KB, but only an increase by a factor of 3.2 on image size 300 KB). In our case we have performed a non-uniform scaling, where we have increased the CPU by 2.4 and the disk by 1.1 On some image sizes and text sizes, the overall throughput was increased by a factor 10, but on others there was almost no improvement. The implications this have for web sites, is that it is hard for them to predict how system alternations will affect the overall throughput. As it is dependant on the current image and article size. It was an open question whether or not a dynamic model of the system could be constructed and solved. We have managed to construct the dynamic model, but the predictions it makes are a bit crude. However, we feel that creating a dynamic model has been very useful, and we believe it can make valuable predictions if the accuracy of the parameters are improved. This should be feasible, as our measurements should be easy to recreate. This thesis has been very demanding, because scalability requires a wide field of knowledge (statistics, hardware, software, programming, measurements etc). This has made this work very instructive, as we have gained knowledge in so many different aspects of computer science. Ideally, the thesis should have a larger time span, as the there are so many time consuming phases, which would have been interesting to spend more time on. As consequence of this short time span there are some further work which can be conducted in order to gain further valuable knowledge.