INTRODUCTION
Popular search engines, such as Google, Yahoo and Altavista have played an important role in the World Wide Web. Search engine has been a killing application for internet users. A world wide web without a search engine is unimaginable nowadays. The search engine allow users to find certain information in the gigantic hyperspace database as web faster than it can be imagine before. Search engine also ensures that it produces relevant and important query results, which are achieved through a democratic ranking system (Langville, 2005). Google as the top most search engine in the internet nowadays, states that it has crawled through >4.2 billion web pages. This is done using their popular PageRank method as its ranking system.
The calculation of page ranks using the PageRank method is currently done offline in Google. Crawler results are converted into Markov chains, with the size of nxn. With around 4.2 billion pages handled by Google, this Markov chains is a gigantic chains. The Pagerank method calculates and ranks an eigen value of each web page. This decremented order of eigen values represents the relevance and importance of the query results. This calculation takes >3 days. With the increasing number of web pages, this calculation will tend to take longer in the near future.
This research proposes the use of Quotient Rayleigh to find the dominant eigen value, which improves the convergence time of Markov chains calculation. With the decreasing of convergence time, it is expected to reduce the PageRank computation time to <3 days, which means that the PageRank method performance will be improved significantly.
MATERIALS AND METHODS
PageRank was first developed by Larry Page at Stanford University (Vise and Malseed, 2005). Later on, the research was joined by Sergey Brin on 1995 and continued as part of a research project about a new kind of search engine. This later produced a functional prototype, called Google. While, just one of many factors, which determine the ranking of Google search results, PageRank becomes the foundation of web search engines of Google since then. PageRank is a method that Google used to measure the relevance of a web page to the query term given by user. It is based on citation analysis founded by Garfield (1979) at the University of Pennsylvania. When all searching result factors, such as title, tag and key words have been used, then Google uses PageRank to shows the order of importance of pages that will be displayed as in the main search results.
In principle, search engine does the following processes when processing a user query:
| • | Finding pages that match the given key word (s) |
|
| • |
Ranking the pages locally based on traditional information
retrieval methods, such as key word matching, keyword closeness, hypertext
tags level, occurrence frequency, etc. |
|
| • | Merging
this result with the global ranking scores produced through PageRank as
follows: |
|
Where:
| PR(A) |
= |
Page rank of page (A) |
| D |
= |
Damping factor (0.8) |
| PR(T1) |
= |
Page rank to page (T1) |
| C(T1) |
= |
Number of link in page (T1) |
| PR(Tn)/C(Tn) |
= |
Ratio of pages have link to (A) |
|
Markov chains: Determination process of PageRank begins by forming an
nxn matrix, called hyperlink matrix A. Where, n is the number of web pages.
If a web page i has link
i≥1
to other web pages and web pages i has a link to webpage j, then element at
line- i and column-j of matrix A (Aij) = 1/li, while other element of Ai = 0.
Therefore, Aij represents the likelihood of a random surfer may choose a link
from webpage i to webpage j. This matrix of link graphs is called a markov matrix
or markov chains. In this research, the markov matrix is obtained from the dataset
from previous researches, where markov matrix is stochastic and irreducible
(Austin, 2006).
A markov chain is a line of random variable, X1, X2, X3,
. given a markov
property that is a state that currently happens, where past and current states
are independent and unbounded.
Pr (X n+1-x|X n ,... , X 1,
X 0-x0) - Pr (X n+1-x|X n-xn) |
|
A probability of moving from state i (page i) to state j (state j) in n steps
(click) is defined as:
and for 1-step transition can be defined as:
The markov matrix is calculated until it produces convergence of eigen values.
A markov chain is called ergodic if there is a path from a state to another
state, or if all states are aperiodic and positive recurrent, given that the
probability is not zero.
For any markov chain that is ergodic, there is a unique average number of visitations for each state. It means that for a period of time, a number of visitations of each state is proportional regardless the original state of the visitation (Austin, 2006).
Eigen vector and eigen value: If A is a nxn matrix, then a nonzero vector x in Rn is an eigen vector of A if Ax is a magnitude multiplication of x that is:
for any scalar λ. λ is eigen value of A and x is eigen vector that corresponds with λ. To find an eigen values of matrix A of size nxn, Eq. 1 can be rewritten as follows:
or equally written as:
In order the scalar λ is considered as an eigen value, there should be
a nonzero solution of this equation is:
The scalar that fulfills this equation is eigen value of A. This is called
a characteristic equation of A. Thus, det (λI-A) is a polynomial of λ
that is, a characteristic polynomial of A.
If A is matrix of size nxn, therefore the characteristic polynomial of A should
be n and coefficient of λn are 1. Therefore, the characteristics
polynomial of matrix nxn has the form of:
det (λI -A) = λn + c1 λn-1+......+cn
|
|
Power method: In practical problems, computing large matrix to find
eigen values is a complex computation problem, that implies time and computation
resource consuming.
In this research, we propose the use of a simpler algorithm to find the eigen values, that is, power method. Power method produces approximate eigen values given the highest absolute value of corresponding eigen vector.
An eigen value of matrix A is called dominant eigen value of A if its absolute value is bigger than the rest of absolute eigen values. The eigen vector that corresponds with the dominant eigen value is called dominant eigen vector of A.
Let assume A is matrix of size nxn that can be diagonalized by dominant eigen
value. If x0 is any non-zero vector in Rn, then vector:
is usually a good approximation of dominant eigen vector if exponent p is big
(Langville, 2005). Iteration process of such vector computation is called power
method. Power method most of the time produce vector with large component.
To solve this, the approximation eigen vector is scaled down in each iteration so that the component will be positioned between +1 and -1. This can be achieved by multiplying the approximation eigen vector with its inverse component that has the biggest absolute value.
For example, given matrix A that has eigen value λ and x is correspond eigen vector. If <,> represents the result in Euclids, then:
Therefore, if x is the approximation of dominant eigen value, then the dominant
eigen value of λ1 can be approximated by:
which is called Quotient Rayleigh (Gumerov, 2003).
Eigen values estimation using power method: The proposed solution is focused on improving power method which is utilized for estimating eigen values during Pagerank calculation. In Fig. 1a (Brin and Page, 1998), the iteration will stop when the eigen value is convergence on corresponding eigen vector. While, in Fig. 1b, there is a filtering mechanism on the eigen value, so that the eigen value resulted from Quotient Rayleigh will move faster toward the convergence value. And this is proven.
The approximation of eigen value by power method uses a simple iteration, which continues until it reaches a value where, it is convergence. The value is called eigen value. Given a markov matrix A with the size of nxn with an n-size non-zero eigen vector, so that
Where:
if u(0) is any vector where
| | Fig. 1: | Flowchart
of eigen values estimation with power method, (a) existing method and
(b) proposed method |
|
Given a symmetrical matrix A, Eq. 4 can be computed as follows: While not convergence (norm(vk+1, uk)) do
End while.
It clearly shows that u(k+1) = ck+1 Au(k), therefore:
So that,
Because eigen vector is unbounded-linear, then any vector can be made as a
linear combination from eigen vector. Hence,
then:
and because
then:
If it is assumed that:
Then,
thus,
where, |γk| tends to be constant even if k→∞. Every
linear combination of the eigen vector that corresponds with the same eigen
value is also also eigen vector.
Then, base on the Eq. 4, the eigen value computation algorithm using power method is scaled down using this mechanism. This will result in the order of x0, x1 , x2 ,.... which, their approximation against dominant eigen vector is increasingly improved.
Eigen value estimation using modified power method: The motivation is
to ensure that the convergence of eigen values can be reached faster than the
previous method.
The eigen value approximation, i.e. dominant eigen value, by adopting power method with Quotient Rayleigh is the major issue in this research. Quotient Rayleigh is used because matrix A is symmetrical (Chowdury and Dasgupta, 2003).
Given any vector v and a value μ so that ||Av-μv||2 is minimum. By writing k = vH Av and assuming that ||v||2 (Wilkinson, 1965), we can conclude that:
If it is observed that the minimum can be reached for μ = k = vH
Av, the quantity is introduce as Quotient Rayleigh, which corresponding to v,
which we written as μR. The word Quotient is used if and only
if the scalar of v cannot be obtained, then it should be vHAv/vHv,
it is clear that if:
therefore,
For any μ, it can be obtained:
Then,
Which is a circular disc with its center μR, which has a small
radius corresponds to any given μ.
Given any normal matrix A, Av - μRv = η, where ||v||2 = 1 and ||η||2 = ε. It is known that n-1 eigen values satisfy the relation: |λi - μR|≥α, where, α is a constant. Then a relevant eigen values λ2, λ3,......, λn can be obtained. If A is normal, then it has an eigen vector ortonormal system xi, (i = 1,....., n), thus, it can be written as:
Given the previous equation:
so that:
therefore, it can be deducted that:
if θ is defined with the following relation:
hence,
then:
Therefore, if a>ε, then ve-iθ is the best approximation
on x1. The constant factor e-iθ is a less significant
factor. As the value α tends to decreased, then the result will also become
smaller. When α is equal to ε, then the result will be insignificant.
Hence that μR is Quotient Rayleigh (Gumerov, 2003).
For any symmetrical matrix A, a temporary assumption is used on eigen matrix x0. Then, the iterative computation is conducted in order to obtain the approximtion over the corresponding eigen vector with eigen value l. In each iteration a Quotient Rayleigh is used to know the maximum eigen value, so that it can be convergence.
While not convergence abs (lk - lk-1) do
percent estimation of dominant eigen values
End while.
RESULTS AND DISCUSSION
To measure the performance of the proposed power method, several tests have been conducted using a number variety of dataset. The result is also compared and analyzed against the existing power method to find the accuracy of the computation model in processing PageRank. This test also measures the efficiency of the proposed method in producing the dominant eigen values. To project the behavior of the system that uses the proposed model in n-size pages, we do an interpolation.
Experimental data: This experimentation is carried out by using dataset from previous researches. Table 1 explains the sources of the dataset used in the testing scenario.
Standford University datasets were crawled in 2002 and used by Sepandar D. Kamvar in his research on ranking the search results of search engine. The datasets that were downloaded for the test are Stanford and Stanford Berkeley web matrixes. Stanford web matrix has 281, 903 pages or around 2.3 million paths with the size of 64.2 MB. While, Stanford Berkeley web matrix has 683, 446 pages or around 7.6 million paths with the size of 125.8 MB.
Toronto University datasets were crawled in 2004 and is used by a number of researches at Toronto University. There are 34 datasets that has been used for the testing, with the size starting from 742 until 11, 659 pages. These datasets was crawled based on the keyword searching and each has 3 separate files which also use for other researches related to path analysis.
Milano University datasets were crawled between 2000 and 2005. Datasets were crawled by many researches from various universities. It contains several datasets, starting from one that has 325, 577 pages until 1, 382, 909 pages. Table 2 shows the dataset from University Milano.
Result and analysis: The tests on the proposed method have been conducted
at the Software Engineering Laboratory of ITS (Institut Teknologi Sepuluh Nopember).
The tests were conducted using a PC Intel P4-3 Giga Quad Core with 5 GB memory.
Table 3 shows the result of the experiment, where power method
with Quotient Rayleigh performs more efficient, in the term of computation time,
rather than the existing method, i.e. between 6.25 and 74.72%. It also shows
that the number of pages is corresponding to the computation time.
| Table 1: | Dataset
for testing |
 |
|
| Table 2: | Dataset
from University Milano |
 |
| Boldi (2004) |
|
| Table 3: |
Computation time of pagerank |
 |
|
Table 4 shows the resulted eigen vector from Pagerank algorithm
using former Power Method and Modified Power Method. It shows that both methods
produce identical eigen values. Which means that the proposed methods behave.
Interpolation: The behavior of the modified power method is also observed
by conducting an interpolation over n-size pages. The interpolation is conducted
using SPSS 7 version 14. The behavior of Pagerank is observed based on the power
function. Figure 2 shows the behavior of both methods.
Based on the power function, the behavior of both methods is observed and analyzed, whether the proposed method always performs better than existing method.
This is proven as follow: from the statistical analysis over the data using the tool, it can produce function of each method.
| Table 4: | Eigen
values of both methods on stanford dataset |
 |
|
| | Fig. 2: | The
computation time of pagerank on existing and proposed methods with dataset
≥11.659 |
|
Where:
| Ypm |
= |
Y (computation time) functional equation of pagerank using
existing power method |
| Yrayleigh |
= |
Y (computation time) functional equation of pagerank using modified poer
method |
| x |
= |
Number of pages |
|
Because the objective is to find any adjacent point, then both equations is
written as follow:
There is only one adjacent point present for the equations, i.e., x ≅
0 and it is shown from Fig. 2 that the function of proposed
method is always above the existing method. This means that it is always true
that the proposed method always performs better than existing methods for any
number of pages bigger than 0. Given the number of existing pages in Google
is 4.2 billion, it will take the existing method around 7.54 days to find the
eigen values, while the proposed method will need only 1.75 day.
CONCLUSION
Ranking web page is the heart of search engine which used for determining significant searching results. One of the best ranking methods currently is PageRank, which is used by Google. From the preliminary test on sample dataset, it is shown that the advanced PageRank, which makes use of Quotient Rayleigh in the computation of power method performs significantly faster that is 64.4%, than the existing method. The interpolation also shows that given the amount of existing collection of pages that Google has the proposed method will reduce 25% of the existing computation time. This is due to the computation characteristic of Quotient Rayleigh, which allows the convergence of eigen value to happen far earlier. There is still a further research on observing the behavior of the new method on a system that similar to the existing search engines architecture, such as Google.
ACKNOWLEDGEMENTS
We would like to thank Massimo Santini and Sebastian Vigno for allowing us to use University Milano dataset for testing our method.