Implementing Quotient Rayleigh in Power Method to Improve the Computation Speed of Pagerank

Authors : Daniel Siahaan and Zainal Arifin

Abstract: PageRank is one of popular web page ranking mechanisms, which is used by Google. It works based on link analysis. The computation of existing pagerank consumes a very significant time, i.e., in the scale of days, due to the fact that it has to find Eigen values of billion of web pages off-line. Previous attempts on reducing the computation time by shorting the time for Eigen value convergence have been carried out based on several methods, such as extrapolation, sparse linear system and quadratic extrapolation. This study proposes a new approach, i.e., using Quotient Rayleigh in the power method in order to shorten the convergence of dominant Eigen values. This modified power method resulted in a significant improvement on PageRank performance. The interpolation over n-size web pages also shows a consistent performance of the proposed approach.

How to cite this article:

Daniel Siahaan and Zainal Arifin, 2009. Implementing Quotient Rayleigh in Power Method to Improve the Computation Speed of Pagerank. Asian Journal of Information Technology, 8: 104-111.

URL: https://medwelljournals.com/abstract/?doi=ajit.2009.104.111

INTRODUCTION

Popular search engines, such as Google, Yahoo and Altavista have played an important role in the World Wide Web. Search engine has been a killing application for internet users. A world wide web without a search engine is unimaginable nowadays. The search engine allow users to find certain information in the gigantic hyperspace database as web faster than it can be imagine before. Search engine also ensures that it produces relevant and important query results, which are achieved through a democratic ranking system (Langville, 2005). Google as the top most search engine in the internet nowadays, states that it has crawled through >4.2 billion web pages. This is done using their popular PageRank method as its ranking system.

The calculation of page ranks using the PageRank method is currently done offline in Google. Crawler results are converted into Markov chains, with the size of nxn. With around 4.2 billion pages handled by Google, this Markov chains is a gigantic chains. The Pagerank method calculates and ranks an eigen value of each web page. This decremented order of eigen values represents the relevance and importance of the query results. This calculation takes >3 days. With the increasing number of web pages, this calculation will tend to take longer in the near future.

This research proposes the use of Quotient Rayleigh to find the dominant eigen value, which improves the convergence time of Markov chains calculation. With the decreasing of convergence time, it is expected to reduce the PageRank computation time to <3 days, which means that the PageRank method performance will be improved significantly.

MATERIALS AND METHODS

PageRank was first developed by Larry Page at Stanford University (Vise and Malseed, 2005). Later on, the research was joined by Sergey Brin on 1995 and continued as part of a research project about a new kind of search engine. This later produced a functional prototype, called Google. While, just one of many factors, which determine the ranking of Google search results, PageRank becomes the foundation of web search engines of Google since then. PageRank is a method that Google used to measure the relevance of a web page to the query term given by user. It is based on citation analysis founded by Garfield (1979) at the University of Pennsylvania. When all searching result factors, such as title, tag and key words have been used, then Google uses PageRank to shows the order of importance of pages that will be displayed as in the main search results.

In principle, search engine does the following processes when processing a user query:

•	Finding pages that match the given key word (s)

•	Ranking the pages locally based on traditional information retrieval methods, such as key word matching, keyword closeness, hypertext tags level, occurrence frequency, etc.

•	Merging this result with the global ranking scores produced through PageRank as follows:

Where:

PR(A)	=	Page rank of page (A)
D	=	Damping factor (0.8)
PR(T₁)	=	Page rank to page (T₁)
C(T₁)	=	Number of link in page (T₁)
PR(T_n)/C(T_n)	=	Ratio of pages have link to (A)

Markov chains: Determination process of PageRank begins by forming an nxn matrix, called hyperlink matrix A. Where, n is the number of web pages. If a web page i has link i≥1 to other web pages and web pages i has a link to webpage j, then element at line- i and column-j of matrix A (Aij) = 1/li, while other element of Ai = 0. Therefore, Aij represents the likelihood of a random surfer may choose a link from webpage i to webpage j. This matrix of link graphs is called a markov matrix or markov chains. In this research, the markov matrix is obtained from the dataset from previous researches, where markov matrix is stochastic and irreducible (Austin, 2006).

A markov chain is a line of random variable, X1, X2, X3,…. given a markov property that is a state that currently happens, where past and current states are independent and unbounded.

Pr (X_n+1-x|X _n ,... , X ₁, X ₀-x₀) - Pr (X n+1-x|X _n-x_n)

A probability of moving from state i (page i) to state j (state j) in n steps (click) is defined as:

and for 1-step transition can be defined as:

The markov matrix is calculated until it produces convergence of eigen values. A markov chain is called ergodic if there is a path from a state to another state, or if all states are aperiodic and positive recurrent, given that the probability is not zero.

For any markov chain that is ergodic, there is a unique average number of visitations for each state. It means that for a period of time, a number of visitations of each state is proportional regardless the original state of the visitation (Austin, 2006).

Eigen vector and eigen value: If A is a nxn matrix, then a nonzero vector x in Rⁿ is an eigen vector of A if Ax is a magnitude multiplication of x that is:

(1)

for any scalar λ. λ is eigen value of A and x is eigen vector that corresponds with λ. To find an eigen values of matrix A of size nxn, Eq. 1 can be rewritten as follows:

(2)

or equally written as:

(3)

In order the scalar λ is considered as an eigen value, there should be a nonzero solution of this equation is:

det (λI-A) = 0

The scalar that fulfills this equation is eigen value of A. This is called a characteristic equation of A. Thus, det (λI-A) is a polynomial of λ that is, a characteristic polynomial of A.

If A is matrix of size nxn, therefore the characteristic polynomial of A should be n and coefficient of λⁿ are 1. Therefore, the characteristics polynomial of matrix nxn has the form of:

det (λI -A) = λⁿ + c₁ λ^n-1+......+c_n

Power method: In practical problems, computing large matrix to find eigen values is a complex computation problem, that implies time and computation resource consuming.

In this research, we propose the use of a simpler algorithm to find the eigen values, that is, power method. Power method produces approximate eigen values given the highest absolute value of corresponding eigen vector.

An eigen value of matrix A is called dominant eigen value of A if its absolute value is bigger than the rest of absolute eigen values. The eigen vector that corresponds with the dominant eigen value is called dominant eigen vector of A.

Let assume A is matrix of size nxn that can be diagonalized by dominant eigen value. If x₀ is any non-zero vector in Rⁿ, then vector:

A^p x₀

(4)

is usually a good approximation of dominant eigen vector if exponent p is big (Langville, 2005). Iteration process of such vector computation is called power method. Power method most of the time produce vector with large component.

To solve this, the approximation eigen vector is scaled down in each iteration so that the component will be positioned between +1 and -1. This can be achieved by multiplying the approximation eigen vector with its inverse component that has the biggest absolute value.

For example, given matrix A that has eigen value λ and x is correspond eigen vector. If <,> represents the result in Euclids, then:

Therefore, if x is the approximation of dominant eigen value, then the dominant eigen value of λ₁ can be approximated by:

which is called Quotient Rayleigh (Gumerov, 2003).

Eigen values estimation using power method: The proposed solution is focused on improving power method which is utilized for estimating eigen values during Pagerank calculation. In Fig. 1a (Brin and Page, 1998), the iteration will stop when the eigen value is convergence on corresponding eigen vector. While, in Fig. 1b, there is a filtering mechanism on the eigen value, so that the eigen value resulted from Quotient Rayleigh will move faster toward the convergence value. And this is proven.

The approximation of eigen value by power method uses a simple iteration, which continues until it reaches a value where, it is convergence. The value is called eigen value. Given a markov matrix A with the size of nxn with an n-size non-zero eigen vector, so that

Where:

if u⁽⁰⁾ is any vector where


Fig. 1:	Flowchart of eigen values estimation with power method, (a) existing method and (b) proposed method

Given a symmetrical matrix A, Eq. 4 can be computed as follows: While not convergence (norm(v^k+1, u^k)) do

End while.

It clearly shows that u^(k+1) = c_k+1 Au^(k), therefore:

So that,

Because eigen vector is unbounded-linear, then any vector can be made as a linear combination from eigen vector. Hence,

then:

and because

then:

If it is assumed that:

Then,

thus,

where, |γ_k| tends to be constant even if k→∞. Every linear combination of the eigen vector that corresponds with the same eigen value is also also eigen vector.

Then, base on the Eq. 4, the eigen value computation algorithm using power method is scaled down using this mechanism. This will result in the order of x₀, x₁ , x₂ ,.... which, their approximation against dominant eigen vector is increasingly improved.

Eigen value estimation using modified power method: The motivation is to ensure that the convergence of eigen values can be reached faster than the previous method.

The eigen value approximation, i.e. dominant eigen value, by adopting power method with Quotient Rayleigh is the major issue in this research. Quotient Rayleigh is used because matrix A is symmetrical (Chowdury and Dasgupta, 2003).

Given any vector v and a value μ so that ||Av-μv||₂ is minimum. By writing k = v^H Av and assuming that ||v||² (Wilkinson, 1965), we can conclude that:

If it is observed that the minimum can be reached for μ = k = v^H Av, the quantity is introduce as Quotient Rayleigh, which corresponding to v, which we written as μ_R. The word Quotient is used if and only if the scalar of v cannot be obtained, then it should be v^HAv/v^Hv, it is clear that if:

therefore,

For any μ, it can be obtained:

Then,

Which is a circular disc with its center μ_R, which has a small radius corresponds to any given μ.

Given any normal matrix A, Av - μ_Rv = η, where ||v||₂ = 1 and ||η||₂ = ε. It is known that n-1 eigen values satisfy the relation: |λ_i - μ_R|≥α, where, α is a constant. Then a relevant eigen values λ₂, λ₃,......, λ_n can be obtained. If A is normal, then it has an eigen vector ortonormal system x_i, (i = 1,....., n), thus, it can be written as:

Given the previous equation:

so that:

therefore, it can be deducted that:

if θ is defined with the following relation:

hence,

then:

Therefore, if a>ε, then ve^-iθ is the best approximation on x₁. The constant factor e^-iθ is a less significant factor. As the value α tends to decreased, then the result will also become smaller. When α is equal to ε, then the result will be insignificant. Hence that μ_R is Quotient Rayleigh (Gumerov, 2003).

For any symmetrical matrix A, a temporary assumption is used on eigen matrix x₀. Then, the iterative computation is conducted in order to obtain the approximtion over the corresponding eigen vector with eigen value l. In each iteration a Quotient Rayleigh is used to know the maximum eigen value, so that it can be convergence.

While not convergence abs (l_k - l_k-1) do

percent estimation of dominant eigen values

End while.

RESULTS AND DISCUSSION

To measure the performance of the proposed power method, several tests have been conducted using a number variety of dataset. The result is also compared and analyzed against the existing power method to find the accuracy of the computation model in processing PageRank. This test also measures the efficiency of the proposed method in producing the dominant eigen values. To project the behavior of the system that uses the proposed model in n-size pages, we do an interpolation.

Experimental data: This experimentation is carried out by using dataset from previous researches. Table 1 explains the sources of the dataset used in the testing scenario.

Standford University datasets were crawled in 2002 and used by Sepandar D. Kamvar in his research on ranking the search results of search engine. The datasets that were downloaded for the test are Stanford and Stanford Berkeley web matrixes. Stanford web matrix has 281, 903 pages or around 2.3 million paths with the size of 64.2 MB. While, Stanford Berkeley web matrix has 683, 446 pages or around 7.6 million paths with the size of 125.8 MB.

Toronto University datasets were crawled in 2004 and is used by a number of researches at Toronto University. There are 34 datasets that has been used for the testing, with the size starting from 742 until 11, 659 pages. These datasets was crawled based on the keyword searching and each has 3 separate files which also use for other researches related to path analysis.

Milano University datasets were crawled between 2000 and 2005. Datasets were crawled by many researches from various universities. It contains several datasets, starting from one that has 325, 577 pages until 1, 382, 909 pages. Table 2 shows the dataset from University Milano.

Result and analysis: The tests on the proposed method have been conducted at the Software Engineering Laboratory of ITS (Institut Teknologi Sepuluh Nopember). The tests were conducted using a PC Intel P4-3 Giga Quad Core with 5 GB memory. Table 3 shows the result of the experiment, where power method with Quotient Rayleigh performs more efficient, in the term of computation time, rather than the existing method, i.e. between 6.25 and 74.72%. It also shows that the number of pages is corresponding to the computation time.

Table 1:	Dataset for testing

Table 2:	Dataset from University Milano

Boldi (2004)

Table 3:	Computation time of pagerank

Table 4 shows the resulted eigen vector from Pagerank algorithm using former Power Method and Modified Power Method. It shows that both methods produce identical eigen values. Which means that the proposed methods behave.

Interpolation: The behavior of the modified power method is also observed by conducting an interpolation over n-size pages. The interpolation is conducted using SPSS 7 version 14. The behavior of Pagerank is observed based on the power function. Figure 2 shows the behavior of both methods.

Based on the power function, the behavior of both methods is observed and analyzed, whether the proposed method always performs better than existing method.

This is proven as follow: from the statistical analysis over the data using the tool, it can produce function of each method.

Table 4:	Eigen values of both methods on stanford dataset


Fig. 2:	The computation time of pagerank on existing and proposed methods with dataset ≥11.659

Where:

Y_pm	=	Y (computation time) functional equation of pagerank using existing power method
Y_rayleigh	=	Y (computation time) functional equation of pagerank using modified poer method
x	=	Number of pages

Because the objective is to find any adjacent point, then both equations is written as follow:

There is only one adjacent point present for the equations, i.e., x ≅ 0 and it is shown from Fig. 2 that the function of proposed method is always above the existing method. This means that it is always true that the proposed method always performs better than existing methods for any number of pages bigger than 0. Given the number of existing pages in Google is 4.2 billion, it will take the existing method around 7.54 days to find the eigen values, while the proposed method will need only 1.75 day.

CONCLUSION

Ranking web page is the heart of search engine which used for determining significant searching results. One of the best ranking methods currently is PageRank, which is used by Google. From the preliminary test on sample dataset, it is shown that the advanced PageRank, which makes use of Quotient Rayleigh in the computation of power method performs significantly faster that is 64.4%, than the existing method. The interpolation also shows that given the amount of existing collection of pages that Google has the proposed method will reduce 25% of the existing computation time. This is due to the computation characteristic of Quotient Rayleigh, which allows the convergence of eigen value to happen far earlier. There is still a further research on observing the behavior of the new method on a system that similar to the existing search engines’ architecture, such as Google.

ACKNOWLEDGEMENTS

We would like to thank Massimo Santini and Sebastian Vigno for allowing us to use University Milano dataset for testing our method.

Related Links

Journals By Subject

Asian Journal of Information Technology

Implementing Quotient Rayleigh in Power Method to Improve the Computation Speed of Pagerank

How to cite this article: