Journal of Animal and Veterinary Advances

Year: 2009
Volume: 8
Issue: 3
Page No. 413 - 419

Comparison of Principal Component Analysis and Multidimensional Scaling Methods for Clustering Some Honey Bee Genotypes

Authors : Irfan Ozturk , Hikmet Orhan and Zeki Dogan

Abstract: The aim of this study, was to make use of Cluster Analysis, Principal Component Analysis and Multidimensional Scaling for clustering and appointments of the units to the true clusters. The honey bee genotypes (Apis mellifera L.) belonging to 30 provinces of Turkey were clustered according to Cluster Methods cited below. From these methods, McQuitty, Single Linkage, Complete Linkage, Average Linkage and Centroid Linkage showed the similar results and the results were in good agreement for separation graphics obtained by Principal Component Analysis and Multidimensional Scaling Methods while different results were found from median, centroid linkage and k-means methods.

How to cite this article:

Irfan Ozturk , Hikmet Orhan and Zeki Dogan , 2009. Comparison of Principal Component Analysis and Multidimensional Scaling Methods for Clustering Some Honey Bee Genotypes. Journal of Animal and Veterinary Advances, 8: 413-419.

INTRODUCTION

Regarding the similarities and differences in their characteristics, living organisms constitute a multidimensional model of variations in the nature (Steffes, 2007). The variations formed by both ecologic and genetic effects between the subsections of species bring up the structure of population pertaining to the species.

For classification, population and population groups compared with each other and tried to arrange them in defined rules. The most important thing in a classification is to grouping the organisms and ranking them in a hierarchic category. Systematic zoology, used differences and similarities between organisms, put population into species or higher taxons; also, put forward an idea about biological reasons for similarities and differences of taxons (Dufrene and Legendre, 1997).

Today, the data obtained from all the research and experiments are being evaluated by various statistical methods. Similarly, for the correct classification of plants and animals as well, appropriate statistical analyses have to be used. For this aim a number of methods for cluster analysis have been developed. However, uncompromising results obtained from different methods leave the researchers in a contradiction. To overcome this difficulty, the distribution of the data at hand and the application of the method relevant to these data should be known by researcher. This is not an easy task. Therefore, in this study, by using the methods such as Multidimensional Scaling (MDS) and Principal Component Analysis (PCA), we have investigated whether they are of help in determining the clusters.

In hierarchical clustering methods, if an element is determined to go into a cluster, it cannot go into another cluster at all. In case of some extreme values, this situation can lead to a formation of wrong clusters. It has been argued that, by way of obtaining first 2 principal component scores, the approximate scatter diagrams can be drawn and they could give to the researcher an idea about the correctness of cluster analysis results and a preview about its inference (Anderberg, 1973).

In MDS, using the matrix of lengths between n elements, with as minimum dimension as possible, the scatter of elements is aimed to be determined as close as to the original form. From the scatter diagram, the distribution of the elements in the multidimensional space is obtained. Observing this scatter diagram, it is possible to decide which clustering methods can better cluster the elements without any divergence (Johnson and Wichern, 1992; Dufrene and Legendre, 1997).

MATERIALS AND METHODS

The data of this study, were consisted of honeybees (Apis mellifera L.) genotypes values obtained from 30 different locations of Turkey (Ozturk et al., 1992). From each of the honeybee genotypes, 12 measures pertaining to morphological characteristics have been taken, via distance values based on these measurements, bee genotypes were tried to be cast to appropriate clusters.

Measuring scale and variable types should be correctly determined in order to have an appropriate result from applied methods. Since, the morphological characteristics of honeybees depend on measuring, in statistical analysis ratio scaling is used. However, due to having different properties and measuring units, these morphological characteristics are standardized; consequently, a standard Z-matrix is obtained (Dueck et al., 2005). Using this standard Z-matrix, the distances between the paired honeybee genotypes are calculated by the use of Euclidian distances, probably the most commonly used distance measure and the most familiar, where the distance between points i and j denoted by dij is defined as:


where:
Xik = The value of the kth variable for the ith entity
dij = The distance between observations i
j and p = The number of variables (Everitt, 1979; Sandoz, 2006)

Cluster methods: Although, there are many clustering methods for the classification of data, in this study, we have applied the methods, which have a common application and take place in many software packages like SPSS (2006), Minitab Inc. (2003) and Chirpaz et al. (2004). These methods can be named as Single Linkage, Complete Linkage, Average Linkage, Centroid Linkage, Median Linkage, McQuitty, Ward Clustering and k-means.

Principal Components Analysis (PCA): Although, it is easy to make analysis in a multivariable case, inference pertaining to their results is not an easy task. In cluster analysis, there are many distance measures and methods based on these measures. Depending on either distance measure or selected method, the results of cluster analysis could be different and this can lead researcher into an uncertainty. That is why, in recent years, in cluster analysis Principal Components is mostly used. By this way, on the one hand, the number of variables is reduced; on the other hand, the correlation pattern between variables, which is negatively affecting the multi variable analysis methods, can be removed. Furthermore, it is possible to derive detailed information from the plot of observations over the first 2 principal components. The resulting diagram can give the researcher an idea about the correctness and inference of cluster analysis results (Bensmail et al., 1997).

The reasons of the Principal Components could be given as follows; let us think about the position in the p-dimensional space of an X data matrix, which contains n individual for each of p variables. The data matrix can be expressed as a cloud consisting of many points, each of which are determined by an individual. Due to the absence, of an exact independence between variables, the axes of cloud-shaped geometric form will not be orthogonal to each other and on this account there will not be any definition of it. In fact, if these points are taken into ellipsoid orthogonal axes, we can derive more detailed and explanatory information. For this aim, in the transformation applied, without changing the total variance of points along the first axes, the new axes are brought into orthogonality.

In practice, since the measuring units of variables are different from each other, we do not use the Xpxn dimensional raw data matrix, we will make use of Zpxn dimensional standard matrix, which is consisted of standardized values of X variables. In this case, Tpxp being a transformation matrix, we have the following relation. After transformation, we obtain unrelated Yij values with each other from the related Zij values.

In principal component analysis to account for the variance-covariance structure, instead of p variables, k principal components, which are, linear combination of them is used. The method of PCA brings about a basic structure for the ensuing multivariable statistical analysis. The mentioned analyses are multivariable regression, cluster and factor analysis (Sharma, 1996).

In PCA, the basic instrument used for the analysis is either the variance-covariance matrix (Σ) or correlation matrix. Let our vector pertaining to the variables be in the form of:

X: {X1, X2,..., XP}

Eigen values related to these variables are:

λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λP ≥ 0

and eigen vectors are:

ei : {e1, e2, e3, . . . , eP}

To define in brief the aforementioned eigen values and eigen vectors, let A be a pth order square matrix. Having λ as a constant, the characteristic equation of this matrix is formed like the following equality.

p roots obtained from the above p-th order polynomial are called the eigen values (or characteristic roots) of matrix A. For a matrix A to have real eigen values, it should hold the property of being a symmetrical matrix that is A = AT, the original matrix and its transposed matrix should be equal. Otherwise its eigen values will be complex numbers.

On the other hand, if λ1, λ2, . . . , λP are eigen values of A, then

In this case, if a matrix A of order p has a rank less than p, then at least one of the eigen values of this matrix is zero and its determinant value will be zero as well.

Let, A be a matrix of order pxp and λ be a constant, if a non zero vector of order px1 provides the following condition then this vector is said to be an eigen vector of A related with the eigen value λ.

(A-λI ) e = 0

The eigen vectors of a symmetrical matrix are orthogonal to each other and this property provides the very basis of many multivariable analysis (Chang, 1983).

Using this property, the principal components can be found via following equality:

Multidimensional Scaling (MDS) methods: This method tries to determine the scatter of n individuals according to p variables in k-dimensional (k<p) space.

Cluster analysis is a method that brings about the relation between n individuals depending on p variables and makes use of dendograms formed by calculated distances. However, here especially in hierarchical cluster methods an individual assigned to any cluster cannot be assigned again in later stages to a more suitable cluster taking it out from the first cluster to which it is assigned before. This point can cause wrong formations of cluster structures for some data. Therefore, to compare the structure of formed clusters it is necessary to consider the scatter of original variables in k-dimensional space (Malhotra, 1987).

As in the case of Cluster Analysis, MDS also makes use of the distances between individuals. These original distances are taken as absolute distances and put into the process.


Table 1: Kruskal’s stress statistics

According to the mentioned distances; indicator distances, which are quite close to the original ones, are tried to be produced on the coordinate system. The goodness of fit between this indicator distances and original ones are determined by the stress measure as follows:

The goodness of fit between resulting scatter diagrams and that of real variables in space can be most reliably tested by stress coefficient. For this reason, the fitness of determined form to the original one is possible to evaluate by the stress values reported by Ozdamar (1999) (Table 1).

Another statistical advantage of MDS method is that it does not require an assumption of a specific distribution about population. Since our data are in the rational scale, predictor configuration distances (tdij) are calculated by linear regression equations:

dij = a + bδij + e

These predictor distances are called inequalities (differences) and inequality values of tdij are distance data that are quite close to dij values and representing δij values (Ozdamar, 1999).

Hotelling-lawley trace statistics: This method is found to be the most appropriate criterion for determining the number of cluster (Milligan and Cooper, 1985; Dincer and Ozdamar, 1993). In this method, B and W being, respectively between groups and within groups sum of squares matrices, we show the rth eigen value as λr for the product matrix of BW-1. The trace of the product matrix of (BW-1) defined as the sum of diagonal elements is [tr (BW-1)], which is equal to the sum of eigen values, gives the Hotelling-T 2 statistics:

is being used Giurcaneanu and Tabus (2004). A big value of T2α implies the rejection of H0 hypothesis. Provided with sufficiently big values of n, using the approximation of:

the table values of chi-square (χ2) distribution with p (k-1) degrees of freedom are used as critical value. Furthermore, having done the transformation

the F table values with vl = (k-1) p2 and v2 = p (n-p-1) + 2 degrees of freedom can be used as well, as a critical value (Everitt, 1979).

RESULTS

According to the aforementioned clustering methods honeybees belonging to the 30 regions have been clustered so as to form 2, 3, 4 and 5 clusters. The results of most significant cluster structure are reported in Table 2.

Having applied PCA to the variables, the distribution of bee genotypes have been obtained according to the resulting 2 principal components as in Fig. 1. With the help of this graphic a visual prediction about the scatter and structure of clusters has been obtained and it is shown that the bee genotypes can be divided into 4 or 5 clusters.

In a similar way, for honeybee genotypes, the MDS analysis and its scatter in the multi-dimensional space have been obtained as in Fig. 2. The analysis has been separately evaluated for 2, 3 and 4 dimensions. The related stress values obtained according to Kruskal’s stress statistics are calculated, respectively as 0.07995; 0.04079 and 0.02119. Naturally, when having the scatter of genotypes evaluated as being 4-dimensioned, the stress value proved to be a minimum. However, small is the stress value of diagram obtained in multidimensional scaling analysis, the number of dimensions also is required to be a minimum. That is because when the number of dimensions increases graphical illustrations become far from being easily understandable. The preferred solution is 3 or less (3≥k) dimensional one. For this reason, three-dimensional scaling of the data has been decided, resulting scale values and loci of genotypes determined to be as in Fig. 2. Moreover, the stress value derived from three dimensions is (0.04079) and the significance attached to this value is 0.025-0.05, that is the goodness of fit of configuration distances to the original ones are excellent.


Table 2: Clusters obtained according to different cluster methods

Table 3: The tests of cluster structures by hotelling lawley trace statistics and their significance levels

Fig. 1: Scatter diagram of the honeybee genotypes obtained from first 2 PCs

Fig. 2: Three-dimensional scatter diagram of honeybee genotypes using MDS method

Having considered Fig. 2, it is observed that the honeybee genotypes for 30 regions could be assigned to 4 or 5 clusters. These results have come out to be similar with those scatters obtained from PCA and it is proven to be no significant differences across the methods. As a matter of fact, both PCA and MDS methods have given results supporting each other for the scatter of honeybee genotypes.

In case of the number of cluster to be 4, the methods of Single linkage, Complete Linkage, mean Linkage, McQuitty and Ward come out with the best similar cluster structures. In case of the number of cluster to be 5, the methods of Single linkage, Complete Linkage, Mean Linkage, McQuitty, Central Linkage and Median come out with the best similar cluster structures. To make this point clearer, the statistics of Hotelling Lawley Trace test is applied to each of the resulting cluster structures and test results have been reported in Table 3.

DISCUSSION

Having done the analyses and tests, many cluster structures have been eliminated. Hotelling Lawley trace test applied to some clustering methods and structures seeming most significant in Table 3 have all ended up with significant p-values. However, at the p<0.001 significance level, we have observed that the quadruple cluster provides the best fitting cluster structure formed by methods such as McQuitty, Single Linkage, Comlete Linkage, Average Linkage, Central Linkage and Ward with highest F-value (F = 97.5) and producing graphics complying with those of PCA and MDS analysis (Table 3).

Furthermore, when the genotypes of honeybees are considered, Sanliurfa bees, which take place within the location number 27 and forming the cluster 4 by itself, do not show any similarities with those of other regions. This result confirms the findings of Ruttner (1988), regarding the bees of Southeast Anatolian region to be one of the 6 ecotypes of Iran bees (A.m. meda).

CONCLUSION

Using only cluster analysis by itself is not sufficient in assigning the individuals into the right clusters and there so it can be said that use should be made of other aforementioned methods. In this study, however, though they have given sufficient explanatory preliminary information in determining the number of clusters, it is established that in different applications they do not always provide the same explanatory information. In this situation, when the subject matter is the clustering of individuals, it can be said that the following the process below could prove to be useful.

First of all one should bring about a profile of individuals by a cluster analysis
In the next stage, by Multidimensional Scaling or Principal Components Analysis, the number of clusters and suitable clustering methods for assigning individuals to the correct clusters has to be determined
In the third stage, by making use of statistics such as Hotelling Lawley trace value, the best feasible cluster structure and significance level (P) should be determined

Design and power by Medwell Web Development Team. © Medwell Publishing 2024 All Rights Reserved