Abstract: Distributed clustering is an emerging research area in the broader field of Knowledge discovery in databases. Normalization is an essential preprocessing step in data mining, to standardize values of all attributes or features from different dynamic range into a specified range. In this study, distributed K-Means clustering algorithm is extended by applying global normalization before performing the clustering on distributed datasets, without necessarily downloading all the data into a single site. The performance of proposed normalization based distributed K-Means clustering algorithm is compared against distributed K-Means clustering algorithm and normalization based centralized K-Means clustering algorithm. The quality of clustering is also compared by three normalization procedures, namely Min-max, Z-score and decimal scaling for the proposed distributed clustering algorithm. The comparative analysis shows that the distributed clustering results depend on the type of normalization procedure. The experiments are carried out for various numerical datasets of UCI machine learning data repository.
N. Karthikeyani Visalakshi and K. Thangavel, 2009. Impact of Normalization in Distributed K-Means Clustering. International Journal of Soft Computing, 4: 168-172.
Clustering methods seek to organize a set of objects into clusters such that objects within a given cluster have a high degree of similarity, whereas objects belonging to different clusters have a high degree of dissimilarity. These methods have been widely applied in various areas such as taxonomy, image processing, information retrieval, data mining, etc. (Jain et al., 1999). Todays large-scale datasets are usually logically and physically distributed, requiring a distributed approach to clustering. Huge amounts of data are stored in autonomous, geographically distributed sources over networks with limited bandwidth and large number of computational resources (Folino et al., 2006).
Traditional clustering methods require all data to be located at the place, where they are analyzed and cannot be applied in the case of multiple distributed datasets, unless all data are transferred to a single location and clustered. Due to technical, economical or security reasons, it is not always possible to transmit all data from different local sites to single location and then perform global clustering. It is obvious that alternate distributed clustering algorithms (Ghosh and Merugu, 2003) reduce the communication overhead, central storage requirements and computation times by exchanging few data and avoiding synchronization as much as possible.
Most of the existing distributed clustering algorithms available in the literature (Sanghamitra et al., 2006; Jin et al., 2006; Jeong et al., 2007) aim to provide hard clusters based on K-Means algorithm. The K-Means (Jain et al., 1999) typically uses Euclidean or squared Euclidean distance to measure the distortion between a data object and its cluster centroid. These distances are usually computed from raw data and not from standardized data. While, using Euclidean distances, the distance between any two objects is not affected by the addition of new objects to the analysis. However, the clustering results can be greatly affected by differences in scale among the dimension from, which the distances are computed. Data normalization is the linear transformation of data to a specific range. Therefore, it is worthwhile to enhance clustering quality by normalizing the dynamic range of input data objects into specific range (de Souto et al., 2008).
The effects of normalization are evaluated for different conventional clustering methods like K-Means, fuzzy C-Means, Partitioning around Mediods and Hierarchical clustering and showed that the clustering results depend on the normalization method, but only in centralized environment (Doherty et al., 2007; Seo Young Kim and Toshimitsu, 2008; de Souto et al., 2008). To the best of our knowledge, there is no research on normalization in distributed clustering. Hence, an attempt is made in this study, to propose a novel Normalization based Distributed K-Means (NDKM) clustering algorithm, by extending Genlins Distributed K-Means (DKM) algorithm (Genlin and Ling, 2007) with global normalization procedure. A comparative study is made on DKM, NDKM and Normalization based Centralized K-Means (NCKM) clustering, where all the data are merged into a single data source, normalized and clustered using K-Means algorithm. This study also takes an additional effort to compare the performance of three different normalization procedures (Luai et al., 2006), while clustering homogeneously distributed numerical datasets.
K-Means clustering: Traditionally, clustering algorithms have been classified into two categories: hierarchical and partitional. Commonly used algorithms in the hierarchical category are single linkage and complete linkage algorithms. K-Means is a widely used algorithm in the partitional class (Jain et al., 1999).
The K-Means method partitions the data set into K subsets such that all objects in a given subset are closest to the same centroid. In detail, it randomly selects K of the objects to represent the cluster centroids. Based on the selected objects, all remaining objects are assigned to their closer centroid one by one. The Euclidean distance between the object and every centroid is computed and the object is moved to one of the cluster centroid, which yields minimum distance. The value of selected centroid is recalculated by taking the mean of all data objects belonging to the same cluster. The operation is iterated for all the objects. If K cannot be known ahead of time, various values of K can be evaluated until the most suitable one is found.
Distributed clustering: Distributed clustering assumes that the objects to be clustered reside on different sites. This process is carried out in two different levels: local level and global level. In local level, all sites carry out clustering process independently from each other. After having completed the clustering, a local model such as cluster centroids is determined, which should reflect an optimum trade-off between complexity and accuracy. Next, the local model is transferred to a central site, where the local models are merged in order to form a global model. The resultant global model is again transmitted to local sites to update the local models (Januzaj et al., 2004).
The key idea of distributed clustering is to achieve a global clustering that is as good as the best centralized clustering algorithm with limited communication required to collect the local models or local representatives into a single location, regardless of the crucial choice of any clustering technique in local site. Distributed clustering algorithms (Sanghamitra et al., 2006) can be classified along two independent dimensions such as classification based on data distribution and classification based on data communication.
A common classification based on data distribution in the study (Ghosh and Merugu, 2003) is those, which apply to homogeneously distributed or heterogeneously distributed data. Homogeneous datasets contain the same set of attributes across distributed data sites. Examples include local weather, databases at different geographical locations and market-basket data collected at different locations of a grocery chain. Heterogeneous data model supports different data sites with different schemata. For example, a disease emergence detection problem may require collective information from a disease database, a demographic database and biological surveillance databases.
According to the type of data communication, distributed clustering algorithms are classified into two categories: multiple communications round algorithms and centralized ensemble-based algorithms. The first group consists of methods requiring multiple rounds of message passing. These methods require a significant amount of synchronization, whereas the second group works asynchronously. Many of the distributed clustering algorithms work in an asynchronous manner, first generating the local clusters and then combining those at the central site (Park and Kargupta, 2003).
Normalization: Preprocessing Luai et al. (2006) is often required before using any data mining algorithms to improve the results performance. Data normalization is one of the preprocessing procedures in data mining, where the attribute data are scaled so as to fall within a small specified range such as -1.0 to 1.0 or 0.0 to 1.0. Normalization before clustering is specially needed for distance metric, such as Euclidian distance, which are sensitive to differences in the magnitude or scales of the attributes. In real applications, because of the differences in range of attributes value, one attribute might overpower the other one. Normalization prevents outweighing attributes with large range like salary over attributes with smaller range like age. The goal is to equalize the size or magnitude and the variability of these attributes.
There are many methods for data normalization, which include Min-max normalization, Z-score normalization and normalization by decimal scaling. Min-max normalization performs a linear transformation on the original data. Suppose that mina and maxa are the minimum and the maximum values for attribute A. Min-max normalization maps a value v of A-v in the range (0, 1) by computing:
In Z-score normalization, the values for an attribute A are normalized based on the mean and standard deviation of A. A value v of A is normalized to v by computing:
where, and σA are the mean and the standard deviation, respectively of attribute A. This method of normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers that dominate the min-max normalization.
Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A. A value v of A is normalized to v by computing:
where, j is the smallest integer such that Max(|v|)<1.
MATERIALS AND METHODS
Normalization based distributed K-Means clustering: The Distributed K-Means algorithm is centralized ensemble based distributed clustering algorithm introduced in Genlin and Ling (2007) for clustering homogeneously distributed datasets. The DKM first does clustering in local site using K-Means, then sent all centroid values to central site, finally global centroid values of underlying global clustering are obtained by using K-Means again. The NDKM is extended version of DKM, where global normalization is performed to standardize the data objects into specific range. The step by step procedure of proposed algorithm is described in Fig. 1. First minimum and maximum values of each feature vectors are extracted from all local datasets and transmitted to central place, where global minimum and maximum values are identified. These two values are transmitted to local sites to perform global normalization using min-max normalization procedure. Next, normalized objects are clustered using K-Means algorithm to obtain centroids matrix and cluster index for each dataset. All local centroids are merged and clustered using K-Means algorithm, to group similar centroids and obtain global centroids. The global centroids is now transmitted to local site, where the Euclidean distance of each object from the global set of centroids are computed and assigned to the nearest cluster centroid.
The proposed algorithm can also be implemented for other types of normalization procedure, by simply modifying first two steps in the algorithm.
|Fig. 1:||Normalization based distributed K-Means algorithm|
In case of Z-score normalization method, global mean and standard deviation values are to be calculated based on local mean and standard deviation of each attribute from each dataset. Similarly, global maximum absolute value of each attribute is to be computed, to perform normalization on individual datasets.
RESULTS AND DISCUSSION
The main objective of this research is to explore the impact of normalization in the process of distributed K-Means clustering. The experimental analysis is performed with six benchmark datasets in two aspects. First, the efficiency of NDKM with Min-max normalization is compared against DKM and NCKM with Min-max normalization procedure. Next, the performance of three different normalization procedures is evaluated for NDKM. The information about the datasets available in the UCI machine learning data repository (Merz and Murphy, 1998) is shown in Table 1. The performance of the clustering algorithm is measured in terms of three external validity measures (Halkidi et al., 2002; Hui et al., 2006) namely rand index, F-measure and entropy. In case of F-measure and rand index, the value 1 indicates that the data clusters are exactly same. But, the value 0 indicates that the data clusters are exactly same for Entropy measure. For the purpose of experimental setup, the dataset is divided into three disjoint subsets and each subset is considered as distributed data source.
Experiment 1: This experiment is to show the significance of Min-max normalization in distributed clustering. The results of NDKM, in comparison with the results of DKM and NCKM, in terms of rand index, F-measure and entropy are shown in Table 2-4, respectively.
|Table 1:||Details of datasets|
|Table 2:||Performance analysis of NDKM based on rand index|
|Table 3:||Performance analysis of NDKM based on F-Measure|
|Table 4:||Performance analysis of NDKM based on entropy|
From the Table 2-4, it is observed that NDKM algorithm yields better results than DKM for all datasets in terms of rand index, F-measure and entropy. The values of all three measures are highly appreciable for breast cancer datasets, with NDKM algorithm. It is noted that the quality of clusters produced by NDKM is as good as NCKM for all datasets, in terms of all three measures. Moreover, the performance of NDKM is slightly higher than the NCKM for Australian and Image Segmentation datasets.
Experiment 2: The second experiment compares the quality of clusters obtained by NDKM with min-max normalization procedure against two other normalization procedures Z-score and decimal scaling. The domino effect of three normalization procedures based on the external validity measures, rand index, F-measure and entropy is shown in Table 5-7, respectively. From these Table 5-7, the following observations are identified. All three normalization procedures produce almost same quality clusters for Breast cancer, Mammography and Pen digit datasets.
|Table 5:||Comparative analysis of normalization methods based on rand index|
|Table 6:||Comparative analysis of normalization methods based on F-measure|
|Table 7:||Comparative analysis of normalization methods based on entropy|
Both min-max and Z-score methods yield better performance than Decimal Scaling for Satellite image dataset. The Min-max is suitable for Image Segmentation dataset, whereas Z-score is suitable for Australian dataset. Hence, it is identified that there is no unique normalization procedure, which yields better quality clusters for all datasets and so every dataset supports specific normalization method. In general, it is also observed that the methods Min-max and Z-score are suitable for more datasets than Decimal Scaling. So, it may be concluded that the selection of normalization method depends on specific domain.
A novel method of distributed K-Means clustering using global normalization is proposed to produce optimum quality clusters in distributed environment. Comprehensive experiments on six benchmark numerical datasets have been conducted to study the impact of normalization and to compare the effect of three different normalization procedures in distributed K-Means clustering. It can be concluded that the normalization before distributed clustering leads to obtain better quality clusters. Also, it is important to select specific normalization procedure, according to the nature of datasets. In future, appropriate normalization procedure can be applied in distributed fuzzy clustering and its variants to improve the performance.