Abstract: Dataset with class imbalance is a challenging problem in many real-world application domains in the field of machine learning and data mining community which is the main cause for the degradation of the classifier performance. A data set is said to be imbalanced if the distribution of instances belonging to each class is not in equal proportion. Researchers worked on class imbalance have identified that combination of class overlapping with class imbalance and high dimensional data is crucial problems which are the important factors for the deterioration of the classifier performance. To overcome this problem a model with two phases of preprocessing is proposed. The objective of the proposed model is 3 fold: increase the minority class instances to address the class imbalance problem, to remove class overlap using the proposed model and to reduce Type 1 and 2 error of the classifier, i.e., false positive and false negative rate which means that the patients who actually does not have disease but predicted as to have disease and vice versa which is a serious problem in reality as it is a matter of life of a patient. The efficiency of the proposed model was evaluated with performance measures like precision, recall, F-measure, AUC, accuracy, kappa, false positive rate and false negative rate. Results proved that proposed model is more efficient than the existing models in the literature as all the 9 classifiers on all the three datasets showed accuracy above 99% and a significant reduction in false positive and false negative rate and also the proposed model was successful in overcoming the issues associated with the real world data sets like class overlap and class imbalance and finally improving the performances of the classifier like false positive, false negative rate, Auc and accuracy of the classifier.
B.V. Sumana and T. Santhanam, 2016. Prediction of Imbalanced Data Using Cluster Based Approach. Asian Journal of Information Technology, 15: 3022-3042.