Abstract: Text Categorization, which consists of automatically assigning documents to a set of categories deals with the management of huge number of features. Feature selection is one of the important and frequently used techniques in data preprocessing for data mining. It removes irrelevant, redundant or noisy data and brings immediate effects for data mining applications. In this study, we propose a filter system for feature set extraction, based on the similarity distance measure. Although past literatures have suggested that the use of features from irrelevant categories can improve the measure of text categorization, we believe that by incorporating only relevant feature can be highly effective. The experimental comparison is carried out between distance measure and four well-known classification techniques: C4.8, Multilayer perceptron, Least Mean Square and Linear Regression. The results also show that our proposed method can perform comparatively well with other classification measures, especially on a highly overlapped collection of topics and also it is found that C4.8 acts as a better classifier than other techniques.
Christy, A. and P. Thambidurai , 2006. Feature Selection for Efficient Text Categorization and Knowledge Discovery Using Classification Techniques. Asian Journal of Information Technology, 5: 872-876.