We then have one extracted feature for each cluster

Info icon This preview shows pages 2–3. Sign up to view the full content.

We then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster. Three ways of weighting, hard, soft, and mixed, are introduced. By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. The user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. Real world experiments on data sets show that our method can run faster and obtain better extracted features than other methods. Feature Reduction In general, there are two ways of doing feature reduction, feature selection, and feature extraction. By feature selection approaches, a new feature set W’={w’ 1’ ,w 2 ’,…,w k ’} is obtained, which is a subset of the original feature set W. Then W0 is used as inputs for classification tasks. Information Gain (IG) is frequently employed in the feature selection approach [10]. It measures the reduced uncertainty by an information-theoretic measure and gives each word a weight. Feature Clustering Feature clustering is an efficient approach for feature reduction [25], [29], which groups all features into some clusters, where features in a cluster are similar to each other. The feature clustering methods proposed in [24], [25], [27], [29] are “hard” clustering methods, where each word of the original features belongs to exactly one word cluster. Therefore each word contributes to the synthesis of only one new feature. Each new feature is obtained by summing up the words belonging to one cluster. Let D be the matrix consisting of all the original documents with m features and D be the matrix consisting of the converted documents with new k features. The new feature set W ’= {w 1 ,w 2 ’,…,w k } corresponds to a partition {w 1’ ,w 2 ’,…,w k } of the original feature set W, i.e., W t W q =0, where 1 q, t k and t != q. Note that a cluster corresponds to an element in the partition. Then, the tth feature value of the converted document d i is calculated as follows: d’ it = ∑ w j Є w t d ij which is a linear sum of the feature values in w t. III .OUR METHOD (PREPROCESSING) There are some issues pertinent to most of the existing feature clustering methods. First, the parameter k, indicating the desired number of extracted features, has to be specified in advance. This gives a burden to the user, since trial- and-error has to be done until the appropriate number of extracted features is found. Second, when calculating similarities, the variance of the underlying cluster is not considered. Intuitively, the distribution of the data in a cluster is an important factor in the calculation of similarity. Third, all words in a cluster have the same degree of contribution to the resulting extracted feature. Sometimes, it may be better if more similar words are allowed to have bigger degrees of contribution. Our feature clustering algorithm is proposed to deal with these issues. Suppose, we are given a document set D of n documents d
Image of page 2

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 3
This is the end of the preview. Sign up to access the rest of the document.
  • Fall '16
  • FIX
  • Machine Learning, The Land, International Journal of Advanced Research in Computer Science and Software Engineering, IEEE Trans, Text Classification, Sainani Arpitha

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern