Pantel lin 2002 each resulting cluster p from a

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: are very close to each other and are far away from other clusters. The structure was very well identified by the cluster algorithm. For the range from 0.5 to 0.7 the objects are clearly assigned to the appropriate clusters. A larger level of noise exists in the data set if the silhouette coefficient is within the range of 0.25 to 0.5 whereby also here still clusters are identifiable. Many objects could not be assigned clearly to one cluster in this case due to the cluster algorithm. At values under 0.25 it is practically impossible to identify a cluster structure and to calculate meaningful (from the view of application) cluster centers. The cluster algorithm more or less "guessed" the clustering. Silhouette Coefficient The purity measure is based on the well-known precision measure for information retrieval (cf. Pantel & Lin (2002)). Each resulting cluster P from a partitioning P of the overall document set D is treated as if it were the result of a query. Each set L of documents of a partitioning L, which is Comparative Measures 38 LDV-FORUM A Brief Survey of Text Mining obtained by manual labe...
View Full Document

Ask a homework question - tutors are online