The entropy can be seen as a measure of the

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: omputed: W (t) = 1 + 1 log2 | D | ∑ d∈ D P(d, t) log2 P(d, t) with P ( d, t ) = tf(d, t) ∑n=1 tf(dl , t) l (1) Here the entropy gives a measure how well a word is suited to separate documents by keyword search. For instance, words that occur in many documents will have low entropy. The entropy can be seen as a measure of the importance of a word in the given domain context. As index words a number of words that have a high entropy relative to their overall frequency can be chosen, i.e. of words occurring equally often those with the higher entropy can be preferred. In order to obtain a fixed number of index terms that appropriately cover the documents, a simple greedy strategy can be applied: From the first document in the collection select the term with the highest relative entropy (or information gain as described in Sect. 3.1.1) as an index term. Then mark this document and all other documents containing this term. From the first of the remaining 26 LDV-FORUM A Brief Survey of Text Mining unmarked documents select again the term with the hig...
View Full Document

This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.

Ask a homework question - tutors are online