This preview shows page 1. Sign up to view the full content.
Unformatted text preview: omputed:
W (t) = 1 + 1
log2 | D | ∑ d∈ D P(d, t) log2 P(d, t) with P ( d, t ) = tf(d, t)
∑n=1 tf(dl , t)
l (1) Here the entropy gives a measure how well a word is suited to separate documents by keyword search. For instance, words that occur in many documents
will have low entropy. The entropy can be seen as a measure of the importance
of a word in the given domain context. As index words a number of words
that have a high entropy relative to their overall frequency can be chosen, i.e. of
words occurring equally often those with the higher entropy can be preferred.
In order to obtain a ﬁxed number of index terms that appropriately cover the
documents, a simple greedy strategy can be applied: From the ﬁrst document in
the collection select the term with the highest relative entropy (or information
gain as described in Sect. 3.1.1) as an index term. Then mark this document
and all other documents containing this term. From the ﬁrst of the remaining 26 LDV-FORUM A Brief Survey of Text Mining
unmarked documents select again the term with the hig...
View Full Document
This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.
- Summer '11