Cluster-Evaluation_cost-based

Cluster Cost-based Evaluation Algorithm for Topic Detection The clustering task: A set of messages is to be clustered according to target identity. The goal of the clustering is that each cluster should contain messages from a single target and that the number of clusters should equal the number of targets. Practically, the clusters should be as pure as possible and the number of clusters should be as small as possible. How well the clustering algorithm achieves this goal is evaluated using an algorithm that assigns a cost for examining a message, Cexam, and a cost for missing a target message, Cmiss. Here is how the algorithm works: 1) For each target, index = k: a) For each cluster, index = i: Ncluster(i) is the number of messages in cluster i. Ntarget(i,k) is the number of target messages in cluster i for target k. i) Examine a message chosen at random from the cluster. This incurs a cost of Cexam. The probability that this message is a target message is Ptarget(i,k) = Ntarget(i,k)/Ncluster(i). (1) If the message is a target message, all the other messages in the cluster are also examined. This results in a total incurred cost for the cluster of Ncluster(i) Cexam. (2) If the message is not a target message, the cluster is abandoned. This results in a total incurred cost for the cluster of Cexam + Cmiss Ntarget(i,k). This algorithm gives the following clustering cost measure: Ccluster = = = {P k i k i t arg et (i , k ) Cexam N cluster (i ) [ ] ] + (1 Pt arg et (i, k ) ) [Cexam + Cmiss N t arg et (i, k ) ]} N t arg et (i , k ) Cexam + Cmiss N t arg et (i , k ) + 1 N cluster (i ) N k i N t arg et (i , k ) cluster (i ) N t arg et (i , k ) N t arg et (i , k ) N cluster (i ) N t arg et (i , k ) + Cmiss Cexam N t arg et (i , k ) + 1 N cluster (i ) N cluster (i ) Cexam N cluster (i ) [ [ ] ( ) Interpretation of Ccluster is difficult because Ccluster is a function of the message set and tends to increase with the number of target messages. Knowledge of the minimum and maximum values of Ccluster, given the size of the message set and the number of target messages for each target, would aid interpretation. Perfect (minimum cost) clustering would require that each target have one cluster with Ptarget = 1 and all others with Ptarget = 0, and that the number of clusters equal the number of targets. In this case, Ccluster would be: C min = {C k exam N t arg et ( k ) t arg et ( k ) + Cexam ( N clusters 1) + } = Cexam {N k ( N clusters 1)} = Cexam t N arg et ( k ) + N t arg ets ( N clusters 1) k = Cexam N total + N t arg ets N t arg ets 1 [ ( )] where Ntarget(k) is the number of target messages for target k, Ntargets is the number of different targets, Nclusters is the number of different clusters (= Ntargets), and Ntotal is the total number of messages to be clustered. (But note that if Cmin is calculated for only a subset of targets, then the sum of the target messages will be less than Ntotal.) The worst possible (maximum cost) clustering would require that Ntarget(i,k) be independent of i and that all clusters contain only one message. In this case, Ccluster would be: C max = {P =N {P =N {C k i total k total k t arg et ( k ) [ Cexam ] + (1 Pt arg et (k ) ) [Cexam + Cmiss Pt arg et (k ) N cluster (i ) ]} (1 Pt arg et (k ) ) [Cexam + Cmiss Pt arg et (k ) ]} ( )} ) t arg et ( k ) [ Cexam ] + exam + Cmiss Pt arg et ( k ) 1 Pt arg et ( k ) + Cmiss N total = Cexam N t arg ets N total N k t arg et ( k ) N total N t arg et ( k ) ( The normalized value of Ccluster is then: C norm = Ccluster C min C max C min For TDT purposes, a target is a reference topic, a message is a story, and a cluster is a system-defined topic. The following chart shows topic detection results for all TDT2 detection systems running on the default data sources (newswire +ASR) for a range of Cmiss values. Notice that as the ratio of Cmiss to Cexam increases the performances of six systems improve while those of two degrade. This ...

