on the processing time required to build those hierarchies.
Figure 1 shows the top level clusters for a small corpus (972
documents), a large portion of which are news stories. The first
cluster (239 documents) is about “quake,
earthquake, richter
scale” etc., so we can conclude that it contains documents about
earthquakes. Each cluster is labeled with related terms (from the
cluster centroid) that convey a coherent topic. The seventh cluster
(84 documents) appears to be on medical issues (“heart, patient,
euthanasia, doctor,
drug,
hospital,
medical”
etc.) We have
expanded this cluster to show the nine sub-clusters
(which
represent medical subtopics) - two of these are documents (i.e.,
from a cluster of size one), shaded in gray.
2. SYSTEM
DESCRIPTION
The primary steps to generating a hierarchy are the extraction of
features from documents, followed
by clustering
or grouping
based on those features. Over forty
configurable
parameters
control the algorithms in this process - we will describe the most
significant variations.
2.1 Feature Extraction
Feature
extraction
maps
each
document
into
a
concise
representation of its topic. These extracted features are used to
compare documents and to label their content for users. We use
the vector space model [l l] to represent documents as points in a
high dimensional topic space, where each dimension corresponds
to a unique word or concept from the corpus. The mapping
process extracts a list of unique terms from each document,
assigns each a weight, and represents the document using the n
highest-weighted terms. We refer to the parameter n as the feature
vector length, and our default is 25.
Several parameters control term selection and weighting.
By
default, we remove stop words. The weight
assigned to each
remaining term is a function of either the tf (term frequency) or
tf;df (term frequency - inverse document frequency). To use tfidf;
the system makes an initial pass over the text collection to create a
“baseline” - the baseline stores the total number of documents
that each unique term occurs in (i.e., the document frequency).
Feature extraction consults this baseline to set term weights as a
function of both the term’s frequency within the document and
the number of documents the term occurs in. An important word
is one that appears frequently within the current document, but
infrequently
across the other documents in the collection.
For
example, the word ‘excavation’
is uncommon in most corpora, so
if it appeared frequently
in a particular document, it would be
given a high topic weight. Conversely, words like ‘say’ that might
appear in almost every document are assigned a low weight (even
if they occur frequently in a given document.)
Because feature extraction
produces generic points in space as
input to the clustering algorithms, the system is not limited to
text; by implementing new feature extraction modules to map data
to high-dimensional points, we could cluster any type of data.
2.2 Clustering
The goal of clustering is to group the points in a feature space
optimally
based on proximity,
in order to form a hierarchy of
clusters. We unified near-linear time complexity
techniques from
k-means ([8], [4]) and Scatter/Gather [l]. The techniques are all
partitional, meaning that they simply separate a flat collection of
items into a single
set of “bins.”
A
hierarchy
is built
by
recursively
applying
a partitional
algorithm.
The
partitional
algorithms
each run in O(N) with
respect to the number of
documents, N, so the overall hierarchy is generated in O(N log N)
time (assuming a balanced hierarchy.)
Because documents and clusters are represented as points in
space, we can compare them using vector cosine. Clusters include
a “center” or “centroid”
vector that is the weighted average of the
documents or clusters they contain. To prevent longer documents
from
dominating
centroid
calculations,
we
normalize
all
document vectors to unit length. To compare a document to a
cluster, we simply calculate the cosine between the document
vector and the cluster’s centroid vector.
The partitional algorithms have three stages: seed selection, center
adjustment, and cluster refinement. Seed selection is the process
of choosing k candidate points in the feature space to serve as
centers for the partitions2. During center adjustment, documents
are repeatedly assigned to the nearest center, and the center is
recalculated
based on the average location
of all documents
assigned to it, thereby moving it through the feature space. This
process
may
be
repeated
multiple
times.
Afterwards,
all
documents are removed from the centers, and reassigned to the
new closest center. Thus, it is important
that the centers be
distributed
effectively
enough that they each attract sufficient
nearby, topically
related documents. Cluster refinement
is an
optional final step for improving the new partitions.
2.2.1 Seed Selection
Seed selection picks centers to which the system can assign each
point in the input set to form a partition. We implemented three
seed selection algorithms: random, buckshot, and fractionation.
Random is the simplest; it picks k points randomly from the input
set as the initial centers.
The second method is buckshot, described by [I]. Buckshot picks
2/k. y1 points randomly from the input set of n items, and clusters
them using a high quality
O(N2) clustering
algorithm.
The k
centroids resulting from this clustering become the initial centers.
For the O(N2) algorithm we use the group average variation of
greedy agglomerative clustering, as did [l]. We will also refer to
this as the “cluster subroutine.”
The third method, fractionation, is also described by [I]. It uses
the same cluster subroutine to build a bottom-up hierarchy from
the initial input set, clustering fixed-size groups of points at each
step to maintain a linear time complexity.
The top-level clusters
of this hierarchy become the initial seeds.
2.2.2 Center Adjustment
Once k seeds are selected as centers, the system can iteratively
assign each point in the input set to the closest center and adjust
that center accordingly.
If a point’s similarity
to every center is
below the assignment similarity
threshold, t, it is not assigned to
any center. By default, we use a small non-zero fixed value for t,
though we are investigating
techniques for setting t dynamically.
Continuous
k-means [4] consists of following
random
seed
selection with some number of iterations of center adjustment. In
2 k defaults to 9 in our system.
17