Documents Found!
As seen in
Less Work, Better Grades
Join
Course Hero
Access
best resources
Ace
your classes
Ace your courses with Course Hero!
|
|
|
Study Smarter, Score Higher
Here are the top 5 related documents
...Security Education in U of Minnesota
Yongdae Kim
1
U of M Security Program Overview
! Faculty Members
!Nick Hopper: privacy, crypto, information hiding !Yongdae Kim: distributed system security, applied crypto !Andrew Odlyzko: crypto, security econ...
...Lab1: The Buffer Overflow
Credits
This lab and examples are based very strongly (or outright copied from) The Shellcoder's Handbook by Jack Koziol, David Litchfield, Dave Aitel, Chris Anley, Sinan noir Eren, Neel Mehta and Riley Hassell
Credits
...
...LECTURE I: SOFTWARE SECURITY
UMSSIA
THINKING LIKE AN ADVERSARY
SECURITY ASSESSMENT
Confidentiality? Availability? Dependability? Security by Obscurity: a system that is only secure if the adversary doesnt know the details is not secure!
CONTROL H...
Document Content (unformatted)
Course Hero has millions of student submitted documents similar to the one
below including study guides, homework solutions, papers, exam answer keys and textbook solutions.
Overlapping Model-based Clustering Arindam Banerjee Chase Krumpelman Joydeep Ghosh Dept. of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712, USA Sugato Basu Raymond J. Mooney Dept. of Computer Sciences University of Texas at Austin Austin, TX 78712, USA ABSTRACT While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to nding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modi cations for this extension, and present results on synthetic data as well as subsets of 20-Newsgroups and EachMovie datasets. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications - Data Mining; I.2.6 [Arti cial Intelligence]: Learning General Terms Algorithms Overlapping clustering, exponential model, Bregman divergences, high-dimensional clustering, graphical model. Keywords 1. INTRODUCTION Most clustering methods partition the data into non-overlapping regions, where each point belongs to only one cluster. In a variety of important applications, though, overlapping clustering, wherein some items are allowed to be members of two or more discovered clusters, is more appropriate. For example, in biology, genes often simultaneously participate in multiple processes; therefore, when clustering micro-array gene expression data, it is appropriate to assign genes to multiple, overlapping clusters [23, 4]. Similarly, when clustering documents into topic categories, documents may contain multiple relevant topics and an overlapping clustering might be more relevant [22]. In the 20-Newsgroups benchmark dataset, articles with multiple topics are cross posted to multiple newsgroups. Ideally, a clustering algorithm applied to this data would allow articles to be assigned to multiple topic labels and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. KDD 05, August 21 24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-135-X/05/0008 ...$5.00. would rediscover the original cross-posted articles. In the EachMovie dataset used to test recommender systems, many movies belong to more than one genre, such as Aliens , which is listed in the action, horror and science ction genres. An overlapping clustering algorithm applied to this data should automatically discover such multi-genre movies. In this paper, we generalize an approach to overlapping clustering introduced by Segal et al. [23], hereafter referred to as the SBK model. The original method was presented as a specialization of a Probabilistic Relational Model (PRM) [14] and was speci cally designed for clustering gene expression data. We present an alternative view of their basic approach as a generalization of standard mixture models. While the original model maximized likelihood over constant variance Gaussians, we generalize it to work with any regular exponential family distribution, and corresponding Bregman divergences, thereby making the model applicable for a wide variety of clustering distance functions [2]. This generalization is critical to the effective application of the approach to high-dimensional sparse data, such as typically those encountered in text mining and recommender systems, where Gaussian models and Euclidean distance are known to perform poorly. Further, we propose a novel algorithm dynamicM that assigns instances to multiple clusters for the general model. We also outline an alternating minimization algorithm that monotonically improves the objective function for overlapping models for any regular exponential family distribution. In order to demonstrate the generality and effectiveness of our approach, we present experiments in which we produced and evaluated overlapping clusterings for subsets of the 20-Newsgroups and EachMovie data sets mentioned above. An alternative straw man algorithm for overlapping clustering is to produce a standard probabilistic soft clustering by mixture modeling and then make a hard assignment of each item to one or more clusters using a threshold on the cluster membership probability. The ability of thresholded soft clustering to produce good overlapping clusterings is an open question. Consequently, we experimentally compare our approach to an appropriate thresholded soft clustering and show that the proposed overlapping clustering model produces groupings that are more similar to the original overlapping categories in the 20Newsgroups and EachMovie data. A brief word on notation: uppercase letters such as X signify a matrix, whose ith row vector is represented as Xi , jth column vector is represented as X j , and whose entry in row i and column j j is represented as Xi as well as Xi j . 2. BACKGROUND In this section, we give a brief introduction to the PRM-based SBK model. Probabilistic Relational Models (PRMs) [14] extend the basic concepts of Bayesian networks into a framework for representing and reasoning with probabilistic relationships between entities in a relational structure. The SBK model is an instantiation of a PRM for capturing the relationships between genes, processes, and measured expression values on DNA microarrays. The structure of the instantiated model succinctly captures the underlying biological understanding of the mechanism generating the observed microarray values namely, that genes participate in processes, experimental conditions cause the invocation of processes at varying levels, and the observed expression value in any particular microarray spot is due to the combined contributions of several different processes. The SBK model places no constraints on the number of processes in which any gene might participate, and thus gene membership in multiple processes, i.e., overlapping clustering, naturally follows. The SBK model works with three matrices: the observed real expression matrix X (genes experiments), a hidden binary membership matrix M (genes processes) containing the membership of each gene in each process, and a hidden real activity matrix A (processes conditions) containing the activity of each process for each experimental condition. The SBK modeling assumes that j the expression value Xi corresponding to gene i in experiment j has a Gaussian distribution with constant variance. The mean of j the distribution is equal to the sum of the activity levels Ah of j the processes h in which gene i participates so that p(Xi |Mi , A) = j 1 1 j )2 ). The SBK model further assumes exp( 2 2 (Xi Mi A 2 that M and A are independent so that p(M, A) = p(M)p(A) and j that Xi s are conditionally independent given Mi and A j . M and A are assumed to be component-wise independent as well. Assuming that elements of A are uniformly distributed, considering the log-likelihood of the joint distribution, we have 1 max log p(X, M, A) min X MA 2 log p(M) . M,A M,A 2 2 To nd the value of the hidden variables M and A, the SBK model uses an EM approach [12]. The E step involves nding the best estimates of the binary genes-process memberships M. The M step involves computing the prior probability of gene membership in each process p(M) and the process-condition activations A. The core parameter estimation problem is much easier to understand if we recast it as a matrix decomposition problem, initially ignoring the priors. With the knowledge that there are k relevant processes in the observations, we want to nd a decomposition of the observed expression matrix X n d into a binary membership matrix M {0, 1}n k and a real valued activation matrix A k d such that ||X MA||2 is minimized. Hence, the problem is one of matrix factorization, and the dif culty arises from the fact that M is a binary matrix. h are the component mixing coef cients such that h 0 and k h = 1. In mixture model estimation, each point Xi is ash=1 sumed to be generated from only one underlying mixture component. Let Z be a n k boolean matrix such that Zi j is 1 if the jth component density was selected to generate Xi , and 0 otherwise. Let zi be a hidden random variable corresponding to the index of the 1 in each row Zi : every zi is therefore a multinomial random variable, since it can take one of k discrete values. Since the Z matrix is unknown, the optimum parameters of the mixture model can be obtained using the well-known iterative Expectation Maximization (EM) algorithm [12]. The probability value p(zi = h|Xi , ) after convergence of the EM algorithm gives the probability of the point Xi being generated from the hth mixture component. Using these probabilities, mixture models are often used to generate a partitional clustering of the data, where the points estimated to be most probably generated from the hth mixture model component are considered to constitute the hth partition. In order to use the mixture model to get overlapping clustering, where a point can deterministically belong to multiple clusters, one can choose a threshold value such that Xi belongs to the hth partition if p(zi = h|Xi , ) > . Such a thresholding technique can enable Xi to belong to multiple clusters. However, there are two problems with this method. One is that the choice of the parameter , which is dif cult to learn given only X. Secondly, this is not a natural generative model for overlapping clustering. In the mixture model, the underlying model assumption is that a point is generated from only one mixture component, and p(zi = h|Xi , ) simply gives the probability of Xi being generated from the hth mixture component. However, an overlapping clustering model should generate Xi by simultaneously activating multiple mixture components. We describe one such model in the next section. 3.2 Proposed Overlapping Clustering Model 3. THE MODEL In this section, we rst outline a simplistic way of getting overlaps from soft-clustering based on mixture models. Then, we propose our model for overlapping clustering, hereafter referred to as MOC, as a generalization of the SBK model. The overlapping clustering model that we present here is a generalization of the SBK model described in Section 2. The SBK model minimizes the squared loss between X and MA, and their proposed algorithms is not applicable for estimating the optimal M and A corresponding to other loss functions. In MOC, we generalize the SBK model to work with a broad class of probability distributions, instead of just Gaussians, and propose an alternate minimization algorithm for the general model. The most important difference between MOC and the mixture model is that we remove the multinomial constraint on the matrix Z, so that it can now be an arbitrary boolean matrix. To distinguish from the constrained matrix Z, we denote this unconstrained boolean matrix as the membership matrix M. Every point Xi now has a corresponding k-dimensional boolean membership vector M i : the hth component Mih of this membership vector is a Bernoulli random variable indicating whether Xi belongs to the hth cluster. So, a membership vector Mi with multiple 1 s directly encodes the fact that the point Xi belongs to multiple clusters. Let us now consider the probability of generating the observed j data points in MOC. A is the activity matrix of this model, where Ah th feature represents the activity of cluster h while generating the j of the data. The probability of generating all the data points is p(X| ) = p(X|M, A) = p(Xi |Mi , A j ) j i, j j 3.1 Overlapping Clustering with Mixture Model Given a set of n data points in represented by a n d matrix X, tting a mixture model to X is equivalent to assuming that each data point Xi is drawn independently from a probability density p(Xi | ) = k h ph (Xi | h ), where = { h }k , h=1 h=1 k is the number of mixture components, ph is the probability density function of the hth mixture component with parameters h , and {Xi }n i=1 (1) d, where = {M, A} are the parameters of p, and Xi s are conditionally independent given Mi and A j . In MOC, we assume p to be the density function of any regular exponential family distribution, and also assume that the expectation parameter corresponding to Xi is of the form Mi A, so that E[Xi ] = Mi A. In other words, using vector notation, we assume that each Xi is generated from an exponential family density whose mean Mi A is determined by taking the sum of the activity levels of the components that contribute to the generation of Xi , i.e., Mih is 1 for the active components. Using the above assumptions and the bijection between regular exponential distributions and regular Bregman divergences [2], the conditional density can be represented as: p(Xi |Mi , A j ) exp{ d (Xi , Mi A j )} j j (2) where d is the Bregman divergence corresponding to the chosen exponential density p. For example, if p is the Poisson density, d is the I-divergence; if p is the Gaussian density, d is the squared Euclidean distance [2]. Similar to the SBK model, the overlapping clustering model tries to optimize the following joint distribution of X, M and A: p(X, M, A) = = p(M, A)p(X|M, A) = p(M)p(A)p(X|M, A) Since M is a binary matrix, this is integer optimization problem and there is no known polynomial time algorithm to exactly solve the problem. The explicit enumeration method involves evaluating all 2k possibilities for every data point, which can be prohibitive for even moderate values of k. So, we investigate simple techniques of updating M so that the loss function is minimized. There can be two ways of coming up with an algorithm for updating M. The rst one is to consider a real relaxation of the problem and allow M to take real values in [0, 1]. For particular choices of the Bregman divergence, speci c algorithms can be devised to solve the real relaxed version of the problem. For example, when the Bregman divergence is the squared loss, the corresponding problem is just the bounded least squares (BLS) problem given by min X MA 2 , for which there are well studied algoM:0 Mih 1 p(Mih ) i,h p(Ah ) j h, j p(Xi |Mi , A j ) j i, j . Making similar model assumptions as in Section 2, we assume that M and A are independent of each other apriori and A is distributed uniformly over a suf ciently large compact set, implying that p(M, A) = p(M)p(A) p(M). Then, maximizing the loglikelihood of the joint distribution gives max log p(X, M, A) max M,A M,A log p(Mih ) d (Xi , Mi A j ) j i,h i, j min M,A d (Xi j , (MA)i j ) log ih i, j i,h . where ih = p(Mih ) is the (Bernoulli) prior probability of the i-th point having a membership Mih to the h-th cluster. 4. ALGORITHMS AND ANALYSIS In this section, we propose and analyze algorithms for estimating the overlapping clustering model given an observation matrix X. In particular, from a given observation matrix X, we want to estimate the prior matrix , the membership matrix M and the activity matrix A so as to maximize p(M, A, X), the joint distribution of (X, M, A). The key idea behind the estimation is an alternating minimization technique that alternates between updating , M and A. rithms [6]. Now, from the real bounded matrix M, one can get the cluster membership by rounding Mih values either by proper thresholding [23] or randomized rounding. If k0 clusters get turned on for a particular data point, the SBK model performs an explicit 2k0 search over the on clusters in order get improved results. Another alternative could be to keep M in its real relaxed version till the overall alternating minimization method has converged, and round it at the very end. The update equation of the priors h and ih has to be appropriately changed in this case. Although the real relaxation approach seems simple enough for the squared loss case, it is not necessarily so for all Bregman divergences. In the general case, one may have to solve an optimization problem (not necessarily convex) with inequality constraints, before applying the heuristics outlined above. In order to avoid that, we outline a second approach that directly tries to solve the integer optimization problem without doing real relaxation. We begin by making two observations regarding the problem of estimating M: (1) In a realistic setting, a data point is more likely to be in very few clusters rather than most of them; and (2) For each data point i, estimating Mi is a variant of the subset sum problem that uses a Bregman divergence to measure loss. Taking the rst observation a step further, for a domain if it is well understood (or desirable) that each data point can belong to at most k0 clusters, for some k0 possibly signi cantly smaller than k, then it may be computationally feasible to perform an explicit search over all the possibilities: k 1 + k 2 Updating The prior matrix can be directly calculated from the current estimate of M. If h denotes the prior probability of any point belonging to cluster h, then, for a particular point i, we have ih = h i (1 h )1 Mi . Since h is the probability of a Bernoulli random variable, and the Bernoulli distribution is a member of the exponential family, the maximum likelihood estimate is just the sample mean of the suf cient statistic [2]. Since the suf cient statistic for Bernoulli is just the indicator of the event, the maximum likelihood 1 estimate of the prior h of cluster h is just h = n i {M h =1} . Thus, i one can compute the prior matrix using these update equations. Mh h 4.1 equality holds if k0 k/2. Note that for k0 = 1, the overlapping clustering model essentially reduces to the regular mixture model. However, in general, such a brute-force search may only be feasible for very small value of k0 . Further, it is perhaps not easy to decide on such a k0 apriori for a given problem. So, we focus on designing an ef cient way of searching through the relevant possibilities using the second observation. The subset sum problem is one of the hard knapsack problems [9] that tries to solve the following: Given a set of k natural numbers a1 , . . . , ak and a target number x, nd a subset S of the numbers such that ah S ah = x. In a more realistic setting, one works with a set of real numbers, and tries to nd a subset such that the sum over the subset is the closest possible to x. In our case, we measure closeness using a Bregman divergence and we have multiple target numbers to which we want the sum to be close. In particular, then the problem is to nd Mi such that Mi = argmin d (Xi , Mi A) = argmin Mi {0,1}k Mi + + k k0 ek k0 k0 , where the last in- 4.2 Updating M In the main alternating minimization technique, for a given X, A, the update for M has to minimize d (Xi j , (MA)i j ) . i, j Thus, there are m target numbers Xi1 , . . . , Xim , and for each target number Xi j the subset is to be chosen from A1 , . . . , Ak . The total j j loss is the sum of the individual losses, and the problem is to nd a single Mi that minimizes the total loss. {0,1}k j=1 d (Xi j , Mih Ah ) . j h=1 m k Using the inherent bias of natural overlapping problems to put each point in low number of clusters, and the similarities of our formulation to the subset sum problem, we propose the algorithm dynamicM (Algorithm 1). The algorithm is motivated by the Apriori class of algorithms in data mining and Shapley value computation in co-operative game theory [17]. It is important to note that no theoretical claim is being made regarding the optimality of dynamicM. The belief is that such an ef cient algorithm will work well in practice, as the empirical evidence in Section 5 suggests. Algorithm 1 dynamicM Input: Row vector [x]1 d , distance function d, activity matrix [A]k d , initial guess [m0 ]1 k Output: Boolean membership [m]1 vector k that gives a low value for d(x, mA) Method: Initialize assignment vector [m]1 k to all zeros {Separate search thread for each initial cluster turned on } for h = 1 to k do Turn on only the h-th cluster, i.e., set m(h) = 1, m(i) = 0, if i = h Set the h-th thread th to be active Compute objective function h = d(x, mA) {Run over all possible sizes (> 1) of clusters turned on } for r = 2 to k do if thread th is still active then Set old = h h From the rest (k r + 1) clusters, nd best cluster to turn on if best cluster to turn on is p then Turn on the p-th cluster, i.e., m(p) = 1 Compute objective function h d(x, mA) old if h h then Set h = old h Set the h-th thread th to be inactive Set m = m0 , = d(x, m0 A) Find the best m over all threads using h , h = 1, . . . , k If best m over threads is worse than m0 , set m = m0 Output [m]1 k practice. Instead, dynamicM starts with k threads, one corresponding to each one of the k clusters turned on . Then, in each thread, it performs the search outlined above for adding the next on cluster, till no such clusters are found, or all of them have been turned on . The search is similar in avor to the Apriori algorithms, or, dynamic programming algorithms in general, where an optimal substructure property is assumed to hold so that the search for the best membership vector with (h + 1) clusters turned on starts from that with h clusters turned on . Effectively, dynamicM searches over k permutations, each starting with a different cluster turned on . The other entries of the permutation are obtained greedily on the y. Since dynamicM runs k threads to achieve partial permutation independence, the best membership vector over all the threads is selected at the end. The algorithm has a worst case running time of O(k3 ) and is capable of running with any distance function. Updating A We now focus on updating the activity matrix A. Since there are no restrictions on A as such, the update step is simpler than that for M. Note that the only constraint that such an update needs to satisfy is that MA stays in the domain of . We give exact updates for particular choices of Bregman divergences: the squared loss and the I-divergence, since we use only these in section 5. In case of the square loss, since the domain of is , the problem minA X MA 2 is just the standard least squares problem that can be exactly solved by A = M X, where M is the pseudo-inverse of M, and is equal to (M T M) 1 M T in case M T M is invertible. In case of I-divergence or un-normalized relative entropy, the problem min dI (X, MA) = min A A 4.3 i, j Xi j log Xi j Xi j + (MA)i j (MA)i j The algorithm dynamicM starts with 1 cluster turned on and greedily looks for the next best cluster to turn on so as to minimize the loss function. If such a cluster is found, then it has 2 clusters turned on . Then, it repeats the process with the 2 clusters turned on . In general, if h clusters are turned on , dynamicM considers turning each one of the remaining (k h) clusters on , one at a time, and computes loss corresponding to the membership vector with (h + 1) clusters turned on . If, at any stage, turning on each one of the remaining (k h) clusters increases the loss function, the search process is terminated. Otherwise, it picks the best (h + 1)th cluster to turn on , and repeats the search for the next best on the remaining (k h 1) clusters. Such a procedure will of course depend on the order in which clusters are considered to be turned on . In particular, the choice of the rst cluster to be turned on will partly determine which other clusters will get turned on . The permutation dependency of the problem is somewhat similar in avor to that of pay-off computation in a co-operative game. If h players are already in cooperation, the value-add of the (h + 1)th partner will depend on the permutation following which the rst h were chosen. In order to design a fair pay-off strategy, one computes the average value-add of a player, better known as Shapley value, over all permutations of forming co-operations [17]. Then, in theory, dynamicM should consider each all possible permutations, keep turning clusters on following each permutation to gure out the lowest loss achieved along that particular permutation, and nally compute the best membership vector among all permutations. Clearly, such an approach would be infeasible in has been studied as a non-negative matrix factorization technique [19]. The optimal update for A for given X, M multiplicative and is given by j hj j j M X /(MA)i Ah = A h i i i h (4) i Mi In order to prevent a division by 0, it makes sense to use max((MA) i , ) and max( i Mih , ) as the denominators for some small constant > 0. With the above updates, the respective loss functions are provably non-increasing. In the case of a general Bregman divergence, the update steps need not necessarily be as simple and will be investigated as a future work. j 5. EXPERIMENTS This section describes the details of our experiments that demonstrate the superior performance of MOC on real-world data sets, compared to the thresholded mixture model. 5.1 Methodology We run experiments on three types of datasets: synthetic data, movie recommendation data, and text documents. For the highdimensional movie and text data, we create subsets from the original datasets, which have the characteristics of having a small number of points compared to the dimensionality of the space. The purpose of performing experiments on these subsets is to scale down the sizes of the datasets for computational reasons but at the same time not scale down the dif culty of the tasks, since clustering a small number of points in a high-dimensional space is a comparatively dif cult task. Synthetic data: In [23], apart from demonstrating their approach on gene microarray data and evaluating on standard biology databases, , (3) Segal et al. also showed results on microarray-like synthetic data with a clear ground truth since the biology databases are generally believed to be lacking in coverage. The synthetic data was generated by sampling points from the overlapping clustering model and subsequently adding noise. We used a similar technique to create three synthetic datasets of different sizes: (1) small-synthetic: a dataset with n = 75, d = 30 and k = 10; (2) medium-synthetic: a dataset with n = 200, d = 50 and k = 30; and (3) large-synthetic: a dataset with n = 1000, d = 150 and k = 30. For the synthetic datasets we used squared Euclidean distance as the cluster distortion measure in the overlapping clustering algorithm, since Gaussian densities were used to generate the noise-free datasets. Movie Recommendation data: The EachMovie1 dataset has 5point user ratings for the 74,424 movies in the collection. The corresponding movie genre information is extracted from the Internet Movie Database (IMDB)2 collection. If each genre is considered as a separate category or cluster, then this dataset also has naturally overlapping clusters since many movies are annotated in IMDB as belonging to multiple genres, e.g., Aliens belongs to 3 genre categories: action, horror and science ction. We created 2 subsets from the EachMovie dataset: (1) movie-taa: 300 movies from the 3 genres thriller, action and adventure; and (2) movie-afc: 300 movies from the 3 genres animation, family, and comedy. We clustered the movies based on the user recommendations to rediscover genres, based on the belief that similarity in recommendation pro les of movies gives an indication about whether they are in related genres. For this domain we use I-divergence with Laplace smoothing as the cluster distortion measure. Text data: Experiments were also run on 3 text datasets derived from the 20-Newsgroups collection3 , which has 20,000 documents from 20 Usenet newsgroups. We processed the original newsgroup articles to recover the multiple newsgroup labels on each message posting. From the full dataset, a subset was created having 100 postings in each of the 20 newsgroups, from which the following three data subsets were created with varying levels of overlap in the topics: (1) news-similar-3; (2) news-related-3; and (3) newsdifferent-3. Details of these datasets are outlined in [3]. The vectorspace model of each data subset was created using standard text pre-processing methods [13], and each data subset has 300 points in high-dimensional space (> 1000 words). In this case, I-divergence was again used as the Bregman divergence for overlapping clustering, with suitable Laplace smoothing. We used an experimental methodology similar to the one used to demonstrate the effectiveness of the SBK model [23]. For each dataset, we initialized the overlapping clustering by running k-means clustering, where the additive inverse of the corresponding Bregman divergence was used as the similarity measure and the number of clusters was set by the number of underlying categories in the dataset. The resulting clustering was used to initialize our overlapping clustering algorithm. To evaluate the clustering results, precision, recall, and F-measure were calculated over pairs of points. For each pair of points that share at least one cluster in the overlapping clustering results, these measures try to estimate whether the prediction of this pair as being in the same cluster was correct with respect to the underlying true categories in the data. Precision is calculated as the fraction of pairs correctly put in the same cluster, recall is the fraction of actual pairs that were identi ed, and F-measure is the harmonic mean of precision and recall. 1 http://research.compaq.com/SRC/eachmovie 5.2 Results Table 1 presents the results of MOC versus the standard mixture model for the datasets described in Section 5.1. Each reported result is an average over ten trials. For the synthetic data sets, we compared our approach to thresholded Gaussian mixture models; for the text and movie data sets, the baselines were thresholded multinomial mixture models. Table 1 shows that for all domains, even though the thresholded mixture model has slightly better precision in most cases, it has signi cantly worse recall: therefore MOC consistently outperforms the thresholded mixture model in terms of overall F-measure, by a large margin in most cases. Table 1 also shows that the performance of MOC improves empirically as the ratio of the data set size to the number of processes increases. Table 2 compares the performance of using the dynamicM algorithm versus the bounded least squares (BLS) algorithm followed by local search, in the M estimation step in MOC. BLS/search gets better results on precision, which is expected since BLS is the optimal solution for the real relaxation of the M estimation problem for the Gaussian model. However dynamicM outperforms BLS/search on the overall F-measure. Moreover, BLS is only applicable for squared Euclidean distance, whereas dynamicM can be applied for M estimation with any distance function. Detailed inspection of the results revealed that MOC gets overlapping clusterings that are closer to the ground truths for the text and the movie data. For example, for movie-afc, the average number of clusters a movie is assigned to is 1.19, whereas MOC clustering has an average of 1.13 clusters per movie. The thresholded mixture model got posterior probability values very close to 0 or 1, as is very common in mixture model estimation for high-dimensional data: as a result there was almost no cluster overlap for various choices of the threshold value, and points were assigned to 1.00 clusters on an average in the thresholded mixture models. MOC was also able to recover the correct underlying multiple genres in many cases, e.g., the movie Toy Story in the movie-afc dataset belongs to all the three genres of animation, family and comedy in this dataset, and MOC correctly put it in all 3 clusters. The main purpose of the experiments in this section is to illustrate that the overlapping clustering model can be generalized to work for exponential models beyond Gaussians. We have not provided results on the biological datasets in this section due lack of space. However, note that if we run our algorithm on the biological data using BLS/search and a Gaussian model, then we will get exactly the same results as the SBK model [23]. 6. RELATED WORK 2 http://www.imdb.com 3 http://www.ai.mit.edu/people/jrennie/20Newsgroups Possibility theory, developed in the fuzzy logic community, allows an object to belong to multiple sets in the sense of having high membership values to more than one set [5]. In particular, unlike probabilities, the sum of membership values may be more than one [22]. One of the earlier works on overlapping clustering techniques with the possibility of not clustering all points was presented in [20]. Most recent work in overlapping clustering has been primarily driven by the needs of microarray analysis. Several methods for obtaining overlapping gene clusters, including gene shaving [16] and mean square residue bi-clustering [8] have been proposed. Before the PRM based SBK model was proposed, one of the most notable efforts was the the plaid model [18], wherein the gene-expression matrix was modeled as a superposition of several layers of plaids (subsets of genes and conditions). Bregman divergences were conceived and have been extensively studied in the convex optimization community [7]. Over the past few years, they have been successfully applied to a variety of ma- Data small-synthetic medium-synthetic large-synthetic movie-taa movie-afc news-different-3 news-related-3 news-similar-3 F-measure MOC Mixture 0.64 0.12 0.36 0.08 0.71 0.06 0.24 0.01 0.87 0.04 0.33 0.01 0.62 0.03 0.50 0.04 0.76 0.03 0.61 0.07 0.45 0.01 0.41 0.05 0.54 0.02 0.39 0.02 0.35 0.02 0.28 0.01 Precision MOC Mixture 0.83 0.07 0.80 0.07 0.73 0.05 0.60 0.03 0.85 0.06 0.87 0.04 0.55 0.01 0.56 0.01 0.80 0.01 0.81 0.02 0.34 0.01 0.40 0.05 0.42 0.01 0.44 0.02 0.23 0.01 0.24 0.01 Recall MOC Mixture 0.53 0.14 0.24 0.07 0.70 0.09 0.15 0.01 0.89 0.05 0.20 0.01 0.71 0.07 0.46 0.08 0.72 0.06 0.50 0.09 0.68 0.05 0.41 0.06 0.76 0.08 0.35 0.01 0.69 0.06 0.34 0.01 Table 1: Comparison of results of MOC and thresholded mixture models on all datasets Data small-synthetic medium-synthetic large-synthetic F-measure dynamicM BLS/search 0.64 0.12 0.55 0.20 0.71 0.06 0.65 0.05 0.87 0.04 0.87 0.02 Precision dynamicM BLS/search 0.83 0.07 0.98 0.03 0.73 0.05 0.91 0.06 0.85 0.06 0.92 0.02 Recall dynamicM BLS/search 0.52 0.14 0.41 0.19 0.70 0.09 0.51 0.06 0.89 0.05 0.83 0.04 Table 2: Results: dynamicM vs Bounded Least Squares (with search) for synthetic data chine learning issues, for example to unify seemingly disparate concepts of boosting and logistic regression [11]. More recently, they have been studied in the context of clustering [2]. Our formulation has some similarities to generalized linear models (GLMs) [21, 10]. However, there are a few very important differences. In GLMs [21], a multidimensional regression problem of the form d (Y, f (BZ)) is solved where Z is the (known) input variable, Y is the (known) response and f is the so-called canonical link function derived from . The problem is to nd B and can be solved using iteratively re-weighted least squares (IRLS) in the general case. Extension to the case where both B and Z are unknown and one alternates between updating B and Z has been studied by Collins et al. [10] while extending PCA to the exponential families. Although several extensions [15] of the basic GLM model to matrix factorization have been studied, except for the well known instance of non-negative matrix factorization (NMF) using I-divergence [19], all formulations use the canonical link function and hence are different our formulation. Moreover, our model constraints M to be a binary matrix, which is never a standard constraint in GLMs. 8. REFERENCES 7. CONCLUSIONS In contrast to traditional partitional clustering, overlapping clustering allows items to belong to multiple clusters. In several important applications in bioinformatics, text management, and other areas, overlapping clustering provides a more natural way to discover interesting and useful classes in data. This paper has introduced a broad generative model for overlapping clustering, MOC, based on generalizing the SBK model presented in [23]. It has also provided a generic alternating minimization algorithm for ef ciently and effectively tting this model to empirical data. Finally, we have presented experimental results on both arti cial data and real newsgroup and movie data, which is more general and effective than an alternative naive method based on thresholding the results of a traditional mixture model. A few issues regarding practical applicability of MOC needs further investigation. It maybe often desirable to use different exponential family models for different subsets of features. MOC allows such modeling in theory, as long as the total divergence is a convex combination of the individual ones. Further, MOC can potentially bene t from semi-supervision [3] as well as be extended to a coclustering framework [1]. Acknowledgements: The research was supported in part by NSF grants IIS 0325116, IIS 0307792, and an IBM PhD fellowship. [1] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In KDD, 2004. [2] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In SDM, 2004. [3] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD, 2004. [4] A. Battle, E. Segal, and D. Koller. Probabilistic discovery of overlapping cellular processes and their regulation using gene expression data. In RECOMB, 2004. [5] J. C. Bezdek and S. K. Pal Fuzzy Models for Pattern Recognition . IEEE Press, 1992. [6] A. Bjorck. Numerical Methods for Least Squares Problems. Society for Industrial & Applied Math (SIAM), 1996. [7] Y. Censor and S. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, 1998. [8] Y. Cheng and G. M. Church. Biclustering of expression data. In ISMB, 2000. [9] V. Chv tal. Hard knapsack problems. Operations Research, 28(6):1402 1412, a 1980. [10] M. Collins, S. Dasgupta, and R. Schapire. A generalization of principal component analysis to the exponential family. In NIPS, 2001. [11] M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. In COLT, 2000. [12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1 38, 1977. [13] I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42:143 175, 2001. [14] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, 1999. [15] G. Gordon. Generalized2 linear2 models. In NIPS, 2001. [16] T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. C. Chan, D. Botstein, and P. Brown. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology, 2000. [17] J. Kleinberg, C. H. Papadimitriou, and P. Raghavan. On the value of private information. In Proc. 8th Conf. on Theoretical Aspects of Rationality and Knowledge, 2001. [18] L. Lazzeroni and A. B. Owen. Plaid models for gene expression data. Statistica Sinica, 12(1):61 86, 2002. [19] D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001. [20] W. T. McCormick, P. J. Schweitzer, and T. W. White. Problem decomposition and data reorganization by a clustering technique. Operations Research, 20:993 1009, 1972. [21] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall/CRC, 1989. [22] M. Sahami, M. Hearst, and E. Saund. Applying the Multiple Cause Mixture Model to Text Categorization. In ICML, 1996. [23] E. Segal, A. Battle, and D. Koller. Decomposing gene expression into cellular processes. In PSB, 2003.
Find millions of documents here - Study Guides, Homework Solutions, Papers, Exam Answer Keys and more.
Course Hero has millions of course related materials that will enable you to learn better,
faster and get an A in all your courses.
Below is a small sample set of documents:
Below is a small sample set of documents:
Minnesota >> D >> 2 (Fall, 2008)
...
Minnesota >> D >> 06 (Fall, 2008)
An End-to-end Supervised Target-Word Sense Disambiguation System Mahesh Joshi1 Serguei Pakhomov2 1 Ted Pedersen1 2 Richard Maclin1 Christopher Chute2 Department of Computer Science University of Minnesota Duluth, MN, USA 55812 {joshi031, tpederse...
Minnesota >> D >> 01 (Fall, 2008)
...
Minnesota >> D >> 04 (Fall, 2008)
Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces Amruta Purandare and Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN 55812 USA {pura0010,tpederse}@d.umn.edu http:/senseclusters.sourcefor...
Minnesota >> D >> 3 (Fall, 2008)
The Duluth Lexical Sample Systems in SENSEVAL-3 Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN 55812 tpederse@d.umn.edu http:/www.d.umn.edu/tpederse Abstract Two systems from the University of Minnesota, Duluth partic...
Minnesota >> D >> 3 (Fall, 2008)
Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to SENSEVAL-3 Saif Mohammad University of Toronto Toronto, ON M5S1A1 Canada smm@cs.toronto.edu http:/www.cs.toronto.edu/smm Ted Pedersen University of Minnesota Duluth, M...
Minnesota >> D >> 04 (Fall, 2008)
Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad University of Toronto Toronto, ON M4M2X6 Canada smm@cs.toronto.edu http:/www.cs.toronto.edu/smm Ted Pedersen University of Minnesota Duluth, MN 55812 USA ...
Minnesota >> D >> 2007 (Fall, 2008)
Discovering Identities in Web Contexts with Unsupervised Clustering Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN 55812 USA tpederse@d.umn.edu Abstract We describe the application of unsupervised clustering methodolog...
Minnesota >> D >> 2007 (Fall, 2008)
Unsupervised Discrimination of Person Names in Web Contexts Ted Pedersen1 and Anagha Kulkarni2 2 University of Minnesota, Duluth, MN 55812, USA Carnegie Mellon University, Pittsburgh, PA 15213, USA 1 Abstract. Ambiguous person names are a problem ...
Minnesota >> D >> 2006 (Fall, 2008)
An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-Occurrence Features Ted Pedersen1 , Anagha Kulkarni1 , Roxana Angheluta2 , Zornitsa Kozareva3 , and Thamar Solorio4 1 2 University of Minnesota, Duluth, USA Kat...
Minnesota >> D >> 2005 (Fall, 2008)
SenseClusters: Unsupervised Clustering and Labeling of Similar Contexts Anagha Kulkarni and Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN 55812 {kulka020,tpederse}@d.umn.edu http:/senseclusters.sourceforge.net Abstra...
Minnesota >> D >> 2005 (Fall, 2008)
Unsupervised Discrimination and Labeling of Ambiguous Names Anagha K. Kulkarni Department of Computer Science University Of Minnesota Duluth, MN 55812 kulka020@d.umn.edu http:/senseclusters.sourceforge.net Abstract This paper describes adaptations o...
Minnesota >> D >> 2005 (Fall, 2008)
Name Discrimination by Clustering Similar Contexts Ted Pedersen1 , Amruta Purandare2 , and Anagha Kulkarni1 1 2 University of Minnesota, Duluth, MN 55812, USA University of Pittsburgh, Pittsburgh, PA 15260, USA http:/senseclusters.sourceforge.net A...
Minnesota >> D >> 07 (Fall, 2008)
Using UMLS Concept Unique Identiers (CUIs) for Word Sense Disambiguation in the Biomedical Domain Bridget T. McInnes, MS1 , Ted Pedersen, PhD2 , and John Carlis, PhD1 1 University of Minnesota, Minneapolis, MN, USA 2 University of Minnesota, Duluth,...
Minnesota >> D >> 2 (Fall, 2008)
Determining Smoker Status using Supervised and Unsupervised Learning with Lexical Features Ted Pedersen University of Minnesota, Department of Computer Science, Duluth, MN, USA Abstract This paper describes three University of Minnesota, Duluth sys...
Minnesota >> D >> 06 (Fall, 2008)
A Comparative Study of Supervised Learning as Applied to Acronym Expansion in Clinical Reports Mahesh Joshi, MS1, Serguei Pakhomov, PhD2, Ted Pedersen, PhD1 and Christopher G. Chute, MD, DrPH2 1 Department of Computer Science, University of Minnesota...
Minnesota >> D >> 06 (Fall, 2008)
A Comparative Study of Supervised Learning as Applied to Acronym Expansion in Clinical Reports Mahesh Joshi, Serguei Pakhomov, Ted Pedersen, Christopher G. Chute University of Minnesota, Duluth Mayo College of Medicine, Rochester AMIA-2006 1 Overv...
Minnesota >> D >> 06 (Fall, 2008)
Kernel Methods for Word Sense Disambiguation and Acronym Expansion Mahesh Joshi Ted Pedersen Richard Maclin Serguei Pakhomov Division of Biomedical Informatics Mayo College of Medicine Rochester, MN, USA 55905 Pakhomov.Serguei@mayo.edu Department of ...
Minnesota >> D >> 05 (Fall, 2008)
Abbreviation and Acronym Disambiguation in Clinical Discourse Serguei Pakhomov, PhD1, Ted Pedersen, PhD2 and Christopher G. Chute, MD, DrPH1 1 Division of Biomedical Informatics, Mayo College of Medicine, Rochester, MN, USA 2 Department of Computer S...
Minnesota >> D >> 3 (Fall, 2008)
The SENSEVAL3 Multilingual EnglishHindi Lexical Sample Task Timothy Chklovski Information Sciences Institute University of Southern California Marina del Rey, CA 90292 timc@isi.edu Ted Pedersen Department of Computer Science University of Minnesota D...
Minnesota >> D >> 2002 (Fall, 2008)
...
Minnesota >> D >> 2002 (Fall, 2008)
...
Minnesota >> D >> 2 (Fall, 2008)
...
Minnesota >> D >> 01 (Fall, 2008)
...
Minnesota >> D >> 01 (Fall, 2008)
...
Minnesota >> D >> 00 (Fall, 2008)
A Simple Approach to Building Ensembles of Naive Bayesian Classiers for Word Sense Disambiguation Ted Pedersen Department of Computer Science University of Minnesota Duluth Duluth, MN 55812 USA tpederse@d.umn.edu This paper presents a corpus-based ...
Minnesota >> D >> 00 (Fall, 2008)
A Simple Approach to Building Ensembles of Naive Bayesian Classiers for Word Sense Disambiguation Ted Pedersen Department of Computer Science University of Minnesota Duluth Duluth, MN 55812 USA tpederse@d.umn.edu This paper presents a corpus-based ...
Minnesota >> D >> 00 (Fall, 2008)
Ted Pedersen This paper presents a corpus{based approach to word sense disambiguation that combines a number of Naive Bayesian classiers into an ensemble that performs disambiguation via a majority vote. Each of the member classiers is based on collo...
Minnesota >> D >> 1998 (Fall, 2008)
LEARNING PROBABILISTIC MODELS OF WORD SENSE DISAMBIGUATION Approved by: Dr. Dan Moldovan Dr. Rebecca Bruce Dr. Weidong Chen Dr. Frank Coyle Dr. Margaret Dunham Dr. Mandyam Srinath LEARNING PROBABILISTIC MODELS OF WORD SENSE DISAMBIGUATION A ...
Minnesota >> D >> 98 (Fall, 2008)
Naive Bayes as a Satiscing Model Department of Computer Science and Engineering Southern Methodist University Dallas, TX 75275{0122 pedersen@seas.smu.edu Ted Pedersen Abstract We report on an empirical study of supervised learning algorithms that i...
Minnesota >> D >> 97 (Fall, 2008)
Appears in the Proceedings of the Fourteenth National Conference on Arti cial Intelligence, July 1997, Providence, RI Naive Mixes for Word Sense Disambiguation Department of Computer Science & Engineering Southern Methodist University Dallas, TX 752...
Minnesota >> D >> 97 (Fall, 2008)
Sequential Model Selection for Word Sense Disambiguation Ted Pederseny and Rebecca Brucey and Janyce Wiebez of Computer Science and Engineering Southern Methodist University, Dallas, TX 75275 zDepartment of Computer Science New Mexico State Universit...
Minnesota >> D >> 2009 (Fall, 2008)
Improved Unsupervised Name Discrimination with Very Wide Bigrams and Automatic Cluster Stopping Ted Pedersen University of Minnesota, Duluth, MN 55812, USA Abstract. We cast name discrimination as a problem in clustering short contexts. Each occurre...
Minnesota >> D >> 2 (Fall, 2008)
UMND2 : SenseClusters Applied to the Sense Induction Task of SENSEVAL-4 Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN 55812 tpederse@d.umn.edu http:/senseclusters.sourceforge.net Abstract SenseClusters is a freelyava...
Minnesota >> D >> 06 (Fall, 2008)
How many different John Smiths, and who are they? Anagha Kulkarni and Ted Pedersen Department of Computer Science University of Minnesota, Duluth Duluth, MN 55812 USA {kulka020,tpederse}@d.umn.edu http:/senseclusters.sourceforge.net Abstract In this...
Minnesota >> D >> 2006 (Fall, 2008)
6 Unsupervised corpus-based methods for WSD Ted Pedersen University of Minnesota, Duluth This chapter focuses on unsupervised corpus-based methods of word sense discrimination that are knowledge-lean, and do not rely on external knowledge sources s...
Minnesota >> D >> 06 (Fall, 2008)
Automatic Cluster Stopping with Criterion Functions and the Gap Statistic Ted Pedersen and Anagha Kulkarni Department of Computer Science University of Minnesota, Duluth Duluth, MN 55812 USA {tpederse,kulka020}@d.umn.edu http:/senseclusters.sourcefor...
Minnesota >> D >> 2006 (Fall, 2008)
Selecting the Right Number of Senses Based on Clustering Criterion Functions Ted Pedersen and Anagha Kulkarni Department of Computer Science University of Minnesota, Duluth Duluth, MN 55812 USA {tpederse,kulka020}@d.umn.edu http:/senseclusters.source...
Minnesota >> D >> 2006 (Fall, 2008)
Improving Name Discrimination: A Language Salad Approach Ted Pedersen and Anagha Kulkarni Department of Computer Science University of Minnesota, Duluth Duluth, MN 55812 USA {tpederse,kulka020}@d.umn.edu Zornitsa Kozareva Dept. de Lenguajes y Sistema...
Minnesota >> D >> 2005 (Fall, 2008)
Identifying Similar Words and Contexts in Natural Language with SenseClusters Ted Pedersen and Anagha Kulkarni Department of Computer Science University of Minnesota Duluth, MN 55812 {tpederse,kulka020}@d.umn.edu http:/senseclusters.sourceforge.net A...
Minnesota >> D >> 2004 (Fall, 2008)
Improving Word Sense Discrimination with Gloss Augmented Feature Vectors Amruta Purandare1 and Ted Pedersen2 1 2 University of Pittsburgh, Pittsburgh, PA 15260 USA University of Minnesota, Duluth, MN 55812 USA http:/senseclusters.sourceforge.net Ab...
Minnesota >> D >> 04 (Fall, 2008)
Discriminating Among Word Meanings By Identifying Similar Contexts Amruta Purandare and Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN 55812 {pura0010,tpederse}@d.umn.edu http:/senseclusters.sourceforge.net Abstract W...
Minnesota >> D >> 04 (Fall, 2008)
SenseClusters - Finding Clusters that Represent Word Senses Amruta Purandare and Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN 55812 {pura0010,tpederse}@d.umn.edu http:/senseclusters.sourceforge.net Abstract SenseClu...
Minnesota >> D >> 04 (Fall, 2008)
SenseClusters - Finding Clusters that Represent Word Senses Amruta Purandare and Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN 55812 {pura0010,tpederse}@d.umn.edu http:/senseclusters.sourceforge.net Abstract SenseClu...
Minnesota >> D >> 98 (Fall, 2008)
Appears in the Proceedings of the Fifteenth National Conference on Arti cial Intelligence, July 1998, Madison, WI Raw Corpus Word Sense Disambiguation Department of Computer Science & Engineering Southern Methodist University Dallas, TX 75275{0122 p...
Minnesota >> D >> 97 (Fall, 2008)
Appears in the Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, August 1997, Providence, RI Distinguishing Word Senses in Untagged Text Department of Computer Science and Engineering Southern Methodist Univer...
Minnesota >> D >> 97 (Fall, 2008)
Appears in the Proceedings of the Fourteenth National Conference on Arti cial Intelligence, July 1997, Providence, RI Knowledge Lean Word Sense Disambiguation Department of Computer Science & Engineering Southern Methodist University Dallas, TX 7527...
Minnesota >> CS >> 8761 (Fall, 2008)
- CS 8761 - Fall 2004 - Dr. Ted Pedersen -Readings are required, and should be completed before the lecture on the indicated date. TRY THIS/THESE are suggested programming activities and exercises that give you a way to check your progress, and let y...
Minnesota >> CS >> 8761 (Fall, 2008)
INSTALL = You are looking at the installation instructions for EGAL. Prerequisites for the installer = EGAL is written in perl. It has a cgi web interface which is written in perl as well. The installer is written to ease your job as much as possible...
Minnesota >> CS >> 8761 (Fall, 2008)
INSTALLATION INSTRUCTIONS Prerequisites: -The following Perl modules must be installed on your system: LWP Algorithm:NaiveBayes YAML PDL SOAP:Lite HTML:Parser These modules are available on CPAN (http:/search.cpan.org). Some of these modules may alre...
Minnesota >> CS >> 8761 (Fall, 2008)
INSTALL GUIDE FOR RIVER-PLATE ESSAY GRADER SYSTEM (REGS) -This directory is the distribution of version 1.00 of the REGS essay grading system for UNIX. This version was created in December, 2004. Please see our web page https:/sourceforge.net/project...
Minnesota >> CS >> 8761 (Fall, 2008)
Evaluador Version 0.03 Evaluador Installation Guide Installation Guide for Evaluador Module - Unix/linux Command Line = Requirements: = 1) Perl Version 5.8 or better Required for Evaluador. You can check the version of Perl you have by issuing the f...
Minnesota >> CS >> 8761 (Fall, 2008)
* * Installation Instructions ( INDEPENDIENTE Essay Grader Final Release Dec 24,2004. ) ** * Kindly follow these steps for hassle-free installation. A) External Software and Packages -This software uses Linkparser and it\'s Perl API Lingua:Linkparser....
Minnesota >> CS >> 8761 (Fall, 2008)
Boca Juniors = Khanna,Sudip Kodali,Varsha Datar,Ajit Vidyadhar Yadav,Archna Doddapaneni,Nagendra River Plate = Kapoor,Tarun Reddy Parlapalli,Anoop Potnis,Poorva Nookala,Lalit Vohra,Neeraj Independiente = Bhumkar,Kedar Satish Kohli,Saiyam Bakshi,Visha...
Minnesota >> CS >> 8761 (Fall, 2008)
@stop.mode=OR /\\ba\\b/ /\\baboard\\b/ /\\babout\\b/ /\\babove\\b/ /\\bacross\\b/ /\\bafter\\b/ /\\bagain\\b/ /\\bagainst\\b/ /\\ball\\b/ /\\balong\\b/ /\\balongside\\b/ /\\balready\\b/ /\\balso\\b/ /\\balthough\\b/ /\\balways\\b/ /\\bam\\b/ /\\bamid\\b/ /\\bamidst\\b/ /\\bamong\\b/ /\\ba...
Minnesota >> CS >> 8761 (Fall, 2008)
- CS 8761 - Fall 2002 - Dr. Ted Pedersen -Readings are required, and should be completed before our lecture on the indicated date. TRY THIS/THESE are suggested programming activities and exercises that give you a way to check your progress, and let y...
Minnesota >> CS >> 8761 (Fall, 2008)
CS 8761 Final Project Teams - Fall 2002 Alianza Lima -Abou-Rjeili, Amine Goopy, Suchitra Purandare, Amruta Storie, Sam Sporting Cristal -Bellamkonda, Archana Gordon, Paul Kankaria, Rashmi Nagle, Ashutosh Melgar FBC -Jain, Prashant Kuthadi, Sumalatha ...
Minnesota >> CS >> 8761 (Fall, 2008)
CS 8761 Final Project Algorithm description Alianza Lima +-+ Main idea We postulate the sentiment of a review can be obtained by examining the sentiment of the words used. Main steps 1. Use BigMac to filter out certain review words 2. Attempt to dete...
Minnesota >> CS >> 8761 (Fall, 2008)
Experimental Results and Analysis By Alianza Lima Amine, Amruta, Sam, Suchitra = Experiments Done = Aim To observe the precision and recalls on various review and sanity data sets under different windowing and scaling conditions. Data Used test data ...
Minnesota >> CS >> 8761 (Fall, 2008)
# \ / # \\ ~ ~ / # ( @ ) #*-oOOo-(_)-oOOo* # # Natural Language Processing # # CS8761 Dr. Ted Pedersen # # FINAL PROJECT # # An approach to sentiment classification using machine readable dictionaries # and World Wide Web # # # Melgar FBC - Prashant J...
Minnesota >> CS >> 8761 (Fall, 2008)
# \ / # \\ ~ ~ / # ( @ ) #*-oOOo-(_)-oOOo* # # Natural Language Processing # # CS8761 Dr. Ted Pedersen # # FINAL PROJECT # # An approach to sentiment classification using machine readable dictionaries # and World Wide Web # # # Melgar FBC - Prashant J...
Minnesota >> CS >> 8761 (Fall, 2008)
Cristal Algorithm Final Project Submission Course Name: NLP Course Number: CS8761 Date: Dec 17, 2002 Key Idea Adjectives and adverbs in the definitions of a word match with those in the context that the word tends to occur in. Generateing good and ba...
Minnesota >> CS >> 8761 (Fall, 2008)
Analysis Final Project: Sentiment Classification CS8761 Sporting Cristal: Archana Bellamkonda Rashmi Kankaria Ashutosh Nagle Paul Gordon 12/17/2002 Introduction This report describes experiments that were conducted using a program written to determin...
Minnesota >> CS >> 8761 (Fall, 2008)
CS8761 - Natural Language Processing Date : 12/16/2002 Algorithm for sentiment classification = CS8761 UNION MINAS Bridget Thomson McInnes, Deodatta Y Bhoite, Yanhua Li, Nitin Agarwal, Kailash Aurangabadkar, bthomson@d.umn.edu bhoi0001@d.umn.edu lixx...
Minnesota >> CS >> 8761 (Fall, 2008)
CS8761 - Natural Language Processing Analysis of Results of sentiment classifier = Date: 12/16/2002 Union Minas: Bridget Thomson McInnes, Deodatta Y Bhoite, Yanhua Li, Nitin Agarwal, Kailash Aurangabadkar, bthomson@d.umn.edu bhoi0001@d.umn.edu lixx03...
Minnesota >> CS >> 8761 (Fall, 2008)
Nov 20, 2002 Here are my comments on the stage 1 beta version. I have included all of the team comments in this message - it just seemed simpler that way, and of course we are among friends and have no secrets from one another. Please note that I am ...
Minnesota >> CS >> 8761 (Fall, 2008)
December 11, 2002 Take Home Final Exam - CS 8761 Concept specificity is a hypothetical measure that provides a numeric value for each sense in a dictionary. This indicates how general or specific the concept associated with that sense is. Assume that...
Minnesota >> CS >> 8761 (Fall, 2008)
CS 8761 Fall 2002 Here are some comments regarding assignment 1. These are all relating to submission procedures, etc. It is very important that you follow the requirements for these assignments. I was generally surprised at how often simple directio...
Minnesota >> CS >> 8761 (Fall, 2008)
After each assignment I\'ll send a summary of any observations or comments I might have. In general our assignments are dealing with issues that can still in some respects be considered open research questions, so I\'ll take a few moments to raise some...
Minnesota >> CS >> 8761 (Fall, 2008)
I have set up a script to automatically test your assignment 2 submissions. It only really tests the tmi and ll3 portions of your assignment of course, since you all used different measures for user2 and user3. You should get an automatically generat...
What are you waiting for?