CS 124 / LINGUIST 180 - Winter 2010 Homework 4: Person Name ('Named Entity') Classification Due: Thurs Feb 3, 9:30am For this assignment, you will be building a maximum entropy Markov model (MEMM) for identifying person names in newswire text. We have provided all of the machinery for training and testing your MEMM, but we have left the feature set woefully inadequate. Your job is to modify the code for generating features so that it produces a much more sensible, complete, a higher-performing set of features. This assignment has significantly more starter code associated with it than your other previous (or future) assignments. Because of this, we highly recommend that you use our starter code, and program in Java. However, if you really want to reimplement an entire MEMM in another language, you are allowed to do so. In that case, please email the staff ( cs124-win0910-staff@lists.stanford.edu ) and we can give you some pointers on getting started. Adding Features to the Code The file you will be modifying is FeatureFactory.java . The class currently looks something like this: public class FeatureFactory { public List computeFeatures(List<String> words, String previousLabel, int position) { List features = new ArrayList(); // ADD YOUR FEATURES HERE return features; } } You will create the features for the word at the given position, with the given previous label. You may condition on any word in the sequence (and its relative position), not just the current word, because they are all observed. You may not condition on any labels other than the previous one. Each function you build will be a binary function of some kind of feature. The features are stored in a list because we are using a sparse representation. Features which have a value of true (or 1.0 ) will be present in
