This preview shows page 1. Sign up to view the full content.
Unformatted text preview: last modified Mar 30 EECS 281 Project 5: More clustering! Please note that you are responsible for any "sticky" phorum threads as they are an extension of this document and may contain corrections or clarifications. Due This project is due on April 3, 2008 at 3:00PM. You also have 4 late days per semester, but a maximum of 2 may be used on this project. Therefore the autograder closes completely at April 5, 2008 at 3:00PM. Introduction In this project you will implement different algorithms for clustering. You will read in files and compute a similarity matrix as in project 4. Then you will cluster these files into groups using a couple of different methods. The general objective of these algorithms will be to obtain K clusters such that the diameter of the largest cluster is minimized. The diameter of a group is greatest distance between two files in that group. Intuitively, we want to grow very small clusters and discourage bloating clusters that have gotten large. Program Behavior Your Makefile should produce an executable called p5exe which takes exactly 4 arguments.
./p5exe N K A dirname Most of these arguments are the same as in project 4. N is a frequency threshold - you should effectively ignore words appearing at least this many times in a particular file. K is the number of clusters that your algorithm should create. A is either "1", "2", or "3", and dictates which clustering algorithm you should use. The final argument is the name of a single directory, which can be a relative or absolute path. You are guaranteed that the arguments we pass your program are valid.
GIVEN You will be given the test case directories and corresponding similarity matrices. This information will be posted on CTools to ensure that your parsing and similarity calculations from project 4 are correct.
OUTPUT Your output for each algorithm will be in the same format as in project 4. That is, you should print "Similarity matrix:" on its own line, then the similarity matrix, then two newlines, then "Clustered similarity matrix:" on its own line, then the clustered similarity matrix. You may use the instructor's project 4 solution for proper formatting.
1. GREEDY ALGORITHM First put each file in its own cluster. Then sort the files lexicographically and consider each file in turn. For each file, consider merging its cluster with every other cluster. Merge the cluster of this file with the cluster which results in the smallest maximum cluster diameter. Consider other clusters in lexicographic order, such that if multiple potential clusters are equally good, choose the one that comes first lexicographically. Continue this process until there are K clusters (so merge N-K times).
2. SIMULATED ANNEALING This heuristic solution starts with the greedy algorithm above and attempts to improve the maximum cluster diameter by moving files between clusters. Using the following annealing schedule:
1 start with the greedy solution 2 do the following 1000 times last modified Mar 30
3 4 5 6 7 pick a random cluster and then a random file to potentially reassign to it if proposed reassignment increases max diameter T = max diameter of clusters / 5 I = amount of increase accept the reassignment with probability e^(-I/T) by comparing e^(-I/T) with a random number between 0 and 100 8 else if proposed reassignment decreases max diameter or leaves it unchanged 9 accept the reassignment If at line 3 the potential reassign results in an empty cluster or the random file is in the random cluster, continue to the next iteration of the loop immediately. Use rand() for your random number generator and be sure to use it to pick a random cluster before picking a random file. You should use "rand() % num_files" to get a random number between 0 and num_files. Do not use srand() so that your rand() is seeded the same as the autograder's. To "accept the reassignment with probability e^(-I/T)" you should compute e^(-I/T) as a percentage and then compare it (using the < operator) with a number randomly generated between 0 and 100, inclusive. This means that for each loop of the algorithm, you should be calling rand() twice and then a third time if line 7 is reached.
3. BRANCH AND BOUND This produces the optimal solution by creating a decision tree and then pruning it. You may want to do some research on branch and bound algorithms if you are unfamiliar with branch and bound. For this project, sort the files lexicographically and use the greedy algorithm's solution as an initial bound. Then place the first file in the first cluster. From here, there are two choices: either place the second file in the first cluster or start a new cluster. For the third file, there are three choices for each of the two previous choices (assuming K >= 3). This makes a total of 6 possibilities. Continue in this manner. Use the best solution so far in the bounding step, eliminating branches of the decision tree that cannot result in the optimal solution. If multiple solutions have the same minimum max diameter, treat the first found as the best solution. Grading The grading for this project is broken down as follows: Greedy algorithm 20 points Simulated annealing 40 points Branch and bound 40 points Even though this project is not explicitly graded on style, your code should be readable. You should make proper use of comments, indentation, the const keyword, and the Standard library. Also be sure to separate your code into different functions (avoiding a do-everything main function) and to make use of header files for function declarations. If you have questions about any of these concepts, please ask on the phorum or in office hours. Your code should also be reasonably efficient - for example if something can be done with one loop then using 3 nested loops will be frowned upon. We reserve the right to take off points if your code does not conform to any of these standards. Submitting You will be submitting all of your code (.h and .cpp files) and your Makefile (we should be able to run "make" to produce the executable). Autograder First of all, be sure that your code works on a Unix/Linux workstation using g++ version 4.1.2. Then put all the required files (as explained in the "Submitting" section) into their own directory and cd to that last modified Mar 30 directory. Make sure that no .o files or executables are in this directory (running "make clean" should achieve this). Then make a zip archive by running the following command on the command prompt.
tar -zcf submission.tar.gz * To submit to the autograder, visit http://grader5.eecs.umich.edu and sign in. Then select the correct project from the drop-down and click "View". From there you can upload your gzipped tarball. The autograder will send you an email with the results; if it does not come, check your spam box. Depending on the load of the autograder, it may take a while to get your results. You may submit 3 times per day for instant-ish feedback. Any other submissions will be graded but you will not receive a grade report. We will use your most recent submission as your final grade. The autograder will only tell you the results of the test cases. The other parts are graded by hand and thus you will not know your entire grade until these parts are evaluated. To make that very clear: the autograder does not assign grades. If you have problems submitting, post on the phorum. If it is a technical error (if something really strange happens), email Tom at firstname.lastname@example.org. ...
View Full Document
This note was uploaded on 04/11/2008 for the course EECS 281 taught by Professor Jag during the Winter '08 term at University of Michigan.
- Winter '08