ps42002

ps42002 - Harvard-MIT Division of Health Sciences and...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Problem Set 4 Question 0: Project Teams & Times (5 points) List the name(s) of your partner(s) for the final project: _______________________ ____________________________________________________________________ Oral presentations for final projects will take place at the following times and locations. Please indicate below if your team can or cannot present during the following timeslots. If your team cannot present at a given time, please also list the reason(s). Lec# Time Can Attend Cannot attend (+ reason) Lec 12 12-2pm Lec 12 5:30-7:30pm Lec 13 12-2pm Lec 13 5:30-7:30pm Lec 1 4 12-2pm L ec 1 4 5:30-7:30pm Problem 1: Clustering (33 points) Microarray and DNA chip technologies have made it possible to study expression patterns of thousand of genes simultaneously. The amount of data coming out of these efforts is overwhelming. A powerful strategy for analysis of microarray data is the clustering of expression profiles. Expression profiles can be clustered by gene or by condition. Golub et al. ( Science , 286 , 531-7. pdf , supplemental website ) clustered different types of leukemia expression data using non-hierarchical Self-organizing Maps (SOMs). Now you will write a Perl program to cluster the same data using an alternative hierarchical clustering algorithm. I) I) Briefly describe the two major goals of this paper. (2 pts) II) II) Describe the major steps of the SOMs training algorithm without using code. (4 pts) III) III) The authors used Affymetrix GeneChip, which is very different from ratio- based cDNA microarray in the way of measuring expression level of RNA. Data from several different GeneChip microarrays should be normalized before being compared to each other. Describe why normalization is needed, and how the authors normalized their data. (4 pts)
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
IV) IV) A brief summary of the hierarchical clustering algorithm that you are asked to implement can be found here . Your assignment is to cluster the normalized expression data of 50 predictor genes from Golub et al. using the single-linkage and complete-linkage Euclidean distance metrics. (11 pts) a. a. Partial credits are given for the following tasks: i. i. Reading input data (2 pts) ii. ii. Constructing distance matrix (2 pts) iii. iii. Updating distance matrix (3 pts) iv. iv. Output clustering result (4 pts) b. b. Use the sample dataset of 5 samples and its clustering result to verify your code. Print the group members and distance matrix at each iteration. c. c. Please attach your well-annotated Perl code to the end of your problem set. You may use this template ( ps4-1-template.pl ) for your program. Sample dataset of 5 samples: http://www.courses.fas.harvard.edu/~bphys101/problemsets/ps4-1-sample.txt Clustering result of sample dataset using complete-linkage Euclidean distance: http://www.courses.fas.harvard.edu/~bphys101/problemsets/ps4-1-result.txt Normalized training dataset: http://www.courses.fas.harvard.edu/~bphys101/problemsets/ps4-1-train.txt Clustering result of normalized training dataset using complete/single-linkage Euclidean
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 01/24/2010 for the course HST. 508 taught by Professor Dr.georgechurch during the Fall '02 term at MIT.

Page1 / 7

ps42002 - Harvard-MIT Division of Health Sciences and...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online