STAT530lecture1Introduction - STAT 530 STAT Introduction to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: STAT 530 STAT Introduction to Computational Biology Computational Ping Ma Ping Outline Course information Course Course structure Course Literature retrieval and Mining Literature Ping Ma STAT530 2 Ping Ma STAT 530 1 STAT 530 Lecture MW 10:00-10:50pm Lecture 10:50pm Instructor: Ping Ma Instructor: pingma@uiuc.edu office hour: Monday 2:00-4:00pm 4:00pm 116 D Illini Hall Illini Ping Ma STAT530 3 STAT 530 Course webpage: Course http://www.stat.uiuc.edu/~pingma/stat530.html Material Recommended textbook, although not required Recommended Course website will provide web resources and Course suggested papers suggested Programming languages Programming We prefer R and perl We perl You can use whatever works for you You Ping Ma STAT530 4 Ping Ma STAT 530 2 Textbook Recommend Textbooks Speed, T.P. (2003) Statistical Analysis of Gene Expression Microarray Data. Chapman and Hall. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological sequence analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press. Main References Wilkinson, D. J. (2006) Stochastic Modelling for Systems Biology Chapman and Hall. Brown (2006) Genomes (3rd Edition) Garland Science. Ping Ma STAT530 5 Grading Policy Homework: 60% Six sets of homework Each set of homework consists of three components: Each statistical reasoning, statistical programming, biological interpretation of results. biological Ping Ma STAT530 6 Ping Ma STAT 530 3 Homework You are strongly encouraged to form a team (no more than 3 persons) to work together. more Each team only submit one solution. Each Each solution consists of solution per se and author contribution author Sample author contribution: Sample N.S. and H.Z. designed research; N.S., R.J.C., and N.S. H.Z. performed research; N.S. analyzed data; and N.S. and H.Z. wrote the paper. Ping Ma STAT530 7 Grading Policy Final project: 40% Final You are encouraged to find a project by your You own. In case you could not find one, I have some projects for you to choose. Submit your project pre-summary by Oct 27th. Submit A final project paper (no less than 15 pages not final including tables and figures) will be submitted. including One team member will present the project. One Ping Ma STAT530 8 Ping Ma STAT 530 4 STAT 530 This is NOT a “recipe” course This is NOT a “button-clicking” course It involves serious statistical reasoning It Ping Ma STAT530 9 Literature Retrieval PubMed – Comprehensive biomedical literature DB – Has simple and useful tools – Compare papers for Related Articles Faculty of 1000 – Quickly read most important, relevant and up-to-date papers in my area Google Scholar: – Quick solution for citations and download Ping Ma STAT530 10 Ping Ma STAT 530 5 PubMed PubMed NCBI NLM NIH – Biomedical literature database – Grew out of MEDLINE – >12M citations in MEDLINE since 1960’s. – 400K added annually – Entrez retrieval system PubMed entry – Citation (paper) published – Citation indexed in PubMed with PubMedID assigned – Citation indexed with MeSH (Medical Subject Heading) terms Ping Ma STAT530 11 Faculty of 1000 Highlights papers by scientific merit instead of the journal Papers rated by selected leading researchers – F1000 factors: recommended, must read, exceptional, increased factor by more ratings – Comments: 1 sentence on significance, and 2-3 summarizes the paper – Link to PubMed, also has related F1000 papers Ping Ma STAT530 12 Ping Ma STAT 530 6 Google Scholar Quick way to check number of times paper is cited Directly download from UIUC – Set library in Google Scholar preferences Quick way to download pdf not subscribed by UIUC Ping Ma STAT530 13 Literature Mining Terms Corpus: Collection of documents Term frequency: Number of times a word appears in a document Document frequency: Number of documents a word appears in Collection frequency: Total number of times a word appears in a corpus Stop words: Words in the corpus that contribute little to meaning, e.g. to, is, an Stemming: Group together different variations of the same word, e.g. activate vs. activated vs. activating Ping Ma STAT530 14 Ping Ma STAT 530 7 Documents as Vectors ”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” acid amino analysis comparison control environments […] Our 2 2 1 1 1 2 1 document is summarized as a vector of word counts. • Each dimension contains the number of times a word appears. • Can calculate similarity between two documents by comparing their vectors •A Ping Ma STAT530 15 Comparing Two Documents Can compare two document’s similarity by calculating their vector correlations Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b)= 0.985615 Correlated cor(b, c) =-0.110328 Not correlated Ping Ma STAT530 16 Ping Ma STAT 530 8 Term Weighting Considerations Give different terms different weight Global weight – Document frequency Fewer documents, more weight: log(N / df) Local weight – Term frequency More frequent, more weight: 1 + log(tf) – Document length Less weight for longer document Ping Ma STAT530 17 Related Articles Related Articles – Similarity between two documents: Σall terms (local wt1 × local wt2 × global wt) – Pre-computed related articles for each citation – Rank ordered by relevance How to evaluate: – Tradeoff between precision and recall – Precision = # relevant hits in hitlist / # hits in hitlist – Recall = # relevant hits in hitlist / # relevant documents in the corpus Ping Ma STAT530 18 Ping Ma STAT 530 9 ...
View Full Document

Ask a homework question - tutors are online