Unformatted text preview: STAT 530 STAT Introduction to Computational Biology Computational
Ping Ma Ping Outline
Course information Course Course structure Course Literature retrieval and Mining Literature Ping Ma STAT530 2 Ping Ma STAT 530 1 STAT 530
Lecture MW 10:00-10:50pm Lecture 10:50pm Instructor: Ping Ma Instructor: [email protected] office hour: Monday 2:00-4:00pm 4:00pm 116 D Illini Hall Illini Ping Ma STAT530 3 STAT 530
Course webpage: Course
http://www.stat.uiuc.edu/~pingma/stat530.html Material Recommended textbook, although not required Recommended Course website will provide web resources and Course suggested papers suggested Programming languages Programming We prefer R and perl We perl You can use whatever works for you You
Ping Ma STAT530 4 Ping Ma STAT 530 2 Textbook
Recommend Textbooks Speed, T.P. (2003) Statistical Analysis of Gene Expression Microarray Data. Chapman and Hall. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological sequence analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press. Main References Wilkinson, D. J. (2006) Stochastic Modelling for Systems Biology Chapman and Hall. Brown (2006) Genomes (3rd Edition) Garland Science.
Ping Ma STAT530 5 Grading Policy
Homework: 60% Six sets of homework Each set of homework consists of three components: Each statistical reasoning, statistical programming, biological interpretation of results. biological Ping Ma STAT530 6 Ping Ma STAT 530 3 Homework
You are strongly encouraged to form a team (no more than 3 persons) to work together. more Each team only submit one solution. Each Each solution consists of solution per se and author contribution author Sample author contribution: Sample N.S. and H.Z. designed research; N.S., R.J.C., and N.S. H.Z. performed research; N.S. analyzed data; and N.S. and H.Z. wrote the paper.
Ping Ma STAT530 7 Grading Policy
Final project: 40% Final You are encouraged to find a project by your You own. In case you could not find one, I have some projects for you to choose. Submit your project pre-summary by Oct 27th. Submit A final project paper (no less than 15 pages not final including tables and figures) will be submitted. including One team member will present the project. One
Ping Ma STAT530 8 Ping Ma STAT 530 4 STAT 530
This is NOT a “recipe” course This is NOT a “button-clicking” course It involves serious statistical reasoning It Ping Ma STAT530 9 Literature Retrieval
PubMed – Comprehensive biomedical literature DB – Has simple and useful tools – Compare papers for Related Articles Faculty of 1000 – Quickly read most important, relevant and up-to-date papers in my area Google Scholar: – Quick solution for citations and download
Ping Ma STAT530 10 Ping Ma STAT 530 5 PubMed
PubMed NCBI NLM NIH – Biomedical literature database – Grew out of MEDLINE – >12M citations in MEDLINE since 1960’s. – 400K added annually – Entrez retrieval system PubMed entry – Citation (paper) published – Citation indexed in PubMed with PubMedID assigned – Citation indexed with MeSH (Medical Subject Heading) terms Ping Ma STAT530 11 Faculty of 1000
Highlights papers by scientific merit instead of the journal Papers rated by selected leading researchers
– F1000 factors: recommended, must read, exceptional, increased factor by more ratings – Comments: 1 sentence on significance, and 2-3 summarizes the paper – Link to PubMed, also has related F1000 papers Ping Ma STAT530 12 Ping Ma STAT 530 6 Google Scholar
Quick way to check number of times paper is cited Directly download from UIUC – Set library in Google Scholar preferences Quick way to download pdf not subscribed by UIUC Ping Ma STAT530 13 Literature Mining Terms
Corpus: Collection of documents Term frequency: Number of times a word appears in a document Document frequency: Number of documents a word appears in Collection frequency: Total number of times a word appears in a corpus Stop words: Words in the corpus that contribute little to meaning, e.g. to, is, an Stemming: Group together different variations of the same word, e.g. activate vs. activated vs. activating
Ping Ma STAT530 14 Ping Ma STAT 530 7 Documents as Vectors
”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” acid amino analysis comparison control environments […] Our 2 2 1 1 1 2 1 document is summarized as a vector of word counts. • Each dimension contains the number of times a word appears. • Can calculate similarity between two documents by comparing their vectors •A Ping Ma STAT530 15 Comparing Two Documents
Can compare two document’s similarity by calculating their vector correlations Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b)= 0.985615 Correlated cor(b, c) =-0.110328 Not correlated
Ping Ma STAT530 16 Ping Ma STAT 530 8 Term Weighting Considerations
Give different terms different weight Global weight – Document frequency Fewer documents, more weight: log(N / df) Local weight – Term frequency More frequent, more weight: 1 + log(tf) – Document length Less weight for longer document
Ping Ma STAT530 17 Related Articles
Related Articles – Similarity between two documents: Σall terms (local wt1 × local wt2 × global wt) – Pre-computed related articles for each citation – Rank ordered by relevance How to evaluate: – Tradeoff between precision and recall – Precision = # relevant hits in hitlist / # hits in hitlist – Recall = # relevant hits in hitlist / # relevant documents in the corpus
Ping Ma STAT530 18 Ping Ma STAT 530 9 ...
View Full Document
This note was uploaded on 11/21/2010 for the course STAT STAT530 taught by Professor Ma during the Spring '07 term at University of Illinois, Urbana Champaign.
- Spring '07