Unformatted text preview: STAT 530 STAT Introduction to Computational Biology Computational
Ping Ma Ping Outline
Course information Course structure Literature retrieval and Mining
Lecture MW 10:00-10:50pm Instructor: Ping Ma pingma@uiuc.edu office hour: Monday 2:00-4:00pm 116 D Illini Hall
Course webpage:
http://www.stat.uiuc.edu/~pingma/stat530.html Material Recommended textbook, although not required Course website will provide web resources and suggested papers Programming languages We prefer R and perl You can use whatever works for you
Textbook
Recommend Textbooks Speed, T.P. (2003) Statistical Analysis of Gene Expression Microarray Data. Chapman and Hall. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological sequence analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press. Main References Wilkinson, D. J. (2006) Stochastic Modelling for Systems Biology Chapman and Hall. Brown (2006) Genomes (3rd Edition) Garland Science.
Grading Policy
Homework: 60% Six sets of homework Each set of homework consists of three components: statistical reasoning, statistical programming, biological interpretation of results.
You are strongly encouraged to form a team (no more than 3 persons) to work together. more Each team only submit one solution. Each Each solution consists of solution per se and author contribution author Sample author contribution: Sample N.S. and H.Z. designed research; N.S., R.J.C., and N.S. H.Z. performed research; N.S. analyzed data; and N.S. and H.Z. wrote the paper.
Grading Policy
Final project: 40% You are encouraged to find a project by your own. In case you could not find one, I have some projects for you to choose. Submit your project presummary by Oct 27th. A final project paper (no less than 15 pages not including tables and figures) will be submitted. One team member will present the project.
STAT 530
This is NOT a "recipe" course This is NOT a "buttonclicking" course It involves serious statistical reasoning Literature Retrieval
PubMed – Comprehensive biomedical literature DB – Has simple and useful tools – Compare papers for Related Articles Faculty of 1000 – Quickly read most important, relevant and uptodate papers in my area Google Scholar: – Quick solution for citations and download
PubMed
PubMed NCBI NLM NIH – Biomedical literature database – Grew out of MEDLINE – >12M citations in MEDLINE since 1960’s. – 400K added annually – Entrez retrieval system PubMed entry – Citation (paper) published – Citation indexed in PubMed with PubMedID assigned – Citation indexed with MeSH (Medical Subject Heading) terms Ping Ma STAT530 11 Faculty of 1000
Highlights papers by scientific merit instead of the journal Papers rated by selected leading researchers
Google Scholar
Quick way to check number of times paper is cited Directly download from UIUC – Set library in Google Scholar preferences Quick way to download pdf not subscribed by UIUC Literature Mining Terms
Corpus: Collection of documents Term frequency: Number of times a word appears in a document Document frequency: Number of documents a word appears in Collection frequency: Total number of times a word appears in a corpus Stop words: Words in the corpus that contribute little to meaning, e.g. to, is, an Stemming: Group together different variations of the same word, e.g. activate vs. activated vs. activating
Documents as Vectors
”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” acid amino analysis comparison control environments […] Our 2 2 1 1 1 2 1 document is summarized as a vector of word counts. • Each dimension contains the number of times a word appears. • Can calculate similarity between two documents by comparing their vectors •A Ping Ma STAT530 15 Comparing Two Documents
Can compare two document's similarity by calculating their vector correlations Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b)= 0.985615 Correlated cor(b, c) =0.110328 Not correlated
Term Weighting Considerations
Give different terms different weight Global weight – Document frequency Fewer documents, more weight: log(N / df) Local weight – Term frequency More frequent, more weight: 1 + log(tf) – Document length Less weight for longer document
Related Articles
Related Articles – Similarity between two documents: Σall terms (local wt1 × local wt2 × global wt) – Precomputed related articles for each citation – Rank ordered by relevance How to evaluate: – Tradeoff between precision and recall – Precision = # relevant hits in hitlist / # hits in hitlist – Recall = # relevant hits in hitlist / # relevant documents in the corpus
