This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 1 Automated linking PUBMED documents with GO terms using SVM SU-SHING CHEN* AND HYUNKI KIM Computer and Information Science and Engineering Department University of Florida, Gainesville, Florida 32611, USA * To whom correspondence should be addressed. Abstract Summary: We have developed an automated linking scheme for PUBMED citations with GO terms using SVM (Support Vector Machine), a classification algorithm. The PUBMED database has been essential to life science researchers with over 12 million citations. More recently GO (Gene Ontology) has provided a graph structure for biological process, cellular component, and molecular function of genomic data. By text mining the textual content of PUBMED and associating them with GO terms, we have built up an ontological map for these databases so that users can search PUBMED via GO terms and conversely GO entries via PUBMED classification. Consequently, some interesting and unexpected knowledge may be captured from them for further data analysis and biological experimentation. This paper reports our results on SVM implementation and the need to parallelize for the training phase. Availability: PUBMED/GO linking software will be available upon request. Contact: [email protected] 1 Introduction With the exponential growth of biomedical data, life science researchers have met a new challenge - how to exploit systematically the relationships between genes, sequences and the biomedical literature . Usually most of known genes are found in the biomedical literature and PUBMED is a worthy database for this kind of information. PUBMED, developed by the U.S. National Library of Medicine (NLM), is a database of indexed bibliographic citations and abstracts . It contains over 4,600 biomedical journals. PUBMED citations and abstracts are searchable via PUBMED 1 or the NLM Gateway 2 . The biomedical literature has much to say about gene sequence, but it also seems that sequence can tell us much about the biomedical literature. Currently, highly trained biologists read the literature and manually select appropriate Gene Ontology (GO) terms to annotate the literature with GO terms. Gene Ontology database has more recently been created to provide an ontological graph structure for biological process, cellular component, and molecular function of genomic data . McCray et al.  show that the GO is suitable as a resource for natural language processing (NLP) applications because a large percentage (79%) of the GO terms have passed the NLP parser. They also show that 35% of the GO terms were found in a 1 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi 2 http://gateway.nlm.nih.gov/gw/Cmd 2 corpus collected from the MEDLINE database and 27% of the GO terms were found in the current edition of the Unified Medical Language System (UMLS). A recent research work of Raychaudri et al. employs a “maximum entropy” technique to categorize 21 GO terms using training and test documents extracted from PUBMED using handcrafted keyword queries. Their study reports that their models trained on PUBMED handcrafted keyword queries....
View Full Document
This note was uploaded on 01/15/2012 for the course COP 4600 taught by Professor Yavuz-kahveci during the Spring '07 term at University of Florida.
- Spring '07
- Operating Systems