This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: An Approach to Protein Name Extraction using Heuristics and a Dictionary Kazuhiro Seki Laboratory of Applied Informatics Research, Indiana University, 1320 East Tenth Street, LI 011, Bloom- ington, Indiana 47405-3907. Email: [email protected] Javed Mostafa Laboratory of Applied Informatics Research, Indiana University, 1320 East Tenth Street, LI 011, Bloom- ington, Indiana 47405-3907, Email: [email protected] This paper proposes a method for protein name ex- traction from biological texts. Our method exploits hand-crafted rules based on heuristics and a set of protein names (dictionary). In contrast to previously proposed methods, our approach avoids the use of natural language processing tools such as part-of- speech taggers and syntactic parsers so as to improve processing speed. We implemented a prototype sys- tem for protein name extraction based on our method and conducted evaluation experiments. The result showed that our system produces results comparable to the state-of-the-art protein name extraction sys- tem on multiple corpora. Introduction Ever-growing digitized texts have resulted in a demand for automated techniques to extract novel information. Message Understanding Conferences (MUCs) (Grish- man and Sundheim, 1996) represent one of the major attempts to develop information extraction (IE) tech- niques targeting general texts (newswire articles) in which the participants independently implement IE sys- tems and compare their system performance on a com- mon test set. IE is crucial and urgent also in the field of molecu- lar biology because of a demand for automatically dis- covering molecular pathways and interactions in the literature, which is, even for human experts, labor- intensive and time-consuming. Therefore, much re- search has been done to explore IE techniques on bi- ological texts (Friedman et al., 2001; Ng and Wong, 1999; Proux et al., 1998; Sekimizu et al., 1998; Thomas et al., 2000). Our ultimate goal is to realize an automated sys- tem to discover information in the biological literature, specifically, relations and interactions between specific proteins and cancer, which is expected to be beneficial for developing new medicine and treatments peculiar to cancer. To accomplish our goal, we begin with extract- ing protein names appearing in biological texts. How- ever, automatic protein name extraction is not a trivial task. There are no common standards or fixed nomen- clatures for protein names. As new proteins continue to be discovered and named, fixed protein name dictio- naries are not necessarily helpful in extracting new pro- tein names. Additionally, protein names often appear in shortened, abbreviated, or slightly altered forms (e.g., capital and small letters and hyphens). Therefore, even the protein names that are not new and are supposed to be contained in a dictionary might be overlooked due to way they are actually written....
View Full Document
This note was uploaded on 09/21/2009 for the course CS 580 taught by Professor Fdfdf during the Spring '09 term at University of Toronto.
- Spring '09