BIOS 735: Statistical Computing Michael Wu Lecture 7: More Text Processing, Intro to Regular Expressions, File I/O September 13, 2011 Michael Wu (Lecture 7) BIOS 735 September 13, 2011 1 / 16

Administrative Details Homework 1 is out and has been updated (both the problem statement and the sample data set for problem 4) Handing in homework: I Please e-mail your homework to Baiming Zou (DO NOT CC THE INSTRUCTOR) F E-mails should originate from valid UNC account F E-mail subject should read BIOS735 Homework 1 <Name> F Example: BIOS735 Homework 1 MikeWu I The functions for each problem should be in a separate text ﬁle F the ﬁle name for each problem should be: <Name>-problem * .txt F Example: MikeWu-problem1.txt or MikeWu-problem4.txt etc I No late homework, please address questions to the grader or instructor well before the due date. I In your e-mail, the body must contain the UNC honor statement: I have neither given nor received unauthorized assistance while preparing this assignment. I Please cite anybody you may have worked with in the body of your e-mail. I worked with Joe Blow and Bob Slob Michael Wu (Lecture 7) BIOS 735 September 13, 2011 2 / 16
String Manipulation: String Matching ( grep regexpr ) a = c("asdfA","asdfB", "12365)C", "asdfD","asdfqwerty") grep("df", a) grepl("df", a) regexpr("df", a) regexpr("d", a) gregexpr("a", a[1], ignore.case = T) Michael Wu (Lecture 7) BIOS 735 September 13, 2011 3 / 16

String Manipulation: Example Suppose for instance, we have text we wish to split into it’s consituent words. dat = readLines( "http://www.bios.unc.edu/˜mwu/bios735/Strings/abstract.txt")[[1]] strsplit(dat, " ") strsplit(dat, "asdf") strsplit(dat, "[asdf]") What’s going on? Regular expressions. strsplit(dat, "[[:punct:]]") strsplit(dat, "[ [:punct:]]") strsplit(dat, "[[:space:]]") strsplit(dat, "[[:space:][:punct:]]") strsplit(dat, ".") strsplit(dat, "[.]") Michael Wu (Lecture 7) BIOS 735 September 13, 2011 4 / 16
String Manipulation: Regular Expressions Regular expressions : description of a “codiﬁed method of SEARCHING”; R calls it a pattern that describes a set of strings. We will focus on “Extended Regular Expressions” which are the default in R.

