Lectures12-Regexp-2up - Regular Expressions 279 Regular...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
Regular Expressions 279 Regular expressions give us a powerful way of matching patterns in text data. Example 1: election data from three different datasets. We know these are the same places, but how can the computer recognize that? 280
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Example 2: Creating variables that predict whether an email is SPAM - numbers or underscores in the sending address - all capital letters in the subject line - fake “words” like Vi@graa - number of exclamation points in the subject line - received time in the current time zone 281 Example 3: Text mining the State of the Union addresses How long are the speeches? How do the distributions of certain words change over time? Which presidents have given “similar” speeches? 282
Background image of page 2
The language of regular expressions allows us to carry out many useful tasks, such as extracting pieces of text - for example, Fnding all the links in an HTML document creating variables from information found in text cleaning and transforming text into a uniform format, resolving inconsistencies in format between Fles mining text by treating documents directly as data “scraping” the web for data Most importantly, we do this all programatically rather than by hand, so that we can easily reproduce our work if needed. 283 A regular expression (sometimes abbreviated by regex or regexp) is a pattern that describes a set of strings. This set may be Fnite or inFnite, depending on the particular regexp. We say the regexp “matches” each element of that set. ±or example, we’ll see the regexp grey|gray matches the words grey and gray, whereas the regexp ^A.* matches any string starting with capital A. The idea is somewhat similar to using wildcards in specifying UNIX Fle names, but with many more possibilities. 284
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Regular expressions are used in many programs, UNIX utilities, and programming languages. We will focus on regular expressions as they’re used in R. However, the default treatment of regexps in R (called the“extended regular expressions” standard) is the same as in many UNIX utilities, with one small exception I’ll come back to later. 285 Regular expressions are constructed from three things: Literal characters are matched only by the character itself. A character class is matched by any single member of the speciFed class. ±or example, [A-Z] is matched by any capital letter. Modifers
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 13

Lectures12-Regexp-2up - Regular Expressions 279 Regular...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online