jurafsky&martin_3rdEd_17 (1).pdf

Greedy there are however ways to enforce non greedy

Info icon This preview shows pages 15–17. Sign up to view the full content.

greedy There are, however, ways to enforce non-greedy matching, using another mean- non-greedy ing of the ? qualifier. The operator *? is a Kleene star that matches as little text as *? possible. The operator +? is a Kleene plus that matches as little text as possible. +? 2.1.3 A Simple Example Suppose we wanted to write a RE to find cases of the English article the . A simple (but incorrect) pattern might be: /the/ One problem is that this pattern will miss the word when it begins a sentence and hence is capitalized (i.e., The ). This might lead us to the following pattern: /[tT]he/ But we will still incorrectly return texts with the embedded in other words (e.g., other or theology ). So we need to specify that we want instances with a word bound- ary on both sides: /\b[tT]he\b/ Suppose we wanted to do this without the use of /\b/ . We might want this since /\b/ won’t treat underscores and numbers as word boundaries; but we might want to find the in some context where it might also have underlines or numbers nearby ( the or the25 ). We need to specify that we want instances in which there are no alphabetic letters on either side of the the : /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/ But there is still one more problem with this pattern: it won’t find the word the when it begins a line. This is because the regular expression [ˆa-zA-Z] , which we used to avoid embedded instances of the , implies that there must be some single (although non-alphabetic) character before the the . We can avoid this by specify- ing that before the the we require either the beginning-of-line or a non-alphabetic character, and the same at the end of the line: /(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/ The process we just went through was based on fixing two kinds of errors: false positives , strings that we incorrectly matched like other or there , and false nega- false positives tives , strings that we incorrectly missed, like The . Addressing these two kinds of false negatives errors comes up again and again in implementing speech and language processing systems. Reducing the overall error rate for an application thus involves two antag- onistic efforts: Increasing precision (minimizing false positives) Increasing recall (minimizing false negatives)
Image of page 15

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

16 C HAPTER 2 R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE 2.1.4 A More Complex Example Let’s try out a more significant example of the power of REs. Suppose we want to build an application to help a user buy a computer on the Web. The user might want “any machine with more than 6 GHz and 500 GB of disk space for less than $1000”. To do this kind of retrieval, we first need to be able to look for expressions like 6 GHz or 500 GB or Mac or $ 999.99 . In the rest of this section we’ll work out some simple regular expressions for this task. First, let’s complete our regular expression for prices. Here’s a regular expres- sion for a dollar sign followed by a string of digits: /$[0-9]+/ Note that the $ character has a different function here than the end-of-line function we discussed earlier. Regular expression parsers are in fact smart enough to realize
Image of page 16
Image of page 17
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern