{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Wk9_RegExp-4pp - Elec0onStudy TextData thecountycenter...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
10/18/12 1 Text Data Elec0on Study Geographic Data – longitude and la0tude of the county center Popula0on Data from the census for each county Elec0on results from 2008 for each county (scraped from a Website) Want to match/merge the informa0on from these three different source What issues arise in matching? What problems need resolving to match coun0es across sources? Capitaliza0on qui vs Qui County/Parish missing St. vs St DeWiN vs De WiN
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
10/18/12 2 Text mining State of Union Addresses How long are the speeches? How do the distribu0ons of certain words change over 0me? Which presidents have given similar speeches? *** State of the Union Address George Washington December 8, 1790 Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in being able to repeat my congratulations on the favorable prospects which continue to distinguish our public affairs. The abundant fruits of another year have blessed our country with plenty and with the means of a flourishing commerce. Text mining State of Union Addresses All speeches in one large plain text file Each speech starts with “***” on a line followed by 3 lines of informa0on about who gave the speech and when To mine the speeches, we want to create a word vector for each speech, which tracks the counts of how many 0mes a par0cular word was said in each speech. Words such as na0on, na0onal, na0ons should collapse to the same “word” Web behavior Every 0me you visit a Web site, informa0on is recorded about the visit: the page visited, date and 0me of visit browser used opera0ng system IP address
Background image of page 2
10/18/12 3 Two lines of the Web log 169.237.46.168 ‐ ‐ [26/Jan/2004:10:47:58 ‐0800] "GET /stat141/Winter04 HTTP/1.1" 301 328 "hNp://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compa0ble; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)” 169.237.46.168 ‐ ‐ [26/Jan/2004:10:47:58 ‐0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "hNp://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compa0ble; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)" The informa0on in the log has a lot of structure, for example the date always appears in square brackets. However, the informa0on is not consistently separated by the same characters such as in a csv file, nor is it placed consistently in the same columns in the file. Spam filtering: Anatomy of email message Three parts: header, body, aNachments (op0onal). Like regular mail, the header is the envelope and the body is the leNer. Plain text Header: date, sender, and subject message id, who are the carbon‐copy recipients, return path. SYNTAX – KEY:VALUE
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
10/18/12 4 Example header Date: Mon, 2 Feb 2004 22:16:19 -0800 (PST) From: [email protected] X-X-Sender: [email protected] To: Txxxx Uxxx <[email protected]> Subject: Re: prof: did you receive my hw?
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}