Wk9_RegExp-4pp

Wk9_RegExp-4pp - Elec0onStudy TextData thecountycenter...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
10/18/12 1 Text Data Elec±on Study Geographic Data – longitude and la±tude of the county center Popula±on Data from the census for each county Elec±on results from 2008 for each county (scraped from a Website) Want to match/merge the informa±on from these three diFerent source What issues arise in matching? What problems need resolving to match coun±es across sources? Capitaliza±on qui vs Qui County/Parish missing St. vs St DeWiN vs De WiN
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
10/18/12 2 Text mining State of Union Addresses How long are the speeches? How do the distribu±ons of certain words change over ±me? Which presidents have given l similar z speeches? *** State of the Union Address George Washington December 8, 1790 Fellow-Citizens of the Senate and House of Representatives: In meeting you again I feel much satisfaction in being able to repeat my congratulations on the favorable prospects which continue to distinguish our public affairs. The abundant fruits of another year have blessed our country with plenty and with the means of a flourishing commerce. Text mining State of Union Addresses All speeches in one large plain text Fle Each speech starts with “***” on a line followed by 3 lines of informa±on about who gave the speech and when To mine the speeches, we want to create a word vector for each speech, which tracks the counts of how many ±mes a par±cular word was said in each speech. Words such as na±on, na±onal, na±ons should collapse to the same “word” Web behavior Every ±me you visit a Web site, informa±on is recorded about the visit: the page visited, date and ±me of visit browser used opera±ng system IP address
Background image of page 2
10/18/12 3 Two lines of the Web log 169.237.46.168 ‐ ‐ [26/Jan/2004:10:47:58 ‐0800] "GET /stat141/Winter04 HTTP/1.1" 301 328 "hNp://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compa±ble; MSIE 6.0; Windows ²T 5.0; .²ET CLR 1.1.4322)” 169.237.46.168 ‐ ‐ [26/Jan/2004:10:47:58 ‐0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "hNp://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compa±ble; MSIE 6.0; Windows ²T 5.0; .²ET CLR 1.1.4322)" The informa±on in the log has a lot of structure, for example the date always appears in square brackets. However, the informa±on is not consistently separated by the same characters such as in a csv Fle, nor is it placed consistently in the same columns in the Fle. Spam Fltering: Anatomy of email message Three parts: header, body, aNachments (op±onal). Like regular mail, the header is the envelope and the body is the leNer. Plain text Header: date, sender, and subject message id, who are the carbon‐copy recipients, return path. SY²TAX – KEY:VALUE
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
10/18/12 4 Example header Date: Mon, 2 Feb 2004 22:16:19 -0800 (PST) From: [email protected] X-X-Sender: [email protected] To: Txxxx Uxxx <[email protected]> Subject: Re: prof: did you receive my hw?
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

Page1 / 14

Wk9_RegExp-4pp - Elec0onStudy TextData thecountycenter...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online