DM4-Flynn_2 - Open Source Text Mining Mathew Flynn, PhD...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Open Source Text Mining Mathew Flynn, PhD Louise Francis, FCAS, MAAA Rationale For Paper Text mining is a new and promising technology for analyzing unstructured text data Commercial text mining software can be expensive and difficult to learn Several free open source languages can perform text mining, but without help thay can be difficult to learn In this session we will provide a mini- tutorial to 2 miniopen source products Two Open Source Products for Text Mining Perl – a text processing language R – a statistical and analytical language with text mining functionality provided by a text mining package tm 1 The Data Text mining can be applied to many common tasks Internet searches Internet Screening emails for spam Screening Analyzing free form fields in underwriting and claims Analyzing files Analyzing survey data Analyzing We illustrate the last 2 Survey data can be downloaded from CAS web site. Mini Tutorial We will give tutorial on using Perl and R for text mining Download the survey data Follow our examples The Survey Data From 2008 CAS Quinquennial Survey What are the top two issues that will impact the CAS in the next five years? Survey Question: Top Two Issues Affecting CAS A crisis that could affect our ability to "regulate" ourselves. A need to deal more thoroughly with non-traditional risk management approaches Ability of members to prove they are more than just number crunchers ability to convince non-insurance companies of the value/skills offered by CAS members. 2 Perl Go to www.Perl.org Download Perl Run execute file (or run active perl) perl) the Windows command search must be correct for Windows to find the desired perl.exe www.Perl.org Good References Practical Text Mining with Perl by Bilisoly (2008) is an excellent resource for text mining in Perl Perl for Dummies (Hoffman, 2003) Perl provides a basic introduction including needed header information 3 Some Key Things Perl must be run from DOS. One gets to DOS by finding the Command Prompt on the Programs menu Before running Perl switch to the Perl directory Before (i.e., if Perl was installed and it is in the folder named Perl, in DOS, type “cd C:\Perl”). C:\ Perl” Programs need to be saved in text processing Programs software. We recommend Notepad rather than Word, as some of the features of Word cause unexpected results when running a program. We recommend using the extension .pl. Some Key Things cont. The header line of a Perl program is dependent on the operating system. To run a Perl program type the following at the command To prompt: Perl program_name input_file_name, output_file_name[1] Perl input_file_name, output_file_name[1] The input and output files are only required if the program requires a file, and the file name is not contained in the program itself. The input and output file may be contained in the program code, as illustrated in some of our examples, rather than entered at runtime. Parsing Text Parsing Identify the spaces, punctuation and other non alphanumeric characters found in text documents and separating the words from these other characters Most computer languages (and spreadsheets) have text functions that perform the search and substring functions to do this Perl has special functions for parsing text Perl 4 The split function split(/separating character(s)/, string) character(s) Example $Response = "Ability of members to prove $Response they are more than just number crunchers"; @words =split (/ /, $Response); @words Complications of split function Complications More than one space @words =split (/ [\s+]/, $Response); @words [\ Other separators Use substitute function Use Simple parse program: Parse2.pl #!perl -w # Parse2.pl # Program to parse text string using one or more spaces as separator $Response = "Ability of members to prove they are more than just number crunchers"; @words =split (/\s+/, $Response); #parse words in string (/\ # Loop through words in word array and print them foreach $word (@words) { print "$word\n"; "$word\ } 5 Less Simple parse program: Parse3.pl #!perl -w # Parse3.pl # Program to parse a sentence and remove punctuation $Test = "A crisis that could affect our ability to 'regulate' ourselves.";# a test string with punctuation @words =split (/[\s+]/, $Test); # parse the string using spaces (/[\ # Loop through words to find non punctuation characters foreach $word (@words) { while ( $word =~ /(\w+)/g ) { /(\ # match by 1 or more alphanumeric characters. These will be the words excluding punctuation print "$1 \n"; #print the first match which will be the word of print alphanumeric characters } } Read in survey data and parse #!perl -w #!perl # Enter file name withtext data here $TheFile =“Top2Iss.txt"; # open the file open(INFILE, $TheFile) or die "File not found"; open(INFILE, $TheFile) # read in one line at a time while(<INFILE>) { chomp; # eliminate end of line charachter s/[.?!"()'{},&;]//g; # replace punctuation with null s/[.?!"()'{},&;]//g; s/\// /g; # replace slash with space s/\ s/\-//g; #replace dash with null s/\ //g; s/^ //g; #replace beginning of line space print "$_\n"; # print cleaned line out "$_\ @word=split(/[\s+]/); # parse line @word=split(/[\ } Print it out also #!perl -w #!perl #parsecomplex.pl # Enter file name withtext data here $TheFile ="Top2Iss.txt"; # open the file open(INFILE, $TheFile) or die "File not found"; open(INFILE, $TheFile) # read in one line at a time while(<INFILE>) { chomp; # eliminate end of line charachter s/[.?!"()'{},&;]//g; # replace punctuation with null s/[.?!"()'{},&;]//g; s/\// /g; # replace slash with space s/\ s/\-//g; #replace dash with null s/\ //g; s/^ //g; #replace beginning of line space print "$_\n"; # print cleaned line out "$_\ @word=split(/[\s+]/); # parse line @word=split(/[\ foreach $word (@word) { print "$word,"} print "\n"; "\ } 6 Word Search First, read in the accident description field For each claim Read in each word Read If the lower case of the target word is found If output a 1 for the new indicator variable, otherwise output a 0. SearchTarget.pl SearchTarget.pl $target = “(regulaton)"; regulaton)"; # initialize file variable containing file with text data $TheFile =“Top2Iss1.txt"; open(INFILE, $TheFile) or die "File not found"; # open the file open(INFILE, $TheFile) # initialize identifier variables used when search is successful $i=0; $flag=0; # read each line while(<INFILE>) { chomp; ++$i; # put input line into new variable $Sentence = $_; # parse line of text @words = split(/[\s+]/,$Sentence); split(/[\ $flag=0; foreach $x (@words) { if (lc($x) =~ /$target/) { (lc($x) $flag=1; $flag=1; } } # print lines with target variable to screen print "$i $flag $Sentence \n"; } Using Target in Analysis Homeowner Claim Mean Severity No 2,376.6 Yes 6,221.1 7 Text Statistics The length of each word is tabulated within a loop. A key line of code is: $count[length($x)] +=1; #increment count[length($x)] counter for words of this length Perl Program for Word Lengths Length.pl #!perl -w #!perl # Enter file name with text data here $TheFile ="Top2Iss.txt"; # open the file open(INFILE, $TheFile) or die "File not found"; open(INFILE, $TheFile) # read in one line at a time while(<INFILE>) { chomp; # eliminate end of line character s/[.?!"()'{},&;]//g; # replace punctuation with null s/[.?!"()'{},&;]//g; s/\// /g; # replace slash with space s/\ s/\-//g; #replace dash with null s/\ //g; s/^ //g; #replace beginning of line space print "$_\n"; # print cleaned line out "$_\ @word=split(/[\s+]/); # parse line @word=split(/[\ # count length of each word in array @count foreach $x (@word) { $count[length($x)] +=1 ;} count[length($x)] } $mxcount=$#count; mxcount=$#count; # print out largest word size and frequency of each count print print "Count $mxcount\n"; $mxcount\ for ($i = 0; $i <= $#count; ) { # does word of that size exist? if ( exists($count[$i])) { exists($count[$i])) print "There are $count[$i] words of length $i\n"; $count[$i] $i } $i += 1; # increment loop counter } Hashes A hash is like an array, but can be distinguished from an array in a number of ways. An array is typically indexed with zero and integer An values, while a hash can be indexed with a letter or word. Instead of an index the hash has a key that maps to a Instead specific array value. For instance, while the first entry in a Perl array is $array[0], the For $array[0], first entry in a hash might be $hash{‘a’} or $hash{‘x’} or even $hash{‘ $hash{‘ $hash{‘hello’}. (Note that the order is not relevant.) hash{‘ hello’ Because the time to locate a value on a hash table is Because independent of its size, hashes can be very efficient for processing large amounts of data 8 Hashes cont. A hash variable begins with a % A hash holding a list of words might be denoted %words. hash A hash holding the counts of words from a document hash might be %count, and the indices of the hash can be the words themselves. A specific value of a hash is referenced by using a dollar specific sign ($) in front of the hash variable name, and referencing the specific item with brackets. For example, the specific word homeowner is referenced as For $words{‘homeowner’}. words{‘ homeowner’ Thus the indices of the hash are strings, not numbers. Thus Use Hash Distribution of Word Use Frequencies Testhash.pl #!perl -w #!perl # Testhash.pl # Usage: testhash.pl <datafile> <outputfile> datafile> <outputfile> # input datafile must be present and a command line arg such as Top2Iss.txt open(MYDATA, $ARGV[0]) or die("Error: cannot open file '$ARGV[0]'\n"); open(MYDATA, die("Error: '$ARGV[0]'\ # output datafile must be present and a cmd line arg open(OUTP, ">$ARGV[1]") or die("Cannot open file '$ARGV[1]' for writing\n"); open(OUTP, writing\ print OUTP "Output results for ".$ARGV[0]."\n"; ".$ARGV[0]."\ # read in the file, get rid of newline and punctuation chars while( $line = <MYDATA> ){ chomp($line); chomp($line); # eliminate punctuation $line =~ s/[-.?!"()'{}&;]//g; s/[-.?!"()'{}&;]//g; $line =~ s/\s+/ /g; s/\ @words = split(/ /,$line); foreach $word (@words) { ++$counts{lc($word)}; ++$counts{lc($word)}; } } # sort by value (lowest to highest using counts for the key) # and write the output file and screen foreach $value (sort {$counts{$a} cmp $counts{$b} } {$counts{$a} counts{$b} keys %counts) { # print the word and the count for the word print "$value $counts{$value} \n"; $counts{$value} print OUTP "$value $counts{$value} \n" $counts{$value} } # close the files close MYDATA; close OUTP; Word Frequencies Rank 1 2 3 4 5 6 7 8 9 10 726 727 728 729 730 Word of the to and in actuaries other from for erm alternative thin information industries retire Count 102 80 57 53 42 34 27 26 25 24 1 1 1 1 1 P(Rank=k) 0.05 0.04 0.03 0.03 0.02 0.02 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 9 Zipf’s Law Zipf’ Stop Words Frequently occurring words The A To It Do not contribute to meaning of record of text Eliminate Substitution operator Thus to eliminate the word “the”, use the the” code s/the//g; s/the//g; Apply to multiple terms you want to eliminate s/[-.?!"()'{}&;]//g; s/[- .?!"()'{}&;]//g; 10 Term Document Matrix A Table of indicator variables If a word is present, a 1, otherwise a 0 Term Data Matrix Ourselves 1 0 0 0 0 0 0 0 0 0 cas 0 0 0 1 0 0 0 0 0 0 Not 0 0 0 0 0 0 0 1 0 0 That 1 0 0 0 0 0 0 0 0 0 communicators/executive 0 0 0 0 0 0 0 1 0 0 our 1 0 0 0 0 0 0 0 0 0 approaches 0 1 0 0 0 0 0 0 0 0 Word Lengths Word Length 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 GL Data 1,062 4,172 5,258 5,418 2,982 2,312 2,833 1,572 1,048 591 111 156 78 19 0 1 2 1 1 Survey Data 21 309 298 215 153 143 213 161 216 146 92 44 61 2 3 1 0 0 0 11 Stopwords.pl # StopWords.pl # This program eliminates stop words and computes the term-document matrix term# a key part is to tabulate the indicator/count of every term - usually a word # it may then be used to find groupings of words that create content content # This would be done in a separate program # Usage: termdata.pl <datafile> <outputfile> datafile> <outputfile> $TheFile = "Top2Iss.txt"; #$Outp1 = "OutInd1.txt"; open(MYDATA, $TheFile ) or die("Error: cannot open file"); open(MYDATA, $TheFile die("Error: open(OUTP1, ">OutInd1.txt") or die("Cannot open file for writing\n"); writing\ open(OUTP2, ">OutTerms.txt") or die("Cannot open file for writing\n"); ">OutTerms.txt") writing\ # read in the file each line and create hash of words # create grand dictionary of all words # initialize line counter $i=0; while (<MYDATA> ){ chomp($_); s/[-.?!"()'{}&;]//g; s/[-.?!"()'{}&;]//g; s/^ //g; s/,//g; s/,//g; s/\d/ /g; s/\ s/(\sof\s)/ /g; s/(\sof\ s/(\sto\s)/ /g; s/(\sto\ s/(\sthe\s)/ /g; s/(\sthe\ s/(\sand\s)/ /g; s/(\sand\ s/(\sin\s)/ /g; s/(\sin\ s/(The\s)/ /g; s/(The\ s/(\sfor\s)/ /g; s/(\sfor\ s/(\as\s)/ /g; s/(\as\ s/(A\s)/ /g; s/(A\ s/(\sin\s)/ /g; s/(\sin\ s/(\swith\s)/ /g; s/(\swith\ s/(\san\s)/ /g; s/(\san\ s/(\swith\s)/ /g; s/(\swith\ s/(\sare\s)/ /g; s/(\sare\ Stopwords.pl cont. s/(\sthey\s)/ /g; s/(\sthey\ s/(\sthan\s)/ /g; s/(\sthan\ s/(\sas\s)/ /g; s/(\sas\ s/(\sby\s)/ /g; s/(\sby\ s/\s+/ /g; s/\ if (not /^$/) { #ignore empty lines @words = split(/ /); foreach $word (@words) { ++$response[$i]{lc($word)}; ++$response[$i]{lc($word)}; ++$granddict{lc($word)}; ++$granddict{lc($word)}; } ++$i; } } $nlines = $i-1; $ifor $i (0..$nlines) { foreach $word (keys %granddict) { %granddict) if (exists($response[$i]{$word})) (exists($response[$i]{$word})) { ++$ indicator[$i]{$word}; } indicator[$i]{$word}; else { $indicator[$i]{$word}=0; indicator[$i]{$word}=0; } print OUTP1 "$indicator[$i]{$word},"; "$indicator[$i]{$word},"; } print OUTP1 "\n"; "\ } foreach $word (keys %granddict) { %granddict) print OUTP2 "$word,$granddict{$word}\n"; "$word,$granddict{$word}\ } # close the files close MYDATA; close OUTP1; close OUTP2; OutPut Matrix 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 12 ...
View Full Document

Ask a homework question - tutors are online