5 t 3 s1 s2 m 1 11 11 11 0 09 3

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 0) –  MARTHA vs. MARHTA (t = 2) Sangmi Lee Pallickara, CS480, Spring 2012 CS480 Principles of Data Management •  m : # of matching character within the maximum difference (0.5 x min(|s1|,|s2|)) . •  Example –  σ = {r, ., _, J, o, h, n, _, D, o, e} –  where _ denotes a space character •  |s1| = 14 •  |s2| = 12 •  The maximum distance between two common characters is 0.5 × min (12,14) = 6 •  m = 11 •  t = 0 Sangmi Lee Pallickara, CS480, Spring 2012 21 Spring 2013 Jaro Similarity: example - continued CS480 Principles of Data Management 22 Spring 2013 Jaro Similarity: example - continued •  The final Jaro distance is, •  Performs well for strings with slight spelling varia=ons •  Does not cope well with longer strings separa=ng common characters 1 | m | | m | | m | −0.5 t ×( + + ) 3 | s1 | | s2 | |m | 1 11 11 11 − 0 = ×( + + ) ≈ 0.9 3 12 14 11 JaroSim( s1, s2 ) = –  It has the restric=on that common characters have to occur within a certain distance from each other Professor_John_Doe John_Doe € Sangmi Lee Pallickara, CS480, Spring 2012 23 Sangmi Lee Pallickara, CS480, Spring 2012 24 4 2/22/13 CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Jaro-Winkler Similarity Spring 2013 Jaro-Winkler Similarity •  The problems that occur in the Jaro Similarity happen very oien in a person’s name. •  Jaro ­Winkler similarity –  Extension of the...
View Full Document

Ask a homework question - tutors are online