1995 using informa on content to evaluate seman c

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 5 simpath(nickel,money) = 1/6 = .17 simpath(coinage,Richter scale) = 1/6 = .17 Dan Jurafsky Problem with basic path ­based similarity •  Assumes each link represents a uniform distance •  But nickel to money seems to us to be closer than nickel to standard •  Nodes high in the hierarchy are very abstract •  We instead want a metric that •  Represents the cost of each edge independently •  Words connected only through abstract nodes •  are less similar Dan Jurafsky Informa1on content similarity metrics •  Let’s define P(c) as: Resnik 1995. Using informa-on content to evaluate seman-c similarity in a taxonomy. IJCAI •  The probability that a randomly selected word in a corpus is an instance of concept c •  Formally: there is a dis-nct random variable, ranging over words, associated with each concept in the hierarchy •  for a given concept, each observed noun is either •  a member of that concept with probability P(c) •  not a member of that concept with probability 1-P(c) •  All words are members of the root node (En-ty) •  P(root)=1 •  The lower a node in hierarchy, the lower its probability Dan Jurafsky en-ty Informa1on content similarity … geological ­forma-on •  Train by coun-ng in a corpus natural eleva-on cave shore •  Each instance of hill counts toward frequency of natural eleva<on, geological forma<on, en<ty, etc hill ridge grono coast •  Let words(c) be the set of all words that are children of node c •  words(“geo ­forma-on”) = {hill,ridge,grono,coast,cave,shore,natural eleva-on} •  words(“natural eleva-on”) = {hill, ridge} " count (w ) P(c) = w!words ( c ) N Dan Jurafsky Informa1on content similarity •  WordNet hierarchy augmented with probabili-es P(c) D. Lin. 1998. An Informa-on ­Theore-c Defini-on of Similarity. ICML 1998 Dan Jurafsky Informa1on content: defini1ons •  Informa-on content: IC(c) = -log P(c) •  Most informa-ve subsumer (Lowest common subsumer) LCS(c1,c2) = The most informa-ve (lowest) node in the hierarchy subsum...
View Full Document

This document was uploaded on 02/14/2014.

Ask a homework question - tutors are online