This preview shows page 1. Sign up to view the full content.
Unformatted text preview: 6 Mining the Web
Outline: 1. Dynamic itemset counting : Searching for interesting sets of items in a space too large ever to consider even each pair of items. 2. Books and authors" : Sergey Brin's intriguing experiment to mine the Web for relational data. 6.1 Finding Unusual Itemsets The problem is to nd sets of words that appear together unusually often" on the Web, e.g., New" and York" or f Dutchess", of", York"g. Unusually often" can be de ned in various ways, in order to capture the idea that the number of Web documents containing the set of words is much greater than what one would expect if words were sprinkled at random, each word with its own probability of occurrence in a document. One appropriate way is entropy per word in the set. Formally, the interest of a set of words is
S log2
S Q probS w in S probw jj
S Note that we divide by the size of to avoid the Bonferroni e ect," where there are so many sets of a given size that some, by chance alone, will appear to be correlated. Example: If words , , and each appear in 1 of all documents, and = f g appears in 0.1 , of documents, then the interestingness of is log2 001 01 01 01 3 = log21000 3 or about 3.3. Technical problem: interest is not monotone, or downwards closed," the way high support is. That is, we can have a set with a high value of interest, yet some, or even all, of its immediate proper subsets are not interesting. In contrast, if has high support, then all of its subsets have support at least as high. Technical problem: With more than 108 di erent words appearing in the Web, it is not possible even to consider all pairs of words.
a b c S a; b; c S : =: : : = = S S 6.2 The DICE Engine DICE dynamic itemset counting engine repeatedly visits the pages of the Web, in a roundrobin fashion. At all times, it is counting occurrences of certain sets of words, and of the individual words in that set. The number of sets being counted is small enough that the counts t in main memory. From time to time, say every 5000 pages, DICE reconsiders the sets that it is counting. It throws away those sets that have the lowest interest, and replaces them with other sets. The choice of new sets is based on the heavy edge property, which is an experimentally justi ed observation that those words that appear in a highinterest set are more likely than others to appear in other highinterest sets. Thus, when selecting new sets to start counting, DICE is biased in favor of words that already appear in highinterest sets. However, it does not rely on those words exclusively, or else it could never nd highinterests sets composed of the many words it has never looked at. Some but not all of the constructions that DICE uses to create new sets are: 1. Two random words. This is the only rule that is independent of the heavy edge assumption, and helps new words get into the pool. 2. A word in one of the interesting sets and one random word. 21 3. Two words from two di erent interesting pairs. 4. The union of two interesting sets whose intersection is of size 2 or more. 5. f g if all of f g, f g, and f g are found to be interesting.
a; b; c a; b a; c b; c Of course, there are generally too many options to do all of the above in all possible ways, so a random selection among options, giving some choices to each of the rules, is used. 6.3 Books and Authors The general idea is to search the Web for facts of a given type, typically what might form the tuples of a relation such as . The computation is suggested by Fig. 13.
Books title; author Current patterns Sample data Find patterns Find data Current data Figure 13: Extracting relations from the Web 1. Start with a sample of the tuples one would like to nd. In the example discussed in the Brin paper, ve examples of book titles and their authors were used. 2. Given a set of known examples, nd where that data appears on the Web. If a pattern is found that identi es several examples of known tuples, and is su ciently speci c that it is unlikely to identify too much, then accept this pattern. 3. Given a set of accepted patterns, nd the data that appears in these patterns, add it to the set of known data. 4. Repeat steps 2 and 3 several times. In the example cited, four rounds were used, leading to 15,000 tuples; about 95 were true titleauthor pairs. 6.4 What is a Pattern? The notion suggested consists of ve elements: 1. The order ; i.e., whether the title appears prior to the author in hte text, or viceversa. In a more general case, where tuples have more than 2 components, the order would be the permutation of components. 2. The URL pre x. 3. The pre x of text, just prior to the rst of the title or author. 4. The middle : text appearing between the two data elements. 5. The su x of text following the second of the two data elements. Both the pre x and su x were limited to 10 characters. Example 6.1 : A possible pattern might consist of the following:
22 1. Order: title then author. 2. URL pre x: www.stanford.edu class 3. Pre x, middle, and su x of the following form:
LI I title I by author P Here the pre x is LI I , the middle is I by including the blank after by", and the su x is P . The title is whatever appears between the pre x and middle; the author is whatever appears between the middle and su x. The intuition behind why this pattern might be good is that there are probably lots of reading lists among the class pages at Stanford. 2 To focus on patterns that are likely to be accurate, Brin used several constraints on patterns, as follows: Let the speci city of a pattern be the product of the lengths of the pre x, middle, su x, and URL pre x. Roughly, the speci city measures how likely we are to nd the pattern; the higher the speci city, the fewer occurrences we expect. Then a pattern must meet two conditions to be accepted: 1. There must be at least 2 known data items that appear in this pattern. 2. The product of the speci city of the pattern and the number of occurrences of data items in the pattern must exceed a certain threshold not speci ed.
T 6.5 Data Occurrences An occurrence of a tuple is associated with a pattern in which it occurs; i.e., the same title and author might appear in several di erent patterns. Thus, a data occurrence consists of: 1. The particular title and author. 2. The complete URL, not just the pre x as for a pattern. 3. The order, pre x, middle, and su x of the pattern in which the title and author occurred. 6.6 Finding Data Occurrences Given Data If we have some known titleauthor pairs, our rst step in nding new patterns is to search the Web to see where these titles and authors occur. We assume that there is an index of the Web, so given a word, we can nd pointers to all the pages containing that word. The method used is essentially apriori: 1. Find pointers to all those pages containing any known author. Since author names generally consist of 2 words, use the index for each rst name and last name, and check that the occurrences are consecutive in the document. 2. Find pointers to all those pages containing any known title. Start by nding pages with each word of a title, and then checking that the words appear in order on the page. 3. Intersect the sets of pages that have an author and a title on them. Only these pages need to be searched to nd the patterns in which a known titleauthor pair is found. For the pre x and su x, take the 10 surrounding characters, or fewer if there are not as many as 10. 23 6.7 Building Patterns from Data Occurrences 1. Group the data occurrences according to their order and middle. For example, one group in the groupby" might correspond to the order titlethenauthor" and the middle I by . 2. For each group, nd the longest common pre x, su x, and URL pre x. 3. If the speci city test for this pattern is met, then accept the pattern. 4. If the speci city test is not met, then try to split the group into two by extending the length of the URL pre x by one character, and repeat from step 2. If it is impossible to split the group because there is only one URL then we fail to produce a pattern from the group.
www.stanford.edu class cs345 index.html www.stanford.edu class cs145 intro.html www.stanford.edu class cs140 readings.html Example 6.2 : Suppose our group contains the three URL's: The common pre x is www.stanford.edu class cs . If we have to split the group, then the next character, 3 versus 1, breaks the group into two, with those data occurrences in the rst page there could be many such occurrences going into one group, and those occurrences on the other two pages going into another. 2 6.8 Finding Occurrences Given Patterns 1. Find all URL's that match the URL pre x in at least one pattern. 2. For each of those pages, scan the text using a regular expression built from the pattern's pre x, middle, and su x. 3. Extract from each match the title and author, according the order speci ed in the pattern. 24 ...
View
Full
Document
This note was uploaded on 01/31/2011 for the course CS 345 taught by Professor Dunbar,a during the Fall '07 term at UC Davis.
 Fall '07
 Dunbar,A

Click to edit the document details