EstimatingSizeOfTheInternet

EstimatingSizeOfTheInternet - 1. Estimating Search Result...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
1. Estimating Search Result Sizes using Probability We can use some basic facts about probability, and use some terms that are "likely to be independent" to estimate the size of theIinternet. Actually, it is not the size of the Internet that we will estimate; it is the size of the part of the Internet that is indexed by the particular search engine that we are using. For example, we can search for the documents that contain any one of the words Oreo, rock and fish. Before we start, let's think about what probability theory tells us. We will use the expression R p for the probability that web page has the word "rock" in it (and similarly , OF pp for "oreo" and "fish"). Now, from the basic definition of probability, the number of pages ( ) W N that have a particular word, W, is related to the probability of that word by the equation: WW N p X Where X is the (unknown) size of the whole index. So we can write three equations: 2 and we can multiply these together to get: OO RR FF O R F O R F N p X N p X N p X N N N p p p X X  I have rearranged the factors, and broken the 3 factors of X into X, and X-squared. We will see why in a moment. Now, if these words are really unrelated to each other, then the probability of the outcome that one of them appears is independent of the outcome that another one of them appears, and we can say that the probability that all three of them appear in the same document is the product of the probabilities that the individual words occur. This is the rule that if events are independent, then the probabilities multiply. So we can write "ORF" to mean "all three words occur". And if they are independent then we know that ORF O R F p p p p . So we can say that the number of documents that have all three words is: ORF ORF O R F N p X p p p X
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Notice now, that this expression is exactly what appears in our earlier equation for the product of the three numbers, based on how often the individual words appear.
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 5

EstimatingSizeOfTheInternet - 1. Estimating Search Result...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online