Quiz time

Quiz time

a. One of the most important statistical properties of text is Zipf’s law. Write down Zipf’s law: [5pts] b. Explain Zipf’s law in a sentence or two. [5pts] c. Now we’re going to consider Zipf’s law in action. First, suppose that you have a collection of extremely simple children’s books. There are only four different words in your collection: alice banana chocolate dandelion . There are no other words. Suppose there are 5,000 tokens in your collection and that the frequency order is alice > banana > chocolate > dandelion . Assuming that Zipf’s law holds exactly for this collection, what are the frequencies of the four words? [10pts]
Document 1: “The Longhorns? The Longhorns? No ... Aggies! Aggies! Aggies! Aggies!” Document 2: “Aggies? Aggies? The Aggies will defeat the Longhorns. :-)” Remove all punctuation and perform casefolding (convert to all lower caps). Now write down the modified documents: [4pts] Now, we’re going to view our documents in terms of the vector space model. How many dimensions (axes) are there in the vector space

