The google Books n-gram corpora can be found on aws and amazon. This data set is usually used for testing/learning big data.
The format for each line is:
ngram TAB year TAB match_count TAB volume_count NEWLINE
In this exercise, we will develop an MRjob to find the longest 5-gram (all characters except whitespaces) in a small test dataset. Next week we will use EMR to find the longest 5-gram in BIG scale.
Question 5 What is an n-gram? Name an application for an n-gram.
Question 6 MRJob for longest 5-gram. Please fill-in where said "code"
# -*- coding: utf-8 -*-
from mrjob.protocol import RawProtocol
from mrjob.job import MRJob
from mrjob.step import MRStep
return [MRStep(mapper = self.mapper,
reducer = self.reducer)]
def mapper(self, _,line):
line = line.strip()
line = line.split("t")
#--Get n_gram and count code
for word in words:
def reducer(self, length, n_gram):
if __name__ == '__main__':
Recently Asked Questions
- 1. Ellyn’s Art Supply Company (EASC) manufactures large paint brushes and small paint brushes. Per unit cost data is as follows: Large | Small direct
- A project has an initial cost of $75,000, expected net cash inflows of $18,000 per year for 7 years, and a cost of capital of 12%. What is the project’s PI?
- A sport car is advertised to have a maximum cornering acceleration of 0.85 g. What is its maximum speed for a 50-m radius curve? a= R= V=?