The google Books n-gram corpora can be found on aws and amazon. This data set is usually used for testing/learning big data.
The format for each line is:
ngram TAB year TAB match_count TAB volume_count NEWLINE
In this exercise, we will develop an MRjob to find the longest 5-gram (all characters except whitespaces) in a small test dataset. Next week we will use EMR to find the longest 5-gram in BIG scale.
Question 5 What is an n-gram? Name an application for an n-gram.
Question 6 MRJob for longest 5-gram. Please fill-in where said "code"
# -*- coding: utf-8 -*-
from mrjob.protocol import RawProtocol
from mrjob.job import MRJob
from mrjob.step import MRStep
return [MRStep(mapper = self.mapper,
reducer = self.reducer)]
def mapper(self, _,line):
line = line.strip()
line = line.split("t")
#--Get n_gram and count code
for word in words:
def reducer(self, length, n_gram):
if __name__ == '__main__':