View the step-by-step solution to:

Google Books 5-gram

Google Books 5-gram¶

The google Books n-gram corpora can be found on aws and amazon. This data set is usually used for testing/learning big data.

The format for each line is:

ngram TAB year TAB match_count TAB volume_count NEWLINE

In this exercise, we will develop an MRjob to find the longest 5-gram (all characters except whitespaces) in a small test dataset. Next week we will use EMR to find the longest 5-gram in BIG scale.

Question 5 What is an n-gram? Name an application for an n-gram.

Question 6 MRJob for longest 5-gram. Please fill-in where said "code"


#!/usr/bin/env python

# -*- coding: utf-8 -*-

import re

import mrjob

from mrjob.protocol import RawProtocol

from mrjob.job import MRJob

from mrjob.step import MRStep

class longest5gram(MRJob):


  def steps(self):

    return [MRStep(mapper = self.mapper,

            reducer = self.reducer)]



  def mapper(self, _,line):

    line = line.strip()

    line = line.split("t")


    #--Get n_gram and count code




    for word in words:



    yield #--code


  def reducer(self, length, n_gram):


    yield #--code



if __name__ == '__main__':

Recently Asked Questions

Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.


Educational Resources
  • -

    Study Documents

    Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

    Browse Documents
  • -

    Question & Answers

    Get one-on-one homework help from our expert tutors—available online 24/7. Ask your own questions or browse existing Q&A threads. Satisfaction guaranteed!

    Ask a Question