Google Books 5-gram¶
The google Books n-gram corpora can be found on aws and amazon. This data set is usually used for testing/learning big data.
The format for each line is:
ngram TAB year TAB match_count TAB volume_count NEWLINE
Using Python MRjob to find the longest 5-gram (all characters except whitespaces) in a small test dataset.
Question 5 What is an n-gram? Name an application for an n-gram