with each process taking a different subset of documents. Text is also the most com‐pressible format, making Zip files, which leverage directory structures on disk, anideal distribution and storage format, in fact, NLTK CorpusReaderobjects, which wewill discuss in the next section, can read from either a path to a directory or a path toa Zip file. Finally, corpora stored on disk are generally static and treated as a whole,fulfilling the requirement for WORM storage presented in the previous section.Storing a single document per file could lead to some challenges, however. Considersmaller document sizes like emails or tweets, which don’t make sense to store as indi‐vidual files. Email is typically stored in an MBox format — a plaintext format thatuses separators to delimit multipart mime messages containing text, HTML, images,and attachments. The MBox format can be read in Python with the emailmodulethat comes with the standard library, making it easy to parse with a reader, but diffi‐cult to split into multiple documents per file. On the other hand, most email clientsstore an MBox file per folder (or label), e.g. the Inbox MBox, the Starred MBox, theArchive MBox and so forth, which gives us the idea that a corpus of MBox filesorganized by category is a good idea.Tweets are generally small JSON data structures that include not just the text of thetweet but other meta data like user or location. The typical way to store multipletweets is in newline delimited JSON, sometimes called the JSON lines format. Thisformat makes it easy to read one tweet at a time by parsing only a single line at a time,but also to seek to different tweets in the file. A single file of tweets can be large, soorganizing tweets in files by user, location, or day can reduce overall file sizes andagain create a disk structure of multiple files. Another technique is simply to writefiles with a maximum size limit. E.g. keep writing data to the file, respecting docu‐ment boundaries, until it reaches some size limit (e.g. 128 MB) then open a new fileand continue writing there.