might be worthwhile in large applications consider the benefits of using a file

Might be worthwhile in large applications consider

This preview shows page 21 - 23 out of 82 pages.

might be worthwhile in large applications, consider the benefits of using a file-basedapproach: compression techniques on directories are well suited to text informationand the use of a file synchronization service provides automatic replication. The con‐struction of a corpus in a database is thus beyond the scope of this book. In order toaccess our text corpora, we will plan to structure our data on disk in a meaningfulway, which we will explore in the next section.Corpus Disk StructureThe simplest and most common method of organizing and managing a text-basedcorpus is to store individual documents in a file system on disk. By organizing thecorpus into sub directories, corpora can be categorized or meaningfully partitionedby meta information like dates. By maintaining each document as its own file, readerscan seek quickly to different subsets of documents and processing can be parallelized,Corpus Data Management | 19
Background image
with each process taking a different subset of documents. Text is also the most com‐pressible format, making Zip files, which leverage directory structures on disk, anideal distribution and storage format, in fact, NLTK CorpusReaderobjects, which wewill discuss in the next section, can read from either a path to a directory or a path toa Zip file. Finally, corpora stored on disk are generally static and treated as a whole,fulfilling the requirement for WORM storage presented in the previous section.Storing a single document per file could lead to some challenges, however. Considersmaller document sizes like emails or tweets, which don’t make sense to store as indi‐vidual files. Email is typically stored in an MBox format — a plaintext format thatuses separators to delimit multipart mime messages containing text, HTML, images,and attachments. The MBox format can be read in Python with the emailmodulethat comes with the standard library, making it easy to parse with a reader, but diffi‐cult to split into multiple documents per file. On the other hand, most email clientsstore an MBox file per folder (or label), e.g. the Inbox MBox, the Starred MBox, theArchive MBox and so forth, which gives us the idea that a corpus of MBox filesorganized by category is a good idea.Tweets are generally small JSON data structures that include not just the text of thetweet but other meta data like user or location. The typical way to store multipletweets is in newline delimited JSON, sometimes called the JSON lines format. Thisformat makes it easy to read one tweet at a time by parsing only a single line at a time,but also to seek to different tweets in the file. A single file of tweets can be large, soorganizing tweets in files by user, location, or day can reduce overall file sizes andagain create a disk structure of multiple files. Another technique is simply to writefiles with a maximum size limit. E.g. keep writing data to the file, respecting docu‐ment boundaries, until it reaches some size limit (e.g. 128 MB) then open a new fileand continue writing there.
Background image
Image of page 23

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture