Assignment1 - Assignment #1; also known as: I'm Pretty Sure...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Sara Kazemi 09/14/2007 Ling 571 Assignment #1; also known as: Format: The format of the source file brown.txt is such that all letter strings are separated from non-letter  strings (e.g. punctuation) and that only one sentence, which is delimited by a period (and perhaps  a question or exclamation mark if appropriate) is permitted per line. Sentences: The sequence of commands  [ cat brown.txt | wc -l ]  calculates that there are 52,105 lines in the  Brown corpus, which should be equal to the number of sentences since the *.txt file is formatted  to only allow one sentence per line.  Tokens: The sequence of commands  [ tr A-Z a-z <brown.txt | wc -w ]  returns a word count of 1,170,811.  Since the specified commands do not discriminate between alphabetical and non-alphabetical  strings, this number consists of both words and non-words and can account for the total number  of tokens within the corpus. This number does not, however, reflect the number of unique tokens  within the corpus. Words:
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 09/06/2009 for the course LING 571 taught by Professor Staff during the Fall '08 term at San Diego State.

Page1 / 2

Assignment1 - Assignment #1; also known as: I'm Pretty Sure...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online