Why should we care about document standardization the

  • CUHK
  • STAT 3008
  • Essay
  • fuzziest
  • 249
  • 100% (5) 5 out of 5 people found this document helpful

This preview shows page 30 - 32 out of 249 pages.

Why should we care about document standardization? The main advantage ofstandardizing the data is that the mining tools can be applied without having toconsider the pedigree of the document. For harvesting information from a document,it is irrelevant what editor was used to create it or what the original format was. Thesoftware tools need to read data just in one format, and not in the many differentformats they came in originally.
2.3Tokenization172.3 TokenizationAssume the document collection is in XML format and we are ready to examine theunstructured text to identify useful features. The first step in handling text is to breakthe stream of characters into words or, more precisely,tokens. This is fundamental tofurther analysis. Without identifying the tokens, it is difficult to imagine extractinghigher-level information from the document. Each token is an instance of atype, sothe number of tokens is much higher than the number of types. As an example, inthe previous sentence there are two tokens spelled “the.” These are both instancesof a type “the,” which occurs twice in the sentence. Properly speaking, one shouldalways refer to the frequency of occurrence of a type, but loose usage also talks aboutthe frequency of a token. Breaking a stream of characters into tokens is trivial fora person familiar with the language structure. A computer program, though, beinglinguistically challenged, would find the task more complicated. The reason is thatcertain characters are sometimes token delimiters and sometimes not, dependingon the application. The characters space, tab, and newline we assume are alwaysdelimiters and are not counted as tokens. They are often collectively calledwhitespace. The characters() <>!?” are always delimiters and may also be tokens. Thecharacters . , : - ’ may or may not be delimiters, depending on their environment.A period, comma, or colon between numbers would not normally be considered adelimiter but rather part of the number. Any other comma or colon is a delimiter andmay be a token. A period can be part of an abbreviation (e.g., if it has a capital letteron both sides). It can also be part of an abbreviation when followed by a space (e.g.,Dr.). However, some of these are really ends of sentences. The problem of detectingwhen a period is an end of sentence and when it is not will be discussed later. Forthe purposes of tokenization, it is probably best to treat any ambiguous period as aword delimiter and also as a token.The apostrophe also has a number of uses. When preceded and followed by non-delimiters, it should be treated as part of the current token (e.g., isn’t or D’angelo).When followed by an unambiguous terminator, it might be a closing internal quoteor might indicate a possessive (e.g., Tess’). An apostrophe preceded by a terminatoris unambiguously the beginning of an internal quote, so it is possible to distinguishthe two cases by keeping track of opening and closing internal quotes.

Upload your study docs or become a

Course Hero member to access this document

Upload your study docs or become a

Course Hero member to access this document

End of preview. Want to read all 249 pages?

Upload your study docs or become a

Course Hero member to access this document

Term
Spring
Professor
C.Y.YAU
Tags

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture