This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Entropic Analysis of the Latin Language Erik Peterson Department of Computer Science, UC Santa Barbara, Santa Barbara, CA 93106 [email protected] June 1, 2006 Abstract In this paper I will explore the entropy of the Latin language with a particular eye towards the differences in entropy between authors and between times when Latin was a living language. I expect to find that not only are the texts of authors discernible by their entropies, but that there is a gradual change in the entropy of the language from the earliest Latin texts until later ’Church Latin’ texts. This investigation will involve processing large numbers of Latin texts, available in digital format from the Tufts University Perseus Project. 1 Background The automatic identification of a text’s language or author is not a new idea. It has been shown that the language and author of a text can be algorithmically determined  as can the composer of a piece of music . The method for doing this involves the notion of the entropy of a discrete random variable, as described by Shannon . The entropy of a random variable is a measure of how ’random’ the variable is, and is defined as H ( X ) = X x- p ( x ) log p ( x ) The entropy is tightly bounded below by , implying no randomness, and above by the quantity log |X| , where |X| is the size of the alphabet of X , implying total randomness. 1 In the analysis of languages, a language is modeled as a discrete random variable with some theoretical probability distribution associated with its letters. As a result, one can discuss the entropy of a natural language . An alternate interpretation of entropy is the amount of information contained in each member of the alphabet of a discrete random variable. Languages with higher entropies can be thought of as being more succinct; lan- guages with lower entropies as more redundant. There are different methods for approximating the entropy of a language from a text. The most naive method is to merely count the frequencies of each letter, including spaces, and use those frequencies to compute the entropy using the above definition. This method gives an upper bound on the ideal entropy of the source language, but it is not a good measure of the languages theoretical entropy. A better method of calculating the entropy of a text is to perform a universal compression algorithm on it and to determine how much compression is achieved. This approximates the process of calculating the entropy based on the frequencies of sequences of letters, rather than letters alone. Algorithms such as Lempel-Ziv compression yield a compression which asymptotically approaches the entropy of the source in the limit of large samples....
View Full Document
This note was uploaded on 12/27/2011 for the course CMPSC 225 taught by Professor Vandam during the Fall '09 term at UCSB.
- Fall '09
- Julius Caesar, Julius Caesar, Cicero