This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: TextTiling: A Quantitative Approach to Discourse Segmentation Marti A. Hearst Computer Science Division, 571 Evans Hall University of California, Berkeley Berkeley, CA 94720 email@example.com Abstract This paper presents TextTiling, a method for partition- ing full-length text documents into coherent multi- paragraph units. The layout of text tiles is meant to reflect the pattern of subtopics contained in an expos- itory text. The approach uses lexical analyses based on tf.idf, an information retrieval measurement, to de- termine the extent of the tiles, incorporating thesaural informationvia a statisticaldisambiguationalgorithm. The tiles have been foundto correspond well to human judgementsofthemajorsubtopicboundariesofscience magazine articles. 1 Introduction Expository texts such as science magazine articles and environmental impact reports can be viewed as being composed of a few main topics and a series of short, sometimesdensely discussed, subtopics. For example, consider a 23-paragraph article from Discover maga- zine whose main topic is the exploration of Venus by the Magellan space probe. A reader divided this text into the following segments, with the labels shown, where the numbers indicate paragraph numbers: 1-2 Intro to Magellan space probe 3-4 Intro to Venus 5-7 Lack of craters 8-11 Evidence of volcanic action 12-15 River Styx 16-18 Crustal spreading 19-21 Recent volcanism 22-23 Future of Magellan The capability to automate the recognition of this kind of structure ina full-textdocument shouldbe useful for improving a variety of computational tasks, e.g., hy- pertext, text summarization and information retrieval. Towardthisend, thispaperdescribesTextTiling, acom- putational approach to segmenting written expository text into contiguous, non-overlapping discourse units that correspond to the pattern of subtopics in a text. 1 (Skorochodko 1972)has suggesteddiscovering a texts structure by dividing it up into sentences and seeing how muchwordoverlap appearsamongthesentences. The overlap forms a kind of intra-structure; fully con- nected graphs might indicate dense discussions of a topic, while long spindly chains of connectivity might indicate a sequentialaccount. The crucial idea is that of defining the structure of a text as a function of the con- nectivity patterns of the terms that comprise it. This is in contrast with segmenting guided primarily by fine- grained discourse cues such as register change, focus shift,andcue words. From a computational viewpoint, deducing textual topic structure from lexical connec- tivityalone is appealing, bothbecause it is easy tocom- pute, and also because discourse cues are sometimes misleading with respect to the topic structure (Brown & Yule 1983)(ch. 3)....
View Full Document
This note was uploaded on 09/21/2009 for the course CS 580 taught by Professor Fdfdf during the Spring '09 term at University of Toronto- Toronto.
- Spring '09
- Computer Science