68746 - TextTiling: A Quantitative Approach to Discourse...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: TextTiling: A Quantitative Approach to Discourse Segmentation Marti A. Hearst Computer Science Division, 571 Evans Hall University of California, Berkeley Berkeley, CA 94720 marti@cs.berkeley.edu Abstract This paper presents TextTiling, a method for partition- ing full-length text documents into coherent multi- paragraph units. The layout of text tiles is meant to reflect the pattern of subtopics contained in an expos- itory text. The approach uses lexical analyses based on tf.idf, an information retrieval measurement, to de- termine the extent of the tiles, incorporating thesaural informationvia a statisticaldisambiguationalgorithm. The tiles have been foundto correspond well to human judgementsofthemajorsubtopicboundariesofscience magazine articles. 1 Introduction Expository texts such as science magazine articles and environmental impact reports can be viewed as being composed of a few main topics and a series of short, sometimesdensely discussed, subtopics. For example, consider a 23-paragraph article from Discover maga- zine whose main topic is the exploration of Venus by the Magellan space probe. A reader divided this text into the following segments, with the labels shown, where the numbers indicate paragraph numbers: 1-2 Intro to Magellan space probe 3-4 Intro to Venus 5-7 Lack of craters 8-11 Evidence of volcanic action 12-15 River Styx 16-18 Crustal spreading 19-21 Recent volcanism 22-23 Future of Magellan The capability to automate the recognition of this kind of structure ina full-textdocument shouldbe useful for improving a variety of computational tasks, e.g., hy- pertext, text summarization and information retrieval. Towardthisend, thispaperdescribesTextTiling, acom- putational approach to segmenting written expository text into contiguous, non-overlapping discourse units that correspond to the pattern of subtopics in a text. 1 (Skorochodko 1972)has suggesteddiscovering a texts structure by dividing it up into sentences and seeing how muchwordoverlap appearsamongthesentences. The overlap forms a kind of intra-structure; fully con- nected graphs might indicate dense discussions of a topic, while long spindly chains of connectivity might indicate a sequentialaccount. The crucial idea is that of defining the structure of a text as a function of the con- nectivity patterns of the terms that comprise it. This is in contrast with segmenting guided primarily by fine- grained discourse cues such as register change, focus shift,andcue words. From a computational viewpoint, deducing textual topic structure from lexical connec- tivityalone is appealing, bothbecause it is easy tocom- pute, and also because discourse cues are sometimes misleading with respect to the topic structure (Brown & Yule 1983)(ch. 3)....
View Full Document

This note was uploaded on 09/21/2009 for the course CS 580 taught by Professor Fdfdf during the Spring '09 term at University of Toronto- Toronto.

Page1 / 10

68746 - TextTiling: A Quantitative Approach to Discourse...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online