# Register now to access 7 million high quality study materials (What's Course Hero?) Course Hero is the premier provider of high quality online educational resources. With millions of study documents, online tutors, digital flashcards and free courseware, Course Hero is helping students learn more efficiently and effectively. Whether you're interested in exploring new subjects or mastering key topics for your next exam, Course Hero has the tools you need to achieve your goals.

2 Pages

### Cluster-Evaluation_cost-based

Course: TDT 3, Fall 2009
School: UPenn
Rating:

Word Count: 886

#### Document Preview

Cluster Cost-based Evaluation Algorithm for Topic Detection The clustering task: A set of messages is to be clustered according to target identity. The goal of the clustering is that each cluster should contain messages from a single target and that the number of clusters should equal the number of targets. Practically, the clusters should be as pure as possible and the number of clusters should be as small as...

Register Now

#### Unformatted Document Excerpt

Coursehero >> Pennsylvania >> UPenn >> TDT 3

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Cluster Cost-based Evaluation Algorithm for Topic Detection The clustering task: A set of messages is to be clustered according to target identity. The goal of the clustering is that each cluster should contain messages from a single target and that the number of clusters should equal the number of targets. Practically, the clusters should be as pure as possible and the number of clusters should be as small as possible. How well the clustering algorithm achieves this goal is evaluated using an algorithm that assigns a cost for examining a message, Cexam, and a cost for missing a target message, Cmiss. Here is how the algorithm works: 1) For each target, index = k: a) For each cluster, index = i: Ncluster(i) is the number of messages in cluster i. Ntarget(i,k) is the number of target messages in cluster i for target k. i) Examine a message chosen at random from the cluster. This incurs a cost of Cexam. The probability that this message is a target message is Ptarget(i,k) = Ntarget(i,k)/Ncluster(i). (1) If the message is a target message, all the other messages in the cluster are also examined. This results in a total incurred cost for the cluster of Ncluster(i) Cexam. (2) If the message is not a target message, the cluster is abandoned. This results in a total incurred cost for the cluster of Cexam + Cmiss Ntarget(i,k). This algorithm gives the following clustering cost measure: Ccluster = = = {P k i k i t arg et (i , k ) Cexam N cluster (i ) [ ] ] + (1 Pt arg et (i, k ) ) [Cexam + Cmiss N t arg et (i, k ) ]} N t arg et (i , k ) Cexam + Cmiss N t arg et (i , k ) + 1 N cluster (i ) N k i N t arg et (i , k ) cluster (i ) N t arg et (i , k ) N t arg et (i , k ) N cluster (i ) N t arg et (i , k ) + Cmiss Cexam N t arg et (i , k ) + 1 N cluster (i ) N cluster (i ) Cexam N cluster (i ) [ [ ] ( ) Interpretation of Ccluster is difficult because Ccluster is a function of the message set and tends to increase with the number of target messages. Knowledge of the minimum and maximum values of Ccluster, given the size of the message set and the number of target messages for each target, would aid interpretation. Perfect (minimum cost) clustering would require that each target have one cluster with Ptarget = 1 and all others with Ptarget = 0, and that the number of clusters equal the number of targets. In this case, Ccluster would be: C min = {C k exam N t arg et ( k ) t arg et ( k ) + Cexam ( N clusters 1) + } = Cexam {N k ( N clusters 1)} = Cexam t N arg et ( k ) + N t arg ets ( N clusters 1) k = Cexam N total + N t arg ets N t arg ets 1 [ ( )] where Ntarget(k) is the number of target messages for target k, Ntargets is the number of different targets, Nclusters is the number of different clusters (= Ntargets), and Ntotal is the total number of messages to be clustered. (But note that if Cmin is calculated for only a subset of targets, then the sum of the target messages will be less than Ntotal.) The worst possible (maximum cost) clustering would require that Ntarget(i,k) be independent of i and that all clusters contain only one message. In this case, Ccluster would be: C max = {P =N {P =N {C k i total k total k t arg et ( k ) [ Cexam ] + (1 Pt arg et (k ) ) [Cexam + Cmiss Pt arg et (k ) N cluster (i ) ]} (1 Pt arg et (k ) ) [Cexam + Cmiss Pt arg et (k ) ]} ( )} ) t arg et ( k ) [ Cexam ] + exam + Cmiss Pt arg et ( k ) 1 Pt arg et ( k ) + Cmiss N total = Cexam N t arg ets N total N k t arg et ( k ) N total N t arg et ( k ) ( The normalized value of Ccluster is then: C norm = Ccluster C min C max C min For TDT purposes, a target is a reference topic, a message is a story, and a cluster is a system-defined topic. The following chart shows topic detection results for all TDT2 detection systems running on the default data sources (newswire +ASR) for a range of Cmiss values. Notice that as the ratio of Cmiss to Cexam increases the performances of six systems improve while those of two degrade. This ...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

UPenn - TDT - 3
Hierarchical Topic Detection Evaluation Proposal1. Overview This document describes a new approach to the Topic Detection task for the Topic Detection and Tracking Evaluation series. The new approach, called Hierarchical Topic Detection (HTD), is be
UPenn - LDC - 94
MACROPHONE TRANSCRIPTION Overview -The goal of the Macrophone transcription effort is to provide an accurate word level transcription of what the caller said, with minimal markings for extraneous events and disfluencies. Because of the volume of data
UPenn - LDC - 2002
Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure TheoryLynn Carlson Department of Defense Ft. George G. Meade MD 20755 lmcarlnord@aol.com Daniel Marcu Information Sciences Institute University of S. California Marina del R
UPenn - LDC - 93
MINIMAL AND MAXIMAL REFERENCE ANSWERS I. OVERVIEW A minimal/maximal reference answer pair is produced for each utterance interpretation that is taken as a request to retrieve a set of tuples &gt;from the database. In order for a hypothesis answer to be
UPenn - LDC - 96
FFMTIMIT Acoustic-Phonetic Continuous Speech Corpus Far Field Microphone Recordings Training and Test Data and Speech Header Software NIST Speech Disc 21-1.1 August 1995 This CD-ROM contains the previously-unreleased secondary microphone waveforms fo
UPenn - LDC - 98
ISSUES INVOLVED IN VOICEMAIL DATA COLLECTION M. Padmanabhan, G. Ramaswamy, B. Ramabhadran, P. S. Gopalakrishnan, C. Dunn IBM T. J. Watson Research Center P. O. Box 218, Yorktown Heights, NY 10598 1 INTRODUCTION Speech recognition is an important area
UPenn - LDC - 2005
Articulation Index Corpus(please see readme.txt for a rough sketch of the DVD contents)IntroductionThe Articulation Index Corpus was partly inspired by the work of HarveyFletcher, who did a number of perceptual experiments involving English
UPenn - LDC - 2007
README File for the ARABIC GIGAWORD CORPUS THIRD EDITION =INTRODUCTION-Arabic Gigaword Third Edition was produced by Linguistic DataConsortium (LDC); the catalog number is LDC2007T40 and the ISBN is1-58563-460-3. This is a compre
UPenn - LDC - 2004
Note: This document describes the transcription methods andconventions as employed by the Meeting Transcription team. Becausethe transcripts were later reformatted to conform to the new MRTspecification, an ADDENDUM is provided at the end of the d
UPenn - LDC - 93
Studio Quality Speaker-Independent Connected-Digit Corpus (TIDIGITS) CD-ROM Set NIST Speech Discs 4-1, 4-2, and 4-3
UPenn - LDC - 2003
ARABIC PART-OF-SPEECH/MORPHOLOGICAL ANALYSIS TAGGINGThe Penn Arabic Treebank uses a level of annotation more accuratelydescribed as morphological analysis than as part-of-speech tagging. InOctober 2001, the decision was taken to use Tim Buckwalt
UPenn - LDC - 96
VOICE ACROSS HISPANIC AMERICA TRANSCRIPTION = Yeshwant Muthusamy, Barb Wheatley and Joseph Picone Personal Systems Laboratory, Texas Instruments INTRODUCTION -This document describes the conventions used to validate and transcribe Spanish speech data
UPenn - LDC - 2004
*BUCKWALTER ARABIC MORPHOLOGICAL ANALYZER VERSION 2.0Portions (c) 2002-2004 QAMUS LLC (www.qamus.org),(c) 2002-2004 Trustees of the University of Pennsylvania**LDC USER AGREEMENTUse of this version of the Buckwalter Arabic Morphological Analy
UPenn - CIT - 591
Abstract Classes and InterfacesApr 10, 2009Abstract methodsYou can declare an object without defining it:Personp; publicabstractvoiddraw(intsize); Notice that the body of the method is missingSimilarly, you can declare a method without
UPenn - CIT - 590
Extreme ProgrammingApr 10, 2009Software engineering methodologiesA methodology is a formalized process or set of practices for creating software An early methodology was the waterfall model, so named because each stage flowed into the next,
UPenn - CIT - 591
Which is better?Which is better? Assume s1 and s2 are Strings:A. if(s1=s2){.} B. if(s1.equals(s2){.}Answer: B s1=s2tests whether s1 and s2 reference the same string; s1.equals(s2) tests whether they reference equal strings Strings1=&quot;ABC&quot;; Str
UPenn - CIT - 591
Objects: Extended ExampleGeneral idea Simulate (model) the following situation: A customer walks into a grocery store, picks up a few items, pays for them, and leaves Lets write a program to do this Limitations: As yet, we have no way of inte
UPenn - CIT - 591
The Rabbit HuntAn example Java programThe user interfaceThe program designThe eight classes RabbitHunt - just gets things started Controller - accepts GUI commands from user View - creates the animated display Model - coordinates all the a
UPenn - CIT - 591
Lunar LanderAn Example of Interacting ClassesApr 10, 2009LunarLanderGameThis class contains the publicstaticvoidmain(String[]args) method. In this method, you should (1) create a LunarLander object, (2) create an IOFrame object, and (3) send
UPenn - CIT - 594
StacksWhat is a stack? A stack is a Last In, First Out (LIFO) data structure Anything added to the stack goes on the top of the stack Anything removed from the stack is taken from the top of the stack Things are removed in the reverse order
UPenn - CIT - 594
RecursionApr 10, 2009Definitions IA recursive definition is a definition in which the thing being defined occurs as part of its own definition Example: An atom is a name or a number A list consists of: An open parenthesis, &quot;(&quot; Zero or m
UPenn - CIT - 597
JSPJava Server PagesReference: http:/www.apl.jhu.edu/~hall/java/Servlet Tutorial/ServletTutorialJSP.htmlApr 10, 2009A Hello World servlet(from the Tomcat installation documentation)publicclassHelloServletextendsHttpServlet{ publicvoiddoGet(H
UPenn - CIT - 597
Regular Expressions in JavaApr 10, 2009Regular Expressions A regular expression is a kind of pattern that can be applied to text (Strings, in Java) A regular expression either matches the text (or part of the text), or it fails to match I
UPenn - CIT - 597
AjaxApr 10, 2009The hypeAjax (sometimes capitalized as AJAX) stands for Asynchronous JavaScript And XML Ajax is a technique for creating better, faster, more responsive web applications Web applications with Ajax are supposed to replac
UPenn - CIT - 591
Enums(and a review of switch statements)Apr 10, 2009Enumerated valuesSometimes you want a variable that can take on only a certain listed (enumerated) set of values Examples: dayOfWeek: SUNDAY, MONDAY, TUESDAY, month: JAN, FEB, MAR,
UPenn - CIT - 591
Additional Java SyntaxApr 10, 2009Odd cornersWe have already covered all of the commonly used Java syntax Some Java features are seldom used, because: They are needed in only a few specialized situations, or Its just as easy to do withou
UPenn - CIT - 591
Which is better?Apr 10, 2009Which is better?Assume s1 and s2 are Strings: A. if(s1=s2){.} B. if(s1.equals(s2){.}?2Answer: Bs1=s2tests whether s1 and s2 reference the same string;s1.equals(s2) tests whether they reference equal str
UPenn - CIT - 591
Characters and StringsApr 10, 2009CharactersIn Java, a char is a primitive type that can hold one single character A character can be: A letter or digit A punctuation mark A space, tab, newline, or other whitespace A control character
UPenn - CIT - 594
GenericsApr 10, 2009Arrays and collectionsIn Java, array elements must all be of the same type: int[]counts=newint[10]; String[]names={&quot;Tom&quot;,&quot;Dick&quot;,&quot;Harry&quot;};Hence, arrays are type safe: The compiler will not let you put the wrong kind
UPenn - CIT - 594
Linked ListsAnatomy of a linked list A linked list consists of: A sequence of nodesmyList a b c dEach node contains a value and a link (pointer or reference) to some other node The last node contains a null link The list may have a headerMor
UPenn - LING - 001
Ling 001: Syntax IIMovement &amp; Constraints 2-11-2009Phrases In the last lecture, we talked about simple phrases; e.g. Noun Phrases like The dog The big dog The big dog that John was talking to In this lecture, we will look at how phrases and
UPenn - LING - 102
GenieandLanguageAcquisitionHowchildrenlearntospeakandwhat happensoncetheypassthecritical periodwithouthavingdoneso.Infants:010mos. Infantscandistinguishsoundsfrombirth,even ifthosesoundsarenotpartoftheirparents speech. Bysixmonths,babiesbegintol
UPenn - LING - 102
LanguageContactpresentedby MichaelL.Friesner August6,2007Thank you to Gillian Sankoff for sending me her PPT slides (among other things).TwoMainTypesof LanguageContactAgent:Nonnativespeakersaffectingalanguagethey cometospeak languageshift
UPenn - LING - 102
Acts of Conflicting IdentityThe Sociolinguistics of British pop-song pronunciation by Peter TrudgillThe Accent of pop singing At least since the 20s and the advent of Jazz, singers have adopted speech patterns while singing that are different fro
UPenn - LING - 001
A puzzle: why language? Quantitatively and qualitatively unique like elephants trunks No similar evolutionary trends in other species other species dont want to pick up peanuts with their noses all mammals have flexible noses, some use them as
UPenn - LING - 120
LING 120 Introduction to Speech AnalysisFall 2007Week 6Speech analysis II: Stops, nasals, liquidsOct. 8-12, 20072LING 120 Introduction to Speech Analysis, Fall 20073LING 120 Introduction to Speech Analysis, Fall 20074LING 1
UPenn - LING - 520
LING 520 Introduction to Phonetics IFall 2008Week 9Basic audition Speech perception Nov. 3, 20082LING 520 Introduction to Phonetics I, Fall 20083LING 520 Introduction to Phonetics I, Fall 20084LING 520 Introduction to Phonetics
UPenn - LING - 520
LING 520 Introduction to Phonetics IFall 2008Week 2English consonants and vowels Articulatory phonology Sep. 15, 20082 1. Consonants are longer when at the end of a phrase (bib, did, don, nod). 2. Voiceless stops (i.e., /p, t, k/) are asp
UPenn - COGSCI - 501
Loudness predicts prominence: fundamental frequency lends little.G. Kochanski and E. Grabe and J. Coleman and B. Rosner( 2006/08/27 09:49:02 UTC )Running title: Fundamental Frequency Lends Little Prominence The University of Oxford Phonetics Lab
UPenn - COGSCI - 501
Psychological Review Vol. 65, No. 6, 19S8THE PERCEPTRON: A PROBABILISTIC MODEL FOR INFORMATION STORAGE AND ORGANIZATION IN THE BRAIN 1F. ROSENBLATT Cornell Aeronautical LaboratoryIf we are eventually to understand the capability of higher organi
UPenn - LING - 120
LING 120 Introduction to Speech AnalysisFall 2007Week 5Speech analysis I: Vowels and FricativesOct. 1-5, 20072[From: UCL phonetics website]LING 120 Introduction to Speech Analysis, Fall 20073LING 120 Introduction to Speech Ana
UPenn - LING - 520
LING 520 Introduction to Phonetics IFall 2008Week 3Sounds in other languagesSep. 22, 2008Languages in the world There are about 7,000 languages in the world today. Over half of them (52 percent) are spoken by fewer than 10,000 people; over
UPenn - COGSCI - 501
268IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL. 24, NO. 2,FEBRUARY 2002Short Papers_Two Variations on Fishers Linear Discriminant for Pattern RecognitionTristrom CookeAbstractDiscriminants are often used in patter
UPenn - LING - 106
Right Linear GrammarsLing 106 October 8, 20031.Regular languages as languages generated by FSAWhen we did distributional analysis, we saw that linguistic units in natural language (roughly words) can be classified into grammatical categories o
UPenn - LING - 520
LING 520 Introduction to Phonetics IFall 2008Week 5Acoustic theory of speech production Acoustics of vowels Oct. 6, 20082 LING 120 Introduction to Phonetics I, Fall 20083LING 120 Introduction to Phonetics I, Fall 20084n=2L
UPenn - LING - 120
LING 120 Introduction to Speech AnalysisFall 2007Week 2Anatomy of speech production Phonetic transcription RecordingSep. 10-14, 20072Nasal Cavity Oral Cavity Pharynx Larynx: vocal folds in it Trachea: the windpipe Lung: supply airstreamSa
UPenn - LING - 120
LING 120 Introduction to Speech AnalysisFall 2007Week 4Acoustics of speech production SamplingSep. 24-28, 20072 LING 120 Introduction to Speech Analysis, Fall 20073n=2L nfn =vn=nv 2Ln = 1, 2, 3.L = /2 = 2L f = v/
UPenn - LING - 120
LING 120 Introduction to Speech AnalysisFall 2007Week 3Physics of soundSep. 17-21, 20072Motion: Distance (unit: meters, 1 m 39 inches); displacement (vector); Speed = distance / time (units: meters/sec, m/s); Velocity specifies the di
UPenn - LING - 120
LING 120 Introduction to Speech AnalysisFall 2007Week 8Speech analysis IV: Variation and statistical techniques (I)Oct. 22-24, 2007Variation in speech2 Linguistic factors: phonetic context, intonation, syntax/semantics, etc. Paralin
UPenn - LING - 120
LING 120 Introduction to Speech AnalysisFall 2007Week 9Speech analysis IV: Variation and statistical techniques (II)Oct. 29 - Nov. 2, 2007Hypothesis testing Steps for Hypothesis Testing: 1. Formulate your hypotheses: - Need a Null Hypothesis
UPenn - LING - 102
LING-102, Summer 2007Instructor: Marjorie PakJuly 25, 2007Homework 4. Due Monday, July 30, at 10am. Part of the homework will be handwritten and turned in to me in class; the other part will be emailed to me before class. See below for exact in
UPenn - LING - 120
LING 120 Introduction to Speech AnalysisFall 2007Week 1Overview of the course The speech chain Linguistic organizationSep. 5-7, 2007Syllabus2http:/www.ling.upenn.edu/courses/ling120/LING 120 Introduction to Speech Analysis, Fall 2007
UPenn - LING - 102
LING-102, Summer 2007Instructor: Marjorie PakHomework 2. Due Wednesday, July 18, at the beginning of class. 1. The following spectrogram shows me saying two separate made-up words with a pause in between. Each word is composed of three vowels. On
UPenn - LING - 525
Signal Processing ToolboxFor Use with MATLABComputation Visualization ProgrammingUsers GuideVersion 4.2How to Contact The MathWorks:PHONEFAX MAIL508-647-7000 508-647-7001Phone Fax MailuINTERNETThe MathWorks, Inc. 24 Prime Park
UPenn - LING - 120
LING 120 Introduction to Speech AnalysisFall 2007Week 7Speech analysis III: Speech prosodyOct. 17-19, 20072LING 120 Introduction to Speech Analysis, Fall 20073LING 120 Introduction to Speech Analysis, Fall 20074Tone1: High level
UPenn - LING - 102
LING-102, Summer 2007 Homework 1. Due Wednesday, July 11 at the beginning of class (hard copy) Part 1. Pick any sentence from todays class handout that contains at least 5 words, and transcribe it on a separate piece of paper using the IPA. Bring you
UPenn - LING - 001
SyntaxLING 001 - October 11, 2006 Joshua TaubererSyntaxHow can the words of a language be put together?SyntaxWhat makes a valid combination or order of words? What are the relations between the words in a sentence? What is the mecha
UPenn - LING - 001
Sound StructurePart II: Phonology 1-28-2009Review of Phonetics Speech sounds are decomposable into articulatory primitives (also known as features) Consonants and Vowels Feature differences (e.g., voiced vs. voiceless, nasal vs. not nasal, labi
UPenn - LING - 001
undergraduate faculty campus student college academic curriculum freshman classroom professor moral considerateness bison whale governance utilitarianism ethic entity preference utilitarian diabetes elderly appendix geriatric directory hospice arthri
UPenn - LING - 001
Semanticsand some syntax, math, and computational linguistics tooLING 001 - October 16, 2006Joshua TaubererSemantics Why does a sentence mean what it means? What are the meanings of words and how do they come together to make larger meanings
UPenn - LING - 001
Linguistics 001: StructuresSyntax I 2-9-2009Plagiarism at Harvard Last year, a Harvard student accused of plagiarism of a teen novel Sabrina was the brainy Angel. Yet another example of how every girl had to be one or the other: Pretty or smart