124.11.lec2

124.11.lec2 - CS 124/LINGUIST 180 From Languages to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180 From Languages to Information Lecture 2 Tokeniza0on/Segmenta0on Minimum Edit Distance Thanks to Chris Manning and Serafim Batzoglou for slides! Outline   Tokeniza0on   Word Tokeniza0on   Normaliza0on   Lemma0za0on and stemming   Sentence Tokeniza0on   Minimum Edit Distance   Levenshtein distance   Needleman ­Wunsch   Smith ­Waterman 2 1/6/11 Tokenization   For   Informa0on retrieval   Informa0on extrac0on (detec0ng named en00es, etc.)   Spell ­checking   3 tasks   Segmen0ng/tokenizing words in running text   Normalizing word formats   Segmen0ng sentences in running text   Why not just periods and white ­space?   Mr. Sherwood said reaction to Sea Containers’ proposal has been "very positive." In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.!   “I said, ‘what’re you? Crazy?’ “ said Sadowsky. “I can’t afford to do that.’’! What’s a word?   I do uh main ­ mainly business data processing   Fragments   Filled pauses   Are cat and cats the same word?   Some terminology   Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense   Cat and cats = same lemma   Wordform: the full inflected surface form.   Cat and cats = different wordforms   Token/Type How many words?   they lay back on the San Francisco grass and looked at the stars   13 tokens (or 12)   12 types (or 11)   The Switchboard corpus of American telephone conversa0on:   2.4 million wordform tokens   ~20,000 wordform types   Brown et al (1992) large corpus of text   583 million wordform tokens   293,181 wordform types   Shakespeare:   884,647 wordform tokens   31,534 wordform types   Let N = number of tokens, V = vocabulary = number of types   General wisdom: V > O(sqrt(N)) Issues in Tokenization   Finland’s capital → Finland? Finlands? Finland’s   what’re, I’m, isn’t ­>   What are, I am, is not   Hewle: ­Packard →   Hewle: and Packard as two tokens?   state ­of ­the ­art:   Break up?   lowercase, lower ­case, lower case ?   San Francisco, New York: one token or two?   Words with punctua0on   m.p.h., PhD. Slide from Chris Manning Tokenization: language issues   French   L'ensemble → one token or two?   L ? L’ ? Le ?   Want l’ensemble to match with un ensemble   German noun compounds are not segmented   LebensversicherungsgesellschaGsangestellter   ‘life insurance company employee’   German retrieval systems benefit greatly from a compound spli7er module Slide from Chris Manning Tokenization: language issues   Chinese and Japanese no spaces between words:   莎拉波娃现在居住在美国东南部的佛罗里达。   莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达   Sharapova now lives in US southeastern Florida   Further complicated in Japanese, with mul0ple alphabets intermingled   Dates/amounts in mul0ple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji End-user can express query entirely in hiragana! Slide from Chris Manning Romaji Word Segmentation in Chinese   Words composed of characters   Characters are generally 1 syllable and 1 morpheme.   Average word is 2.4 characters long.   Standard segmenta0on algorithm:   Maximum Matching   (also called Greedy) Maximum Matching Word Segmentation Algorithm   1)  2)  3)  4)  Given a wordlist of Chinese, and a string. Start a pointer at the beginning of the string Find the longest word in dic0onary that matches the string star0ng at pointer Move the pointer over the word in string Go to 2 English failure example (Palmer 00)   the table down there   thetabledownthere   Theta bled own there   But works astonishingly well in Chinese   莎拉波娃现在居住在美国东南部的佛罗里达。   莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达   Modern algorithms beier s0ll:   probabilis0c segmenta0on   Using “sequence models” like HMMs that we’ll see in 2 weeks Normalization   Need to “normalize” terms   For IR, indexed text & query terms must have same form.   We want to match U.S.A. and USA   We most commonly implicitly define equivalence classes of terms   e.g., by dele0ng periods in a term   Alterna0ve is to do asymmetric expansion:   Enter: window   Enter: windows   Enter: Windows Search: window, windows Search: Windows, windows, window Search: Windows   Poten0ally more powerful, but less efficient Slide from Chris Manning Case folding   For IR: Reduce all leiers to lower case   excep0on: upper case in mid ­sentence?   e.g., General Motors   Fed vs. fed   SAIL vs. sail   Oqen best to lower case everything, since users will use lowercase regardless of ‘correct’ capitaliza0on…   For sen0ment analysis, MT, Info extrac0on   Case is helpful (“US” versus “us” is important) Slide from Chris Manning Lemmatization   Reduce inflec0onal/variant forms to base form   E.g.,   am, are, is → be   car, cars, car's, cars' → car   the boy's cars are different colors → the boy car be different color   Lemma0za0on implies doing “proper” reduc0on to dic0onary headword form Slide from Chris Manning Stemming   Reduce terms to their “roots” before indexing   “Stemming” is crude chopping of “affixes”   language dependent   e.g., automate(s), automaOc, automaOon all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress Slide from Chris Manning Porter’s algorithm   Commonest algorithm for stemming English   A sequence of phases   each phase consists of a set of rules   sses → ss   ies → i   a4onal → ate   4onal → 4on   Some rules only apply to mul0 ­syllable words   (syl > 1) EMENT → ø   replacement → replac   cement → cement Slide from Chris Manning More on Morphology   Morphology:   how words are built up from smaller meaningful units called morphemes:   Stems: The core meaning bearing units   Affixes: Bits and pieces that adhere to stems to change their meanings and gramma0cal func0ons Dealing with complex morphology is sometimes necessary   Machine transla0on   Need to know that the Spanish words quiero (‘I want’) and quieres (‘you want’) are both related to querer ‘want’   Other languages requires segmen0ng morphemes   Turkish   Uygarlas0ramadiklarimizdanmissinizcasina   `(behaving) as if you are among those whom we could not civilize’   Uygar `civilized’ + las `become’ + 0r `cause’ + ama `not able’ + dik `past’ + lar ‘plural’ + imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’ Sentence Segmentation   !, ? rela0vely unambiguous   Period “.” is quite ambiguous   Sentence boundary   Abbrevia0ons like Inc. or Dr.   General idea:   Build a binary classifier:   Looks at a “.”   Decides EndOfSentence/NotEOS   Could be hand ­wriien rules, sequences of regular expressions, or machine ­learning Determining if a word is end-ofutterance: a Decision Tree More sophisticated decision tree features   Prob(word with “.” occurs at end ­of ­s)   Prob(word aqer “.” occurs at begin ­of ­s)   Length of word with “.”   Length of word aqer “.”   Case of word with “.”: Upper, Lower, Cap, Number   Case of word aqer “.”: Upper, Lower, Cap, Number   Punctua0on aqer “.” (if any)   Abbrevia0on class of word with “.” (month name, unit ­of ­measure, 0tle, address name, etc) 1/5/07 From Richard Sproat slides Learning Decision Trees   DTs are rarely built by hand   Hand ­building only possible for very simple features, domains   Lots of algorithms for DT induc0on II. Minimum Edit Distance   Spell ­checking   Non ­word error detec0on:   detec0ng “graffe”   Non ­word error correc0on:   figuring out that “graffe” should be “giraffe”   Context ­dependent error detec0on and correc0on:   Figuring out that “war and piece” should be peace Non-word error detection   Any word not in a dic0onary   Assume it’s a spelling error   Need a big dic0onary! Isolated word error correction   How do I fix “graffe”?   Search through all words:   graf   craq   grail   giraffe   Pick the one that’s closest to graffe   What does “closest” mean?   We need a distance metric.   The simplest one: edit distance.   (More sophis0cated probabilis0c ones: noisy channel) Edit Distance   The minimum edit distance between two strings   Is the minimum number of edi0ng opera0ons   Inser0on   Dele0on   Subs0tu0on   Needed to transform one into the other Minimum Edit Distance Minimum Edit Distance   If each opera0on has cost of 1   Distance between these is 5   If subs0tu0ons cost 2 (Levenshtein)   Distance between them is 8 Edit transcript 2 9 1/6/11 Defining Min Edit Distance   For two strings S1 of len n, S2 of len m   distance(i,j) or D(i,j)   means the edit distance of S1[1..i] and S2[1..j]   i.e., the minimum number of edit opera0ons need to transform the first i characters of S1 into the first j characters of S2   The edit distance of S1, S2 is D(n,m)   We compute D(n,m) by compu0ng D(i,j) for all i (0 < i < n) and j (0 < j < m) 3 0 1/6/11 Defining Min Edit Distance   Base condi0ons:   D(i,0) = i   D(0,j) = j   Recurrence Rela0on: D(i ­1,j) + 1   D(i,j) = min D(i,j ­1) + 1 D(i ­1,j ­1) + 1; if S1(i) ≠ S2(j) 0; if S1(i) = S2(j) 3 1 1/6/11 Dynamic Programming   A tabular computa0on of D(n,m)   Boiom ­up   We compute D(i,j) for small i,j   And compute increase D(i,j) based on previously computed smaller values 3 2 1/6/11 The Edit Distance Table N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N N O I 9 8 7 T N E T N I # 6 5 4 3 2 1 0 # 1 E 2 X 3 E 4 C 5 U 6 T 7 I 8 O 9 N N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Suppose we want the alignment too   We can keep a “backtrace”   Every 0me we enter a cell, remember where we came from   Then when we reach the end, we can trace back from the upper right corner to get an alignment Backtrace N9 O8 I7 8 7 6 9 8 7 10 11 12 11 10 9 9 10 11 10 9 8 8 9 10 9 8 9 8 9 10 T N E T N I # 5 4 3 4 3 2 1 E 6 5 4 5 4 3 2 X 7 6 5 6 5 4 3 E 11 10 9 8 7 8 9 N 6 5 4 3 2 1 0 # 8 7 6 7 6 5 4 C 9 8 7 8 7 6 5 U 8 9 8 7 8 7 6 T 9 10 9 8 7 6 7 I 10 11 10 9 8 7 8 O Adding Backtrace to MinEdit   Base condi0ons:   D(i,0) = i   D(0,j) = j   Recurrence Rela0on: Case 1 D(i ­1,j) + 1   D(i,j) = min D(i,j ­1) + 1 Case 2 D(i ­1,j ­1) + 1; if S1(i) ≠ S2(j) 0; if S1(i) = S2(j) LEFT Case 1 ptr(i,j) DOWN Case 2 DIAG Case 3 Case 3 MinEdit with Backtrace 3 9 1/6/11 Performance  Time: O(nm)  Space: O(nm)  Backtrace O(n+m) Weighted Edit Distance   Why would we add weights to the computa0on?   How? Confusion matrix 4 2 1/6/11 4 3 1/6/11 Weighted Minimum Edit Distance 4 4 1/6/11 Why “Dynamic Programming” “I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, “programming” I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. Its impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.” Richard Bellman, “Eye of the Hurricane: an autobiography” 1984. Other uses of Edit Distance in text processing   Evalua0ng Machine Transla0on and speech recogni0on R Spokesman confirms H Spokesman said S senior government adviser was shot! the senior I D adviser was shot dead! I!   En0ty Extrac0on and Coreference   IBM Inc. announced today   IBM’s profits   Stanford President John Hennessy announced yesterday   for Stanford University President John Hennessy Edit distance in computationl genomics 4 7 1/6/11 The Cell © 1997-2005 Coriell Institute for Medical Research Chromosomes telomere centromere nucleosome DNA H1 chromatin H2A, H2B, H3, H4 © 1997-2005 Coriell Institute for Medical Research ~146bp Nucleotide (base) purines to previous nucleotide O O P O- H C H Cytosine (C) to base O C Guanine (G) Thymine (T) H O Adenine (A) 5’ C H H C 3’ C H to next nucleotide © 1997-2005 Coriell Institute for Medical Research H pyrimidines “AGACC” (backbone) © 1997-2005 Coriell Institute for Medical Research “AGACC” (DNA) 3’ 5’ 3’ © 1997-2005 Coriell Institute for Medical Research 5’ Genes & Proteins Double-stranded DNA 5’ TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA 3’ 3’ ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT 5’ (transcription) Single-stranded RNA AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA (translation) protein Proteins are chains of amino-acids © 1997-2005 Coriell Institute for Medical Research There are 20 amino-acids Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2...xM, y = y1y2 yN , an alignment is an assignment of gaps to positions 0, , N in x, and 0, , N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence Slide from Serafim Batzoglou Why sequence alignment? •  Comparing genes or regions from different species   to find important regions   determine function   uncover evolutionary forces •  Assembling fragments to sequence DNA •  Compare individuals to looking for mutations DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… Slide from Serafim Batzoglou Whole Genome Shotgun Sequencing genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~900 bp ~900 bp Slide from Serafim Batzoglou Fragment Assembly Slide from Serafim Batzoglou Steps to Assemble a Genome Some Terminology 1. a 5 overlapping reads read Find 00-900 long word that comes out of sequencer mate pair a pair of reads from two ends 2. Merge some good fragment reads into of the same insert pairs of longer contigs contig a contiguous sequence formed by several overlapping reads with no gaps 3. Link contigs to form supercontigs supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs 4. Derive consensus sequence consensus sequence derived from the sequene multiple alignment of reads in a contig ..ACGATTACAATAGGTT.. Slide from Serafim Batzoglou Find Overlapping Reads Create local multiple alignments from the pairwise read alignments TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Slide from Serafim Batzoglou Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Can derive each consensus base by weighted voting, etc. Slide from Serafim Batzoglou Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… REARRANGEMENTS Slide from Serafim Batzoglou Inversion Translocation Duplication SEQUENCE EDITS Evolutionary Rates next generation OK OK OK X X Still OK? Slide from Serafim Batzoglou Sequence conservation implies function Alignment is the key to •  Finding important regions •  Determining function •  Uncovering the evolutionary forces Slide from Serafim Batzoglou Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2...xM, y = y1y2 yN , an alignment is an assignment of gaps to positions 0, , N in x, and 0, , N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence Slide from Serafim Batzoglou Alignments in two fields  In Natural Language Processing  We generally talk about distance (minimized)  And weights  In Computa0onal Biology  We generally talk about similarity (maximized)  And scores Scoring Alignments Rough intui<on:   Similar sequences evolved from a common ancestor   Evolu0on changed the sequences from this ancestral sequence by muta<ons:   Replacements: one leier replaced by another   Dele<on: dele0on of a leier   Inser<on: inser0on of a leier   Scoring of sequence similarity should examine how many opera0ons took place What is a good alignment? AGGCTAGTT, AGCGAAGTTT AGGCTAGTTAGCGAAGTTT 6 matches, 3 mismatches, 1 gap AGGCTA-GTTAG-CGAAGTTT 7 matches, 1 mismatch, 3 gaps AGGC-TA-GTTAG-CG-AAGTTT 7 matches, 0 mismatches, 5 gaps Slide from Serafim Batzoglou Scoring Function   Sequence edits: AGGCCTC   Muta0ons   Inser0ons AGGACTC AGGGCCTC AGG . CTC   Dele0ons Scoring Function: Match: +m Mismatch: -s Gap: -d Score F = (# matches) × m - (# mismatches) × s ‒ (#gaps) × d Slide from Serafim Batzoglou Example x = AGTA m= 1 y = ATA s = -1 d = -1 F(i,j) i=0 0 2 3 4 A j=0 1 G T A -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 A -3 -1 -1 0 2 max{0 + 1, -1 – 1, -1 – 1} = 1 0 3 F(1, 1) = max{F(0,0) + s(A, A), F(0, 1) – d, F(1, 0) – d} = Slide from Serafim Batzoglou AGTA A -TA The Needleman-Wunsch Matrix x1 xM y1 Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences yN Slide from Serafim Batzoglou An optimal alignment is composed of optimal subalignments The Needleman-Wunsch Algorithm 1.  Initialization. a.  b.  c.  3.  F(0, 0) F(0, j) F(i, 0) =0 =-j×d =-i×d Main Iteration. Filling-in partial alignments a.  For each i = 1 For each j = 1 M N F(i, j) Ptr(i,j) 3.  = max = F(i-1,j-1) + s(xi, yj) F(i-1, j) ‒ d F(i, j-1) ‒ d DIAG, LEFT, Slide from Serafim Batzoglou if [case 1] if [case 2] UP, if [case 3] Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment [case 1] [case 2] [case 3] A variant of the basic algorithm:   Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- •  If so, we don’t want to penalize gaps at the ends Slide from Serafim Batzoglou Different types of overlaps Example: 2 overlapping“reads” from a sequencing project Example: Search for a mouse gene within a human chromosome Slide from Serafim Batzoglou The Overlap Detection variant xM Changes: yN x1 1.  2.  Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 Termination y1 FOPT = max Slide from Serafim Batzoglou maxi F(i, N) maxj F(M, j) The local alignment problem Given two strings x = x1……xM, y = y1……yN Find substrings x’, y’ whose similarity (op0mal global alignment value) is maximum x = aaaacccccggggia y = icccgggaaccaacc Slide from Serafim Batzoglou Why local alignment – examples   Genes are shuffled between genomes   Por0ons of proteins (domains) are oqen conserved Slide from Serafim Batzoglou Cross-species genome similarity   98% of genes are conserved between any two mammals   >70% average similarity in protein sequence hum_a mus_a rat_a fug_a : : : : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ @ @ @ 57331/400001 78560/400001 112658/369938 36008/68174 hum_a mus_a rat_a fug_a : : : : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ @ @ @ 57381/400001 78610/400001 112708/369938 36058/68174 hum_a mus_a rat_a fug_a : : : : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ @ @ @ 57431/400001 78659/400001 112757/369938 36084/68174 hum_a mus_a rat_a fug_a : : : : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG CCGAGGACCCTGA------------------------------------- @ @ @ @ 57481/400001 78708/400001 112806/369938 36097/68174 Slide from Serafim Batzoglou “atoh” enhancer in human, mouse, rat, fugu fish The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifica0ons to Needleman ­Wunsch: Ini<aliza<on: Itera<on: F(0, j) = F(i, 0) = 0 0 F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(xi, yj) Slide from Serafim Batzoglou The Smith-Waterman algorithm Termina<on: 1.  If we want the best local alignment… FOPT = maxi,j F(i, j) Find FOPT and trace back 2.  If we want all local alignments scoring > t ?? For all i, j find F(i, j) > t, and trace back? Complicated by overlapping local alignments Slide from Serafim Batzoglou Local Alignment Example s = TAATA t = ATCTAA Slide from Hasan Oğul Local Alignment Example s = TAATA t = TACTAA Slide from Hasan Oğul Local Alignment Example s = TAATA t = TACTAA Slide from Hasan Oğul Local Alignment Example s= TAATA t = TACTAA Slide from Hasan Oğul Summary   Tokeniza0on   Word Tokeniza0on   Normaliza0on   Lemma0za0on and stemming   Sentence Tokeniza0on   Minimum Edit Distance   Levenshtein distance   Needleman ­Wunsch (weighted global alignment)   Smith ­Waterman (local alignment)   Applica0ons to:   spell correc0on, machine transla0on, en0ty extrac0on   DNA fragment combina0on, evoluta0onary similarity, muta0on detec0on ...
View Full Document

Ask a homework question - tutors are online