L5-CFG - Introduction to Compiler Design Syntax Analysis...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Introduction to Compiler Design Syntax Analysis: Context-Free Grammar Professor Yi-Ping You Department of Computer Science http://www.cs.nctu.edu.tw/~ypyou/ Introduction to Compiler Design, Spring 2010 Page 1 Outline Outline Overview of Syntax Analysis Context-Free Grammars Writing a Grammar Grammar Transformation Other Issues on CFGs CFG Introduction to Compiler Design, Spring 2010 Page 2 Where Where is Syntax Analyzer? if (b == 0) a = b; Lexical Analyzer or Scanner if ( b == 0 ) a = b ; Syntax Analyzer or Parser if == b 0 a = b abstract syntax tree or parse tree Introduction to Compiler Design, Spring 2010 Page 3 Parsing Parsing Analogy Syntax analysis for natural languages Recognize whether a sentence is grammatically correct Identify the function of each word sentence subject I verb gave indirect object him object noun phrase article “I gave him the book” Introduction to Compiler Design, Spring 2010 noun book Page 4 the Syntax Syntax Analysis Overview Goal – determine if the input token stream satisfies satisfies the syntax of the program What do we need to do this? An expressive way to describe the syntax A mechanism that determines if the input mechanism token stream satisfies the syntax description For lexical analysis Regular expressions describe tokens Finite automata = mechanisms to generate tokens from input stream Introduction to Compiler Design, Spring 2010 Page 5 Just Just Use Regular Expressions? REs can expressively describe tokens Easy to implement via DFAs So just use them to describe the syntax of a programming language NO! – They don’t have enough power to express any nontrivial syntax Example – Nested constructs (blocks, expressions, statements) – Detect balanced braces: statements) Detect {{} {} {{} { }}} { { { { { - We need unbounded counting! - FAs cannot count except in a strictly modulo fashion (FA with stacks, i.e., pushdown automata) Introduction to Compiler Design, Spring 2010 ... } } } } } Page 6 Outline Outline Overview of Syntax Analysis Context-Free Grammars Writing a Grammar Grammar Transformation R.E. (Recursively enumerable) Oth Other Issues on CFGs CFG Context-sensitive grammar Context-free grammar Regular Expression Chomsky Hierarchy Introduction to Compiler Design, Spring 2010 Page 7 ContextContext-Free Grammars (CFGs) Definition: G = (T, N, S, P) T : terminals = token (name) or ε N : nonterminals = syntactic variables S : start symbol = special nonterminal P : productions of the form head → body head = a single nonterminal body = zero or more terminals and nonterminals Specify how nonterminals may be expanded E.g., stmt → if expr then stmt stmt → if expr then stmt else stmt stmt → ε expr → term relop term expr → term term → id term → number Regular Definitions digit → digits → number → letter → id → if → then → else → relop → [0-9] digit+ digits(.digits)?(E[+-]?digits)? [A-Za-z] letter(letter|digit)* if then else < | > | <= | >= | = | <> Introduction to Compiler Design, Spring 2010 Page 8 Notational Notational Conventions Terminals Lowercase letters early in the alphabet (a, b, c) Operator symbols (+, *) Punctuation symbols (parentheses, comma) comma) Digits (0,1,…, 9) Boldface strings (id, if) Nonterminals Uppercase letters early in the alphabet (A, B, C) th Letter S, usually the start symbol it Lowercase italic names (expr, stmt) Program constructs: Uppercase letters (E, T, F) E, T, F for expression, term, and factor, respectively ti Introduction to Compiler Design, Spring 2010 Page 9 Notational Notational Conventions (Cont’d) Uppercase letters late in the alphabet (X, Y, Z): grammar symbols (i.e., either nonterminals or terminals) Lowercase letter late in the alphabet (u, v, …, z): strings of terminals Lowercase Greek letters (α, β, γ): strings of grammar symbols Shorthand notation – vertical bar for multiple productions term → id term → number term → id | number Unless stated, the head of the first production is the start symbol Introduction to Compiler Design, Spring 2010 Page 10 Notational Notational Convention: An Example Grammar for simple arithmetic expression expression → expression + term expression → expression – term expression → term term → term * factor term → term / factor term → factor factor → ( expression ) factor → id E→E+T|E–T|T T→T*F|T/F|F F→ ( E ) | id Introduction to Compiler Design, Spring 2010 Page 11 Derivations Derivations The purpose of a grammar is to derive strings in the the language defined by the grammar E.g., consider the grammar E → E + E | E * E | – E | ( E ) | id E → – E signifies that if E denotes an expression, then – E must also denote an expression, i.e., E can be replaced by – E be E derives – E, written as E ⇒ – E E ⇒ – E ⇒ – (E) ⇒ – (id) Introduction to Compiler Design, Spring 2010 Page 12 Derivations Derivations (Cont’d) α ⇒ β, β can be derived from α in one step + ⇒ derived in one or more steps * ⇒ derived in any number of steps * α⇒α * * * If α ⇒ β and β ⇒ γ, then α ⇒ γ * If S ⇒ α (S is the start symbol of a grammar G), α is a sentential form of G A sentential form may contain both terminals and nonterminals A sentence of G is a sentential from with no nonterminals Introduction to Compiler Design, Spring 2010 Page 13 ContextContext-Free Languages A Language L(G) that can be generated by a context-free grammar G is called context-free language L(G): the set of strings of terminals derived from the start symbol by repeatedly applying the productions productions, i.e., the set of sentences the * L(G) = {w | S ⇒ w}, where S is the start symbol and w is a sequence of terminals w is a sentence of G Introduction to Compiler Design, Spring 2010 Page 14 More More on Derivations E → E + E | E * E | – E | ( E ) | id The string –(id + id) is a sentence of the above grammar because there is a derivation E ⇒ – E ⇒ – (E) ⇒ – (E + E) ⇒ – (id + E) ⇒ – (id + id) An alternative derivation E ⇒ – E ⇒ – (E) ⇒ – (E + E) ⇒ – (E + id) ⇒ – (id + id) At each step in a derivation, two choices to be made made Choose which nonterminal to replace Choose a production with that nonterminal as head Introduction to Compiler Design, Spring 2010 Page 15 Derivation Derivation Order Leftmost derivation ⇒ lm Always substitute the leftmost non-terminal E l⇒ – E ⇒ – (E) ⇒ – (E + E) ⇒ – (id + E) ⇒ – (id + id) id lm lm lm lm m * If S ⇒ α (S is the start symbol of a grammar G), lm α is a left-sentential from of G Rightmost derivation (canonical derivation) ⇒ rm Always substitute the rightmost non-terminal * If S ⇒ α (S is the start symbol of a grammar G), rm α is a right-sentential from of G Introduction to Compiler Design, Spring 2010 Page 16 Parse Parse Trees and Derivations Sequence of parse trees for E ⇒ – E ⇒ – (E) ⇒ – (E + E) ⇒ – (id + E) ⇒ – (id + id) id E⇒ – E E ⇒ – E E (E) ⇒ – E E (E) E+E ⇒ – E E (E) E+E id ⇒ – E E (E) E+E id id The leaves of a parse tree (from left to right) constitute constitute a sentential from, called yield or frontier of the tree Introduction to Compiler Design, Spring 2010 Page 17 Parse Parse Trees and Derivations (Cont’d) Many-to-one relationship between derivations and and parse trees, since a parse tree ignores the E derivation order * E.g., E ⇒ – (id + id) The parse trees of leftmost and rightmost derivations are the same – E (E) E+E id id A parse tree always has a unique leftmost derivation and a unique rightmost derivation One-to-one relationship Introduction to Compiler Design, Spring 2010 Page 18 More More on CFGs CFGs powerful enough to expression the syntax in most most programming languages Derivation Successive application of productions starting from S Acceptance? Determine if there is a derivation for an input token stream Introduction to Compiler Design, Spring 2010 Page 19 A Parser Parser Context-free grammar, G Token stream, s (from scanner) Parser Yes, if s in L(G) No, otherwise otherwise Error messages Syntax analyzers (parsers) = CFG acceptors which also output the corresponding derivation when the token stream is accepted Various kinds: LL(k), LR(k), SLR, LALR Introduction to Compiler Design, Spring 2010 Page 20 Syntax ErrorSyntax Error-Recovery Strategies Panic-mode: simple To discard input symbols until one of the synchronizing tokens is found usually delimiters, such as semicolons or braces Phrase-level: To perform local correction on the remaining input Comma Comma -> semicolon, insert/delete semicolon, … semicolon insert/delete Error-productions To augment the grammar with productions that To generate erroneous constructs Global-correction: theoretical To choose a minimal sequence of changes to obtain a globally least-cost correction Introduction to Compiler Design, Spring 2010 Page 21 Outline Outline Overview of Syntax Analysis Context-Free Grammars Writing a Grammar Grammar Transformation Other Issues on CFGs CFG Introduction to Compiler Design, Spring 2010 Page 22 Writing Writing a Grammar Context-free grammars are capable of describing most, but not all, of the syntax of programming languages E.g., identifiers E.g., “identifiers should be declared before they are used” cannot described by CFG Leave it to semantic analysis Introduction to Compiler Design, Spring 2010 Page 23 Common Common Grammar Problems Lists: zero or more ID’s separated by commas Note it is easy to express one or more ID’s: idlist → idlist, id | id [id], [id, id], [id, id, id], … For zero or more ID’s, idlist → ε | id | idlist, idlist won’t work because it can generate: id, , id id idlist → ε | idlist, id | id won’t work either because it can generate: , id, id id We should separate out the empty list from the general list of one or more ID’s general idlist → ε | nonEmptyIdlist nonEmptyIdlist → nonEmptyIdlist, id | id id id Introduction to Compiler Design, Spring 2010 Page 24 Outline Outline Overview of Syntax Analysis Context-Free Grammars Writing a Grammar Grammar Transformation Other Issues on CFGs CFG Introduction to Compiler Design, Spring 2010 Page 25 Grammar Grammar Transformations Make a grammar more suitable for parsing Eliminating ambiguity Eliminating left recursion Left factoring Left-recursion elimination and left factoring are useful for making grammars suitable for top-down parsing Top-down parsing Bottom-up parsing Introduction to Compiler Design, Spring 2010 Page 26 Ambiguity Ambiguity E → E + E | E * E | – E | ( E ) | id Ambiguous Grammar The grammar permits two distinct leftmost derivations for the sentence id + id * id E⇒E+E ⇒ id + E ⇒ id + E * E ⇒ id + id * E ⇒ id + id * id E E+E id E * E id id E⇒E*E ⇒E+E*E ⇒ id + E * E ⇒ id + id * E ⇒ id + id * id E E*E E + E id id id Page 27 Introduction to Compiler Design, Spring 2010 Ambiguous Ambiguous Grammar Ambiguity implies multiple parse trees A grammar G is ambiguous if for the same sentence it produces more than one leftmost derivations or more than one rightmost derivations Problems with an ambiguous grammar Can make parsing difficult diffi Can impact the semantics of the language Thus, program meaning is not defined!! E.g., precedence is not uniquely defined (1+2*3) + 1 2 * 3 =7 1 + 2 * 3 =9 Page 28 Introduction to Compiler Design, Spring 2010 Eliminating Eliminating Ambiguity Rewrite grammar to eliminate ambiguity Many ways to rewrite the grammar The new grammar should accept the same language For each input string, there may be multiple parse trees Each has a different semantic meaning Which one do we want? Rewrite grammar should be based on the desired semantics semantics There is no general algorithm to rewrite ambiguous ambiguous grammars Introduction to Compiler Design, Spring 2010 Page 29 Rewrite Rewrite Ambiguous Grammar Express precedence correctly Use one nonterminal for each precedence level Start with lower precedence E → E + E | E * E | – E | ( E ) | id Input: id + id * id E E T T*T error E→E+E|T T→T*T|F F→ – E | ( E ) | id E+E TT*T FF id id F id F id Page 30 Introduction to Compiler Design, Spring 2010 More More Problems with Associativity However, the above grammar is still ambiguous, and parse trees may not express the associative of “−” and “*” Input: id + id + id E→E+E|T T→T*T|F F→ – E | ( E ) | id E E+E T+TT F id Introduction to Compiler Design, Spring 2010 E E+E TT+T FF id id F id Page 31 FF id id (id+id)+id id+(id+id) Rewrite Rewrite Ambiguous Grammar (Cont’d) Rewrite grammars considering associative rules Problems with associativity: The rule E → E + E has E on both sides of “+” Need to make the second E to some other nonterminal parsed earlier E Similarly for the rule E → E * E E+T E→E+E|T E+TF T→T*T|F F→ – E | ( E ) | id T F id E→E+T|T T→T*F|F F→ – E | ( E ) | id Introduction to Compiler Design, Spring 2010 F id id (id+id)+id Page 32 Rules Rules for Associativity Recursive productions: E → E + T is called a left recursive production is + A ⇒ Aα E → T + E is called a right recursive production + A ⇒ αA E → E + E is both left and right recursion If one wants left associativity, use left recursion If one wants right associativity, use right recursion Introduction to Compiler Design, Spring 2010 Page 33 Ambiguity Ambiguity – Another Example (1/2) if statement stmt → if-stmt | while-stmt | … while if-stmt → if expr then stmt else stmt | if expr then stmt Input: if (a) then if (b) then x = c else x = d if else if-stmt if expr then stmt else stmt (a) if-stmt x=d if-stmt if expr then stmt (a) if-stmt if expr then stmt (b) x=c expr then if expr then stmt else stmt (b) x=c x=d if (a) then {if (b) then x = c} else x = d if (a) then {if (b) then x = c else x = d} Introduction to Compiler Design, Spring 2010 Page 34 Ambiguity Ambiguity – Another Example (2/2) How to rewrite the if-stmt grammar to eliminate ambiguity? (desired semantics: match the else with the closest if) By defining different if statements: unmatched and matched Matched: then and else parts are always in pairs By disallowing matched stmt to go back to unmatched stmt Solution if-stmt → unmatched-stmt | matched-stmt matched-stmt → if expr then matched-stmt else matched-stmt | others else others unmatched-stmt → if expr then matched-stmt else unmatched-stmt unmatched-stmt → if expr then if-stmt Once getting into matched-stmt, never go back to unmatched-stmt In unmatched stmt, need to consider both matched and unmatched Anything in the then part has to be matched if there is a else part It is possible to have unmatched in the else part Introduction to Compiler Design, Spring 2010 Page 35 Ambiguity Ambiguity Rewritten grammar Less intuitive Harder to comprehend by the language designer as well as the user of the language Current practice Expression Precedence is desired, so, good to use the grammar with precedence If Language definition still has the ambiguous grammar Use Use some ad hoc methods to resolve the problem (which is also easy to deal with) Introduction to Compiler Design, Spring 2010 Page 36 Outline Outline Overview of Syntax Analysis Context-Free Grammars Writing a Grammar Grammar Transformation Eliminating ambiguity Eliminating left recursion Left factoring Other Issues on CFGs Introduction to Compiler Design, Spring 2010 Page 37 Elimination Elimination of Left Recursion A grammar is left recursive if it has a nonterminal A such that there’s a derivation + A ⇒ Aα for some string α Top-down parsing methods cannot handle leftrecursive grammars, e.g., A → Aα A Eliminating left recursion A → Aα | β A → βA’ A’ → α A’ | ε E → T E’ E’ → + T E’ | ε T → F T’ T’ → * F T’ | ε F→ – E | ( E ) | id Page 38 Aα Aα Aα E.g., E→E+T|T T→T*F|F F→ – E | ( E ) | id id Introduction to Compiler Design, Spring 2010 Elimination Elimination of Left Recursion (Cont’d) The general case A → Aα1 | Aα2 | … | Aαm | β1 | β2 | … | βn A → β1A’ | β2A’ | … | βnA’ A’ → α1A’ | α2A’ | | αmA’ | ε Only eliminate immediate left recursion Not worked for left recursion involving derivations derivations of two or more steps E.g., S → A a | b A→Ac|Sd|ε Introduction to Compiler Design, Spring 2010 Page 39 Elimination Elimination of Left Recursion (Cont’d) Input: Grammar G with no cycles or ε-productions + No A ⇒ A or A → ε Output: An equivalent grammar with no left recursion arrange nonterminals in some order A1, A2, …, An for (each i from 1 to n) { for (each j from 1 to i -1) { // j < i // replace each production Ai → Ajγ by production Ai → δ1γ | δ2γ | … | δkγ, where Aj → δ1 | δ2 | … | δk are where current Aj-productions } eliminate immediate left recursion among Ai-productions // Ak is “clean” for all k, k ≤ i; i.e., Ak → Amα must have m > k } Introduction to Compiler Design, Spring 2010 Page 40 LeftLeft-Recursion Elimination: An Example S→Aa|b A→Ac|Sd|ε The algorithm is not guaranteed to work because there is a A → ε A1 A2 Order the nonterminals: S, A i=1, nothing happen (no inner loop, no immediate left recursion) nothing i=2, replace S in A → S d A2 → A1γ ⇒ A2 → δ1γ | δ2γ | … | δkγ, A→Ac|Aad|bd|ε where A1 → δ1 | δ2 | … | δk where Eliminate immediate left recursion S→Aa|b A → b d A’ | A’ A’ → c A’ | a d A’ | ε Introduction to Compiler Design, Spring 2010 A → Aα1 | Aα2 | … | Aαm | β1 | β2 | … | βn A → β1A’ | β2A’ | … | βnA’ A’ → α1A’ | α2A’ | … | αmA’ | ε Page 41 Outline Outline Overview of Syntax Analysis Context-Free Grammars Writing a Grammar Grammar Transformation Eliminating ambiguity Eliminating left recursion Left factoring Other Issues on CFGs Introduction to Compiler Design, Spring 2010 Page 42 Left Left Factoring A grammar transformation that is useful for producing a grammar suitable for top-down parsing When the choice between two alternative Aproductions is not clear E.g., stmt → if expr then stmt else stmt | if expr then stmt Transformation A → αβ1 | αβ2 A → αA’ A’ → β1 | β2 A → αA’ | γ A’ → β1 | β2 | … | βn Page 43 A → αβ1 | αβ2 | … | αβn | γ Introduction to Compiler Design, Spring 2010 Left Left Factoring: An Example Dangling-else program stmt → if expr then stmt else stmt | if expr then stmt |a expr → b stmt → if expr then stmt stmt’ | a stmt’ → else stmt | ε expr → b Introduction to Compiler Design, Spring 2010 Page 44 Outline Outline Overview of Syntax Analysis Context-Free Grammars Writing a Grammar Grammar Transformation Other Issues on CFGs CFG Introduction to Compiler Design, Spring 2010 Page 45 Verifying Verifying the Language Generated by a Grammar To prove that a grammar G generates a language L Every string generated by G is in L Every string in L can be generated by G Ex: Ex: G : S → (S) S | ε L(G) = ? Strings of balanced parentheses How to prove that? Every sentence derived from S is balanced Every balanced string is derived from S Introduction to Compiler Design, Spring 2010 Page 46 Every Every sentence derived from S is balanced Basis: number of derivation steps n=1 Empty string: balanced S → (S) S | ε Induction: assume assume that all derivations of fewer than n steps produce balanced sentences and a leftmost derivation of exactly n steps is in the form of leftmost * * S ⇒ (S)S ⇒ (x)S ⇒ (x)y lm lm lm n steps The derivations of x and y from S take fewer than n steps ⇒ x, y: balanced ⇒ (x)y: balanced Introduction to Compiler Design, Spring 2010 Page 47 Every Every balanced string is derivable from S Basis: length of string 0 Empty string: derivable S → (S) S | ε Induction: every balanced string has even length Assume that every balanced string of length less than 2n is derivable from S Consider a balanced string w of length 2n, n ≥ 1 w: begins with ‘(‘ Let (x) be the shortest nonempty prefix of w having an equal number of left and right parentheses Then, w can be written as w=(x)y, where x and y are Th itt balanced x and y are of length less than 2n ⇒ x and y are derivable from S * * * ⇒ We can find a derivation: S ⇒ (S)S ⇒ (x)S ⇒ (x)y w=(x)y : also derivable from S Introduction to Compiler Design, Spring 2010 Page 48 CFGs v.s. CFGs v.s. REs Grammars are more powerful notation than regular regular expressions RE is a subset of CFG Regular grammar Can inductively build a grammar for each RE ε a R1 R2 R1 | R2 R1* S→ε S→a S → S1 S2 S → S1 | S2 S → S1 S | ε Where G1 = grammar for R1, with start symbol S1 with G2 = grammar for R2, with start symbol S2 Introduction to Compiler Design, Spring 2010 Page 49 NFA NFA to CFG Constructing a grammar from NFA For each state i of NFA, create nonterminal Ai If state i has transition to state j on input a, add the production Ai → aAj If i is an accepting state, add Ai → ε If i is the start state, make Ai the start symbol a start 0 b a 1 b 2 b 3 A0 → aA0 | bA0 | aA1 bA aA A1 → bA2 A2 → bA3 A3 → ε Introduction to Compiler Design, Spring 2010 Page 50 Why Why Use REs for Lexical Syntax? Separating into lexical and non-lexical parts provides a convenient way of modularizing components Lexical Lexical rules of a language are quite simple Regular expressions provide a more concise and easier-to-understand notation for tokens Automatic construction of efficient lexical analyzers from regular expressions Introduction to Compiler Design, Spring 2010 Page 51 Why Why RE Cannot Describes L={anbn |n ≥ 1} L= Suppose L can be described by a regular expression, RE We could construct DFA D with finite number (k) of states to accept L (aibi is in the language) For an input beginning with more than k a’s, D must enter some state twice (si) ai+mbi=ai+(j-i)bi=ajbi also in the language (contradict!) path labeled am, m=j – i and j > i, i.e., m > 0 path labeled ai S0 … Si path labeled bi … f Introduction to Compiler Design, Spring 2010 Page 52 Non-ContextNon-Context-Free Language Examples L={anbmcndm |n ≥ 1 and m ≥ 1} anbm are the formal parameters defined in a procedure cndm are the matching numbers of actual parameters L ={wcw| w=(a|b)*} a variable before its use should be declared Introduction to Compiler Design, Spring 2010 Page 53 Chomsky Chomsky Hierarchy R.E. (Recursively enumerable) (R Context-sensitive grammar Context-free grammar Regular Expression Introduction to Compiler Design, Spring 2010 Page 54 Beyond ContextBeyond Context-Free Languages Context-sensitive grammar Also allow rules of the form: αAβ → αγβ Replace A by γ only if found in the context of α and β Left side does not have to be a single non-terminal α, β ∈ (N ∪ T)* γ ∈ (N ∪ T)* − ε (no erase rule) Recursively enumerable grammar Allow rules of the form: α → β α ∈ (N ∪ T)* N (N ∪ T)* – At least one non-terminal β ∈ (N ∪ T)* Introduction to Compiler Design, Spring 2010 Page 55 Next Next Topic: Parsers Context-free grammar, G Token stream, s (from scanner) Parser Yes, if s in L(G) No, otherwise Error messages Introduction to Compiler Design, Spring 2010 Page 56 ...
View Full Document

This note was uploaded on 12/25/2010 for the course ALL 0204 taught by Professor 79979 during the Spring '10 term at National Chiao Tung University.

Ask a homework question - tutors are online