01-01lexicalAnalysis

01-01lexicalAnalysis - Recall: Structure of a Compiler...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CSE 450: Compilers K. Stirewalt Lexical analysis Topics: – Issues and complexity of lexical analysis – Regular expressions CSE 450: Compilers K. Stirewalt Recall: Structure of a Compiler Source Language Target Language Semantic Analyzer Syntax Analyzer Lexical Analyzer Front End Code Optimizer Target Code Generator Back End Int. Code Generator Intermediate Code CSE 450: Compilers K. Stirewalt Today! Source Language Target Language Semantic Analyzer Syntax Analyzer Lexical Analyzer Front End Code Optimizer Target Code Generator Back End Int. Code Generator Intermediate Code CSE 450: Compilers K. Stirewalt What exactly is lexing? Consider the code: if (i==j); z=1; else; z=0; endif; This is really nothing more than a string of characters: i f _ ( i = = j ) ; \n \t z = 1 ; \n e l s e ; \n \t z = 0 ; \n e n d i f ; Lexical analysis (aka scanning ) divides this string into meaningful, multi- character chunks called tokens CSE 450: Compilers K. Stirewalt Tokens Meaningful units of input text Languages generally contain small number of token types. •English tokens are things like parts of speech (e.g., “noun”, “verb”, “adjective”), punctuation, etc. •In a program, this could be an “identifier”, a “floating - point number”, a “math symbol”, a “keyword”, etc… More abstract than the substrings they represent •E.g., IDENTIFIER vs. “employeeName” •E.g., BLOCK -C OMMAND vs. “if” CSE 450: Compilers K. Stirewalt Identifying Tokens The string that an instance of a token denotes is called a lexeme . The set of all possible lexemes denoted by a given type of token is described by the use of a pattern . For example, the pattern to describe an identifier (e.g., a user - defined variable, method name, etc.) is a string of letters, numbers, or underscores, beginning with a non - number. Patterns typically described using regular expressions .
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 CSE 450: Compilers K. Stirewalt Implementation Lexical analyzer must be able to: 1. Remove all whitespace and comments from the input 2. Tokenize the remaining input 3. Associate the lexeme denoted by each found token, as well as the line number it was found on. How do we go about implementing this? CSE 450: Compilers K. Stirewalt Example Line Token Lexeme 1 BLOCK_COMMAND if 1 OPEN_PAREN ( 1 ID i 1 OP_RELATION == 1 ID j 1 CLOSE_PAREN ) 1 ENDLINE ; 2 ID z 2 ASSIGN = 2 NUMBER 1 2 ENDLINE ; 3 BLOCK_COMMAND else Etc… i f _ ( i = = j ) ; \n \t z = 1 ; \n e l s e ; \n \t z = 0 ; \n e n d i f ; CSE 450: Compilers K. Stirewalt Lexical ambiguity and lookahead Tokens typically read in from left-to-right, recognized one at a time from the input string.
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 07/25/2008 for the course CSE 450 taught by Professor Stirewalt during the Spring '08 term at Michigan State University.

Page1 / 5

01-01lexicalAnalysis - Recall: Structure of a Compiler...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online