June 22, 2011
Handout written by Maggie Johnson and Julie Zelenski, with edits by Keith.
is the process where the stream of characters making up the
source program is read from left-to-right and grouped into tokens.
of characters with a collective meaning. There are usually only a small number of tokens
for a programming language: constants (integer, double, char, string, etc.), operators
(arithmetic, relational, logical), punctuation, and reserved words.
while (i > 0)
i = i - 2;
The lexical analyzer takes a source program as input, and produces a stream of tokens as
output. The lexical analyzer might recognize particular instances of tokens such as:
for an integer constant token
for a string constant token
for a variable token
Such specific instances are called
A lexeme is the actual character sequence
, the general class that a lexeme belongs to.
Some tokens have exactly one
lexeme (e.g., the
character); for others, there are many lexemes (e.g., integer constants).
The scanner is tasked with determining that the input stream can be divided into valid
symbols in the source language, but has no smarts about which token should come
where. Few errors can be detected at the lexical level alone because the scanner has a
very localized view of the source program without any context. The scanner can report
about characters that are not valid tokens (e.g., an illegal or unrecognized symbol) and a
few other malformed entities (illegal characters within a string constant, unterminated
It does not look for or detect garbled sequences, tokens out of place,
undeclared identifiers, misspelled keywords, mismatched types and the like. For
example, the following input will not generate any errors in the lexical analysis phase,