Reference Book:
Compilers: Principles, Techniques and Tools
Token: a two tuple abstract symbol <name, attribute value>
Pattern: description of the form or representation of lexemes.
Lexeme: sequence of characters that match with a pattern of a token identified by lexical analyzer as instance of token.
Eg- printf("Sum=%d\n",total);
printf and total are lexemes matching the pattern for token id, "sum=%d\n" is lexeme matching literal.
Specification of Tokens
Regular expressions are used to specify lexeme patterns.
Strings and Languages
Alphabet - finite set of symbols. Ex - {0,1} is binary alphabet, ASCII is a popular alphabet.
String - finite sequence of symbols belonging to alphabet.
Language - finite set of strings over a specific alphabet.
Substring - a smaller set of consecutive elements of string. Ex- pil from compiler.
Subsequence - a smaller set of elements in any order from string obtained by deleting zero or more elements. Ex - cpile from compiler.
Regular Expressions
Mathematically, a regular expression is defined as -
1. ε is a regular expression
L ( ε ) = { ε }
It is the language consisting of only the empty string.
2. r = a is another regular expression for the language -
L ( a ) = { a }
3. ( a ) + ( b ) → L ( a ) U L ( b )
( a ) | ( b ) → L ( a ) U L ( b )
( a ) . ( b ) → L ( a ) . L ( b )
( a ) * → ( L ( a ) ) *
( ( a ) ) → L ( a )
Regular Definitions : Production rules for regular expressions.
production rule ⤇symbol ← d1 → r1 → regular expression