About Lex
This project is maintained by ckshitij
sudo apt-get install flex
sudo apt-get install byaac
yacc -d filename.y
lex filename.l
gcc gcc lex.yy.c y.tab.c -o lex_yacc
./lex_yacc
A lex program has the following basic structure:
%{
C declarations and includes
%}
Lex macro definitions and directives
%%
Lex Specification
in the form of pattern/action statements like this:
keyword { my_c_code(yytext); }
%%
C language program (the rest)
However, the only mandatory part is the first %%
Lex turns the user’s expressions and actions (called source in this memo) into the host general-purpose language; the generated program is named yylex. The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected.
+-------+
Source -> | Lex | -> yylex
+-------+
+-------+
Input -> | yylex | -> Output
+-------+
An overview of Lex
For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of lines.
%%
[ \t]+$;
is all that is required. The program contains a %% delimiter to mark the beginning of the rules, and one rule.
%%
[ \t]+$ ;
[ \t]+ printf(" ");
Lex can be used alone for simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc [3]. Lex programs recognize only regular expressions; Yacc writes parsers that accept a large class of context free grammars, but require a lower level analyzer to recognize input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces.
lexical grammar
rules rules
| |
v v
+---------+ +---------+
| Lex | | Yacc |
+---------+ +---------+
| |
v v
+---------+ +---------+
Input -> | yylex | -> | yyparse | -> Parsed input
+---------+ +---------+
Lex with Yacc
So far, we have been using lex in a “stand-alone” mode, and linking in the (hitherto mysterious) libl library using LDLIBS=-ll
In fact, lex does not generate a complete program. Lex generates a single function,
(int) yylex()
and some associated global variables.
yyin defaults to the standard input.
When lex reaches the end of the file it is reading, it calls a function (int) yywrap() If yywrap() returns non-zero, yylex() returns a zero value. If yywrap() returns zero, yylex() keeps scanning, from where it left off, with whatever input is available on yyin. This is only useful if yywrap() has changed yyin to provide for additional input.
main()
...which simply calls yylex()
yywrap()
...which always returns non-zero.
%{
#include <stdio.h>
#include <errno.h>
int file_num;
int file_num_max;
char **files;
extern int errno;
%}
%%
(ftp|http):\/\/[^ \n<>"]* printf("%s\n",yytext);
.|\n ;
%%
int main(int argc, char *argv[]) {
file_num=1;
file_num_max = argc;
files = argv;
if ( argc > 1 ) {
if ( (yyin = fopen(argv[file_num],"r")) == 0 ) {
perror(argv[file_num]);
exit(1);
}
}
while( yylex() )
;
return 0;
}
int yywrap() {
fclose(yyin);
if ( ++file_num < file_num_max ) {
if ( (yyin = fopen(files[file_num],"r")) == 0 ) {
perror(files[file_num]);
exit(1);
}
return 0;
} else {
return 1;
}
}
make rip-url
Although none of our examples have so far done so, it is valid to execute a return() statement within a lex rule.
yylex() is of type int, so a non-zero integer value would normally be returned. Returning zero would be ambiguous, because the zero value is what is returned by yylex() when it encounters and end-of-file, and yywrap() returns a non-zero.
After yylex() has returned, it is possible to call it again and again, and the scanner will continue exactly where it left off each time. If any start-condition was in force when the return() was executed, it will still apply when yylex() is called again.
This aspect of yylex() plays a key role when lex is being used as a front-end to a parser, such as yacc. When writing a stand-alone lex program, it is generally not required to have a return() statement within a lex rule.
%{
/* C declarations and includes */
%}
/* Yacc token and type declarations */
%%
/* Yacc Specification
in the form of grammer rules like this:
*/
symbol : symbols tokens
{ $$ = my_c_code($1); }
;
%%
/* C language program (the rest) */
menu_item : LABEL EXEC
;
This rule defines a non-terminal symbol, menu_item in terms of the two tokens LABEL and EXEC. Tokens are also known as “terminal symbols”, because the parser does not need to expand them any further. Conversely, menu_item is a “non-terminal symbol” because it can be expanded into LABEL and EXEC.
menu_item : LABEL EXEC '\n'
{ C-program statements }
;
So far, we have only considered the tokens LABEL and EXEC as single-valued integers which are passed from the lexer to the parser. What we really need, is access to the text-strings associated with these tokens (ie their semantic value).
We could do this using a global variable (like token_txt in our spam-checking program), except that yacc executes the action after it has read all the tokens up to that point.
Consider the MENU keyword, in our case. Yacc has to check whether it is followed by another string or a newline, before it can decide whether it is being used to introduce a sub-menu within the same file, or an external menu-file.
In any case, yacc provides a formal method for dealing with the semanitic value of tokens. It begins with the lexer. Every time the lexer returns a value, it should also set the external variable yylval to the value of the token. Yacc will then retain the association between the token and the corresonding value of yylval.
%union {
char *str;
int num;
}
This defines yylval as being a union of the types (char) and (int). This is a classical C-program union, so any number of types may be defined, and the union may even contain struct types, etc. For now, we’ll just have these two types.
Now we need to modify the lexer. This is done by including the line:
yylval.str = strdup(yytext); just before the lexer returns the LABEL and EXEC tokens. We'll also include the line:
yylval.num = atoi(yytext); just before the lexer returns the __INT__ token. ```
Now that we understand the hows and whys of yacc rules, we can try our hand at writing a basic parser. Before we can successfully compile it, there’s a few more things we’ll need.
Yacc generates a single function called yyparse(). This function requires no parameters. and returns either a 0 on success, and 1 on failure. “Failure” in the case of the parser means “if it encounters a syntax error”.
The yyerror() function is called when yacc encounters an invalid syntax. The yyerror() is passed a single string (char) argument. Unfortunately, this string usually just says “parse error”, so on its own, it’s pretty useless. Error recovery is an important topic which we will cover in more detail later on. For now, we just want a basic yyerror() function like this:
yyerror(char *err) {
fprintf(stderr, "%s\n",err);
}
See the section "Interface" in the Bison documentation for more information.