Lexical analyser and parser communication
NickName:rajatkhanduja Ask DateTime:2012-02-23T21:20:56

Lexical analyser and parser communication

Most of the resources on lexical analyzers and parsers illustrate use of streams to communicate between them (or so I understand).

It is explained that the parser asks for the next token, say by calling a function getNextToken(), and the lexer responds to it by returning the next token. Are we supposed to think of them as two objects interacting within the same program or two different programs interacting through streams?

Also, I haven't been able to understand why a serial approach isn't chosen, i.e. the lexical analyzer runs till the end of the provided source, and only then does the parser use the output of the lexical analyzer for parsing. To be precise, if the lexical analyzer reads the next lexeme only when the parser asks for the next token, how is an error handled? Especially if the error occurs towards the end of the file, all the computation done by the parser might be wasted due to the error (assuming a very basic parser without any error handling capabilities). Is recent output cached?

Copyright Notice:Content Author:「rajatkhanduja」,Reproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/9413893/lexical-analyser-and-parser-communication

Answers
Vicky Chijwani 2012-02-23T13:59:03

My answer below is in reference to the Flex-Bison (or Lex-Yacc) model of compilation only. I have little knowledge of other models.\n\nI think of the lexer / parser combination as two co-operating modules in the same program.\nWhen you use Flex with Bison you see that the latter calls a function yylex() provided by the former (yylex() is equivalent to the getNextToken() function in your question). So it makes more sense to think of them as co-operating units in a single program rather than two different programs. Moreover, if the lexer and parser were 2 different programs, you'd have to deal with inter-process communication, shared memory, and related issues, further complicating the task at hand.\n\nTo answer your second question:\nI can think of one important issue that could arise from the parser coming into action after the lexer had finished reading all input: memory usage would be enormous for even moderate-sized programs, as you would have to store data structures for every token, in memory (think of tokens like , and = occupying multiple bytes in memory and you'll quickly see why it's not scalable).\n\nAs for error handling: if the lexer cannot match the input to any regular expression, then yylex() should return a -1 to the parser, using a flex rule as follows:\n\n. { return -1; }\n\n\n(Note the near-invisible period in the first column, which matches any input symbol except \\n)\n\n(NOTE: This rule should be the last rule to appear in your flex file, because the order of the rules determines priority: a token is matched by Flex using the first possible rule in the flex file.)\n\nA return value of -1 by the lexer indicates a tokenization error, and the Bison parser handles it automatically by calling yyerror(char *) (ideally defined by you); otherwise, if the input encounters a parsing error, the parser, again, calls yyerror(char *).\n\nAlso, if you want to display the erroneous piece of code when an error is encountered, you'd have to have a way to access the related source code given the defective token, which means the approach of reading the input entirely followed by parsing would not work at all, unless you store associated source code with each token while tokenizing, essentially making a memory behemoth of a compiler.",


More about “Lexical analyser and parser communication” related questions

Lexical analyser and parser communication

Most of the resources on lexical analyzers and parsers illustrate use of streams to communicate between them (or so I understand). It is explained that the parser asks for the next token, say by c...

Show Detail

Is string "1a" an error for lexical analyser or not?

I am making a basic lexical analyser in Java for my semester project and I am at conflict on a concept with my subject teacher. My view is that in general if an input like "1a" is given to lexical

Show Detail

Are back references possible in flex (lexical analyser)?

I'm used to play with regexp in languages where I can use parenthesis to capture references. The only thing near that in flex that I'm seeing is the yytext variable. But it's contents are the full

Show Detail

prolog fact lexical analyser in python

I am interested in being able to read and manipulate prolog 'facts' using python To facilitate this, how might one write a lexical analyser in python that could read a set of facts from a text fil...

Show Detail

how to implement lexical analyser and parser in Haskell

I got this piece of code here, it is a program written in an imperative programming language structured in Haskell, so the question is "how can I implement a lexer analyser and parser for this lang...

Show Detail

Creating a simple lexical analyser in Java

I am creating a lexical analyser that must read a text input and output tokens for a basic 'created' language and should output a token when called. I would like it to distinguish between identifie...

Show Detail

Question on lexical analysis

I am reading the dragon book. Quoting the text from the book (3.1.4 Lexical errors, Pno 114) It is hard for a lexical analyzer to tell, without the aid of other components, that there is a

Show Detail

How to handle parser errors

I'm writing a JSON parser and I have trouble coming up with good design for the error handling. Let's say that at some point, the lexical analyser founds a token with a lexical error in it. How sho...

Show Detail

A tutorial on Lexical parser

I am a bit stuck with parser combinators on whitespaces. They consume keyword even if keyword is a prefix in the "keywordandtherestofthestream". Moreover, identifier = rep1("a") consumes both lette...

Show Detail

Python: how lexical analyser convert tabs to spaces

In Python language reference, it said "First, tabs are replaced (from left to right) by one to eight spaces such that the total number of characters up to and including the replacement is a mul...

Show Detail