C++ Producer Guide

March 1998

3.4. Parsing C++

The parser used in the C++ producer is generated using the sid tool. Because of the large size of the generated code (1.3MB), the sid output is run through a simple program, sidsplit, which splits the output into a number of more manageable modules. It also transforms the code to use the PROTO macros used in the rest of the program.

sid is designed as a parser for grammars which can be transformed into LL(1) grammars. The distinguishing feature of these grammars is that the parser can always decide what to do next based on the current terminal. This is not the case in C++; in some circumstances a potentially unlimited look-ahead is required to distinguish, for example, declaration statements from expression statements. In the technical phrase, C++ is an LL(k) grammar. Fortunately there are relatively few such situations, and sid provides a mechanism, predicates, for bypassing the normal parsing mechanism in these cases. Thus it is possible, although difficult, to express C++ as a sid grammar.

The sid grammar file, syntax.sid, is closely based on the ISO C++ grammar. In particular, the same production names have been used. The grammar has been extended slightly to allow common syntactic errors to be detected elegantly. Other parsing errors are handled by sid's exception mechanism. At present there is only limited recovery after such errors.

The lexical analysis routines in the C++ producer are hand-crafted, based on an initial version generated by the simple lexical analyser generator, lexi. lexi has been used more directly to generate the lexical analysers for certain of the other automatic code generating tools, including calculus, used in the producer.

The sid grammar contains a number of entry points. The most important is parse_file, which is used to parse a complete C++ translation unit. The syntax for the #pragma TenDRA directives is included within the same grammar with two entry points, parse_tendra in normal use, and parse_preproc for use in preprocessing mode. There are also entry points in the grammar for each of the kinds of token argument. The parsing routines for token and template arguments are largely hand-crafted, based on these primitives.

Certain parsing operations are performed before control passes to the sid grammar. As mentioned above, these include the processing of token and template applications. The other important case concerns nested name specifiers. For example, in:

	class A {
	    class B {
		static int c ;
	    } ;
	} ;

	int A::B::c = 0 ;

the qualified identifier A::B::c is split into two terminals, a nested name specifier, A::B::, and an identifier, c, which is looked up in the corresponding namespace. Note that it is at this stage that name look-up occurs. An identifier can be mapped to one of a number of terminals, including keywords, type names, namespace names and other identifiers, according to the result of this look-up. If the look-up gives a macro then this is expanded at this stage.