C++ Producer Guide

March 1998

2.4.1 - Lexical elements
2.4.2 - Overall syntax
2.4.3 - File locations
2.4.4 - Identifiers
2.4.5 - Types
2.4.6 - Sorts
2.4.7 - Token applications
2.4.8 - Errors
2.4.9 - File inclusions
2.4.10 - String literals

2.4. Symbol table dump

The symbol table dump provides a method whereby third party tools can interface with the C and C++ producers. The producer outputs information on the identifiers declared within a source file, their uses etc. into a file which can then be post-processed by a separate tool. Any error messages and warnings can also be included in this file, allowing more sophisticated error presentation tools to be written.

The file to be used as the symbol table output file, plus details of what information is to be included in the dump file can be specified using the -d command-line option. The format of the dump file is described below; a summary of the syntax is given as an annex.

2.4.1. Lexical elements

A symbol table dump file consists of a sequence of characters giving information on identifiers, errors etc. arising from a translation unit. The fundamental lexical tokens are a number, consisting of a sequence of decimal digits, and a string, consisting of a sequence of characters enclosed in angle braces. A string can have one of two forms:

	string :
		<characters>
		&number<characters>

In the first form, the characters are terminated by the first > character encountered. In the second form, the number of characters is given by the preceding number. No white space is allowed either before or after the number. To aid parsers, the C++ producer always uses the second form for strings containing more than 100 characters. There are no escape characters in strings; the characters can contain any characters, including newlines and #, except that the first form cannot contain a > character.

Space, tab and newline characters are white space. Comments begin with # and run to the end of the line. Comments are treated as white space. All other characters are treated as distinct lexical tokens.

2.4.2. Overall syntax

A symbol table dump file takes the form of a list of commands of various kinds conveying information on the analysed file. This can be represented as follows:

	dump-file :
		command-list_opt

	command-list :
		command command-list_opt

	command :
		version-command
		identifier-command
		scope-command
		override-command
		base-command
		api-command
		template-command
		promotion-command
		error-command
		path-command
		file-command
		include-command
		string-command

The various kinds of command are discussed below. The first command in the dump file should be of the form:

	version-command :
		V number number string

where the two numbers give the version of the dump file format (the version described here is 1.1 so both numbers should be 1) and the string gives the language being represented, for example, <C++>.

2.4.3. File locations

A location within a source file can be specified using three numbers and two strings. These give respectively, the column number, the line number taking #line directives into account, the line number not taking #line directives into account, the file name taking #line directives into account, and the file name not taking #line directives into account. Any or all of the trailing elements can be replaced by * to indicate that they have not changed relative to the last location given. Note that for the two line numbers, unchanged means that the difference of the line numbers, taking #line directives into account or not, is unchanged. Thus:

	location :
		number number number string string
		number number number string *
		number number number *
		number number *
		number *
		*

Note that there is a concept of the current file location, relative to which other locations are given. The initial value of the current file location is undefined. Unless otherwise stated, all location elements update the current file location.

2.4.4. Identifiers

Each identifier is represented in the symbol table dump by a unique number. The same number always represents the same identifier.

Identifier names

The number representing an identifier is introduced in the first declaration or use of that identifier and thereafter the number alone is used to denote the identifier:

	identifier :
		number = identifier-name access_opt scope-identifier
		number

The identifier name is given by:

	identifier-name :
		string
		C type
		D type
		O string
		T type

denoting respectively, a simple identifier name, a constructor for a type, a destructor for a type, an overloaded operator function name, and a conversion function name. The empty string is used for anonymous identifiers.

The optional identifier access is given by:

	access :
		N
		B
		P

denoting public, protected and private respectively. An absent access is equivalent to public. Note that all identifiers, not just class members, can have access specifiers; however the access of a non-member is always public.

The scope (i.e. class, namespace, block etc.) in which an identifier is declared is given by:

	scope-identifier :
		identifier
		*

denoting either a named or an unnamed scope.

Identifier uses

Each declaration or use of an identifier is represented by a command of the form:

	identifier-command :
		D identifier-info type-info
		M identifier-info type-info
		T identifier-info type-info
		Q identifier-info
		U identifier-info
		L identifier-info
		C identifier-info
		W identifier-info type-info

where:

	identifier-info :
		identifier-key location identifier

gives the kind of identifier being declared or used, the location of the declaration or use, and the number associated with the identifier. Each declaration may, depending on the identifier-key, associate various type-info with the identifier, giving its type etc.

The various kinds of identifier-command are described below. Any can be preceded by I to indicate an implicit declaration or use. D denotes a definition. M (make) denotes a declaration. T denotes a tentative definition (C only). Q denotes the end of a definition, for those identifiers such as classes and functions whose definitions may be spread over several lines. U denotes an undefine operation (such as #undef for macro identifiers). C denotes a call to a function identifier; L (load) denotes other identifier uses. Finally W denotes implicit type information such as the C producer gleans from its weak prototype analysis.

The various identifier-keys are their associated type-info fields are given by the following table:

Key	Type information	Description
`K`	`*`	keyword
`MO`	sort	object macro
`MF`	sort	function macro
`MB`	sort	built-in macro
`TC`	type	class tag
`TS`	type	structure tag
`TU`	type	union tag
`TE`	type	enumeration tag
`TA`	type	`typedef` name
`NN`	`*`	namespace name
`NA`	scope-identifier	namespace alias
`VA`	type	automatic variable
`VP`	type	function parameter
`VE`	type	`extern` variable
`VS`	type	`static` variable
`FE`	type identifier_opt	`extern` function
`FS`	type identifier_opt	`static` function
`FB`	type identifier_opt	built-in operator function
`CF`	type identifier_opt	member function
`CS`	type identifier_opt	`static` member function
`CV`	type identifier_opt	virtual member function
`CM`	type	data member
`CD`	type	`static` data member
`E`	type	enumerator
`L`	`*`	label
`XO`	sort	object token
`XF`	sort	procedure token
`XP`	sort	token parameter
`XT`	sort	template parameter

The function identifier keys can optionally be followed by C indicating that the function has C linkage, and I indicating that the function is inline. By default, functions declared in a C++ dump file have C++ linkage and functions declared in a C dump file have C linkage. The optional identifier which forms part of the type-info of these functions is used to form linked lists of overloaded functions.

Identifier scopes

Each identifier belongs to a scope, called its parent scope, in which it is declared. For example, the parent of a member of a class is the class itself. This information is expressed in an identifier declaration using a scope-identifier. In addition to the obvious scopes such as classes and namespaces, there are other scopes such as blocks in function definitions. It is possible to introduce dummy identifiers to name such scopes. The parent of such a dummy identifier will be the enclosing scope identifier, so these dummy identifiers naturally represent the block structure. The parent of the top-level block in a function definition can be considered to be the function itself.

Information on the start and end of such scopes is given by:

	scope-command :
		SS scope-key location identifier
		SE scope-key location identifier

where:

	scope-key :
		N
		S
		B
		D
		H
		CT
		CF
		CC

gives the kind of scope involved: a namespace, a class, a block, some other declarative scope, a declaration block (see below), a true conditional scope, a false conditional scope or a target dependent conditional scope.

A declaration block is a sequence of declarations enclosed in directives of the form:

	#pragma TenDRA declaration block identifier begin
	....
	#pragma TenDRA declaration block end

This allows the sequence of declarations to be associated with the given identifier in the symbol dump file. This technique is used in the API description files to aid analysis tools in determining which declarations are part of the API.

Other identifier information

Other information associated with an identifier may be expressed using other dump commands. For example:

	override-command :
		O identifier identifier

is used to express the fact that the two identifiers are virtual member functions, the first of which overrides the second.

The command:

	base-command :
		B identifier-key identifier base-graph

	base-graph :
		base-class
		base-class ( base-list )

	base-class :
		number = V_opt access_opt type-name
		number :

	base-list :
		base-graph base-list_opt

associates a base class graph with a class identifier. Any class which does not have an associated base-command can be assumed to have no base classes. Each node in the graph is a type-name with an associated list of base classes. A V is used to indicate a virtual base class. Each node is numbered; duplicate numbers are used to indicate bases identified via the virtual base class structure. Any base class can then be referred to as:

	base-number :
		number : type-name

indicating the base class with the given number in the given class.

The command:

	api-command :
		X identifier-key identifier string

associates the external token name given by the string with the given tokenised identifier.

The command:

	template-command :
		Z identifier-key identifier token-application specialise-info

is used to introduce an identifier corresponding to an instance of a template, token-application. This instance may correspond to a specialisation of the primary template; this information is represented by:

	specialise-info :
		identifier
		token-application
		*

where * indicates a non-specialised instance.

2.4.5. Types

The built-in types are represented in the symbol table dump as follows:

Type	Encoding	Type	Encoding
char	`c`	float	`f`
signed char	`Sc`	double	`d`
unsigned char	`Uc`	long double	`r`
signed short	`s`	void	`v`
unsigned short	`Us`	(bottom)	`u`
signed int	`i`	bool	`b`
unsigned int	`Ui`	ptrdiff_t	`y`
signed long	`l`	size_t	`z`
unsigned long	`Ul`	wchar_t	`w`
signed long long	`x`	-	-
unsigned long long	`Ux`	-	-

Named types (classes, enumeration types etc.) can be represented by the corresponding identifier or token application:

	type-name :
		identifier
		token-application

Composite and qualified types are represented in terms of their subtypes as follows:

Type	Encoding
`const` type	`C` type
`volatile` type	`V` type
pointer type	`P` type
reference type	`R` type
pointer to member type	`M` type-name `:` type
function type	`F` type parameter-types
array type	`A` nat_opt `:` type
bitfield type	`B` nat `:` type
template type	`t` parameter-list_opt `:` type
promotion type	`p` type
arithmetic type	`a` type `:` type
integer literal type	`n` lit-base_opt lit-suffix_opt
weak function prototype (C only)	`W` type parameter-types
weak parameter type (C only)	`q` type

Other types can be represented by their textual representation using the form Q string, or by *, indicating an unknown type.

The parameter types for a function type are represented as follows:

	parameter-types :
		: exception-spec_opt func-qualifier_opt :
		. exception-spec_opt func-qualifier_opt :
		. exception-spec_opt func-qualifier_opt .
		, type parameter-types

where the :: form indicates that there are no further parameters, the .: form indicates that the parameters are terminated by an ellipsis, and the .. form indicates that no information is available on the further parameters (this can only happen with non-prototyped functions in C). The function qualifiers are given by:

	func-qualifier :
		C func-qualifier_opt
		V func-qualifier_opt

representing const and volatile member functions. The function exception specifier is given by:

	exception-spec :
		( exception-list_opt )

	exception-list :
		type
		type , exception-list

with an absent exception specifier, as in C++, indicating that any exception may be thrown.

Array and bitfield sizes are represented as follows:

	nat :
		+ number
		- number
		identifier
		token-application
		string

where a string is used to hold a textual representation of complex values.

Template types are represented by a list of template parameters, which will have previously been declared using the XT identifier key, followed by the underlying type expressed in terms of these parameters. The parameters are represented as follows:

	parameter-list :
		identifier
		identifier , parameter-list

Integer literal types are represented by the value of the literal followed by a representation of the literal base and suffix. These are given by:

	lit-base :
		O
		X

representing octal and hexadecimal literals respectively (decimal is the default), and:

	lit-suffix :
		U
		l
		Ul
		x
		Ux

representing the U, L, UL, LL and ULL suffixes respectively.

Target dependent integral promotion types are represented using p, so for example the promotion of unsigned short is represented as pUs. Information on the other cases, where the promotion type is known, can be given in a command of the form:

	promotion-command :
		P type : type

Thus the fact that the promotion of short is int would be expressed by the command Ps:i.

2.4.6. Sorts

A sort in the symbol table dump corresponds to the sort of a token declared in the #pragma token syntax. Expression tokens are represented as follows:

	expression-sort :
		ZEL type
		ZER type
		ZEC type
		ZN

corresponding to lvalue, rvalue and const EXP tokens of the given type, and NAT or INTEGER tokens, respectively. Statement tokens are represent by:

	statement-sort :
		ZS

Type tokens are represented as follows:

	type-sort :
		ZTO
		ZTI
		ZTF
		ZTA
		ZTP
		ZTS
		ZTU

corresponding to TYPE, VARIETY, FLOAT, ARITHMETIC, SCALAR, STRUCT or CLASS, and UNION token respectively. There are corresponding TAG forms:

	tag-type-sort :
		ZTTS
		ZTTU

Member tokens are represented using:

	member-sort :
		ZM type : type-name

where the first type gives the member type and the second gives the parent structure or union type.

Procedure tokens can be represented using:

	proc-sort :
		ZPG parameter-list_opt ; parameter-list_opt : sort
		ZPS parameter-list_opt : sort

The first form corresponds to the more general form of PROC token, that expressed using { .... | .... }, which has separate lists of bound and program parameters. These token parameters will have previously been declared using the XP identifier key. The second form corresponds to the case where the bound and program parameter lists are equal, that expressed as a PROC token using ( .... ). A more specialised version of this second form is a FUNC token, which is represented as:

	func-sort :
		ZF type

As noted above, template parameters are represented by a sort. Template type parameters are represented by ZTO, while template expression parameters are represent by ZEC (recall that such parameters are always constant expressions). The remaining case, template template parameters, can be represented as:

	template-sort :
		ZTt parameter-list_opt :

Finally, the number of parameters in a macro definition is represented by a sort of the form:

	macro-sort :
		ZUO
		ZUF number

corresponding to a object-like macro and a function-like macro with the given number of parameters, respectively.

2.4.7. Token applications

Given an identifier representing a PROC token or a template, an application of that token or an instance of that template can be represented using:

	token-application :
		T identifier , token-argument-list :

where the token or template arguments are given by:

	token-argument-list :
		token-argument
		token-argument , token-argument-list

Note that the case where there are no arguments is generally just represented by identifier; this case is specified separately in the rest of the grammar.

A token-argument can represent a value of any of the sorts listed above: expressions, integer constants, statements, types, members, functions and templates. These are given respectively by:

	token-argument :
		E expression
		N nat
		S statement
		T type
		M member
		F identifier
		C identifier

where:

	expression :
		nat

	statement :
		expression

	member :
		identifier
		string

2.4.8. Errors

Each error in the C++ error catalogue is represented by a number. These numbers happen to correspond to the position of the error within the catalogue, but in general this need not be the case. The first use of each error introduces the error number by associating it with a string giving the error name. This has the form cpp.error where error gives an error name from the C++ (cpp) error catalogue. Thus:

	error-name :
		number = string
		number

Each error message written to the symbol table dump has the form:

	error-command :
		ES location error-info
		EW location error-info
		EI location error-info
		EF location error-info
		EC error-info
		EA error-argument

denoting constraint errors, warnings, internal errors, fatal errors, continuation errors and error arguments respectively. Note that an error message may consist of several components; the initial error plus a number of continuation errors. Each error message may also have a number of error argument associated with it. This error information is given by:

	error-info :
		error-name number number

where the first number gives the number of error arguments which should be read, and the second is nonzero to indicate that a continuation error should be read.

Each error argument has one of the forms:

	error-argument :
		B base-number
		C scope-identifier
		E expression
		H identifier-name
		I identifier
		L location
		N nat
		S string
		T type
		V number
		V - number

corresponding to the various syntactic categories described above. Note that a location error argument, while expressed relative to the current file location, does not change this location.

2.4.9. File inclusions

It is possible to include information on header files within the symbol table dump. Firstly a number is associated with each directory on the #include search path:

	path-command :
		FD number = string string_opt

The first string gives the directory pathname; the second, if present, gives the associated directory name as specified in the -N command-line option.

Now the start and end of each file are marked using:

	file-command :
		FS location directory
		FE location

where directory gives the number of the directory in the search path where the file was found, or * if the file was found by other means. It is worth noting that if, for example, a function definition is the last item in a file, the FE command will appear in the symbol table dump before the QFE command for the end of the function definition. This is because lexical analysis, where the end of file is detected, takes place before parsing, where the end of function is detected.

A #include directive, whether explicit or implicit, can be represented using:

	include-command :
		FIA location string
		FIQ location string
		FIN location string
		FIS location string
		FIE location string
		FIR location

the first three corresponding to header names of the forms <....>, "...." and [....] respectively, the next two corresponding to start-up and end-up files, and the final form being used to resume the original file after the #include directive has been processed.

2.4.10. String literals

It is possible to dump information on string literals to the symbol table dump file using the commands:

	string-command :
		A location string
		AC location string
		AL location string
		ACL location string

representing string literals, character literals, wide string literals and wide character literals respectively. The given string gives the string text.