TDF and Portability

January 1998

3.1 - Features of TDF

3.1.1 - Capsule Structure
3.1.2 - Tokens

3.2 - TDF Compilation Phases

3.2.1 - API Description (Top Right)
3.2.2 - Production (Top Left)
3.2.3 - API Implementation (Bottom Right)
3.2.4 - Installation (Bottom Left)
3.2.5 - Illustrated Example

3.3 - Aspects of the TDF System

3.3.1 - The C to TDF Producer
3.3.2 - C to TDF Mappings
3.3.3 - TDF Linking
3.3.4 - The TDF Installers

3.4 - TDF and APIs

3.4.1 - API Description

3.4.1.1 - The Description Process
3.4.1.2 - Resolving Conflicts
3.4.1.3 - The Benefits of API Description

3.4.2 - TDF Library Building

3.4.2.1 - System Header Problems
3.4.2.2 - System Library Problems
3.4.2.3 - TDF Library Builders

3.5 - TDF and Conditional Compilation

3.5.1 - User-Defined APIs
3.5.2 - User Defined Tokens - Example
3.5.3 - Conditional Compilation within TDF
3.5.4 - Alternative Program Versions

3. TDF

Having discussed many of the problems involved with writing portable programs, we now eventually turn to TDF. Firstly a brief technical overview is given, indicating those features of TDF which facilitate the separation of program. Secondly the TDF compilation scheme is described. It is shown how the features of TDF are exploited to aid in the separation of target independent and target dependent code which we have indicated as characterising portable programs. Finally, the various constituents of this scheme are considered individually, and their particular roles are described in more detail.

3.1. Features of TDF

It is not the purpose of this paper to explain the exact specification of TDF - this is described elsewhere (see [6] and [4]) - but rather to show how its general design features make it suitable as an aid to writing portable programs.

TDF is an abstraction of high-level languages - it contains such things as exps (abstractions of expressions and statements), shapes (abstractions of types) and tags (abstractions of variable identifiers). In general form it is an abstract syntax tree which is flattened and encoded as a series of bits, called a capsule. This fairly high level of definition (for a compiler intermediate language) means that TDF is architecture neutral in the sense that it makes no assumptions about the underlying processor architecture.

The translation of a capsule to and from the corresponding syntax tree is totally unambiguous, also TDF has a "universal" semantic interpretation as defined in the TDF specification.

3.1.1. Capsule Structure

A TDF capsule consists of a number of units of various types. These are embedded in a general linkage scheme (see Fig. 2). Each unit contains a number of variable objects of various sorts (for example, tags and tokens) which are potentially visible to other units. Within the unit body each variable object is identified by a unique number. The linking is via a set of variable objects which are global to the entire capsule. These may in turn be associated with external names. For example, in Fig. 2, the fourth variable of the first unit is identified with the first variable of the third unit, and both are associated with the fourth external name.

FIGURE 2. TDF Capsule Structure

This capsule structure means that the combination of a number of capsules to form a single capsule is a very natural operation. The actual units are copied unchanged into the resultant capsule - it is only the surrounding linking information that needs changing. Many criteria could be used to determine how this linking is to be organised, but the simplest is to link two objects if and only if they have the same external name. This is the scheme that the current TDF linker has implemented. Furthermore such operations as changing an external name or removing it altogether ("hiding") are very simple under this linking scheme.

3.1.2. Tokens

So, the combination of program at this high level is straightforward. But TDF also provides another mechanism which allows for the combination of program at the syntax tree level, namely tokens. Virtually any node of the TDF tree may be a token : a place holder which stands for a subtree. Before the TDF can be decoded fully the definition of this token must be provided. The token definition is then macro substituted for the token in the decoding process to form the complete tree (see Fig. 3).

FIGURE 3. TDF Tokens

Tokens may also take arguments (see Fig. 4). The actual argument values (from the main tree) are substituted for the formal parameters in the token definition.

FIGURE 4. TDF Tokens (with Arguments)

As mentioned above, tokens are one of the types of variable objects which are potentially visible to external units. This means that a token does not have to be defined in the same unit as it is used in. Nor do these units have originally to have come from the same capsule, provided they have been linked before they need to be fully decoded. Tokens therefore provide a mechanism for the low-level separation and combination of code.

3.2. TDF Compilation Phases

We have seen how one of the great strengths of TDF is the fact that it facilitates the separation and combination of program. We now demonstrate how this is applied in the TDF compilation strategy. This section is designed only to give an outline of this scheme. The various constituent phases are discussed in more detail later.

Again we start with the simplest case, where the program contains no target dependent code. The strategy is illustrated in Fig. 5, which should be compared with the traditional compilation strategy shown in Fig. 1. The general layout of the diagrams is the same. The left halves of the diagrams refers to the program itself, and the right halves to the corresponding API. The top halves refer to machine independent material, and the bottom halves to what happens on each target machine. Thus, as before, the portable program appears in the top left of the diagram, and the corresponding API in the top right.

The first thing to note is that, whereas previously all the compilation took place on the target machines, here the compilation has been split into a target independent (C -> TDF) part, called production, and a target dependent (TDF -> target) part, called installation. One of the synonyms for TDF is ANDF, Architecture Neutral Distribution Format, and we require that the production is precisely that - architecture neutral - so that precisely the same TDF is installed on all the target machines.

This architecture neutrality necessitates a separation of code. For example, in the "Hello world" example discussed in sections 2.1.1 and 2.1.2, the API specifies that there shall be a type FILE and an object stdout of type FILE *, but the implementations of these may be different on all the target machines. Thus we need to be able to abstract out the code for FILE and stdout from the TDF output by the producer, and provide the appropriate (target dependent) definitions for these objects in the installation phase.

FIGURE 5. TDF Compilation Phases

3.2.1. API Description (Top Right)

The method used for this separation is the token mechanism. Firstly the syntactic element of the API is described in the form of a set of target independent headers. Whereas the target dependent, system headers contain the actual implementation of the API on a particular machine, the target independent headers express to the producer what is actually in the API, and which may therefore be assumed to be common to all compliant target machines. For example, in the target independent headers for the ANSI standard, there will be a file stdio.h containing the lines:

	#pragma token TYPE FILE # ansi.stdio.FILE
	#pragma token EXP rvalue : FILE * : stdout # ansi.stdio.stdout
	#pragma token FUNC int ( const char *, FILE * ) : fputs # ansi.stdio.fputs

These #pragma token directives are extensions to the C syntax which enable the expression of abstract syntax information to the producer. The directives above tell the producer that there exists a type called FILE, an expression stdout which is an rvalue (that is, a non-assignable value) of type

FILE
*

, and a procedure fputs with prototype:

	int fputs ( const char *, FILE * ) ;

and that it should leave their values unresolved by means of tokens (for more details on the #pragma token directive see [3]). Note how the information in the target independent header precisely reflects the syntactic information in the ANSI API.

The names ansi.stdio.FILE etc. give the external names for these tokens, those which will be visible at the outermost layer of the capsule; they are intended to be unique (this is discussed below). It is worth making the distinction between the internal names and these external token names. The former are the names used to represent the objects within C, and the latter the names used within TDF to represent the tokens corresponding to these objects.

3.2.2. Production (Top Left)

Now the producer can compile the program using these target independent headers. As will be seen from the "Hello world" example, these headers contain sufficient information to check that the program is syntactically correct. The produced, target independent, TDF will contain tokens corresponding to the various uses of stdout, fputs and so on, but these tokens will be left undefined. In fact there will be other undefined tokens in the TDF. The basic C types, int and char are used in the program, and their implementations may vary between target machines. Thus these types must also be represented by tokens. However these tokens are implicit in the producer rather than explicit in the target independent headers.

Note also that because the information in the target independent headers describes abstractly the contents of the API and not some particular implementation of it, the producer is in effect checking the program against the API itself.

3.2.3. API Implementation (Bottom Right)

Before the TDF output by the producer can be decoded fully it needs to have had the definitions of the tokens it has left undefined provided. These definitions will be potentially different on all target machines and reflect the implementation of the API on that machine.

The syntactic details of the implementation are to be found in the system headers. The process of defining the tokens describing the API (called TDF library building) consists of comparing the implementation of the API as given in the system headers with the abstract description of the tokens comprising the API given in the target independent headers. The token definitions thus produced are stored as TDF libraries, which are just archives of TDF capsules.

For example, in the example implementation of stdio.h given in section 2.1.2, the token ansi.stdio.FILE will be defined as the TDF compound shape corresponding to the structure defining the type FILE (recall the distinction between internal and external names). __iob will be an undefined tag whose shape is an array of 60 copies of the shape given by the token ansi.stdio.FILE, and the token ansi.stdio.stdout will be defined to be the TDF expression corresponding to a pointer to the second element of this array. Finally the token ansi.stdio.fputs is defined to be the effect of applying the procedure given by the undefined tag fputs. (In fact, this picture has been slightly simplified for the sake of clarity. See the section on C -> TDF mappings in section 3.3.2.)

These token definitions are created using exactly the same C -> TDF translation program as is used in the producer phase. This program knows nothing about the distinction between target independent and target dependent TDF, it merely translates the C it is given (whether from a program or a system header) into TDF. It is the compilation process itself which enables the separation of target independent and target dependent TDF.

In addition to the tokens made explicit in the API, the implicit tokens built into the producer must also have their definitions inserted into the TDF libraries. The method of definition of these tokens is slightly different. The definitions are automatically deduced by, for example, looking in the target machine's limits.h header to find the local values of CHAR_MIN and CHAR_MAX, and deducing the definition of the token corresponding to the C type char from this. It will be the variety (the TDF abstraction of integer types) consisting of all integers between these values.

Note that what we are doing in the main library build is checking the actual implementation of the API against the abstract syntactic description. Any variations of the syntactic aspects of the implementation from the API will therefore show up. Thus library building is an effective way of checking the syntactic conformance of a system to an API. Checking the semantic conformance is far more difficult - we shall return to this issue later.

3.2.4. Installation (Bottom Left)

The installation phase is now straightforward. The target independent TDF representing the program contains various undefined tokens (corresponding to objects in the API), and the definitions for these tokens on the particular target machine (reflecting the API implementation) are to be found in the local TDF libraries. It is a natural matter to link these to form a complete, target dependent, TDF capsule. The rest of the installation consists of a straightforward translation phase (TDF -> target) to produce a binary object file, and linking with the system libraries to form a final executable. Linking with the system libraries will resolve any tags left undefined in the TDF.

3.2.5. Illustrated Example

In order to help clarify exactly what is happening where, Fig. 6 shows a simple example superimposed on the TDF compilation diagram.

FIGURE 6. Example Compilation

The program to be translated is simply:

	FILE f ;

and the API is as above, so that FILE is an abstract type. This API is described as target independent headers containing the #pragma token statements given above. The producer combines the program with the target independent headers to produce a target independent capsule which declares a tag f whose shape is given by the token representing FILE, but leaves this token undefined. In the API implementation, the local definition of the type FILE from the system headers is translated into the definition of this token by the library building process. Finally in the installation, the target independent capsule is combined with the local token definition library to form a target dependent capsule in which all the tokens used are also defined. This is then installed further as described above.

3.3. Aspects of the TDF System

Let us now consider in more detail some of the components of the TDF system and how they fit into the compilation scheme.

3.3.1. The C to TDF Producer

Above it was emphasised how the design of the compilation strategy aids the representation of program in a target independent manner, but this is not enough in itself. The C -> TDF producer must represent everything symbolically; it cannot make assumptions about the target machine. For example, the line of C containing the initialisation:

	int a = 1 + 1 ;

is translated into TDF representing precisely that, 1 + 1, not 2, because it does not know the representation of int on the target machine. The installer does know this, and so is able to replace 1 + 1 by 2 (provided this is actually true).

As another example, in the structure:

	struct tag {
	    int a ;
	    double b ;
	} ;

the producer does not know the actual value in bits of the offset of the second field from the start of the structure - it depends on the sizes of int and double and the alignment rules on the target machine. Instead it represents it symbolically (it is the size of int rounded up to a multiple of the alignment of double). This level of abstraction makes the tokenisation required by the target independent API headers very natural. If we only knew that there existed a structure

struct
tag

with a field b of type double then it is perfectly simple to use a token to represent the (unknown) offset of this field from the start of the structure rather than using the calculated (known) value. Similarly, when it comes to defining this token in the library building phase (recall that this is done by the same C -> TDF translation program as the production) it is a simple matter to define the token to be the calculated value.

Furthermore, because all the producer's operations are performed at this very abstract level, it is a simple matter to put in extra portability checks. For example, it would be a relatively simple task to put most of the functionality of lint (excluding intermodular checking) or gcc's -Wall option into the producer, and moreover have these checks applied to an abstract machine rather than a particular target machine. Indeed a number of these checks have already been implemented.

These extra checks are switched on and off by using #pragma statements. (For more details on the #pragma syntax and which portability checks are currently supported by the producer see [3].) For example, ANSI C states that any undeclared function is assumed to return int, whereas for strict portability checking it is more useful to have undeclared functions marked as an error (indeed for strict API checking this is essential). This is done by inserting the line:

	#pragma no implicit definitions

either at the start of each file to be checked or, more simply, in a start-up file - a file which can be #include'd at the start of each source file by means of a command line option.

Because these checks can be turned off as well as on it is possible to relax as well as strengthen portability checking. Thus if a program is only intended to work on 32-bit machines, it is possible to switch off certain portability checks. The whole ethos underlying the producer is that these portability assumptions should be made explicit, so that the appropriate level of checking can be done.

As has been previously mentioned, the use of a single front-end to any compiler not only virtually eliminates the problems of differing code interpretation and compiler quirks, but also reduces the exposure to compiler bugs. Of course, this also applies to the TDF compiler, which has a single front-end (the producer) and multiple back-ends (the installers). As regards the syntax and semantics of the C language, the producer is by default a strictly ANSI C compliant compiler. (Addition to the October 1993 revision : Alas, this is no longer true; however strict ANSI can be specified by means of a simple command line option (see [1]). The decision whether to make the default strict and allow people to relax it, or to make the default lenient and allow people to strengthen it, is essentially a political one. It does not really matter in technical terms provided the user is made aware of exactly what each compilation mode means in terms of syntax, semantics and portability checking.) However it is possible to change its behaviour (again by means of #pragma statements) to implement many of the features found in "traditional" or "K&R" C. Hence it is possible to precisely determine how the producer will interpret the C code it is given by explicitly describing the C dialect it is written in in terms of these #pragma statements.

3.3.2. C to TDF Mappings

The nature of the C -> TDF transformation implemented by the producer is worth considering, although not all the features described in this section are fully implemented in the current (October 1993) producer. Although it is only indirectly related to questions of portability, this mapping does illustrate some of the problems the producer has in trying to represent program in an architecture neutral manner.

Once the initial difficulty of overcoming the syntactic and semantic differences between the various C dialects is overcome, the C -> TDF mapping is quite straightforward. In a hierarchy from high level to low level languages C and TDF are not that dissimilar - both come towards the bottom of what may legitimately be regarded as high level languages. Thus the constructs in C map easily onto the constructs of TDF (there are a few exceptions, for example coercing integers to pointers, which are discussed in [3]). Eccentricities of the C language specification such as doing all integer arithmetic in the promoted integer type are translated explicitly into TDF. So to add two char's, they are promoted to int's, added together as int's, and the result is converted back to a char. These rules are not built directly into TDF because of the desire to support languages other than C (and even other C dialects).

A number of issues arise when tokens are introduced. Consider for example the type size_t from the ANSI standard. This is a target dependent integer type, so bearing in mind what was said above it is natural for the producer to use a tokenised variety (the TDF representation of integer types) to stand for size_t. This is done by a #pragma token statement of the form:

	#pragma token VARIETY size_t # ansi.stddef.size_t

But if we want to do arithmetic on size_t's we need to know the integer type corresponding to the integral promotion of

size_t

. But this is again target dependent, so it makes sense to have another tokenised variety representing the integral promotion of size_t. Thus the simple token directive above maps to (at least) two TDF tokens, the type itself and its integral promotion.

As another example, suppose that we have a target dependent C type, type say, and we define a procedure which takes an argument of type type. In both the procedure body and at any call of the procedure the TDF we need to produce to describe how C passes this argument will depend on type. This is because C does not treat all procedure argument types uniformly. Most types are passed by value, but array types are passed by address. But whether or not type is an array type is target dependent, so we need to use tokens to abstract out the argument passing mechanism. For example, we could implement the mechanism using four tokens : one for the type type (which will be a tokenised shape), one for the type an argument of type type is passed as, arg_type say, (which will be another tokenised shape), and two for converting values of type type to and from the corresponding values of type arg_type (these will be tokens which take one exp argument and give an exp). For most types, arg_type will be the same as type and the conversion tokens will be identities, but for array types, arg_type will be a pointer to type and the conversion tokens will be "address of" and "contents of".

So there is not the simple one to one correspondence between #pragma token directives and TDF tokens one might expect. Each such directive maps onto a family of TDF tokens, and this mapping in a sense encapsulates the C language specification. Of course in the TDF library building process the definitions of all these tokens are deduced automatically from the local values.

3.3.3. TDF Linking

We now move from considering the components of the producer to those of the installer. The first phase of the installation - linking in the TDF libraries containing the token definitions describing the local implementation of the API - is performed by a general utility program, the TDF linker (or builder). This is a very simple program which is used to combine a number of TDF capsules and libraries into a single capsule. As has been emphasised previously, the capsule structure means that this is a very natural operation, but, as will be seen from the previous discussion (particularly section 2.2.3), such combinatorial phases are very prone to namespace problems.

In TDF tags, tokens and other externally named objects occupy separate namespaces, and there are no constructs which can cut across these namespaces in the way that the C macros do. There still remains the problem that the only way to know that two tokens, say, in different capsules are actually the same is if they have the same name. This, as we have already seen in the case of system linking, can cause objects to be identified wrongly.

In the main TDF linking phase - linking in the token definitions at the start of the installation - we are primarily linking on token names, these tokens being those arising from the use of the target independent headers. Potential namespace problems are virtually eliminated by the use of unique external names for the tokens in these headers (such as ansi.stdio.FILE in the example above). This means that there is a genuine one to one correspondence between tokens and token names. Of course this relies on the external token names given in the headers being genuinely unique. In fact, as is explained below, these names are normally automatically generated, and uniqueness of names within a given API is checked. Also incorporating the API name into the token name helps to ensure uniqueness across APIs. However the token namespace does require careful management. (Note that the user does not normally have access to the token namespace; all variable and procedure names map into the tag namespace.)

We can illustrate the "clean" nature of TDF linking by considering the st_atime example given in section 2.2.3. Recall that in the traditional compilation scheme the problem arose, not because of the program or the API implementation, but because of the way they were combined by the pre-processor. In the TDF scheme the target independent version of sys/stat.h will be included. Thus the procedure name st_atime and the field selector st_atime will be seen to belong to genuinely different namespaces - there are no macros to disrupt this. The former will be translated into a TDF tag with external name st_atime, whereas the latter is translated into a token with external name posix.stat.struct_stat.st_atime, say. In the TDF library reflecting the API implementation, the token posix.stat.struct_stat.st_atime will be defined precisely as the system header intended, as the offset corresponding to the C field selector st_atim.st__sec. The fact that this token is defined using a macro rather than a conventional direct field selector is not important to the library building process. Now the combination of the program with the API implementation in this case is straightforward - not only are the procedure name and the field selector name in the TDF now different, but they also lie in distinct namespaces. This shows how the separation of the API implementation from the main program is cleaner in the TDF compilation scheme than in the traditional scheme.

TDF linking also opens up new ways of combining code which may solve some other namespace problems. For example, in the open example in section 2.2.3, the name open is meant to be internal to the program. It is the fact that it is not treated as such which leads to the problem. If the program consisted of a single source file then we could make open a static procedure, so that its name does not appear in the external namespace. But if the program consists of several source files the external name is necessary for intra-program linking. The TDF linker allows this intra-program linking to be separated from the main system linking. In the TDF compilation scheme described above each source file is translated into a separate TDF capsule, which is installed separately to a binary object file. It is only the system linking which finally combines the various components into a single program. An alternative scheme would be to use the TDF linker to combine all the TDF capsules into a single capsule in the production phase and install that. Because all the intra-program linking has already taken place, the external names required for it can be "hidden" - that is to say, removed from the tag namespace. Only tag names which are used but not defined (and so are not internal to the program) and main should not be hidden. In effect this linking phase has made all the internal names in the program (except main) static.

In fact this type of complete program linking is not always feasible. For very large programs the resulting TDF capsule can to be too large for the installer to cope with (it is the system assembler which tends to cause the most problems). Instead it may be better to use a more judiciously chosen partial linking and hiding scheme.

3.3.4. The TDF Installers

The TDF installer on a given machine typically consists of four phases: TDF linking, which has already been discussed, translating TDF to assembly source code, translating assembly source code to a binary object file, and linking binary object files with the system libraries to form the final executable. The latter two phases are currently implemented by the system assembler and linker, and so are identical to the traditional compilation scheme.

It is the TDF to assembly code translator which is the main part of the installer. Although not strictly related to the question of portability, the nature of the translator is worth considering. Like the producer (and the assembler), it is a transformational, as opposed to a combinatorial, compilation phase. But whereas the transformation from C to TDF is "difficult" because of the syntax and semantics of C and the need to represent everything in an architecture neutral manner, the transformation from TDF to assembly code is much easier because of the unambiguous syntax and uniform semantics of TDF, and because now we know the details of the target machine, it is no longer necessary to work at such an abstract level.

The whole construction of the current generation of TDF translators is based on the concept of compilation as transformation. They represent the TDF they read in as a syntax tree, virtually identical to the syntax tree comprising the TDF. The translation process then consists of continually applying transformations to this tree - in effect TDF -> TDF transformations - gradually optimising it and changing it to a form where the translation into assembly source code is a simple transcription process (see [7]).

Even such operations as constant evaluation - replacing 1 + 1 by 2 in the example above - may be regarded as TDF -> TDF transformations. But so may more complex optimisations such as taking constants out of a loop, common sub-expression elimination, strength reduction and so on. Some of these transformations are universally applicable, others can only be applied on certain classes of machines. This transformational approach results in high quality code generation (see [5]) while minimising the risk of transformational errors. Moreover the sharing of so much code - up to 70% - between all the TDF translators, like the introduction of a common front-end, further reduces the exposure to compiler bugs.

Much of the machine ABI information is built into the translator in a very simple way. For example, to evaluate the offset of the field b in the structure struct tag above, the producer has already done all the hard work, providing a formula for the offset in terms of the sizes and alignments of the basic C types. The translator merely provides these values and the offset is automatically evaluated by the constant evaluation transformations. Other aspects of the ABI, for example the procedure argument and result passing conventions, require more detailed attention.

One interesting range of optimisations implemented by many of the current translators consists of the inlining of certain standard procedure calls. For example, strlen ( "hello" ) is replaced by 5. As it stands this optimisation appears to run the risk of corrupting the programmer's namespace - what if strlen was a user-defined procedure rather than the standard library routine (cf. the open example in section 2.2.3)? This risk only materialises however if we actually use the procedure name to spot this optimisation. In code compiled from the target independent headers all calls to the library routine strlen will be implemented by means of a uniquely named token, ansi.string.strlen say. It is by recognising this token name as the token is expanded that the translators are able to ensure that this is really the library routine strlen.

Another example of an inlined procedure of this type is alloca. Many other compilers inline alloca, or rather they inline __builtin_alloca and rely on the programmer to identify alloca with __builtin_alloca. This gets round the potential namespace problems by getting the programmer to confirm that alloca in the program really is the library routine alloca. By the use of tokens this information is automatically provided to the TDF translators.

3.4. TDF and APIs

What the discussion above has emphasised is that the ability to describe APIs abstractly as target independent headers underpins the entire TDF approach to portability. We now consider this in more detail.

3.4.1. API Description

The process of transforming an API specification into its description in terms of #pragma token directives is a time-consuming but often fascinating task. In this section we discuss some of the issues arising from the process of describing an API in this way.

3.4.1.1. The Description Process

As may be observed from the example given in section 3.2.1, the #pragma token syntax is not necessarily intuitively obvious. It is designed to be a low-level description of tokens which is capable of expressing many complex token specifications. Most APIs are however specified in C-like terms, so an alternative syntax, closer to C, has been developed in order to facilitate their description. This is then transformed into the corresponding #pragma token directives by a specification tool called tspec (see [2]), which also applies a number of checks to the input and generates the unique token names. For example, the description leading to the example above was:

	+TYPE FILE ;
	+EXP FILE *stdout ;
	+FUNC int fputs ( const char *, FILE * ) ;

Note how close this is to the English language specification of the API given previously. (There are a number of open issues relating to tspec and the #pragma token syntax, mainly concerned with determining the type of syntactic statements that it is desired to make about the APIs being described. The current scheme is adequate for those APIs so far considered, but it may need to be extended in future.)

tspec is not capable of expressing the full power of the #pragma token syntax. Whereas this makes it easier to use in most cases, for describing the normal C-like objects such as types, expressions and procedures, it cannot express complex token descriptions. Instead it is necessary to express these directly in the #pragma token syntax. However this is only rarely required : the constructs offsetof, va_start and va_arg from ANSI are the only examples so far encountered during the API description programme at DRA. For example, va_arg takes an assignable expression of type va_list and a type t and returns an expression of type t. Clearly, this cannot be expressed abstractly in C-like terms; so the #pragma token description:

	#pragma token PROC ( EXP lvalue : va_list : e, TYPE t )\
	    EXP rvalue : t : va_arg # ansi.stdarg.va_arg

must be used instead.

Most of the process of describing an API consists of going through its English language specification transcribing the object specifications it gives into the tspec syntax (if the specification is given in a machine readable form this process can be partially automated). The interesting part consists of trying to interpret what is written and reading between the lines as to what is meant. It is important to try to represent exactly what is in the specification rather than being influenced by one's knowledge of a particular implementation, otherwise the API checking phase of the compilation will not be checking against what is actually in the API but against a particular way of implementing it.

There is a continuing API description programme at DRA. The current status (October 1993) is that ANSI (X3.159), POSIX (1003.1), XPG3 (X/Open Portability Guide 3) and SVID (System V Interface Definition, 3rd Edition) have been described and extensively tested. POSIX2 (1003.2), XPG4, AES (Revision A), X11 (Release 5) and Motif (Version 1.1) have been described, but not yet extensively tested.

There may be some syntactic information in the paper API specifications which tspec (and the #pragma token syntax) is not yet capable of expressing. In particular, some APIs go into very careful management of namespaces within the API, explicitly spelling out exactly what should, and should not, appear in the namespaces as each header is included (see the appendix on namespaces and APIs below). What is actually being done here is to regard each header as an independent sub-API. There is not however a sufficiently developed "API calculus" to allow such relationships to be easily expressed.

3.4.1.2. Resolving Conflicts

Another consideration during the description process is to try to integrate the various API descriptions. For example, POSIX extends ANSI, so it makes sense to have the target independent POSIX headers include the corresponding ANSI headers and just add the new objects introduced by POSIX. This does present problems with APIs which are basically compatible but have a small number of incompatibilities, whether deliberate or accidental. As an example of an "accidental" incompatibility, XPG3 is an extension of POSIX, but whereas POSIX declares malloc by means of the prototype:

	void *malloc ( size_t ) ;

XPG3 declares it by means of the traditional procedure declaration:

	void *malloc ( s )
	size_t s ;

These are surely intended to express the same thing, but in the first case the argument is passed as a size_t and in the second it is firstly promoted to the integer promotion of size_t. On most machines these are compatible, either because of the particular implementation of size_t, or because the procedure calling conventions make them compatible. However in general they are incompatible, so the target independent headers either have to reflect this or have to read between the lines and assume that the incompatibility was accidental and ignore it.

As an example of a deliberate incompatibility, both XPG3 and SVID3 declare a structure struct msqid_ds in sys/msg.h which has fields msg_qnum and msg_qbytes. The difference is that whereas XPG3 declares these fields to have type unsigned short, SVID3 declares them to have type unsigned long. However for most purposes the precise types of these fields is not important, so the APIs can be unified by making the types of these fields target dependent. That is to say, tokenised integer types __msg_q_t and __msg_l_t are introduced. On XPG3-compliant machines these will both be defined to be unsigned short, and on SVID3-compliant machines they will both be unsigned long. So, although strict XPG3 and strict SVID3 are incompatible, the two extension APIs created by adding these types are compatible. In the rare case when the precise type of these fields is important, the strict APIs can be recovered by defining the field types to be unsigned short or unsigned long at produce-time rather than at install-time. (XPG4 uses a similar technique to resolve this incompatibility. But whereas the XPG4 types need to be defined explicitly, the tokenised types are defined implicitly according to whatever the field types are on a particular machine.)

This example shows how introducing extra abstractions can resolve potential conflicts between APIs. But it may also be used to resolve conflicts between the API specification and the API implementations. For example, POSIX specifies that the structure struct flock defined in fcntl.h shall have a field l_pid of type pid_t. However on at least two of the POSIX implementations examined at DRA, pid_t was implemented as an int, but the l_pid field of struct flock was implemented as a short (this showed up in the TDF library building process). The immediate reaction might be that these system have not implemented POSIX correctly, so they should be cast into the outer darkness. However for the vast majority of applications, even those which use the l_pid field, its precise type is not important. So the decision was taken to introduce a tokenised integer type, __flock_pid_t, to stand for the type of the l_pid field. So although the implementations do not conform to strict POSIX, they do to this slightly more relaxed extension. Of course, one could enforce strict POSIX by defining __flock_pid_t to be pid_t at produce-time, but the given implementations would not conform to this stricter API.

Both the previous two examples are really concerned with the question of determining the correct level of abstraction in API specification. Abstraction is inclusive and allows for API evolution, whereas specialisation is exclusive and may lead to dead-end APIs. The SVID3 method of allowing for longer messages than XPG3 - changing the msg_qnum and msg_qbytes fields of struct msqid_ds from unsigned short to unsigned long - is an over-specialisation which leads to an unnecessary conflict with XPG3. The XPG4 method of achieving exactly the same end - abstracting the types of these fields - is, by contrast, a smooth evolutionary path.

3.4.1.3. The Benefits of API Description

The description process is potentially of great benefit to bodies involved in API specification. While the specification itself stays on paper the only real existence of the API is through its implementations. Giving the specification a concrete form means not only does it start to be seen as an object in its own right, rather than some fuzzy document underlying the real implementations, but also any omissions, insufficient specifications (where what is written down does not reflect what the writer actually meant) or built-in assumptions are more apparent. It may also be able to help show up the kind of over-specialisation discussed above. The concrete representation also becomes an object which both applications and implementations can be automatically checked against. As has been mentioned previously, the production phase of the compilation involves checking the program against the abstract API description, and the library building phase checks the syntactic aspect of the implementation against it.

The implementation checking aspect is considered below. Let us here consider the program checking aspect by re-examining the examples given in section 2.2.4.1. The SIGKILL example is straightforward; SIGKILL will appear in the POSIX version of signal.h but not the ANSI version, so if the program is compiled with the target independent ANSI headers it will be reported as being undefined. In a sense this is nothing to do with the #pragma token syntax, but with the organisation of the target independent headers. The other examples do however rely on the fact that the #pragma token syntax can express syntactic information in a way which is not possible directly from C. Thus the target independent headers express exactly the fact that time_t is an arithmetic type, about which nothing else is known. Thus ( t & 1 ) is not type correct for a time_t t because the binary & operator does not apply to all arithmetic types. Similarly, for the type div_t the target independent headers express the information that there exists a structure type div_t and field selectors quot and rem of div_t of type int, but nothing about the order of these fields or the existence of other fields. Thus any attempt to initialise a div_t will fail because the correspondence between the values in the initialisation and the fields of the structure is unknown. The struct dirent example is entirely analogous, except that here the declarations of the structure type struct dirent and the field selector d_name appear in both the POSIX and XPG3 versions of dirent.h, whereas the field selector d_ino appears only in the XPG3 version.

3.4.2. TDF Library Building

As we have said, two of the primary problems with writing portable programs are dealing with API implementation errors on the target machines - objects not being defined, or being defined in the wrong place, or being implemented incorrectly - and namespace problems - particularly those introduced by the system headers. The most interesting contrast between the traditional compilation scheme (Fig. 1) and the TDF scheme (Fig. 5) is that in the former the program comes directly into contact with the "real world" of messy system headers and incorrectly implemented APIs, whereas in the latter there is an "ideal world" layer interposed. This consists of the target independent headers, which describe all the syntactic features of the API where they are meant to be, and with no extraneous material to clutter up the namespaces (like index and the macro st_atime in the examples given in section 2.2.3), and the TDF libraries, which can be combined "cleanly" with the program without any namespace problems. All the unpleasantness has been shifted to the interface between this "ideal world" and the "real world"; that is to say, the TDF library building.

The importance of this change may be summarised by observing that previously all the unpleasantnesses happened in the left hand side of the diagram (the program half), whereas in the TDF scheme they are in the right hand side (the API half). So API implementation problems are seen to be a genuinely separate issue from the main business of writing programs; the ball is firmly in the API implementor's court rather than the programmer's. Also the problems need to be solved once per API rather than once per program.

It might be said that this has not advanced us very far towards actually dealing with the implementation errors. The API implementation still contains errors whoever's responsibility it is. But the TDF library building process gives the API implementor a second chance. Many of the syntactic implementation problems will be shown up as the library builder compares the implementation against the abstract API description, and it may be possible to build corrections into the TDF libraries so that the libraries reflect, not the actual implementation, but some improved version of it.

To show how this might be done, we reconsider the examples of API implementation errors given in section 2.2.4.2. As before we may divide our discussion between system header problems and system library problems. Recall however the important distinction, that whereas previously the programmer was trying to deal with these problems in a way which would work on all machines (top left of the compilation diagrams), now the person building the TDF libraries is trying to deal with implementation problems for a particular API on a particular machine (bottom right).

3.4.2.1. System Header Problems

Values which are defined in the wrong place, such as SEEK_SET in the example given, present no difficulties. The library builder will look where it expects to find them and report that they are undefined. To define these values it is merely a matter of telling the library builder where they are actually defined (in unistd.h rather than stdio.h).

Similarly, values which are undefined are also reported. If these values can be deduced from other information, then it is a simple matter to tell the library builder to use these deduced values. For example, if EXIT_SUCCESS and EXIT_FAILURE are undefined, it is probably possible to deduce their values from experimentation or experience (or guesswork).

Wrongly defined values are more difficult. Firstly they are not necessarily detected by the library builder because they are semantic rather than syntactic errors. Secondly, whereas it is easy to tell the library builder to use a corrected value rather than the value given in the implementation, this mechanism needs to be used with circumspection. The system libraries are provided pre-compiled, and they have been compiled using the system headers. If we define these values differently in the TDF libraries we are effectively changing the system headers, and there is a risk of destroying the interface with the system libraries. For example, changing a structure is not a good idea, because different parts of the program - the main body and the parts linked in from the system libraries - will have different ideas of the size and layout of this structure. (See the struct flock example in section 3.4.1.2 for a potential method of resolving such implementation problems.)

In the two cases given above - DBL_MAX and size_t - the necessary changes are probably "safe". DBL_MAX is not a special value in any library routines, and changing size_t from int to unsigned int does not affect its size, alignment or procedure passing rules (at least not on the target machines we have in mind) and so should not disrupt the interface with the system library.

3.4.2.2. System Library Problems

Errors in the system libraries will not be detected by the TDF library builder because they are semantic errors, whereas the library building process is only checking syntax. The only realistic ways of detecting semantic problems is by means of test suites, such as the Plum-Hall or CVSA library tests for ANSI and VSX for XPG3, or by detailed knowledge of particular API implementations born of personal experience. However it may be possible to build workarounds for problems identified in these tests into the TDF libraries.

For example, the problem with realloc discussed in section 2.2.4.4 could be worked around by defining the token representing realloc to be the equivalent of:

	#define realloc ( p, s ) ( void *q = ( p ) ? ( realloc ) ( q, s ) : malloc ( s ) )

(where the C syntax has been extended to allow variables to be introduced inside expressions) or:

	static void *__realloc ( void *p, size_t s )
	{
	    if ( p == NULL ) return ( malloc ( s ) ) ;
	    return ( ( realloc ) ( p, s ) ) ;
	}
	
	#define realloc ( p, s ) __realloc ( p, s )

Alternatively, the token definition could be encoded directly into TDF (not via C), using the TDF notation compiler (see [9]).

3.4.2.3. TDF Library Builders

The discussion above shows how the TDF libraries are an extra layer which lies on top of the existing system API implementation, and how this extra layer can be exploited to provide corrections and workarounds to various implementation problems. The expertise of particular API implementation problems on particular machines can be captured once and for all in the TDF libraries, rather than being spread piecemeal over all the programs which use that API implementation. But being able to encapsulate this expertise in this way makes it a marketable quantity. One could envisage a market in TDF libraries: ranging from libraries closely reflecting the actual API implementation to top of the range libraries with many corrections and workarounds built in.

All of this has tended to paint the system vendors as the villains of the piece for not providing correct API implementations, but this is not entirely fair. The reason why API implementation errors may persist over many operating system releases is that system vendors have as many porting problems as anyone else - preparing a new operating system release is in effect a huge porting exercise - and are understandably reluctant to change anything which basically works. The use of TDF libraries could be a low-risk strategy for system vendors to allow users the benefits of API conformance without changing the underlying operating system.

Of course, if the system vendor's porting problems could be reduced, they would have more confidence to make their underlying systems more API conformant, and thereby help reduce the normal programmer's porting problems. So whereas using the TDF libraries might be a short-term workaround for API implementation problems, the rest of the TDF porting system might help towards a long-term solution.

Another interesting possibility arises. As we said above, many APIs, for example POSIX and BSD, offer equivalent functionality by different methods. It may be possible to use the TDF library building process to express one in terms of the other. For example, in the struct dirent example10 in section 2.3.3, the only differences between POSIX and BSD were that the BSD version was defined in a different header and that the structure was called struct direct. But this presents no problems to the TDF library builder : it is perfectly simple to tell it to look in sys/dir.h instead of dirent.h, and to identify struct direct with struct dirent. So it may be possible to build a partial POSIX lookalike on BSD systems by using the TDF library mechanism.

3.5. TDF and Conditional Compilation

So far our discussion of the TDF approach to portability has been confined to the simplest case, where the program itself contains no target dependent code. We now turn to programs which contain conditional compilation. As we have seen, many of the reasons why it is necessary to introduce conditional compilation into the traditional compilation process either do not arise or are seen to be distinct phases in the TDF compilation process. The use of a single front-end (the producer) virtually eliminates problems of compiler limitations and differing interpretations and reduces compiler bug problems, so it is not necessary to introduce conditionally compiled workarounds for these. Also API implementation problems, another prime reason for introducing conditional compilation in the traditional scheme, are seen to be isolated in the TDF library building process, thereby allowing the programmer to work in an idealised world one step removed from the real API implementations. However the most important reason for introducing conditional compilation is where things, for reasons of efficiency or whatever, are genuinely different on different machines. It is this we now consider.

3.5.1. User-Defined APIs

The things which are done genuinely differently on different machines have previously been characterised as comprising the user-defined component of the API. So the real issue in this case is how to use the TDF API description and representation methods within one's own programs. A very simple worked example is given below (in section 3.5.2), for more detailed examples see [8].

For the MSB example given in section 2.3 we firstly have to decide what the user-defined API is. To fully reflect exactly what the target dependent code is, we could define the API, in tspec terms, to be:

	+MACRO unsigned char MSB ( unsigned int a ) ;

where the macro MSB gives the most significant byte of its argument, a. Let us say that the corresponding

#pragma
token

statement is put into the header msb.h. Then the program can be recast into the form:

	#include <stdio.h>
	#include "msb.h"
	
	unsigned int x = 100000000 ;
	
	int main ()
	{
	    printf ( "%u\n", MSB ( x ) ) ;
	    return ( 0 ) ;
	}

The producer will compile this into a target independent TDF capsule which uses a token to represent the use of MSB, but leaves this token undefined. The only question that remains is how this token is defined on the target machine; that is, how the user-defined API is implemented. On each target machine a TDF library containing the local definition of the token representing MSB needs to be built. There are two basic possibilities. Firstly the person performing the installation could build the library directly, by compiling a program of the form:

	#pragma implement interface "msb.h"
	#include "config.h"
	
	#ifndef SLOW_SHIFT
	#define MSB ( a ) ( ( unsigned char ) ( a >> 24 ) )
	#else
	#ifdef BIG_ENDIAN
	#define MSB ( a ) *( ( unsigned char * ) &( a ) )
	#else
	#define MSB ( a ) *( ( unsigned char * ) &( a ) + 3 )
	#endif
	#endif

with the appropriate config.h to choose the correct local implementation of the interface described in msb.h. Alternatively the programmer could provide three alternative TDF libraries corresponding to the three implementations, and let the person installing the program choose between these. The two approaches are essentially equivalent, they just provide for making the choice of the implementation of the user-defined component of the API in different ways. An interesting alternative approach would be to provide a short program which does the selection between the provided API implementations automatically. This approach might be particularly effective in deciding which implementation offers the best performance on a particular target machine.

3.5.2. User Defined Tokens - Example

As an example of how to define a simple token consider the following example. We have a simple program which prints "hello" in some language, the language being target dependent. Our first task is choose an API. We choose ANSI C extended by a tokenised object hello of type char * which gives the message to be printed. This object will be an rvalue (i.e. it cannot be assigned to). For convenience this token is declared in a header file,

tokens.h

say. This particular case is simple enough to encode by hand; it takes the form:

	#pragma token EXP rvalue : char * : hello #
	#pragma interface hello

consisting of a #pragma token directive describing the object to be tokenised, and a #pragma interface directive to show that this is the only object in the API. An alternative would be to generate tokens.h from a tspec specification of the form:

	+EXP char *hello ;

The next task is to write the program conforming to this API. This may take the form of a single source file, hello.c, containing the lines:

	#include <stdio.h>
	#include "tokens.h"
	
	int main ()
	{
	    printf ( "%s\n", hello ) ;
	    return ( 0 ) ;
	}

The production process may be specified by means of a Makefile. This uses the TDF C compiler, tcc, which is an interface to the TDF system which is designed to be like cc, but with extra options to handle the extra functionality offered by the TDF system (see [1]).

	produce : hello.j
		echo "PRODUCTION COMPLETE"
	
	hello.j : hello.c tokens.h
		echo "PRODUCTION : C->TDF"
		tcc -Fj hello.c

The production is run by typing make produce. The ANSI API is the default, and so does not need to be specified to tcc. The program hello.c is compiled to a target independent capsule, hello.j. This will use a token to represent hello, but it will be left undefined.

On each target machine we need to create a token library giving the local definitions of the objects in the API. We shall assume that the library corresponding to the ANSI C API has already been constructed, so that we only need to define the token representing hello. This is done by means of a short C program, tokens.c, which implements the tokens declared in tokens.h. This might take the form:

	#pragma implement interface "tokens.h"
	#define hello "bonjour"

to define hello to be "bonjour". On a different machine, the definition of hello could be given as "hello", "guten Tag", "zdrastvetye" (excuse my transliteration) or whatever (including complex expressions as well as simple strings). Note the use of #pragma implement interface to indicate that we are now implementing the API described in tokens.h, as opposed to the use of #include earlier when we were just using the API.

The installation process may be specified by adding the following lines to the Makefile:

	install : hello
		echo "INSTALLATION COMPLETE"
	
	hello : hello.j tokens.tl
		echo "INSTALLATION : TDF->TARGET"
		tcc -o hello -J. -jtokens hello.j
	
	tokens.tl : tokens.j
		echo "LIBRARY BUILDING : LINKING LIBRARY"
		tcc -Ymakelib -o tokens.tl tokens.j
	
	tokens.j : tokens.c tokens.h
		echo "LIBRARY BUILDING : DEFINING TOKENS"
		tcc -Fj -not_ansi tokens.c

The complete installation process is run by typing make install. Firstly the file tokens.c is compiled to give the TDF capsule tokens.j containing the definition of hello. The -not_ansi flag is needed because tokens.c does not contain any real C (declarations or definitions), which is not allowed in ANSI C. The next step is to turn the capsule

tokens.j

into a TDF library, tokens.tl, using the -Ymakelib option to tcc (with older versions of tcc it may be necessary to change this option to -Ymakelib -M -Fj). This completes the API implementation.

The final step is installation. The target independent TDF, hello.j, is linked with the TDF libraries tokens.tl and ansi.tl (which is built into tcc as default) to form a target dependent TDF capsule with all the necessary token definitions, which is then translated to a binary object file and linked with the system libraries. All of this is under the control of tcc.

Note the four stages of the compilation : API specification, production, API implementation and installation, corresponding to the four regions of the compilation diagram (Fig. 5).

3.5.3. Conditional Compilation within TDF

Although tokens are the main method used to deal with target dependencies, TDF does have built-in conditional compilation constructs. For most TDF sorts X (for example, exp, shape or variety) there is a construct X_cond which takes an exp and two X's and gives an X. The exp argument will evaluate to an integer constant at install time. If this is true (nonzero), the result of the construct is the first X argument and the second is ignored; otherwise the result is the second X argument and the first is ignored. By ignored we mean completely ignored - the argument is stepped over and not decoded. In particular any tokens in the definition of this argument are not expanded, so it does not matter if they are undefined.

These conditional compilation constructs are used by the C -> TDF producer to translate certain statements containing:

	#if condition

where condition is a target dependent value. Thus, because it is not known which branch will be taken at produce time, the decision is postponed to install time. If condition is a target independent value then the branch to be taken is known at produce time, so the producer only translates this branch. Thus, for example, code surrounded by #if 0 ... #endif will be ignored by the producer.

Not all such #if statements can be translated into TDF X_cond constructs. The two branches of the #if statement are translated into the two X arguments of the X_cond construct; that is, into sub-trees of the TDF syntax tree. This can only be done if each of the two branches is syntactically complete.

The producer interprets #ifdef (and #ifndef) constructs to mean, is this macro is defined (or undefined) at produce time? Given the nature of pre-processing in C this is in fact the only sensible interpretation. But if such constructs are being used to control conditional compilation, what is actually intended is, is this macro defined at install time? This distinction is necessitated by the splitting of the TDF compilation into production and installation - it does not exist in the traditional compilation scheme. For example, in the mips example in section 2.3, whether or not mips is defined is intended to be an installer property, rather than what it is interpreted as, a producer property. The choice of the conditional compilation path may be put off to install time by, for example, changing #ifdef mips to #if is_mips where is_mips is a tokenised integer which is either 1 (on those machines on which mips would be defined) or 0 (otherwise). In fact in view of what was said above about syntactic completeness, it might be better to recast the program as:

	#include <stdio.h>
	#include "user_api.h" /* For the spec of is_mips */
	
	int main ()
	{
	    if ( is_mips ) {
		fputs ( "This machine is a mips\n", stdout ) ;
	    }
	    return ( 0 ) ;
	}

because the branches of an if statement, unlike those of an #if statement, have to be syntactically complete is any case. The installer will optimise out the unnecessary test and any unreached code, so the use of if ( condition ) is guaranteed to produce as efficient code as #if condition.

In order to help detect such "installer macro" problems the producer has a mode for detecting them. All #ifdef and #ifndef constructs in which the compilation path to be taken is potentially target dependent are reported (see [3] and [8]).

The existence of conditional compilation within TDF also gives flexibility in how to approach expressing target dependent code. Instead of a "full" abstraction of the user-defined API as target dependent types, values and functions, it can be abstracted as a set of binary tokens (like is_mips in the example above) which are used to control conditional compilation. This latter approach can be used to quickly adapt existing programs to a TDF-portable form since it is closer to the "traditional" approach of scattering the program with #ifdef's and #ifndef's to implement target dependent code. However the definition of a user-defined API gives a better separation of target independent and target dependent code, and the effort to define such as API may often be justified. When writing a new program from scratch the API rather than the conditional compilation approach is recommended.

The latter approach of a fully abstracted user-defined API may be more time consuming in the short run, but this may well be offset by the increased ease of porting. Also there is no reason why a user-defined API, once specified, should not serve more than one program. Similar programs are likely to require the same abstractions of target dependent constructs. Because the API is a concrete object, it can be reused in this way in a very simple fashion. One could envisage libraries of private APIs being built up in this way.

3.5.4. Alternative Program Versions

Consider again the program described in section 2.3.4 which has optional features for displaying its output graphically depending on the boolean value HAVE_X_WINDOWS. By making

HAVE_X_WINDOWS

part of the user-defined API as a tokenised integer and using:

	#if HAVE_X_WINDOWS

to conditionally compile the X Windows code, the choice of whether or not to use this version of the program is postponed to install time. If both POSIX and X Windows are implemented on the target machine the installation is straightforward. HAVE_X_WINDOWS is defined to be true, and the installation proceeds as normal. The case where only POSIX is implemented appears to present problems. The TDF representing the program will contain undefined tokens representing objects from both the POSIX and X Windows APIs. Surely it is necessary to define these tokens (i.e. implement both APIs) in order to install the TDF. But because of the use of conditional compilation, all the applications of X Windows tokens will be inside X_cond constructs on the branch corresponding to HAVE_X_WINDOWS being true. If it is actually false then these branches are stepped over and completely ignored. Thus it does not matter that these tokens are undefined. Hence the conditional compilation constructs within TDF give the same flexibility in the API implementation is this case as do those in C.