TDF is an abstraction of high-level languages - it contains such things
as
The translation of a capsule to and from the corresponding syntax
tree is totally unambiguous, also TDF has a "universal"
semantic interpretation as defined in the TDF specification.
FIGURE 2. TDF Capsule Structure
FIGURE 3. TDF Tokens
FIGURE 4. TDF Tokens (with Arguments)
FIGURE 5. TDF Compilation Phases
The names
Note also that because the information in the target independent headers
describes abstractly the contents of the API and not some particular
implementation of it, the producer is in effect checking the program
against the API itself.
The syntactic details of the implementation are to be found in the
system headers. The process of defining the tokens describing the
API (called TDF library building) consists of comparing the implementation
of the API as given in the system headers with the abstract description
of the tokens comprising the API given in the target independent headers.
The token definitions thus produced are stored as TDF libraries, which
are just archives of TDF capsules.
For example, in the example implementation of
These token definitions are created using exactly the same C ->
TDF translation program as is used in the producer phase. This program
knows nothing about the distinction between target independent and
target dependent TDF, it merely translates the C it is given (whether
from a program or a system header) into TDF. It is the compilation
process itself which enables the separation of target independent
and target dependent TDF.
In addition to the tokens made explicit in the API, the implicit tokens
built into the producer must also have their definitions inserted
into the TDF libraries. The method of definition of these tokens is
slightly different. The definitions are automatically deduced by,
for example, looking in the target machine's
Note that what we are doing in the main library build is checking
the actual implementation of the API against the abstract syntactic
description. Any variations of the syntactic aspects of the implementation
from the API will therefore show up. Thus library building is an effective
way of checking the syntactic conformance of a system to an API. Checking
the semantic conformance is far more difficult - we shall return to
this issue later.
FIGURE 6. Example Compilation
As another example, in the structure:
Furthermore, because all the producer's operations are performed at
this very abstract level, it is a simple matter to put in extra portability
checks. For example, it would be a relatively simple task to put most
of the functionality of
These extra checks are switched on and off by using
Because these checks can be turned off as well as on it is possible
to relax as well as strengthen portability checking. Thus if a program
is only intended to work on 32-bit machines, it is possible to switch
off certain portability checks. The whole ethos underlying the producer
is that these portability assumptions should be made explicit, so
that the appropriate level of checking can be done.
As has been previously mentioned, the use of a single front-end to
any compiler not only virtually eliminates the problems of differing
code interpretation and compiler quirks, but also reduces the exposure
to compiler bugs. Of course, this also applies to the TDF compiler,
which has a single front-end (the producer) and multiple back-ends
(the installers). As regards the syntax and semantics of the C language,
the producer is by default a strictly ANSI C compliant compiler. (Addition
to the October 1993 revision : Alas, this is no longer true; however
strict ANSI can be specified by means of a simple command line option
(see [1]). The decision whether to make the default strict and allow
people to relax it, or to make the default lenient and allow people
to strengthen it, is essentially a political one. It does not really
matter in technical terms provided the user is made aware of exactly
what each compilation mode means in terms of syntax, semantics and
portability checking.) However it is possible to change its behaviour
(again by means of
Once the initial difficulty of overcoming the syntactic and semantic
differences between the various C dialects is overcome, the C ->
TDF mapping is quite straightforward. In a hierarchy from high level
to low level languages C and TDF are not that dissimilar - both come
towards the bottom of what may legitimately be regarded as high level
languages. Thus the constructs in C map easily onto the constructs
of TDF (there are a few exceptions, for example coercing integers
to pointers, which are discussed in [3]). Eccentricities of the C
language specification such as doing all integer arithmetic in the
promoted integer type are translated explicitly into TDF. So to add
two
A number of issues arise when tokens are introduced. Consider for
example the type
As another example, suppose that we have a target dependent C type,
So there is not the simple one to one correspondence between
In TDF tags, tokens and other externally named objects occupy separate
namespaces, and there are no constructs which can cut across these
namespaces in the way that the C macros do. There still remains the
problem that the only way to know that two tokens, say, in different
capsules are actually the same is if they have the same name. This,
as we have already seen in the case of system linking, can cause objects
to be identified wrongly.
In the main TDF linking phase - linking in the token definitions at
the start of the installation - we are primarily linking on token
names, these tokens being those arising from the use of the target
independent headers. Potential namespace problems are virtually eliminated
by the use of unique external names for the tokens in these headers
(such as
We can illustrate the "clean" nature of TDF linking by considering
the
TDF linking also opens up new ways of combining code which may solve
some other namespace problems. For example, in the
In fact this type of complete program linking is not always feasible.
For very large programs the resulting TDF capsule can to be too large
for the installer to cope with (it is the system assembler which tends
to cause the most problems). Instead it may be better to use a more
judiciously chosen partial linking and hiding scheme.
It is the TDF to assembly code translator which is the main part of
the installer. Although not strictly related to the question of portability,
the nature of the translator is worth considering. Like the producer
(and the assembler), it is a transformational, as opposed to a combinatorial,
compilation phase. But whereas the transformation from C to TDF is
"difficult" because of the syntax and semantics of C and
the need to represent everything in an architecture neutral manner,
the transformation from TDF to assembly code is much easier because
of the unambiguous syntax and uniform semantics of TDF, and because
now we know the details of the target machine, it is no longer necessary
to work at such an abstract level.
The whole construction of the current generation of TDF translators
is based on the concept of compilation as transformation. They represent
the TDF they read in as a syntax tree, virtually identical to the
syntax tree comprising the TDF. The translation process then consists
of continually applying transformations to this tree - in effect TDF
-> TDF transformations - gradually optimising it and changing it
to a form where the translation into assembly source code is a simple
transcription process (see [7]).
Even such operations as constant evaluation - replacing 1 + 1 by 2
in the example above - may be regarded as TDF -> TDF transformations.
But so may more complex optimisations such as taking constants out
of a loop, common sub-expression elimination, strength reduction and
so on. Some of these transformations are universally applicable, others
can only be applied on certain classes of machines. This transformational
approach results in high quality code generation (see [5]) while minimising
the risk of transformational errors. Moreover the sharing of so much
code - up to 70% - between all the TDF translators, like the introduction
of a common front-end, further reduces the exposure to compiler bugs.
Much of the machine ABI information is built into the translator in
a very simple way. For example, to evaluate the offset of the field
One interesting range of optimisations implemented by many of the
current translators consists of the inlining of certain standard procedure
calls. For example,
Another example of an inlined procedure of this type is
Most of the process of describing an API consists of going through
its English language specification transcribing the object specifications
it gives into the
There is a continuing API description programme at DRA. The current
status (October 1993) is that ANSI (X3.159), POSIX (1003.1), XPG3
(X/Open Portability Guide 3) and SVID (System V Interface Definition,
3rd Edition) have been described and extensively tested. POSIX2 (1003.2),
XPG4, AES (Revision A), X11 (Release 5) and Motif (Version 1.1) have
been described, but not yet extensively tested.
There may be some syntactic information in the paper API specifications
which
As an example of a deliberate incompatibility, both XPG3 and SVID3
declare a structure
This example shows how introducing extra abstractions can resolve
potential conflicts between APIs. But it may also be used to resolve
conflicts between the API specification and the API implementations.
For example, POSIX specifies that the structure
Both the previous two examples are really concerned with the question
of determining the correct level of abstraction in API specification.
Abstraction is inclusive and allows for API evolution, whereas specialisation
is exclusive and may lead to dead-end APIs. The SVID3 method of allowing
for longer messages than XPG3 - changing the
The implementation checking aspect is considered below. Let us here
consider the program checking aspect by re-examining the examples
given in section 2.2.4.1
The importance of this change may be summarised by observing that
previously all the unpleasantnesses happened in the left hand side
of the diagram (the program half), whereas in the TDF scheme they
are in the right hand side (the API half). So API implementation problems
are seen to be a genuinely separate issue from the main business of
writing programs; the ball is firmly in the API implementor's court
rather than the programmer's. Also the problems need to be solved
once per API rather than once per program.
It might be said that this has not advanced us very far towards actually
dealing with the implementation errors. The API implementation still
contains errors whoever's responsibility it is. But the TDF library
building process gives the API implementor a second chance. Many of
the syntactic implementation problems will be shown up as the library
builder compares the implementation against the abstract API description,
and it may be possible to build corrections into the TDF libraries
so that the libraries reflect, not the actual implementation, but
some improved version of it.
To show how this might be done, we reconsider the examples of API
implementation errors given in section 2.2.4.2
In the two cases given above -
For example, the problem with
All of this has tended to paint the system vendors as the villains
of the piece for not providing correct API implementations, but this
is not entirely fair. The reason why API implementation errors may
persist over many operating system releases is that system vendors
have as many porting problems as anyone else - preparing a new operating
system release is in effect a huge porting exercise - and are understandably
reluctant to change anything which basically works. The use of TDF
libraries could be a low-risk strategy for system vendors to allow
users the benefits of API conformance without changing the underlying
operating system.
Of course, if the system vendor's porting problems could be reduced,
they would have more confidence to make their underlying systems more
API conformant, and thereby help reduce the normal programmer's porting
problems. So whereas using the TDF libraries might be a short-term
workaround for API implementation problems, the rest of the TDF porting
system might help towards a long-term solution.
Another interesting possibility arises. As we said above, many APIs,
for example POSIX and BSD, offer equivalent functionality by different
methods. It may be possible to use the TDF library building process
to express one in terms of the other. For example, in the
For the
On each target machine we need to create a token library giving the
local definitions of the objects in the API. We shall assume that
the library corresponding to the ANSI C API has already been constructed,
so that we only need to define the token representing
The installation process may be specified by adding the following
lines to the
The final step is installation. The target independent TDF,
Note the four stages of the compilation : API specification, production,
API implementation and installation, corresponding to the four regions
of the compilation diagram (Fig. 5).
These conditional compilation constructs are used by the C -> TDF
producer to translate certain statements containing:
Not all such
The producer interprets
In order to help detect such "installer macro" problems
the producer has a mode for detecting them. All
The existence of conditional compilation within TDF also gives flexibility
in how to approach expressing target dependent code. Instead of a
"full" abstraction of the user-defined API as target dependent
types, values and functions, it can be abstracted as a set of binary
tokens (like
The latter approach of a fully abstracted user-defined API may be
more time consuming in the short run, but this may well be offset
by the increased ease of porting. Also there is no reason why a user-defined
API, once specified, should not serve more than one program. Similar
programs are likely to require the same abstractions of target dependent
constructs. Because the API is a concrete object, it can be reused
in this way in a very simple fashion. One could envisage libraries
of private APIs being built up in this way.
Part of the TenDRA Web.3.1. Features of TDF
It is not the purpose of this paper to explain the exact specification
of TDF - this is described elsewhere (see [6] and [4]) - but rather
to show how its general design features make it suitable as an aid
to writing portable programs.exps
(abstractions of expressions and statements),
shapes
(abstractions of types) and tags
(abstractions of variable identifiers). In general form it is an abstract
syntax tree which is flattened and encoded as a series of bits, called
a capsule
. This fairly high level of definition (for
a compiler intermediate language) means that TDF is architecture neutral
in the sense that it makes no assumptions about the underlying processor
architecture.3.1.1. Capsule Structure
A TDF capsule consists of a number of units of various types. These
are embedded in a general linkage scheme (see Fig. 2). Each unit contains
a number of variable objects of various sorts (for example, tags and
tokens) which are potentially visible to other units. Within the unit
body each variable object is identified by a unique number. The linking
is via a set of variable objects which are global to the entire capsule.
These may in turn be associated with external names. For example,
in Fig. 2, the fourth variable of the first unit is identified with
the first variable of the third unit, and both are associated with
the fourth external name.
This capsule structure means that the combination of a number of capsules
to form a single capsule is a very natural operation. The actual units
are copied unchanged into the resultant capsule - it is only the surrounding
linking information that needs changing. Many criteria could be used
to determine how this linking is to be organised, but the simplest
is to link two objects if and only if they have the same external
name. This is the scheme that the current TDF linker has implemented.
Furthermore such operations as changing an external name or removing
it altogether ("hiding") are very simple under this linking
scheme.3.1.2. Tokens
So, the combination of program at this high level is straightforward.
But TDF also provides another mechanism which allows for the combination
of program at the syntax tree level, namely tokens
. Virtually
any node of the TDF tree may be a token : a place holder which stands
for a subtree. Before the TDF can be decoded fully the definition
of this token must be provided. The token definition is then macro
substituted for the token in the decoding process to form the complete
tree (see Fig. 3).
Tokens may also take arguments (see Fig. 4). The actual argument values
(from the main tree) are substituted for the formal parameters in
the token definition.
As mentioned above, tokens are one of the types of variable objects
which are potentially visible to external units. This means that a
token does not have to be defined in the same unit as it is used in.
Nor do these units have originally to have come from the same capsule,
provided they have been linked before they need to be fully decoded.
Tokens therefore provide a mechanism for the low-level separation
and combination of code., the API specifies that there shall be a type
FILE
and an object stdout
of type FILE
*
, but the implementations of these may be different on all
the target machines. Thus we need to be able to abstract out the code
for FILE
and stdout
from the TDF output
by the producer, and provide the appropriate (target dependent) definitions
for these objects in the installation phase.
3.2.1. API Description (Top Right)
The method used for this separation is the token mechanism. Firstly
the syntactic element of the API is described in the form of a set
of target independent headers. Whereas the target dependent, system
headers contain the actual implementation of the API on a particular
machine, the target independent headers express to the producer what
is actually in the API, and which may therefore be assumed to be common
to all compliant target machines. For example, in the target independent
headers for the ANSI standard, there will be a file stdio.h
containing the lines:
#pragma token TYPE FILE # ansi.stdio.FILE
#pragma token EXP rvalue : FILE * : stdout # ansi.stdio.stdout
#pragma token FUNC int ( const char *, FILE * ) : fputs # ansi.stdio.fputs
These #pragma token
directives are extensions to the
C syntax which enable the expression of abstract syntax information
to the producer. The directives above tell the producer that there
exists a type called FILE
, an expression stdout
which is an rvalue (that is, a non-assignable value) of type FILE
*
, and a procedure fputs
with prototype:
int fputs ( const char *, FILE * ) ;
and that it should leave their values unresolved by means of tokens
(for more details on the #pragma token
directive see
[3]). Note how the information in the target independent header precisely
reflects the syntactic information in the ANSI API.ansi.stdio.FILE
etc. give the external names
for these tokens, those which will be visible at the outermost layer
of the capsule; they are intended to be unique (this is discussed
below). It is worth making the distinction between the internal names
and these external token names. The former are the names used to represent
the objects within C, and the latter the names used within TDF to
represent the tokens corresponding to these objects.3.2.2. Production (Top Left)
Now the producer can compile the program using these target independent
headers. As will be seen from the "Hello world" example,
these headers contain sufficient information to check that the program
is syntactically correct. The produced, target independent, TDF will
contain tokens corresponding to the various uses of stdout
,
fputs
and so on, but these tokens will be left undefined.
In fact there will be other undefined tokens in the TDF. The basic
C types, int
and char
are used in the program,
and their implementations may vary between target machines. Thus these
types must also be represented by tokens. However these tokens are
implicit in the producer rather than explicit in the target independent
headers.3.2.3. API Implementation (Bottom Right)
Before the TDF output by the producer can be decoded fully it needs
to have had the definitions of the tokens it has left undefined provided.
These definitions will be potentially different on all target machines
and reflect the implementation of the API on that machine.stdio.h
given in section 2.1.2ansi.stdio.FILE
will be defined as the TDF compound shape corresponding to the structure
defining the type FILE
(recall the distinction between
internal and external names). __iob
will be an undefined
tag whose shape is an array of 60 copies of the shape given by the
token ansi.stdio.FILE
, and the token ansi.stdio.stdout
will be defined to be the TDF expression corresponding to a pointer
to the second element of this array. Finally the token ansi.stdio.fputs
is defined to be the effect of applying the procedure given
by the undefined tag fputs
. (In fact, this picture has
been slightly simplified for the sake of clarity. See the section
on C -> TDF mappings in section 3.3.2limits.h
header to find the local values of CHAR_MIN
and CHAR_MAX
, and deducing the definition of the token corresponding to
the C type char
from this. It will be the variety
(the TDF abstraction of integer types) consisting of all integers
between these values.3.2.4. Installation (Bottom Left)
The installation phase is now straightforward. The target independent
TDF representing the program contains various undefined tokens (corresponding
to objects in the API), and the definitions for these tokens on the
particular target machine (reflecting the API implementation) are
to be found in the local TDF libraries. It is a natural matter to
link these to form a complete, target dependent, TDF capsule. The
rest of the installation consists of a straightforward translation
phase (TDF -> target) to produce a binary object file, and linking
with the system libraries to form a final executable. Linking with
the system libraries will resolve any tags left undefined in the TDF.3.2.5. Illustrated Example
In order to help clarify exactly what is happening where, Fig. 6 shows
a simple example superimposed on the TDF compilation diagram.
The program to be translated is simply:
FILE f ;
and the API is as above, so that FILE
is an abstract
type. This API is described as target independent headers containing
the #pragma token
statements given above. The producer
combines the program with the target independent headers to produce
a target independent capsule which declares a tag f
whose
shape is given by the token representing FILE
, but leaves
this token undefined. In the API implementation, the local definition
of the type FILE
from the system headers is translated
into the definition of this token by the library building process.
Finally in the installation, the target independent capsule is combined
with the local token definition library to form a target dependent
capsule in which all the tokens used are also defined. This is then
installed further as described above.3.3. Aspects of the TDF System
Let us now consider in more detail some of the components of the TDF
system and how they fit into the compilation scheme.3.3.1. The C to TDF Producer
Above it was emphasised how the design of the compilation strategy
aids the representation of program in a target independent manner,
but this is not enough in itself. The C -> TDF producer must represent
everything symbolically; it cannot make assumptions about the target
machine. For example, the line of C containing the initialisation:
int a = 1 + 1 ;
is translated into TDF representing precisely that, 1 + 1, not 2,
because it does not know the representation of int
on
the target machine. The installer does know this, and so is able to
replace 1 + 1 by 2 (provided this is actually true).
struct tag {
int a ;
double b ;
} ;
the producer does not know the actual value in bits of the offset
of the second field from the start of the structure - it depends on
the sizes of int
and double
and the alignment
rules on the target machine. Instead it represents it symbolically
(it is the size of int
rounded up to a multiple of the
alignment of double
). This level of abstraction makes
the tokenisation required by the target independent API headers very
natural. If we only knew that there existed a structure struct
tag
with a field b
of type double
then it is perfectly simple to use a token to represent the (unknown)
offset of this field from the start of the structure rather than using
the calculated (known) value. Similarly, when it comes to defining
this token in the library building phase (recall that this is done
by the same C -> TDF translation program as the production) it
is a simple matter to define the token to be the calculated value.lint
(excluding intermodular
checking) or gcc
's -Wall option into the producer,
and moreover have these checks applied to an abstract machine rather
than a particular target machine. Indeed a number of these checks
have already been implemented.#pragma
statements. (For more details on the #pragma
syntax and
which portability checks are currently supported by the producer see
[3].) For example, ANSI C states that any undeclared function is assumed
to return int
, whereas for strict portability checking
it is more useful to have undeclared functions marked as an error
(indeed for strict API checking this is essential). This is done by
inserting the line:
#pragma no implicit definitions
either at the start of each file to be checked or, more simply, in
a start-up file - a file which can be #include
'd at the
start of each source file by means of a command line option.#pragma
statements) to implement many
of the features found in "traditional" or "K&R"
C. Hence it is possible to precisely determine how the producer will
interpret the C code it is given by explicitly describing the C dialect
it is written in in terms of these #pragma
statements.3.3.2. C to TDF Mappings
The nature of the C -> TDF transformation implemented by the producer
is worth considering, although not all the features described in this
section are fully implemented in the current (October 1993) producer.
Although it is only indirectly related to questions of portability,
this mapping does illustrate some of the problems the producer has
in trying to represent program in an architecture neutral manner.char
's, they are promoted to int
's,
added together as int
's, and the result is converted
back to a char
. These rules are not built directly into
TDF because of the desire to support languages other than C (and even
other C dialects).size_t
from the ANSI standard. This
is a target dependent integer type, so bearing in mind what was said
above it is natural for the producer to use a tokenised variety (the
TDF representation of integer types) to stand for size_t
.
This is done by a #pragma token
statement of the form:
#pragma token VARIETY size_t # ansi.stddef.size_t
But if we want to do arithmetic on size_t
's we need to
know the integer type corresponding to the integral promotion of size_t
. But this is again target dependent, so it makes sense to
have another tokenised variety representing the integral promotion
of size_t
. Thus the simple token directive above maps
to (at least) two TDF tokens, the type itself and its integral promotion.type
say, and we define a procedure which takes an argument
of type type
. In both the procedure body and at any call
of the procedure the TDF we need to produce to describe how C passes
this argument will depend on type
. This is because C
does not treat all procedure argument types uniformly. Most types
are passed by value, but array types are passed by address. But whether
or not type
is an array type is target dependent, so
we need to use tokens to abstract out the argument passing mechanism.
For example, we could implement the mechanism using four tokens :
one for the type type
(which will be a tokenised shape),
one for the type an argument of type type
is passed as,
arg_type
say, (which will be another tokenised shape),
and two for converting values of type type
to and from
the corresponding values of type arg_type
(these will
be tokens which take one exp argument and give an exp). For most types,
arg_type
will be the same as type
and the
conversion tokens will be identities, but for array types, arg_type
will be a pointer to type
and the conversion tokens will
be "address of" and "contents of".#pragma
token
directives and TDF tokens one might expect. Each such
directive maps onto a family of TDF tokens, and this mapping in a
sense encapsulates the C language specification. Of course in the
TDF library building process the definitions of all these tokens are
deduced automatically from the local values.ansi.stdio.FILE
in the example above). This
means that there is a genuine one to one correspondence between tokens
and token names. Of course this relies on the external token names
given in the headers being genuinely unique. In fact, as is explained
below, these names are normally automatically generated, and uniqueness
of names within a given API is checked. Also incorporating the API
name into the token name helps to ensure uniqueness across APIs. However
the token namespace does require careful management. (Note that the
user does not normally have access to the token namespace; all variable
and procedure names map into the tag namespace.)st_atime
example given in section 2.2.3sys/stat.h
will be included.
Thus the procedure name st_atime
and the field selector
st_atime
will be seen to belong to genuinely different
namespaces - there are no macros to disrupt this. The former will
be translated into a TDF tag with external name st_atime
,
whereas the latter is translated into a token with external name posix.stat.struct_stat.st_atime
, say. In the TDF library reflecting the API implementation,
the token posix.stat.struct_stat.st_atime
will be defined
precisely as the system header intended, as the offset corresponding
to the C field selector st_atim.st__sec
. The fact that
this token is defined using a macro rather than a conventional direct
field selector is not important to the library building process. Now
the combination of the program with the API implementation in this
case is straightforward - not only are the procedure name and the
field selector name in the TDF now different, but they also lie in
distinct namespaces. This shows how the separation of the API implementation
from the main program is cleaner in the TDF compilation scheme than
in the traditional scheme.open
example in section 2.2.3open
is meant
to be internal to the program. It is the fact that it is not treated
as such which leads to the problem. If the program consisted of a
single source file then we could make open
a static
procedure, so that its name does not appear in the external namespace.
But if the program consists of several source files the external name
is necessary for intra-program linking. The TDF linker allows this
intra-program linking to be separated from the main system linking.
In the TDF compilation scheme described above each source file is
translated into a separate TDF capsule, which is installed separately
to a binary object file. It is only the system linking which finally
combines the various components into a single program. An alternative
scheme would be to use the TDF linker to combine all the TDF capsules
into a single capsule in the production phase and install that. Because
all the intra-program linking has already taken place, the external
names required for it can be "hidden" - that is to say,
removed from the tag namespace. Only tag names which are used but
not defined (and so are not internal to the program) and main
should not be hidden. In effect this linking phase has made all the
internal names in the program (except main
) static
.3.3.4. The TDF Installers
The TDF installer on a given machine typically consists of four phases:
TDF linking, which has already been discussed, translating TDF to
assembly source code, translating assembly source code to a binary
object file, and linking binary object files with the system libraries
to form the final executable. The latter two phases are currently
implemented by the system assembler and linker, and so are identical
to the traditional compilation scheme.b
in the structure struct tag
above, the
producer has already done all the hard work, providing a formula for
the offset in terms of the sizes and alignments of the basic C types.
The translator merely provides these values and the offset is automatically
evaluated by the constant evaluation transformations. Other aspects
of the ABI, for example the procedure argument and result passing
conventions, require more detailed attention.strlen ( "hello" )
is replaced
by 5. As it stands this optimisation appears to run the risk of corrupting
the programmer's namespace - what if strlen
was a user-defined
procedure rather than the standard library routine (cf. the open
example in section 2.2.3strlen
will be implemented by means of
a uniquely named token, ansi.string.strlen
say. It is
by recognising this token name as the token is expanded that the translators
are able to ensure that this is really the library routine strlen
.
alloca
.
Many other compilers inline alloca
, or rather they inline
__builtin_alloca
and rely on the programmer to identify
alloca
with __builtin_alloca
. This gets
round the potential namespace problems by getting the programmer to
confirm that alloca
in the program really is the library
routine alloca
. By the use of tokens this information
is automatically provided to the TDF translators.3.4. TDF and APIs
What the discussion above has emphasised is that the ability to describe
APIs abstractly as target independent headers underpins the entire
TDF approach to portability. We now consider this in more detail.3.4.1. API Description
The process of transforming an API specification into its description
in terms of #pragma token
directives is a time-consuming
but often fascinating task. In this section we discuss some of the
issues arising from the process of describing an API in this way.3.4.1.1. The Description Process
As may be observed from the example given in section 3.2.1, the
#pragma token
syntax is not necessarily intuitively obvious.
It is designed to be a low-level description of tokens which is capable
of expressing many complex token specifications. Most APIs are however
specified in C-like terms, so an alternative syntax, closer to C,
has been developed in order to facilitate their description. This
is then transformed into the corresponding #pragma token
directives by a specification tool called tspec
(see
[2]), which also applies a number of checks to the input and generates
the unique token names. For example, the description leading to the
example above was:
+TYPE FILE ;
+EXP FILE *stdout ;
+FUNC int fputs ( const char *, FILE * ) ;
Note how close this is to the English language specification of the
API given previously. (There are a number of open issues relating
to tspec
and the #pragma token
syntax, mainly
concerned with determining the type of syntactic statements that it
is desired to make about the APIs being described. The current scheme
is adequate for those APIs so far considered, but it may need to be
extended in future.)tspec
is not capable of expressing the full power of
the #pragma token
syntax. Whereas this makes it easier
to use in most cases, for describing the normal C-like objects such
as types, expressions and procedures, it cannot express complex token
descriptions. Instead it is necessary to express these directly in
the #pragma token
syntax. However this is only rarely
required : the constructs offsetof
, va_start
and va_arg
from ANSI are the only examples so far encountered
during the API description programme at DRA. For example, va_arg
takes an assignable expression of type va_list
and a
type t
and returns an expression of type t
.
Clearly, this cannot be expressed abstractly in C-like terms; so the
#pragma token
description:
#pragma token PROC ( EXP lvalue : va_list : e, TYPE t )\
EXP rvalue : t : va_arg # ansi.stdarg.va_arg
must be used instead.tspec
syntax (if the specification
is given in a machine readable form this process can be partially
automated). The interesting part consists of trying to interpret what
is written and reading between the lines as to what is meant. It is
important to try to represent exactly what is in the specification
rather than being influenced by one's knowledge of a particular implementation,
otherwise the API checking phase of the compilation will not be checking
against what is actually in the API but against a particular way of
implementing it.tspec
(and the #pragma token
syntax)
is not yet capable of expressing. In particular, some APIs go into
very careful management of namespaces within the API, explicitly spelling
out exactly what should, and should not, appear in the namespaces
as each header is included (see the appendix on namespaces and APIs
below). What is actually being done here is to regard each header
as an independent sub-API. There is not however a sufficiently developed
"API calculus" to allow such relationships to be easily
expressed.3.4.1.2. Resolving Conflicts
Another consideration during the description process is to try to
integrate the various API descriptions. For example, POSIX extends
ANSI, so it makes sense to have the target independent POSIX headers
include the corresponding ANSI headers and just add the new objects
introduced by POSIX. This does present problems with APIs which are
basically compatible but have a small number of incompatibilities,
whether deliberate or accidental. As an example of an "accidental"
incompatibility, XPG3 is an extension of POSIX, but whereas POSIX
declares malloc
by means of the prototype:
void *malloc ( size_t ) ;
XPG3 declares it by means of the traditional procedure declaration:
void *malloc ( s )
size_t s ;
These are surely intended to express the same thing, but in the first
case the argument is passed as a size_t
and in the second
it is firstly promoted to the integer promotion of size_t
.
On most machines these are compatible, either because of the particular
implementation of size_t
, or because the procedure calling
conventions make them compatible. However in general they are incompatible,
so the target independent headers either have to reflect this or have
to read between the lines and assume that the incompatibility was
accidental and ignore it.struct msqid_ds
in sys/msg.h
which has fields msg_qnum
and msg_qbytes
.
The difference is that whereas XPG3 declares these fields to have
type unsigned short
, SVID3 declares them to have type
unsigned long
. However for most purposes the precise
types of these fields is not important, so the APIs can be unified
by making the types of these fields target dependent. That is to say,
tokenised integer types __msg_q_t
and __msg_l_t
are introduced. On XPG3-compliant machines these will both be defined
to be unsigned short
, and on SVID3-compliant machines
they will both be unsigned long
. So, although strict
XPG3 and strict SVID3 are incompatible, the two extension APIs created
by adding these types are compatible. In the rare case when the precise
type of these fields is important, the strict APIs can be recovered
by defining the field types to be unsigned short
or unsigned
long
at produce-time rather than at install-time. (XPG4 uses
a similar technique to resolve this incompatibility. But whereas the
XPG4 types need to be defined explicitly, the tokenised types are
defined implicitly according to whatever the field types are on a
particular machine.)struct flock
defined in fcntl.h
shall have a field l_pid
of type pid_t
. However on at least two of the POSIX implementations
examined at DRA, pid_t
was implemented as an int
,
but the l_pid
field of struct flock
was
implemented as a short
(this showed up in the TDF library
building process). The immediate reaction might be that these system
have not implemented POSIX correctly, so they should be cast into
the outer darkness. However for the vast majority of applications,
even those which use the l_pid
field, its precise type
is not important. So the decision was taken to introduce a tokenised
integer type, __flock_pid_t
, to stand for the type of
the l_pid
field. So although the implementations do not
conform to strict POSIX, they do to this slightly more relaxed extension.
Of course, one could enforce strict POSIX by defining __flock_pid_t
to be pid_t
at produce-time, but the given implementations
would not conform to this stricter API.msg_qnum
and msg_qbytes
fields of struct msqid_ds
from unsigned short
to unsigned long
- is
an over-specialisation which leads to an unnecessary conflict with
XPG3. The XPG4 method of achieving exactly the same end - abstracting
the types of these fields - is, by contrast, a smooth evolutionary
path.3.4.1.3. The Benefits of API Description
The description process is potentially of great benefit to bodies
involved in API specification. While the specification itself stays
on paper the only real existence of the API is through its implementations.
Giving the specification a concrete form means not only does it start
to be seen as an object in its own right, rather than some fuzzy document
underlying the real implementations, but also any omissions, insufficient
specifications (where what is written down does not reflect what the
writer actually meant) or built-in assumptions are more apparent.
It may also be able to help show up the kind of over-specialisation
discussed above. The concrete representation also becomes an object
which both applications and implementations can be automatically checked
against. As has been mentioned previously, the production phase of
the compilation involves checking the program against the abstract
API description, and the library building phase checks the syntactic
aspect of the implementation against it.SIGKILL
example is
straightforward; SIGKILL
will appear in the POSIX version
of signal.h
but not the ANSI version, so if the program
is compiled with the target independent ANSI headers it will be reported
as being undefined. In a sense this is nothing to do with the #pragma
token
syntax, but with the organisation of the target independent
headers. The other examples do however rely on the fact that the #pragma
token
syntax can express syntactic information in a way which
is not possible directly from C. Thus the target independent headers
express exactly the fact that time_t
is an arithmetic
type, about which nothing else is known. Thus ( t & 1 )
is not type correct for a time_t t
because the binary
&
operator does not apply to all arithmetic types.
Similarly, for the type div_t
the target independent
headers express the information that there exists a structure type
div_t
and field selectors quot
and rem
of div_t
of type int
, but nothing about
the order of these fields or the existence of other fields. Thus any
attempt to initialise a div_t
will fail because the correspondence
between the values in the initialisation and the fields of the structure
is unknown. The struct dirent
example is entirely analogous,
except that here the declarations of the structure type struct
dirent
and the field selector d_name
appear in
both the POSIX and XPG3 versions of dirent.h
, whereas
the field selector d_ino
appears only in the XPG3 version.3.4.2. TDF Library Building
As we have said, two of the primary problems with writing portable
programs are dealing with API implementation errors on the target
machines - objects not being defined, or being defined in the wrong
place, or being implemented incorrectly - and namespace problems -
particularly those introduced by the system headers. The most interesting
contrast between the traditional compilation scheme (Fig. 1) and the
TDF scheme (Fig. 5) is that in the former the program comes directly
into contact with the "real world" of messy system headers
and incorrectly implemented APIs, whereas in the latter there is an
"ideal world" layer interposed. This consists of the target
independent headers, which describe all the syntactic features of
the API where they are meant to be, and with no extraneous material
to clutter up the namespaces (like index
and the macro
st_atime
in the examples given in section 2.2.3),
and the TDF libraries, which can be combined "cleanly" with
the program without any namespace problems. All the unpleasantness
has been shifted to the interface between this "ideal world"
and the "real world"; that is to say, the TDF library building.DBL_MAX
and size_t
- the necessary changes are probably "safe". DBL_MAX
is not a special value in any library routines, and changing size_t
from int
to unsigned int
does not affect
its size, alignment or procedure passing rules (at least not on the
target machines we have in mind) and so should not disrupt the interface
with the system library.3.4.2.2. System Library Problems
Errors in the system libraries will not be detected by the TDF library
builder because they are semantic errors, whereas the library building
process is only checking syntax. The only realistic ways of detecting
semantic problems is by means of test suites, such as the Plum-Hall
or CVSA library tests for ANSI and VSX for XPG3, or by detailed knowledge
of particular API implementations born of personal experience. However
it may be possible to build workarounds for problems identified in
these tests into the TDF libraries.realloc
discussed in section
2.2.4.4realloc
to be the equivalent of:
#define realloc ( p, s ) ( void *q = ( p ) ? ( realloc ) ( q, s ) : malloc ( s ) )
(where the C syntax has been extended to allow variables to be introduced
inside expressions) or:
static void *__realloc ( void *p, size_t s )
{
if ( p == NULL ) return ( malloc ( s ) ) ;
return ( ( realloc ) ( p, s ) ) ;
}
#define realloc ( p, s ) __realloc ( p, s )
Alternatively, the token definition could be encoded directly into
TDF (not via C), using the TDF notation compiler (see [9]).3.4.2.3. TDF Library Builders
The discussion above shows how the TDF libraries are an extra layer
which lies on top of the existing system API implementation, and how
this extra layer can be exploited to provide corrections and workarounds
to various implementation problems. The expertise of particular API
implementation problems on particular machines can be captured once
and for all in the TDF libraries, rather than being spread piecemeal
over all the programs which use that API implementation. But being
able to encapsulate this expertise in this way makes it a marketable
quantity. One could envisage a market in TDF libraries: ranging from
libraries closely reflecting the actual API implementation to top
of the range libraries with many corrections and workarounds built
in.struct
dirent
example10 in section 2.3.3struct direct
.
But this presents no problems to the TDF library builder : it is perfectly
simple to tell it to look in sys/dir.h
instead of dirent.h
, and to identify struct direct
with struct
dirent
. So it may be possible to build a partial POSIX lookalike
on BSD systems by using the TDF library mechanism.3.5. TDF and Conditional Compilation
So far our discussion of the TDF approach to portability has been
confined to the simplest case, where the program itself contains no
target dependent code. We now turn to programs which contain conditional
compilation. As we have seen, many of the reasons why it is necessary
to introduce conditional compilation into the traditional compilation
process either do not arise or are seen to be distinct phases in the
TDF compilation process. The use of a single front-end (the producer)
virtually eliminates problems of compiler limitations and differing
interpretations and reduces compiler bug problems, so it is not necessary
to introduce conditionally compiled workarounds for these. Also API
implementation problems, another prime reason for introducing conditional
compilation in the traditional scheme, are seen to be isolated in
the TDF library building process, thereby allowing the programmer
to work in an idealised world one step removed from the real API implementations.
However the most important reason for introducing conditional compilation
is where things, for reasons of efficiency or whatever, are genuinely
different on different machines. It is this we now consider.MSB
example given in section 2.3tspec
terms, to be:
+MACRO unsigned char MSB ( unsigned int a ) ;
where the macro MSB
gives the most significant byte of
its argument, a
. Let us say that the corresponding #pragma
token
statement is put into the header msb.h
.
Then the program can be recast into the form:
#include <stdio.h>
#include "msb.h"
unsigned int x = 100000000 ;
int main ()
{
printf ( "%u\n", MSB ( x ) ) ;
return ( 0 ) ;
}
The producer will compile this into a target independent TDF capsule
which uses a token to represent the use of MSB
, but leaves
this token undefined. The only question that remains is how this token
is defined on the target machine; that is, how the user-defined API
is implemented. On each target machine a TDF library containing the
local definition of the token representing MSB
needs
to be built. There are two basic possibilities. Firstly the person
performing the installation could build the library directly, by compiling
a program of the form:
#pragma implement interface "msb.h"
#include "config.h"
#ifndef SLOW_SHIFT
#define MSB ( a ) ( ( unsigned char ) ( a >> 24 ) )
#else
#ifdef BIG_ENDIAN
#define MSB ( a ) *( ( unsigned char * ) &( a ) )
#else
#define MSB ( a ) *( ( unsigned char * ) &( a ) + 3 )
#endif
#endif
with the appropriate config.h
to choose the correct local
implementation of the interface described in msb.h
. Alternatively
the programmer could provide three alternative TDF libraries corresponding
to the three implementations, and let the person installing the program
choose between these. The two approaches are essentially equivalent,
they just provide for making the choice of the implementation of the
user-defined component of the API in different ways. An interesting
alternative approach would be to provide a short program which does
the selection between the provided API implementations automatically.
This approach might be particularly effective in deciding which implementation
offers the best performance on a particular target machine.3.5.2. User Defined Tokens - Example
As an example of how to define a simple token consider the following
example. We have a simple program which prints "hello" in
some language, the language being target dependent. Our first task
is choose an API. We choose ANSI C extended by a tokenised object
hello
of type char *
which gives the message
to be printed. This object will be an rvalue (i.e. it cannot be assigned
to). For convenience this token is declared in a header file, tokens.h
say. This particular case is simple enough to encode by hand;
it takes the form:
#pragma token EXP rvalue : char * : hello #
#pragma interface hello
consisting of a #pragma token
directive describing the
object to be tokenised, and a #pragma interface
directive
to show that this is the only object in the API. An alternative would
be to generate tokens.h
from a tspec
specification
of the form:
+EXP char *hello ;
The next task is to write the program conforming to this API. This
may take the form of a single source file, hello.c
, containing
the lines:
#include <stdio.h>
#include "tokens.h"
int main ()
{
printf ( "%s\n", hello ) ;
return ( 0 ) ;
}
The production process may be specified by means of a Makefile
.
This uses the TDF C compiler, tcc
, which is an interface
to the TDF system which is designed to be like cc
, but
with extra options to handle the extra functionality offered by the
TDF system (see [1]).
produce : hello.j
echo "PRODUCTION COMPLETE"
hello.j : hello.c tokens.h
echo "PRODUCTION : C->TDF"
tcc -Fj hello.c
The production is run by typing make produce
. The ANSI
API is the default, and so does not need to be specified to tcc
.
The program hello.c
is compiled to a target independent
capsule, hello.j
. This will use a token to represent
hello
, but it will be left undefined.hello
.
This is done by means of a short C program, tokens.c
,
which implements the tokens declared in tokens.h
. This
might take the form:
#pragma implement interface "tokens.h"
#define hello "bonjour"
to define hello
to be "bonjour". On a different
machine, the definition of hello
could be given as "hello",
"guten Tag", "zdrastvetye" (excuse my transliteration)
or whatever (including complex expressions as well as simple strings).
Note the use of #pragma implement interface
to indicate
that we are now implementing the API described in tokens.h
,
as opposed to the use of #include
earlier when we were
just using the API.Makefile
:
install : hello
echo "INSTALLATION COMPLETE"
hello : hello.j tokens.tl
echo "INSTALLATION : TDF->TARGET"
tcc -o hello -J. -jtokens hello.j
tokens.tl : tokens.j
echo "LIBRARY BUILDING : LINKING LIBRARY"
tcc -Ymakelib -o tokens.tl tokens.j
tokens.j : tokens.c tokens.h
echo "LIBRARY BUILDING : DEFINING TOKENS"
tcc -Fj -not_ansi tokens.c
The complete installation process is run by typing make install
.
Firstly the file tokens.c
is compiled to give the TDF
capsule tokens.j
containing the definition of hello
.
The -not_ansi flag is needed because tokens.c
does not contain any real C (declarations or definitions), which is
not allowed in ANSI C. The next step is to turn the capsule tokens.j
into a TDF library, tokens.tl
, using the -Ymakelib
option to tcc
(with older versions of tcc
it may be necessary to change this option to -Ymakelib -M -Fj).
This completes the API implementation.hello.j
,
is linked with the TDF libraries tokens.tl
and ansi.tl
(which is built into tcc
as default) to form a target
dependent TDF capsule with all the necessary token definitions, which
is then translated to a binary object file and linked with the system
libraries. All of this is under the control of tcc
.3.5.3. Conditional Compilation within TDF
Although tokens are the main method used to deal with target dependencies,
TDF does have built-in conditional compilation constructs. For most
TDF sorts X
(for example, exp, shape or variety) there
is a construct X_cond
which takes an exp and two X
's
and gives an X
. The exp argument will evaluate to an
integer constant at install time. If this is true (nonzero), the result
of the construct is the first X
argument and the second
is ignored; otherwise the result is the second X
argument
and the first is ignored. By ignored we mean completely ignored -
the argument is stepped over and not decoded. In particular any tokens
in the definition of this argument are not expanded, so it does not
matter if they are undefined.
#if condition
where condition
is a target dependent value. Thus, because
it is not known which branch will be taken at produce time, the decision
is postponed to install time. If condition
is a target
independent value then the branch to be taken is known at produce
time, so the producer only translates this branch. Thus, for example,
code surrounded by #if 0
... #endif
will
be ignored by the producer.#if
statements can be translated into TDF
X_cond
constructs. The two branches of the #if
statement are translated into the two X
arguments of
the X_cond
construct; that is, into sub-trees of the
TDF syntax tree. This can only be done if each of the two branches
is syntactically complete.#ifdef
(and #ifndef
)
constructs to mean, is this macro is defined (or undefined) at produce
time? Given the nature of pre-processing in C this is in fact the
only sensible interpretation. But if such constructs are being used
to control conditional compilation, what is actually intended is,
is this macro defined at install time? This distinction is necessitated
by the splitting of the TDF compilation into production and installation
- it does not exist in the traditional compilation scheme. For example,
in the mips example in section 2.3mips
is defined is intended to be an installer property, rather than what
it is interpreted as, a producer property. The choice of the conditional
compilation path may be put off to install time by, for example, changing
#ifdef mips
to #if is_mips
where is_mips
is a tokenised integer which is either 1 (on those machines on which
mips
would be defined) or 0 (otherwise). In fact in view
of what was said above about syntactic completeness, it might be better
to recast the program as:
#include <stdio.h>
#include "user_api.h" /* For the spec of is_mips */
int main ()
{
if ( is_mips ) {
fputs ( "This machine is a mips\n", stdout ) ;
}
return ( 0 ) ;
}
because the branches of an if
statement, unlike those
of an #if
statement, have to be syntactically complete
is any case. The installer will optimise out the unnecessary test
and any unreached code, so the use of if ( condition )
is guaranteed to produce as efficient code as #if condition
.#ifdef
and #ifndef
constructs in which the compilation path
to be taken is potentially target dependent are reported (see [3]
and [8]).is_mips
in the example above) which are
used to control conditional compilation. This latter approach can
be used to quickly adapt existing programs to a TDF-portable form
since it is closer to the "traditional" approach of scattering
the program with #ifdef
's and #ifndef
's
to implement target dependent code. However the definition of a user-defined
API gives a better separation of target independent and target dependent
code, and the effort to define such as API may often be justified.
When writing a new program from scratch the API rather than the conditional
compilation approach is recommended.3.5.4. Alternative Program Versions
Consider again the program described in section 2.3.4 which has
optional features for displaying its output graphically depending
on the boolean value HAVE_X_WINDOWS
. By making HAVE_X_WINDOWS
part of the user-defined API as a tokenised integer and using:
#if HAVE_X_WINDOWS
to conditionally compile the X Windows code, the choice of whether
or not to use this version of the program is postponed to install
time. If both POSIX and X Windows are implemented on the target machine
the installation is straightforward. HAVE_X_WINDOWS
is
defined to be true, and the installation proceeds as normal. The case
where only POSIX is implemented appears to present problems. The TDF
representing the program will contain undefined tokens representing
objects from both the POSIX and X Windows APIs. Surely it is necessary
to define these tokens (i.e. implement both APIs) in order to install
the TDF. But because of the use of conditional compilation, all the
applications of X Windows tokens will be inside X_cond
constructs on the branch corresponding to HAVE_X_WINDOWS
being true. If it is actually false then these branches are stepped
over and completely ignored. Thus it does not matter that these tokens
are undefined. Hence the conditional compilation constructs within
TDF give the same flexibility in the API implementation is this case
as do those in C.
Crown
Copyright © 1998.