Viktor T. Toth - W - Compiler architecture

Support my research and science communication efforts through Patreon.

As they teach in compiler-writer school, a compiler for a programming language consists of four essential components:

A lexical analyzer
A syntax analyzer
A symbol dictionary
A code generator

Simply put, the lexical analyzer reads letters and assembles them into words, or tokens. The syntax analyzer examines these tokens in context, putting meaning into program lines. The symbol dictionary manages any variable and other names that are defined in a program. Lastly, the code generator actually emits the machine language equivalent of the compiled program.

Lexical Analysis

A glance at the W syntax specifications shows what the smallest units of translation are in this language:

Symbols are sequences of alphanumeric characters (always beginning with a non-numeric character), such as A, x1, or my_variable.

Literals are decimal or hexadecimal numbers, character constants, or strings.

Operators are unary or binary operators.

It is the purpose of the lexical analyzer component in the W compiler to identify these tokens and return them, one by one, to the calling program. This is accomplished in the compiler by the get_token() function. In addition to returning a numeric value identifying the token's type, get_token() also sets certain global variables; for instance, if the token is a NUMBER, the global variable number will be set to the numeric value of the number read.

Symbol and Memory Management

Before considering how symbols are managed by the compiler, it is important to clarify what exactly those symbols mean; in other words how memory is used in a W program for functions, global, and local variables.

A compiled W program is an MS-DOS .COM executable. A .COM executable lives in a single, 64 kilobyte segment of memory, which is used for both program and data. The first 256 bytes of this segment are used as the Program Segment Prefix, containing program startup information; the actual program code begins at address 0x0100. When the program starts, the stack pointer is reset to 0x0000; therefore, the first usable stack location will be at the top of the segment, at 0xFFFE.

A W program uses the stack to allocate memory for parameters passed to a function, and for variables local to a function. Before a function is called, the calling program places the function's parameters on the stack, in left-to-right order. At the beginning of the function, the current position of the stack pointer (SP) is saved in the base pointer (BP) register (but first, the BP register is itself saved on the stack); for any local variables used within the function, the stack pointer is then decremented by the appropriate amount.

Consequently, it is possible to use the base pointer, BP, for referencing both local variables and function parameters. At the location pointed to by BP, the old value of BP can be found in the stack; at BP+2, the return address of the function is located. The parameters passed to the function can be found at BP+4, BP+6, etc., in right-to-left order.

When a new local variable is created, the stack pointer is decremented by two, leaving two bytes of additional room between BP and SP. local variables are therefore referenced by BP-2, BP-4, BP-6, etc.

So, code begins at address 0x0100; the stack begins at the top of the 64 kilobyte segment, and grows downward. Where does that leave room for global variables? The best place for them is right after the last program code instruction. So the memory model of a W program looks like this:

0xFFFF Top of stack
... local variables, function parameters, return addresses, temporary storage ...
Bottom of stack

Empty space

Top of heap
... global variables ...
Bottom of heap

0x0100
Top of code
... program instructions ...
Bottom of code

0x0000 MS-DOS Program Segment Prefix

This makes it clear that the W compiler has to be capable of managing three distinct types of symbols:

Code symbols are absolute addresses, and represent the location of program functions
Heap symbols are relative to the top of code space
Local symbols are relative to the stack pointer's location at the beginning of function execution

As discussed earlier, local symbols will always be relative to the value of the processor's BP register. The processor has another index register that is similar in function: the DI register. Since we don't know where heap symbols will be located until the entire program hasn't been compiled, rather than using absolute addresses, we can use addresses relative to the DI register for these symbols, and then ensure that the DI register is loaded with the proper value (pointing to the bottom of the heap) before program execution.

Another aspect of symbol management is the handling of name spaces. The same symbol name may occur multiple times in different contexts in a program. Consider the following example:

{
	x := 5
	{
		x := 10
		printf(x, "x is now %d\r\n\0", stdout)
	}
	printf(x, "x is now %d\r\n\0", stdout)
}

Make a guess: what will this program fragment print when you run it? If you're a seasoned C programmer, you know the answer:

x is now 10
x is now 5

In other words, although we redefine x in the inner compound statement, the "original" definition of x is not forgotten; when the inner compound statement terminates, the original definition comes back into effect.

Because of this, the symbol manager must be able to handle the occurrence of the same symbol multiple times, in multiple name spaces. Fortunately, it isn't as hard to do as it seems; at any given time, there's a straightforward hierarchy of name spaces, only the last one of which is used to actually add new symbols.

The symbol dictionary is implemented in the W compiler as a simple 32-kilobyte (16384-element) array. Space in this array in consumed from both ends. From the top growing downward, 2-word entries are added for each new symbol: one word points to the symbol name, the other word contains the symbol's 16-bit value. Symbols are preceded by a name space header that identifies the name space as a global or local namespace, and also provides links to the previous and next name space headers. From the bottom of the symbol dictionary, growing upward, null terminated symbol name strings are added. This is demonstrated by the following piece of ASCII art, taken directly from the W compiler source:

;Namespace structure:
;+-----------------------+
;| Previous namespace    |= 0 <-+
;| Next namespace        | -+   |
;| Flags                 |  |   |
;| Symbol space          |--+---+--+
;+-----------------------+  |   |  |
;| Symbol                |--+---+--+
;| Value                 |  |   |  |
;+-----------------------+  |   |  |
;| Symbol                |--+---+--+--+
;| Value                 |  |   |  |  |
;| ...                   |  |   |  |  |
;+-----------------------+  |   |  |  |
;| Previous namespace    |<-+ --+  |  | <--thisspace
;| Next namespace        |= 0      |  |
;| Flags                 |         |  |
;| Symbol space          |--+      |  |
;| ...                   |  |      |  | <--currspace
;|                       |  |      |  |
;| ...                   |  |      |  |
;| Symbol\0              |<-+      |  | <--symbspace
;| Symbol\0              |<--------+--+
;| Symbol\0(*)           |<--------+
;+-----------------------+

You may have noticed that this symbol table implementation lacks any indices, hash tables, or any such construct that would make searching for variables easier. Not including such searching aides was a conscious decision. Apart from the fact that the W compiler was meant to be as simple as possible, it is also unlikely that any name space in a typical W program would contain more than a small handful of symbols; in this case, a simple, sequential search is often faster than building, maintaining, and searching through an index of some sort.

The symbol table is managed through a simple set of subroutines in the compiler:

The newspace(flag) function creates a new name space. The flag parameter can be non-zero, indicating that we've begun the name space of a new function; if it's zero, the name space belongs to an inner compound statement. (This feature actually would make it possible to implement "local functions", i.e., functions defined within the body of another function; this capability, however, is not used in the W language.)

The oldspace(flag) function removes the last name space from the symbol table. The flag parameter will be used to check that a name space of the correct type is being removed.

The addname(name, value) function is used to add a new symbol to the current name space. If there's insufficient space (a rather unlikely event), an error is generated.

The findname(name, global) function is used to search for a symbol. Normally, the current name space is searched, and if the symbol name is not found, previous name spaces are examined up to, and including, the first name space encountered with a non-zero flag value. However, if the global parameter is non-zero, only the topmost name space (that containing global heap and code symbols) is considered.

Speaking of heap vs. code symbols, this structure does not provide a means to differentiate between them. That is accomplished by flipping the topmost bit from 0 to 1 in the first character of a heap symbol. This can be done without penalty, since extended ASCII characters cannot normally be part of a symbol name.

When the compiler encounters a symbol that is defined using the := operator, it adds the symbol to the current namespace using addname(). When a symbol is encountered as part of an expression, namespaces are searched using findname() for a local, global code, or global heap symbol, and the appropriate code is generated. For instance, if an expression consists of a symbol all by itself, the compiler may generate the following machine code instructions:

MOV AX,[1000]   ; symbol is a code symbol resolving into address 0x1000
MOV AX,[BP-2]   ; symbol is a the first local variable of the function
MOV AX,[BP+4]   ; symbol is the last parameter in a function's parameter list
MOV AX,[DI+100] ; symbol is a heap symbol at address 0x0100 from the bottom of the heap

Syntactical Analysis

Syntactical analysis consists of reading individual program words ("tokens") from the source file and analyzing these in the context of the program. Syntactical analysis and code generation comprises the bulk of the W compiler, and it closely follows the logic of the syntax specifications of the W language.

Take, for instance, the definition of a program:

program := definition [definition...]

Corresponding to this syntactical element, you'll find a subroutine in the W compiler named program() that contains, at its core, the following lines of code:

read_token()
p := $
token != FILE_END ?
{
	@definition() ? $ = p, error(16)
}

This function closely follows the logic of the syntax diagram. The entire program() function does, of course, contain additional code; namely, code that emits the W program preamble (a few instructions that load the heap base, process the command line, and call the program's main function), and fixups generated once the compilation is complete (such as loading the proper heap base address.)

Even simpler than program() is the _definition() function:

_definition() :=
{
	declarator() ?
	{
		token == ASSIGN_OPERATOR ?
		{
			p := $
			read_token()
			initializer()
		}, 0
	}, 0
}

Once again, the function closely follows the logic of the syntax specification. Things do get a bit more complicated afterwards; sometimes, it is necessary to "read ahead" in the source code to determine, for instance, if an expression is a constant expression (consisting entirely of constant values, meaning that it can be evaluated by the compiler at compile time).

Without going into too much detail, here's a list of the additional functions that perform syntactical analysis:

The initializer() function evaluates initializer expressions and assigns the resulting value to newly created variables.

The declarator() function processes the expression on the left side of a declarative assignment, :=. Code is generated to allocate space for local variables on the stack.

The _primary_expression() function checks for a constant expression; if it is found, code is emitted to load a constant into the accumulator, otherwise, compound expressions or conditional expressions are evaluated.

The compound_expression() function allocates a new local name space and evalutes the expressions found in between a pair of curly braces.

The constant_expression() reads, and computes the result of, an expression comprising entirely of constant values. The evaluate() function is used to evaluate operators.

The simple_expression() evaluates expressions connected with binary operators. Code corresponding to operators is emitted by the execute() function.

The unary_expression() is the core expression evaluator function that can handle complex expressions encoded with parentheses, expressions with unary operators (including the address-of and dereference operators), and expressions referencing the instruction pointer. It uses the loadadr() function to load the address of a variable; the loadargs() function to place the arguments of a function call on the stack; and the loadvar() function to load the value of a variable. The fixstack() function is used after a function call to restore the stack pointer (which was altered when the function call's arguments were placed on the stack.)

All of the above functions make use of read_token() to read the next program word; and read_position() and set_position() for repositioning the file read pointer (and adjust the line number counter) during read-ahead operations.

Code Generation

So how does code become generated by the compiler? To understand that, it is necessary to first understand what kinds of code fragments the compiler generates. The program() function was already mentioned as the place where the W program preamble gets generated. The following additional code fragments are generated at various points by the compiler:

The function preamble consists of instructions to save the Base Pointer register, and set it up for a new stack frame:

PUSH BP
MOV  BP, SP

The function postamble restores the stack pointer, the base pointer, and returns control to the calling program:

MOV  SP, BP
POP  BP
RET

Memory is allocated when a new symbol is declared, and deallocated when the end of the compound instruction in which symbols were declared is reached:

SUB  SP, 2
ADD  SP, 2

When an expression is evaluated, the result is left in the accumulator, AX. When binary operators are evaluated, the current contents of AX are pushed to the stack; the next subexpression is evaluted; then the contents of the stack are popped (e.g., into the BX register) and the binary operation is executed. The actual details vary from one operator to the next; these code fragments are generated by the compiler's execute() function. Expression evaluation is performed using a logical stack for operators and arguments; operator precedence is observed.

Conditional expressions are implemented as follows. First, the condition expression is evaluated; depending on the result, a conditional jump is executed. If there is an "else" clause, a second, uncondtional jump is used to skip it of the "if" clause is executed:

JNZ  +3
JMP  _else
... if clause instructions
JMP  _end
_else:
... else clause instructions
_end:

Expressions containing the instruction pointer symbol ($) are evaluated as follows: if the $ symbol appears on the right hand of an assignment instruction, a CALL instruction is executed to the next instruction address, the return address is then popped from the stack and adjusted. The adjustment takes into account any room needed to then load the address of the variable on the left side of the assignment instruction; NOP instructions may be used to ensure that the construct is of fixed length. If the $ symbol appears on the left side of an assignment instruction, a simple unconditional jump is executed:

CALL _here
_here:
POP  AX
ADD  AX, 8
... store result in the desired variable
...
JMP  AX   ; executing a $= instruction

The W compiler is not an optimizing compiler, as evidenced by its execution speed (the C version of the W compiler itself, compiled with Microsoft Visual C++ 1.52c, executes approximately 6 times faster than the W version.) That said, some simple optimizations are performed anyway, to make the resulting code slightly more efficient. For instance, when an expression begins with the symbols 0!=, this part is suppressed. (This construct is often used to resolve an ambiguity caused by the fact that W doesn't use an instruction termination character, such as a semicolon.) Additionally, expressions that consist entirely of constants are evaluated at compile time, making the resulting code both faster and smaller.

The compiler contains two functions that are responsible for writing generated code to disk files: writecode() and writeheap(). The writecode() function writes directly to the executable file. The writeheap() function is used to write data (heap variables) to a temporary file; when the compilation process is finished, the contents of this temporary file are appended to the code file, and the heap address in the program preamble is adjusted.

Compiler Diagnostics

When an error is encountered during the compilation process, the compiler prints the appropriate error message on standard error. No attempt is made to try and resume the compilation process; the process usually stops after printing two or three error messages.

Whether or not the compilation process was successful, the compiler prints a symbol table of global (code and heap) symbols on standard output. This symbol table can be used for diagnostic purposes (for instance, when analyzing a running program with a machine language debugger.) It is possible to redirect the compiler's standard output to save the symbol table to a file, while leaving error messages to appear on the display.

Nav view search

Navigation

Search

W - Compiler architecture

Lexical Analysis

Symbol and Memory Management

Syntactical Analysis

Code Generation

Compiler Diagnostics

`0xFFFF`	Top of stack ... local variables, function parameters, return addresses, temporary storage ... Bottom of stack
	Empty space
	Top of heap ... global variables ... Bottom of heap
`0x0100`	Top of code ... program instructions ... Bottom of code
`0x0000`	MS-DOS Program Segment Prefix