Compiler Construction Niklaus Wirth This is a slightly revised version of the book published by Addison-Wesley in 1996 I...
81 downloads
1785 Views
597KB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Compiler Construction Niklaus Wirth This is a slightly revised version of the book published by Addison-Wesley in 1996 ISBN 0-201-40353-6 Zürich, November 2005
1
Theory and Techniques of Compiler Construction An Introduction Niklaus Wirth
Preface This book has emerged from my lecture notes for an introductory course in compiler design at ETH Zürich. Several times I have been asked to justify this course, since compiler design is considered a somewhat esoteric subject, practised only in a few highly specialized software houses. Because nowadays everything which does not yield immediate profits has to be justified, I shall try to explain why I consider this subject as important and relevant to computer science students in general. It is the essence of any academic education that not only knowledge, and, in the case of an engineering education, know-how is transmitted, but also understanding and insight. In particular, knowledge about system surfaces alone is insufficient in computer science; what is needed is an understanding of contents. Every academically educated computer scientist must know how a computer functions, and must understand the ways and methods in which programs are represented and interpreted. Compilers convert program texts into internal code. Hence they constitute the bridge between software and hardware. Now, one may interject that knowledge about the method of translation is unnecessary for an understanding of the relationship between source program and object code, and even much less relevant is knowing how to actually construct a compiler. However, from my experience as a teacher, genuine understanding of a subject is best acquired from an in-depth involvement with both concepts and details. In this case, this involvement is nothing less than the construction of an actual compiler. Of course we must concentrate on the essentials. After all, this book is an introduction, and not a reference book for experts. Our first restriction to the essentials concerns the source language. It would be beside the point to present the design of a compiler for a large language. The language should be small, but nevertheless it must contain all the truly fundamental elements of programming languages. We have chosen a subset of the language Oberon for our purposes. The second restriction concerns the target computer. It must feature a regular structure and a simple instruction set. Most important is the practicality of the concepts taught. Oberon is a general-purpose, flexible and powerful language, and our target computer reflects the successful RISC-architecture in an ideal way. And finally, the third restriction lies in renouncing sophisticated techniques for code optimization. With these premisses, it is possible to explain a whole compiler in detail, and even to construct it within the limited time of a course. Chapters 2 and 3 deal with the basics of language and syntax. Chapter 4 is concerned with syntax analysis, that is the method of parsing sentences and programs. We concentrate on the simple but surprisingly powerful method of recursive descent, which is used in our exemplary compiler. We consider syntax analysis as a means to an end, but not as the ultimate goal. In Chapter 5, the transition from a parser to a compiler is prepared. The method depends on the use of attributes for syntactic constructs. After the presentation of the language Oberon-0, Chapter 7 shows the development of its parser according to the method of recursive descent. For practical reasons, the handling of syntactically erroneous sentences is also discussed. In Chapter 8 we explain why languages which contain declarations, and which therefore introduce dependence on context, can nevertheless be treated as syntactically context free. Up to this point no consideration of the target computer and its instruction set has been necessary. Since the subsequent chapters are devoted to the subject of code generation, the specification of a target becomes unavoidable (Chapter 9). It is a RISC architecture with a small instruction set and a set of registers. The central theme of compiler design, the generation of instruction sequences, is thereafter distributed over three chapters: code for expressions and assignments to variables (Chapter 10), for
2
conditional and repeated statements (Chapter 11) and for procedure declarations and calls (Chapter 12). Together they cover all the constructs of Oberon-0. The subsequent chapters are devoted to several additional, important constructs of general-purpose programming languages. Their treatment is more cursory in nature and less concerned with details, but they are referenced by several suggested exercises at the end of the respective chapters. These topics are further elementary data types (Chapter 13), and the constructs of open arrays, of dynamic data structures, and of procedure types called methods in object-oriented terminology (Chapter 14). Chapter 15 is concerned with the module construct and the principle of information hiding. This leads to the topic of software development in teams, based on the definition of interfaces and the subsequent, independent implementation of the parts (modules). The technical basis is the separate compilation of modules with complete checks of the compatibility of the types of all interface components. This technique is of paramount importance for software engineering in general, and for modern programming languages in particular. Finally, Chapter 16 gives a brief overview of problems of code optimization. It is necessary because of the semantic gap between source languages and computer architectures on the one hand, and our desire to use the available resources as well as possible on the other. Acknowledgements I express my sincere thanks to all who contributed with their suggestions and critizism to this book which matured over the many years in which I have taught the compiler design course at ETH Zürich. In particular, I am indebted to Hanspeter Mössenböck and Michael Franz who carefully read the manuscript and subjected it to their scrutiny. Furthermore, I thank Stephan Gehring, Stefan Ludwig and Josef Templ for their valuable comments and cooperation in teaching the course. N. W. December 1995
3
Contents Preface 1. Introduction 2. Language and Syntax 2.1. Exercises 3. Regular Languages 4. Analysis of Context-free Languages 4.1. The method of recursive descent 4.2. Table-driven top-down parsing 4.3. Bottom-up parsing 4.4. Exercises 5. Attributed Grammars and Semantics 5.1. Type rules 5.2. Evaluation rules 5.3. Translation rules 5.4. Exercises 6. The Programming Language Oberon-0 7. A Parser for Oberon-0 7.1. The scanner 7.2. The parser 7.3. Coping with syntactic errors 7.4. Exercises 8. Consideration of Context Specified by Declarations 8.1. Declarations 8.2. Entries for data types 8.3. Data representation at run-time 8.4. Exercises 9. A RISC Architecture as Target 10. Expressions and Assignments 10.1. Straight code generation according to the stack principle 10.2. Delayed code generation 10.3. Indexed variables and record fields 10.4. Exercises 11. Conditional and Repeated Statements and Boolean Epressions 11.1. Comparisons and jumps 11.2. Conditional and repeated statements 11.3. Boolean operations 11.4. Assignments to Boolean variables 11.5. Exercises 12. Procedures and the Concept of Locality 12.1. Run-time organization of the store 12.2. Addressing of variables 12.3. Parameters 12.4. Procedure declarations and calls 12.5. Standard procedures 12.6. Function procedures 12.7. Exercises
4
13. Elementary Data Types 13.1. The types REAL and LONGREAL 13.2. Compatibility between numeric data types 13.3. The data type SET 13.4. Exercises 14. Open Arrays, Pointers and Procedure Types 14.1. Open arrays 14.2. Dynamic data structures and pointers 14.3. Procedure types 14.5. Exercises 15. Modules and Separate Compilation 15.1. The principle of information hiding 15.2. Separate compilation 15.3. Implementation of symbol files 15.4. Addressing external objects 15.5. Checking configuration consistency 15.6. Exercises 16. Code Optimizations and the Frontend/backend Structure 16.1. General considerations 16.2. Simple optimizations 16.3. Avoiding repeated evaluations 16.4. Register allocation 16.5. The frontend/backend compiler structure 16.6. Exercises Appendix A: Syntax A.1. Oberon-0 A.2. Oberon A.3. Symbol files Appendix B: The ASCII character set Appendix C: The Oberon-0 compiler C.1. The scanner C.2. The parser C.3. The code generator References
5
1. Introduction Computer programs are formulated in a programming language and specify classes of computing processes. Computers, however, interpret sequences of particular instructions, but not program texts. Therefore, the program text must be translated into a suitable instruction sequence before it can be processed by a computer. This translation can be automated, which implies that it can be formulated as a program itself. The translation program is called a compiler, and the text to be translated is called source text (or sometimes source code). It is not difficult to see that this translation process from source text to instruction sequence requires considerable effort and follows complex rules. The construction of the first compiler for the language Fortran (formula translator) around 1956 was a daring enterprise, whose success was not at all assured. It involved about 18 manyears of effort, and therefore figured among the largest programming projects of the time. The intricacy and complexity of the translation process could be reduced only by choosing a clearly defined, well structured source language. This occurred for the first time in 1960 with the advent of the language Algol 60, which established the technical foundations of compiler design that still are valid today. For the first time, a formal notation was also used for the definition of the language's structure (Naur, 1960). The translation process is now guided by the structure of the analysed text. The text is decomposed, parsed into its components according to the given syntax. For the most elementary components, their semantics is recognized, and the meaning (semantics) of the composite parts is the result of the semantics of their components. Naturally, the meaning of the source text must be preserved by the translation. The translation process essentially consists of the following parts: 1. The sequence of characters of a source text is translated into a corresponding sequence of symbols of the vocabulary of the language. For instance, identifiers consisting of letters and digits, numbers consisting of digits, delimiters and operators consisting of special characters are recognized in this phase, which is called lexical analysis. 2. The sequence of symbols is transformed into a representation that directly mirrors the syntactic structure of the source text and lets this structure easily be recognized. This phase is called syntax analysis (parsing). 3. High-level languages are characterized by the fact that objects of programs, for example variables and functions, are classified according to their type. Therefore, in addition to syntactic rules, compatibility rules among types of operators and operands define the language. Hence, verification of whether these compatibility rules are observed by a program is an additional duty of a compiler. This verification is called type checking. 4. On the basis of the representation resulting from step 2, a sequence of instructions taken from the instruction set of the target computer is generated. This phase is called code generation. In general it is the most involved part, not least because the instruction sets of many computers lack the desirable regularity. Often, the code generation part is therefore subdivided further. A partitioning of the compilation process into as many parts as possible was the predominant technique until about 1980, because until then the available store was too small to accommodate the entire compiler. Only individual compiler parts would fit, and they could be loaded one after the other in sequence. The parts were called passes, and the whole was called a multipass compiler. The number of passes was typically 4 - 6, but reached 70 in a particular case (for PL/I) known to the author. Typically, the output of pass k served as input of pass k+1, and the disk served as intermediate storage (Figure 1.1). The very frequent access to disk storage resulted in long compilation times.
6
lexical analysis
syntax analysis
code generation
Figure 1.1. Multipass compilation. Modern computers with their apparently unlimited stores make it feasible to avoid intermediate storage on disk. And with it, the complicated process of serializing a data structure for output, and its reconstruction on input can be discarded as well. With single-pass compilers, increases in speed by factors of several thousands are therefore possible. Instead of being tackled one after another in strictly sequential fashion, the various parts (tasks) are interleaved. For example, code generation is not delayed until all preparatory tasks are completed, but it starts already after the recognition of the first sentential structure of the source text. A wise compromise exists in the form of a compiler with two parts, namely a front end and a back end. The first part comprises lexical and syntax analyses and type checking, and it generates a tree representing the syntactic structure of the source text. This tree is held in main store and constitutes the interface to the second part which handles code generation. The main advantage of this solution lies in the independence of the front end of the target computer and its instruction set. This advantage is inestimable if compilers for the same language and for various computers must be constructed, because the same front end serves them all. The idea of decoupling source language and target architecture has also led to projects creating several front ends for different languages generating trees for a single back end. Whereas for the implementation of m languages for n computers m * n compilers had been necessary, now m front ends and n back ends suffice (Figure 1.2).
Pascal
Modula
Oberon
Syntax tree
MIPS
SPARC
ARM
Figure 1.2. Front ends and back ends. This modern solution to the problem of porting a compiler reminds us of the technique which played a significant role in the propagation of Pascal around 1975 (Wirth, 1971). The role of the structural tree was assumed by a linearized form, a sequence of commands of an abstract computer. The back end consisted of an interpreter program which was implementable with little effort, and the linear instruction sequence was called P-code. The drawback of this solution was the inherent loss of efficiency common to interpreters. Frequently, one encounters compilers which do not directly generate binary code, but rather assembler text. For a complete translation an assembler is also involved after the compiler. Hence, longer translation times are inevitable. Since this scheme hardly offers any advantages, we do not recommend this approach.
7
Increasingly, high-level languages are also employed for the programming of microcontrollers used in embedded applications. Such systems are primarily used for data acquisition and automatic control of machinery. In these cases, the store is typically small and is insufficient to carry a compiler. Instead, software is generated with the aid of other computers capable of compiling. A compiler which generates code for a computer different from the one executing the compiler is called a cross compiler. The generated code is then transferred - downloaded - via a data transmission line. In the following chapters we shall concentrate on the theoretical foundations of compiler design, and thereafter on the development of an actual single-pass compiler.
8
2. Language and Syntax Every language displays a structure called its grammar or syntax. For example, a correct sentence always consists of a subject followed by a predicate, correct here meaning well formed. This fact can be described by the following formula: sentence = subject predicate. If we add to this formula the two further formulas subject = "John" | "Mary". predicate = "eats" | "talks". then we define herewith exactly four possible sentences, namely John eats John talks
Mary eats Mary talks
where the symbol | is to be pronounced as or. We call these formulas syntax rules, productions, or simply syntactic equations. Subject and predicate are syntactic classes. A shorter notation for the above omits meaningful identifiers: S = AB. A = "a" | "b". B = "c" | "d".
L = {ac, ad, bc, bd}
We will use this shorthand notation in the subsequent, short examples. The set L of sentences which can be generated in this way, that is, by repeated substitution of the left-hand sides by the right-hand sides of the equations, is called the language. The example above evidently defines a language consisting of only four sentences. Typically, however, a language contains infinitely many sentences. The following example shows that an infinite set may very well be defined with a finite number of equations. The symbol ∅ stands for the empty sequence. S = A. A = "a" A | ∅.
L = {∅, a, aa, aaa, aaaa, ... }
The means to do so is recursion which allows a substitution (here of A by "a"A) be repeated arbitrarily often. Our third example is again based on the use of recursion. But it generates not only sentences consisting of an arbitrary sequence of the same symbol, but also nested sentences: S = A. A = "a" A "c" | "b".
L = {b, abc, aabcc, aaabccc, ... }
It is clear that arbitrarily deep nestings (here of As) can be expressed, a property particularly important in the definition of structured languages. Our fourth and last example exhibits the structure of expressions. The symbols E, T, F, and V stand for expression, term, factor, and variable. E T F V
= = = =
T | A "+" T. F | T "*" F. V | "(" E ")". "a" | "b" | "c" | "d".
From this example it is evident that a syntax does not only define the set of sentences of a language, but also provides them with a structure. The syntax decomposes sentences in their constituents as shown in the example of Figure 2.1. The graphical representations are called structural trees or syntax trees.
9
a*b+c
(a+b)*(c+d)
A
A
A
+
+
*
* a
a+b*c
c
a
b
* b
c
(A)
(A)
+
+
a
b
c
d
Figure 2.1. Structure of expressions Let us now formulate the concepts presented above more rigorously: A language is defined by the following: 1. The set of terminal symbols. These are the symbols that occur in its sentences. They are said to be terminal, because they cannot be substituted by any other symbols. The substitution process stops with terminal symbols. In our first example this set consists of the elements a, b, c and d. The set is also called vocabulary. 2. The set of nonterminal symbols. They denote syntactic classes and can be substituted. In our first example this set consists of the elements S, A and B. 3. The set of syntactic equations (also called productions). These define the possible substitutions of nonterminal symbols. An equation is specified for each nonterminal symbol. 4. The start symbol. It is a nonterminal symbol, in the examples above denoted by S. A language is, therefore, the set of sequences of terminal symbols which, starting with the start symbol, can be generated by repeated application of syntactic equations, that is, substitutions. We also wish to define rigorously and precisely the notation in which syntactic equations are specified. Let nonterminal symbols be identifiers as we know them from programming languages, that is, as sequences of letters (and possibly digits), for example, expression, term. Let terminal symbols be character sequences enclosed in quotes (strings), for example, "=", "|". For the definition of the structure of these equations it is convenient to use the tool just being defined itself: syntax production expression term factor
= = = = =
production syntax | ∅. identifier "=" expression "." . term | expression "|" term. factor | term factor. identifier | string.
identifier string stringhead
= = =
letter | identifier letter | identifier digit. stringhead """. """ | stringhead character.
10
letter digit
= =
"A" | ... | "Z". "0" | ... | "9".
This notation was introduced in 1960 by J. Backus and P. Naur in almost identical form for the formal description of the syntax of the language Algol 60. It is therefore called Backus Naur Form (BNF) (Naur, 1960). As our example shows, using recursion to express simple repetitions is rather detrimental to readability. Therefore, we extend this notation by two constructs to express repetition and optionality. Furthermore, we allow expressions to be enclosed within parentheses. Thereby an extension of BNF called EBNF (Wirth, 1977) is postulated, which again we immediately use for its own, precise definition: syntax production expression term factor
= = = = =
{production}. identifier "=" expression "." . term {"|" term}. factor {factor}. identifier | string | "(" expression ")" | "[" expression "]" | "{" expression "}".
identifier string letter digit
= = = =
letter {letter | digit}. """ {character} """. "A" | ... | "Z". "0" | ... | "9".
A factor of the form {x} is equivalent to an arbitrarily long sequence of x, including the empty sequence. A production of the form A = AB | ∅. is now formulated more briefly as A = {B}. A factor of the form [x] is equivalent to "x or nothing", that is, it expresses optionality. Hence, the need for the special symbol ∅ for the empty sequence vanishes. The idea of defining languages and their grammar with mathematical precision goes back to N. Chomsky. It became clear, however, that the presented, simple scheme of substitution rules was insufficient to represent the complexity of spoken languages. This remained true even after the formalisms were considerably expanded. In contrast, this work proved extremely fruitful for the theory of programming languages and mathematical formalisms. With it, Algol 60 became the first programming language to be defined formally and precisely. In passing, we emphasize that this rigour applied to the syntax only, not to the semantics. The use of the Chomsky formalism is also responsible for the term programming language, because programming languages seemed to exhibit a structure similar to spoken languages. We believe that this term is rather unfortunate on the whole, because a programming language is not spoken, and therefore is not a language in the true sense of the word. Formalism or formal notation would have been more appropriate terms. One wonders why an exact definition of the sentences belonging to a language should be of any great importance. In fact, it is not really. However, it is important to know whether or not a sentence is well formed. But even here one may ask for a justification. Ultimately, the structure of a (well formed) sentence is relevant, because it is instrumental in establishing the sentence's meaning. Owing to the syntactic structure, the individual parts of the sentence and their meaning can be recognized independently, and together they yield the meaning of the whole. Let us illustrate this point using the following, trivial example of an expression with the addition symbol. Let E stand for expression, and N for number: E = N | E "+" E. N = "1" | "2" | "3" | "4" . Evidently, "4 + 2 + 1" is a well-formed expression. It may even be derived in several ways, each corresponding to a different structure, as shown in Figure 2.2.
11
A A A
A
+
A
+ A
4
A 1
+
4
2
A A
+ A
2
1
Figure 2.2. Differing structural trees for the same expression. The two differing structures may also be expressed with appropriate parentheses, namely as (4 + 2) + 1 and as 4 + (2 + 1), respectively. Fortunately, thanks to the associativity of addition both yield the same value 7. But this need not always be the case. The mere use of subtraction in place of addition yields a counter example which shows that the two differing structures also yield a different interpretation and result: (4 - 2) - 1 = 1, 4 - (2 - 1) = 3. The example illustrates two facts: 1. Interpretation of sentences always rests on the recognition of their syntactic structure. 2. Every sentence must have a single structure in order to be unambiguous. If the second requirement is not satisfied, ambiguous sentences arise. These may enrich spoken languages; ambiguous programming languages, however, are simply useless. We call a syntactic class ambiguous if it can be attributed several structures. A language is ambiguous if it contains at least one ambiguous syntactic class (construct).
2.1. Exercises 2.1. The Algol 60 Report contains the following syntax (translated into EBNF): primary = unsignedNumber | variable | "(" arithmeticExpression ")" | ... . factor = primary | factor "↑" primary. term = factor | term ("×" | "/" | "÷") factor. simpleArithmeticExpression = term | ("+" | "-") term | simpleArithmeticExpression ("+" | "-") term. arithmeticExpression = simpleArithmeticExpression | "IF" BooleanExpression "THEN" simpleArithmeticExpression "ELSE" arithmeticExpression. relationalOperator = "=" | "≠" | "≤" | "" . relation = arithmeticExpression relationalOperator arithmeticExpression. BooleanPrimary = logicalValue | variable | relation | "(" BooleanExpression ")" | ... . BooleanSecondary = BooleanPrimary | "¬" BooleanPrimary. BooleanFactor = BooleanSecondary | BooleanFactor "∧" BooleanSecondary. BooleanTerm = BooleanFactor | BooleanTerm "∨" BooleanFactor. implication = BooleanTerm | implication "⊃" BooleanTerm. simpleBoolean = implication | simpleBoolean "≡" implication. BooleanExpression = simpleBoolean | "IF" BooleanExpression "THEN" simpleBoolean "ELSE" BooleanExpression. Determine the syntax trees of the following expressions, in which letters are to be taken as variables: x+y+z x×y+z x+y×z (x - y) × (x + y)
12
-x ÷ y a+b 0 DO IF x MOD 2 = 1 THEN z := z + y END ; y := 2*y; x := x DIV 2 END ; Write(x); Write(y); Write(z); WriteLn END Multiply; PROCEDURE Divide; VAR x, y, r, q, w: INTEGER; BEGIN Read(x); Read(y); r := x; q := 0; w := y; WHILE w y DO q := 2*q; w := w DIV 2; IF w = . , : ) ] OF THEN DO ( [ ~ := ; END ELSE ELSIF IF WHILE ARRAY RECORD CONST TYPE VAR PROCEDURE BEGIN MODULE The words written in upper-case letters represent single, terminal symbols, and they are called reserved words. They must be recognized by the scanner, and therefore cannot be used as identifiers. In addition to the symbols listed, identifiers and numbers are also treated as terminal symbols. Therefore the scanner is also responsible for recognizing identifers and numbers. It is appropriate to formulate the scanner as a module. In fact, scanners are a classic example of the use of the module concept. It allows certain details to be hidden from the client, the parser, and to make accessible (to export) only those features which are relevant to the client. The exported facilities are summarized in terms of the module's interface definition: DEFINITION OSS; (*Oberon Subset Scanner*) IMPORT Texts; CONST IdLen = 16; (*symbols*) null = 0; times = 1; div = 3; mod = 4; and = 5; plus = 6; minus = 7; or = 8; eql = 9; neq = 10; lss = 11; geq = 12; leq = 13; gtr = 14; period = 18; comma = 19; colon = 20; rparen = 22; rbrak = 23; of = 25; then = 26; do = 27; lparen = 29; lbrak = 30; not = 32; becomes = 33; number = 34; ident = 37; semicolon = 38; end = 40; else = 41; elsif = 42; if = 44; while = 46; array = 54; record = 55; const = 57; type = 58; var = 59; procedure = 60; begin = 61; module = 63; eof = 64; TYPE Ident = ARRAY IdLen OF CHAR; VAR val: LONGINT; id: Ident; error: BOOLEAN; PROCEDURE Mark(msg: ARRAY OF CHAR); PROCEDURE Get(VAR sym: INTEGER); PROCEDURE Init(T: Texts.Text; pos: LONGINT); END OSS. The symbols are mapped onto integers. The mapping is defined by a set of constant definitions. Procedure Mark serves to output diagnostics about errors discovered in the source text. Typically, a short explanation is written into a log text together with the position of the discovered error. Procedure Get represents the actual scanner. It delivers for each call the next symbol recognized. The procedure performs the following tasks (the complete listing is shown in Appendix C): 1. Blanks and line ends are skipped. 2. Reserved words, such as BEGIN and END, are recognized. 3. Sequences of letters and digits starting with a letter, which are not reserved words, are recognized as identifiers. The parameter sym is given the value ident, and the character sequence itself is assigned to the global variable id.
33
4. Sequences of digits are recognized as numbers. The parameter sym is given the value number, and the number itself is assigned to the global variable val. 5. Combinations of special characters, such as := and = , ) ] OF THEN DO ; END ELSE ELSIF * DIV MOD + - = # < >= , ) ] OF THEN DO ; END ELSE ELSIF + - = # < >= , ) ] OF THEN DO ; END ELSE ELSIF = # < >= , ) ] OF THEN DO ; END ELSE ELSIF , ) ] OF THEN DO ; END ELSE ELSIF ; END ELSE ELSIF ; END ELSE ELSIF ; END ELSE ELSIF END ELSE ELSIF ; END ) ; ) ; ) ; ; ident
factor term SimpleExpression expression assignment ProcedureCall statement StatementSequence FieldList type FPSection FormalParameters ProcedureHeading ProcedureBody
34
*
* * *
*
ProcedureDeclaration declarations
; END BEGIN
The subsequent checks of the rules for determinism show that this syntax of Oberon-0 may indeed be handled by the method of recursive descent using a lookahead of one symbol. A procedure is constructed corresponding to each nonterminal symbol. Before the procedures are formulated, it is useful to investigate how they depend on each other. For this purpose we design a dependence graph (Figure 7.1). Every procedure is represented as a node, and an edge is drawn to all nodes on which the procedure depends, that is, calls directly or indirectly. Note that some nonterminal symbols do not occur in this graph, because they are included in other symbols in a trivial way. For example, ArrayType and RecordType are contained in type only and are therefore not explicitly drawn. Furthermore we recall that the symbols ident and integer occur as terminal symbols, because they are treated as such by the scanner. module FPsection
declarations
IdentList
StatSequence type
expression
ProcedureDeclaration
SimpleExpression term factor selector
Figure 7.1. Dependence diagram of parsing procedures Every loop in the diagram corresponds to a recursion. It is evident that the parser must be formulated in a language that allows recursive procedures. Furthermore, the diagram reveals how procedures may possibly be nested. The only procedure which is not called by another procedure is Module. The structure of the program mirrors this diagram. The program is listed in Appendix C in an already augmented form. The parser, like the scanner, is also formulated as a module.
7.3. Coping with syntactic errors So far we have considered only the rather simple task of determining whether or not a source text is well formed according to the underlying syntax. As a side-effect, the parser also recognizes the structure of the text read. As soon as an inacceptable symbol turns up, the task of the parser is completed, and the process of syntax analysis is terminated. For practical applications, however, this proposition is unacceptable. A genuine compiler must indicate an error diagnostic message and thereafter proceed with the analysis. It is then quite likely that further errors will be detected. Continuation of parsing after an error detection is, however, possible only under the assumption of certain hypotheses about the nature of the error. Depending on this assumption, a part of the subsequent text must be skipped, or certain symbols must be inserted. Such measures are necessary even when there is no intention of correcting or executing the erroneous source program. Without an at least partially correct hypothesis, continuation of the parsing process is futile (Graham and Rhodes, 1975; Rechenberg and Mössenböck, 1985).
35
The technique of choosing good hypotheses is complicated. It ultimately rests upon heuristics, as the problem has so far eluded formal treatment. The principal reason for this is that the formal syntax ignores factors which are essential for the human recognition of a sentence. For instance, a missing punctuation symbol is a frequent mistake, not only in program texts, but an operator symbol is seldom omitted in an arithmetic expression. To a parser, however, both kinds of symbols are syntactic symbols without distinction, whereas to the programmer the semicolon appears as almost redundant, and a plus symbol as the essence of the expression. This kind of difference must be taken into account if errors are to be treated sensibly. To summarize, we postulate the following quality criteria for error handling: 1. As many errors as possible must be detected in a single scan through the text. 2. As few additional assumptions as possible about the language are to be made. 3. Error handling features should not slow down the parser appreciably. 4. The parser program should not grow in size significantly. We can conclude that error handling strongly depends on a concrete case, and that it can be described by general rules only with limited success. Nevertheless, there are a few heuristic rules which seem to have relevance beyond our specific language, Oberon. Notably, they concern the design of a language just as much as the technique of error treatment. Without doubt, a simple language structure significantly simplifies error diagnostics, or, in other words, a complicated syntax complicates error handling unnecessarily. Let us differentiate between two cases of incorrect text. The first case is where symbols are missing. This is relatively easy to handle. The parser, recognizing the situation, proceeds by omitting one or several calls to the scanner. An example is the statement at the end of factor, where a closing parenthesis is expected. If it is missing, parsing is resumed after emitting an error message: IF sym = rparen THEN Get(sym) ELSE Mark(" ) missing") END Virtually without exception, only weak symbols are omitted, symbols which are primarily of a syntactic nature, such as the comma, semicolon and closing symbols. A case of wrong usage is an equality sign instead of an assignment operator, which is also easily handled. The second case is where wrong symbols are present. Here it is unavoiable to skip them and to resume parsing at a later point in the text. In order to facilitate resumption, Oberon features certain constructs beginning with distinguished symbols which, by their nature, are rarely misused. For example, a declaration sequence always begins with the symbol CONST, TYPE, VAR, or PROCEDURE, and a structured statement always begins with IF, WHILE, REPEAT, CASE, and so on. Such strong symbols are therefore never skipped. They serve as synchronization points in the text, where parsing can be resumed with a high probability of success. In Oberon's syntax, we establish four synchronization points, namely in factor, statement, declarations and type. At the beginning of the corresponding parser procedures symbols are being skipped. The process is resumed when either a correct start symbol or a strong symbol is read. PROCEDURE factor; BEGIN (*sync*) IF sym < lparen THEN Mark("ident?"); REPEAT Get(sym) UNTIL sym >= lparen END ; ... END factor; PROCEDURE StatSequence; BEGIN (*sync*) IF sym < ident THEN Mark("Statement?"); REPEAT Get(sym) UNTIL sym >= ident END ; ... END StatSequence;
36
PROCEDURE Type; BEGIN (*sync*) IF (sym # ident) & (sym >= const) THEN Mark("type?"); REPEAT Get(sym) UNTIL (sym = ident) OR (sym >= array) END ; ... END Type; PROCEDURE declarations; BEGIN (*sync*) IF sym < const THEN Mark("declaration?"); REPEAT Get(sym) UNTIL sym >= const END ; ... END declarations; Evidently, a certain ordering among symbols is assumed at this point. This ordering had been chosen such that the symbols are grouped to allow simple and efficient range tests. Strong symbols not to be skipped are assigned a high ranking (ordinal number) as shown in the definition of the scanner's interface. In general, the rule holds that the parser program is derived from the syntax according to the recursive descent method and the explained translation rules. If a read symbol does not meet expectations, an error is indicated by a call of procedure Mark, and analysis is resumed at the next synchronization point. Frequently, follow-up errors are diagnosed, whose indication may be omitted, because they are merely consequences of a formerly indicated error. The statement which results for every synchronization point can be formulated generally as follows: IF ~(sym IN follow(SYNC)) THEN Mark(msg); REPEAT Get(sym) UNTIL sym IN follow(SYNC) END where follow(SYNC) denotes the set of symbols which may correctly occur at this point. In certain cases it is advantageous to depart from the statement derived by this method. An example is the construct of statement sequence. Instead of Statement; WHILE sym = semicolon DO Get(sym); Statement END we use the formulation LOOP (*sync*) IF sym < ident THEN Mark("ident?"); ... END ; Statement; IF sym = semicolon THEN Get(sym) ELSIF sym IN follow(StatSequence) THEN EXIT ELSE Mark("semicolon?") END END This replaces the two calls of Statement by a single call, whereby this call may be replaced by the procedure body itself, making it unnecessary to declare an explicit procedure. The two tests after Statement correspond to the legal cases where, after reading the semicolon, either the next statement is analysed or the sequence terminates. Instead of the condition sym IN follow(StatSequence) we use a Boolean expression which again makes use of the specifically chosen ordering of symbols: (sym >= semicolon) & (sym < if) OR (sym >= array) The construct above is an example of the general case where a sequence of identical subconstructs which may be empty (here, statements) are separated by a weak symbol (here, semicolon). A second, similar case is manifest in the parameter list of procedure calls. The statement
37
IF sym = lparen THEN Get(sym); expression; WHILE sym = comma DO Get(sym); expression END ; IF sym = rparen THEN Get(sym) ELSE Mark(") ?") END END is being replaced by IF sym = lparen THEN Get(sym); LOOP expression; IF sym = comma THEN Get(sym) ELSIF (sym = rparen) OR (sym >= semicolon) THEN EXIT ELSE Mark(") or , ?") END END ; IF sym = rparen THEN Get(sym) ELSE Mark(") ?") END END A further case of this kind is the declaration sequence. Instead of IF sym = const THEN ... END ; IF sym = type THEN ... END ; IF sym = var THEN ... END ; we employ the more liberal formulation LOOP IF sym = const THEN ... END ; IF sym = type THEN ... END ; IF sym = var THEN ... END ; IF (sym >= const) & (sym = array) END ; IF sym = ident THEN find(obj); Get(sym); IF obj.class = Typ THEN type := obj.type ELSE Mark("type?") END ELSIF sym = array THEN Get(sym); IF sym = number THEN n := val; Get(sym) ELSE Mark("number?"); n := 1 END ; IF sym = of THEN Get(sym) ELSE Mark("OF?") END ; Type1(tp); NEW(type); type.form := Array; type.base := tp; type.len := n; type.size := type.len * tp.size ELSIF sym = record THEN Get(sym); NEW(type); type.form := Record; type.size := 0; OpenScope; LOOP IF sym = ident THEN IdentList(Fld, first); Type1(tp); obj := first; WHILE obj # guard DO obj.type := tp; obj.val := type.size; INC(type.size, obj.type.size); obj := obj.next END END ; IF sym = semicolon THEN Get(sym) ELSIF sym = ident THEN Mark("; ?") ELSE EXIT END END ; type.fields := topScope.next; CloseScope; IF sym = end THEN Get(sym) ELSE Mark("END?") END ELSE Mark("ident?") END END Type1; Following a longstanding tradition, addresses of variables are assigned negative values, that is, negative offsets to the common base address determined during program execution. The auxiliary procedures OpenScope and CloseScope ensure that the list of record fields is not intermixed with the list of variables. Every record declaration establishes a new scope of visibility of field identifiers, as required by the definition of the language Oberon. Note that the list into which new entries are inserted is rooted in the global variable topScope.
8.4. Exercises 8.1. The scope of identifiers is defined to extend from the place of declaration to the end of the procedure in which the declaration occurs. What would be necessary to let this range extend from the beginning to the end of the procedure? 8.2. Consider pointer declarations as defined in Oberon. They specify a type to which the declared pointer is bound, and this type may occur later in the text. What is necessary to accommodate this relaxation of the rule that all referenced entities must be declared prior to their use?
45
9. A RISC-Architecture as Target It is worth noticing that our compiler, up to this point, could be developed without reference to the target computer for which it is to generate code. But why indeed should the target machine's structure influence syntactic analysis and error handling? On the contrary, such an influence should consciously be avoided. As a result, code generation for an arbitrary computer may be added according to the principle of stepwise refinement to the existing, machine independent parser, which serves like a scaffolding. Before undertaking this task, however, a specific target architecture must be selected. To keep both the resulting compiler reasonably simple and the development clear of details that are of relevance only for a specific machine and its idiosyncrasies, we postulate an architecture according to our own choice. Thereby we gain the considerable advantage that it can be tailored to the needs of the source language. This architecture does not exist as a real machine; it is therefore virtual. But since every computer executes instructions according to a fixed algorithm, it can easily be specified by a program. A real computer may then be used to execute this program which interprets the generated code. The program is called an interpreter, and it emulates the virtual machine, which therefore can be said to have a semi-real existence. It is not the aim of this text to present the motivations for our choice of the specific virtual architecture with all its details. This chapter is rather intended to serve as a descriptive manual consisting of an informal introduction and a formal definition of the computer in the form of the interpretive program. This formalization may even be considered as an example of an exact specification of a processor. In the definition of this computer we intentionally follow closely the line of RISC-architectures. The acronym RISC stands for reduced instruction set computer, where "reduced" is to be understood as relative to architectures with large sets of complex instructions, as these were dominant until about 1980. This is obviously not the place to explain the essence of the RISC architecture, nor to expound on its various advantages. Here it is obviously attractive because of its simplicity and clarity of concepts, which simplify the description of the instruction set and the choice of instruction sequences corresponding to specific language constructs. The architecture chosen here is almost identical to the one presented by Hennessy and Patterson (1990) under the name DLX. The small deviations are due to our desire for increased regularity. Among commercial products, the MIPS and ARM architectures are closest to our virtual machine. Resources and registers From the viewpoints of the programmer and the compiler designer the computer consists of an arithmetic unit, a control unit and a store. The arithmetic unit contains 16 registers R0 – R15, with 32 bits each. The control unit consists of the instruction register IR, holding the instruction currently being executed, and the program counter PC, holding the address of the instruction to be fetched next (Figure 9.1). The program counter is included in the set of data registers: PC = R15. Branch instructions to subroutines implicitly use register R14 to store the return address. The memory consists of 32-bit words, and it is byte-addressed, that is, word addresses are multiples of 4. There are three types of instructions and instruction formats. Register instructions operate on registers only and feed data through a shifter and the arithmetic logic unit ALU. Memory instructions fetch and store data in memory. Branch instructions affect the program counter. 1. Register instructions MOV MVN ADD SUB MUL DIV MOD CMP
a, c a, c a, b, c a, b, c a, b, c a, b, c a, b, c b, c
(formats F0 and F1) R.a := Shift(R.c, b) R.a := - Shift(R.c, b) R.a := R.b + R.c R.a := R.b – R.c R.a := R.b * R.c R.a := R.b DIV R.c R.a := R.b MOD R.c Z := R.b = R.c N := R.b < R.c
MOVI MVNI ADDI SUBI MULI DIVI MODI CMPI
46
a, im a, im a, b, im a, b, im a, b, im a, b, im a, b, im b, im
R.a := Shift(im, b) R.a := - Shift(im, b) R.a := R.b + im R.a := R.b - im R.a := R.b * im R.a := R.b DIV im R.a := R.b MOD im Z := R.b = im N := R.b < im
a
incrementer
Register
Program b
c decode i
AL U
Instr Reg
Shifter adr Memory
Figure 9.1.Block diagram of the RISC structure
F0
00
4
4
4
4
op
a
b
c 18
F1
01
op
a
b
im
F2
10
op
a
b
disp 26
F3
11
op
disp Figure 9.2. Instruction formats.
In the case of register instructions there are two variants. Either the second operand is an immediate value (F1), and the 18-bit constant im is sign extended to 32 bits. Or the second operand is a register (F0). The comparison instruction CMP affects the status bits Z and N. 2. Menory instructions LDW LDB POP STW STB PSH
a, b, im a, b, im a, b, im a, b, im a, b, im a, b, im
(format F2) R.a := Mem[R.b +disp] R.a := Mem[R.b + disp] MOD 100H R.b := R.b - disp; R.a := Mem[R.b] Mem[R.b + disp] := R.a Mem[R.b + disp] := … Mem[R.b] := R.a; R.b := R.b + disp
47
load word load byte pop store word store byte push
3. Branch instructions BEQ BLT BLE BR
disp disp disp disp
BSR disp RET disp
(Format F3, word address PC-relative) PC := PC+disp*4, if Z PC := PC+disp*4, if N PC := PC+disp*4, if Z or N PC := PC+disp*4
BNE disp BGE disp BGT disp
R14 := PC; PC := PC + disp*4 PC := R.c
PC := PC+disp*4, if ~Z PC := PC+disp*4, if ~N PC := PC+disp*4, if ~(Z or N) (address PC-relative)
The virtual computer is defined by the following interpreter program in more detail. Note that register PC holds word addresses instead of byte addresses, and that Z and N are status bits set be comparison instructions. MODULE RISC; (*NW 27. 11. 05*) IMPORT SYSTEM, Texts; CONST MemSize* = 4096; ProgOrg = 2048; (*in bytes*) MOV = 0; MVN = 1; ADD = 2; SUB = 3; MUL = 4; Div = 5; Mod = 6; CMP = 7; MOVI = 16; MVNI = 17; ADDI = 18; SUBI = 19; MULI = 20; DIVI = 21; MODI = 22; CMPI = 23; CHKI = 24; LDW = 32; LDB = 33; POP = 34; STW = 36; STB = 37; PSH = 38; RD = 40; WRD= 41; WRH = 42; WRL = 43; BEQ = 48; BNE = 49; BLT = 50; BGE = 51; BLE = 52; BGT = 53; BR = 56; BSR = 57; RET = 58; VAR IR: LONGINT; N, Z: BOOLEAN; R*: ARRAY 16 OF LONGINT; M*: ARRAY MemSize DIV 4 OF LONGINT; W: Texts.Writer; (* R15] is PC, R[14] used as link register by BSR instruction*) PROCEDURE Execute*(start: LONGINT; VAR in: Texts.Scanner; out: Texts.Text); VAR opc, a, b, c, nxt: LONGINT; BEGIN R[14] := 0; R[15] := start + ProgOrg; LOOP (*interpretation cycle*) nxt := R[15] + 4; IR := M[R[15] DIV 4]; opc := IR DIV 4000000H MOD 40H; a := IR DIV 400000H MOD 10H; b := IR DIV 40000H MOD 10H; c := IR MOD 40000H; IF opc < MOVI THEN (*F0*) c := R[IR MOD 10H] ELSIF opc < BEQ THEN (*F1, F2*) c := IR MOD 40000H; IF c >= 20000H THEN DEC(c, 40000H) END (*sign extension*) ELSE (*F3*) c := IR MOD 4000000H; IF c >= 2000000H THEN DEC(c, 4000000H) END (*sign extension*) END ; CASE opc OF MOV, MOVI: R[a] := ASH(c, b) (*arithmetic shift*) | MVN, MVNI: R[a] := -ASH(c, b) | ADD, ADDI: R[a] := R[b] + c | SUB, SUBI: R[a] := R[b] - c | MUL, MULI: R[a] := R[b] * c | Div, DIVI: R[a] := R[b] DIV c | Mod, MODI: R[a] := R[b] MOD c | CMP, CMPI: Z := R[b] = c; N := R[b] < c | CHKI: IF (R[a] < 0) OR (R[a] >= c) THEN R[a] := 0 END | LDW: R[a] := M[(R[b] + c) DIV 4]
48
| | | | | | | | | | | | | | | | | |
LDB: (*not implemented*) POP: R[a] := M[(R[b]) DIV 4]; INC(R[b], c) STW: M[(R[b] + c) DIV 4] := R[a] STB: (*not implemented*) PSH: DEC(R[b], c); M[(R[b]) DIV 4] := R[a] RD: Texts.Scan(in); R[a] := in.i WRD: Texts.Write(W, " "); Texts.WriteInt(W, R[c], 1) WRH: Texts.WriteHex(W, R[c]) WRL: Texts.WriteLn(W); Texts.Append(out, W.buf) BEQ: IF Z THEN nxt := R[15] + c*4 END BNE: IF ~Z THEN nxt := R[15] + c*4 END BLT: IF N THEN nxt := R[15] + c*4 END BGE: IF ~N THEN nxt := R[15] + c*4 END BLE: IF Z OR N THEN nxt := R[15] + c*4 END BGT: IF ~Z & ~N THEN nxt := R[15] + c*4 END BR: nxt := R[15] + c*4 BSR: nxt := R[15] + c*4; R[14] := R[15] + 4 RET: nxt := R[c MOD 10H]; IF nxt = 0 THEN EXIT END END ; R[15] := nxt END END Execute; PROCEDURE Load*(VAR code: ARRAY OF LONGINT; len: LONGINT); VAR i: INTEGER; BEGIN i := 0; WHILE i < len DO M[i + ProgOrg DIV 4] := code[i]; INC(i) END END Load; BEGIN Texts.OpenWriter(W) END RISC. Additional notes: 1. Instructions RD, WRD, WRH and WRL are not typical computer instructions. We have added them here to provide a simple and effective way for input and output. Compiled and interpreted programs can thus be tested and obtain a certain reality. 2. Instructions LDB and STB load and store a single byte. Without them, it would not make sense to speak about a byte-oriented computer. However, we refrain from specifying them here, because such program statements would hardly mirror their implementation in hardware truthfully. 3. Instructions PSH and POP behave like STW and LDW, whereby the value of the base register R.b is incremented or decremented by the amount c. They will allow the handling of procedure parameters in a convenient way (see Chapter 12). 4. The CHKI instruction simply replaces an index value which is outside the index bounds by 0, because the RISC does not provide a trap facility.
49
10. Expressions and Assignments 10.1. Straight code generation according to the stack principle The third example in Chapter 5 showed how to convert an expression from conventional infix form into its equivalent postfix form. Our ideal computer would be capable of directly interpreting postfix notation. As also shown, such an ideal computer requires a stack for holding intermediate results. Such a computer architecture is called a stack architecture. Computers based on a stack architecture are not in common use. Sets of explicitly addressable registers are preferred to a stack. Of course, a set of registers can easily be used to emulate a stack. Its top element is indicated by a global variable representing the stack pointer (SP) in the compiler. This is feasible, since the number of intermediate results is known at compile time, and the use of a global variable is justified because the stack constitutes a global resource. To derive the program for generating the code corresponding to specific constructs, we first postulate the desired code patterns. This method will also be successfully employed later for other constructs beyond expressions and assignments. Let the code for a given construct K be given by the following table: K
code(K)
side effect
ident
LDW
number
MOVI i, 0, value
( exp )
code(exp)
fac0 * fac1
code(fac0) code(fac1) MUL i, i, i+1
DEC(SP)
term0 + term1
code(term0) code(term1) ADD i, i, i+1
DEC(SP)
ident := exp
code(exp) STW i, adr(ident)
DEC(SP)
i, 0, adr(ident)
INC(SP) INC(SP)
To begin, we restrict our attention to simple variables as operands, and we omit selectors for structured variables. First, consider the assignment u := x*y + z*w: Instruction
meaning
stack
LDW LDW MUL LDW LDW MUL ADD STW
R0 := x R1 := y R0 := R0*R1 R1 := z R2 := w R1 := R1 * R2 R1 := R1 + R2 u := R1
x SP = x, y x*y x*y, z x*y, z, w x*y, z*w x*y + z*w -
R0, base, x R1, base, y R0, R1, R2 R1, base, z R2, base, w R1, R1, R2 R0, R0, R1 R0, base, u
stack pointer 1 2 1 2 3 2 1 0
From this it is quite evident how the corresonding parser procedures must be extended: PROCEDURE factor; VAR obj: Object; BEGIN IF sym = ident THEN find(obj); Get(sym); INC(RX); Put(LDW, RX, 0, -obj.val) ELSIF sym = number THEN INC(RX); Put(MOVI, RX, 0, val); Get(sym) ELSIF sym = lparen THEN Get(sym); expression; IF sym = rparen THEN Get(sym) ELSE Mark(" ) missing") END
50
ELSIF ... END END factor; PROCEDURE term; VAR op: INTEGER; BEGIN factor; WHILE (sym = times) OR (sym = div) DO op := sym; Get(sym); factor; DEC(RX); IF op = times THEN Put(MUL, RX, RX, RX+1) ELSIF op = div THEN Put(DIV, RX, RX, RX+1) END END END term; PROCEDURE SimpleExpression; VAR op: INTEGER; BEGIN IF sym = plus THEN Get(sym); term ELSIF sym = minus THEN Get(sym); term; Put(SUB, RX, 0, RX) ELSE term END ; WHILE (sym = plus) OR (sym = minus) DO op := sym; Get(sym); term; DEC(RX); IF op = plus THEN Put(ADD, RX, RX, RX+1) ELSIF op = minus THEN Put(SUB, RX, RX, RX+1) END END END SimpleExpression; PROCEDURE Statement; VAR obj: Object; BEGIN IF sym = ident THEN find(obj); Get(sym); IF sym = becomes THEN Get(sym); expression; Put(STW, RX, 0, obj.val); DEC(RX) ELSIF ... END ELSIF ... END END Statement; Here we have introduced the generator procedure Put. It can be regarded as the counterpart of the scanner procedure Get. We assume that it deposits an instruction in a global array, using the variable pc as index denoting the next free location in the array. With this assumption, procedure Put is formulated in Oberon as follows, whereby LSH(x, n) is a function yielding the value x left shifted by n bit positions: PROCEDURE Put(op, a, b, d: INTEGER); BEGIN code[pc] := LSH(LSH(LSH(op, 4) + a, 4) + b, 18) + (d MOD 40000H); INC(pc) END Put Addresses of variables are indicated here by simply using their identifier. In reality, the address values obtained from the symbol table stand in place of the identifiers. They are offsets to a base address computed at run time, that is, the offsets are added to the base address to yield the effective address. This holds not only for our RISC machine, but also for virtually all common computers. We take this fact into account by specifying addresses using pairs consisting of the offset a and the base (register) r.
51
10.2. Delayed code generation Consider as a second example the expression x + 1. According to the scheme presented in Section 10.1, we obtain the corresponding code LDW 0, base, x MOVI 1, 0, 1 ADD 0, 0, 1
R0 := x R1 := 1 R0 := R0 + R1
This shows that the generated code is correct, but certainly not optimal. The flaw lies in the fact that the constant 1 is loaded into a register, although this is unnecessary, because our computer features an instruction which lets constants be added immediately to a register (immediate addressing mode). Evidently some code has been emitted prematurely. The remedy must be to delay code emission in certain cases until it is definitely known that there is no better solution. How is such a delayed code generation to be implemented? In general, the method consists in associating the information which would have been used for the selection of the emitted code with the resulting syntactic construct. From the principle of attributed grammars presented in Chapter 5, this information is retained in the form of attributes. Code generation therefore depends not only on the syntactically reduced symbols, but also on the values of their attributes. This conceptual extension is reflected by the fact that parser procedures obtain a result parameter which represents these attributes. Because there are usually several attributes, a record structure is chosen for these parameters; we call their type Item (Wirth and Gutknecht, 1992). In the case of our second example, it is necessary to indicate whether the value of a factor, term or expression is held (at run time) in a register, as has been the case so far, or whether the value is a known constant. The latter case will quite likely lead to a later instruction with immediate mode. It now becomes clear that the attribute must indicate the mode in which the factor, term or expression is, that is, where the value is stored and how it is to be accessed. This attribute mode corresponds to the addressing mode of computer instructions, and its range of possible values depends on the set of addressing modes which the target computer features. For each addressing mode offered, there is a corresponding item mode. A mode is also implicitly introduced by object classes. Object classes and item modes partially overlap. In the case of our RISC architecture, there are only three modes: Item mode
Object class
Addressing mode
Additional attributes
Var Const Reg
Var Const -
Direct Immediate Register
a Value in memory at address a a Value is the constant a r Value held in register R[r]
With this in mind, we declare the data type Item as a record structure with fields mode, type, a and r. Evidently, the type of the item is also an attribute. It will not be mentioned any further below, because we shall consider only the single type Integer. The parser procedures now emerge as functions with result type Item. Programming considerations, however, suggest to use proper procedures with a result parameter instead of function procedures. Item = RECORD mode: INTEGER; type: Type; a, r: LONGINT; END Let us now return to our example to demonstrate the generation of code for the expression x+1. The process is shown in Figure 10.1. The transformation of a Var-Item into a Reg-Item is accompanied by the emission of an LDW instruction, and the transformation of a Reg-Item and a Const-Item into a Reg-Item is accompanied by emitting an ADDI instruction.
52
+ x Var x
+ 1
x Reg x
+ 1
x
+ 1
Reg x
Const 1
x
1 Reg 1
Figure 10.1. Generating items and instructions for the expression x+1. Note the similarity of the two types Item and Object. Both describe objects, but whereas Objects represent declared, named objects, whose visibility reaches beyond the construct of their declaration, Items describe objects which are always strictly bound to their syntactic construct. Therefore, it is strongly recommended not to allocate Items dynamically (in a heap), but rather to declare them as local parameters and variables. PROCEDURE factor(VAR x: Item); BEGIN IF sym = ident THEN find(obj); Get(sym); x.mode := obj.class; x.a := obj.adr; x.r := 0 ELSIF sym = number THEN x.mode := Const; x.a := val; Get(sym) ELSIF sym = lparen THEN Get(sym); expression(x); IF sym = rparen THEN Get(sym) ELSE Mark(" ) missing") END ELSIF ... END END factor; PROCEDURE term(VAR x: Item); VAR y: Item; op: INTEGER; BEGIN factor(x); WHILE (sym = times) OR (sym = div) DO op := sym; Get(sym); factor(y); Op2(op, x, y) END END term; PROCEDURE SimpleExpression(VAR x: Item); VAR y: Item; op: INTEGER; BEGIN IF sym = plus THEN Get(sym); term(x) ELSIF sym = minus THEN Get(sym); term(x); Op1(minus, x) ELSE term(x) END ; WHILE (sym = plus) OR (sym = minus) DO op := sym; Get(sym); term(y); Op2(op, x, y) END END SimpleExpression; PROCEDURE Statement; VAR obj: Object; x, y: Item; BEGIN IF sym = ident THEN find(obj); Get(sym); x.mode := obj.class; x.a := obj.adr; x.r := 0; IF sym = becomes THEN
53
Get(sym); expression(y); IF y.mode # Reg THEN load(y) END ; Put(STW, y.r, 0, x.a) ELSIF ... END ELSIF ... END END Statement; The code generating statements are now merged into two procedures, Op1 and Op2. The principle of delayed code emission is also used here to avoid the emission of arithmetic instructions if the compiler can perform the operation itself. This is the case when both operands are constants. The technique is known as constant folding. PROCEDURE Op1(op: INTEGER; VAR x: Item); (* x := op x *) VAR t: LONGINT; BEGIN IF op = minus THEN IF x.mode = Const THEN x.a := -x.a ELSE IF x.mode = Var THEN load(x) END ; Put(MVN, x.r, 0, x.r) END ... END END Op1; PROCEDURE Op2(op: INTEGER; VAR x, y: Item); (* x := x op y *) BEGIN IF (x.mode = Const) & (y.mode = Const) THEN IF op = plus THEN x.a := x.a + y.a ELSIF op = minus THEN x.a := x.a - y.a ... END ELSE IF op = plus THEN PutOp(ADD, x, y) ELSIF op = minus THEN PutOp(SUB, x, y) ... END END END Op2; PROCEDURE PutOp(cd: LONGINT; VAR x, y: Item); BEGIN IF x.mode # Reg THEN load(x) END ; IF y.mode = Const THEN Put(cd+MVI, x.r, x.r, y.a) ELSE IF y.mode # Reg THEN load(y) END ; Put(cd, x.r, x.r, y.r); EXCL(regs, y.r) END END PutOp; PROCEDURE load(VAR x: Item); VAR r: INTEGER; BEGIN (*x.mode # Reg*) IF x.mode = Var THEN GetReg(r); Put(LDW, r, x.r, x.a); x.r := r ELSIF x.mode = Const THEN IF x.a = 0 THEN x.r := 0 ELSE GetReg(x.r); Put(MOVI, x.r, 0, x.a) END
54
END ; x.mode := Reg END load; Whenever arithmetic expressions are evaluated, the inherent danger of overflow exists. The evaluating statements should therefore be suitably guarded. In the case of addition guards can be formulated as follows: IF x.a >= 0 THEN IF y.a = MIN(INTEGER) - x.a THEN x.a := x.a + y.a ELSE Mark("underflow") END END The essence of delayed code generation is that code is not emitted before it is clear that no better solution exists. For example, an operand is not loaded into a register before this is known to be unavoidable. We also abandon allocation of registers according to the rigid stack principle. This is advantageous in certain cases which will be explained later. Procedure GetReg delivers and reserves any one of the free registers. The set of free registers is suitably represented by a global variable regs. Of course, care has to be taken to release registers whenever their value is no longer relevant. PROCEDURE GetReg(VAR r: LONGINT); VAR i: INTEGER; BEGIN i := 1; WHILE (i < 15) & (i IN regs) DO INC(i) END ; INCL(regs, i); r := i END GetReg; The principle of delayed code generation is also useful in many other cases, but it becomes indispensible when considering computers with complex addressing modes, for which reasonably efficient code has to be generated by making good use of the available complex modes. As an example we consider code emission for a CISC architecture. It typically offers instructions with two operands, one of them also representing the result. Let us consider the expression u := x + y*z and obtain the following instruction sequence: MOV MUL ADD MOV
y, R0 z, R0 x, R0 R0, u
R0 := y R0 := R0 * z R0 := R0 + x u := R0
This is obtained by delaying the loading of variables until they are to be joined with another operand. Because the instruction replaces the first operand with the operation's result, the operation cannot be performed on the variable's original location, but only on an intermediate location, typically a register. The copy instruction is not issued before this is recognized as unavoidable. A side effect of this measure is that, for example, the simple assignment x := y does not transfer via a register at all, but occurs directly through a copy instruction, which both increases efficiency and decreases code length: MOV
y, x
x := y
10.3. Indexed variables and record fields So far we have considered simple variables only in expressions and assignments. Access to elements of structured variables, arrays and records, necessitates the selection of the element according to a computed index or a field identifier, respectively. Syntactically, the variable's identifier is followed by one or several selectors. This is mirrored in the parser by a call of the procedure selector within factor and also in statement: find(obj); Get(sym); x.mode := obj.class; x.a := obj.adr; x.r := 0; selector(x)
55
Procedure selector processes not only a single selection, but if needed an entire sequence of selections. The following formulation shows that the attribute type of the operand x is also relevant. PROCEDURE selector(VAR x: Item); VAR y: Item; obj: Object; BEGIN WHILE (sym = lbrak) OR (sym = period) DO IF sym = lbrak THEN Get(sym); expression(y); IF x.type.form = Array THEN Index(x, y) ELSE Mark("not an array") END ; IF sym = rbrak THEN Get(sym) ELSE Mark("]?") END ELSE Get(sym); IF sym = ident THEN IF x.type.form = Record THEN FindField(obj, x.type.fields); Get(sym); IF obj # guard THEN Field(x, obj) ELSE Mark("undef") END ELSE Mark("not a record") END ELSE Mark("ident?") END END END END selector; The address of the selected element is given by the formulas derived in Section 8.3. In the case of a field identifier the address computation is performed by the compiler. The address is the sum of the variable's address and the field's offset. PROCEDURE Field(VAR x: Item; y: Object); (* x := x.y *) BEGIN INC(x.a, y.val); x.type := y.type END Field; In the case of an indexed variable, code is emitted according to the formula adr(a[k]) = adr(a) + k * size(T) Here a denotes the array variable, k the index, and T the type of the array's elements. Index computation requires two instructions; the scaled index is added to the register component of the address. Let the index be stored in register R.j, and let the array address be stored in register R.i. MULI ADD
j, j, size(T) j, i, j
Procedure Index emits the above index code, checks whether the indexed variable is indeed an array, and computes the element's address directly if the index is a constant. PROCEDURE Index(VAR x, y: Item); (* x := x[y] *) VAR z: Item; BEGIN IF y.type # intType THEN Mark("index not integer") END ; IF y.mode = Const THEN IF (y.a < 0) OR (y.a >= x.type.len) THEN Mark("index out of range") END ; x.a := x.a + y.a * x.type.base.size ELSE IF y.mode # Reg THEN load(y) END ; Put(MULI, y.r, y.r, x.type.base.size); Put(ADD, y.r, x.r, y.r); EXCL(regs, x.r); x.r := y.r END; x.type := x.type.base END Index;
56
We can now show the code resulting from the following program fragment containing one- and twodimensional arrays: PROCEDURE P1; VAR i, j: INTEGER; adr -4, -8 a: ARRAY 4 OF INTEGER; adr -24 b: ARRAY 3 OF ARRAY 5 OF INTEGER; adr -84 BEGIN i := a[j]; i := a[2]; i := a[i+j]; i := b[i][j]; i := b[2][4]; i := a[a[i]] END P1. LDW MULI ADD LDW STW
0, base, -8 0, 0, 4 0, base, 0 1, 0, -24 1, base, -4
i := a[j]
LDW STW
0, base, -16 0, base, -4
i := a[2]
LDW LDW ADD MULI ADD LDW STW
0, base, -4 1, base, -8 0, 0, 1 0, 0, 4 0, base, 0 1, 0, -24 1, base, -4
i := a[i+j];
LDW MULI ADD LDW MULI ADD LDW STW
1, base, -4 0, 0, 20 0, base, 0 1, base, -8 1, 1, 4 1, 0, 1 0, 1, -84 0, base, -4
LDW STW
0, base, -28 0, base, -4
i := b[2][4]
LDW MULI ADD LDW MULI ADD LDW STW
0, base, -4 0, 0, 4 0, base, 0 1, 0, -24 1, 1, 4 1, base, 1 0, 1, -24 0, base, -4
i := a[a[i]]
a i
i+j
i i := b[i][j] j b i
Note that the validity of the index can be checked only if the index is a constant, that is, it is of known value. Otherwise the index cannot be checked until run time. Although the test is of course redundant in correct programs, its omission is not recommended. In order to safeguard the abstraction of the array structure the test is wholly justified. However, the compiler designer should attempt to achieve utmost efficiency. The test takes the form of the statement IF (k < 0) OR (k >= n) THEN HALT END where k is the index and n the array's length. For our virtual computer, we simply postulate a corresponding instruction. In other cases a suitable instruction sequence must be found, whereby the following must be considered: since in Oberon the lower array bound is always 0, a single comparison suffices if the index value is considered as an unsigned integer. This is so because negative values in
57
complement representation appear with a sign bit value 1, yielding an unsigned value larger than the highest (signed) integer value. Procedure Index is extended accordingly, generating a CHK instruction, which results in termination of the computation in case of an invalid index. IF y.mode # Reg THEN load(y) END ; Put(CHKI, y.r, 0, x.type.base.len); Put(MULI, y.r, y.r, x.type.base.size); Finally, an example of a program is shown with nested data structures. It demonstrates clearly how the special treatment of constants in selectors simplifies the code for address computations. Compare the code resulting for variables indexed by expressions with those indexed by constants. CHK instructions have been omitted for the sake of brevity. PROCEDURE P2; TYPE R0 = RECORD x, y: INTEGER END ; R1 = RECORD u: INTEGER; v: ARRAY 4 OF R0; w: INTEGER END ; VAR i, j, k: INTEGER; s: ARRAY 2 OF R1; BEGIN k := s[i].u; k := s[1].w; k := s[i].v[j].x; k := s[1].v[2].y; s[0].v[i].y := k END P2. LDW MULI ADD LDW STW
0, base, -4 0, 0, 40 0, base, 0 1, 0, -92 1, base, -12
i
LDW STW
0, base, -16 0, base, -12
s[1].w
LDW MULI ADD LDW MULI ADD LDW STW
0, base, -4 0, 0, 40 0, base, 0 1, base, -8 1, 1, 8 1, 0, 1 0, 1, -88 0, base, -12
i
LDW STW
0, base, -28 0, base, -12
s[1].v[2].y
LDW MULI ADD LDW STW
0, base, -4 0, 0, 8 0, base, 0 1, base, -12 1, 0, -84
i
offset 0 offset 4 offset 36 adr -4, -8, -12 adr -92
s[i].u k
j s[i].v[j].x
k s[0].v[i].y
A desire to keep target-dependent parts of the compiler separated from target-independent parts suggests that code generating statements should be collected in the form of procedures in a separate module. We shall call this module OSG and present its interface. It contains several of the generator procedures encountered so far. The others will be explained in Chapters 11 and 12.
58
DEFINITION OSG; IMPORT OSS, Texts, Fonts, Display; CONST Head = 0; Var = 1; Par = 2; Const = 3; Fld = 4; Typ = 5; Proc = 6; SProc = 7; Boolean = 0; Integer = 1; Array = 2; Record = 3; TYPE Object = POINTER TO ObjDesc; ObjDesc = RECORD class, lev: INTEGER; next, dsc: Object; type: Type; name: OSS.Ident; val: LONGINT; END ; Type = POINTER TO TypeDesc; TypeDesc = RECORD form: INTEGER; fields: Object; base: Type; size, len: INTEGER; END ; Item = RECORD mode, lev: INTEGER; type: Type; a: LONGINT; END ; VAR boolType, intType: Type; curlev, pc: INTEGER; PROCEDURE FixLink (L: LONGINT); PROCEDURE IncLevel (n: INTEGER); PROCEDURE MakeConstItem (VAR x: Item; typ: Type; val: LONGINT); PROCEDURE MakeItem (VAR x: Item; y: Object); PROCEDURE Field (VAR x: Item; y: Object); PROCEDURE Index (VAR x, y: Item); PROCEDURE Op1 (op: INTEGER; VAR x: Item); PROCEDURE Op2 (op: INTEGER; VAR x, y: Item); PROCEDURE Relation (op: INTEGER; VAR x, y: Item); PROCEDURE Store (VAR x, y: Item); PROCEDURE Parameter (VAR x: Item; ftyp: Type; class: INTEGER); PROCEDURE CJump (VAR x: Item); PROCEDURE BJump (L: LONGINT); PROCEDURE FJump (VAR L: LONGINT); PROCEDURE Call (VAR x: Item); PROCEDURE IOCall (VAR x, y: Item); PROCEDURE Header (size: LONGINT); PROCEDURE Enter (size: LONGINT); PROCEDURE Return (size: LONGINT); PROCEDURE Open; PROCEDURE Close (VAR S: Texts.Scanner; globals: LONGINT); END OSG.
59
10.4. Exercises 10.1. Improve the Oberon-0 compiler in such a way that multiplication and division instructions are replaced by efficient shift and mask instructions, if a factor or the divisor is a power of 2. 10.2. Improve the Oberon-0 compiler in such a way that the access code for array elements includes a test verifying that the index value lies within the range given by the array's declaration. 10.3. Had the assignment statement in Oberon been defined in a form where the assigned expression occurs to the left of the variable, that is for example by the form e =: v, would compilation of assignments be simpler in any way? 10.4. Consider the introduction of a multiple assignment in Oberon of the form e =: x0 =: x1 =: ... =: xn. Implement it. Does the definition of its semantics present any problems? 10.5. Change the definition of expressions in Oberon to that of Algol 60 (see Exercise 2.1) and implement the changes. Discuss the advantages and disadvantages of the two forms.
60
11. Conditional and Repeated Statements and Boolean Expressions 11.1. Comparisons and jumps Conditional and repeated statements are implemented with the aid of jump instructions, also called branch instructions. As a first example, let us consider the simplest form of conditional statement: IF x = y THEN StatSequence END Its mapping into a sequence of instructions is straightforward: IF x = y THEN StatSequence END
L
EQL x, y BF L code(StatSequence) ...
Our considerations are based once again on a stack architecture. Instruction EQL tests the two operands for equality and replaces them on the stack by the Boolean result. The subsequent branch instruction BF (branch if FALSE) leads to the destination label L if this result is FALSE, and removes it from the stack. Similarly to EQL, conditional branch instructions are postulated for the relations 9, . Unfortunately, however, such compiler_friendly computers are hardly widespread. Rather more common are computers whose branch instructions depend on the comparison of a register value with 0. We denote them as BNE (branch if not equal), BLT (branch if less than), BGE (branch if greater or equal), BLE (branch if less or equal), and BGT (branch if greater than). The code sequence corresponding to the above example is IF x = y THEN StatSequence END
L
code (Ri := x - y) BNE Ri, L code(StatSequence) ...
The use of subtraction (x - y ≥ 0 standing for x ≥ y) has an implicit pitfall: subtraction may lead to overflow, resulting in either program termination or a wrong result. Therefore a specific comparison instruction CMP is used in place of subtraction, which avoids overflow, but correctly indicates whether the difference is either zero, positive or negative. The result is typically stored in a special register called condition code, consisting of the two bits denoted by N and Z, indicating whether the difference is negative or zero respectively. All conditional branch instructions then implicitly refer to this register as argument. IF x = y THEN StatSequence END
L
CMP x, y BNE L code(StatSequence) ...
11.2. Conditional and repeated statements The question of how a Boolean value is to be represented by an item now arises. In the case of stack architecture the answer is easy: since the result of a comparison lies on the stack like any other result, no special item mode is necessary. A CMP instruction, however, requires further thought. We shall first restrict our consideration to the simple cases of pure comparisons without further Boolean operators. In the case of an architecture with a CMP scheme, it is necessary to indicate in the resulting item which register holds the computed difference, and which relation is represented by the comparison. For the latter a new attribute is required; we call the new mode Cond and its new attribute (record field) c. The mapping of relations to values of c is defined by = < = >
1 3 5
61
The construct containing comparisons is the expression. Its syntax is expression = SimpleExpression [("=" | "#" | "=") SimpleExpression]. The corresponding, extended parser procedure is easily derived: PROCEDURE expression(VAR x: Item); VAR y: Item; op: INTEGER; BEGIN SimpleExpression(x); IF (sym >= eql) & (sym = times) & (sym = plus) & (sym 0 DO z := x+y; i := i-1 END Since x and y remain unchanged during the repetition, the sum need be computed only once. The compiler must pull the assignment to z out of the loop. The technical term for this feat is loop invariant code motion. In all the latter cases code can only be improved by selective analysis of context. But this is precisely what increases the effort during compilation very significantly. The compiler presented for Oberon-0 does not constitute a suitable basis for this kind of optimization. Related to the pulling out of constant expressions from loops is the technique of simplifying expressions by using the values computed in the previous repetition, that is, by considering recurrence relations. If, for example, the address of an array element is given by adr(a[i]) = k*i + a0, then adr(a[i+1]) = adr(a[i]) + k. This case is particularly frequent and therefore relevant. For instance, the addresses of the indexed variables in the statement FOR i := 0 TO N-1 DO a[i] := b[i] * c[i] END can be computed by a simple addition of a constant to their previous values. This optimization leads to significant reductions in computation time. A test with the following example of a matrix multiplication showed surprising results: FOR i := 0 TO 99 DO FOR j := 0 TO 99 DO FOR k := 0 TO 99 DO a[i, j] := a[i, j] + b[i, k] * c[k, j] END END END The use of registers instead of memory locations to hold index values and sums, and the elimination of index bound tests resulted in a speed increased by a factor of 1.5. The replacement of indexed addressing by linear progression of addresses as described above yielded a factor of 2.75. And the additional use of a combined multiplication and addition instruction to compute the scalar products increased the factor to 3.90. Unfortunately, not even consideration of simple context information suffices in this case. A sophisticated control and data flow analysis is required, as well as detection of the fact that in each repetition an index is incremented monotonically by 1.
16.4. Register allocation The dominant theme in the subject of optimization is the use and allocation of processor registers. In the Oberon-0 compiler presented registers are used exclusively to hold anonymous intermediate results during the evaluation of expressions. For this purpose, usually a few registers suffice. Modern processors, however, feature a significant number of registers with access times considerably shorter than that of main memory. Using them for intermediate results only would imply a poor utilization of the most valuable resources. A primary goal of good code optimization is the most effective use of registers in order to reduce the number of accesses to the relatively slow main memory. A good strategy of register usage yields more advantages than any other branch of optimization. A widespread technique is register allocation using graph colouring. For every value occurring in a computation, that is, for every expression the point of its generation and the point of its last use are determined. They delimit its range of relevance. Obviously, values of different expressions can be stored in the same register, if and only if their ranges do not overlap. The ranges are represented by the nodes of a graph, in which an edge between two nodes signifies that the two ranges overlap. The allocation of N available registers to the occurring values may then be understood as the colouring of the graph with N colours in such a way that neighbouring nodes always have different colours. This implies that values with overlapping ranges are always allocated to different registers. Furthermore, selected, scalar, local variables are no longer allocated in memory at all, but rather in dedicated registers. In order to approach an optimal register utilization, sophisticated algorithms are employed to determine which variables are accessed most frequently. Evidently, the necessary
102
bookkeeping about variable accesses grows, and thereby compilation speed suffers. Also, care has to be taken that register values are saved in memory before procedure calls and are restored after the procedure return. The lurking danger is that the effort necessary for this task surpasses the savings obtained. In many compilers, local variables are allocated to registers only in procedures which do not contain any calls themselves (leaf procedures), and which therefore are also called most frequently, as they constitute the leaves in the tree representing the procedure call hierarchy. A detailed treatment of all these optimization problems is beyond the scope of an introductory text about compiler construction. The above outline shall therefore suffice. In any case such techniques make it clear that for a nearly optimal code generation significantly more information about context must be considered than is the case in our relatively simple Oberon-0 compiler. Its structure is not well-suited to achieving a high degree of optimization. But it serves excellently as a fast compiler producing quite acceptable, although not optimal code, as is appropriate in the development phase of a system, particularly for educational purposes. Section 16.5 indicates another, somewhat more complex compiler structure which is better suited for the incorporation of optimization algorithms.
16.5. The frontend/backend compiler structure The most significant characteristic of the compiler developed in Chapters 7 _ 12 is that the source text is read exactly once. Code is thereby generated on the fly. At each point, information about the operands is restricted to the items denoting the operand and to the symbol table representing declarations. The socalled frontend/backend compiler structure, which was briefly mentioned in Chapter 1, deviates decisively in this respect. The frontend part also reads the source text once only, but instead of generating code it builds a data structure representing the program in a form suitably organized for further processing. All information contained in statements is mapped into this data structure. It is called a syntax tree, because it also mirrors the syntactic structure of the text. Somewhat oversimplifying the situation, we may say that the frontend compiles declarations into the symbol table and statements into the syntax tree. These two data structures constitute the interface to the backend part whose task is code generation. The syntax tree allows fast access to practically all parts of a program, and it represents the program in a preprocessed form. The resulting compilation process is shown in Figure 16.1. Program Declarations
Statements
Front end
Symbol table
Syntax tree
Back end
code
Figure 16.1. Compiler consisting of front end and back end We pointed out one significant advantage of this structure in Chapter 1: the partitioning of a compiler in a target-independent front end and a target-dependent back end. In the following, we focus on the interface between the two parts, namely the structure of the syntax tree. Furthermore, we show how the tree is generated.
103
Exactly as in a source program where statements refer to declarations, so does the syntax tree refer to entries in the symbol table. This gives rise to the understandable desire to declare the elements of the symbol table (objects) in such a fashion that it is possible to refer to them from the symbol table itself as well as from the syntax tree. As basic type we introduce the type Object which may assume different forms as appropriate to represent constants, variables, types, and procedures. Only the attribute type is common to all. Here and subsequently we make use of Oberon's feature called type extension (Reiser and Wirth, 1992). Object ObjDesc ConstDesc VarDesc
= = = =
POINTER TO ObjDesc; RECORD type: Type END ; RECORD (ObjDesc) value: LONGINT END ; RECORD (ObjDesc) adr, level: LONGINT END ;
The symbol table consists of lists of elements, one for each scope (see Section 8.2). The elements consist of the name (identifier) and a reference to the identified object. Ident IdentDesc
= =
POINTER TO IdentDesc; RECORD name: ARRAY 32 OF CHAR; obj: Object; next: Ident END ; POINTER TO ScopeDesc; RECORD first: Ident; dsc: Scope END ;
Scope = ScopeDesc =
The syntax tree is best conceived as a binary tree. We call its elements Nodes. If a syntactic construct has the form of a list, it is represented as a degenerate tree in which the last element has an empty branch. Node = POINTER TO NodeDesc; NodeDesc = RECORD (Object) op: INTEGER; left, right: Object END Let us consider the following brief excerpt of a program text as an example: VAR x, y, z: INTEGER; BEGIN z := x + y - 5; ... The front end parses the source text and builds the symbol table and the syntax tree as shown in Figure 16.2. Representations of data types are omitted. root
:= topScope
Scope
5
+
Variable Ident “z”
“x”
“y”
Figure 16.2. Symbol table (below) and syntax tree (above). Representations of procedure calls, the IF and WHILE statements and the statement sequence are shown in Figures 16.3 - 16.5.
104
call P
par
par
par
a
b
c
Figure 16.3. Procedure call.
IF b0
WHILE IF
S0
b
b1
S1
S
S2
Figure 16.4. IF and WHILE statements.
;
;
;
S0
S1
Sn
Figure 16.5. Statement sequence. To conclude, the following examples demonstrate how the described data structures are generated. The reader should compare these compiler excerpts with the corresponding procedures of the Oberon-0 compiler listed in Appendix C. All subsequent algorithms make use of the auxiliary procedure New, which generates a new node. PROCEDURE New(op: INTEGER; x, y: Object): Item; VAR z: Item; BEGIN New(z); z.op := op; z.left := x; z.right := y; RETURN z END New; PROCEDURE factor(): Object; VAR x: Object; c: Constant; BEGIN IF sym = ident THEN x := This(name); Get(sym); x := selector(x) ELSIF sym = number THEN NEW(c); c.value := number; Get(sym); x := c ELSIF sym = lparen THEN Get(sym); x := expression(); IF sym = rparen THEN Get(sym) ELSE Mark(22) END ELSIF sym = not THEN Get(sym); x := New(not, NIL, factor()) ELSE ... END ; RETURN x END factor;
105
PROCEDURE term(): Object; VAR op: INTEGER; x: Object; BEGIN x := factor(); WHILE (sym >= times) & (sym ": Texts.Read(R, ch); IF ch = "=" THEN Texts.Read(R, ch); sym := geq ELSE sym := gtr END | ";": Texts.Read(R, ch); sym := semicolon | ",": Texts.Read(R, ch); sym := comma | ":": Texts.Read(R, ch); IF ch = "=" THEN Texts.Read(R, ch); sym := becomes ELSE sym := colon END | ".": Texts.Read(R, ch); sym := period | "(": Texts.Read(R, ch); IF ch = "*" THEN comment; Get(sym) ELSE sym := lparen END | ")": Texts.Read(R, ch); sym := rparen | "[": Texts.Read(R, ch); sym := lbrak | "]": Texts.Read(R, ch); sym := rbrak | "0".."9": Number; | "A" .. "Z", "a".."z": Ident | "~": Texts.Read(R, ch); sym := not ELSE Texts.Read(R, ch); sym := null END END END Get; PROCEDURE Init*(T: Texts.Text; pos: LONGINT); BEGIN error := FALSE; errpos := pos; Texts.OpenReader(R, T, pos); Texts.Read(R, ch) END Init; PROCEDURE EnterKW(sym: INTEGER; name: ARRAY OF CHAR); BEGIN keyTab[nkw].sym := sym; COPY(name, keyTab[nkw].id); INC(nkw) END EnterKW; BEGIN Texts.OpenWriter(W); error := TRUE; nkw := 0; EnterKW(null, "BY"); EnterKW(do, "DO"); EnterKW(if, "IF"); EnterKW(null, "IN"); EnterKW(null, "IS"); EnterKW(of, "OF"); EnterKW(or, "OR"); EnterKW(null, "TO"); EnterKW(end, "END"); EnterKW(null, "FOR"); EnterKW(mod, "MOD"); EnterKW(null, "NIL"); EnterKW(var, "VAR"); EnterKW(null, "CASE"); EnterKW(else, "ELSE"); EnterKW(null, "EXIT"); EnterKW(then, "THEN"); EnterKW(type, "TYPE"); EnterKW(null, "WITH"); EnterKW(array, "ARRAY"); EnterKW(begin, "BEGIN"); EnterKW(const, "CONST"); EnterKW(elsif, "ELSIF"); EnterKW(null, "IMPORT"); EnterKW(null, "UNTIL"); EnterKW(while, "WHILE"); EnterKW(record, "RECORD");
113
EnterKW(null, "REPEAT"); EnterKW(null, "RETURN"); EnterKW(null, "POINTER"); EnterKW(procedure, "PROCEDURE"); EnterKW(div, "DIV"); EnterKW(null, "LOOP"); EnterKW(module, "MODULE"); END OSS.
C.2. The parser MODULE OSP; (* NW 23.9.93 / 9.2.95*) IMPORT Viewers, Texts, Oberon, MenuViewers, TextFrames, OSS, OSG; CONST WordSize = 4; VAR sym: INTEGER; loaded: BOOLEAN; topScope, universe: OSG.Object; (* linked lists, end with guard *) guard: OSG.Object; W: Texts.Writer; PROCEDURE NewObj(VAR obj: OSG.Object; class: INTEGER); VAR new, x: OSG.Object; BEGIN x := topScope; guard.name := OSS.id; WHILE x.next.name # OSS.id DO x := x.next END ; IF x.next = guard THEN NEW(new); new.name := OSS.id; new.class := class; new.next := guard; x.next := new; obj := new ELSE obj := x.next; OSS.Mark("mult def") END END NewObj; PROCEDURE find(VAR obj: OSG.Object); VAR s, x: OSG.Object; BEGIN s := topScope; guard.name := OSS.id; LOOP x := s.next; WHILE x.name # OSS.id DO x := x.next END ; IF x # guard THEN obj := x; EXIT END ; IF s = universe THEN obj := x; OSS.Mark("undef"); EXIT END ; s := s.dsc END END find; PROCEDURE FindField(VAR obj: OSG.Object; list: OSG.Object); BEGIN guard.name := OSS.id; WHILE list.name # OSS.id DO list := list.next END ; obj := list END FindField; PROCEDURE IsParam(obj: OSG.Object): BOOLEAN; BEGIN RETURN (obj.class = OSG.Par) OR (obj.class = OSG.Var) & (obj.val > 0) END IsParam; PROCEDURE OpenScope; VAR s: OSG.Object; BEGIN NEW(s); s.class := OSG.Head; s.dsc := topScope; s.next := guard; topScope := s END OpenScope;
114
PROCEDURE CloseScope; BEGIN topScope := topScope.dsc END CloseScope; (* -------------------- Parser ---------------------*) PROCEDURE^ expression(VAR x: OSG.Item); PROCEDURE selector(VAR x: OSG.Item); VAR y: OSG.Item; obj: OSG.Object; BEGIN WHILE (sym = OSS.lbrak) OR (sym = OSS.period) DO IF sym = OSS.lbrak THEN OSS.Get(sym); expression(y); IF x.type.form = OSG.Array THEN OSG.Index(x, y) ELSE OSS.Mark("not an array") END ; IF sym = OSS.rbrak THEN OSS.Get(sym) ELSE OSS.Mark("]?") END ELSE OSS.Get(sym); IF sym = OSS.ident THEN IF x.type.form = OSG.Record THEN FindField(obj, x.type.fields); OSS.Get(sym); IF obj # guard THEN OSG.Field(x, obj) ELSE OSS.Mark("undef") END ELSE OSS.Mark("not a record") END ELSE OSS.Mark("ident?") END END END END selector; PROCEDURE factor(VAR x: OSG.Item); VAR obj: OSG.Object; BEGIN (*sync*) IF sym < OSS.lparen THEN OSS.Mark("ident?"); REPEAT OSS.Get(sym) UNTIL sym >= OSS.lparen END ; IF sym = OSS.ident THEN find(obj); OSS.Get(sym); OSG.MakeItem(x, obj); selector(x) ELSIF sym = OSS.number THEN OSG.MakeConstItem(x, OSG.intType, OSS.val); OSS.Get(sym) ELSIF sym = OSS.lparen THEN OSS.Get(sym); expression(x); IF sym = OSS.rparen THEN OSS.Get(sym) ELSE OSS.Mark(")?") END ELSIF sym = OSS.not THEN OSS.Get(sym); factor(x); OSG.Op1(OSS.not, x) ELSE OSS.Mark("factor?"); OSG.MakeItem(x, guard) END END factor; PROCEDURE term(VAR x: OSG.Item); VAR y: OSG.Item; op: INTEGER; BEGIN factor(x); WHILE (sym >= OSS.times) & (sym = OSS.plus) & (sym = OSS.eql) & (sym = OSS.ident END ; IF sym = OSS.ident THEN find(obj); OSS.Get(sym); OSG.MakeItem(x, obj); selector(x); IF sym = OSS.becomes THEN OSS.Get(sym); expression(y); OSG.Store(x, y) ELSIF sym = OSS.eql THEN OSS.Mark(":= ?"); OSS.Get(sym); expression(y) ELSIF x.mode = OSG.Proc THEN par := obj.dsc; IF sym = OSS.lparen THEN OSS.Get(sym); IF sym = OSS.rparen THEN OSS.Get(sym) ELSE LOOP parameter(par); IF sym = OSS.comma THEN OSS.Get(sym)
116
ELSIF sym = OSS.rparen THEN OSS.Get(sym); EXIT ELSIF sym >= OSS.semicolon THEN EXIT ELSE OSS.Mark(") or , ?") END END END END ; IF obj.val < 0 THEN OSS.Mark("forward call") ELSIF ~IsParam(par) THEN OSG.Call(x) ELSE OSS.Mark("too few parameters") END ELSIF x.mode = OSG.SProc THEN IF obj.val = OSS.semicolon) & (sym < OSS.if) OR (sym >= OSS.array) THEN EXIT ELSE OSS.Mark("; ?") END END END StatSequence; PROCEDURE IdentList(class: INTEGER; VAR first: OSG.Object); VAR obj: OSG.Object; BEGIN IF sym = OSS.ident THEN NewObj(first, class); OSS.Get(sym); WHILE sym = OSS.comma DO OSS.Get(sym); IF sym = OSS.ident THEN NewObj(obj, class); OSS.Get(sym) ELSE OSS.Mark("ident?") END END;
117
IF sym = OSS.colon THEN OSS.Get(sym) ELSE OSS.Mark(":?") END END END IdentList; PROCEDURE Type(VAR type: OSG.Type); VAR obj, first: OSG.Object; x: OSG.Item; tp: OSG.Type; BEGIN type := OSG.intType; (*sync*) IF (sym # OSS.ident) & (sym < OSS.array) THEN OSS.Mark("type?"); REPEAT OSS.Get(sym) UNTIL (sym = OSS.ident) OR (sym >= OSS.array) END ; IF sym = OSS.ident THEN find(obj); OSS.Get(sym); IF obj.class = OSG.Typ THEN type := obj.type ELSE OSS.Mark("type?") END ELSIF sym = OSS.array THEN OSS.Get(sym); expression(x); IF (x.mode # OSG.Const) OR (x.a < 0) THEN OSS.Mark("bad index") END ; IF sym = OSS.of THEN OSS.Get(sym) ELSE OSS.Mark("OF?") END ; Type(tp); NEW(type); type.form := OSG.Array; type.base := tp; type.len := SHORT(x.a); type.size := type.len * tp.size ELSIF sym = OSS.record THEN OSS.Get(sym); NEW(type); type.form := OSG.Record; type.size := 0; OpenScope; LOOP IF sym = OSS.ident THEN IdentList(OSG.Fld, first); Type(tp); obj := first; WHILE obj # guard DO obj.type := tp; obj.val := type.size; INC(type.size, obj.type.size); obj := obj.next END END ; IF sym = OSS.semicolon THEN OSS.Get(sym) ELSIF sym = OSS.ident THEN OSS.Mark("; ?") ELSE EXIT END END ; type.fields := topScope.next; CloseScope; IF sym = OSS.end THEN OSS.Get(sym) ELSE OSS.Mark("END?") END ELSE OSS.Mark("ident?") END END Type; PROCEDURE declarations(VAR varsize: LONGINT); VAR obj, first: OSG.Object; x: OSG.Item; tp: OSG.Type; L: LONGINT; BEGIN (*sync*) IF (sym < OSS.const) & (sym # OSS.end) THEN OSS.Mark("declaration?"); REPEAT OSS.Get(sym) UNTIL (sym >= OSS.const) OR (sym = OSS.end) END ; LOOP IF sym = OSS.const THEN OSS.Get(sym); WHILE sym = OSS.ident DO NewObj(obj, OSG.Const); OSS.Get(sym); IF sym = OSS.eql THEN OSS.Get(sym) ELSE OSS.Mark("=?") END; expression(x); IF x.mode = OSG.Const THEN obj.val := x.a; obj.type := x.type ELSE OSS.Mark("expression not constant") END;
118
IF sym = OSS.semicolon THEN OSS.Get(sym) ELSE OSS.Mark(";?") END END END ; IF sym = OSS.type THEN OSS.Get(sym); WHILE sym = OSS.ident DO NewObj(obj, OSG.Typ); OSS.Get(sym); IF sym = OSS.eql THEN OSS.Get(sym) ELSE OSS.Mark("=?") END ; Type(obj.type); IF sym = OSS.semicolon THEN OSS.Get(sym) ELSE OSS.Mark(";?") END END END ; IF sym = OSS.var THEN OSS.Get(sym); WHILE sym = OSS.ident DO IdentList(OSG.Var, first); Type(tp); obj := first; WHILE obj # guard DO obj.type := tp; obj.lev := OSG.curlev; varsize := varsize + obj.type.size; obj.val := -varsize; obj := obj.next END ; IF sym = OSS.semicolon THEN OSS.Get(sym) ELSE OSS.Mark("; ?") END END END ; IF (sym >= OSS.const) & (sym = OSG.Array THEN OSS.Mark("no struct params") END ; ELSE parsize := WordSize END ; obj := first; WHILE obj # guard DO obj.type := tp; INC(parblksize, parsize); obj := obj.next END END FPSection; BEGIN (* ProcedureDecl *)
119
OSS.Get(sym); IF sym = OSS.ident THEN procid := OSS.id; NewObj(proc, OSG.Proc); OSS.Get(sym); parblksize := marksize; OSG.IncLevel(1); OpenScope; proc.val := -1; IF sym = OSS.lparen THEN OSS.Get(sym); IF sym = OSS.rparen THEN OSS.Get(sym) ELSE FPSection; WHILE sym = OSS.semicolon DO OSS.Get(sym); FPSection END ; IF sym = OSS.rparen THEN OSS.Get(sym) ELSE OSS.Mark(")?") END END ELSIF OSG.curlev = 1 THEN OSG.EnterCmd(procid) END ; obj := topScope.next; locblksize := parblksize; WHILE obj # guard DO obj.lev := OSG.curlev; IF obj.class = OSG.Par THEN DEC(locblksize, WordSize) ELSE obj.val := locblksize; obj := obj.next END END ; proc.dsc := topScope.next; IF sym = OSS.semicolon THEN OSS.Get(sym) ELSE OSS.Mark(";?") END; locblksize := 0; declarations(locblksize); WHILE sym = OSS.procedure DO ProcedureDecl; IF sym = OSS.semicolon THEN OSS.Get(sym) ELSE OSS.Mark(";?") END END ; proc.val := OSG.pc; OSG.Enter(locblksize); IF sym = OSS.begin THEN OSS.Get(sym); StatSequence END ; IF sym = OSS.end THEN OSS.Get(sym) ELSE OSS.Mark("END?") END ; IF sym = OSS.ident THEN IF procid # OSS.id THEN OSS.Mark("no match") END ; OSS.Get(sym) END ; OSG.Return(parblksize - marksize); CloseScope; OSG.IncLevel(-1) END END ProcedureDecl; PROCEDURE Module(VAR S: Texts.Scanner); VAR modid: OSS.Ident; varsize: LONGINT; BEGIN Texts.WriteString(W, " compiling "); IF sym = OSS.module THEN OSS.Get(sym); OSG.Open; OpenScope; varsize := 0; IF sym = OSS.ident THEN modid := OSS.id; OSS.Get(sym); Texts.WriteString(W, modid); Texts.WriteLn(W); Texts.Append(Oberon.Log, W.buf) ELSE OSS.Mark("ident?") END ; IF sym = OSS.semicolon THEN OSS.Get(sym) ELSE OSS.Mark(";?") END; declarations(varsize); WHILE sym = OSS.procedure DO ProcedureDecl; IF sym = OSS.semicolon THEN OSS.Get(sym) ELSE OSS.Mark(";?") END END ; OSG.Header(varsize);
120
IF sym = OSS.begin THEN OSS.Get(sym); StatSequence END ; IF sym = OSS.end THEN OSS.Get(sym) ELSE OSS.Mark("END?") END ; IF sym = OSS.ident THEN IF modid # OSS.id THEN OSS.Mark("no match") END ; OSS.Get(sym) ELSE OSS.Mark("ident?") END ; IF sym # OSS.period THEN OSS.Mark(". ?") END ; CloseScope; IF ~OSS.error THEN COPY(modid, S.s); OSG.Close(S, varsize); Texts.WriteString(W, "code generated"); Texts.WriteInt(W, OSG.pc, 6); Texts.WriteLn(W); Texts.Append(Oberon.Log, W.buf) END ELSE OSS.Mark("MODULE?") END END Module; PROCEDURE Compile*; VAR beg, end, time: LONGINT; S: Texts.Scanner; T: Texts.Text; v: Viewers.Viewer; BEGIN loaded := FALSE; Texts.OpenScanner(S, Oberon.Par.text, Oberon.Par.pos); Texts.Scan(S); IF S.class = Texts.Char THEN IF S.c = "*" THEN v := Oberon.MarkedViewer(); IF (v.dsc # NIL) & (v.dsc.next IS TextFrames.Frame) THEN OSS.Init(v.dsc.next(TextFrames.Frame).text, 0); OSS.Get(sym); Module(S) END ELSIF S.c = "@" THEN Oberon.GetSelection(T, beg, end, time); IF time >= 0 THEN OSS.Init(T, beg); OSS.Get(sym); Module(S) END END ELSIF S.class = Texts.Name THEN NEW(T); Texts.Open(T, S.s); OSS.Init(T, 0); OSS.Get(sym); Module(S) END END Compile; PROCEDURE Decode*; VAR V: MenuViewers.Viewer; T: Texts.Text; X, Y: INTEGER; BEGIN T := TextFrames.Text(""); Oberon.AllocateSystemViewer(Oberon.Par.frame.X, X, Y); V := MenuViewers.New( TextFrames.NewMenu("Log.Text", "System.Close System.Copy System.Grow Edit.Search Edit.Store"), TextFrames.NewText(T, 0), TextFrames.menuH, X, Y); OSG.Decode(T) END Decode; PROCEDURE Load*; VAR S: Texts.Scanner; BEGIN IF ~OSS.error & ~loaded THEN Texts.OpenScanner(S, Oberon.Par.text, Oberon.Par.pos); OSG.Load(S); loaded := TRUE END END Load; PROCEDURE Exec*;
121
VAR S: Texts.Scanner; BEGIN IF loaded THEN Texts.OpenScanner(S, Oberon.Par.text, Oberon.Par.pos); Texts.Scan(S); IF S.class = Texts.Name THEN OSG.Exec(S) END END END Exec; PROCEDURE enter(cl: INTEGER; n: LONGINT; name: OSS.Ident; type: OSG.Type); VAR obj: OSG.Object; BEGIN NEW(obj); obj.class := cl; obj.val := n; obj.name := name; obj.type := type; obj.dsc := NIL; obj.next := topScope.next; topScope.next := obj END enter; BEGIN Texts.OpenWriter(W); Texts.WriteString(W, "Oberon0 Compiler 9.2.95"); Texts.WriteLn(W); Texts.Append(Oberon.Log, W.buf); NEW(guard); guard.class := OSG.Var; guard.type := OSG.intType; guard.val := 0; topScope := NIL; OpenScope; enter(OSG.Typ, 1, "BOOLEAN", OSG.boolType); enter(OSG.Typ, 2, "INTEGER", OSG.intType); enter(OSG.Const, 1, "TRUE", OSG.boolType); enter(OSG.Const, 0, "FALSE", OSG.boolType); enter(OSG.SProc, 1, "Read", NIL); enter(OSG.SProc, 2, "Write", NIL); enter(OSG.SProc, 3, "WriteHex", NIL); enter(OSG.SProc, 4, "WriteLn", NIL); universe := topScope END OSP.
C.3. The code generator MODULE OSG; (* NW 18.12.94 / 10.2.95 / 24.3.96 / 25.11.05*) IMPORT Oberon, Texts, OSS, RISC; CONST maxCode = 1000; maxRel = 200; NofCom = 16; (* class / mode*) Head* = 0; Var* = 1; Par* = 2; Const* = 3; Fld* = 4; Typ* = 5; Proc* = 6; SProc* = 7; Reg = 10; Cond = 11; (* form *) Boolean* = 0; Integer* = 1; Array* = 2; Record* = 3; MOV = 0; MVN = 1; ADD = 2; SUB = 3; MUL = 4; Div = 5; Mod = 6; CMP = 7; MOVI = 16; MVNI = 17; ADDI = 18; SUBI = 19; MULI = 20; DIVI = 21; MODI = 22; CMPI = 23; CHKI = 24; LDW = 32; LDB = 33; POP = 34; STW = 36; STB = 37; PSH = 38; RD = 40; WRD= 41; WRH = 42; WRL = 43; BEQ = 48; BNE = 49; BLT = 50; BGE = 51; BLE = 52; BGT = 53; BR = 56; BSR = 57; RET = 58; FP = 12; SP = 13; LNK = 14; PC = 15; (*reserved registers*) TYPE Object* = POINTER TO ObjDesc; Type* = POINTER TO TypeDesc; Item* = RECORD mode*, lev*: INTEGER;
122
type*: Type; a*, b, c, r: LONGINT; END ; ObjDesc*= RECORD class*, lev*: INTEGER; next*, dsc*: Object; type*: Type; name*: OSS.Ident; val*: LONGINT END ; TypeDesc* = RECORD form*: INTEGER; fields*: Object; base*: Type; size*, len*: INTEGER END ; VAR boolType*, intType*: Type; curlev*, pc*: INTEGER; cno: INTEGER; entry, fixlist: LONGINT; regs: SET; (* used registers *) W: Texts.Writer; code: ARRAY maxCode OF LONGINT; comname: ARRAY NofCom OF OSS.Ident; (*commands*) comadr: ARRAY NofCom OF LONGINT; mnemo: ARRAY 64, 5 OF CHAR; (*for decoder*) PROCEDURE GetReg(VAR r: LONGINT); VAR i: INTEGER; BEGIN i := 0; WHILE (i < FP) & (i IN regs) DO INC(i) END ; INCL(regs, i); r := i END GetReg; PROCEDURE Put(op, a, b, c: LONGINT); BEGIN (*emit instruction*) IF op >= 32 THEN DEC(op, 64) END ; code[pc] := ASH(ASH(ASH(op, 4) + a, 4) + b, 18) + (c MOD 40000H); INC(pc) END Put; PROCEDURE PutBR(op, disp: LONGINT); BEGIN (*emit branch instruction*) code[pc] := ASH(op-40H, 26) + (disp MOD 4000000H); INC(pc) END PutBR; PROCEDURE TestRange(x: LONGINT); BEGIN (*18-bit entity*) IF (x >= 20000H) OR (x < -20000H) THEN OSS.Mark("value too large") END END TestRange; PROCEDURE load(VAR x: Item); VAR r: LONGINT;
123
BEGIN (*x.mode # Reg*) IF x.mode = Var THEN IF x.lev = 0 THEN x.a := x.a - pc*4 END ; GetReg(r); Put(LDW, r, x.r, x.a); EXCL(regs, x.r); x.r := r ELSIF x.mode = Const THEN TestRange(x.a); GetReg(x.r); Put(MOVI, x.r, 0, x.a) END ; x.mode := Reg END load; PROCEDURE loadBool(VAR x: Item); BEGIN IF x.type.form # Boolean THEN OSS.Mark("Boolean?") END ; load(x); x.mode := Cond; x.a := 0; x.b := 0; x.c := 1 END loadBool; PROCEDURE PutOp(cd: LONGINT; VAR x, y: Item); BEGIN IF x.mode # Reg THEN load(x) END ; IF y.mode = Const THEN TestRange(y.a); Put(cd+16, x.r, x.r, y.a) ELSE IF y.mode # Reg THEN load(y) END ; Put(cd, x.r, x.r, y.r); EXCL(regs, y.r) END END PutOp; PROCEDURE negated(cond: LONGINT): LONGINT; BEGIN IF ODD(cond) THEN RETURN cond-1 ELSE RETURN cond+1 END END negated; PROCEDURE merged(L0, L1: LONGINT): LONGINT; VAR L2, L3: LONGINT; BEGIN IF L0 # 0 THEN L2 := L0; LOOP L3 := code[L2] MOD 40000H; IF L3 = 0 THEN EXIT END ; L2 := L3 END ; code[L2] := code[L2] - L3 + L1; RETURN L0 ELSE RETURN L1 END END merged; PROCEDURE fix(at, with: LONGINT); BEGIN code[at] := code[at] DIV 400000H * 400000H + (with MOD 400000H) END fix; PROCEDURE FixWith(L0, L1: LONGINT); VAR L2: LONGINT; BEGIN WHILE L0 # 0 DO L2 := code[L0] MOD 40000H; fix(L0, L1-L0); L0 := L2 END END FixWith; PROCEDURE FixLink*(L: LONGINT);
124
VAR L1: LONGINT; BEGIN WHILE L # 0 DO L1 := code[L] MOD 40000H; fix(L, pc-L); L := L1 END END FixLink; (*-----------------------------------------------*) PROCEDURE IncLevel*(n: INTEGER); BEGIN INC(curlev, n) END IncLevel; PROCEDURE MakeConstItem*(VAR x: Item; typ: Type; val: LONGINT); BEGIN x.mode := Const; x.type := typ; x.a := val END MakeConstItem; PROCEDURE MakeItem*(VAR x: Item; y: Object); VAR r: LONGINT; BEGIN x.mode := y.class; x.type := y.type; x.lev := y.lev; x.a := y.val; x.b := 0; IF y.lev = 0 THEN x.r := PC ELSIF y.lev = curlev THEN x.r := FP ELSE OSS.Mark("level!"); x.r := 0 END ; IF y.class = Par THEN GetReg(r); Put(LDW, r, x.r, x.a); x.mode := Var; x.r := r; x.a := 0 END END MakeItem; PROCEDURE Field*(VAR x: Item; y: Object); (* x := x.y *) BEGIN INC(x.a, y.val); x.type := y.type END Field; PROCEDURE Index*(VAR x, y: Item); (* x := x[y] *) BEGIN IF y.type # intType THEN OSS.Mark("index not integer") END ; IF y.mode = Const THEN IF (y.a < 0) OR (y.a >= x.type.len) THEN OSS.Mark("bad index") END ; INC(x.a, y.a * x.type.base.size) ELSE IF y.mode # Reg THEN load(y) END ; Put(CHKI, y.r, 0, x.type.len); Put(MULI, y.r, y.r, x.type.base.size); Put(ADD, y.r, x.r, y.r); EXCL(regs, x.r); x.r := y.r END; x.type := x.type.base END Index; PROCEDURE Op1*(op: INTEGER; VAR x: Item); (* x := op x *) VAR t: LONGINT; BEGIN IF op = OSS.minus THEN IF x.type.form # Integer THEN OSS.Mark("bad type") ELSIF x.mode = Const THEN x.a := -x.a ELSE IF x.mode = Var THEN load(x) END ; Put(MVN, x.r, 0, x.r) END ELSIF op = OSS.not THEN IF x.mode # Cond THEN loadBool(x) END ;
125
x.c := negated(x.c); t := x.a; x.a := x.b; x.b := t ELSIF op = OSS.and THEN IF x.mode # Cond THEN loadBool(x) END ; PutBR(BEQ + negated(x.c), x.a); EXCL(regs, x.r); x.a := pc-1; FixLink(x.b); x.b := 0 ELSIF op = OSS.or THEN IF x.mode # Cond THEN loadBool(x) END ; PutBR(BEQ + x.c, x.b); EXCL(regs, x.r); x.b := pc-1; FixLink(x.a); x.a := 0 END END Op1; PROCEDURE Op2*(op: INTEGER; VAR x, y: Item); (* x := x op y *) BEGIN IF (x.type.form = Integer) & (y.type.form = Integer) THEN IF (x.mode = Const) & (y.mode = Const) THEN (*overflow checks missing*) IF op = OSS.plus THEN INC(x.a, y.a) ELSIF op = OSS.minus THEN DEC(x.a, y.a) ELSIF op = OSS.times THEN x.a := x.a * y.a ELSIF op = OSS.div THEN x.a := x.a DIV y.a ELSIF op = OSS.mod THEN x.a := x.a MOD y.a ELSE OSS.Mark("bad type") END ELSE IF op = OSS.plus THEN PutOp(ADD, x, y) ELSIF op = OSS.minus THEN PutOp(SUB, x, y) ELSIF op = OSS.times THEN PutOp(MUL, x, y) ELSIF op = OSS.div THEN PutOp(Div, x, y) ELSIF op = OSS.mod THEN PutOp(Mod, x, y) ELSE OSS.Mark("bad type") END END ELSIF (x.type.form = Boolean) & (y.type.form = Boolean) THEN IF y.mode # Cond THEN loadBool(y) END ; IF op = OSS.or THEN x.a := y.a; x.b := merged(y.b, x.b); x.c := y.c ELSIF op = OSS.and THEN x.a := merged(y.a, x.a); x.b := y.b; x.c := y.c END ELSE OSS.Mark("bad type") END ; END Op2; PROCEDURE Relation*(op: INTEGER; VAR x, y: Item); (* x := x ? y *) BEGIN IF (x.type.form # Integer) OR (y.type.form # Integer) THEN OSS.Mark("bad type") ELSE PutOp(CMP, x, y); x.c := op - OSS.eql; EXCL(regs, y.r) END ; x.mode := Cond; x.type := boolType; x.a := 0; x.b := 0 END Relation; PROCEDURE Store*(VAR x, y: Item); (* x := y *) VAR r: LONGINT; BEGIN IF (x.type.form IN {Boolean, Integer}) & (x.type.form = y.type.form) THEN IF y.mode = Cond THEN Put(BEQ + negated(y.c), y.r, 0, y.a); EXCL(regs, y.r); y.a := pc-1; FixLink(y.b); GetReg(y.r); Put(MOVI, y.r, 0, 1); PutBR(BR, 2); FixLink(y.a); Put(MOVI, y.r, 0, 0)
126
ELSIF y.mode # Reg THEN load(y) END ; IF x.mode = Var THEN IF x.lev = 0 THEN x.a := x.a - pc*4 END ; Put(STW, y.r, x.r, x.a) ELSE OSS.Mark("illegal assignment") END ; EXCL(regs, x.r); EXCL(regs, y.r) ELSE OSS.Mark("incompatible assignment") END END Store; PROCEDURE Parameter*(VAR x: Item; ftyp: Type; class: INTEGER); VAR r: LONGINT; BEGIN IF x.type = ftyp THEN IF class = Par THEN (*Var param*) IF x.mode = Var THEN IF x.a # 0 THEN GetReg(r); Put(ADDI, r, x.r, x.a) ELSE r := x.r END ELSE OSS.Mark("illegal parameter mode") END ; Put(PSH, r, SP, 4); EXCL(regs, r) ELSE (*value param*) IF x.mode # Reg THEN load(x) END ; Put(PSH, x.r, SP, 4); EXCL(regs, x.r) END ELSE OSS.Mark("bad parameter type") END END Parameter; (*---------------------------------*) PROCEDURE CJump*(VAR x: Item); BEGIN IF x.type.form = Boolean THEN IF x.mode # Cond THEN loadBool(x) END ; PutBR(BEQ + negated(x.c), x.a); EXCL(regs, x.r); FixLink(x.b); x.a := pc-1 ELSE OSS.Mark("Boolean?"); x.a := pc END END CJump; PROCEDURE BJump*(L: LONGINT); BEGIN PutBR(BR, L-pc) END BJump; PROCEDURE FJump*(VAR L: LONGINT); BEGIN PutBR(BR, L); L := pc-1 END FJump; PROCEDURE Call*(VAR x: Item); BEGIN PutBR(BSR, x.a - pc) END Call; PROCEDURE IOCall*(VAR x, y: Item);
127
VAR z: Item; BEGIN IF x.a < 4 THEN IF y.type.form # Integer THEN OSS.Mark("Integer?") END END ; IF x.a = 1 THEN GetReg(z.r); z.mode := Reg; z.type := intType; Put(RD, z.r, 0, 0); Store(y, z) ELSIF x.a = 2 THEN load(y); Put(WRD, 0, 0, y.r); EXCL(regs, y.r) ELSIF x.a = 3 THEN load(y); Put(WRH, 0, 0, y.r); EXCL(regs, y.r) ELSE Put(WRL, 0, 0, 0) END END IOCall; PROCEDURE Header*(size: LONGINT); BEGIN entry := pc; Put(MOVI, SP, 0, RISC.MemSize - size); (*init SP*) Put(PSH, LNK, SP, 4) END Header; PROCEDURE Enter*(size: LONGINT); BEGIN Put(PSH, LNK, SP, 4); Put(PSH, FP, SP, 4); Put(MOV, FP, 0, SP); Put(SUBI, SP, SP, size) END Enter; PROCEDURE Return*(size: LONGINT); BEGIN Put(MOV, SP, 0, FP); Put(POP, FP, SP, 4); Put(POP, LNK, SP, size+4); PutBR(RET, LNK) END Return; PROCEDURE Open*; BEGIN curlev := 0; pc := 0; cno := 0; regs := {} END Open; PROCEDURE Close*(VAR S: Texts.Scanner; globals: LONGINT); BEGIN Put(POP, LNK, SP, 4); PutBR(RET, LNK); END Close; PROCEDURE EnterCmd*(VAR name: ARRAY OF CHAR); BEGIN COPY(name, comname[cno]); comadr[cno] := pc*4; INC(cno) END EnterCmd; (*-------------------------------------------*) PROCEDURE Load*(VAR S: Texts.Scanner); BEGIN RISC.Load(code, pc); Texts.WriteString(W, " code loaded"); Texts.WriteLn(W); Texts.Append(Oberon.Log, W.buf); RISC.Execute(entry*4, S, Oberon.Log) END Load; PROCEDURE Exec*(VAR S: Texts.Scanner); VAR i: INTEGER;
128
BEGIN i := 0; WHILE (i < cno) & (S.s # comname[i]) DO INC(i) END ; IF i < cno THEN RISC.Execute(comadr[i], S, Oberon.Log) END END Exec; PROCEDURE Decode*(T: Texts.Text); VAR i, w, op, a: LONGINT; BEGIN Texts.WriteString(W, "entry"); Texts.WriteInt(W, entry*4, 6); Texts.WriteLn(W); i := 0; WHILE i < pc DO w := code[i]; op := w DIV 4000000H MOD 40H; Texts.WriteInt(W, 4*i, 4); Texts.Write(W, 9X); Texts.WriteString(W, mnemo[op]); IF op < BEQ THEN a := w MOD 40000H; IF a >= 20000H THEN DEC(a, 40000H) (*sign extension*) END ; Texts.Write(W, 9X); Texts.WriteInt(W, w DIV 400000H MOD 10H, 4); Texts.Write(W, ","); Texts.WriteInt(W, w DIV 40000H MOD 10H, 4); Texts.Write(W, ",") ELSE a := w MOD 4000000H; IF a >= 2000000H THEN DEC(a, 4000000H) (*sign extension*) END END ; Texts.WriteInt(W, a, 6); Texts.WriteLn(W); INC(i) END ; Texts.WriteLn(W); Texts.Append(T, W.buf) END Decode; BEGIN Texts.OpenWriter(W); NEW(boolType); boolType.form := Boolean; boolType.size := 4; NEW(intType); intType.form := Integer; intType.size := 4; mnemo[MOV] := "MOV "; mnemo[MVN] := "MVN "; mnemo[ADD] := "ADD "; mnemo[SUB] := "SUB "; mnemo[MUL] := "MUL "; mnemo[Div] := "DIV "; mnemo[Mod] := "MOD "; mnemo[CMP] := "CMP "; mnemo[MOVI] := "MOVI"; mnemo[MVNI] := "MVNI"; mnemo[ADDI] := "ADDI"; mnemo[SUBI] := "SUBI"; mnemo[MULI] := "MULI"; mnemo[DIVI] := "DIVI"; mnemo[MODI] := "MODI"; mnemo[CMPI] := "CMPI"; mnemo[CHKI] := "CHKI"; mnemo[LDW] := "LDW "; mnemo[LDB] := "LDB "; mnemo[POP] := "POP "; mnemo[STW] := "STW "; mnemo[STB] := "STB "; mnemo[PSH] := "PSH "; mnemo[BEQ] := "BEQ "; mnemo[BNE] := "BNE "; mnemo[BLT] := "BLT "; mnemo[BGE] := "BGE ";
129
mnemo[BLE] := "BLE "; mnemo[BGT] := "BGT "; mnemo[BR] := "BR "; mnemo[BSR] := "BSR "; mnemo[RET] := "RET "; mnemo[RD] := "READ"; mnemo[WRD] := "WRD "; mnemo[WRH] := "WRH "; mnemo[WRL] := "WRL "; END OSG.
130
References A.V. Aho, J.D. Ullman. Principles of Compiler Design. Reading MA: Addison-Wesley, 1985. F. L. DeRemer. Simple LR(k) grammars. Comm. ACM, 14, 7 (July 1971), 453-460. M. Franz. The case for universal symbol files. Structured Programming 14 (1993), 136-147. S. L. Graham, S. P. Rhodes. Practical syntax error recovery. Comm. ACM, 18, 11, (Nov. 1975), 639-650. J. L. Hennessy, D. A. Patterson. Computer Architecture. A Quantitative Approach. Morgan Kaufmann, 1990. C.A.R. Hoare. Notes on data structuring. In Structured Programming. O.-J. Dahl, E.W. Dijkstra, C.A.R. Hoare, Acad. Press, 1972. U. Kastens. Uebersetzerbau. Oldenbourg, 1990 D. E. Knuth. On the translation of languages from left to right. Information and Control, 8, 6 (Dec. 1965), 607-639. D.E. Knuth. Top-down syntax analysis. Acta Informatica 1 (1971), 79-110. W. R. LaLonde, et al. An LALR(k) parser generator. Proc. IFIP Congress 71, North-Holland, 153-157. J.G.Mitchell, W. Maybury, R. Sweet. Mesa Language Manual. Xerox Palo Alto Research Center, Technical Report CSL-78-3. P. Naur (Ed). Report on the algorithmic language Algol 60. Comm. ACM, 3 (1960), 299-314, and Comm. ACM, 6, 1 (1963), 1-17. P. Rechenberg, H. Mössenböck. Ein Compiler-Generator für Mikrocomputer. C. Hanser, 1985. M. Reiser, N. Wirth. Programming in Oberon. Wokingham: Addison-Wesley, 1992. N. Wirth. The programming language Pascal. Acta Informatica 1 (1971) N. Wirth. Modula - A programming language for modular multiprogramming. Software - Practice and Experience, 7, (1977), 3-35. N. Wirth. What can we do about the unnecessary diversity of notation for syntactic definitions? Comm. ACM, 20, (1977), 11, 822-823. N. Wirth. Programming in Modula-2. Heidelberg: Springer-Verlag, 1982. N. Wirth and J. Gutknecht. Project Oberon. Wokingham: Addison-Wesley, 1992.
131