A compiler takes as input a source program and produces as output an equivalent sequence of machine instructions. This process is so complex that it is divided into a series of sub-processes called phases. The different phases of the compiler are as follows:
Lexical Analyzer or Scanner
The first phase of the compiler, called Lexical Analyzer or Scanner reads the source program one character at a time, carving the source program into a sequence of atomic units called tokens. The usual tokens are identifiers, keywords, constants, operators and punctuation symbols such as comma and parenthesis.
Each token is a sub-string of the source program that is to be treated as a single unit. The Lexical analyzer examines successive character in the source program starting from the first character not yet grouped into a token. It may be required to search many characters beyond the next token in order to determine what the next token actually is.
Phases of a Compiler
Syntax Analyzer or Parser
The second phase of the compiler, called the Syntax Compiler or Parser receives a stream of tokens as the output of the lexical analyzer.
The syntax analyzer groups tokens together into syntactic structure called as expression. Expression may further be combined to form statements. The syntactic structure can be regarded as a tree whose leaves are the token called as parse trees. The parser has two functions:
- Firstly, it checks if the tokens from lexical analyzer, occur in pattern that are permitted by the specification for the source language.
It also imposes on tokens that are permitted by the specification for the source language. It also imposes on tokens a tree-like structure that is used by the subsequent phases of the compiler.
- Secondly, it makes explicit the hierarchical structure of the incoming token stream by identifying which parts of the token stream should be grouped.
Intermediate Code Generation
The third phase of the compiler, called Intermediate Code Generation uses the structure produced by the syntax analyzer to create a stream of simple instructions. The output of the syntax analyzer is some representation of a parse tree. The intermediate code generation phase transforms this parse tree into an intermediate language representation of the source program.
Code Optimization
The fourth phase of the compiler, called Code Optimization is an optional phase designed to improve the intermediate code so that the ultimate object program runs faster or takes less space. Its output is another intermediate code program that does the same job as the original, but in a way that saves time and/or space.
Local Optimization
There are local transformations that can be applied to a program to make improvement.
Loop Optimization
Another important source of optimization concerns about increasing the speed of loops.
A typical loop improvement is to move a computation that produces the same result each time around the loop to a point, in the program just before the loop is entered. Then this computation is done only once each time the loop is entered, rather than once for the each iteration of the loop. Such a computation is called loop variant.
Code Generation
The fifth phase or the final phase of the compiler, called Code Generation produces the object code by deciding on the memory locations for data, selecting code to access each datum and selecting the registers in which each computation is to be done.
Many computers have a few high-speed registers in which computations can be performed quickly. A good code generator would attempt to utilize these registers as efficiently as possible. This aspect of code generation, called register allocation, is particularly difficult to do optimally, but some heuristic approaches can give reasonably good results. Designing a code generator that produces truly efficient object programs is one of the most difficult parts of a compiler design. During all these phases of the compiler, they are being interacted upon by the following two important data structures:
Table Management or Book-Keeping.
A compiler needs to collect information about all the data objects that appear in the source program. The information about data objects is collected by the early phases of the compiler – lexical and syntactic analyzers. The data structure used to record this information is called as symbol table. The determination of the type of intermediate results and check to check for, if the type of the arguments is legal for the application of an operation is termed as semantic analysis. It can be done during the syntactic analysis phase, the intermediate code generation phase or in the final code generation phase.
Error Handling
One of the most important functions of a compiler is the detection and reporting of errors in the source program. The error messages should allow the programmer to determine exactly where the errors have occurred. Errors may occur in all of the phases of a compiler. Whenever a phase of the compiler discovers an error, it must report the error to the dhandler, which issues an appropriate diagnostic message. Both the table management and error handling routines interact with all phases of the compiler.