2) DLR - Scanner

In the first chapter I described common architecture of .NET language compiler. There are usually three parts - scanner, parser and generator of CIL code. But building a language on the DLR slightly differs because instead of CIL code is produced DLR abstract syntax tree (DLR AST).

Here is an example demonstrating how code string is processed by a compiler built on the DLR:

As you can see DLR AST is quite similar to a language specific AST built by the parser, the difference is that DLR AST is build from nodes (operations) provided by the DLR and so DLR understands it and can interpret it while language specific AST is a tree structure defined by the language developer.
You could also build DLR AST directly from tokens, but I do not recommend it. It’s more convenient to separate the generation of AST into two steps, where at first parser builds IPL AST. Then generator not only transforms IPL AST into DLR AST, but also it manages variables and variable scopes, converts values into appropriate type, etc.

Implemented language

In this tutorial I’m going to describe implementation of new language I call IPL. It is a simple language similar to ToyScript (also language intended for learning the DLR and provided together with the DLR source codes but lacking any documentation).
First version of IPL language (let’s call it IPL1) will be expression language, that means there will be no statements but only an expression evaluated into a single value.

First version of the IPL (let’s call it IPL1) will support several essential features:

  1. variables
  2. values of type int and string
  3. let-in-end blocks
  4. if-else conditions
  5. lambda functions

Lambda Functions

Lambda function is the same as lambda expression in the C# 3.0. If you aren’t familiar with this concept, it’s a simple function, where body is an expression that is evaluated and returned when lambda function is called.

add = fun x y => x + y

Calling add(5, 6) results in 11.

Let-in-end blocks

Let-in-end blocks are used to declare and use variables inside the block. Blocks can be nested and each block has its own variable scope. Implementation of let-in-end blocks will demonstrate how to implement variable scopes in the DLR.

Here are some examples:

Here is an implementation of factorial in IPL1:

let val x = 4
    val fac = fun n => if n > 1 then n * self(n - 1)
                       else 1
in fac(x) end

First version of the IPL (let’s call it IPL1) will be expression language, that means there will be no statements but only an expression evaluated into a single value. In the next chapters the language is going to be extended by new features such as mentioned statements, functions, .NET interoperability (calling and using .NET libraries), etc.

Building IPL language

Lexer and parser generators

How to build a scanner and a parser? There are several ways, you could implement those parts by yourself, but this can be quite time expensive and you need some experience. Another, much simpler way, is to use lexer and parser generators. I prefer using generators because those tools are specialized on building scanners and parsers and they considerably simplify the implementation. In .NET there are several generators available, I prefer F# generators fslex (lexer) and fsyacc (parser).

Don’t be affraid if you’re not familiar with F#, actually there won’t be more like 20 lines of F# code together for scanner and the parser because fslex and fsyacc follow it’s own specification. Rest, the generator, will be implemented in the C# and while F# as well as C# are .NET languages, there won’t be a problem to use those parts together.

Let’s get started

As I already mentioned, scanner will be implemented using F# lexer generator fslex, that is part of the F# distribution. You can download F# on http://research.microsoft.com/fsharp/release.aspx ^.

If you have F# already installed, you can simply add the F# Lex (fslex) and F# Yacc (fsyacc) specification files into the project:

Scanner

Here is the scanner’s specification file scanner.fsl of the IPL1 language:

// scanner.fsl
{
open System
open Parser
open Lexing
}

let digit = ['0'-'9']
let char = ['a'-'z' 'A'-'Z']
let string = '"'([^'"'])*'"'
let whitespace = [' ' '\t' '\r' '\n']

rule scan = parse
// Keywords
| "else"			{ KeyELSE }
| "end"				{ KeyEND }
| "false"			{ KeyFALSE }
| "fun"				{ KeyFUN }
| "if"				{ KeyIF }
| "in"				{ KeyIN }
| "let"				{ KeyLET }
| "then"			{ KeyTHEN }
| "true"			{ KeyTRUE }
| "val"				{ KeyVAL }
// Operators
| "<"				{ OpLESS }
| "*"				{ OpSTAR }
| "-"				{ OpDASH }
| "+"				{ OpPLUS }
| "="				{ OpEQUAL }
| "=="				{ OpEQUALEQUAL }
| "=<"				{ OpEQUALLESS }
| "=>”				{ OpEQUALGREATER }
| “>”				{ OpGREATER }
| “>=”				{ OpGREATEREQUAL }
| “/”				{ OpSLASH }
| “!=”				{ OpEXCLAMATIONEQUAL }
// Special
| “(”				{ SpLEFTPAREN }
| “)”				{ SpRIGHTPAREN }
| “,”				{ SpCOMMA }
// others
| digit+			{ NUM (Int32.Parse(lexeme lexbuf)) }
| string			{ STRING ((lexeme lexbuf).Trim([|’\”‘|])) }
| char(char | digit)*		{ IDENT (lexeme lexbuf) }
| whitespace			{ scan lexbuf }
| eof				{ EOF }

Scanner constists of plenty of rules. For each keyword, operator, special character and value type of the language, there is one rule. Each rule starts with ‘|’ character, after that follows regular expression parsing the string and on the right side of the rule, there is a token produced by the scanner. Tokens NUM, STRING and IDENT are generated for integer and string values and identificators and they keep the parsed value (that is specified in the brackets, e.g. NUM(5), IDENT(“x”), etc).

Example:
This scanner parses a string “x = 5 + 3 * 6” into the list of tokens:
[IDENT(“x”), OpEQUAL, NUM(5), OpADD, NUM(3), OpMUL, NUM(6)]

Parser

While this chapter is focused only on the scanner implementation, still parser specification must be created. That’s because in the parser file, there are defined tokens and without tokens definitions scanner wouldn’t be complete.

Here is the parser.fsy file. There aren’t parsing rules yet, those will be added in the next chapter focused on the parser implementation:

// parser.fsy
%token  NUM
%token  STRING
%token  IDENT

// Keywords
%token KeyELSE KeyEND KeyFALSE KeyFUN KeyIF KeyIN KeyLET KeyTHEN KeyTRUE
       KeyVAL

// Operators
%token OpLESS OpSTAR OpDASH OpPLUS OpEQUAL OpEQUALEQUAL
       OpEQUALLESS OpEQUALGREATER OpGREATER OpGREATEREQUAL OpSLASH
       OpEXCLAMATIONEQUAL

// Special
%token SpLEFTPAREN SpRIGHTPAREN SpCOMMA

// others
%token EOF

// ** ignore rest of this file for now
%start start
%type  start
%%
start: { “” }

Now scanner implementation is complete and both files can be compiled into .NET using the fslex.exe and fsyacc.exe tools. Those tools are in the C:\Program Files\FSharp-[version]\bin\ folder.

Compiling scanner.fsl and parser.fsy:

fslex.exe scanner.fsl
fsyacc.exe parser.fsy

Two files are generated - scanner.fs and parser.fs that have to be added into the project.

File order matters! F# files are executed sequentially according to their position in the project. Because scanner is using tokens that are defined in the parser file, parser.fs must be placed before scanner.fs in the project. If your files are in different order than you want, there are two ways how to change it - in Visual Studio renaming the file will always put the file onto the last position or another way is to open project file (*.fsharpp) in a text editor and change the order manually.

Running a scanner

Here is a function in F# that calls the scanner on given code string and returns list of tokens:

// main.fs #light open System // returns list of tokens let runScanner code = let mutable cont = true // continue scanning? let mutable tokens = [] // empty list of tokens let lexbuf = Lexing.from_string code // initialize lexer while cont = true do let token = Scanner.scan lexbuf // parse one token tokens <- tokens @ [token] // store token // end of the string? if token = Parser.token.EOF then cont <- false tokens

Project

Project source codes are available for download (link is at the end of the chapter). Besides scanner implementation and token definitions, there are also unit test in the ScannerTests.cs.

Conclusion

If the scanner implementation is still not clear to you, you can look at much simpler project which is a parser of arithmetic expressions (download here).


IPL1_01scanner.zip

Leave a Reply