README
¶
s-lang
This repository contains a compiler for a minimal programming language, targeting linux/amd64 systems. It seems that languages are traditionally named after their creators, or have a single-letter name. This has both of course - on the basis that it is being developed as a learning exercise I have no expectation that anyone other than myself will ever use it I can do that!
The generated code contains no external dependencies, so when compiled static binaries are produced which do not depend upon libC, etc. The standard library routines which are not used may be removed by the linker, reducing size, and generated binaries start at around 2k in size.
- Written in Golang for portability.
- Although the generated code is obviously Linux/AMD64-specific.
- We have a real lexer, and parser.
- Internally an AST is constructed which is then walked to generate the assembly language representation of the program.
- The compiler may automatically invoke the external
asandldbinaries to compile and link if desired.
In terms of features:
- Single-pass compiler which generates an assembly output for programs.
- Parsing uses a Pratt-style parser with precedence layers:
- Maths operations:
+,-,*,/,%, and^. - Comparison operations:
<,<=,>,>=, for integers, floats, and mixtures of the two. - Equality operations
==,!=work for integers, floats, and even strings.- You can compare integer with integer, float with float, string with string, and mixed operations where you compare either a float or an integer.
- You cannot compare a string with anything other than a string.
- Logical operations:
&&and||. - Postfix decrement/increment support for variables (
i++;, orindex--;for example). - Unary operations
-,+, and!.
- Maths operations:
- Support for integers, floats, and strings.
- String literals are interned.
- So you can call "
print("Steve");" 100 times and still see the text "Steve" in the binary only once.
- So you can call "
- String literals are interned.
- There is support for string/memory indexing.
- This is byte-based by default, but you can use a
pragmato treat memory as arrays of 8, 16, 32, or 64 bit entries. This is the closest we come to types. - Bounds checking is enabled and enforced.
- This is byte-based by default, but you can use a
- The ability to include inline assembly via
inline { .. }.inlinestatements are generated as they are encountered.- If you want to add new sections then use a
data { .. }-block, that is guaranteed to be inserted at the end of the assembly-generation. So you can add ".section blah .. .." without fear of breaking things.
- Looping is available with the
whilestatement, including standard support forbreakandcontinue. - Conditional support with
ifwithelsebranch too. - We support
switchstatements, albeit only with integer/character literals for thecasematches.defaultis supported too, of course.
- Some support for catching signals via "magic functions":
sigint()is called, if defined, when SIGINT is received (i.e. Ctrl-C is pressed).- We use this in examples/life.in to clear the screen, and restore the cursor.
sigfpe()is called, if defined, when SIGFPE is received.- This is the floating-point exception generated upon division by zero.
- Cleanup via
at_exit()which is called, if defined, when the program terminates, either due to signals being caught, or at an ordinary exit. - User defined functions, with default values.
Anti-features, or limitations:
- The language is built around numbers (integers&floats), and strings.
- We have no support for arrays, hashes, or structures.
- That said you can fake arrays via indexing into characters of strings, or
malloc'd areas of memory.- You can see that done in this test/jumptable1.in where we use it to implement a simple dynamic dispatch routine.
- test/jumptable2.in is cleaner approach where we directly get/set 64-bit values.
- There are only a few functions in the standard library.
- There is no general purpose support for types "u8 x = 8", "u16 y = 16384", etc.
- We can allocate memory with
malloc()and index it with "m[n]" - by default the memory will be treated as arrays of bytes.pragmacan be used to reinterpret a pointer as an arrays of 8, 16, 32, or 64-byte entries- (See test/jumptable2.in for an example of that.)
- We can allocate memory with
That said the code is clean, commented/documented, and contains a fair number of test-cases (both of the internal golang packages, and the compilation and execution of programs).
The intent behind this project was to learn, and increase my knowledge of low-level stuff. So everything here works, and everything is commented, but this is not a production-grade general-purpose language by any means.
Example Programs
There's a good SUMMARY.md of the syntax, and implementation details, which documents all supported features and syntax, as well as a collection of examples/ showing real programs.
A couple of highlights from the examples:
- examples/brainfuck.in - Naive brainfuck interpreter.
- Contains three hardcoded programs inline:
- The classic "Hello World" program.
- A simple "cat", which copies STDIN to STDOUT.
- The impressive mandelbrot generation program.
- If executed with the path to a file containing a brainfuck program it will read and execute that instead of any of the inline programs.
- This interpreter takes approximately 2 minutes to render the mandelbrot example program.
- Contains three hardcoded programs inline:
- examples/jit.in - JIT brainfuck Compiler
- This runs the same programs as the previous version, and also has the ability to load other programs if you specify their path.
- This example compiles the specified brainfuck programs to x86 assembly, before executing it.
- This runs the mandelbrot generation example in approximately three seconds.
- examples/life.in
- Conway's Game of Life.
- Randomly populate 20% of the arena, and evolves it until Ctrl-C is pressed.
- Math examples:
- examples/factorial.in - Calculate factorials 1-20.
- examples/fibonacci.in - Calculate fibonacci sequence, using recursion.
- examples/fizzbuzz.in - Calculate fizzbuzz 0-100.
- examples/primes.in - Calculate first 100 prime numbers.
- examples/num2hex.in - Convert a decimal number to a hex string.
- Misc examples
- examples/example.in - Misc. examples to demonstrate some of the facilities.
- examples/sort.in - Sort arrays of characters, or characters in strings.
Syntax
The following is a tour of our language, again check SUMMARY.md for a concrete list of examples, syntax, and caveats along with implementation notes:
# Comments are prefixed with "#" and last until the end of the line.
// You can use C-style comments if you prefer.
# Set a variable and print it.
let a = 3;
print( a );
/* Multi-
line
comments
work
too
*/
# Indexing works - and is bound-checked at run-time.
# LET is not required to declare/update a variable.
s = "Steve";
print(s[0],"\n");
# Updating too - again bounds checking is enforced.
s[1] = 42;
print(s,"\n");
# Printing a newline is common.
newline();
# simple loops with "while"
let x = 10;
# Looping on a variable is the same as "while ( x > 0 ) .."
while(x) {
print("The value in my loop is ", x, "\n");
x--;
}
inline {
# Inline assembly here
}
# Conditional expressions are present
if (x >= 3) {
print("x >= 3\n");
} else {
print("x is not >= 3\n");
}
# Printing of integer and string literal works too.
# (Our print function is variadic.)
print( "steve", " ", 21);
# Exit with the given status (7)
exit(1 + 2 * 3);
Trailing semicolons are mandatory, and brackets around the if/while tests, because that simplifies the parser!
There is an emacs lisp mode, s-lang.el providing syntax highlighting for our language, although there are no other features beyond that.
Usage
Once built (and optionally installed) the s-lang binary may be used to
generate, compile, or inspect the output of various stages via a number of
sub-commands.
Here we see the available sub-commands that you might choose to use, though in practice only the last three are useful for users.
lex
This is an internal command to show what the lexer makes of a given input file:
s-lang lex examples/example.in
parse
This is an internal command to show what the parser makes of a given input file:
s-lang parse examples/example.in
generate
This is one of the main commands, and generates an assembly language version of the input file:
s-lang generate [-output out.s] examples/example.in
You could assemble that output, and link it, like so:
as -msyntax=intel -mnaked-reg out.s -o out.o
ld -s -o out out.o
Confirm the assembly is sane:
objdump -M intel -S out
Then run it:
./out
The compile sub-command automates the process of generating source, compiling it, and linking it to produce a final binary.
compile
This performs the same generation as in the generate sub-command, but also runs the assembler and linker for you (the linking step is pretty aggressive, we remove unused sections, and strip):
s-lang compile [-output a.out] examples/example.in
Typically you'd run something like this to generate and execute in one go:
s-lang compile examples/example.in && ./a.out
(Or use the execute sub-command to create a binary and run it in one step.)
execute
This performs the same generation as in the compile sub-command, but also runs the resulting binary for you:
s-lang execute [-output a.out] examples/example.in
version
Report the version number of this binary, using the git-information that go embeds within generated binaries.
STDLIB
We embed a small number of functions within the generated programs, our so-called "standard library". These are functions which seemed to be useful enough to include globally, and each function that accepts arguments has type-checking, both at compile-time and run-time.
addr(PTR)- Return the address of a pointer returned by
mmap(N)ormalloc(N). - Necessary if you write a JIT, but not otherwise.
- Return the address of a pointer returned by
argc()- Return the count of supplied command-line arguments.
argv(N)- Return the Nth command-line argument, as a string.
call(N|POINTER)- Call the given address. Expected to be used for jumptables, etc.
exit(N)- Terminate execution with the given exit-code.
filesize(STR)- Return the size of the given file.
float2int(F)- Convert the given floating-point number to an integer.
getc()- Read a single character from STDIN, returns 0 on EOF.
getenv(STR)- Return the contents of the environmental variable with the given name.
int2float(N)- Convert the given integer to a floating point.
malloc(N)- Allocate N bytes on the heap.
- Note memory is not executable; use mmap() if you need that.
- NOTE: We have no corresponding
free.
memlen(PTR|STR)- Return the length of the given string/pointer-allocation as an integer.
mmap(N)- Allocate N bytes of readable/writable/executable memory via MMAP.
newline()- Print a newline to STDOUT.
panic(STR)- Print the given message, and exit.
- Do not invoke
at_exit().
print(...,...,...)- This function is variadic; it will accept any number of arguments of any type.
- Print each argument in turn.
putc(N)- Print the ASCII character corresponding to the given integer to STDOUT, i.e
putc(42);will print*.
- Print the ASCII character corresponding to the given integer to STDOUT, i.e
rand(N)- Return a random number between 0-(N-1).
readfile(STR)- Return the contents of the given file.
sleep(N|F)- Sleep for the given duration, integer or float.
sqrt(N|F)- Calculate the square root of the given integer/float.
- Always returns a floating-point result.
strcat(STR, STR)- Combine the two strings, and return the new string.
strcmp(STR, STR)- Compare two strings for equality, return
0if equal. - You can use
if ( str == str ) { ..to directly test for equality now, but this was not previously possible. Similarly!=works for inequality testing.
- Compare two strings for equality, return
strdup(STR)- Allocate a copy of the given string, and return it.
strlen(STR)- Return the length of the given string.
str2int(STR)- Convert a string into an integer.
str2float(STR)- Convert a string into a floating-point number.
You can find the implementation of our standard library routines beneath the compiler/templates/stdlib directory.
Adding to the standard library
If you wish to add a new function which will be available to all compiled programs you need to add it to a new file beneath compiler/templates/stdlib, then rebuild the compiler. Template files beneath compiler/templates are embedded within the compiler - there is need to configure load-paths, or similar.
As an example if you wished to define the new function foo:
- Create
compiler/templates/stdlib/foo.tmpl - Inside there define a new function, with the label
foo:. - Rebuild the compiler (with "
go build .")
Once that is done your prgrams can immmediately call it:
let a = foo();
This works because internally a call to foo( [args] ) is converted into a call to the assembly-language function named foo. (i.e. Defined with the label foo:). You can use inline to define/call such a function manually if you wish, providing you follow our function ABI.
It is assumed your function will check the types of any arguments it receives, but you can add an entry to the type-checking package, described later, if you wish to add some additional compile-time type checking.
Type Checking
There are two forms of possible type-checking:
- Type checking at compilation time.
- Type checking at run time.
At compilation time we can detect invalid argument counts for standard library functions and user-defined functions. For example this is caught:
function foo( x ) { print( "I got: ", x , "\n" );
foo(); # ERROR - Expected one argument, received zero.
For checking actual types at compile time we're limited, we can detect this error:
strlen(3); # Wrong type, expected string but got int
However this is permitted:
let a = 3;
print(strlen(a)) # Type information didn't survive the assignment
NOTE: Compile-time type checking of standard-library functions requires an explicit definition within our check/ package. If you add a new function please do add an entry there.
Run-time checking of types is deferred to our standard library routines, and they should all check their argument types are valid before they execute their jobs. They will return an error string "strlen: expected STRING", or similar, instead of their normal result.
Optimizing Generated Binary Size
When the compile sub-command is executed we run generate NAME.s from NAME.in, and then:
as -msyntax=intel -mnaked-reg NAME.s -o NAME.o
ld --gc-sections -s -no-pie -z noseparate-code -o NAME NAME.o
The linker command here shrinks our binaries significantly, if you don't want/need that size-saving then a more typical usage will suffice:
ld -o NAME NAME.o
The difference in the two approaches can be seen by our brainfuck example:
- Just using
ld -o brainfuck brainfuck.o:-rwxr-xr-x 1 skx skx 38648 May 24 13:27 brainfuck
- With our longer
ldusage:-rwxr-xr-x 1 skx skx 18208 May 24 13:28 brainfuck
We went from 25k to 18k, which is a good saving.
You can see some tips on debugging with gdb in DEBUGGING.md, if you want to ease your debugging it is recommended you assemble and link yourself, that way there will be debug information available to you.
Development & Testing
Development is nearing completion now. There are a few small and obvious things to add, but at the same time the scripting language itself is pretty complete, the standard library is complex enough to write real programs, and I suspect my urge to add new things will diminish over time.
- If any compiled program terminates with a segfault at runtime that's a bug I will definitely fix.
Updates should be contributed by pull-requests which address open issues, but sometimes I'm less strict with myself than I should be.
I've written test-cases covering most of the implementation, which you can run in the standard manner:
$ go test ./...
ok s-lang 0.005s
ok s-lang/compiler 0.009s
ok s-lang/lexer 0.006s
ok s-lang/parser 0.003s
There is also support for the fuzz-testing that golang provides, you can run five minutes of fuzz-testing by executing the following (remove the -fuzztime=300s to run forever, and remove -parallel=1 to run more than a single instance at a time):
go test -fuzztime=300s -parallel=1 -fuzz=FuzzProject -v
In addition to the golang tests, and fuzzer, we have some functional test-cases beneath test/, there is a trivial driver which executes each of the sample programs, and compares the output produced to known-good results:
$ cd test && make
expr.in
compiling expr.in to expr
executing expr > expr.out
comparing expr.out to expr.expected
cleanup
...
Running make test should run both of those things.
Lessons Learned
- Not having (strong) types meant it was easy to get started, but in the long-run it made several things very limited.
- Storing type-data in the lower bits was a decision which worked, but it becomes pervasive.
- I should have started with a Pratt-style parser, rather than recursive descent.
- It's much easier to extend and manage precedence levels this way.
- It's surprising how small a standard-library you actually need to write complex, or "useful", programs.
- Implementing a real AST was very useful, and a good decision.
- I had been tempted to start with a BASIC-style language and skip the use of an AST.
- You can see hints of that in my use of "
LET x = y".
- Just because it is assembly doesn't mean it is fast.
- The mandelbrot example takes two minutes to run, which is 1:58 too slow.
- Using a real ABI would make functions faster and easier to manage.
- Using a flexible register allocation would also have made things faster, but it was easy to just assume values were stored in RAX, etc.
- There's a lot of value to be obtained by writing functional tests, not just implementation tests.
- These genuinely caught regressions.
- Especially when I updated parts of the standard-library and forgot where parts call each other.
- Putting functions in separate sections allows the linker to optimize very effectively.
- It wasn't immediately obvious that I didn't need to write code for both
==(equality-checking) and the reverse (!=/inequality checking).- Instead I needed to only write one routine and invert the result for the other operation.
- Similarly ">" is the same as "not <=", and "<" is the same as "not >=".
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
Package check provides some limited support for type-checking arguments to known standard-library routines.
|
Package check provides some limited support for type-checking arguments to known standard-library routines. |
|
Package compiler implements our compiler.
|
Package compiler implements our compiler. |
|
Package lexer contains a simple lexer, which consumes text from our input language and returns a series of tokens.
|
Package lexer contains a simple lexer, which consumes text from our input language and returns a series of tokens. |
|
Package parser parses input programs, via our lexer, and allows AST nodes to be generated.
|
Package parser parses input programs, via our lexer, and allows AST nodes to be generated. |