For the past six months I have been working on a programming language (PL) called Pinecone. I would not dare to call it complete, but it is already possible to use it – it contains enough elements for this, such as variables, functions and user-defined data structures.
I’m not an expert. When I started working on this project, I had no idea what I was doing and still have no idea. I have never purposefully studied the principles of creating a language – I just read some materials on the Web and even in them I did not find almost anything useful for myself.
However, I wrote a completely new language. And it works. I guess I’m doing something right.
In this article, I’ll try to show you how Pinecone (and other programming languages) turns source code into what many consider magic. I will also pay attention to situations in which I had to seek compromises, and explain why I made the decisions that I made.
The text does not exactly claim to be a complete guide to creating a programming language, but for the curious it will be a good starting point.
The first steps
“Where to start at all?” Is a question that other developers often ask when they find out that I am writing my own language. In this part I will try to answer it in detail.
Compiled or Interpreted?
The compiler analyzes the entire program, turns it into machine code and saves it for later execution. The interpreter parses and executes the program line by line in real time.
Technically, any language can be both compiled and interpreted. But for each language one of the methods is more suitable than the other, and the choice of the paradigm in the early stages determines the further design. In general, interpretation is flexible and compilation provides high performance, but this is just the tip of an extremely complex topic.
I wanted to create a simple yet performant language, which there aren’t many, so I decided from the beginning to make Pinecone compileable. Nevertheless, Pinecone also has an interpreter – at first launch was possible only with its help, later I will explain why.
A kind of meta-step: a programming language is itself a program that needs to be written in some language. I chose C ++ because of its performance, great functionality, and simply because I like it.
But in general, advice can be given like this:
- it is highly recommended to write interpreted PL in compiled PL (C, C ++, Swift). Otherwise, performance losses will snowball while the meta-interpreter interprets your interpreter;
- compiled PL can be written in interpreted PL (Python, JS). The compilation time will increase, but not the program execution time.
The structure of a programming language has several stages from the source code to the executable file, at each of which the data is formatted in a certain way, as well as functions for moving between these stages. Let’s talk about this in more detail.
Lexical analyzer / lexer
The line of source code goes through the lexer and becomes a list of tokens.
The first step in most programming languages is lexical analysis. In simple terms, it is a breakdown of text into tokens, that is, language units: variables, function names (identifiers), operators, numbers. Thus, giving the lexer a line with the source code as input, we will get a list of all the tokens that it contains at the output.
The source code will no longer be accessed in the next steps, so the lexer must return all the information they need.
When creating a language, the first thing I did was write a lexer. Later, I explored tools that could make lexical analysis easier and reduce bugs.
One of the main such tools is Flex, a lexical analyzer generator. It accepts a file with a description of the grammar of the language as input, and then creates a program in C, which, in turn, parses the string and produces the desired result.
I decided to leave the analyzer I wrote. In the end, I did not see any particular advantages in Flex, and using it would only create additional dependencies that complicate the build process. In addition, my choice provides more flexibility – for example, you can add an operator to a language without having to edit multiple files.
Parser / Parser
The list of tokens goes through the parser and turns into a tree.
The next stage is the parser. It converts the original text, that is, a list of tokens (taking into account parentheses and the order of operations), into abstract syntax tree, which allows you to structurally represent the rules of the created language. The process itself can be called simple, but with the increase in the number of language constructions, it becomes much more complicated.
In this step, I also thought of using a third party library, considering Bison to generate the parser. It is a lot like Flex — a custom syntax file is structured using a C program, but again I ditched the automation.
Benefits of custom programs
With the lexer, my decision to write and use my own code (about 200 lines long) was pretty obvious: I love puzzles, and this one is also relatively trivial. It’s a different story with the parser: now the length of the code for it is 750 lines, and this is already the third attempt (the first two were just awful).
However, I decided to do the parser myself. Here are the main reasons:
- minimization context switching;
- simplification of assembly;
- desire to cope with the task on their own.
I would not advise using lexical and parser generators or other so-called “compiler compilers”. Writing a lexer and a parser will not take much time, and using a generator will firmly bind you to it in future work (which is important when porting the compiler to a new platform). In addition, generators are distinguished by the issuance of irrelevant error messages.
Abstract semantic graph
Moving from a syntax tree to a semantic graph
In this part, I have implemented a structure that is inherently closest to the “intermediate representation” in LLVM. There is a small but important difference between an abstract syntax tree (AST) and an abstract semantic graph (ASG).
ASG vs ASD
Roughly speaking, a semantic graph is a syntax tree with context. That is, it contains information such as what type the function returns or where the same variable is used. Because the graph needs to recognize and remember all this context, the code that generates it needs support in the form of many different explanatory tables.
Once the graph is drawn up, starting the program becomes a fairly straightforward task. Each node contains an implementation of a function that receives some data as an input, does what is programmed (including possible calls to auxiliary functions), and returns a result. This is the interpreter in action.
You are probably asking where the interpreter came from if I originally defined Pinecone as a compiled language. The point is that compilation is much more difficult than interpretation – I mentioned earlier that I ran into some problems at this step.
Write your own compiler
At first I liked this idea – I love doing things myself, and besides, I have long wanted to learn assembly language. But creating a cross-platform compiler from scratch is more difficult than writing machine code for each element of the language. I found this idea completely impractical and not worth the investment.
LLVM is a collection of compilation tools used by the developers of Swift, Rust, and Clang, for example. I decided to stop at this option, but again I did not calculate the complexity of the task that I set myself. For me, the problem was not mastering the assembler, but working with a huge multi-component library.
I still needed some kind of solution, so I wrote something that would definitely work: a transpiler from Pinecone to C ++ – it compiles like “source code to source code”, and also added the ability to automatically compile output from GCC. This method is neither scalable nor cross-platform, but at the moment at least it works for almost all programs on Pinecone, this is already good.
Now I lack the necessary practice, but in the future I am going to implement the Pinecone compiler using LLVM from start to finish – I like the tool and the manuals for it are good. So far, the interpreter is enough for primitive programs, and the transpiler can handle more complex ones.
I hope this article will be useful to someone. I highly recommend that you at least try to write your own language, despite the fact that you have to understand many implementation details – this is a learning, developmental and just an interesting experiment.
Here are general tips from me (quite subjective, of course):
- if you have no preference and in doubt whether to write a compiled or interpreted language, choose the latter. Interpreted languages are usually easier to design, assemble, and learn;
- with lexers and parsers, do whatever you want. The use of automation tools depends on your desire, experience and specific situation;
- if you are not ready / do not want to spend time and effort (a lot of time and effort) on coming up with your own programming strategy for programming, follow the sequence of actions described in this article. I put a lot of effort into it and it works;
- again, if you don’t have enough time / motivation / experience / desire or something else to write a classic PL, try to write an esoteric one, like
At last, a lively guided tour through all the features, functions, and applications of the Racket programming language. You’ll learn a variety of coding paradigms, including iterative, object oriented, and logic programming; create interactive graphics, draw diagrams, and solve puzzles as you explore Racket through fun computer science topics–from statistical analysis to search algorithms, the Turing machine, and more.