# How to create a (slow) interpreter - Part 1: Tokens

Published on  2019-07-22

# How to create a (slow) interpreter - Part 1: Tokens Photo by Andreas Brücker on Unsplash  Last time, we had a simple syntax to our language, it's now time to get our hands dirty and start writing some code.

## Divide and conquer

First, we need to think about what our interpreter needs to compute. As an input, we will feed it a text containing the code and as an output it will either return a list of errors or a text output. We must ask ourselves how it will read this initial text and understand commands.

It's already pretty obvious regarding how we defined the syntax that we can safely split along the line breaks to get the instruction lines. So let's take that for granted.

When you read a text like `The cat is blue`, we can easily count words as we are familiar with this type of sentences. Now imagine we happen to read the line `VAR A = 169`, how many words do you see? That's a tricky question, answers may vary considering how you are reading this. Some may argue that only `VAR` is a valid word, others would say that there are 2 or 3 in total.

Since we cannot define instructions from words, we will use another object that holds information: tokens. In the last example we can see 4 tokens:

• `VAR`: a word of 3 letters
• `A`: a word of 1 letter
• `=`: a symbol
• `169` a number

Let's try that on harder example: `WHILE var != var2 + 1 && var < (5.2 - var3) ^ -2`. How many tokens do you think there are?

And the anwser is… (click)

If you guessed 17, you were right, the full decomposition is:

`WHILE`, `var`, `!=`, `var2`, `+`, `1`, `&&`, `var`, `<`, `(`, `5.2`, `-`, `var3`, `)`, `^`, `-`, `2`

Working on this case, the first rule we realize is that splitting along spaces will not work (and should not work) as, for example, `VAR A=169` has a valid syntax. We can break up "tokenizing" into these 6 rules:

1. A word is starting with a letter and can contain letters, numbers and underscores, like built-in commands `WHILE`, `IF` or variable names `myAwesomeVar_12`.
2. Spaces are meaningless excepted between 2 words: we can shrink `A +` but not `IF A`.
3. Some symbols are acting as separators like `(` or `+`.
4. Numbers are digits and can include a floating point like `2.03`.
5. Some symbols must be merged together like `&&` or `>=`.
6. Numbers and variables can possess a leading leading sign like `-2` or `-varA`.*

* The last one is special as we will handle this task here and it will only be used later in another function.

Following these indications, we can start writing our `tokenize` function that will split an instruction line into meaningful tokens.

## Now you're thinking with tokens

``````/**
* split a string expression into tokens
* @param {string} exp
* @return {string[]} tokens
*/
function tokenize(exp){``````

So as we said before, there are 2 types of spaces: between words and others. We must differentiate both by changing the first temporarily. We can also declare an output list and a variable keeping the first index of a word.

``````// save the space between 2 a-z words by replacing it with ¤ and ditch other spaces
const exp2 = exp.replace(/([a-zA-Z0-9_]*)[\\s]([a-zA-Z0-9_]*)/g, '\$1¤\$2')
.replace(/[\\s]/g,'') // remove blank chars
.replace(/¤/g,' '); // recover saved space
const output = []; // saved tokens
let i0 = 0; // starting index of a word``````

We will now iterate over each character the following way:

``````    VAR var2=169
^                 i0=0
VAR var2=169
|^
VAR var2=169
| ^
VAR var2=169
|  ^              separator, new token: VAR
VAR var2=169
^             i0=4
VAR var2=169
|^
VAR var2=169
| ^
VAR var2=169
|  ^
VAR var2=169
|   ^         separator, new tokens: var2, =
VAR var2=169
^        i0=9
VAR var2=169
|^
VAR var2=169
| ^
VAR var2=169
|  ^     end of string, new token: 169
``````

So we will loop until we encounter a separator:

``````let i;
for (i = 0; i < exp2.length; i++) {
if ('+-*/%^()=&| ><!'.includes(exp2[i])) { // if current char is a separator``````

Whenever it happens, we first have to save the previous token if it exists.
Next we save the separator as a token when it's not a space between words.
Then we update the starting point of the following token.

``````if (i > i0) // save token before
output.push(exp2.substr(i0, i - i0));
if (exp2[i] !== ' ') // save separator as token if not a word separator
output.push(exp2[i]);
i0 = i + 1;
}``````

At the end of the loop, let's not forget to save the remaining token:

``````}
if (i > i0) // save last token
output.push(exp2.substr(i0, i - i0));``````

Pretty easy, right? So where are we now?

1. words aren't containing symbols*.
2. spaces are removed except between words.
3. symbols act as separators.
4. numbers can contains points. (as `.` is not in the symbol list)
5. symbols aren't merged. (`A!=B` became `A`, `!`, `=`, `B`)
6. inverse operation is not separated from subtraction.

* words might actually contain symbols if there aren't in the separators list (like `@`) but we will check their validity another time.

We now want to merge symbols when it's required. To do so, we will create a testing function that takes two neighbor tokens and checks their merge ability:

``````const canMerge = function(a,b){
return ('&|='.includes(a) && a.length === 1 && a === b) || // double separator
(b === '=' && '!<>'.includes(a) && a.length === 1); // >= or <= or !=
};``````

We only need to iterate over our `output` list (in reverse to avoid index jumps) and merge whenever it's possible:

``````for (let i = output.length - 1; i > 0; i--) {
if (canMerge(output[i - 1], output[i]))
output.splice(i - 1, 2, output[i - 1] + output[i]); // remove 2 elements and add merged``````

We will take this opportunity to iterate over the output to identify the inverse operation:

Inverse operation is identified by a minus sign after a symbol except for closing parenthesis. For example, in `(A-B)-C*-D`, only the final one is different.

``````else if ('+-*/%^(=&| ><!'.includes(output[i - 1]) && output[i] === '-')
output[i] = '.-'; // differentiate this operation from subtraction``````

Now that all the rules are fulfilled, we can simply return our output:

``````}
return output;
}``````

We can now enjoy our fully working function:

Full `tokenize` function (click)

``````/**
* split a string expression into tokens
* @param {string} exp
* @return {string[]} tokens
*/
function tokenize(exp){
// save the space between 2 a-z words by replacing it with ¤ and ditch other spaces
const exp2 = exp.replace(/([a-zA-Z0-9_]*)[\\s]([a-zA-Z0-9_]*)/g, '\$1¤\$2')
.replace(/[\\s]/g,'') // remove blank chars
.replace(/¤/g,' '); // recover saved space
const output = []; // saved tokens
let i;
let i0 = 0; // starting index of a word
for (i = 0; i < exp2.length; i++) {
if ('+-*/%^()=&| ><!'.includes(exp2[i])) { // if current char is a separator
if (i > i0) // save token before
output.push(exp2.substr(i0, i - i0));
if (exp2[i] !== ' ') // save separator as token if not a word separator
output.push(exp2[i]);
i0 = i + 1;
}
}
if (i > i0) // save last token
output.push(exp2.substr(i0, i - i0));

const canMerge = function(a,b){
return ('&|='.includes(a) && a.length === 1 && a === b) || // double separator
(b === '=' && '!<>'.includes(a) && a.length === 1); // >= or <= or !=
};

for (let i = output.length - 1; i > 0; i--) {
if (canMerge(output[i - 1], output[i]))
output.splice(i - 1, 2, output[i - 1] + output[i]); // remove 2 elements and add merged
else if ('+-*/%^(=&| ><!'.includes(output[i - 1]) && output[i] === '-')
output[i] = '.-'; // differentiate this operation from subtraction
}
return output;
}``````

Let's test this out with the examples described above:

``````console.log(tokenize('VAR A = 169').join('  /  '));
console.log(tokenize('WHILE var != var2 + 1 && var < (5.2 - var3) ^ -2').join('  /  '));``````

Here's the console output from the tests that just ran into your browser:

That's it, we can now move on to our next task in this adventure which is How to create a (slow) interpreter - Part 2: Shunting Yard (soon)

See you next time! Klemek
Junior software engineer

Go to top - Back to home -