With all those source lines being sent to Mecrisp, it’s time to look a little more under the hood.
Let’s start by pointing out that this article focuses on an STM32F103 µC running at 72 MHz, which is an ARM Cortex M3. Performance figures for the JeeNode Zero’s STM32L052 µC will differ, because it runs at a lower clock speed and because it’s a “simpler” ARM Cortex M0+.
Many Forth implementations use a “threaded interpreter” design, but Mecrisp is
different in that it compiles Forth directly to ARM machine code. The code in
embello
includes a “micros
” word, which returns the time of the systick hardware
counter as microseconds. We can use this to measure the duration of operations -
here’s how to find out how long “nop
” takes:
: a micros nop micros swap - . ; a 1 ok.
So that was quick: < 1 µs! - and not so surprising, considering that the clock runs at 72 MHz.
Let’s go through the above code, just so that the logic is made very clear:
- “
: a
” - starts the definition of a word called “a” - “
micros
” - pushes the current µs counter value on the stack - “
nop
” - does nothing, even though it takes a clock cycle to do so - “
micros
” - pushes the current µs counter value on the stack - “
swap
” - reverses the top two items on the stack - “
-
” - replaces the top two items on the stack with their difference - “
.
” (yes, that’s a single period) - pops the top value and prints it - “
;
” - ends the definition of “a” - “
a
” - calls a - “
1
” - is printed by Forth, i.e. the output from calling “.
“ - “
ok.
” - tells us Forth is ready for the next command
It’s very important to compile the code into a definition and then run it, otherwise we’ll be measuring the lookup and compilation time. Interesting, but not what we’re after here:
micros nop micros swap - . 2483 ok.
Now we can have a look at raw performance. Let’s run an empty loop a million times:
: a micros 1000000 0 do loop micros swap - . ; a 83386 ok.
That’s 83.4 ns per loop iteration, i.e. 12 MHz. Since the system clock is 72 MHz, we can deduce that Mecrisp compiles the empty loop to something which takes 6 clock cycles per iteration. It generated this code in just a few milliseconds, by the way - i.e. unnoticeble in interactive use.
More interesting, is how fast we can toggle an I/O pin. Let’s take PC13, the LED on a Blue Pill:
: a micros 1000000 0 do pc13 iox! loop micros swap - . ; a 236240 ok.
That’s 236 ns per iteration, so the pin toggles over 4 million times per second. If we discount for the 6-cycle loop overhead, the pin toggling itself takes 153 ns.
Actually, iox!
is still a bit slower than just setting or clearing an I/O pin:
: a micros 1000000 0 do pc13 ios! loop micros swap - . ; a 166752 ok.
Again, without the loop overhead, setting a pin takes a mere 83 nanoseconds. Not too shabby for a self-contained µC board with an interactive prompt - all implemented in 20 KB of code!
This super performance is only possible because Matthias Koch used his expertise
to
help
the compiler perform optimally on io@
, ios!
, ioc!
, and iox!
- so now
everyone benefits from it.
Now let’s dive in deeper… a lot deeper. There’s a disassembler (written in Forth, what else?), which we can load into flash memory with the following commands (it uses about 6 KB):
compiletoflash
!s ../flib/mecrisp/disassembler-m3.fs
compiletoram
With this, we can examine the actual machine code generated for “a” by Mecrisp’s compiler:
: a micros nop micros swap - . ; ok.
see a
0000CDE0: B500 push { lr }
0000CDE2: F7FA bl 00006E12 --> micros
0000CDE4: F816
0000CDE6: F7F7 bl 00004762 --> nop
0000CDE8: FCBC
0000CDEA: F7FA bl 00006E12 --> micros
0000CDEC: F812
0000CDEE: CF08 ldmia r7 { r3 }
0000CDF0: 1AF6 subs r6 r6 r3
0000CDF2: F7F7 bl 0000434E --> .
0000CDF4: FAAC
0000CDF6: BD00 pop { pc }
That’s probably a bit too much assembler code to take in, so let’s start a bit simpler:
: a ( n -- n ) 17 + ; ok.
see a
20000454: 3611 adds r6 #11
20000456: 4770 bx lr
This word takes the top of the stack and adds 17 to it. It gets compiled into a routine with just two machine instructions (the stack top is always in register 6): add 17 (hex 11) and return.
Things are about to get a bit more interesting:
: b ( n -- n ) 7 + 10 + ; ok.
see b
20000470: 3607 adds r6 #7
20000472: 360A adds r6 #A
20000474: 4770 bx lr
No surprises yet: we first added 7 and then we added 10 (hex A) more. Now watch this:
: c ( n -- n ) 7 10 + + ; ok.
see c
20000498: 3611 adds r6 #11
2000049A: 4770 bx lr
Mecrisp does constant folding: the compiler can keep track of the topmost few items on the stack, and decide to optimise things if they are constants! (calculating 7+10 at compile time)
To show some more code generation at work, here are three equivalent definitions:
: d ( n -- n ) 2 * ; ok.
see d
200004B4: 2302 movs r3 #2
200004B6: 435E muls r6 r3
200004B8: 4770 bx lr
: e ( n -- n ) 2* ; ok.
see e
200004D0: 0076 lsls r6 r6 #1
200004D2: 4770 bx lr
: f ( n -- n ) dup + ; ok.
see f
200004E4: 19B6 adds r6 r6 r6
200004E6: 4770 bx lr
To see how constant folding actually works, we can look at the definition of
bit
, which takes a number on the stack and returns an int with just that bit
set:
: bit ( u -- u ) 1 swap lshift 1-foldable ;
see bit
000058DE: 2301 movs r3 #1
000058E0: 40B3 lsls r3 r6
000058E2: 461E mov r6 r3
000058E4: 4770 bx lr
The special trick here is that we’ve defined this word as “1-foldable”, meaning:
if the topmost 1 elements (i.e. just the top) of the stack is known to be a
constant, then instead of compiling the code, Mecrisp will execute that code
on the constant value right away, and use the resulting value instead. It’s
optimising away the operation, helped by the unique property that run-time and
compile-time contexts are very similar in Forth. Here is an example of
using bit
:
: g ( u -- u ) 3 bit and ; ok.
see g
0000CEF8: F016 ands r6 r6 #8
0000CEFA: 0608
0000CEFC: 4770 bx lr
This really only scratches the surface of what Mecrisp is doing under the hood. Here are a few more examples - you’ll have to know something about ARM machine code and the assembly mnemonics, but as you can see, it’s all very compact:
: aa ( -- ) 10 0 do loop ; ok.
see aa
0000CE8E: B500 push { lr }
0000CE90: F847 str r6 [ r7 #-4 ]!
0000CE92: 6D04
0000CE94: B430 push { r4 r5 }
0000CE96: 2400 movs r4 #0
0000CE98: 250A movs r5 #A
0000CE9A: CF40 ldmia r7 { r6 }
0000CE9C: 3401 adds r4 #1
0000CE9E: 42AC cmp r4 r5
0000CEA0: D1FC bne 0000CE9C
0000CEA2: BC30 pop { r4 r5 }
0000CEA4: BD00 pop { pc }
: bb ( n -- n ) if 2 else 3 then ; ok.
see bb
0000CEB2: B500 push { lr }
0000CEB4: 2E00 cmp r6 #0
0000CEB6: CF40 ldmia r7 { r6 }
0000CEB8: D003 beq 0000CEC2
0000CEBA: F847 str r6 [ r7 #-4 ]!
0000CEBC: 6D04
0000CEBE: 2602 movs r6 #2
0000CEC0: E002 b 0000CEC8
0000CEC2: F847 str r6 [ r7 #-4 ]!
0000CEC4: 6D04
0000CEC6: 2603 movs r6 #3
0000CEC8: BD00 pop { pc }
: cc ( -- ) 2 3 * 4 + drop ; ok.
see cc
0000CED6: 4770 bx lr
: dd ( -- ) 2 3 * 4 + . ; ok.
see dd
0000CEE2: F847 str r6 [ r7 #-4 ]!
0000CEE4: 6D04
0000CEE6: 260A movs r6 #A
0000CEE8: B500 push { lr }
0000CEEA: F7F7 bl 0000434E --> .
0000CEEC: FA30
0000CEEE: BD00 pop { pc }
It may seem overdone to have Mecrisp generate machine code and do all the optimisation on the fly, but when the lower levels are so performant, it becomes feasible to construct an entire application on top. Even if deeply nested and layered as many tiny “words” calling each other.
Which is a good thing in ultra low-power designs: the sooner the µC is done with its periodic tasks, the sooner it can return to its sub-µA sleep mode.
The best of both worlds: compiler-grade performance, with interpreter-style instant coding!