Let’s look at how the tricks in the previous article end up in code, and their performance effects. First thetrace.fs utility, which needs to insert special calls at the start and end of each word:
: a 2 . ; ok.
see a
20000394: F847 str r6 [ r7 #-4 ]!
20000396: 6D04
20000398: 2602 movs r6 #2
2000039A: B500 push { lr }
2000039C: F244 movw r0 #434F
2000039E: 304F
200003A0: 4780 blx r0 --> .
200003A2: BD00 pop { pc }
ok.
That was the normal untraced code. Here is a definition with tracing added:
: ta 2 . ; ok.
see ta
20002892: B500 push { lr }
20002894: F7FF bl 2000269E --> trace-entry
20002896: FF03
20002898: F847 str r6 [ r7 #-4 ]!
2000289A: 6D04
2000289C: 2602 movs r6 #2
2000289E: F244 movw r0 #434F
200028A0: 304F
200028A2: 4780 blx r0 --> .
200028A4: F7FF bl 20002736 --> trace-exit
200028A6: FF47
200028A8: BD00 pop { pc }
Even if you’re not familar with ARM assembly code, you can see that two calls have been inserted, for a total of 8 extra bytes code for each compiled word. There’s not much point investigating performance, since traced code will slow down enormously while printing out all the trace messages. It’s usually not a good idea to enable tracing on lots of nested words.
Next, let’s see howprofile.fs alters the way compilation works:
: a ; ok.
see a
20000394: 4770 bx lr
ok.
: ta ; ok.
see ta
2000263A: F242 movw r0 #2600
2000263C: 6000
2000263E: F2C2 movt r0 #2000
20002640: 0000
20002642: 6AC3 ldr r3 [ r0 #2C ]
20002644: 3301 adds r3 #1
20002646: 62C3 str r3 [ r0 #2C ]
20002648: 4770 bx lr
ok.
In this case, 14 bytes get added, to load the address of a variable, fetch it,
increment it, and save the result. Plus another 22 bytes for each “(cnt)
“
variable that gets created in the dictionary. So this does add up a bit more: we
will need 36 bytes per word extra when enabling profiling.
To see the impact on performance, the micros
word can be used to measure the
elapsed time when calling a word say 1,000,000 times. The general structure for
this type of measurement code is as follows (it has to be compiled since there
is a loop involved):
: blah micros 1000000 0 do <some-operation> loop micros swap - . ; blah
This saves the current time (in µs) on the stack, performs the loop, saves the new time, reverses the top two stack elements, calculates their difference, and prints the result:
: dummy micros 1000000 0 do loop micros swap - . ; dummy 83385 ok.
As you can see, the overhead of a dummy loop is 83.4 nanoseconds - this is on an STM32F103 running at 72 MHz. Let’s repeat this for 1,000,000 calls of theunprofiled empty word:
: t1 micros 1000000 0 do a loop micros swap - . ; t1 83385 ok.
Looks like that call of an empty word got optimised away - oh well. Now theprofiled version:
: t2 micros 1000000 0 do ta loop micros swap - . ; t2 222359 ok.
So profiling will add 139 ns to every call where profiling has been turned on. Let’s verify:
profile
20002794 00000001 t2
2000272C 00000001 t1
200026D4 00000000 tc
2000267C 00000000 tb
20002630 000F4240 ta
ok.
Yep, t1
and t2
were called once, and ta
was called $F4240 times, i.e.
1,000,000 in decimal.
Keep in mind that words tend to be small and numerous in well-written Forth
code, so this will add up. But if you define all the low-level words before
including the profile.fs
code, and don’t need to track their counts, then in
many cases this sort of overhead should be no problem at all: 7 million calls
will still add less than one second to the execution time.
Not bad for a compiler + runtime + interactive prompt, running on a tiny little 64K flash µC!
The last utility we’ll look at here is theroutes.fs code, which allows after-the-fact patching of words compiled to flash memory:
: ta ; ok.
see ta
200027EE: F242 movw r0 #2780
200027F0: 7080
200027F2: F2C2 movt r0 #2000
200027F4: 0000
200027F6: 6E03 ldr r3 [ r0 #60 ]
200027F8: F847 str r6 [ r7 #-4 ]!
200027FA: 6D04
200027FC: 461E mov r6 r3
200027FE: B500 push { lr }
20002800: 2E00 cmp r6 #0
20002802: D001 beq 20002808
20002804: F847 str r6 [ r7 #-4 ]!
20002806: 6D04
20002808: 2E00 cmp r6 #0
2000280A: CF40 ldmia r7 { r6 }
2000280C: D003 beq 20002816
2000280E: F642 movw r0 #2F23
20002810: 7023
20002812: 4780 blx r0 --> execute
20002814: BD00 pop { pc }
ok.
There’s no need to show the compiled code for a
again, it’s the same as in the
profiled case. The patchable version now uses 40 bytes instead of 2, plus again
22 bytes for each “(vec)
” variable definition, for a total of 60 bytes of
overhead per word.
Note that these examples are for RAM-based code. When compiled to flash, all addresses will end up in the lower 64K of memory, reducing calls and variables from 8-byte to 4-byte code.
The overhead for re-routable code is as follows:
: t micros 1000000 0 do td loop micros swap - . ; t 555866 ok.
That’s 472 ns overhead per call to a re-routable word. The reason for this is
probably that the logic in each word is more complex: we have to not just
fetch the patched value, we also have to check whether it is set (i.e. ≠ 0), and
if so: call the replacement code and exit our original code prematurely. This
requires an extra -dup
, if/then
, and exit
. As you can see, this adds up.
But let’s keep it all in perspective: both profiling and re-routing add well under a microsecond of overhead. It’s still virtually unnoticable - until you start making millions of “instrumented” calls.