Articles on JeeLabs

The article about the ESP-Link ended with what turned out to be an ominous note, in hindsight:

There’s still a buglet in this setup: the “reset” word leads to a runaway loop of “Unhandled Interrupt 00000003” - but this can be recovered through […]

Something really strange was happening:

toggling the DTR pin from the ESP-Link did correctly reset the Olimexino
but doing a software reset sent the Olimexino into a tail spin

And here were the crazy bits:

the software reset worked fine when DTR from the ESP-Link was not connected
once in this endless fault loop, even pressing the RESET button did not work (!)

To summarise: the ESP-Link itself works brilliantly. It gets data across between the Olimexino serial FTDI port and WiFi, in both directions and at full 115,200 baud speed, without dropping a single byte. The Reset button on its “Console” web page allows getting control back from any runaway code loop or crash, and does this again without any hiccups whatsoever.

But enter the word “reset” followed by a newline, and you get this:

The yellow trace is the RESET (DTR) pin, blue is serial data from ESP to STM, and green is serial output data, i.e. from STM to ESP. Note that the three traces overlap to fit on the screen.

First the characters “reset” are sent and echoed, then a brief delay, then a CR is sent (and echoed as a space), and then… total havoc!

With the RESET / DTR pin disconnected, it all works exactly as expected:

The Mecrisp welcome greeting, and a little later a custom message from the “init” code.

The crucial observation: how on earth can a software reset behave differently when a hardware pin connection is changed? (thx to Matthias K for pointing this out) - a power glitch? noise?

After many, many hours of head-scratching, it turns out that the answer is hidden in this sneaky little diagram, mentioned in STM’s reference manual (RM0008, p.90):

The RESET pin is not just an input pin, it’s also driven low when a reset is generated internally! - with the pin tied to the “stiff” output pin on the ESP-Link, it was fighting against a high logic level and ended up causing a hardware fault interrupt. Note that the µC continued to work as it was sending out messages over serial at the proper baud rate. It just never reached a reset state…

Once identified, the solution was trivial: just add an extra diode, so that the ESP can only pull the µC’s RESET pin down. Now, internal resets no longer face the “1” output level when pulled down, and instead perform a clean reset, shaped by the RC circuit on the Olimexino board:

And indeed, everything works with the diode - here is the RESET pin during a software reset:

Spot-on: 10 kΩ + 100 nF = 1 ms RC rise time, at which point the RESET pin is ≈ 67% of Vcc (ignore the the scope’s 1.80 ms value for tr: it’s calculated from the 10%/90% points).

For the “JEM” JeeLabs Energy Monitor, we’re going to need to put the ADC on the Olimexino’s STM32F103 to some serious work: the goal is to acquire 4 ADC channels at 25 Khz each, so that we can capture a full cycle of the 50 Hz AC mains signal with a resolution of 500 samples, as well as collecting the readings of up to three current transformers.

Since AC mains voltage is being sampled via the negative peaks of the incoming 9V AC supply, we really only get half cycles, with flat segments in between. To be able to reconstruct a full cycle, we need to capture at least 3 segments: in the worst case, two flat ones with only one complete negative cycle. This requires a data sampling window of at least 30 ms.

As described earlier, we’re going to aim for the following setup:

single ADC, acquiring 4 channels every 40 µs
for each channel, two buffers of 800 samples
this gives an acquisition time of 32 ms per buffer

The STM32F103 has a very capable ADC subsystem, as seen in this diagram from the datasheet:

To distill some other relevant info from the datasheet for our use case:

the ADC can take up to 1 million samples per second
it’s slightly less when running at 72 MHz (max ≈ 850 Ksps)
there’s a “SCAN” mode to read 1..16 specific ADC channels in rapid succession
the ADC can be triggered to run from a hardware timer, set to 25 KHz in this case

So this means we’re getting one new ADC reading every 10 µs on average. There is one catch: in scan mode, the ADC can only be used in combination with DMA, which makes sense since these data rates would completely overwhelm the CPU if handled through interrupts.

A benefit of using a hardware timer + DMA is that the ADC acquisition timing will be rock solid.

That DMA controller itself is an equally sophisticated part of the µC chip, by the way:

Note that both diagrams include hardware which is not on the “low end” STM32F103RB used on the Olimexino-STM32 board, which has only one ADC and one DMA unit.

It takes quite some reading in the (1137-page!) reference manual for the STM32F1xx chips, to figure out all the settings needed to implement the above acquisition mode. Then again, once that’s done, the code is remarkably short.

Here’s the basic DMA-based acquisition cycle to keep the ADC permanently running:

: adc1-dma ( addr count pin rate -- )  \ continuous DMA-based conversion
  3 +timer        \ set the ADC trigger rate using timer 3
  +adc  adc drop  \ perform one conversion to set up the ADC
  2dup 0 fill     \ clear sampling buffer

    0 bit RCC-AHBENR bis!  \ DMA1EN clock enable
      2/ DMA1-CNDTR1 !     \ 2-byte entries
          DMA1-CMAR1 !     \ write to address passed as input
  ADC1-DR DMA1-CPAR1 !     \ read from ADC1

                0   \ register settings for CCR1 of DMA1:
  %01 10 lshift or  \ MSIZE = 16-bits
   %01 8 lshift or  \ PSIZE = 16 bits
          7 bit or  \ MINC
          5 bit or  \ CIRC
                    \ DIR = from peripheral to mem
          0 bit or  \ EN
      DMA1-CCR1 !

                 0   \ ADC1 triggers on timer 3 and feeds DMA1:
          20 bit or  \ EXTTRIG
  %100 17 lshift or  \ timer 3 TRGO event
           8 bit or  \ DMA
           0 bit or  \ ADON
        ADC1-CR2 ! ;

It’s not so important at this stage how this works, just what it does:

a buffer + length is passed in, where the DMA unit will deposit all its readings
the DMA unit is set up to fill this buffer in circular mode, going on forever
the ADC is set up to acquire data on every timout of timer 3 at a specified rate

This was created in an earlier experiment, titled Reading ADC samples via DMA to implement an oscilloscope. That was for a single channel, whereas here we need four. Luckily, we can keep that DMA code as is and modify the ADC settings on the fly to switch to 4-channel scan mode:

: quad-adc ( -- )  \ configure ADC and DMA for quad-channel continuous sampling
  +adc  6 us  adc-calib 
  adata #abytes VAC-IN arate-clk adc1-dma
  VAC-IN adc#                 \ channel 0
  CT1    adc#  5 lshift or    \ channel 1
  CT2    adc# 10 lshift or    \ channel 2
  CT3    adc# 15 lshift or    \ channel 3
              ADC1-SQR3 !     \ set up the ADC scan channels
  3 20 lshift ADC1-SQR1 !     \ four scan channels
         8 bit ADC1-CR1 bis!  \ enable SCAN mode
;

The above code depends on a number of constants, defined as follows:

                               4 constant #adcs
                             800 constant #asamples
                               2 constant #abuffers
#adcs #asamples * #abuffers * 2* constant #abytes
                              40 constant arate-us
                   arate-us 72 * constant arate-clk

It also needs this definition of a 12.8 KB buffer to store all acquired data in:

#abytes buffer: adata

Note that the timer and DMA settings have not changed: timer 3 will fire once every 40 µs and trigger a burst of four ADC conversions, one for each channel. Each completed conversion then triggers a DMA transfer, filling up the circular buffer four times faster than before.

All this code looks complex, and of course in a way it is indeed - this is a complex use case for the ADC + DMA hardware contained in the µC, after all! But in actual use it couldn’t be simpler:

quad-adc

That’s it. Now - magically - the adata buffer will be continuously filled with ADC samples from all four channels, without the CPU doing any work at all. It’s all happening in the background, and perhaps most surprising of all: the current drawn for all this extra activity is only 2 mA!

The processing overhead is negligible: one 16-bit read and one 16-bit write by the DMA unit - once every 10 µs on average. Since both ADC and SRAM are on the fast internal bus, this will occupy that internal data bus less than 0.3% of the time.

We do need to be careful with timing and synchronise our processing to avoid DMA changing values while we’re still using them. This is solved by inspecting a few status bits in the DMA controller: there is one bit for when the buffer has been filled halfway and another bit when the buffer is full and the DMA unit starts over from the beginning. These happen 40 µs x 800 samples = 32 ms apart, so we can simply poll this in the main loop of our application. There is even no need to introduce interrupts - 32 ms is a very long time for a µC running at 72 MHz.

At the halfway point, we have 32 ms to process the 1st buffer. At the end point, we have another 32 ms to process the 2nd buffer. And so on. This is the circular equivalent of double buffering.

Another subtle issue, is that we can no longer use the ADC in polled mode. To read out the LiPo voltage for example, we need to somehow make the ADC read out an extra channel, without interfering with the above high-speed acquisition cycle. As it happens, the designers at STM thought of that too, and came up with the concept of “injected data channels”: it’s possible to make the ADC acquire 1..4 extra channels, and have it place the results in separate registers.

Using this mechanism, we could specify that we want to read PB0 as well for example (once!), and then simply wait for the ADC scan to pick that request up after it has taken care of all the regular channels. This will allow reading out a few other analog pins, with at most 40..50 µs delay - the worst-case time needed by the ADC to start again and process our “injected” request.

As you can see, modern ARM µCs are a lot more than just a CPU-with-some-memory!

Not long ago, Ken Boak very generously donated one of his assembled PCB designs to JeeLabs:

This is a break-out board for the STM32F746VG, an ARM Cortex M7 CPU with floating point and a whopping 1 MB flash + 320 KB RAM, all in a 100-pin SMD package.

Lots of I/O hardware, including USB and Ethernet, lots of analog I/O with three ADCs capable of millions of samples per second each, and a dual DAC. Lots of UART/I2C/SPI too, of course.

But the most interesting aspect of this chip, versus the lowly STM32F103 chip used in the HyTiny and Olimexino, is perhaps its speed: the STM32F7 series can run at up to 216 MHz, three times as fast as the F103. On first thought, it might seem that this would translate to “simply” running three times as many instructions in the same amount of time. Not so:

This is what the different columns represent:

µs/10k = microseconds to run 10,000 iterations of the loop
clk/loop = processor clock cycles per single loop iteration
iter/µs = iterations per µs (the same as: million iterations per second)
speedup = performance increase of F746 @ 216 MHz over F103 @ 72 MHz
efficiency = performance increase specific to Cortex M7 vs Cortex M3

That last column is the most interesting one: it compares the measured performance of some simple loops in Mecrisp Forth while dividing out the clock rate. So an empty loop runs about 4 times faster than could be explained by the clock speed difference alone.

The most likely explanation is a better cache, a better processing pipeline, or a better lookahead optimiser - or more likely: a mix of all this. Getting to the bottom of this would require much more investigation - for now, the point was simply to show how advances in µC technology can lead to more-than-linear performance increases.

The code used for the above timing results was as follows (running from RAM):

10000 buffer: buf

: j0 micros 10000 0 do                         loop micros swap - . ;
: j1 micros 10000 0 do nop                     loop micros swap - . ;
: j2 micros 10000 0 do 1 i buf + c!            loop micros swap - . ;
: j3 micros 10000 0 do     buf   c@       drop loop micros swap - . ;
: j4 micros 10000 0 do   i buf + c@       drop loop micros swap - . ;
: j5 micros 10000 0 do   i buf + c@ dup + drop loop micros swap - . ;
: j6 micros 10000 0 do   i buf + c@ dup * drop loop micros swap - . ;
: j7 micros 10000 0 do   i buf + c@ dup / drop loop micros swap - . ;
: jn j0 j1 j2 j3 j4 j5 j6 j7 ;

It’s not a very comprehensive timing suite - just a quick set of explorations which came to mind. Let’s not even try to suggest that this would be representative in any way or for any purpose.

One aspect stands out, though: the amazing speed of this code. It can be typed into the console interactively, yet the resulting performance levels are orders of magnitude higher than other interactive languages, which tend to be interpreted (especially in such a constrained µC context).

The range of power consumption modes is equally impressive, from drawing a few dozen mA when the F103 runs at 72 MHz and about 150 mA when the F746 runs at 216 MHz, to just a few microamps when entering standby mode. Computers have come a long way since the PDP-8!

P.S. - Here is a different kind of performance comparison: running 1,000,000,000 iterations of an empty loop takes about 26 s on an STM32F746 @ 216 MHz, 7 s on a core i7 @ 2.8 GHz, using Qemu in a Linux VM (via qemu-arm-static), and 1 s on an Odroid C1+ @ 1.7 GHz. Whereby those last two both use the Linux ARM build (all these tests were done with “Mecrisp 2.2.5 RA”).

There are three pulse counters for measuring power at JeeLabs - one for solar PV production and two for the kitchen stove and the rest, respectively:

These generate 2000 pulses per kWh, that’s one pulse per 0.5 Wh, and are optically isolated. Reading them out is super simple: add a 1 kΩ series resistor and power them from 3.3V .. 5V. The result is a series of clean “1” pulses, each 100 ms long (there’s no contact bounce).

At the maximum rated current of 16A each, which corresponds to 3680 Watt for a nominal 230 Vac feed, we get about 2 pulses per second. With room to detect surges at least 5 times as high.

The main trick is to measure the time between these pulses fairly accurately, as this provides a measure of the actual current consumption. With 1s between pulses, we know the power is 1800 W, and with 10s between pulses, it’ll be 180 W - averaged out over those periods, that is.

It’s easy to measure time in a µC, especially on a millisecond scale. There’s a SysTick counter in the ARM µC, which is set up to run at 1000 Hz, i.e. one tick per millisecond. See this code.

So all we need to do is set up three “external interrupts” to trigger on the rising edge, and then count and timestamp each event:

0 0 2variable pulses1  \ last millis and pulse count #1
0 0 2variable pulses2  \ last millis and pulse count #2
0 0 2variable pulses3  \ last millis and pulse count #3

: ext3-tick ( -- )  \ interrupt handler for EXTI3
  3 bit EXTI-PR !  \ clear interrupt
  millis pulses1 1 over +! cell+ ! ;

: ext4-tick ( -- )  \ interrupt handler for EXTI4
  4 bit EXTI-PR !  \ clear interrupt
  millis pulses2 1 over +! cell+ ! ;

: ext5-tick ( -- )  \ interrupt handler for EXTI9_5
  5 bit EXTI-PR !  \ clear interrupt
  millis pulses3 1 over +! cell+ ! ;

: count-pulses ( -- )  \ set up and start the external interrupts
       ['] ext3-tick irq-exti3 !     \ install interrupt handler EXTI 3
       ['] ext4-tick irq-exti4 !     \ install interrupt handler EXTI 4
       ['] ext5-tick irq-exti5 !     \ install interrupt handler EXTI 5-9

               9 bit NVIC-EN0R bis!  \ enable EXTI3 interrupt 9
  %0010 12 lshift AFIO-EXTICR1 bis!  \ select P<C>3
                3 bit EXTI-IMR bis!  \ enable PC<3>
               3 bit EXTI-RTSR bis!  \ trigger on PC<3> rising edge

              10 bit NVIC-EN0R bis!  \ enable EXTI4 interrupt 10
            %0010 AFIO-EXTICR2 bis!  \ select P<C>4
                4 bit EXTI-IMR bis!  \ enable PC<4>
               4 bit EXTI-RTSR bis!  \ trigger on PC<4> rising edge

              23 bit NVIC-EN0R bis!  \ enable EXTI9_5 interrupt 23
   %0010 4 lshift AFIO-EXTICR2 bis!  \ select P<C>5
                5 bit EXTI-IMR bis!  \ enable PC<5>
               5 bit EXTI-RTSR bis!  \ trigger on PC<5> rising edge
;

This code is only so long because of the repetition. It’s all fairly straightforward, once you go through the reference manual and find all the register settings.

Once count-pulses has been called, we end up with three variables of 2 words each, containing automatically-updating pulse counts and the last millisecond timestamp.

And because this is based on interrupts and running in the background, we still have Mecrisp’s interactive command loop to peek and poke around, and look at these variables:

pulses1 2@ . . 10 8819  ok.
pulses2 2@ . . 9 12928  ok.
pulses3 2@ . . 29 17537  ok.

Pulse counter #1 has pulsed 10 times since started, the last one being 8819 milliseconds since the last µC reset. It’s all working like a charm and it doesn’t involve any code or attention to keep this running, all we need to do is pick up these values when we want to report them. Onwards!

This is a pure sine wave, captured by the ADC + DMA code, as described previously:

The plot above consists of 800 samples, sampled 40 µs apart, i.e. at 25 kHz - for a total of 32 ms. A quick calculation would seem to indicate that we’re seeing 1.6 cycles of a 50 Hz sine wave.

Except that it’s not… the incoming signal used here was a 24,950 Hz sine wave!

There is no way to tell what the frequency of a sampled signal is without further information. The reason for this is aliasing, an important aspect in any situation where continuous signals are sampled - as in the case of an ADC. We’ll get the same result with 25050/49950/50050/… Hz.

The math behind all this is locked up inside the Niquist-Shannon sampling theorem, but the intuition for this phenomenon is actually quite easy to pick up.

Here is a high-frequency pure sine wave, sampled at a - too low - rate (SVG from Wikipedia):

Each successive sample is picking up a slightly earlier piece of the sine wave. Unfortunately, when you drop the real signal (the red line) and look at only the sampled value (the black dots), it all ends up suggesting an aliased sine wave of a much lower frequency (the blue line). Yet this last waveform is totally fake - it is not present in the original signal!

When sampling at frequency X, signals of frequency Y, X-Y, X+Y, 2*X-Y, 2*X+Y, … all look the same - no matter how fast that ADC circuit is, or how clean and noise-free that input signal is.

The fact that the 24,950 Hz capture above looked so much like a 50 Hz signal is in fact a tribute to the accuracy of the sampling obtained by running our ADC off a hardware timer. If you think about it: any jitter in the timing of the sampling interval would lead to a highly distorted sample in the case of 24,950 Hz, since the real input signal varies much more quickly than with 50 Hz.

When X is 25,000 Hz, any frequency from 12,501 Hz to 24,999 Hz will “flip over” and alias back into our sampled data as if they were signals of 12,499 Hz to 1 Hz, respectively. And the same will happen over and over again for any frequency above X.

This also points to a solution: if we filter out high frequencies before sampling, then all is well. As long as we keep all frequencies > X/2 out of the ADC. This is called the Nyquist frequency.

Ideally, we’d need a perfect low pass, which passes all signals under X/2 as is, and suppresses everything above. In the real world, such a filter does not exist, but we can choose a filter which starts filtering at X/4 or even X/10, and rely on the roll-off of a simple RC filter to do the job. With a sample and hold capacitor on the input of an ADC, a well-chosen resistor can also work.

There is also another approach to avoid aliasing: assume that we know that frequencies > Z are not present in our signal. Perhaps the properties of the input circuit are such that they already limit the frequency response. Then we could oversample at a higher rate 2*Z, perform some digital filtering, and then throw away the extra samples to end up with exactly the data rate we need. This last step is called decimation. This doesn’t change the fact that sampling should never see frequencies above half the sampling rate - we’ve merely moved part of the work over to DSP.

For the JeeLabs Energy Monitor prototype, the plan is to sample at 25 KHz - so we have to block all signals above 12.5 KHz. What this means for the actual circuit still needs to be determined…

Until now, most of the wireless sensor nodes here at JeeLabs have been using a simple “map C/C++ struct as binary” approach as payload format. The advantage of this is that it simplifies the code in C (once you wrap your mind around how structs and binary data are stored in memory, that is) - but it’s also a bit C-specific, compiler-dependent, and not truly flexible.

With Forth now taking over both sides of the wireless RF link, it’s time to revisit this approach.

Here is a new design, which can efficiently package and send a number of numeric values in a generic form, and which is also easy to decode at the receiving end. This format and design is not language-specific, in fact it has been the basis of several commercial and open-source products since 1991. It’s also used in the p1scanner code in a previous release of HouseMon.

The main idea is to transform integers into variable-length byte sequences. A packet is then simply a series of such values, starting with a packet format ID.

This particular encoding represents small positive integers more compactly than larger ones. And it’ll work with arbitrarily large integers (the current Forth implementation can deal with up to 64-bit values). Negative integers can be handled in several ways, to be described further on.

The following example is for 32-bit integers, the most common size on ARM µC’s, since that’s the size of a native int and of a value on the Forth stack.

Here is how a 32-bit integer is converted to a sequence of 1..5 bytes:

As you can see, bits are grouped 7 at a time, with the high bit set only in the last byte. So all a decoder has to do is to advance through input bytes until it finds one with the high bit set, while accumulating and shifting the result 7 bits at a time.

Here’s the trick: leading zero’s are skipped. Encoded byte sequences never start with a 0-byte:

If the entire value is 0, then this emits a single byte with only the high bit set, i.e. 0x80.

For negative values, with all the high bits set to one, we have several options:

just ignore the issue, and accept that negative values will always be encoded as 5 bytes
add an offset to make it positive (and subtract it after decoding) - probably the simplest solution, but it requires knowledge in the decoder about which values to adjust and how
convert the value N to abs(N) * 2 + sign(N) - and convert it back in the decoder

This approach leads to a compact binary data packet for simple integer values. By adding the convention that the first integer (a single byte, i.e. 0..127) represents the type of packet, we can play various tricks to make things more compact for certain cases.

Here is an actual encoded packet sent out by the JeeLabs Energy Monitor:

8102A8078E808895

There are 6 bytes with the high bit set, therefore 6 values:

0x81 = 1
0x02A8 = 2 * 128 + 40 (i.e. 0xA8-0x80) = 296
0x078E = 7 * 128 + 14 (i.e. 0x8E-0x80) = 910
0x80 = 0
0x88 = 8
0x95 = 21

One potentially very useful property of this encoding is that the 0-byte never appears as the first byte in these variable-byte encodings. Which means that the 0-byte can be inserted as special “escape mark” for all sorts of purposes. It could be followed by an alternate representation for a small string for example, or even be the start of a much richer “separator” convention to allow encoding complete JSON data structures. Note that 0-bytes can appear inside byte sequences when some of the intermediate 7 bit groups are all zero.

And since only the last byte of each encoded byte sequence has its high bit set, we could also traverse multiple values in reverse order, if needed.

For now, the special 0-bytes are not used. They leave the door open for future enhancements.

The smart meter at JeeLabs looks like this:

It’s a Landis & Gyr E350, which monitors all power coming into the house and going out (when solar PV production exceeds local consumption). There’s an RJ12 jack on the bottom right, with serial data coming out at 9,600 baud (newer units send at 115,200 baud).

Every 10 seconds, a “telegram” of information is sent out, which looks something like this:

/XMX5XMXABCE000046099

0-0:96.1.1(30313337323430332020202020202020)
1-0:1.8.1(00003.540*kWh)
1-0:1.8.2(00011.199*kWh)
1-0:2.8.1(00000.000*kWh)
1-0:2.8.2(00004.667*kWh)
0-0:96.14.0(0002)
1-0:1.7.0(0000.35*kW)
1-0:2.7.0(0000.00*kW)
0-0:17.0.0(999*A)
0-0:96.3.10(1)
0-0:96.13.1()
0-0:96.13.0()
0-1:96.1.0(3131323838323030303336383037303132)
0-1:24.1.0(03)
0-1:24.3.0(121129160000)(00)(60)(1)(0-1:24.2.0)(m3)
(00014.684)
0-1:24.4.0(2)
!

This unit has been in operation since end 2012, with a JeeNode attached to pick up P1 data and send out wireless RFM12 packets, using the same variable format described in a recent article:

OK 18 129 1 83 111 232 1 47 58 201 1 55 1 142 3 26 45 233 130 144 [...]

The sketch used to extract data from P1 packets is called p1scanner and can be found on GitHub.

Here is an essentially equivalent re-implementation in Mecrisp Forth:

       8 constant p1#
p1# cells buffer: p1.buf
       0 variable p1.type
       0 variable p1.value

: p1clear p1.buf p1# cells 0 fill ;
: p1save ( pos -- ) cells p1.buf +  p1.value @ swap ! ;
: p1dump cr p1# 0 do i cells p1.buf + @ . loop ;

: p1select ( type -- )  \ these values are for a Landys & Gyr E350 meter:
  case
      181 of 0 p1save endof  \ cumulative electricity consumption, normal
      182 of 1 p1save endof  \ cumulative electricity consumption, low
      281 of 2 p1save endof  \ cumulative electricity production normal
      282 of 3 p1save endof  \ cumulative electricity production low
    96140 of 4 p1save endof  \ tariff
      170 of 5 p1save endof  \ actual consumption
      270 of 6 p1save endof  \ actual production
     2420 of 7 p1save endof  \ cumulative gas consumption
  endcase ;

: p1char ( c -- )
  case
    [char] / of p1clear endof
    [char] : of 0 p1.type ! 0 p1.value ! endof
    [char] ( of p1.type @ 0= if p1.value @ p1.type ! then 0 p1.value ! endof
    [char] ) of p1.type @ p1select endof
    [char] ! of p1dump endof
             dup digit if p1.value @ 10 * + p1.value ! then
  endcase
;

: p1test
  begin
    uart-irq-key? if uart-irq-key p1char then
  key? until ;

The p1select word filters specific values in the data and stores them in an array of 8 entries. The p1test word simply listens to the 2nd serial port, and feeds all incoming characters to p1char. This 2nd port uses interrupts with a ring buffer to avoid losing incoming data. So at 9600 baud and with a 128-byte ring buffer, serial processing needs to start within ≈ 120 ms.

The logic of p1char is the same as the p1_scanner() function in the original C++ code. It plays the same tricks to ignore many of the incoming characters, only triggering on a few specific ones, while also identifying and parsing each numeric value.

Here is a test run, with p1test reporting on Mecrisp’s serial UART1 console, while the above test packet was manually pasted three times into a second terminal session tied to UART2:

p1test 
3540 11199 0 4667 2 35 0 14684 
3540 11199 0 4667 2 35 0 14684 
3540 11199 0 4667 2 35 0 14684

As you can see, all the important values have been properly isolated and parsed, ready to be sent out in the reporting section of the JeeLabs Energy Monitor’s code.

The actual hookup will need some more testing. The interface is described in an old P1 revisited weblog post, but it’s not clear yet whether this will also work at 3.3V and perhaps even without inverting transistor stage. Some more experimentation is needed…

With an increasing number of sensing and reporting activities taking place on the JeeLabs Energy Monitor (JEM) prototype, things are starting to become a bit more complicated.

How can we deal with such a multitude of tasks, each with their own timing requirements?

The traditional (or perhaps one should say: modern?) answer to this is to include a Real-Time Operating System, which has built-in task switching and can offer some hard guarantees on how quickly a task can be triggered to run on specific external events.

Forth, on the other hand, has very early on (and that means decades ago!) implemented a very low-key “cooperative” form of multi-tasking, whereby tasks voluntarily pass control to other tasks in a round-robin fashion.

There is much to be said in favour of this approach, over the “pre-emptive” style, which can stop and start tasks in very abrupt ways. The benefit of pre-emption, is that it’s better at guaranteeing a specific maximum response time. The drawback is that it massively complicates the code, to avoid getting interrupted in troublesome ways, such as in the middle of incrementing a counter.

One advantage of collaborative vs. pre-emptive, is that processing becomes deterministic again. This makes it far easier to reason about what is happening, and in what order.

With the collaborative approach, all tasks must be written in such a way that they periodically relinquish control. The longer some (any!) task waits to do so, the longer the worst-case delay will be when servicing pending requests.

In the case of JEM, all the hard timing requirements are relatively lenient, i.e. in the order of several milliseconds. As long as a task never spends more than a few ms before passing control to the next task, we’ll be fine.

The key trick is to handle all really strict timing demands with interrupts or DMA:

acquiring 4 ADC channels @ 25 KHz: this is handled by DMA, with buffers large enough to allow servicing within 10..20 ms, and requests coming in once every 30 ms on average
counting pulses on 3 pins: this is handled using external interrupts, which perform all critical timing and counting tasks - there are no strict requirements for further processing
reading and parsing serial data from the smart meter’s P1 port at 9600 baud - this uses interrupts with a 128-byte buffer, which can hold over 100 ms of incoming data
parsing the DCF77 radio pulse stream - these come in at 1 pulse per second, each pulse must be timed to distinguish between 0.1 and 0.2 second wide - this has not yet been implemented, but can probably be done with a hardware timer interrupt once every 10 ms
sending out wireless packets every few seconds - this is so relatively slow that it can easily be done in a main loop, while keeping track of elapsed time

None of these tasks take much processing time. But there is one activity missing: 6) processing the four acquired ADC signals to detect the zero crossings, and to calculate voltage times current for each of the three Current Transformers. It should be relatively easy to relinquish control at least once per millisecond, even when some calculations might take much longer than that.

The Mecrisp multitasker can be found here. It’s written in Forth and supports dynamically adding and removing tasks, as well as waking up tasks from interrupt handlers. It’s very lightweight and works by having tasks call the built-in “pause” word once in a while.

Multi-tasking needs one stack per task (eh, two in Forth: a data and a return stack). These stacks must be sized for the worst case, i.e. maximum stack use (including interrupts). Allocating a task and its stack(s) for each item in the above list would require quite a lot of RAM space, but we can in fact do it all with just two tasks: 1) the command interpreter, and 2) everything else.

All we need is a way to “frequently enough” check a few cases, and trigger some activity when processing is required. Each of these cases can be dealt with sequentially. There’s no need to interrupt workflows in the middle of what they’re doing, and resume them later.

So what we can do is keep one task for Mecrisp’s command-line interpreter, and perform all the real work in a single second task. This way, we can continue to interactively type in commands, while all the main JEM activity continues in a separate task - i.e. in the background, essentially.

Here’s a general outline of how that second task in JEM could be structured:

a main loop, which processes ADC data when a new buffer has been collected
this main loop it needs to call a “chores” word at least once every millisecond or so
this chores word is set up to go through (i.e. call) several different, ehm… chores
important: each chore must leave the data and return stacks unaffected, once it is done
each chore checks for some condition, i.e. time to send an RF packet, or time to report the pulse counts, or new P1 data has arrived, etc.
when needed, each of these chores can do some processing, as long as it takes no more than say one millisecond (send an RF packet, parse P1 data, etc)

This approach is considerably simpler than switching between multiple independent tasks. There is merely a main loop, which branches off to do a few other things once in a while.

The reason this approach should be good enough here, is that we’ve been careful to do all time-critical work in interrupts or via DMA. It’ll be ok if some chores take a few ms once in a while.

There’s a major convenience with the above design w.r.t. development: it allows us to continue entering Forth commands at any time, using the Mecrisp UART1 console. This includes peeking and poking in a running system, but also restarting or even re-flashing the JEM board.

Mecrisp Forth comes with a serial-port command line interface. This makes both tinkering and uploading new code a breeze, but it’s nevertheless a fairly limited setup:

no input buffering: if characters come in while the code is busy, they can get lost
no output buffering: sending any text to the console will block until all data is sent
the greeting sent to USART1 cannot be changed or redirected (at least in Mecrisp 2.2.7)
even if the console is reconfigured later on, a reset will still revert to USART1

It’s very easy to redirect console I/O, using a built-in mechanism to re-vector 4 essential words:

key?            ( -- Flag ) Checks if a key is waiting
key             ( -- Char ) Waits for and fetches the pressed key
emit?           ( -- Flag ) Ready to send a character ?
emit            ( Char -- ) Emits a character.

This can be done by assigning new handlers to these corresponding 4 variables:

hook-key?       ( -- a-addr )   terminal IO
hook-key        ( -- a-addr )     on the fly
hook-emit?      ( -- a-addr ) Hooks for redirecting
hook-emit       ( -- a-addr )

After reset, those variables are set as follows to use USART1 in polled mode:

' serial-key?  hook-key?  !
' serial-key   hook-key   !
' serial-emit? hook-emit? !
' serial-emit  hook-emit  !

If we want to change to an interrupt-based USART2 driver, for which an implementation has been created here and here, all we need to do is include those files and add this init code:

compiletoflash

: init ( -- )
  init  1000 ms  key? if eraseflash then  \ safety escape hatch
  +uart-irq
  ['] uart-irq-key? hook-key?  !
  ['] uart-irq-key  hook-key   !
  ['] uart-emit?    hook-emit? !
  ['] uart-emit     hook-emit  !
  cr init ;

This points the input vectors to the interrupt-based driver, and the output vectors to the (polled) driver for USART2. Note the compiletoflash - this code needs to be in flash to survive a reset.

The first line allows recovering from this setup. With “init”, it’s extremely important to prepare for the worst, as this code gets called after every reset. If there is any error in this code, we’ll never get control back! With the extra line, we can hit a key on USART1 to restore Mecrisp to its original state and remove this additional init word.

The above code relies on other code to generate a 1000 ms delay, which is why there needs to be a call to an earlier initinside the above code. In addition, init is called again just before exit, so that the custom greeting gets sent to the new console output device, i.e. USART2.

The above works well: on power-up and reset, the console is automatically adjusted to USART2, with all input stored in a ring buffer, so that incoming data is no longer at risk of being dropped.

But there’s still a risk: if we enter any of eraseflash, eraseflashfrom, or flashpageerase - then we could lose console access via USART2, since this can wipe out the above init override.

The simplest solution is: don’t do that… i.e. never enter these commands if you want to keep the console functioning as is. It can be inconvenient, but luckily we can still easily erase the last few definitions in flash and replace them with new ones using a cornerstone, defined as follows:

: cornerstone ( "name" -- )  \ define a flash memory cornerstone
  <builds begin here dup flash-pagesize 1- and while 0 h, repeat
  does>   begin dup  dup flash-pagesize 1- and while 2+   repeat  cr
  eraseflashfrom ;

Followed by this re-definition (thanks to Matthias Koch for this neat suggestion):

cornerstone eraseflash

What this does is to create a “reference point” for clearing flash definitions. When called, it will partially clear flash memory and remove all definitions entered after this one. Since this name overrides the earlier definition by appearing later in the dictionary, it effectively hides that older one. So from now on, typing eraseflash will clear flash memory, but keep the USART2 console implementation and the corresponding init definition.

This does not prevent calling the other two erasing words, but those are harder to use and intended for internal reference anyway (such as in the cornerstone definition itself).

What if we want to revert the code and restore a pristine USART1-based Mecrisp setup? Well… given that all previously-defined words are still present in the dictionary, that’s also possible:

enter the “words” commands and look for the address of the original eraseflash word
then enter “<address> execute” to run it - or, alternatively…
enter “$4000|$5000 eraseflashfrom” (the address depends on the Mecrisp build)

To summarise: with the above tricks, we can make Mecrisp (semi-) permanently use a different console I/O channel (serial or anything else, really), yet still regain control and restore the original polled USART1 implementation when absolutely needed.

The only risk is when we mess up and install an incorrect init word: in the worst case, we lose console access for good and can’t recover anymore. At that point, there will be no other recourse than to re-flash the entire µC memory by other means (see this summary for some options).

Note that all of the above could also have been used to turn USART1 into an interrupt-driven & buffered console. USART2 was used as example here because it’s easier during development to switch between two separate interfaces.

A while back, an article was posted about the lack of USB on STM32F103 µCs, when it comes to Mecrisp Forth, that is. Unfortunately, getting the built-in USB device-mode hardware working is quite a challenging task. Not only is the USB protocol fairly complex - the actual USB interface hardware on that particular model STM µC is in fact ridiculously messy!

It has all the appearances of a rushed-to-market design, just to get that USB feature shipping!

Fortunately, there are a number of solid working implementations for this hardware, including the libopencm3 and ChibiOS code (both in C), as well as an older (tentative?) implementation in Forth by Eckhart Köppen. This last one has turned out to be the catalyst to move forward on implementing a serial-over-USB driver for (and in) Mecrisp Forth.

First the good news - it works! - even though there are still a few quirks:

$ folie -p /dev/cu.usbmodemC934CC31
Connected to: /dev/cu.usbmodemC934CC31
  ok.
1 2 + . 3  ok.
^D
$

USB output appears to work flawlessly, and is a lot faster than the normal serial console’s 115,200 Baud setting (this is not surprising, given that the USB full-speed rate is 12 MHz).

A few important bugs still remain as of this writing (May 2016):

input only works when data is coming in as chunks of 16 bytes or less - which includes normal typing, and even auto-repeat - the problem only occurs with larger chunks, in which case all characters are lost, for some as yet unexplained reason…
output stalls when there is no connection, i.e. when USB is not plugged in - this means that Forth code will wait (and appear to hang) once the 128-character buffer fills up

The first problem is very unfortunate: it means that Folie can’t be used to upload source code over USB yet. The second problem is more benign, but will become important for unattended use (i.e. powering up and running when no USB host is listening).

The problem with the dropped input is probably related to the lack of back-pressure, i.e. Mecrisp has to let the host know when it can or cannot accept more incoming data. The stalled output is probably solvable by keeping track of the attach state, and then simply dropping all output when there is no active connection (just like a serial port, there is not much else we could do anyway).

It’s important to note that USB is a host-driven protocol. Devices may only respond when the host polls them - including data from device to host! Luckily, the hardware takes care of that.

The new USB driver implementation is 100% pure Forth and can be found on GitHub. At ≈ 350 lines of code, it’s quite long and far from readable or optimised. The main tasks of this code are:

configure the µC clock to run at 72 MHz and derive the 48 MHz clock required by USB
set up the USB hardware interface on the STM32F103 µC, with all its quirks
get through the initial USB “enumeration” sequence to tell the host who we are
receive all incoming packets and place the data in a 128-byte ring buffer
send data out whenever present, using a (probably superfluous) 64-byte ring-buffer
deal with all the possible idling, attach, detach, and re-attach scenarios
keep the USB protocol going through polling (interrupts are not enabled in this driver!)
provide a set of I/O hooks for key?, key, emit?, and emit as a public API for this driver
define a custom init word, which sets up and redirects all console I/O to USB

The result is the session shown above. The current implementation, with some boilerplate code not strictly required for USB, needs ≈ 9 KB of extra flash memory. With the extended Mecrisp “RA” core image, the entire build fills just under 29 KB of flash memory. That still leaves over 32 KB of flash for application use, on even the smallest and very common STM32F103C8 boards.

Note that there is a small dependency on the type of board, since each board has a different way of tying the 1.5 kΩ resistor onto the USB lines, as needed for the reset phase of USB signalling. On a HyTiny, the PA0 pin is handled by this code. Slight mods will be needed for other boards.

With a warm thank you to Matthias Koch, Mecrisp’s author, for his tips, support, and continued encouragement to get all this going. Although the current code is not quite ready for prime time and still needs a major cleanup round, that “only-USB” moment sure is getting a lot closer now!

If you search for “stm32f103c8t6 board” on eBay, you will get lots of hits for what is essentially a single product, to be called the “Blue Pill” from now on:

The price of these boards is absolutely incredible, you could get 10 of these for the same price as an official Arduino Uno. At these prices, which are lower than the official price quoted for the µC chip alone, it’s really tempting to permanently include one in every project.

The “Blue Pill” name, by the way, appears to have originated in the STM32duino forum.

As you can see, the board is very nicely labeled. All the pin names are clearly visible and correspond to the pin names in the STM32 datasheets for this chip. Some notes:

“G” is ground, and it’s available on 3 different pins
“3.3” is 3.3V, i.e. the output of the on-board linear regulator (on two pins)
“R” is RESET, “VB” can be connected to a 3V battery to keep the RTC running
“PC13” is connected to an on-board LED, which turns on when set to “0”
the mini-USB jack is connected to the USB interface built into the µC
the two yellow jumpers control the startup boot settings, default is as shown
4 pins on the right are for SWD programming (a bit like ATmega’s “ISP” mode)

There’s an 8 MHz crystal on-board (it starts up using the internal RC oscillator, but can switch over to the crystal on the fly), as well as a 32 KHz crystal to drive the internal Real Time Clock. On the bottom side, there’s a small 5V => 3.3V regulator and some resistors and capacitors:

The schematic can be found here (PDF) and was copied from the Haoyu product info page.

On to the microcontroller itself, which is labeled “STM32F103C8T6”:

STM = the vendor, STMicroelectronics
32 = 32-bit, i.e. ARM series
F103 = model, “F1” also indicates that this is an ARM Cortex M3
C = 48-pins
8 = flash memory size is 64 KB
T = package type is LQFP
6 = temperature range -40..+85 °C

Rumour has it that these chips actually have 128K flash inside, but they are labeled as 64K and they do report 64K when queried. Maybe the flash failed QA testing, or maybe it’s as harmless as someone having made a mistake with the laser-etched label on the outside - who knows?

There is 20 KB RAM, a huge amount compared to ATmega’s and other 8-bit AVR’s.

As for the rest of this chip’s hardware, there’s a good summary on the datasheet front page:

More details for this model can be found on page 10, in the 2nd column (STM32F103Cx):

The home page for this microcontroller is on the STM site. The datasheet is called DS5319 (PDF), and if you really want to know everything there is to know about this chip, you’ll also need to download the RM0008 Reference Manual (1137 pages!) and the ES096 Errata Sheet.

The 12-bit ADC for up to 1 Msps and USB, CAN, DMA… there really are a lot of quite advanced features in this little fingernail-sized package. And all that for just a few bucks…

Note: let’s use “F103” as shorthand to avoid typing “STM32F103C8T6” all the time

The F103 has a built-in ROM-based boot loader, so there’s no risk of ever damaging it, but it’s quite limited since it can only deal with a serial connection on the USART1 pins (PA9 for TX and PA10 for RX). There is no way to reflash an empty or messed-up F103 from USB.

In addition, every ARM chip has either JTAG hardware built-in, or its low-pincount variant, called SWD. This is a very powerful mechanism which is separate from the µC core itself, to access memory, peripherals, and even the chip’s pins with a “JTAG programmer”.

But the implication of all this is that an F103 requires additional hardware to be programmed: either a USB-serial (a.k.a. FTDI or USB-TTL) converter, or a JTAG/SWD programmer. Since SWD programming is fully supported on the F103, you could use either an ST-Link (many clones available from eBay) or a Black Magic Probe (which uses an F103 itself, by the way).

You can search this weblog for several older articles about ST-Link and the BMP. One benefit of the BMP is that it can also act as USB serial port via a separate header.

For very powerful debugging of firmware written in C or C++, JTAG/SWD offers advanced feaures such as hardware “breakpoints” and “watchpoints”. This requires “gdb” - the GNU Debugger - or one of the many front-ends made for it.

But there is more to embedded development than uploading and debugging. We need a way to communicate with our program during actual use, and we need to supply power to the board. With JTAG/SWD, before you know it, you’ll have to attach three cables to make it all usable.

Luckily, there’s a way out: a USB-serial adapter can be used to perform all three tasks through one cable: uploads, serial comms, and a 5V power feed. It won’t do hardware debugging, but you can always attach an extra BMP when you really have to dig that deep to solve problems.

Here’s one way to work fully through a single USB connection:

Four wires in total: 2 for power, 2 for RX/TX (note that RX and TX need to be crossed-over!).

If you’re feeling adventurous, you could also hook up a permanent FTDI header:

It’s not the most convenient setup, as will become clear below, but it’s the simplest possible setup that will work, and it should work with just about every USB-serial / FTDI / BUB / FRIEND / BOB / eBay clone out there. Perhaps most importantly: this hookup can restore every F103 to a working condition. As long as it hasn’t been electrically damaged, of course.

Note that regardless of which USB serial interface you use, you’ll need to get it working with your host computer - be it running Windows, MacOS, or Linux. This may require a driver installation - the USB interface vendor should be able to tell you. All we need is a working serial connection, i.e. the USB interface showing up as a “Virtual Comm Port” - COM<N>: on Windows, /dev/tty/cu.* on MacOS, and /dev/ttyUSB<N> or /dev/ttyACM<n> on Linux.

Two more hurdles remain: 1) putting the F103 in ROM boot mode, and 2) sending the proper commands to it to erase and re-flash its firmware.

Entering boot mode is easy. Change the yellow jumpers to look as follows, then press reset:

Boot mode is entered when BOOT0 is “1” and BOOT1 is “0” (normal run mode is both “0”).

The F103 will now be running its ROM boot loader, waiting for an autobaud sequence on the PA9/PA10 pins. To get out of boot mode: restore the original jumpers and press reset again.

The last step is to use the proper utility for sending uploads. This uses a special protocol (described in this PDF). The protocol requires the connection to be set to even parity 8-bit.

One option is to use stm32loader, which is a Python script and uses the PySerial package:

TTY=/dev/cu.usbmodem14123
APP=/path/to/your/compiled/app.bin
stm32loader.py -e -w -v -p $TTY -a 0x8000000 $APP

Where TTY and APP need to be adjusted to match your own requirements. Note that you can enter this command to get a brief command-line summary: “stm32loader.py --help”.

Another utility which works in a very similar fashion is stm32flash, a compiled C program.

Lastly, there is Folie v2. It’s written in Go and has pre-compiled binaries for all the common platforms on GitHub. To use Folie just for uploading of a .bin or .hex file, proceed as follows:

download the proper build for your platform from the releases page
or if you have Go installed, build from source: go get github.com/jeelabs/folie
launch as “folie -r” from the command line (or “/path/to/folie -r” if needed)
select the serial port to connect to from the list
enter “!upload /path/to/your/compiled/app.bin-or-hex” and press return
once the upload completes, enter CTRL-D to exit Folie

Note that all these approaches require the F103 to already be in boot mode.

Once you have succesfully uploaded your application to the F103, you need to restore the jumpers to the normal state (i.e. all on the left in the above images). Then press reset and the newly uploaded code will start running.

There is some inconvenience with all this, because you have to repeat these same steps for every single upload: change jumpers, reset, upload, restore jumpers, reset. An alternative which greatly streamlines this process will be presented in a future article. Stay tuned!

Now that the connections and uploading have been dealt with, let’s turn to generating code.

First of all, we need a C/C++ compiler: this one. GCC is now at version 5.0, but 4.8 is also fine (there have been some issues with 4.9 on ARM Cortex, it’s perhaps better to avoid that one…).

You could just go ahead and download from the links mentioned above, but there’s another way - which might not require a download at all if you’ve already been working with Arduino or JeeNode boards. Since there are now Ardunio boards with ARM chips (such as the Due and Zero), that gcc compiler toolchain may already be on your disk, or very easy to add through the Arduino Board Manager. Here’s how to check:

launch the Arduino IDE
go to Board: ... => Board Manager... in the Tools menu
in the list of packages, there will be two which we can use:

If neither of them is installed, you should install one (either one is ok). This will place the GCC toolchain in an area managed by the Arduino IDE - we just need to figure out where that is.

On MacOS, the GCC toolchain files are installed inside this area:

$HOME/Library/Arduino15/packages/arduino/tools/arm-none-eabi-gcc/

On Windows and Linux, the locations will be different. An easy way to find out, is to look for a file (not directory) called arm-none-eabi-gcc, which is the name of the GCC compiler itself. On MacOS and Linux, we can use the find command for this, as follows:

find $HOME -type f -name arm-none-eabi-gcc

This should return the path (or maybe more than one) where the file is found. The proper one is the one inside a .../bin/ directory.

Now that the location of the compiler is known, whether “borrowed” from the Arduino IDE install or by downloading the original ARM Embedded package, we can extend the global search path, by entering this command (adjust as needed, obviously):

PATH=$PATH:/the/location/we/just/found/ending/in/bin/

This setting will be lost on logout or reboot. To make it persist, you should also add this line to your login profile (usually called “~/.profile” or “~/.bashrc” or “~/.bash_profile”).

To check that everything works, we can launch the compiler and ask it to report its version:

arm-none-eabi-gcc -v

Here is the (shortened) output on MacOS using the packages inside Arduino IDE 1.6.13:

Using built-in specs.
COLLECT_GCC=arm-none-eabi-gcc
COLLECT_LTO_WRAPPER=/Users/jcw/Library/Arduino15/packages/arduino/tools/...
Target: arm-none-eabi
Configured with: /Users/build/GCC-4-8-build/gcc-arm-none-eabi-4_8-2014q1-...
Thread model: single
gcc version 4.8.3 20140228 (release) [ARM/embedded-4_8-branch revision ...

We’re good to go. We now have a toolchain to compile and build firmware for ARM µCs!

There is one more important installation step to go through: we need some kind of runtime library, i.e. code which takes care of the initial startup, as well as providing a collection of tested functions to access the hardware peripherals inside the microcontroller. These functions are very vendor-specific, so we need a library which supports the STM32 µCs.

In the Arduino world, you’d use what comes as part of the IDE, with functions such as pinMode(), digitalRead(), digitalWrite(). In the ARM world, these could easily be implemented, but the hardware peripherals offer considerably more sophisticated features, so the libraries tend to be much more elaborate.

One such very capable library is libopencm3. It’s on GitHub, it’s open source (LGPL3), it supports a very wide range of ARM chips by now (not just STM32), it’s well supported, and it’s still actively being extended further. There is a companion repository with example code, and there’s a central website for everything related to libopencm3.

Setting up libopencm3 is quite simple:

download or clone the GitHub repository and put it “somewhere”
chdir to the cloned directory and type “make” to build the library itself
make a note where that is and set up an environment variable, as follows:
```
export OPENCM3_DIR=/path/to/your/libopencm3
```
again, as with the above PATH setting, it’s best to include this in your profile, so you won’t have to re-enter this on every future login

That’s it, we’re done! It took a bit of preparation, but your computer is now ready to use with the Blue Pill boards and can cross-compile and upload your own code to them. Note that all the steps described in this article only need to be done once. The tedious part is over.

Compiling software for the F103 is very easy but – as so often – does take a little preparation. In this case, we will need set up a few build files and make sure they are correct for the F103.

But first, let’s create a new directory and “chdir” into it:

mkdir ~/myf103apps
cd ~/myf103apps

Then you need these 3 files, which can all be downloaded from GitHub:

Makefile.include - contains a few settings to select the F103
rules.mk - the main rules for “make”, copied from the libopencm3 project
stm32f103.ld - memory map and sizes for the F103 variant we’re using

These files don’t have to be adjusted or changed in any way, they just need to be present.

Now we can finally create a first demo, and of course the first one to try is going to be the “Hello world” of Physical Computing: blinking the on-board LED - so let’s get moving!

Step 1 - Create a directory inside the one you just set up, called “blink”, and cd again:

mkdir blink
cd blink

Step 2 - Create a small, project-specific “Makefile”, with this contents:

BINARY = blink
LDSCRIPT = ../stm32f103.ld
SCRIPT_DIR=..

default: $(BINARY).bin

include ../Makefile.include

Step 3 - Write the actual source code, in this case “blink.c”, with this contents:

#include <libopencm3/stm32/rcc.h>
#include <libopencm3/stm32/gpio.h>

int main (void) {
    rcc_periph_clock_enable(RCC_GPIOA);
    gpio_set_mode(GPIOA, GPIO_MODE_OUTPUT_2_MHZ,
                    GPIO_CNF_OUTPUT_PUSHPULL, GPIO1);

    for (;;) {
        gpio_toggle(GPIOA, GPIO1);

        // create a small delay by looping for a while
        for (int i = 0; i < 1000000; ++i) __asm("");
    }
}

Step 4 - Compile this new application by entering the following command:

make

Step 5 - Upload the generated blink.bin file to the Blue Pill:

This can be done as described in the previous article.

That’s it! - To make changes, simply edit blink.c and repeat steps 4 & 5.

Some notes:

this “edit, make, upload” cycle is the traditional way of building code from the command line, and it’s no different for embedded µC’s
there’s no IDE: you can work with whichever editor you prefer - enjoy the freedom!
the Makefile defines the rules used by the make command - but most of these build rules are in the two files we prepared earlier: ../Makefile.include and ../rules.mk
the above blink.c example is typical for a libopencm3-based program: include the files needed for the functions you’re going to call, then define a main function, which sets up the I/O pins and other hardware, and then enters an infinite loop doing whatever the program is supposed to do
the rcc_* and gpio_* functions are part of libopencm3 - you need to visit the main site and browse the documentation to see what’s available - be sure to look under the STM32F103 section, since not all µC’s have the same capabilities
if you have a Black Magic Probe, ST-Link, or OpenOCD hooked up, you can add some settings to your Makefile to automatically upload as last step of make
it’s easy to create additional programs, or variants, or little tests: simply create additional directories next to blink, each with their own Makefile and source code
for a similar Blink example, see the one on GitHub - it contains slightly more code, because it also works on a different board, called the HyTiny

Although it’s nice to see it working, a blinking LED isn’t going to keep you excited for long…

So let’s try something a little more complex: we periodically send out a message to the serial port, and when a <CR> (Carriage Return) character is received, we’ll toggle the on-board LED. In addition, we’ll write the code in C++ instead of C, and we’ll split up the code to use setup() and loop() functions, in the same way as the Arduino world is used to doing.

To reproduce this experiment, you need to make a copy of the “echo” area on GitHub.

The application-specific code is now in a file called “echo.cpp” (the rest is in “main.cpp”):

#include <libopencm3/stm32/gpio.h>
#include <stdio.h>

// defined in main.cpp
extern int serial_getc ();
extern uint32_t millis();

void setup () {
    // LED on HyTiny F103 is PA1, LED on BluePill F103 is PC13
    gpio_set_mode(GPIOA, GPIO_MODE_OUTPUT_2_MHZ,
                    GPIO_CNF_OUTPUT_PUSHPULL, GPIO1);
    gpio_set_mode(GPIOC, GPIO_MODE_OUTPUT_2_MHZ,
                    GPIO_CNF_OUTPUT_PUSHPULL, GPIO13);
}

void loop () {
    printf("\nHit <enter> to toggle the LED (%lu) ...", millis());

    // the following loop takes roughly one second
    for (int i = 0; i < 1650000; ++i) {
        if (serial_getc() == '\r') {
            gpio_toggle(GPIOA, GPIO1);
            gpio_toggle(GPIOC, GPIO13);
        }
    }
}

Note the convenience of having a “grown-up” printf() at our disposal. This makes it very easy to generate good-looking text output on the serial port (it also increases the size of the generated application quite considerably, alas).

As before, you can enter make to build this, and then upload echo.bin to the Blue Pill (or HyTiny). Here is a sample session using Folie for uploading:

$ make
  CXX     main.cpp
  CXX     echo.cpp
  LD      echo.elf
  OBJCOPY echo.bin
   text	   data	    bss	    dec	    hex	filename
  28528	   2220	     64	  30812	   785c	echo.elf
$ folie -r
Select the serial port:
  1: /dev/cu.Bluetooth-Incoming-Port
  2: /dev/cu.usbmodem3430DC31
? 2
Enter '!help' for additional help, or ctrl-d to quit.
[connected to /dev/cu.usbmodem3430DC31]
!u echo.bin
  30748b +V22 #0410 R .W .E writing: 121/121 done.

Hit <enter> to toggle the LED (0) ...
Hit <enter> to toggle the LED (1008) ...
Hit <enter> to toggle the LED (2020) ...
Hit <enter> to toggle the LED (3032) ...

Note that Folie switches into terminal mode once the upload is complete, so the serial output immediately starts showing up. Hitting the Enter key toggles the LED, as expected. It works!

In case you’re wondering: while called “echo”, this example doesn’t really echo characters at all! It is left as an exercise for the reader to change the above example into one which does…

It’s time to step things up a bit. Let’s create a wireless node, with an RFM69 attached to an F103 µC, and then see if we can make it sing…

We’re going to need two nodes, if we want to actually test packet transmission and reception. Here is one node, hacked together with a Blue Pill and some fancy soldering:

The connections to the radio module are as follows, apart from +3.3V and GND:

PA4 = SSEL
PA5 = SCLK
PA6 = MISO
PA7 = MOSI

The setup shown above has more pins connected, but these are just “for future use”.

It’s not quite good enough though, we also need the serial port - so let’s also hot-glue a little JeeLabs Proto Board underneath, and hook up a 6-pin male header as FTDI connector:

The pinouts for the FTDI connector are:

DTR = optional, connected to “R” (RESET) on the Blue Pill
RX = tied to µC’s TX, i.e. PA9
TX = tied to µC’s RX, i.e. PA10
+5V = power, 5V in
RTS = optional, connected to the BOOT0 jumper, middle pin
GND = power, ground

Now we can plug this whole thingamajig into a USB-serial adapter. It’s very convenient.

To try this out, we’ll need the radio/ directory from GitHub - it contains the following files:

Makefile - the make settings for this example
main.cpp - this is identical to the one used in the previous echo example
radio.cpp - the actual application code, with setup() and loop()
rf69.h - the RF69 driver (same as used in several older projects)
spi.h - a header which interfaces to the polled SPI implementation in libopencm3

Builds and downloads can be done in the same way as before for the blink and echo examples. Here is a transcript of the entire process, using Folie for uploads:

$ make
  CXX     main.cpp
  CXX     radio.cpp
  LD      radio.elf
  OBJCOPY radio.bin
   text	   data	    bss	    dec	    hex	filename
  29960	   2220	    204	  32384	   7e80	radio.elf
$ folie -r
Select the serial port:
  1: /dev/cu.Bluetooth-Incoming-Port
  2: /dev/cu.usbmodem3430DC31
? 2
Enter '!help' for additional help, or ctrl-d to quit.
[connected to /dev/cu.usbmodem3430DC31]
!u radio.bin
  32180b .+V22 #0410 R .W .E writing: 126/126 done.

[radio]
  Enter 't' to broadcast a test packet as node 61.
  Listening for packets on 868.6 MHz, group 6 ...

rf69 21ee068803006ec0010a 8102c54bab019c179a80  (68+110:3)
rf69 21ee0687030070c00107 8102c44bac8080  (67.5+112:3)
rf69 21ee068a03007ec0010a 8102bd4bad019d179b80  (69+126:3)

In this case, the code was modified to listen to group 6, where existing wireless sensor nodes are already active here at JeeLabs. And as you can see, this example is picking up a number of packets and reporting them, nicely formatted.

If you don’t have compatible nodes running, you’ll need to create a second board with RFM69 attached, and upload the same code there as well. Then, by entering “t”, you can cause it to send out a test packet, which the other node should be picking up and reporting – if all is well.

That’s it - the start of a Wireless Sensor Network, built with low-cost “Blue Pill” boards based on the powerful ARM STM32F103 µC, and running code compiled with GCC and linked to the libopencm3 library for easy access to the chip’s GPIO, serial port, and built-in SPI hardware.

All this has taken us just five articles, to address and overcome all the hurdles involved when starting on ARM from scratch.

Coming up: a much simpler way to upload and talk to the Blue Pill and other ARM boards…

So far, to upload new code to a Blue Pill, you had to change a jumper, press reset, perform the upload, change the jumper back, and press reset again - this will quickly become very tedious!

So why not let the host computer do this?

What we need is a way to control the RESET and BOOT0 pins of the ARM µC. Reset is easy, and can be done in way similar to the Arduino: tie it to DTR (“Data Terminal Ready”), and have the host computer control this “modem control” pin.

On the wiring side, all we need to do is to make a connection between the FTDI’s DTR pin (sometimes marked “RST”) and the “R” pin on the Blue Pill header.

Controlling BOOT0 is slightly more involved. What we can do is use another modem control output pin for this, called RTS (“Request To Send”). This is not normally connected to an FTDI pin, but it’s not very hard to modify a BUB to do this as well.

The problem though, is that apart from not having these pins available on all USB serial adapters out there, we can also run into nasty issues w.r.t. platform support. Apple’s FTDI adapter does not support RTS, and at least some Windows and Linux versions have issues.

Getting DTR and RTS right on all platforms and making it work with various USB serial adapters turns out to be fairly tricky, to put it mildly.

But there’s a simple solution, and it works really well.

Ingredients: one Blue Pill, one 6-pin female header, and one firmware download.

What we can do, is to move the problem to a place where we can solve it: an implementation of a USB-serial bridge, based on an F103. After all, from a µC we can easily control a few pins, switch to even parity, and do whatever else is needed to program an attached “target” board.

The trick is to stay away from everything which is specific for any platform, and to only use the USB serial connection for just that: serial data. One way to accomplish this is to bring in an old-but-proven serial port control mechanism, called Telnet. This has an escape convention to send special messages in between the normal data flow. The escape byte is 0xFF, which does not normally occur in plain ASCII exchanges (and it’s properly wrapped in case it does).

So the idea is that instead of this setup:

… we switch to this approach:

The Folie utility defaults to the Telnet protocol (with the “-r” command-line option to fall back to raw mode where needed). All we need is that “smart µC-based serial” thing to make it work.

And that’s exactly what SerPlus is. An implementation for F103 boards, which does the required interpretation and translation between Telnet escapes and the real world (it only handles a subset, not the full Telnet protocol). Folie and Serplus were made for each other.

To create this setup, you need a spare Blue Pill (you might as well have a bunch of them lying around anyway, if you plan to follow along with some of the upcoming projects on JeeLabs), and you need to build a little setup similar to this one:

The connections are (bottom view, FTDI header, top to bottom):

GND = power, ground
RTS = tied to a spare I/O pin, i.e. PA2
+5V = power, 5V in
TX = tied to µC’s TX, i.e. PA9
RX = tied to µC’s RX, i.e. PA10
DTR = tied to a spare I/O pin, i.e. PA3

Note that this is not the same as when connecting a male 6-pin FTDI header to a Blue Pill to use it as target board. The connector is flipped, there’s no RX/TX cross-over in the above connections, and the DTR/RTS are not tied to RESET/BOOT0, but to pins which will be used as GPIO outputs to control the target board.

Now the chicken-and-egg part of the story: to program this board, you still need another USB-serial adapter. It could be another SerPlus board, but it’ll most likely be some other type. Not to worry - just follow the wiring setup instructions given in a previous article. If you plan to do this more than once, you could also consider creating an “FTDI cross-over plug”, like this:

Nothing fancy, just a way to streamline things. For one-off’s, four jumper wires will be quicker.

Just to make things clear: we’re about to program a board and turn it into a “SerPlus”, i.e. a variant of a host-side USB serial interface, just like a BUB or some other FTDI-like board. The difference being that this one will have slightly more smarts, to handle Folie’s Telnet escapes.

Last step: flashing the SerPlus code into the Blue Pill we just prepared:

you can get the code from GitHub and build it yourself
… or just grab the compiled binary, called serplus.bin
then upload that to the Blue Pill, in the same way as with the blink, etc. examples

SerPlus is considerably more complex than the other examples so far, but it’s built in exactly the same way, with the same GCC and the same libopencm3. There is a USB driver in there, as well as an interrupt-driven serial port, a SysTick timer, and of course a Telnet protocol decoder (implemented as a small finite state machine). Its main function is to get incoming data from USB to serial, and get incoming serial data to USB - just like any USB-serial board.

But we’re not out of the woods yet. We’re only half-way in fact…

The last step is to make the target Blue Pills suitable for this new “extended” use of the FTDI header. We need to add two things: a connection from DTR to the F103’s RESET pin, and another one from RST to the F103’s BOOT0 pin:

Almost there! - the one remaining issue, is that we can’t control the BOOT0 pin while there’s a jumper on it, and that without the jumper, the F103 may not start up properly in all situations.

A small “mod” will be required - in the form of an added 10 kΩ resistor where the jumper was:

This will make sure that BOOT0 is pulled low by default, but that the RTS wire can pull it high when needed to start an upload.

The result of all this work is TWO Blue Pills, a target (on the left), and a SerPlus (on the right):

Plug them together, connect a USB cable on the right, and you get a very convenient setup:

$ cd ../blink
$ folie
Folie v2.7-1-g94cba5e
Select the serial port:
  1: /dev/cu.Bluetooth-Incoming-Port
  2: /dev/cu.usbmodem3430DC31
? 2
Enter '!help' for additional help, or ctrl-d to quit.
[connected to /dev/cu.usbmodem3430DC31]
!u blink.bin
  724b .+V22 #0410 R .W .E writing: 3/3 done.

(hit ctrl-d to exit Folie)

Note the absence of the -r option this time: we’re now using Folie’s default Telnet protocol, not raw mode. Resets and uploads can now be done without touching the Blue Pill at all.

And since the blink demo is built into Folie, you don’t even need to get a copy of blink.bin:

!u
These firmware images are built-in:
  1: F103-BMP         50096b  crc:F87A
  2: F103-Blink         724b  crc:4967
  3: F103-Mecrisp     20500b  crc:A585
  4: F103-SerPlus      7052b  crc:7DD0
Use '!u <n>' to upload a specific one.
!u 2
  724b .+V22 #0410 R +W +E writing: 3/3 done.

Here’s another upload, of the echo example this time:

$ cd ../echo
$ make
  CXX     main.cpp
  CXX     echo.cpp
  LD      echo.elf
  OBJCOPY echo.bin
   text	   data	    bss	    dec	    hex	filename
  28528	   2220	     64	  30812	   785c	echo.elf
$ folie
Folie v2.7-1-g94cba5e
Select the serial port:
  1: /dev/cu.Bluetooth-Incoming-Port
  2: /dev/cu.usbmodem3430DC31
? 2
Enter '!help' for additional help, or ctrl-d to quit.
[connected to /dev/cu.usbmodem3430DC31]
!u echo.bin
  30748b .+V22 #0410 R .W .E writing: 121/121 done.

Hit <enter> to toggle the LED (0) ...
Hit <enter> to toggle the LED (1008) ...
Hit <enter> to toggle the LED (2020) ...
!reset

Hit <enter> to toggle the LED (0) ...
Hit <enter> to toggle the LED (1008) ...
Hit <enter> to toggle the LED (2020) ...

As you can see, the upload was performed, the program was started, and since Folie then switches to terminal mode, the output immediately appears. Furthermore, hitting CTRL-C generates a reset of the target board, at which point it restarts from scratch.

For people used to the Arduino workflow, this is nothing spectacular - it’s the way the Arduino IDE has always worked (and it’s probably partly responsible for its runaway success).

And now, the same can be done on ARM: a single USB cable for uploads, for talking to the running application, and for supplying 5V power. Mission accomplished – onwards!

Embedded software development can usually be characterised by the following diagram:

In other words: a “host” computer with a cross-compiler, where you enter code, compile it, and upload it to a “target” system which by itself does not have the processing power and memory to handle such a “toolchain”. Maybe you’ll also be using a “debugger” which lets you single-step, set breakpoints, and view variables values and raw memory on the embedded µC.

There is a lot to like about this approach: a very comfortable and powerful host environment is fast, offers excellent editing tools, adds version control, and has lots of storage (and backups).

But it’s also a bit like endoscopy: you’re not developing on the embedded microcontroller, you’re developing for one, while peeping through a fairly restricted “hole” in the wall, peeking and poking at the µC state from a distance.

During development, everything can and will go wrong. That’s the nature of the beast: we need to reason our way through everything unexpected that’s happening - whether it’s a bad assumption, a lack of knowledge, faulty hardware, or even a bug in another part of the code.

And so we add “debugging code” to report progress, and keep a watch on what is zipping by as the code is trying to do its job. We use the debugger to slow down the events and carefully check the state as the logic advances, and we hook up a logic analyser or oscilloscope to make sure everything is going well, and to capture specific events when things are not right.

It’s often a big puzzle, as we stare at our screens, comparing our code, observations, and event logs we’re collecting. It’s even fun when we finally deduce what went wrong and how to fix it!

But isn’t this all a bit backwards? Do we really want to be keyhole surgeons? Isn’t there a way to be “on site” to view and change the microcontroller’s internal state continuously? Can’t we just set or clear a pin or register value while trying to understand how to implement a feature?

A lot of work tends to go into understanding how things work, so that we can find ways to implement what we want on top of the existing system. A lot of time goes into learning how a micrcontroller works, or one of its hardware interfaces, or one of the attached devices. You can’t just read everything, come up with the perfect solution, implement it, and be done. Not as a casual hobbyist with limited exposure to all this technology, anyway. It’s an exploration!

Wouldn’t it be nice therefore, if embedded development were considerably more interactive?

Instead of developing in cycles, we’d be looking straight at the µC, see the values of variables, memory, and hardware registers, change them to see what happens, and see the effects right away. Every observation tends to lead to a new insight about what is going on - interactively, trials can be iterated so quickly that understanding and implementation get “fused” into one.

Think about it for a moment: what would it be like to no longer have a compile cycle, or at least not all the time? Instead, typing new commands which respond immediately, allowing you to try out things on the fly - the moment they come to mind. No host-side preparations would be needed, the main mental effort would be taking place while you’re trying stuff out.

Is this even possible? Of course: everyone who has programmed in BASIC will recognise this. The “host” computer becomes a terminal front end, while the actual thinking and coding takes place where the action is. Which means, in this context: right on the microcontroller.

Except… we’re not going there. Interpreted BASIC is slow and clunky. There’s another way.

As you know, compilers generate code. They take one or more source files and turn them into an executable. In the case of embedded software development, the compiler is actually a cross-compiler because it’s not generating code for the machine it’s running on (your laptop), but for the attached board (a little “target” microcontroller).

The generated code is simply a chunk of binary data - gibberish, as far as we humans are concerned. We upload it to the target board, and then it presumably starts doing what we instructed it to do in the original source code. We rely totally on this 1-to-1 mapping.

There are several implications and drawbacks with this widely-used approach:

partial linking is uncommon in this context, so the binary we have to send each time is usually the entire application
there’s no way to change the code once it has been uploaded - every little change means going through the same cycle again: edit, compile, link, upload, run
interfering with the flow of the code requires a hardware debugger (i.e. gdb), which in turns requires a lot of extra debugger information (supplied by the GCC compiler), to make sense of all the machine code once it has been loaded onto the target
all this is fairly complex, especially when done well and transparently, so that the debugger is able to tie things back to locations in the original source code text
the debugger requires a connection to the target board, using JTAG or SWD, which requires a separate set of I/O pins and a separate cable
on the “host” side, e.g. the laptop, all this complexity adds up to a hefty “toolchain” - with over two dozen different tools in a 400+ MB install to make it happen

But there’s a way to avoid such complexity – and it was invented half a century ago.

It’s called “Forth”. Here’s how Forth manages to avoid the issues mentioned above:

the host is little more than a terminal front end, i.e. a keyboard plus screen
the code is sent as source code, i.e. text, to the target system
there’s no longer a need to distinguish host and target, it all happens on the target
code is executed on the fly, line by line, as it is received
some code has an immediate effect, other code gets compiled for use later
all the code is defined bottom up, there’s no need for a separate linking step
everything that is compiled can be run immediately, from the same command line
redefinitions are simply appended during development, and override all future uses
code can be saved to flash memory, so that it survives a crash or hard reset
tentative code can be compiled to RAM and easily discared by forcing a reset
there’s no interpreter overhead, Forth is compiled on-the-fly to machine code
code gets added in very small steps, many functions are just one or two lines
crashes are not the end of the world, they simply signal that there was a mistake
recovery from a crash is streamlined, resetting to a known state is instant

The central theme is: “learn as you go”. Go ahead: try something, crash, learn, tweak, repeat. Or to put it even more succinctly: Fail Fast! - there is no big wheel turning - sure, there is a workflow, but you can forget about the “work” part of it, it’s all about staying in the flow!

In Forth, there’s no fundamental distinction between compilation, running, and debugging. It’s all interaction, and everything “does something”. Some effects will be internal: viewing or changing values, other effects will be more permanent: defining new functions (Forth calls them “words”) in RAM or in flash memory, and yet other effects will cause I/O pins to change, some data to be sent or received, etc.

Forth is a programming language with its own syntax and conventions (to be described in an upcoming article), but it’s also a programming environment in that it prescribes how the entire development process takes place. All of it - yes, all of it! - lives on the microcontroller.

In a way, there are no safety nets - programming errors and even small mistakes can lead to fatal crashes where the µC stops responding with no clue as to what happened. But while this could be a hassle with a compiled approach, it’s very easily overcome in Forth: reset, and retry things in smaller steps, then inspect what is going on just before the failure you just ran into.

Yet – surprising as it may sound – an embedded Forth environment is also extra-ordinarily robust: all that can happen is that it stops. Simply reset the µC and you’re back on track!

Since the host is no more than a view into the microcontroller, there is no state on the host which can be affected by crashes or resets on the target board. There’s no toolchain, there are no apps to launch or re-launch when the target fails.

What about source code? And editing that code? And version control? And backups?

Ah, this is where we leave the realm of the actual Forth setup. There’s a tool called Folie which addresses all that. The name “Folie” stands for “Forth Live Explorer” – it acts as the terminal and bridge from host to target. In essence, Folie is simply a line-oriented terminal emulator.

But Folie also adds a number of features to make life in Forth-land more convenient:

line editing can be done in Folie, before it’s even sent off to Forth
there’s command history, with up-/down-arrow to access previously-entered text
hitting CTRL-C can reset the attached board - this is a huge time-saver
Folie can upload new firmware to an STM32 µC via the ROM boot loader
you can send the contents of a file to the target board, as if typed in

That last feature is what brings the target board into the 21st century. While you could just manually enter all your code and get it working and saved in flash memory that way, it would be extremely inconvenient to not have a permanent record of the final version saved on file.

With Folie, the idea is to write the main parts of your application as source code in your own preferred editor on your own comfortable host system. Then send this over to Forth and start using it. There are various ways to streamline this, so that the actual sending only takes place occasionally. In short: think on the target, try out things there (using the command line), but as soon as you have some code which is working well, enter it as source code definitions in the editor, and send these definitions over as needed - and whenever they need to be updated.

The normal modus operandi is to keep both your editor and a Folie session open, side by side.

There’s a lot more to say about Folie and the Mecrisp Forth implementation by Matthias Koch, which we’ll be using on ARM microcontrollers (Mecrisp is also available for ARM Linux, and there are variants for MSP430 and even FPGA’s). But that’ll have to wait for another time.

Anyway - the point of this story is that the traditional edit, compile, link, upload, run, debug cycle is not the only way to do things. The next few articles will try to illustrate how Forth’s interactive approach can be very effective and great fun for embedded software development.

Ok, so what is this “Forth” thing like, as programming language?

One way to answer this question, is to point to the Forth in 7 easy steps article, published a while back on this weblog. Go ahead and read it, it’ll definitely give you a first impression.

Forth was invented by Charles Moore, an independent thinker who is clearly not bound by “mainstream conventions”. Here’s an interview with him about Forth. And here is another one. Forth is an old language (just like C), but it’s still evolving in surprising directions, as shown in his 2013 presentation of an ultra-efficient asynchronous 144-core Forth chip.

It’s hard to capture the essence. If it had to be one word, perhaps that word should be “less”. Forth needs less resources and its notation is very minimalistic. That doesn’t make it crude or limited - but it does force you to think about accomplishing your tasks with less machinery.

A lot of what algorithmic programming languages brought into this world will be of little use when programming in Forth. Expressions are not written in algebraic form (”2*a+1”) but in Reverse Polish Notation) (”2 a * 1 +”). Local state is not in locally named variables, but on the data stack. Calls use an explicit return stack, which can be manipulated within Forth.

The reason for this is not to make life harder – though it might well be if you’ve grown used to thinking that everything has to be done in a certain way – but because Forth is a very carefully crafted compromise between capability and complexity. Making things as simple as possible, but no simpler could easily have been Forth’s guiding principle.

So the way to look at Forth is not “why on earth does Forth do it differently?” but: “how far can you extend a really simple and clean design without dragging in complexity?” - and the result is Forth: two stacks, a dictionary of words, RPN, if-else-then, short-but-rich names, and more.

The price paid is that it can take quite some effort to get used to all this. The conciseness is a consequence of the extreme expressiveness of the language. Math can’t be written in prose. Forth can’t be parsed and skimmed in the traditional sense - it’s not a series of statements.

Forth code tends to have few comments, and even then, many comments will be describing “stack effects”, not a sentence explaining what the code does. To clarify what the code does, you break it up into separate definitions (“words”) and give each one a descriptive name.

Forth is a concatenative programming language, which means – vaguely speaking – that you can take any phrase (i.e. sequence of Forth words) and replace it with a single new definition, without altering the behaviour of the code. When you see code such as this:

...
... a b c ...
...
... a b c ...
...

Then you can in general replace it with this:

: blah a b c ;
...
... blah ...
...
... blah ...
...

Even without having a clue what a, b, and c do! Artificial as it may look, this turns out be very useful - once you start shuffling code around like this, you will often discover that there is a perfectly good name for blah, which describes what it’s intent is. Re-factoring becomes a fascinating way to gain more insight in the code you’ve already written. Instead of the mantra “turn everything you plan to re-use into a function”, in Forth you end up writing code in whatever way makes sense right now, and then later discovering that there is more generality to be extracted and turned into new words.

The risk of top-down design is over-generalising before there is a need, and Forth constantly nudges you to stop doing that. Write what you need now. Do not parameterise a word, just because it might be re-used later. When that happens, you’ll either remember that you already wrote something similar and re-use it, or you won’t - in which case the visual similarity of the code will shout out at you later on.

Design is about patterns. Patterns of logic and patterns of notation. So is Forth.

Next up: installing Mecrisp Forth on an F103 and trying it out for real!

We’re about to try out Forth on real hardware. The implementation used here at JeeLabs is Mecrisp Forth by Mattias Koch. It’s fully open source (GPL3), it’s well-supported, it’s robust, it’s available on a range of platforms, and it’s documented (very concisely, as usual in Forth).

There’s a firmware image of Mecrisp Forth for F103 built into Folie, which makes it easy to get started. If you’ve been using Folie, you’re just 3 steps away from installing Forth on a Blue Pill. It’s particularly effortless if you’ve made yourself a SerPlus, as described in a recent article:

plug the SerPlus into USB, and attach the Blue Pill on which you want to install Forth

launch Folie, enter “!upload” (or “!u”) to see a list of built-in firmware images:

!upload
These firmware images are built-in:
  1: F103-BMP         50096b  crc:F87A
  2: F103-Blink         724b  crc:4967
  3: F103-Mecrisp     20500b  crc:A585
  4: F103-SerPlus      7052b  crc:7DD0
Use '!u <n>' to upload a specific one.

now enter this command, making sure the number matches what you see in your list:
```
!u 3
```
that’s it!

(Note: If you only see dots appearing, reset the target Blue Pill to kick the upload into action)

With other USB-serial adapters, you’ll need to put the Blue Pill in boot mode and restore/reset once the upload finishes. Apart from that, the process should be the same as described above. Don’t forget to specify the “-r” flag to Folie when not using the Telnet-aware SerPlus.

Here is a transcript of the entire process when using a SerPlus:

$ folie
Folie v2.7-1-g94cba5e
Select the serial port:
  1: /dev/cu.Bluetooth-Incoming-Port
  2: /dev/cu.usbmodem3430DC31
? 2
Enter '!help' for additional help, or ctrl-d to quit.
[connected to /dev/cu.usbmodem3430DC31]
!u 3
  20500b .+V22 #0410 R .W .E writing: 81/81 done.
Mecrisp-Stellaris RA 2.3.1 for STM32F103 by Matthias Koch
Erase block at  00005004  from Flash
Finished. Reset Mecrisp-Stellaris RA 2.3.1 for STM32F103 by Matthias Koch

The erase and duplicate welcome greeting only appear just after being re-flashed. After that, pressing reset (or hitting CTRL-C when using SerPlus) will show the above greeting once.

We’re in business! - Forth is now at your command (prompt) …

Go ahead, try a few things. Type this text (hitting ENTER sends it to the F103):

1 2 + . <enter>

Note that in Forth, the prompt is called “ok.” and that it usually appears at the end of the previous line. Also note how output shows up after the command, i.e. on the same line.

A good word to learn and use often, is “.s”, which displays the contents of the data stack:

.s <enter>
1 2 .s <enter>
+ .s <enter>
. <enter>
.s <enter>

Or all on one line:

.s 1 .s 2 .s + .s . .s <enter>

Here’s how to produce a huge list of all the words Forth currently knows about:

words <enter>

Sample output, shortened for the sake of brevity (the full output is hundreds of lines long):

Address: 00000150 [...] Name: --- Mecrisp-Stellaris Core ---
Address: 0000058C [...] Name: 2dup
Address: 000005B0 [...] Name: 2drop
Address: 000005CE [...] Name: 2swap
[...]
Address: 00004D4A [...] Name: irq-tim7
Address: 00004D72 [...] Name: irq-usbfs
Address: 00004D9A [...] Name: --- Flash Dictionary ---
 ok.

Now is a good time to revisit the Forth in 7 easy steps article and to browse through Mecrisp’s online glossary. Go ahead, try it - if the µC stops working, simply reset it to get a prompt back.

To get some feel for Forth – once you know the basics – check out Sam Falvo’s 1-hour video. He tackles a problem which is not so relevant for embedded computing, but his explanations, programming style, and way of creating a solution are good examples of advanced Forth use.

One thing you’ll notice is that there are no definitions for controlling GPIO pins, turning LEDs on and off, or for any of the built-on hardware peripherals such as I2C and SPI. The Mecrisp Forth core is just the base system: a command line, all the essential words needed to code in Forth, plus some more which are useful but not specific to the F103 ARM microcontroller.

There’s a lot packed into this Mecrisp core, including an on-the-fly compiler to ARM machine code with some fairly advanced optimisations such as constant folding and register allocation. But for convenience, we’ll first need to load a bit more Forth code in flash. Coming up next…

When an input pin isn't one

Great ADC/DMA performance

Some µC speed measurements

Tracking pulses w/ interrupts

Frequency aliasing in ADCs

Simple variable packet data

Parsing P1 smart meter info

The need for multitasking

Using a buffered serial console

USB serial in Forth, progress!

Let's start with a Blue Pill

USB serial is a good way to go

We need a toolchain and library

LED blinks and serial echoes

Hooking up an RFM69 radio

Fancy serial with a SerPlus

Stop staring at that screen!

Peeling off layers of complexity

Polish notation and tiny words

Let it burn, then press CTRL-C