Articles on JeeLabs

JET is going to need a web interface. In fact, it’s likely that a major part of the total development effort will end up being poured into this “front-end” aspect of the system.

After many, many explorations, a very specific set of tools has been picked for this task, here at JeeLabs. It’ll all be JavaScript-based (ES6 as much as possible), since that’s what web browsers want. But unlike a number of earlier trials and actual builds of HouseMon, this time we’ll go for a fairly plain approach: no CoffeeScript and … no ClojureScript. They’d complicate things too much for casual development, despite their attraction (ClojureScript is pretty amazing!).

We do, however want a very capable development context, able to create an UI which is deeply responsive (”reactive”, even), and can keep everything on the screen up to date, in real-time.

Here is the set of tools selected for upcoming front-end development in JET:

ReactJS - should be easier to learn than AngularJS (but also totally different!)
WebPack - transparently builds and re-builds code during development
Hot Reload - an incredible way to literally edit a running app without losing context
ImmutableJS - a new trend in ReactJS, coming from Clojure and other FP languages
PureCSS (probably) as a simple and clean grid-based CSS styling framework

This front end will be called JET/Web and has been based on the react-example-es2015 project template on GitHub. It has all the main pieces in place for a truly fluid mode of development. A very preliminary setup can be found in the “web/” directory inside the JET repository - but note that the current code still has sample content from the template project.

Front-end development is a lot different from back-end development, i.e. the JetPacks and the hub itself. In development mode, a huge amount of machinery is activated, with NodeJS driving WebPack, “injecting” live-reload hooks into the web pages, and automatic change detection across all the source files. Once ready for “deployment”, the front-end development ends with a “build” step, which generates all the “assets” as static files, and compresses (“uglifies”) all the final JavaScript code into a fairly small single file - optimised for efficient use by web browsers. In JET/Web, the final production code is only about 150 KB of JavaScript (including ReactJS).

If you’re new to any of the tools mentioned above - you may well find that there’s an immense amount of functionality to learn and get familiar with. This is completely unlike programming with a server-side templating approach, such as PHP, Ruby on Rails, or Django. Then again, each and every one of these tools is incredibly powerful - and it’s guaranteed to be fun!

That’s the consequence of today’s breakneck speed and progress w.r.t. web development. But these choices have not been made lightly. Some considerations involved in these choices were:

a low-end (even under-powered, perhaps) server, which can’t handle a lot of processing
the desire to have everything on a web page updated in real-time, without page refreshes
the hub’s web server can’t be restarted, at least not for work on the web client software

In its current state, JET/Web is next to useless. It doesn’t even connect to the MQTT server yet, so there’s no dynamic behaviour other than inside the browser itself (see for yourself, the demo is quite interesting, especially when you start to look into how it’s all set up in the source code).

One final note about these “decisions”: obviously, you have to pick some set of software tools to be able to implement anything meaningful. But with JET, “big” decisions like these are actually quite inconsequential, because many different front ends can easily co-exist: anyone can add another JetPack, and implement a completely different (web or native) front end!

In 1965, computing history was made when DEC introduced a new computer, called the PDP-8. It was the start of a long series of incrementally improved models, later to be followed by the even more successful and ground-breaking PDP-11 series, and then the VAX. It’s a fascinating story, because so many technological trends of that time have shaped what we now use on a daily base, from laptops, to mobile phones, to watches.

Let’s get some historical context first - we’re in the year 1965:

the transistor had been commercially available for about a dozen years
RAM memory was made of magnetic cores - it was used between 1955 and 1975
the ASR-33“teletype” was introduced in 1962 and produced until 1981
block-addressable DECtape had just been invented, holding ≈ 280 KB per reel
hard disks drives were “a few megabytes” and extremely large and expensive
the first 4-bit “computer on a chip” microprocessor was still half a dozen years away
for a lot more context, see this timeline at the Computer History Museum

But perhaps the most telling metric of that time is the amazing revolution made possible by the introduction of the integrated circuit - before the PDP-8, everything had to be constructed from individual components. Here’s the IC’s development path, as summarised by Wikipedia:

Think about it: a few years after the first PDP-8 was produced, a “chip” (from the incredibly successful 7400-series introduced by TI) could replace no more than a handful of gates!

And to get a bit more perspective about the mindset of those days: Ken Olsen, the founder of Digital Equipment Corporation which created the PDPs, said in 1977 that:

“There is no reason for any individual to have a computer in his home.”

(note that he was talking about home automation, not ruling them out for other purposes)

So here we are, half a century ago. Computers were huge colossi, sitting in large noisy rooms, drawing kilowatts of power, and operated by a small group of specialists. Nobody considered these things useful, other than to speed up numerical calculations. Time on “the machine” was so costly, that everything was focused on optimising the computer’s time, not that of us people.

And then came the first minicomputers - most notably the DEC PDP-8 and the Data General Nova. Here’s a picture of one of the first PDP-8’s, with the panels removed to show its innards:

Yeah, it’s called a mini-computer! (image from the SMECC museum in Arizona, US).

There have been many PDP-8 models, over the span of about a decade. Here are a few of the highlights, from Doug Jones’ site - a huge resource for everything related to these machines:

PDP-8 - 1965..1968 - 4K (12-bit word) memory, 1.5 µs memory cycle time - $18,000
PDP-8/i - 1968..1971 - M-series “flip-chips” with wirewrap backplane - $12,800
PDP-8/e - 1970..1978 - SSI/MSI 3-board design, bus instead of backplane - $6,500
PDP-8/a - 1974..1984 - single-board CPU, “workstation” with diskettes - $1,835

Here is a PDP-8/f, a slightly newer version of the PDP-8/e, from The Old Computer Hut site:

A hefty switched power supply on the side and cards which push into an “OMNIBUS” connector board on the bottom. This still uses 4K words of core memory, expandable to a whopping 32K.

One of the distinguishing features of computers from this era is their “programmer’s console” - a row of switches and a bunch of lights which indicate the content of some registers in real time. You can stop the machine dead in its tracks by flipping the STOP switch, examine memory, even “deposit” new values in it, and then continue execution. How’s that for debugging, eh?

Ok, so now we have our computer. How does it interface to the real world? How do we talk to it? How do we tell it what to do? Do we login to it? Or is it all about switches and blinkenlights?

No tablets, no LCDs, no video screens, no internet yet, no Ethernet yet, no local area networks!

The main interface was the teletype, a big and noisy hardcopy printer, keyboard, paper tape reader, and paper tape punch, all in one. A marvel of mechanical (not electronic!) engineering:

The communication speed was 110 baud serial, using the same start/stop bit stream we still use today. That’s about 10 characters per second. The paper punch ran in parallel with the output, so you could “save” what was being sent to you and then later re-enter it, as if you typed it in.

In the early days - or if you had no budget for anything fancier - that was it!

Here’s a 3-minute video, showing a fairly “high-end” setup - s l o w l y ! - typing out a file listing.

To save 2 KB text, i.e. roughly one typed page, on paper tape you had to load empty paper tape in the punch, start the printout, and listen to a very loud paper punch, pressing holes in the tape for well over two minutes. Oh, and some “lead-in” and “lead-out”, i.e. blank pieces of tape at the front and the back to make it possible to load the tape and run it through without problems.

The paper tape got jammed, you say? Pity. Just start over, if it wasn’t damaged too much.
You want to make a safety copy, just in case? Sure, just start the reader and punch in parallel.

Later on, much faster “high-speed” optical paper tape readers were introduced, which greatly reduced the noise and time spent, but paper tape just isn’t such a great medium when it comes to kilobytes of data. Not to mention the storage needs and keeping track of it all (in handwriting).

Meet the DECtape unit (image from the Computer History Museum):

Here is an 18-second video of how they worked. Each tape can store up to 280 KB of data, and because of the “tape marks” it was able to seek to any block on the tape and read or re-write it as needed. Beats paper tape, but it still took a lot of time just shuttling around to access each block.

The DECtape unit was quite expensive (a TU56 dual unit w/ controller was $5,500 in 1974), but the tapes themselves were cheap, so you could have virtually unlimited storage. Some technical specifications, from pdp8.net:

start/stop/turnaround time: 150/100/200 ms
tape speed: 2.4 m/sec - transfer rate: 8,325 12-bit words/sec
power consumption: 325 watts - weight: 36 kg
tape reels: 10 cm in diameter - tape length: 78 meter

DECtape was pretty convenient to handle, and one tape can store about 140 pages of text, but the Achilles’ heel was their seek time: over half a minute just to get from one side to the other.

As today, technology trends evolved rapidly. Disks with fixed platters as well as removable ones became more widespread and more affordable year by year:

DF32 - fixed head - 32..128 (12-bit) kilowords - 17 ms seek, 16 kw/sec
RX02 - 8” floppy disk - 256 kilowords - 262 ms avg seek, 20 kw/sec
RK05 - removable pack - 1.6 megawords - 70 ms seek, 1500 rpm, 100 kw/sec xfer

Note the units: kilowords and megawords. That RK05 became a workhorse, also for the PDP-11 later on, but at a steep price: $7,900 per drive + controller. And to be practical you really needed two, otherwise there’s no way to make backup copies or move large amounts of data around!

Let’s compare this to today: a 16 GB µSD card costs around €9, read and write speeds are in the tens of MB/sec range, and there’s no seek time, as the card has no moving parts. Oh… and no controller either - any µC with 4 spare I/O pins can read and write from this thing. That’s 6,000 x the storage of an RK05 pack, 10,000 x as fast, and 1/1000th of its price (per unit, not byte!). Not to mention physical size and power consumption differences…

If you would like to experience for yourself how a computer such as a PDP-8 looks and feels, there are several possible avenues to choose from:

get in touch with a museum, friend, or hobbyist who has “The Real Thing”, and see if you can get a demo or schedule a session
look for old equipment dumps, maybe some company, university, or individual wants to give, sell, or lend such a machine to you
get hold of schematics, spare parts, and go try and build one yourself, possibly re-using any original parts available to you
download a software simulator for the computer model you’re interested in, and have a go at running this virtual environment
design your own emulation, possibly adding some fancy lights and switches to make it more realistic and tangible than a software-only emulation
look for a kit and build it yourself, knowing that others have done the same, with support from the kit maker and/or other builders who went before you

This article is about that last option. Oscar Vermeulen has a site with the wonderful name of Obsolescence Guaranteed where he has collected a lot of information and offers a kit for what he calls the PiDP-8/i (note the “Pi” in there!). Here’s his PiDP-8/i kit in front of a real PDP-8/i:

The PiDP-8/i looks like a 2:3 scale model of the real thing, but inside is a Raspberry Pi, running SIMH, as an extremely elaborate and complete emulation of the PDP-8 (and others) as well as tons of peripherals. So you can make the machine think it has a paper tape reader, or a few DECtape drives, or some RK05 diskpacks, or all of them at the same time. Storage media will then be emulated as files on the Raspberry Pi’s SD card or on USB sticks.

The Obsolescence Guaranteed site is a joy to read, and has tons of details - about the kit, the assembly process, the original hardware, as well as things you can try out with it.

Two nice videos are the introduction (7 min) and the Hackaday 2015 presentation (20 min).

As noted, the PiDP is completely different inside - it has nothing to do with the original clunky, energy slurping machines of 50 years ago. It just looks the same and it behaves very much like an original PDP (if you imagine the paper tape, teletype, and other peripherals yourself, that is).

Here’s the PiDP, with a Raspberry Pi A+ on the left, and running off a blue (18650-based) LiPo battery pack from eBay - there’s not much behind that front panel, as you can see:

Construction of this kit is very straightforward. It’s all very nicely documented on the website. You have to solder in 89 LEDs, a dozen or so resistors, and the most unusual part: a series of 22 switches (some toggle, some spring-action), carefully mounted and positioned to give the whole thing a nice well-spaced appearance. It took an afternoon - it’s not hard, it just takes patience …

For this build, the goal was to create a completely self-contained unit (hence the battery pack), and to control it entirely via a network connection over WiFi. To that end, an FTDI interface had to be brought out, both to charge the battery pack and to create a serial connection for adjusting WiFi settings. Nothing a bit of Dremel-cutting and hot glue can’t handle:

Wifi is a matter of inserting a WiFi dongle in the A+’s only USB port, but because this was a few millimeters too large to fit inside the box, its plastic cover has been removed - revealing a WiFi board which is even smaller than an ESP8266:

The last puzzle to solve was how to turn power on and off to this thing. The battery pack has a very convenient button, but it would require making another ugly hole in the box. The solution was to place the battery holder right behind the front panel (with a bit of cardboard behind it):

This way, if you know where to push on the front panel, you can bring this PiDP-8/i to life!

Based on a quick measurement, the PiDP-8/i draws about 230 mA, so it ought to last about a day on the LiPo battery before needing to be plugged in. How’s that for mobile computing?

That front panel is quite extraordinary, by the way: not only can you see (and change) the contents of memory and the accumulator, and of course single-step the whole beast - you can even single-cycle it, i.e. go through each of the different phases of an instruction, and see the instruction decoder in action on the right hand side of the panel.

See that vertical set of 8 lights? That’s the instruction type: the PDP-8 has only eight different opcodes! Although one of them has a sub-division for additional “micro” operations. Since only the Jmp and Iot instructions are lit here, the program must be idling, waiting for some I/O.

The PiDP-8/i comes with 32 Kwords of memory, the maximum supported in this architecture, and the simulator is able to connect every possible type of hardware to it, in a virtual sense that is. These options are part of SIMH and can be adjusted through a serial or SSH connection.

So what can you do with 8 banks of 4,096 words of memory, organised as 128-word pages?

The introduction of the PDP-8 series was a disruptive, game-changing event, in that it made computers available to a large group of scientists, engineers, and tinkerers. For the first time, more or less, you could walk into a room, sit down at a teletype, and start programming. No more “batch jobs” and “reserving” time on a huge, scarce, over-booked, expensive machine.

Instructions and memory

The PDP-8’s instruction set is very well documented elsewhere. It only has 3 bits to store the “opcode”, i.e. 8 combinations in all. One is for I/O, one is for special “micro” instructions - that leaves a mere 6 operations: a jump, a subroutine call, and only four other instructions, with 2 special bits and a 7-bit operand field. Can this thing really be Turing-complete?

That’s not all: those 32 Kwords of a maximally-configured PDP-8 are split into eight 4 Kword “banks”, and each bank is split into thirty two pages of 128 words each. Since a word is 12 bits, you can only easily access words in a single 4 Kw memory bank. Multiple instructions will be needed for anything beyond 4 Kw.

There is no “load accumulator” instruction, there is only “add to accumulator”. Storing the accumulator clears it! (which makes a lot of sense combined with add-to-accumulator) - for some interesting notes about the instruction set, see this page.

Let’s look at memory: in those days, random memory was magnetic core memory. It has some very unusual properties by today’s measures: reading a memory address is destructive - you have to write it back after a read if you want to preserve it! As a consequence, reading and writing and even modifiying a memory address all take the same amount of time.

And then this: core memory retains its contents when powered off. That means you can stop a PDP-8 from its front panel, turn the machine off, power it up again, and restart it.

Despite the limitations of a PDP-8, people have built various operating systems for this thing, and implemented more than half a dozen programming languages for it. It boggles the mind.

Languages

There are two categories of programming languages for the PDP-8:

Compiler-based languages - you write your code in some editor, then you save it (to paper tape, or magnetically if you’re lucky), then you start up the compiler, possibly multiple passes, then you start the linker, and at the end you have a binary, which you can start to see if it works.

This process is tedious, to put it mildly. With a disk, a Fortran compilation of a simple “Hello world” program takes 10 seconds or so, but that increases to about 10 minutes with DECtapes, and even more if you have to save to paper tape and also load each new program that way.

Only then will you know whether you mis-typed anything or forgot a comma.

Some languages for PDP-8 were: Fortan II and IV, Algol, Pascal, Lisp (!), and Forth.

Many of these require at least 8 Kwords of core memory, sometimes even 28 Kw. If you only have 4 Kw, the minimal and default PDP-8 configuration, then all you could probably use is the machine-level instruction “assembly language”.

The compilers and linkers themselves were invariably written with an assembler. It’s hard to imagine how much time and effort it must have taken to design, implement, and test these elaborate software systems, fitting their code and data structures into that quirky 32 Kword, 4096 w/bank, 128 w/page memory layout. Text was stored as two 6-bit characters per word: no lowercase, and only a very limited set of special characters! Six-char var names, what a luxury!

Interpreted languages - imagine sitting at the teletype, entering some commands and getting a computed reply back within a second - nirvana!

That was the promise of interpreted programming languages then, and that’s the appeal of scripting languages today (that distinction is all but gone with today’s performance levels).

On the PDP-8, there was BASIC, which incidentally was designed at just about the same time as the PDP-8 hardware. It lets you enter commands in immediate mode, as well as edit them into a larger program by starting each line with a line number. You could enter strange things like:

20 GOTO 10
10 PRINT "HELLO WORLD"

And the computer would guarantee to execute them in order, creating an infinite loop in this case. By hitting Control-C (sound familiar?), you could abort the running program and regain control. The line numbers were irrelevant, but by keeping gaps you could then insert additional lines later, such as:

15 PRINT "1 + 2 =", 1+2

Typing “LIST” would print out the entire program:

10 PRINT "HELLO WORLD"
15 PRINT "1 + 2 =", 1+2
20 GOTO 10

All the essential tools were present for interactive use: a command line, a crude-but-effective editor (with “LOAD” and “SAVE” commands for paper tape or disk files), and your code, waiting to be executed, enhanced, or debugged. In many ways, we still use this same process today.

This approach, and BASIC in general, was definitely the mainstream model for the next twenty years, when 8-bit hobby computers and CP/M and MS-DOS became the dominant systems.

The other interpreted language on the PDP-8 was FOCAL, developed by DEC. Just as BASIC, this was a completely self-contained system. It ran (barely) in just 4 Kw, and there was no “operating system” in sight. Focal-69, the most widespread variant, was the operating system.

Again, looking at the hardware this all ran on, and the fact that these systems themselves had to be programmed in assembly language, raising the conceptual bar to make a PDP-8 an interactive and highly responsive system was quite a revolution at the time.

Operating systems

Then came magnetic storage. Even if an expensive (but fast!) fixed-head DF32 with 4 platters could only hold 128 Kwords of memory, it changed the landscape for good. Gone were the time sinks of loading, saving, re-loading, and damaged or lost paper tapes. The operating system turned these disks (and DECtapes) into data filing cabinets. That’s why they’re called “files”!

File names were up to 6 characters, with a 2-letter “extension” to indicate the type of file (does this ring a bell?). This was also the start of utilities such as “PIP”, the Peripheral Interchange Program which could shuttle data around from one file to the next, from paper tape to disk, from disk to teletype, and so on.

The computer was starting to become more of an information processor, and less of a purely computational / number-crunching engine. And the PDP-8 was right in the middle of all this, with well over half a million units in the field.

The PDP-8 was fertile territory for several groundbreaking operating systems:

OS/8 was the first and main one - a PDP-8 + disk or DECtape + OS/8 was all you needed to get tons of work done (or games). A slow but very respectable precursor of the Personal Computer.

And then more people wanted to join in on the game. Most of the time, all these computers were just sitting, twiddling their thumbs after all, waiting for a bunch of sluggish carbon-based life-forms to press the next key on their keyboard. What a silly waste of (the computer’s) time!

Meet TSS-8, the time-sharing system: it gave each user a “time slice” of a single shared PDP-8, swapping data to and from a disk, as needed to maintain the illusion. While one person was typing, another one could be running a calculation, and they’d both get good mileage out of the system. Just hook up a few more terminals to the machine, and you’re off. Apparently, up to 17 users could share a PDP-8, and its smallest configuration only needed 12 Kwords of RAM!

There’s also ETOS/8 - a virtualising OS, giving each user the illusion of a complete machine.

SIMH and the PiDP-8/i

Last but certainly no less impressive, there’s the SIMH emulator and Oscar’s PiDP-8/i mod to display SIMH’s internal state (see the “Software” section on this page) - he does this by poking around in the (simulated) memory space - a clever way to let the simulator run full speed while still presenting a continuous glimpse inside via the LEDs. All thanks to multi-tasking in Linux.

Everything mentioned above can be tried on the PiDP-8/i. The front panel has a special trigger, where the three INST-FIELD switches in combination with the SINGLE-STEP toggle can be used to start up different software sets, as prepared by Oscar on his pipaOS-based SD card image (which will boot quite a bit faster than a standard Raspbian distro).

Here are the front-panel quick-launch cheat codes:

Octal  IF-sw  Description
--------------------------------------------------------------
  0     000    (used on power-up, set to same as slot 7)
  1     001    RIM Loader at 7756, paper tape/punch enabled
  2     010    TSS/8 multi-user system. Up to 7 telnet logins
  3     011    OS/8 on DECtape (a lot slower, also simulated)
  4     100    Spacewar! With vc8e output on localhost:2222
  5     101    (empty slot, 10-instr binary counter demo)
  6     110    ETOS/8, also with telnet logins on port 4000
  7     111    OS/8 on RK05 10MB disk cartridge

The best resource for all this is really the PiDP-8/i website. It has all the information you might want and lots of pointers to other documentation, software, and the pidp-8 discussion group.

Note that on a single core Raspberry Pi A+ or B+, SIMH runs flat-out, consuming nearly 100% CPU - yet the system remains fairly responsive, even when logged in via a network session. To regain most of the processor time, you can suspend SIMH by entering “ctrl-e” - and later enter “cont” to resume the simulation and blinking. You don’t need to quit simh to get a shell prompt: just type “ctrl-a ctrl-d” to suspend the session and “~/pdp.sh” to resume it.

That’s it for our brief excursion into the world of computing 50 years ago - fun from the 60’s!

In the beginning, there were computers. Programmed with wires, then with binary data (the “stored-program” computer), then with assembly language, and from there on, a relentless stream of new programming languages. Today’s web browsers all “run” JavaScript.

Here’s a summary of that evolution again, in terms of technology:

wires: just hardware circuits and manually inserted patch cords, yikes!
binary data, i.e. “machine code”: a very tedious and low-level way of programming
assembly: symbolic notation, macros, you have to keep track of all registers in your head
Fortran, Cobol, Algol, Pascal, C: yeay, it gets compiled for us!
Basic, Lisp, Icon, Snobol, Perl, Python, Ruby - no compiler: immediate interpretation!
but interpreters are slow - we can compile on-the-fly and just-in-time…
and today, with JavaScript / NodeJS, Java, etc: the compiler has become invisible

The story could end here, but then there is that embedded microcontroller world, with smart chips in just about anything powered by electricity. While powerful and capable of generating byte code and even machine code, they do not have the storage and memory to run a high-end optimising compiler. Even if projects such as Espruino and MicroPython have come a long way to bring complete self-contained environments to the µC world - they still depend heavily on a larger machine to produce those run-time environments we can flash into the µC.

This has an important implication: everything not implemented and linked into Espruino or MicroPython has to be written in the higher-level language (JavaScript or Python, respectively). That works and can be quite convenient, but you lose performance big time (think 1000-fold and more) - these are still interpreted languages, after all. For some cases, this is irrelevant - reading out an I2C sensor and analysing its values can easily be done slowly, if the I2C support is present and if we’re only reading out that sensor once a second or so.

But what if we want more performance? - or run on a smaller µC with 32..128 KB of flash?

One solution is the Arduino IDE: a cross compiler which runs on a large “host” and generates code for our very limited “target” µC. Or some similar “ARM embedded gcc toolchain”.

Which is where we stand today, in 2016: tethered software development, with the source code and tools living in one world (our laptops or the web), and the µC being sent a firmware upload to perform the task we’ve coded up for it, after translation by our toolchain:

you have to set up that toolchain (for your choice of Windows, Mac OSX, Linux)
you have to keep track of the source code, in case you ever need to change it
the µC will do its thing, but any change to it will require going back to the host
software debugging is tedious: add a print statement, compile, upload, try, rinse, repeat
hardware debugging requires proper toolchain support and maybe also learning “gdb”

What if we just want to investigate the hardware, check out a few settings in the chip, briefly toggle a pin or adjust a hardware register setting? Tough luck: you have to leave the flow of design and implementation, and enter the (completely different) world of remote debugging.

Our µC might as well be on Mars. With all our fancy tools (constantly updated, improved, changed) we’re virtually coding in the dark nowadays. We’re adding layer upon layer of technology and infrastructure, just to make that darn LED blink! Or read out a sensor, or make something turn, or respond to sensor changes, whatever. Does it really have to be so hard?

(speaking of Mars: Forth has been used in several NASA space missions)

What if we could talk to an embedded µC directly over a serial port connection, give it simple commands, tell it things to do now, or save for later, or do continuously. As we gradually build up our application, the µC records what we’ve done, lets us change things as much and as often as we like, selectively wiping some previously saved definitions.

Forth can do that. It’s a programming language, but it’s also a full-blown development system. Once you store the Forth “core” into the µC, you’re done. From then on, you can type at it, make it do things, and go wild. If you make a mistake (as we all do, especially while trying out stuff), you simply press reset to return to a stable system.

There is hardly a toolchain involved. The Mecrisp Forth core is written in GNU “assembler”, producing a 16..20 KB “.bin” or “.hex” file, and that’s it. You never need to go back and change it. Everything else can be built on top. Mecrisp Forth is extremely fast, so what you write can also be. There’s an assembler for the ARM Cortex written in Forth: if you load it in, you can extend the core by adding assembler code (using a Forth-like syntax). There’s even a disassembler…

(please note that assembly language is there if you want it, but hardly ever needed in Forth)

But there is one major (and very painful) drawback, in today’s world with millions of lines of code written in C and C++: Forth and C don’t really mix. A µC running the Forth core cannot easily interoperate with C code, although it can be tricked into calling external routines with C linkage (Forth can generate assembler instructions for any purpose, after all).

To sum it all up: think of the Mecrisp Forth core as a boot loader - you have to get it onto the µC once, and then it becomes the new “communication language” for the chip. From there on, this µC will understand plain text Forth commands, including saving potentially large amounts of (your!) Forth definitions after its own flash memory area. All you need, is a serial port + terminal interface, plus a robust way to send larger amounts of Forth source code to the chip.

With Forth, you don’t have a “build environment”. Forth is the environment and it’s running on the chip you’re programming for. It’s intensely interactive and there are no layers of complexity. There is no compiler, no linker, no uploader (other than a text-mode send tool), no bytecode, no firmware image, no object code, there are no binary libraries, no conditionals, no build flags.

For turnkey use, you can define a function called “init” and save it in flash memory. Then your chip will run that code on every reset. But beware: if you don’t include a mechanism to get back to command mode, then the only way to get back control is to reflash the chip with a fresh core…

There is one other “drawback”: Forth blows away every notion of language syntax and software development methodology you’re probably used to - but that’s for the next articles…

The KEY design choice in Forth is its dual data- and return-stack. Forth is a stack-oriented programming language. Another characterisation is that Forth is a concatenative language. Here’s what this means:

Data is manipulated on a stack - this example pushes “1”, then “2” on the stack, then applies the “+” operation on those two values, replacing them with their sum, and ends by running the “.” operation which pops-and-prints the top of the stack, i.e. “3”:

1 2 + .

Suppose we have this sequence, using the “/” division operator:

10 2 / . 20 2 / . 30 2 / .

The output will be “5 10 15”. What about this, then?

10 2 / 20 2 / 30 2 / . . .

A little mental exercise will tell you that it will print out “15 10 5”. Now try this:

10 2 20 2 30 2 / / / . . .

In this case, the output will be: “0 2 10” (whoops, that was a division by zero!) . One more:

10 2 20 2 30 2 / . / . / .

Output: “15 10 5”. Hey, it’s the same as two examples back! What happened?

It may seem like silly mental gymnastics, but there’s something very deep going on here, with far-reaching implications. The first thing to note is that operations (they tend to be verbs) take input (if any) from the top of the data stack and leave output (if any) on the stack in return. Not just numbers: strings, pointers, objects, anything.

That’s where the concatenative aspect comes in: operators can be combined into new ones without having to know what they do to the stack! - we can define a “2/” operator for example:

: 2/ 2 / ;

Looks quirky, eh? What this means is: “when you see a “2/” operator (yes: “2/” is a valid name in Forth, see below), execute “2” followed by “/”. Now that first sequence above can be written as:

10 2/ . 20 2/ . 30 2/ .

Not very useful, but now we could make it more wordy - for example:

: half 2 / ;
10 half . 20 half . 30 half .

We could even define things like “: 3div / / / ;”, or “: triple-dot 3 0 do . loop ;” !

Here’s another example (let’s assume that “delay” expects milliseconds as input):

: seconds 1000 * ;
12 seconds delay

You can see how carefully chosen (natural?) names can lead to fairly readable phrases.

It’s still not a very sophisticated example, but the point is that you can look at any Forth source code and simply scan for repetitions without having a clue about the inner details. Any sequence occurring more than once is a hint that there’s something more general going on. Turning that into a definition with a more meaningful name is a very Forth’ish thing to do. And guess what? As you implement more code in Forth, the process becomes second nature as you write!

And that’s really what Forth is about: creating a Domain-Specific Language (DSL) while writing code. As you formulate your logic in Forth words, you invent steps and come up with names for them. Don’t look back too much, just keep on writing. At some point, any (near-) repetition will show through. Then you can redefine things slightly to make the same phrases become more widely usable. And before you know it, you’re in a universe of your own. Writing in terms uniquely adapted to the task at hand, combining very small existing pieces into larger ones.

The big surprise is that this effort coincides with getting to grips with the logic of a problem.

It’s not uncommon to write dozens, or even hundreds of new word definitions, as you progress. One thing to keep in mind is that Forth is essentially bottom-up: you can only use words which have been defined before it (there are ways around this, writing recursive calls is in fact easy).

There’s another intriguing property of this stack-based approach: there are no local variables! This might seem like a disadvantage, but it also means that you don’t have to invent little names all the time - the code becomes breath-takingly short because of this. All emphasis should go to defining lots of words with well-chosen action names, each operating on only a few stack entries.

As you can see, Forth code can look a bit unusual to the untrained - C/C++ influenced - eye:

These define the words used to perform low-level I/O pin operations. The line “: io ...” defines “io” as a new function, and text enclosed in parentheses is treated as a comment.

In Forth, there is only one syntax rule: a “word” is anything up to the next whitespace character. Some words can take over and read stuff after them, in which case the effects can be different. That’s why the “\” word can act as a comment: it eats up everything after it until the end-of-line.

These definitions are not so much an implementation (although, that too) as they are about defining a vocabulary and a notation which fits the task at hand - in this case defining I/O pins and getting/setting/flipping their value - “@” is a common idiom for fetches and “!” for stores.

And what to make of this implementation of bit-banged I2C, using the above?

Yes, this will look daunting at first, but keep in mind the concatenative nature of it all. You can essentially ignore most of these lower-level definitions. The example at the bottom illustrates this - all you need to look at, are these two lines, consisting mostly of comment text:

: rtc! ( v reg -- ) \ write v to RTC register
  ... ;
: rtc@ ( reg -- v ) \ read RTC register
  ... ;

The “( reg -- v )” comment documents the “stack effect”, i.e. what this functions expects on the stack (reg) and what it puts back in return (v). Note that in a way, local variable names have crept back in as comments, but only for documentation purposes (and often as a data type, not a name). The code itself still runs off the stack, and is extremely concise because of it.

This is how to read out the seconds value from register 0 of the RTC over I2C and print it out:

0 rtc@ .

Everything else can be ignored - as long as the underlying code works properly, that is!

Is Forth low-level? Is it high-level? It definitely seems able to bridge everything from super-low bare silicon hardware access to the conceptually abstract application-level logic. You decide…

Software development in Forth is about “growing” the tools you need as you go along. Over time, you will end up with a set of “words” that are tailored for a specific task. Some words will end up being more reusable than others - there’s no need to aim for generality: it’ll happen all by itself!

Digital I/O

Let’s start with some examples for controlling GPIO pins on an STM32F103:

Define Port B pin 5 as an I/O pin:
```
1 5 io constant PB5
```
Actually, it’s easy to pre-define all of PA0..15, PB0..15, etc - see this code.
Set up PB5 as an open-drain output pin:
```
OMODE-OD PB5 io-mode!
```
Here’s how to set, clear, and toggle that output:
```
PB5 io-1!   PB5 io-0!   PB5 iox!
```
To read out the current value of PB5 and print the result (0 or 1), we can do:
```
PB5 io@ .
```

There are some naming conventions which are very common in Forth, such as “@” for accessing a value and “!” for setting a value. There are many words with those characters in them.

Here are all the public definitions from the io-stm32f103.fs source file on Github:

: io ( port# pin# -- pin )  \ combine port and pin into single int

: io-mode! ( mode pin -- )  \ set the CNF and MODE bits for a pin

: io@ ( pin -- u )  \ get pin value (0 or 1)
: io! ( f pin -- )  \ set pin value
: io-0! ( pin -- )  \ clear pin to low
: io-1! ( pin -- )  \ set pin to high
: iox! ( pin -- )  \ toggle pin

: io# ( pin -- u )  \ convert pin to bit position
: io-base ( pin -- addr )  \ convert pin to GPIO base address
: io-mask ( pin -- u )  \ convert pin to bit mask
: io-port ( pin -- u )  \ convert pin to port number (A=0, B=1, etc)

: io. ( pin -- )  \ display readable GPIO registers associated with a pin

Only the header of each word is shown, as produced with “grep '^: ' io-stm32f103.fs”.

Note that this API is just one of many we could have picked. The names were chosen for their mnemonic value and conciseness, so that small tasks can be written with only a few keystrokes.

Analog I/O

Here’s another “library”, to read out analog pins on the STM32F103 - see adc-stm32f103.fs:

: init-adc ( -- )  \ initialise ADC
: adc ( pin - u )  \ read ADC value

Ah, now we’re cookin’ - only two simple words to remember in this case. Here’s an example:

init-adc   PB0 adc .

Not all pins support analog, but that’s a property of the underlying µC, not the code.

I2C and SPI

The implementation of a bit-banged I2C driver has already been presented in a previous article. Unlike the examples so far, the I2C code is platform-independent because it is built on top of the “io” vocabulary defined earlier. Yippie - we’re starting to move up in abstraction level a bit!

Here’s the API for a bit-banged SPI implementation:

: +spi ( -- ) ssel @ io-0! ;  \ select SPI
: -spi ( -- ) ssel @ io-1! ;  \ deselect SPI

: spi-init ( -- )  \ set up bit-banged SPI

: >spi> ( c -- c )  \ bit-banged SPI, 8 bits
: >spi ( c -- ) >spi> drop ;  \ write byte to SPI
: spi> ( -- c ) 0 >spi> ;  \ read byte from SPI

Some words are so simple that their code and comments will fit on a single line. That code can be very helpful to understand a word and should be included, as shown in these definitions.

Generality

You may be wondering which I/O pins are used for SPI and I2C. This is handled via naming: the above source code expects certain words to have been defined before it is loaded. For example:

PA4 variable ssel  \ can be changed at run time
PA5 constant SCLK
PA6 constant MISO
PA7 constant MOSI

The pattern emerging from all this, is that word definitions are grouped into logical units as source files, and that they each depend on other words to do their thing (and to load without errors, in fact). So the I2C code expects definitions for “SCL” + “SDA” and uses the “io” words.

It’s “turtles all the way down!”, as they say…

In Forth, you can define as many words as you like, and since a word can contain any characters (even UTF-8), there are a lot of opportunities to find nice menmonics. When an existing word is re-defined, it will be used in every following reference to it. Re-definition will not affect the code already entered and saved in the Forth dictionary. Everything uses a stack, even word lookup.

If you need two bit-banged I2C interfaces, for example, you can redefine the SCL & SDA words and then include the I2C library a second time. This will generate some warnings, but it’ll work.

RFM69 driver

With the above words in our toolbelt, we’re finally able to build up something somewhat more substantial, i.e. a driver for the RFM69 wireless radio module, which is connected over SPI:

: rf-init ( group freq -- )  \ init the RFM69 radio module
: rf-freq ( u -- )  \ change the frequency, supports any input precision
: rf-group ( u -- ) RF:SYN2 rf@ ;  \ change the net group (1..250)
: rf-power ( n -- )  \ change TX power level (0..31)

: rf-recv ( -- b )  \ check whether a packet has been received, return #bytes
: rf-send ( addr count hdr -- )  \ send out one packet

With some utility code and examples thrown in to try it out:

: rf. ( -- )  \ print out all the RF69 registers
: rfdemo ( -- )  \ display incoming packets in RF12demo format
: rfdemox ( -- )  \ display incoming packets in RF12demo HEX format

This code is platform independent, i.e. once “io” and “spi” have been loaded, all the information is present to load this driver. The driver itself is ≈ 150 lines of Forth and compiles to < 3 KB.

… and more

If you want to see more, check out this driver for a 128x64 pixel OLED via I2C, plus a graphics library with lines, circles, texts which can drive that OLED. Or have a look at the usart2 code for access to the second h/w serial port. There’s even a cooperative multi-tasker written in Forth.

Everything mentioned will fit in 32 KB of flash and 2 KB RAM - including Mecrisp Forth itself.

But to make it practical we’ll need some more conventions. Where to put files, how to organise and combine them, etc. Take a look at this area for some ideas on how to set up a workflow.

Here is what we’re after, as Forth Development Environment (would that be an “FDE”?):

There are a number of steps needed to use Mecrisp-Stellaris Forth in your own projects (for these articles, we’ll be focusing on the STM32F103µC series with 64..512 KB flash memory):

getting the Mecrisp core “flashed” onto the microcontroller
setting up a convenient connection between your laptop and the µC board
streamlining the iterative coding cycle as much as possible

Note that this is in some way quite similar to hooking up an Arduino or JeeNode, and developing software for it through the Arduino IDE. But there also some substantial differences:

pick your own editor, “whatever works for you” is by far the best choice
no compilers, no debuggers, no toolchain - just a simple way to talk to the µC
no binary code, no runtime libraries, just you, your code, and your terminal

The Arduino approach puts all complexity in the “host” laptop setup. The Mecrisp approach builds words in the µC, on the fly, when they’re typed in (or uploaded, i.e. “simulated typing”).

Installing Mecrisp

Step 1) is not Mecrisp-specific. It’s the same stumbing block with everyµC setup which needs specific firmware. You need to download the latest Mecrisp-Stellaris release from SourceForge, and “get it onto that darn chip… somehow” !

Here are some ways to do this, depending on what interface tools you have and your O/S:

via a serial interface, an ST-Link, or a Black Magic Probe - see this weblog article
using an Arduino Sketch - see “Poor-man’s boot loader upload” in that same article

The firmware in the Mecrisp distribution is available in two versions, a “.bin” and a “.hex” file:

stm32f103/mecrisp-stellaris-stm32f103.bin
stm32f103/mecrisp-stellaris-stm32f103.hex

It depends on the upload mechanism as to which one you need. With a Black Magic Probe (BMP) and arm-none-eabi-gdb, for example, the following commands should do the trick:

% arm-none-eabi-gdb
[...]
(gdb) tar ext /dev/cu.usbmodemD5D1AAB1    (adjust as needed, of course)
(gdb) mon swdp
(gdb) at 1
(gdb) mon erase                                (essential for Mecrisp!)
(gdb) load mecrisp-stellaris-stm32f103.hex
(gdb) q

Then, again if you are using a BMP and running on Mac OSX or Linux:

% screen /dev/cu.usbmodemD5D1AAB3 115200
Mecrisp-Stellaris 2.2.1a for STM32F103 by Matthias Koch
  ok.
(quit with "ctrl-a ctrl-\" or "ctrl-a \" - depending on your setup)

The serial connection must be set up as 115200 Baud for Mecrisp - 8 bits, no parity, 1 stop bit.

If you’re using an ST-Link to upload the firmware, these two commands will do the trick:

st-flash erase                                # essential for Mecrisp!
st-flash write mecrisp-stellaris-stm32f103.bin 0x08000000

It’s very simple and quick, but only · a f t e r · you’ve got all those Pesky Little Details just right. Getting firmware onto a bare STM32F103 µC can still be a hit-and-miss affair. There are simply too many variables involved to come up with a procedure here which will work for everyone.

The good news is that with a little care, you will not have to repeat this step again. Mecrisp is quite good at keeping itself intact (it refuses to re-flash itself, for example).

Installing PicoCom

One of the things you’ll notice if you try out the above setup with screen, is that it doesn’t quite get the line endings right (which are bare LFs in Mecrisp, not CR+LF). It’s better to install a slightly more elaborate terminal emulator - and PicoCom is in fact a very good option for Mac OSX and Linux, as will become clear below. For Windows, there is TeraTerm.

To install PicoCom on Mac OSX with Homebrew, enter this in a command shell:

brew install picocom

To install PicoCom on Debian/Raspbian/Ubuntu Linux, type:

sudo apt-get install picocom

The benefit of PicoCom is that it allows specifying a program to use for uploads. We don’t want to manually enter text, we also need to send entire source files to Mecrisp Forth over serial. The problem is that a bare Mecrisp installation only supports polled serial I/O without handshake. This can only handle text if it’s not coming in “too fast”. In Mecrisp, each word on a line needs to be looked up and compiled, and it all happens on a line-by-line basis. This means that you have to wait for its “ok.” prompt after each line, before sending more text.

One solution is to send all text · v e r y · s l o w l y · but that’ll make it extremely time-consuming.

Installing msend

A better solution is to send full speed and wait for that final prompt before sending the next line, to avoid input characters getting lost. This little utility has been created to do just that: msend.

If you have Go installed, getting msend (Mac OSX and Linux only, for now) is again a one-liner:

go get github.com/jeelabs/embello/tools/msend

Otherwise, you can get the latest binary release for a few platforms from GitHub.

With “msend” installed, PicoCom can now be started up as follows:

picocom -b 115200 --imap lfcrlf -s msend /dev/cu.usbmodemD5D1AAB3

Or even as “mcom /dev/cu.usbmodemD5D1AAB3” - if you add an alias to your .bashrc init file:

alias mcom='picocom -b 115200 --imap lfcrlf -s msend'

And now line endings not only work properly, you also get a very effective upload facility. This will be worth its own article, but you can see a transcript of an upload with includes over here.

Sending a file with PicoCom is triggered by typing “ctrl-a ctrl-s”.

To quit PicoCom, type “ctrl-a ctrl-x” - see also the manual page for further details.

Windows

Neither PicoCom nor msend are available for Windows, but there’s another solution:

install TeraTerm, which is a terminal emulator for Windows
look at this script file for TeraTerm, by Jean Jonethal

This combination should accomplish more or less the same as picocom + msend, i.e. terminal access, throttling text sends, and inserting “include” files.

Optimising workflow

Forth software development is about flow and insanely fast turnaround times between coming up with an idea and trying it out. There are no compilers or other tools to slow you down, and as a result you can type and try out an idea the moment it pops into your head. Total interactivity!

At the same time, the last thing we want, is to constantly re-enter code, let alone lose it for good if the µC crashes. The challenge is to find a proper balance between volatile commands (typed in directly at the Mecrisp prompt, on the µC) and re-sending lots of text from a laptop all the time.

Mecrisp has an elegant and simple approach to help with this:

when you power it up, Mecrisp remembers only what it had stored in flash memory
all new definitions (i.e. “: myword ... ;”) are added and compiled into RAM
stack underflows (a common mistake) clear the stack but won’t lose RAM
a reset (whether in hardware or using the “reset” word) will lose everything in RAM
you can save your next definitions to flash memory by typing “compiletoflash“
this will continue until you press reset or enter “compiletoram“

The thing is that in Mecrisp Forth, a hard crash is no big deal - you should expect to run into stuck code, awful crashes, weird things happening, non-responsive terminal I/O, etc. There’s a reset button on the µC which will get you back to a working state the (sub-) second you use it.

It could be a typo. There could be a hint in what’s on the screen. But even if not, if you make your coding cycles short and frequent, then chances are that you’ll quickly discover what went wrong.

Otherwise… the interactive Forth prompt is your friend: examine the values of variables, or the settings in hardware registers, and invent whatever words you need to help figure out this issue. Words can be one-liners, written only for use in the next few minutes of your investigation!

The more loosely coupled your words are, i.e. called in sequence, not nested too deeply, the easier it will be to set up the stack and call any one of them, in isolation, from the prompt. If something fails, you can take over and repeat the rest of the words by hand, verifying that the stack is as expected (check out the “.s” word!), and peeking around to see what’s going on.

Looking at the diagram above, you’ll see that there are two kinds of permanence in this context: source code in files, and words defined in flash memory. The latter cannot easily be turned back into source, alas. That means they should be either one-offs or created by an earlier upload.

Although the best workflow has yet to be found, some comments on what is likely to work well:

new code, especially when it’s about getting the hardware interface right, needs to run on the µC and can be quickly explored ad-hoc - at the Forth prompt, no definitions needed
you can read / write to registers with “io@” / “io!” commands in a “peek and poke” style
lengthy setup code can be written in your editor, and then uploaded and saved to flash
hardware addresses are a lot easier to use as pre-defined Forth words (i.e. “constant”)
if you make uploaded code store itself in flash, you won’t have to re-upload it after a reset
the “cornerstone” word can partially unwind definitions from flash - great for uploads
make sure your terminal window keeps a lot of history - it’s a very effective historical log

Maybe the rlwrap tool can be made to work with PicoCom - for command history and editing.

There is a lot more to say about this. The “msend” utility recognizes lines starting with the word “include” as special requests to fetch a file from the system and send its contents (this can be nested). This allows keeping various word sets in their own files, and then selectively include them in each project. You can add “compiletoflash” to save the more stable words in flash.

For more ideas on how to organise the code, see the README in the Embello area on GitHub.

There is no need for large nested source file trees. Forth source code tends to be very compact - a single page of code is usually more than enough to implement a complete well-defined module. One directory with a few dozen files is plenty. Put them under source code control in GitHub or elsewhere, and you’ll have your entire project under control for the long-term. Each project can contain all the files it needs to be re-created (i.e. re-uploaded to a µC running Mecrisp Forth).

Enough for now, this’ll get you started. Now go Forth, and create lots of embedded µC projects!

Mecrisp only implements the minimal serial interface required, i.e. USART1 with polled I/O. This is very limited, because the serial port has no buffering capability: if we don’t poll it often enough (over 10,000x per second for 115200 baud!), we risk losing incoming input data.

The standard solution for this is interrupts: by enabling the RX interrupt, we can get the data out in time for the next one to be processed. Although this merely moves the problem around, we can then add a larger buffer in software to store that input data until it’s actually needed.

Let’s implement this - it’s a nice example of how to make hardware and software work together:

to avoid messing up the only communication we have to Forth, i.e. USART1, we’ll be much better off developing this first for USART2 - as changing the values to adapt it to USART1 will be trivial once everything works
we’re going to need some sort of buffer, implemented here as a generic “ring buffer”
we need to set up the USART2 hardware, the easiest way is to start off in polled mode
lastly, we’re going to add an interrupt-handling structure which ties everything together

Circular buffering

What we want for the incoming data is a FIFO queue, i.e. the incoming bytes are pushed in at one end of the buffer, and then pulled out in arrival order from the other end.

A ring buffer is really easy to implement - this Forth implementation is a mere 16 lines of code. Its public API is as follows - for initialisation, pushing a byte in, and pulling a byte out:

: init-ring ( addr size -- )  \ initialise a ring buffer
: >ring ( b ring -- )  \ save byte to end of ring buffer
: ring> ( ring -- b )  \ fetch byte from start of ring buffer

We also need to deal with “emptiness” and avoiding overrun:

: ring# ( ring -- u )  \ return current number of bytes in the ring buffer
: ring? ( ring -- f )  \ true if the ring can accept more data

Ring buffers are simplest when the size of the ring is a power of two (because modulo 2^N arithmetic can then be done using a bit mask). Setup requires a buffer with 4 extra bytes:

128 4 + buffer: myring
myring 128 init-ring

With this out of the way, we now have everything needed to buffer up to 127 bytes of input data.

USART hardware driver

Setting up a hardware driver is by definition going to be hardware-specific. Here is a complete implementation for the STM32F103 µC series:

$40004400 constant USART2
   USART2 $00 + constant USART2-SR
   USART2 $04 + constant USART2-DR
   USART2 $08 + constant USART2-BRR
   USART2 $0C + constant USART2-CR1

: uart-init ( -- )
  OMODE-AF-PP OMODE-FAST + PA2 io-mode!
  OMODE-AF-PP PA3 io-mode!
  17 bit RCC-APB1ENR bis!  \ set USART2EN
  $138 USART2-BRR ! \ set baud rate divider for 115200 Baud at PCLK1=36MHz
  %0010000000001100 USART2-CR1 ! ;

: uart-key? ( -- f ) 1 5 lshift USART2-SR bit@ ;
: uart-key ( -- c ) begin uart-key? until  USART2-DR @ ;
: uart-emit? ( -- f ) 1 7 lshift USART2-SR bit@ ;
: uart-emit ( c -- ) begin uart-emit? until  USART2-DR ! ;

Some constant definitions to access real hardware inside the STM32F103 chip, as gleaned from the datasheet, some tricky initialisation code, and then the four standard routines in Forth to check and actually read or write bytes.

It’s fairly tricky to get this going, but a test setup is extremely simple: just connect PA2 and PA3 to create a “loopback” test, i.e. all data sent out will be echoed back as new input.

During development, it’s useful if we can quickly inspect the values of all the hardware registers. Here’s a simple way to do that:

: uart. ( -- )
  cr ." SR " USART2-SR @ h.4
  ."  BRR " USART2-BRR @ h.4
  ."  CR1 " USART2-CR1 @ h.4 ;

Now, all we need to do to see the registers is to enter “uart.“:

uart. 
SR 00C0 BRR 0138 CR1 200C ok.

That’s after calling uart-init. Right after reset, the output would look like this instead:

SR 0000 BRR 0000 CR1 0000 ok.

To test this new serial port with the loopback wire inserted, we can now enter:

uart-init uart-key? . 33 uart-emit uart-key? . uart-key . uart-key? .

The output will be (note that in Forth, false = 0 and true = -1):

0 -1 33 0  ok.

I.e. no input, send one byte, now there is input, get it & print it, and then again there is no input.

Enabling input interrupts

So far so good, but there is no interrupt handling yet. We now have a second serial port, but unless we poll it constantly, it’ll still “overrun” and lose characters. Let’s fix that next.

Here is the implementation of an extra layer around the above ring and uart code:

128 4 + buffer: uart-ring

: uart-irq-handler ( -- )  \ handle the USART receive interrupt
  USART2-DR @  \ will drop input when there is no room left
  uart-ring dup ring? if >ring else 2drop then ;

$E000E104 constant NVIC-EN1R \ IRQ 32 to 63 Set Enable Register

: uart-irq-init ( -- )  \ initialise the USART2 using a receive ring buffer
  uart-init
  uart-ring 128 init-ring
  ['] uart-irq-handler irq-usart2 !
  6 bit NVIC-EN1R !  \ enable USART2 interrupt 38
  5 bit USART2-CR1 bis!  \ set RXNEIE
;

: uart-irq-key? ( -- f )  \ input check for interrupt-driven ring buffer
  uart-ring ring# 0<> ;
: uart-irq-key ( -- c )  \ input read from interrupt-driven ring buffer
  begin uart-irq-key? until  uart-ring ring> ;

This sets up a 128-byte ring buffer and initialises USART2 as before.

Then, we set up an “interrupt handler” and tie it to the USART2 interrupt (this requires Mecrisp 2.2.2, which is currently still in beta).

The rest is automatic: as if by magic, every new input character will end up being placed in the ring buffer, and so our key? and key code no longer accesses the USART itself - instead, we now treat the ring buffer as the source of our input data.

Interrupts require great care in terms of timing, because interrupt code can run at any time - including exactly while we’re checking for new input in our application code! In this case, it’s all handled by the ring buffer code, which has been carefully written to avoid any race conditions.

Note that interrupts are only used for incoming data, the outgoing side continues to operate in polled mode. The reason is that we cannot control when new data comes in, whereas slow output will simply throttle our data send code. If we don’t deal with input quickly, we lose it - whereas if we don’t keep the output stream going full speed, it’ll merely come out of the chip a little later.

What’s the point?

You might wonder what we’ve actually gained with these few dozen lines of code.

Without interrupts, at 115200 baud, there’s potentially one byte of data coming in every 86.8 µs. If we don’t read it out of the USART hardware before the next data byte is ready, it will be lost.

With a 128-byte ring buffer, the data will be saved up, and even with a full-speed input stream, we only need to check for data and read it (all!) out within 11 milliseconds. Note that - in terms of throughput - nothing has changed: if we want to be able to process a continuous stream of input, we’re going to have to deal with 11,520 bytes of data every second. But in terms of response time, we can now spend up to 11 ms processing the previous data, without worrying about new input.

For a protocol based on text lines for example, with no more than 80..120 characters each, this means our code can now operate in line-by-line mode without data loss.

One use for this is the Mecrisp Forth command line. The built-in polled-only mode is not able to keep up with new input, which is why msend needs to carefully throttle itself to avoid overruns. With interrupts and a ring buffer, this could be adjusted to handle a higher-rate input stream.

Unlike an USART-based serial port, SPI communication is not timing-critical, at least not on the SPI master side. Since the data clock is also sent as separate signal, slowdowns only change the communication rate. That’s why SPI is so easy to implement in bit-banged mode, as shown here.

But software implementations are always going to be slower than dedicated hardware. So here’s a hardware version which drives the clock at 9 MHz, 1/8th the CPU’s 72 MHz master clock:

$40013000 constant SPI1  
     SPI1 $0 + constant SPI1-CR1
     SPI1 $4 + constant SPI1-CR2
     SPI1 $8 + constant SPI1-SR
     SPI1 $C + constant SPI1-DR

: +spi ( -- ) ssel @ io-0! ;  \ select SPI
: -spi ( -- ) ssel @ io-1! ;  \ deselect SPI

: >spi> ( c -- c )  \ hardware SPI, 8 bits
  SPI1-DR !  begin SPI1-SR @ 1 and until  SPI1-DR @ ;

\ single byte transfers
: spi> ( -- c ) 0 >spi> ;  \ read byte from SPI
: >spi ( c -- ) >spi> drop ;  \ write byte to SPI

: spi-init ( -- )  \ set up hardware SPI
  12 bit RCC-APB2ENR bis!  \ set SPI1EN
  %0000000001010100 SPI1-CR1 !  \ clk/8, i.e. 9 MHz, master
  2 bit SPI1-CR2 bis!  \ SS output enable
  OMODE-PP ssel @ io-mode! -spi
  OMODE-AF-PP PA5 io-mode!
  IMODE-FLOAT PA6 io-mode!
  OMODE-AF-PP PA7 io-mode! ;

Note the special hardware pin settings using the STM32’s “alternate function” mode.

The select I/O pin is configured in the ssel variable. Everything else is similar to the USART2 hardware: intitialisation using lots of magic bit settings gleaned from the datasheet, and then a single “>spi>” primitive which transfers a single byte out and back in via the SPI registers.

At 9 MHz, this takes under 1 microsecond per byte. These high rates can only be used across short wires, but are nevertheless perfect to interface with a large variety of SPI-based chips.

Here’s a convenient utility to inspect the SPI hardware registers with a simple “spi.” word:

: spi. ( -- )  \ display SPI hardware registers
  cr ." CR1 " SPI1-CR1 @ h.4
    ."  CR2 " SPI1-CR2 @ h.4
     ."  SR " SPI1-SR @ h.4 ;

This driver is plug-compatible with the bit-banged one presented earlier. One or the other can be loaded and used with the RFM69 driver, for example.

Now that we have a fast SPI driver, we can tackle a more ambituous task of driving a 320x240 colour LCD display. In this example, we’ll use the HyTiny board with this 3.2” display, because the two can be connected via a simple 12-pin FPC cable (included with the display).

When you do the math, you can see that there’s a lot of data involved: 240 x 320 x 16-bit colour (max 18-bit) requires 153,600 bytes of memory (172,800 bytes in 18-bit mode). And to refresh that entire screen, we’ll have to send all those pixels to the display.

Note that although SPI-connected LCD displays are fine for many purposes, they cannot handle video or moving images - you’ll need to use a faster parallel-mode connection for that (with a much higher wire count). At 10 MHz - the maximum specified rate for the ILI9325 LCD driver - each individual pixel takes 1.6 µs to send, i.e. almost a quarter second for the entire image.

Still, with only a few dozen lines of Forth, we can tie Mecrisp’s graphics library to such a display:

Here are some excerpts from this code, which is available in full on GitHub, as usual:

$0000 variable tft-bg
$FC00 variable tft-fg

: tft-init ( -- )
  PB0 ssel !  \ use PB0 to select the TFT display
  spi-init
\ switch to alternate SPI pins, PB3..5 iso PA5..7
  $03000001 AFIO-MAPR !  \ also disable JTAG & SWD to free PB3 PB4 PA15
  IMODE-FLOAT PA5 io-mode!
  IMODE-FLOAT PA6 io-mode!
  IMODE-FLOAT PA7 io-mode!
  OMODE-AF-PP PB3 io-mode!
  IMODE-FLOAT PB4 io-mode!
  OMODE-AF-PP PB5 io-mode!
  OMODE-PP PB2 io-mode!  PB2 io-1!
  %0000000001010110 SPI1-CR1 !  \ clk/16, i.e. 4.5 MHz, master, CPOL=1 (!)
  tft-config ;

\ clear, putpixel, and display are used by the graphics.fs code

: clear ( -- )  \ clear display memory
  0 $21 tft! 0 $20 tft!
  tft-bg @ 320 240 * 0 do dup $22 tft! loop drop ;
: putpixel ( x y -- )  \ set a pixel in display memory
  $21 tft! $20 tft! tft-fg @ $22 tft! ;
: display ( -- ) ;

We have to tinker a bit more with the hardware I/O settings to switch to a different set of pins, matching the HyTiny’s LCD connector. The way it’s done here is to initialise hardware SPI as before, and then undo those I/O pin configurations and redo a few others instead.

The call to tft-config sends a whole slew of little commands to the ILI9325, which needs quite some configuration before it can actually be used after reset.

A common trick to keep the colour details out of the drawing code, is to keep two colour values in variables, used as “background” and “foreground” colour, respectively - with the background used for clearing and filling areas, and the foreground used for lines and individual pixels. By changing these variables before calling a graphics command, you can draw with any colour.

One surprise with this particular ILI9325 chip was that the SPI mode needed CPOL=1 mode. Subtle “gotcha’s” like this can eat up a lot of debug time!

Unlike the OLED driver presented earlier, we don’t have enough RAM to keep a full image buffer in memory. The clear and putpixel primitives defined above will need to immediately send their data to the display hardware. And because of this, the display code used to update what is shown on the screen is now a dummy call.

It takes almost 2 seconds to clear the entire screen with the implementation shown above. This could be optimized quite a bit further by sending all data as one long stream instead of pixel-by-pixel. But hey, as proof-of-concept, it’s fine!

For even more performance, the SPI hardware could be driven from DMA. While this requires some memory to transfer from, it can be useful to “fill” rectangles to a fixed colour by keeping the input address fixed. Still, the upper limit is 10 MHz serial, limiting frame rates to 4 Hz max.

One attraction of the STM32F103 series microcontrollers, is that there are lots of them available on eBay at ridiculously low prices. There are many variants of this µC, with flash memory sizes from 64K to 512K (and beyond, even), and with anything from 36 pins to 144 pins.

If you search on eBay for “stm32f103 board”, the first one that might pop up is perhaps this one:

Here are a few more, all running Mecrisp Forth 2.2.2:

There is no USB driver support in Mecrisp at the moment, so these have each been wired up with USB-serial interfaces. This will be needed as first step anyway, to flash Mecrisp onto the boards.

The procedure to upload Forth to such “Dime-A-Dozen” STM32F103 boards is always similar (although there are several alternatives):

set the BOOT0 jumper to “1” (i.e. VCC)
reset the board to put it into ROM-based serial boot mode
get the latest mecrisp-stellaris-stm32f103.bin from SourceForge

And, lastly, run a command such as this to perform the upload:

python stm32loader.py -ewv -b 115200 -a 0x08000000 \
    -p /dev/<your-tty-port> mecrisp-stellaris-stm32f103.bin

(or use one of the alternative tools listed in the above article, such as BMP or ST-Link)

Once loaded, restore the BOOT0 jumper to “0” (i.e. GND) and then press reset. You should now see a prompt such as this show up on the serial port (USART1 is on PA9 and PA10):

Mecrisp-Stellaris 2.2.2 for STM32F303 by Matthias Koch

Press return and you’ll get Mecrisp’s standard “ok.” prompt. You’re in business!

Something to keep in mind is that there is a single STM32F103 firmware image on SourceForge, which has been built for 64 KB flash and 20 KB RAM. Chips with more memory will work just fine, but Mecrisp won’t be aware of it - flash memory beyond 64K won’t be used for compiled code storage, and RAM beyond 20 KB won’t be allocated or used by Mecrisp (which could actually be an advantage if you want to manually allocate some large buffers).

This is just the tip of Mecrisp’s iceberg, though: there are over a dozen different builds for STM32 chips, including STM’s F3, F4, and F7 series. Each build makes assumptions about the serial port it starts up on, and may depend on having a crystal of a specific frequency installed - but these settings are fairly easily changed in the source code (even though it’s in assembler !).

Some other boards which have been verified to work are:

the HY-MiniSTM32V board, with 3.2” LCD
the RedDragon407 board from eBay, with USART3, 25 MHz crystal, and 3.2” LCD
the Nucleo-32 F303 board, using USART2 instead of USART1, on PA2 and PA15
the Nucleo-64 F302 board, using USART2 instead of USART1, on PA2 and PA3

The above boards are particularly convenient since they include a serial port to USB interface (all Nucleo boards also have ST-Link support for uploading).

So far, we have created two display implementations for Mecrisp Forth: a 128x64 OLED display, connected via (overclocked) I2C, and a 320x240 colour LCD, connected via hardware SPI clocked to 9 MHz. While quite usable, these displays are not terribly snappy:

the OLED display driver uses a 1 KB ram buffer, which it sends in full to the OLED whenever “display” is called - this currently requires about 60 milliseconds
the TFT display uses a much faster connection, but it also needs to handle a lot more data: 320x240 pixels as 16 bits per pixel is 150 KB of data - changes are written directly into the display controller, but this means that it now takes over 1.4 seconds to clear the entire screen!

Fortunately, there are much faster options available, even on low-end STM32F103 chips. They are based on STM’s Flexible Static Memory controller (FSMC), a hardware peripheral which can map various types of external memory into the ARM’s address space. This requires a lot of pins, because such interfaces to external memory will be either 8-bit or 16-bit wide.

But the results can be quite impressive. To access an LCD controller connected in this way, you can now simply write to specific memory addresses in code.

Let’s try it out, using the Hy-MiniSTM32V board from Haoyu. It has an STM32F103VC µC on board, i.e. 80-pins, 256K flash, 64K RAM. Still not enough to keep a complete display copy in RAM, but as you’ll see, this no longer matters. The implementation is available on GitHub.

The code is just under 100 lines, a bit lengthy for inclusion in this article. Some of the highlights:

: tft-pins ( -- )
  8 bit RCC-AHBENR bis!  \ enable FSMC clock

  OMODE-AF-PP OMODE-FAST +
  dup PE7  io-mode!  dup PE8  io-mode!  dup PE9  io-mode!  dup PE10 io-mode!
  dup PE11 io-mode!  dup PE12 io-mode!  dup PE13 io-mode!  dup PE14 io-mode!
  dup PE15 io-mode!  dup PD0  io-mode!  dup PD1  io-mode!  dup PD4  io-mode!
  dup PD5  io-mode!  dup PD7  io-mode!  dup PD8  io-mode!  dup PD9  io-mode!
  dup PD10 io-mode!  dup PD11 io-mode!  dup PD14 io-mode!  dup PD15 io-mode!
  drop ;

As mentioned, we need to set up a lot of GPIO/O pins for this, and of course they have to match with the actual connections on this particular board.

Next, we need to set up three registers in the FSMC hardware (that last write enables the FSMC):

: tft-fsmc ( -- )
  [...] FSMC-BCR1 !
  [...] FSMC-BTR1 !
  [...] FSMC-BWTR1 !
  1 FSMC-BCR1 bis! ;

For full details, see GitHub and the - 1,100-page - STM32F103 Reference Manual (RM0008).

So much for the FSMC. We also need to initialise this particular “R61505U” LCD controller on our board, which requires sending it just the right magic mix of config settings on startup:

create tft:R61505U
hex
    E5 h, 8000 h,  00 h, 0001 h,  2B h, 0010 h,  01 h, 0100 h,  [...]
decimal align

: tft-init ( -- )
  tft-pins tft-fsmc
  tft:R61505U begin
    dup h@ dup $200 < while  ( addr reg )
    over 2+ h@ swap  ( addr val reg )
    dup $100 = if drop ms else tft! then
  4 + repeat 2drop ;

And that’s about it. But here is the interesting bit with respect to the FSMC:

: tft! ( val reg -- )  LCD-REG h! LCD-RAM h! ;

That little definition is our sole interface to the LCD, and it just writes two values to two different memory addresses, now mapped by the FSMC.

This same approach can probably be used with a huge variety of LCD displays out there, as long as they are connected via a parallel bus and the µC has support for FSMC. You “just” need to connect the LCD properly, set up all the GPIO pins and the FSMC to match (including proper read/write timing), and initialise the LCD controller with its matching power-up sequence.

The rest is mostly boilerplate to provide the 3 definitions needed by the display-independentgraphics.fs library from Mecrisp:

$0000 variable tft-bg
$FFFF variable tft-fg

: clear ( -- )
  0 $20 tft!  0 $21 tft!  $22 LCD-REG h!
  tft-bg @  320 240 * 0 do dup LCD-RAM h! loop  drop ;

: putpixel ( x y -- )  \ set a pixel in display memory
  $21 tft! $20 tft! tft-fg @ $22 tft! ;

: display ( -- ) ;  \ update tft from display memory (ignored)

And here’s the result of running all this code with the Mescrisp graphics demo:

(with apologies for the low image quality of this snapshot)

So now we’re back to displaying stuff on the screen, just like the previous two display implementations. But with the above FSMC-based code, a clear screen takes just 30 ms!

As you can see, the “clear” word above simply brute-forces its way through, by setting each screen pixel in a big loop. That’s 5,000 16-bit writes per millisecond, i.e. 200 ns cycle time.

Which goes to show that performance is the result of optimising (only) the right things!

There are a lot of features hiding in today’s microcontrollers - even the STM32F103 series includes some very nice peripherals:

2 to 3 A-to-D converters, sampling up to a million times per second
on the larger devices: dual D-to-A converters, with 3 µS rise times
2 to 3 hardware SPI interfaces, supporting up to 18 Mbit/s
2 to 5 universal serial ports, some of them supporting up to 4.5 Mbis/s

That’s a lot of data, once you start using these peripherals.

With polling, it would be very hard to sustain any substantial data rates, let alone handle I/O from several peripherals all going on at the same time.

With interrupts, it becomes easier to deal with timing from different sources, but you also need to be extra careful to avoid race conditions - which can be very hard to debug and get 100% right.

But there’s also another problem with interrupts: overhead.

To “service” an interrupt, the CPU must stop what it’s doing, save the state, and switch to the interrupt handler. And when the handler returns, it must restore the state before the original code can be resumed. This can eat up quite a few clock cycles, if only to get that saved state in and out of memory. And it leads to latency, before the interrupt handler can perform its task.

In many situations, the sustained data rates are not actually that high. We may be receiving the bytes of a packet, or lines from a serial link, or sending out a reply to an earlier request. Even at top speed, all we really need is to efficiently collect (or emit) a certain number of bytes, and then we can deal with them all at once at a considerably slower pace.

One solution for this is to add FIFOs to each peripheral: that way they can collect all incoming bytes without losing any, even if the CPU isn’t using that data right away. Likewise for output: the CPU can fill an outbound FIFO as soon as it likes, and then move on to other tasks while the hardware clocks all those bytes out at the configured rate. But it’s expensive in terms of silicon.

Meet the Direct Memory Access controller: another brilliant hardware peripheral, whose only task is to move data around. In a way, it’s like a little CPU without computational capability - all it can do is fetch, store, count, and increment its internal address registers.

The DMA “engine” of an STM32F103 chip has 7 to 12 channels depending on chip model, which can each move data around independently. These can be set up to either send or receive data from an ADC, DAC, SPI, USART, etc.

As with interrupts, DMA performs data transfers without having to continuously poll. The code which is currently running need not be aware of it. The difference with interrupts, is that even the CPU is not aware of these data transfers: DMA operates next to the CPU, grabbing its own access to peripherals and memory, and “stealing” memory cycles to perform its transfers. There’s “arbitration” involved, to keep all these cats, eh, bus masters out of each other’s way.

Here is an overview from the STM32F103 Reference Manual:

Similar to the FSMC in the previous article, it takes a bit of tinkering to set up a DMA stream, but the gains can be substantial. Imagine pushing 1 KB of data from RAM to a Digital-to-Analog converter (present on higher-end chip models):

with DMA, the transfer of each 12-bit value will take one memory bus cycle
with interrupts, it’s more like 20..50 CPU and memory cycles, from interrupt begin to end

If you’re feeding the DAC with values at 1 million samples per second, then this overhead will add up - to the point that an interrupt-based implementation might not even be fast enough!

Lets’ try this. We’re going to use the same Hy-MiniSTM32V as with the FSMC. We’ll set up DMA in circular mode, causing it to send out values to the DAC from a fixed-size buffer over and over again. And to get a bit fancy, we’ll store the values of a sine wave in that buffer, so that a real (analog!) sine wave should come out once this all starts running. Code on GitHub, as usual.

First some basic non-DMA code to initialise and send values to both DACs:

: +dac ( -- )  \ initialise the two D/A converters on PA4 and PA5
  29 bit RCC-APB1ENR bis!  \ DACEN clock enable
  IMODE-ADC PA4 io-mode!
  IMODE-ADC PA5 io-mode!
  $00010001 DAC-CR !  \ enable channel 1 and 2
  0 0 2dac!  ;

: 2dac! ( u1 u2 -- )  \ send values to each of the DACs
  16 lshift or DAC-DHR12RD ! ;

That’s the basic DAC peripheral. Fairly simple to setup and use from code.

Here’s the gist of the DMA setup code (details omitted for brevity):

: dac1-dma ( addr count -- )  \ feed DAC1 from wave table at given address
  1 bit RCC-AHBENR bis!  \ DMA2EN clock enable
  [...] DMA2-CNDTR3 !
  [...] DMA2-CMAR3 !
  [...] DMA2-CPAR3 !
  [...] DMA2-CCR3 !
\ set up DAC1 to convert on each write from DMA1
  12 bit DAC-CR bis! ;

But we also need to use a timer, to drive this process, since there is no incoming event to trigger this stream. The timer period determines how fast new values will be sent to the DAC:

: dac1-awg ( u -- )  \ generate on DAC1 via DMA with given timer period
  6 +timer  +dac  wavetable 8192 dac1-dma  fill-sinewave ;

This, and the code to fill a wavetable with sine values can be found here.

And that’s it. If we enter “12 dac1-awg”, then the DAC will start producing a really nice and well-formed 4096-sample sine wave, as can be seen in this oscilloscope capture from pin PA4:

The resulting 675.67 Hz output frequency matches this calculation:

36 MHz <APB1-bus-freq> / 4096 <samples> / (12 <timer-limit> + 1)

In case you’re wondering: DMA is now driving our DAC at over 2.7 million samples per second.

The DAC actually has several other intriguing capabilities, such as generating triangle waves and even mixing pseudo-random noise into its output. See the code on GitHub for some examples.

But the most impressive part perhaps, is that all this is happening in the background. The µC continues to run Mecrisp Forth, and remains as responsive to our typed-in commands as before. The DAC has become totally autonomous, there is not even a single interrupt involved here!

Next up: let’s find out what DMA can do for us on the Analog-to-Digital side…

Now that we have seen how to push out values to the DAC without CPU intervention… can we do the same for acquiring ADC sample data? The answer is a resounding “yes, of course!”

And it’s not even hard, requiring less than two dozen lines of code (full details on GitHub):

: adc1-dma ( addr count pin rate -- )  \ continuous DMA-based conversion
  3 +timer
  +adc  adc drop  \ perform one conversion to set up the ADC
  2dup 0 fill  \ clear sampling buffer

    0 bit RCC-AHBENR bis!  \ DMA1EN clock enable
      2/ DMA1-CNDTR1 !     \ 2-byte entries
          DMA1-CMAR1 !     \ write to address passed as input
  ADC1-DR DMA1-CPAR1 !     \ read from ADC1

  [...] DMA1-CCR1 !
  [...] ADC1-CR2 ! ;

The setup calls the “+adc” and “adc” words, defined earlier for simple polled use of the ADC, and also sets up a timer (again, to define the sampling rate) and the relevant DMA channel.

Let’s have some fun. Let’s first start the DAC via DMA to generate a sine wave, and let’s then also set up the ADC to read and sample this signal back into memory. As set up here, the ADC’s DMA channel saves its data in a circular fashion and keeps on overwriting old data until reconfigured.

And while we’re at it, let’s also plot that acquired data on the Hy-MiniSTM32V’s LCD screen - to create a little one-channel scope (but without triggering, so the screen won’t show a stable image while this code is running). Here is the main logic (see GitHub for the whole story, as usual):

602 buffer: trace

: scope ( -- )  \ very crude continuous ADC capture w/ graph plot
  tft-init  clear border grid  TFT-BL ios!

  11 dac1-awg
  adc-buffer PB0 501 adc1-dma

  begin
    \ grab and draw the trace
    301 0 do
      adc-buffer drop i 2* + h@ 20 / 1+
      dup trace i 2* + h!  \ also save a copy in a buffer
      i pp
    loop
    40 ms  \ leave the trace on the screen for a while
    \ bail out on key press, with the trace stil showing
    key? 0= while
    \ clear the trace again
    tft-bg @ tft-fg !
    301 0 do
      trace i 2* + h@
      i pp
    loop
    $FFFF tft-fg !
    grid  \ redraw the grid
  repeat ;

(where “pp” is shorthand, defined as “: pp ( x y ) 10 + swap 20 + swap putpixel ;”)

The DAC is fed sine wave samples at 0.5 MHz, and the ADC is driven by a timer running at 502 cycles, i.e. about 71.7 KHz (just because that gave a reasonably stable display - there’s clearly aliasing involved at these two rates). The DAC has a 4096-sample buffer, the ADC has only 301.

The “begin ... key? 0= while ... repeat” loop then produces an oscilloscope-like result on the screen, continuously refreshed at about 20 frames per second. By tinkering a bit with the “border” and “grid” code, we can actually add a pretty neat graticule to this screen as well:

The ADC clock is set to 12 MHz (72 MHz on APB1 with prescaler 6), i.e. under the 14 MHz limit. The max sample rate for this setup is ≈ 833 KHz (each measurement needs 14 clock cycles). This corresponds to a minimum timer value of 43 (timers are on APB1, which is clocked at 36 MHz).

Let’s examine the above main loop in a bit more detail:

the “301 0 do .. loop” code displays 301 samples from the ADC acquisition buffer
we also save a copy of these displayed values in a secondary trace buffer
do nothing for 40 milliseconds - this is to leave the image on the screen for a while
rewrite the trace onto the screen once more, but now using the background colour (black)
redraw the dotted grid inside the box, since some of the dots may have been overwritten
rinse and repeat

The logic behind this approach is that clearing the entire display on every pass produces a highly flickering result, as display updates are not synchronised to what the LCD controller is doing. With 30 ms to clear the screen, we’d see part of the screen blanked out, and that at every pass.

So instead, we write the pixels of the trace as we capture them, leave them on the screen for a while, and then clear those (and only those!) pixels again. That’s 250x fewer pixels to update.

Bingo - a crude-and-simple (but pretty!) capture of analog data, constantly updated on the LCD. When a key is pressed, the loop exits and leaves the last trace on the screen. As mentioned before, there’s no triggering, no config, no scaling, no line-drawing interpolation in this demo.

With a loop which only takes 8 ms (plus the 40 ms wait), there is ample processor “headroom” for all kinds of improvements. Filtering, decimation, smoothing, sinx/x traces? Go for it …

Note how the DAC and ADC hardware is driven entirely by the two DMA engines, with the CPU free to perform the main logic and rendering. All this took under 500 lines of Forth code.

DMA is like having a multi-processor under the hood, all inside that one little STM32F103!

Mecrisp Forth 2.2.2 has been flashed onto a new series of boards here at JeeLabs, all with an STM32F103 µC, but of different sizes and with different features on-board.

Haoyu Core Board One

Well, it’s called a HY-STM32F1xxCore144 Core/Dev Board, but “Core Board One” sounds nicer:

This board has a µC with a huge number of pins in a 144-TQFP package, most of which are brought out on 0.1” headers (two dual-row 30-pin, for a total of 120).

Not that all of them can be used freely, but that’s because the board is covered on both sides with some massive memory chips:

128 MByte NAND flash memory (multiplexed over an 8-bit bus)
8 MByte PSRAM (PS = pseudo-static), with 16-bit wide data access
16 MByte NOR flash memory, also supporting 16-bit wide data access

That’s a lot of memory, compared to most little µC boards out there.

The reason to use this board, was to learn more about the “FSMC” controller (the same as used in a previous article for fast TFT LCD access). It takes a few dozen lines of Forth code to set up, but once done, all those 8 MB of PSRAM memory become standard RAM in terms of software, all mapped into addresses 0x60000000 to 0x607FFFFF. It’s not quite as fast as the built-in 64 KB SRAM, but pretty close - more than enough for data storage (and only a fraction slower than built-in flash memory for program execution). Also great for DMA-based massive data capture.

Apart from setup (”psram-init”), there is no API: PSRAM simply looks like extra memory.

The second type of memory is NAND flash. It too needs very little code, but behaves differently: more like the TFT LCD access mode, in that you get two addresses in memory to talk to the chip: one to send commands to, the other to read/and write data. NAND flash is accessed in pages, and is very fast to read, but somewhat slower to write - very much like an SD card, in fact.

The API for this NAND flash memory is:

: nand-init ( -- u )  \ init NAND flash access, return chip info: $409500F1
: nand-erase ( page -- f )  \ erase a block of 256 flash pages
: nand-write ( page addr -- f )  \ write one 512-byte flash page
: nand-read ( page addr -- )  \ read one 512-byte flash page

As with built-in flash memory, pages have to be erased before they can be re-written.

NOR flash hasn’t been tried yet. It’s different from NAND flash in that the entire memory also gets mapped into the µC’s address space, like SRAM, and offers fast random read& exec access. Writing and erasing requires special code, which works in pages - so NOR flash is like the middle ground between SRAM / PSRAM on the one hand, and NAND flash / SD cards on the other.

Olimexino-STM32

This board from Olimex has several nice features:

there’s an STM32F103RB on it, i.e. 64-pin chip with 128 KB flash and 20 KB RAM
it’s Arduino-like (it was modeled after the old “Maple” board from LeafLabs)
there is room for adding extra headers on the inside to support proper 0.1” spacing
it includes a LiPo connector and charger, and supports very low power sleep
it has a CAN bus driver and connector (CAN and USB are exclusive on these F103’s)
there’s a µSD card slot on the back

That last one was the reason to try this board. Here is a first version of some code to initialise an SD card (in bit-banged SPI mode), and read data off it. And this is a first test to mount a FAT16-formatted card, read its root directory, and access data in one of the files.

Here is a transcript of a quick test, with a 2 GB µSD card and some files:

  ok.
sdtry #0 1 #55 1 #41 1 #55 1 #41 0 
17 0 23 
17 0 14 
17 0 12 
20004C50   60 02 00 00 40 00 00 00   84 00 00 00 41 2E 00 5F   `...@... ....A.._
20004C60   00 2E 00 54 00 72 00 0F   00 7F 61 00 73 00 68 00   ...T.r.. ..a.s.h.
20004C70   65 00 73 00 00 00 00 00   FF FF FF FF 7E 31 20 20   e.s..... ....~1  
20004C80   20 20 20 20 54 52 41 22   00 C0 89 23 6E 48 6E 48       TRA" ...#nHnH
20004C90   00 00 89 23 6E 48 03 00   00 10 00 00 41 42 43 44   ...#nH.. ....ABCD

LFN: ._.Trashes. #1 64 
     ~1      .TRA at: 3 
     ABCDEFGH.TXT at: 14 
LFN: .Trashes. #1 64 
     TRASHE~1.    at: 2 
LFN: 00. #2 64 
LFN: .Spotlight-V1 #1 0 
     SPOTLI~1.    at: 4  ok.
14 x 20004C5C 
17 0 12 
20004C50   60 02 00 00 40 00 00 00   84 00 00 00 4D 6F 6E 20   `...@... ....Mon 
20004C60   4D 61 72 20 31 34 20 31   31 3A 31 32 3A 31 34 20   Mar 14 1 1:12:14 
20004C70   43 45 54 20 32 30 31 36   0A 00 00 00 00 00 00 00   CET 2016 ........
20004C80   00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00   ........ ........
20004C90   00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00   ........ ........
 ok.

The “Trashes” and “Spotlight” files are hidden stuff Mac OSX insists on putting on everything it touches. The only non-hidden file in there is “ABCDEFGH.TXT”.

The code figures out where the FAT and root directory blocks are on the µSD card, shows the first 64 bytes of the disk definition block, some filename entries it found (including shreds of “Long FileName” entries, intermixed with the rest), and then reads and dumps some bytes from cluster 14, which corresponds to that 29-byte “ABCDEFGH.TXT” file on the card.

Note that the dump is aligned on a 16-byte boundary, but the read buffer starts at 0x20004C5C.

WaveShare Port103Z

This is one of many development boards available from WaveShare:

It has the same 144-pin STM32F103ZE µC as the Yellow Blue Board above, with 512 KB flash and 64 KB RAM memory. This one was loaded up with Mecrisp Forth mostly because it brings out every single pin, so it’s very easy to try out all the hardware functions available on the STM32F103 “High Density” devices.

There’s a 32,768 Hz crystal on the back of the board, so we can try out the Real-Time Clock (RTC) functionality of this chip. Here is some code for this, and this is the API it exposes:

: +rtc ( -- )  \ restart internal RTC using attached 32,768 Hz crystal
: now! ( u -- )  \ set current time
: now ( -- u )  \ return current time in seconds

There’s no calendar functionality, the built-in hardware simply counts seconds. Since it’s a 32-bit counter, it could easily track “Unix time”, i.e. seconds since Jan 1st, 1970, 0:00 UTC.

If you attach a 3V coin cell between the µC’s “Vbat” and “GND” (and remove a jumper that ties it to Vcc), the internal clock will continue ticking when power to the rest of the µC is off. To regain access, just call “+rtc” after power-up. The counter will then become readable with “now” again.

So many boards…

These are just a few examples of the things you can do with Mecrisp Forth on a large range of ARM boards. They illustrate that the amount of Forth code required to access fairly complex hardware periperhals inside the µC is often susprisingly small. But note also that once such code has been written, the API exposed by those newly-defined “words” can be extremely simple.

The code areas for each of the above boards are all in the Embello repository on GitHub, and are called cbo, oxs, and wpz, respectively. Most of the common code can be found in the flib area.

Those pictures you’ve been seeing in recent articles, with over a dozen boards by now, all have the same configuration in common: boards with a USB port on them, connected and powered through anything but that USB port…

There is some value in this hookup - in fact, either this or an SWD-based setup using a Black Magic Probe or an ST-Link is required to be able to upload the core Mecrisp Forth 2.2.2 image onto the board (once). All the STM32F013 µC models support only the serial port in “ROM boot loader mode”, which is all you’ve got as starting point on a blank chip.

But after that, not having the USB port as “normal” serial device is a major inconvenience.

This highlights what is probably the main drawback of using Mecrisp Forth on microcontrollers such as the STM32 series: Forth cannot be combined with C or C++ - not easily, anyway.

On the one hand, this is no big deal: as the recent “Dive into Forth” article series has shown, it has been surprisingly easy to implement most of the features offered by C runtimes such as the Wiring/Arduino library. Digital I/O, ADCs, DACs, PWM, I2C, SPI were all created with little effort. Even advanced DMA and LCD & memory chip interfaces were added with relative ease.

But USB is a completely different creature: it’s the combination of a fairly complex protocol, with “enumeration”, “endpoints”, and a hefty specification guide, intended to support a huge range of hardware, with lots of different USB modes and speeds. On top of all that, there’s STM32F103’s own USB hardware implementation, which appears to be a fairly early version of what has been greatly enhanced and improved in later STM32 chip series. Things like a 512-byte buffer which has to be accessed as 2-byte words while their addresses start on 4-byte boundaries don’t make the task particularly straightforward. STM32F1’s USB hardware looks like one big kludge…

There is a lot of sample code in C to use as guideline, but it tends to consist of layer upon layer of definitions, headers, and “low-level” vs “high-level” API calls, spread out over dozens of files. It looks more like an example of how to bury the logic of an implementation than anything else!

To put this in perspective: USB is not essential for remote wireless sensor nodes, since they are going to be used un-tethered anyway, but for development it sure would be convenient to just plug that board in, for both power and serial communication. Especially since Mecrisp Forth is entirely driven and uploaded via a serial connection. With USB in Full Speed (FS) mode, i.e. 12 Mbit/sec, a serial connection could also be faster than the current 115,200 baud serial link.

Nevertheless, the plan here at JeeLabs, is to work on getting a USB device implementation in Forth working. This might take a while - the C-based open source examples out there are all too large to make this task simple. Fortunately, the latest beta version of the Saleae Logic Analyser supports USB-FS decoding - this is going to be a huge help during debugging. Another option is to use Linux on the host side: it supports extensive logging and the USB traffic can be examined with WireShark.

A similar situation will arise with Ethernet. There are several examples of a “TCP/IP stack” in Forth, but getting it to work on STM32F107’s and STM32F407’s will probably require some serious time investment and sleuthing…

The good news: these issues do not reduce Forth’s usability for the JET project - stay tuned…

The JET project is about “creating an infrastructure for home monitoring and automation” (it’s actually considerably more, but this is a big-enough bone to chew on already…).

Note that JET is not about individual µCs or boards, it’s about managing an entire set of nodes, warts and all, and heterogenous from the start. JET is about bringing together (and bridging) lots of technologies, and about interfacing to existing ones as well. It’s also about evolution and long-term use - a decade or more - because redoing everything all the time is wasteful.

The JET infrastucture is not necessarily centralised, altough it will be in the first iterations: a “hub”, a variety of remote “nodes”, and browser access to use and administer it all. This is easy to map onto actual hardware, at least for a simple setup:

the hub can be a Raspberry Pi or compatible, i.e. a Linux board
the nodes will be JeeNodes, both AVR- and (in the future) ARM-based
most communication will be wireless (sub-GHz, WiFi, whatever)
most sensor nodes will be ultra low-power and battery-powered
then again, control nodes could also run off USB chargers or similar

Nothing new so far, this story has not changed much in the past years, other than exploring different µC options and trying out some self-powered Micro Power Snitch ideas.

The hub is a recent introduction: a portable application written in Go, in combination with the Mosquitto MQTT server. It has been running here at JeeLabs for a few months now, dutifully collecting home monitoring data, mostly from room nodes, the smart meter, and the solar inverter. The hub itself does very little, but it provides a way to add “Jet Packs” to run arbitrary processes which can tie into the system.

The browser side of JET has not changed: it’ll continue to be written in JavaScript (ES6) and will most likely use ReactJS and PureCSS as foundations. A lot of software development time will end up going there - this is not different from any other modern web-based development.

The nodes have all been running Arduino IDE based C/C++ code, most of this is available in JeeLib on GitHub, as far as ATmega- and ATtiny-based JeeNodes are concerned. Some newer experimental code for ARM has been presented in the past year on the weblog, some for LPC8xx µCs, but recently more for STM32 µCs. That code can be found in the Embello repository, see for example the RF Node Watcher.

But that’s where things are about to change - drastically!

A new beginning

From now on, new work on remote nodes will be done in Forth. Since Mecrisp Forth has proven itself to be very stable and flexible, it’ll be flashed onto every node - very much like a fancy boot loader. This is equivalent to making each node speak Forth on its serial port, once and for all.

This approach has been chosen, because Forth (in particular Mecrisp Forth):

… is an interactive language
… can compile code to flash on the fly
… can clear (parts of) its flash memory again
… can run “risky” code in RAM, with simply a reset to restore its previous state
… could set up a watchdog to force such a reset, even on remote nodes
… has very little overhead, the code is incredibly efficient
… provides access to every feature available in hardware
… can be configured to run arbitrary code after power-up or a reset
… will fit in chips with as little as 32 KB flash, and just a few KB RAM
… works on several different ARM families, not just the STM32 series chips

Do I have to learn Forth?

There are several possible answers to that question:

if you only care about working code, the answer is “no” (just install firmware images)
if you want to play with projects published on the weblog, the answer is “a little”
if you want to dive in and explore everything, or change the code, the answer is “yes”

To explain that second answer: for trying out things written in Forth and made available on GitHub, you don’t need to program in Forth, you can just enter commands like “8686 42 rf-init”, “somevar @ .”, and such - easy stuff (once properly documented and explained!).

Is everything going to be in Forth?

Nooooo! - Forth is still merely an implementation language for little µCs. The plan is to use it to implement a compact dataflow engine for remote nodes, which will then present a “gadgets and circuits” model, somewhat like NoFlo, Node-RED, and Pure Data. All data-driven.

Once such a basic dataflow engine exists, we will have a considerably more abstract conceptual framework to build with and on. There will be gadgets to tie into actual hardware (pins, digital & analog I/O, timers, but also the RF driver), and gadgets to perform generic tasks (periodic execution, filtering, arithmetic, range checking, conditional execution, etc). These can then be combined into larger circuits, and sent to a node as definition of the behaviour of that node.

The reason why Forth looks like a perfect fit for this task, is that it allows growing a node’s functionality in small steps, once the Mecrisp core has been flashed into its flash memory. There will need to be a first layer to tie into the RFM69 radio modules (the RF69 driver already exists), and a way to robustly add and remove additional Forth code over the air. After that, we’ll need a dataflow core. Then, tons of “gadgets”, either coded in Forth or combined from existing ones.

At the end of the day/road/tunnel, Forth will end up being simply a low-level implementation language for manually coding only the bottom layers, with everything else generated from a visual diagram editor in the browser. The long-term goal is not to expose, but to bury Forth!

Yes, it will take a lot of work to get there - JET was never meant to be built in a day…

Every µC from the STM32F10x family has hardware built-in to support USB. The earlier (i.e. smaller) STM32F103’s have a more limited implementation that more recent models. There are some really strange design choices in it - they look like a rushed-to-market implementation:

there’s a 512-byte buffer where everything USB-related happens, but this memory must be accessed as 256 16-bit words on 32-bit address boundaries, i.e. there’s a gap every 2 bytes
some registers live on the APB1 bus like the rest, others live inside this “Packet Memory”
some bits in the registers can’t be set directly, only toggled - this means you have to read the current value, determine what to toggle (via XOR), and then send those toggle bits (while carefully keeping others in a do-nothing state)

Having seen some FPGA designs recently, it’s clear that this is caused by the two different clock domains of the µC on the one hand (normally 72 MHz) and the USB hardware on the other (48 MHz, but not necessarily synchronised). Improving this would probably have required more silicon and engineering time, which perhaps wasn’t avaialable when the STM32F103 came out.

It all leads to very convoluted code!

USB is a fairly complex protocol. There is an excellent USB in a NutShell resource on the web which goes into all the details. A brief summary and some notes:

USB is driven by the host, the device side only needs to respond to requests
there are control messages and actual data, these use different “endpoints”, a bit like different signals in a multi-pin connector - control is always via endpoint zero
all low-level bit-stuffing, framing, and checksumming is handled by the µC’s hardware
a “Full Speed” USB link runs at 12 Mbps, with a “Start Of Frame” packet every 1 ms
the USB driver requires little CPU overhead - idling is fully handled by the hardware
two pins (D+/D-) encode the data differentially and signal special states, e.g. bus reset
there’s a complex “enumeration” phase on startup, for the host to learn what device it is

But the good news is, the current work-in-progress implementation is working (with a big thank you to Eckhart Köppen for his help!). Here’s an extract from the Linux kernel log:

[98361.944068] usb 2-1: new full-speed USB device number 6 using uhci_hcd
[98362.144095] usb 2-1: New USB device found, idVendor=0483, idProduct=5740
[98362.144106] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[98362.144112] usb 2-1: Product: Forth Serial Port
[98362.144118] usb 2-1: Manufacturer: Mecrisp (STM32F10x)
[98362.144124] usb 2-1: SerialNumber: C934CC37
[98362.149505] cdc_acm 2-1:1.0: ttyACM0: USB ACM device
[98443.520186] usb 2-1: USB disconnect, device number 6

And here’s what it looks like on Mac OSX:

(this is an earlier version, the serial number was not yet filled in)

As you can see, the STM32F103 presents itself as an “ACM” modem-like device (with device names chosen to identify this as a Mecrisp Forth context). The huge advantage here, is that such devices do not need any additional drivers - at least not on Linux and Mac OSX.

There is one quirk on Linux: if there’s a ModemManager running, it’ll start sending “AT” commands to the device the moment it is plugged in. This will confuse the Forth interpreter - you have to kill that modemmanager process first (who uses modems anyway, nowadays?).

The other good news is that serial I/O actually works - with the driver set up to send back everything it receives, you can open the serial port, type at it, and see everything echoed back.

In short: the USB driver is essentially already working!

But now comes the tricky bit: for this to become the main interface to Mecrisp Forth, we will need to re-route its input and output to use the USB device. There are four words defined in Forth to help with this: key?, key, emit?, and emit. There are also four “hooks” which allow pointing these to other words to “re-vector” all keyboard output and console output through.

Right now, a first attempt to redefine these words is still failing: for some reason, Mecrisp Forth refuses to start listening to the USB device. This may be related to some circular “ring buffers” added to solve the impedance mismatch between packet-oriented USB streams and the character-oriented mode required by Forth. It’s not clear at the moment where the problem originates - clearly more debugging will be needed to figure this out.

But then what? We can load the new USB driver in flash memory, on top of Mecrisp, and we can set it up to switch its I/O permanently to this code. But what if someone types “eraseflash”? That would wipe out our setup, and revert the entire system to its original serial-port-only state.

Whoops, bye bye USB - we will effectively have lost control of the system!

Here is one possible way out (but see below) - it requires a change in Mecrisp Forth:

introduce a new type of variable, perhaps “otp-var” (for “One-Time Variable”)
it behaves like any other variable, but its initial value can be changed once in Forth
the hook-key?, hook-key, etc vectors are changed to such otp-vars
the Mecrisp build uses the current settings, i.e. serial-key?, serial-key, etc.
two more variables are introduced, let’s calll them flash-start and flash-limit
flash-start marks the lowest erase point for Mecrisp’s flash erasure
similarly, flash-limit marks the top limit, beyond which it never erases
as shipped, these are set to $04000 and $10000, i.e. 16K and 64K, as they are now
and while we’re at it, let’s also add ram-limit, preset to $5000 (20K) on STM32F103’s

The above changes would not alter the way Mecrisp Forth operates, as shipped. But they will make it possible to alter the system (once!) to include more Forth code in Flash memory, and to make sure that code can never be lost again.

The trade-off is that once you make such a change, it becomes permanent (which is the whole point) - to change it again, the entire chip will have to be erased and re-flashed with a clean Mecrisp firmware image.

The flash-limit and ram-limit variables are not strictly necessary, but they allow altering the system to use more or less of the available flash and RAM. This neatly addresses the fact that there are variations of the STM32F103 (and others), which differ only in flash and RAM size.

It also allows reserving flash and/or RAM for other uses. And even do crazy things like putting the RAM limit very very high and allow Forth to allocate memory in external RAM chips attached via FSMC, for example.

On to the implementation of otp-var: this might not be very difficult: variables defined in flash already contain a copy of their initial value. Of course, these variables end up being allocated in RAM (as part of Mecrisp’s startup sequence), so maybe all that needs to be done is to add one extra 32-bit field to these variables in flash, preset to $FFFFFFFF, with logic to check this extra value first. If it’s still $FFFFFFFF, we use the other intial value, but if it isn’t, we use this new value instead. And then some code in Forth (it need not be part of the Mecrisp core) can figure out how to re-write that special value - once only.

One could even consider changing all variables in this way, i.e. allowing every variable defined in flash to be changed once. So the Mecrisp core can continue to set them up as usual, while still allowing an application to change them, once. This may lead to more cases where the entire chip will have to be erased and re-flashed (over serial or via SWD), but this reconfigurability really isn’t intended for mainstream use - it just adds the option to create an enhanced core, such as a USB-enabled variant of Mecrisp. It would also allow creating a smaller Mecrisp core, FWIW.

Update - Matthias Koch, Mecrisp’s author, has suggested some alternatives, which will avoid having to ever reflash the core (although you may still need a serial hookup to get out of one of the possible failure modes). Consider the above shelved: keeping the core as is, is preferable!