📅︎ 26.06.2023

✏︎ by

Evy

⏱︎ 50 kg⋅eV ±12σ

Designing a Virtual CPU ISA

I found myself talking a few times about an old project of mine over at the good ol’ RPLCS. Today i finally thought: Why not blog about it? Introducing: Finvil, a virtual CPU ISA, meant to be embedded into applications as a bytecode interpreter for scripting. Fasten your seat belts, for we are gonna talk about different approaches and considerations in ISA and bytecode design.

As mentioned, code written for Finvil is meant to run in an embedded bytecode interpreter. It should run sandboxed code, quite possibly from different authors, and be capable of safely modifying parts of the embedding environment. Yet it should feel like you are writing code for a real CPU ISA you could implement in Verilog and test on an FPGA.

And yes, »should« and »meant to«. Finvil is far from production ready and in its current design didn’t even make it to the numeric opcode assignment yet. I plan to overhaul its design a lot, and in this article i’d like to share with you some of the old design decisions i made. Why were they made, why do some of them work, why don’t some of them work, and what are the lessons learned for the next version of Finvil?

We’ll begin by looking at Finvil’s register file. It will give you an idea of how a Filvil interpreter would’ve been used, and what kind of data Finvil code would’ve been able to operate over and how.
Then we’ll take a look at Finvil’s instruction formats. Here we’ll see the results of all the compromises that had to be made in order for Finvil to fit its intended purpose: How many operands do we have, how are they encoded, what are the limitations?
Then i’ll show you a few example Finvil instructions to get the full picture of how the embedding environment, the register file, and all the instruction formats play together. There you’ll also see how embedding would’ve worked.
Finally i’ll share my thoughts on what things are going to change in the next Finvil design and why.

The Register File

One of the biggest design challenges was the register file. First of all, i kept track of all register sizes in my spreadsheet, summing up their bit and byte widths. Then i made sure never to overshoot the hard limit of 256 bytes, and ideally not the soft limit of 128 bytes.

(I did overshoot the soft limit.)

This limit exists due to a concern i had during the early design phases: Long-running loops. The embedding environment must never get stuck running a Finvil script. Worst case that meant a script had to be paused and continued later, which in turn meant saving and restoring all register state to/from somewhere. And that somewhere was Finvil emulator components in scriptable game entities. The larger the file, the more specialised registers i could add, but the heavier each scriptable entity becomes, so minimising the register file size was critical.

With that out of the way, here it is, the register file:

There are at most 8 indexable registers per group, but not all groups are filled-up with registers. Some registers also have multiple names, their group names and their accumulator names. It will all make sense later.
Register Group	Size (bits)	Amount & Mnemonics								Sum (bits)
Pointer Register	16	p0 a0.p	p1 a1.p	p2	p3	p4	p5	p6	p7 sp	128
Scalar Register	32	x0 a0.x	x1 a1.x	x2	x3	x4	x5	x6	x7	256
Predicate Register	1	b0 a0.b	b1 a1.b	b2	b3	b4	b5	b6	b7	8
Special Pointer	16	pc	fi							32
Segment Register	64	fs	gs	ls						192
Call Stack	32	c0	c1	c2	c3	c4	c5	c6	c7	256
Call Stack	32	c8	c9	c10	c11	c12	c13	c14	c15	256

The »scalar register« name may have raised some eye brows. Indeed, an even earlier design of Finvil was equipped with 8 vector registers v0 … v7, which were fixed 4-lanes-vectors of scalars. The vector registers were ultimately dropped for two reasons:

They contained too much state to recover, 128 more bytes for every scriptable entity. And that for something that isn’t going to be used most of the time.
They caused a doubling in opcodes. Whatever you could do with the scalars, you should be able to do with vectors. Add in instructions to transfer data between scalars and vector lanes, and suddenly all remaining free opcodes are assigned. As a consequence, many useful opcodes the current design offers didn’t exist.

The other registers we’ll talk about later. For now it suffices to know they exist and let your imagination run wild.

Now, all the above registers add up to a total of 1128 bits, or 141 bytes. Far below 256, just above 128. Scratching the second call stack row would get us below the soft limit.

The Opcode Formats

Here’s a question for fellow bytecode designers: How big are your instructions?

Are they fixed-sized? These are very quick and easy to decode, but you either end up with instructions that are too short for a bunch of useful things or simply too big. Short instructions can’t encode a lot of operand data you may need, such as far jump offsets, big immediate constants, or just having more indexable registers over all. On the other hand, if your instructions are wider, you can easily encode all the crazy powerful instructions you’ll ever want, though also at a cost. By far most instructions in normal general-purpose code are easily compressible. You want to compress as much code as you can, so that more of it fits into a single cache line. This matters just as much to a bytecode interpreter as it matters to a real CPU.

Are they dynamically-sized? Your instructions can be of any length you need. You can have them range from 1 byte to 16 bytes or even more. Need a load instruction for an immediate 128 bit vector? Slap another byte containing the load immediate opcode on it and you get a single 17 bytes instruction. Just need to sign-flip a single register? What more than a byte does that instruction ever need? Now, however, you’re faced with two challenges. ① How do you deal with jumps? What happens if you jump into the middle of an instruction, and how — if at all — do you make sure that doesn’t happen? This is a design challenge quite similar to that of UTF-8. ② How do you know how long an instruction is? Can you know it right away, e.g. for there being a counter in the first byte, or do you have to walk through the entire instruction to know when it eventually ends, if at all? In real hardware, if you cannot know the size of an instruction up front, then decoding instructions in parallel becomes difficult and expensive. Worst case, decoding instructions in a sequence could cause stalls if it takes many cycles to find the end. And in software, you may end up spending more time fetching all parts of an instruction than performing something as simple as register-register-addition.

As you will see, i went for fixed-sized instructions with Finvil. Note, however, that i went for that design because Finvil is meant to be run by a bytecode interpreter, running only small amounts of code for game entity scripting. When designing CPUs for e.g. desktop computers or phones, i’d instead go for a variable length encoding that self-synchronises. Being fixed-sized, it becomes rather easy to ensure ahead of script execution that all jumps land on valid instructions. In fact, jumps into the middle of an instruction are made impossible, but more on that later.

Here’s the table of Finvil instruction formats, mapping fields of operands and opcodes to their positions within 16 bit instructions:

Just like many real CPU ISAs with short fixed-sized instructions, Finvil has a lot of instruction formats. They are a consequence of utilising the available bits as effectively as possible. Instructions aren’t perfectly symmetrical. Some need more registers, some less, some need big immediate operands, some don’t. Wider instruction encodings need fewer instruction formats, for many bits can just be left unused, »reserved for later«, making it possible to merge more specialised formats into fewer more general ones.
Instruction Format
	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
TRR	Type			Source 1			Target			Opcode
XRR	Opcode			Source 1			Target			Opcode
RRR	Source 2			Source 1			Target			Opcode
URR	uint3			Source 1			Target			Opcode
SRR	sint3			Source 1			Target			Opcode
UR	uint6						Target			Opcode
SR	sint6						Target			Opcode
XR	Opcode						Target			Opcode
UA	uint8								A	Opcode
SA	sint8								A	Opcode
FA	Q3.4								A	Opcode
UX	uint8								Op	Opcode
C	uint9									Opcode
OB	sint9									1	1	1	1	b0-b7
OPB	sint6						p0-p7			1	1	1	0	b0-b7
X	Opcode									Opcode
XUR	Op	uint5					Target			Opcode
RRRA	Source 2			Source 1			Target		A	Opcode
URRA	uint3			Source 1			Target		A	Opcode

Here we see how many design decisions of Finvil came together, for their requirements interlock:

A fixed-width 16 bit instruction size was chosen as a compromise between many factors.
Most RISC-y register machines have a »3 operand format«, meaning they can fetch data from 2 source registers and write a single result to 1 target register. Finvil uses 3 operands as well, where possible, for this format reduces »register allocation pressure«†, resulting in faster code.
If Finvil had 16 addressable registers, we’d need 4 bits per register index. With 3 operands that’s 12 bits of register indices, leaving just 4 bits for opcodes. That’s by far not enough, so Finvil has just 8 addressable registers per group, taking up 9 bits to encode 3 operands, leaving 7 bits for opcodes. Much better.
With 7 opcode bits we can encode up to 128 instructions. Finvil has over 160 opcodes, however. As the X-formats show, this is achieved by embedding more opcode bits where we’d otherwise encode operands.
Conditional jumps are faced with a challenge: They need a lot of operand bits to encode a wide jump range, but they also need a number of bits to encode a jump condition. Finvil handles conditional instructions with predicate registers: Boolean registers that can only be used for logical operations, such as storing comparison results or being tested for truthness. At the expense of a lot of opcodes, the OB and OPB formats handle the encoding of conditional jumps with a trick. All 9 operand bits are used for the jump offset in OB, giving us a range of ±256 instructions. The predicate register index tested is thus stored in the only location that is left: Within our 7 opcode bits, meaning that a conditional jump is actually 8 different opcodes, one per tested register.
It’s handy to have 8 bit immediates we could write to our target registers, e.g. to incrementally load big constants. However, out of 9 operand bits that leaves us with just 1 spare bit for target register addressing. We can’t do the OB trick again, for it takes away too many opcodes. And so 1 bit it shall be. We call it the »accumulator« bit, for like old 8-bit accumulator CPUs, it can only address one of two target registers. You have seen these accumulator regitsers in the register group table, shown as aliases. a0.x, for example, is »accumulator 0« of the x group, an alias for x0.
Speaking of 8 bit loads, the FA format has a strange Q3.4 operand. This is a fixed-point number, encoded with 1 sign bit, 3 integer bits, and 4 fraction bits. This format is good enough to encode most common short floating-point numbers, such as 2.5 or 1.25 or 0.0625 (¹⁄₁₆). And being fixed-point, we don’t have to deal with subnormals or other IEEE 754 surprises. Of course you can still incrementally load bit patterns of regular 32 bit IEEE 754 binary floating-point numbers.

A detail you may have been wondering about is the Type field of a TRR instruction. It is capable of distinguishing between 8, 16, 32 bit signed or unsigned integers, or 16, 32 bit IEEE 754 binary floating-point numbers. These instructions are used — as you likely have guessed — to reëncode/convert data as needed, e.g. by integer sign extension.

The Instructions

As mentioned, Finvil in this design ended up having over 160 different instructions. This is too much for a mere blog article about designing an ISA, especially given that most instructions are the usual junk one would expect: Basic arithmetic, boolean logic, bit shifting, etc. Thus, this section will focus on a few examples. Some mundane, just to show that this is overall a pretty standard ISA, some special, given the unique design challenges and goals of Finvil.

The Opcode Field(s)

As a reminder, we have an opcode field of just 7 bits, giving us 128 major opcodes to work with. Real CPU ISAs are often much leaner with how they lay out their opcodes, but Finvil is made to be emulated in software. We want the gathering of opcode bits during emulation to be as cheap as reasonable. Its major opcodes are…

… allocated at a fixed location within an instruction, so that a software emulator can easily mask off the opcode bits.
… allocated in the bottom-most instruction bits, so that after masking no further shifting is required to use the gathered opcode as, say, an index.
… an integer power of 2, such that by merely masking off the opcode bits, we get an array index that cannot overflow.
… to be used as an index for an array of 128 function pointers or computed goto labels.

This allows for a quite fast emulator main loop implementation.

All minor opcodes, which largely aren’t exhaustively assigned to useful operations, can then be individually bounds-checked within a major opcode emulation branch/function.

Helpful Tables

A quite useful tool during the design of Finvil were two tables summing up and further processing statistics on what operations were available and which instruction formats they used.

Opcode Format Statistics

The first table is about just the opcode format statistics, i.e. about what opcodes were assigned to which instruction format, and how much each such opcode counts to the available major opcode space:

A format’s »weight« tells us how many instructions of that format add up to 1 major opcode. `XUR` has a weight of ½, telling us that it has a 1 bit minor opcode.
A format’s »amount« is the sum of all assigned instructions to that format, multiplied by the format’s weight. A non-integer number here tells us that there are unassigned minor opcodes.
The last column indicates with a ✗ if that format has any yet unassigned minor opcodes. While one shouldn’t fill up the remaining opcode space with useless junk, it is desirable to utilise all opcode bits as best as one can. As such, each ✗ indicates a potentially missed opportunity to provide something useful.
Opcode Statistics: Format
Format	Weight	Amount
TRR	1	12	✓
XRR	¹⁄₈	6.25	✗
RRR	1	43	✓
URR	1	1	✓
SRR	1	6	✓
UR	1	4	✓
SR	1	6	✓
XR	¹⁄₆₄	¹⁄₃₂	✗
UA	1	7	✓
SA	1	2	✓
FA	1	4	✓
UX	1	10	✓
C	1	2	✓
OB	8	8	✓
OPB	8	8	✓
X	¹⁄₅₁₂	¹⁄₂₅₆	✗
XUR	¹⁄₂	4	✓
RRRA	1	2	✓
URRA	1	2	✓
Sum:		127.285
% Used:		99.44 %

As the 99.44 % tells us, all major opcodes are fully occupied, though there are a lot of unassigned minor ones left.

Additionally, this table shows opportunities for format reduction. For example, the URR format is used only once. Would it make sense to remove this format, then? The instruction assigned to URR turns out to be s.isli. It is a logical shift of an integer constant by a register value, such as x0 ← lsl 1, x1. This instruction is a highly useful and quite dense way to create common integer constants like 512, 768, or 4096. There is, however, not really a reason the constant operand has to be unsigned, so s.isli could be changed to use the SSR format instead.

On the same note, let’s have a look at X. Only 2 instructions have that format: ret and exit. The only reason X exists is because these two instructions do not take any operands. If i assign these to XRR instead, that format’s »amount« improves to a factor of 6.5 from 6.25, and we can scratch an entire format full of 510 unassigned opcodes. Then we could think of a way to make useful use of the two register operands, or define to ignore them. The same story goes for the XR format as well. Yes, this old version of Finvil is an unfinished project, how do you know? ☺

This is pretty much exactly how my opcode assignment process worked back in the day, and how i’d continue to do it.

Opcode Category Statistics

I also assigned each opcode a category, depending on what physical execution unit in a real CPU they’d be scheduled on, and then just summed up all assigned opcodes per category.

The numbers for »Jump« are off, for the 16 `OB` and `OPB` instructions are counted as just 2.
Opcode Statistics: Category
Category	Count
Constant	10
Memory	20
Comparison	6
Jump	8
Copy	14
Arithmetic	63
Logic	44

Unsurprisingly, arithmetic and logic operations take the lion’s share.

The instructions categorised as »copy« are all program stack push/pop instructions, except for 2: b.cp.p and b.cp.s. In other assembly languages one knows these as »conditional move«. b.cp.s is used as e.g. if b0: x1 ← x5, which assigns the value of x5 to x1, if the predicate b0 is true. b.cp.p performs the same operation, except on pointer registers.

I mainly used this table as a sort of »sanity check« of the assigned opcodes. When too many instructions are about data and control flow logistics instead of performing operations, then either i fell asleep thinking of useful opcodes, or the encoding space turned out to be just too small to be useful. Luckily, neither was the case.

On Opcode Naming

A short side note before jumping into the rest of the opcode assignments. You’ve seen me use two opcode namings so far, e.g. s.isli and x0 ← lsl 1, x1. The first one is the proper opcode mnemonic, the second one is assembly code. In assembly code, we don’t need all the Hungarian notation of our opcode mnemonics, for all that extra information can be derived from the given operands. Then, the arrow ← operator for assignment was chosen to make learning this assembly language easier†.

(† Yes, you there with the subpar keyboard layout that never heard of Unicode can also use the Latin-1 — or shall i say ASCII — substitute <-. ☺)

Boring Instructions

So, what is it Finvil can actually do? As we’ve seen in the statistics, it can perform the usual ALU operations.

; It can add…
x0 ← add.s x0,  7 ; … integers!
x1 ← add.f x1, x2 ; … floats!

; It can do the usual bit logic tricks.
x7 ← 31
x7 ← lsl    1, x7 ; aka. 1 << 31, see, s.isli is useful. =)
x1 ← add.f x2, x3 ; do some float math
x1 ← xor   x1, x7 ; sign-flip with bit hacks!

; It can do counted loops using pointer/index registers.
p1 ← 8
.loop_8_times:
  ; do stuff …

  p1 ← dec  1
  b1 ← ne  p1, 0

  if b1: goto .loop_8_times

Wait, what shenanigans were going on at the end there? Nothing special, really. ne p1, 0 uses the opcode p.ne.i, which assigns the boolean equality test result to a predicate register. The if goto uses b.goto.i, which conditionally, based on a predicate register’s state, jumps to a fixed location. b.goto.i is the instruction family that uses the OB format, allowing it to jump back up to 256 instructions.

That’s it, Finvil can demonstratably perform all the basic operations you’d expect of any small standard off-the-shelf CPU ISA. Let’s have a look at some more elaborate instructions, then.

Counted Loops & Array Indexing

Finvil not only has special pointer registers, but also dedicated pointer instructions. Most of them exist to make writing safe software easier and also faster.

Let’s start with counted loops. Say we have this high-level code which we want to write in Finvil assembly:

let mut a = 0;
let mut b = 0;

for i in 2..=13 {
    a += i;
}

// Let’s count from 7 down to 0!
for j in (0..<8).rev() {
    b -= j;
}

Finvil has special instructions to express both loops with little code.

x0 ← 0 ; a
x1 ← 0 ; b

p1 ←  2
p2 ← 13
.loop_up:
  x3     ← cvt.p2u p1    ; Conversion to read the index
  x0     ← add.s x0, x3  ; a += i
  b1, p1 ← inc 1 to p2   ; (p1++) < p2
  if b1: goto .loop_up

p3 ← 8
p0 ← 0
.loop_down:
  b1, p3 ← dec 1 to p0  ; (p3--) > p0
  x3     ← cvt.p2u p3
  x1     ← sub.s x1, x3 ; b -= j
  if b1: goto .loop_down

The p.dec/p.inc instructions here merge two operations into one for very common code patterns. In real hardware this would save an entire CPU cycle, in software emulation this saves us instruction dispatching branch overhead. As the title of this section suggests, it doesn’t end with loops. Finvil is also equipped with an instruction for quick bounds checking.

This code here:

let start: *const T = …;
let end  : *const T = …;
let p    : *const T = …;

if (p <  start)
|| (p >= end  ) {
    goto 'nooo;
}

Becomes that snippet:

p0 ← … ; start
p1 ← … ; end
p2 ← … ; p

b1 ← p2 in p0 ..< p1 ; right-exclusive range
                     ; an inclusive variant exists
if b1: goto .nooo

This instruction, mnemonic p.ini for inclusive and p.inx for exclusive ranges, even merges 3 basic operations: Two comparisons and an or-combination of the results. Making bounds checks easy and fast is critical to making people opt in to safer software design.

A downside of p.dec, p.inc, p.ini, p.inx is that they use the URRA and RRRA formats, the only ones with 4 distinct operands. The compromise here is that the target predicate register is an accumulator, i.e. only b0 or b1 can be written to. You also can’t use p4-p7 as the tested index register.

Incremental Loads

This is an instruction family i mentioned a few times already but so far never explained. The time has come. Incremental loads are yet another case of instructions that merge highly common instruction sequences into one.

Imagine you want to load a 32-bit constant into a scalar register. Handily, the UA instruction format has a target register index as well as an 8 bit unsigned integer constant. The incremental load instruction allows us to build bigger constants 8 bits at a time:

x7 ← 0xAABBCCDD ; pretend for a moment that this assembles
                ; to one instruction

x0 ← ldi 0xAA
x0 ← ldi 0xBB
x0 ← ldi 0xCC
x0 ← ldi 0xDD

b0 ← eq x0, x7 ; true

An incremental load performs the operation of first shifting the current contents up by 8 bits, then or-combining the shifted contents with the 8 bit constant of the instruction. While Finvil itself only has at most 32 bit addressable registers, this technique scales to arbitrarily large immediate values encoded directly into the instruction stream. A beefier hardware implementation or a clever emulator implementation may even attempt to perform macro-op fusion on successive incremental loads.

Remainder Division

The assembly syntax is pretty straight forward:

x0, x1 ← divrem.u x5, x9

This performs the operation x0 = x5 / x9 and x1 = x5 % x9, for those of you familiar with a few programming languages. There is a twist, however. x0, x1 cannot be changed, these are hard-coded target registers. The two divrem instructions (signed and unsigned integers) use the XRR format, which only has two register operands. To not clobber the input registers unless you explicitly want to, two other target registers had to be pulled out of thin air. I could have used the RRRA format for them, but divrem is not a common enough operation to warrant taking up two full major opcodes. And given that division is already a very slow operation, the added performance impact of possibly having to copy the old x0, x1 contents before divrem isn’t worth mentioning.

In case you are wondering why division and remainder, both very expensive instructions, are merged into one: These are actually one operation in real hardware.† When performing an integer division, you automatically get the remainder as a byproduct.

† Unless you are RISC-V, which makes you compute both separately, meaning you pay the price for the exact same computation twice in a row, only to get the other half of the result. But it’s okay, the RISC-V specification suggests your CPU implementation performs macro-op fusion on successive div, rem instructions. In RISC-V’s philosophy, having combined opcodes is »bloat«.
The full story is that RISC-V defined a 48 bit instruction format for their ISA, which would be capable of handling 4 register operands or 32 bit immediates. An old version of the Bitmanip extension even experimented with other uses for 48 bit instructions. But so far attempts at assigning any opcodes to wider-than-32-bits formats were met with high amounts of friction. A prominent reason that i vaguely remember is that the RISC-V developers do not want to burden hardware implementations with also having to decode 48 bit formats instead of immediately rejecting them, for just a hand full of new instructions.
YMMV

Function Address Space

There are more individual instructions we could talk about, but let’s instead start diving into the feature set that makes running sandboxed code possible. Here’s a weird instruction:

; mnemonic p.fn
p0, p1 ← fi, pc

We just fetched two hidden registers into our pointer registers p0 and p1. As many of you may be familiar with, pc is the »program counter«. This is the register that points at the next instruction for the CPU to execute. Jump instructions work by changing the value of this register behind the scenes. The difference to your off-the-shelf pc is that pc is relative to the start of the current function, and it is bounds-checked against the end of the current function. So for every function in your code, pc=7 means a different thing. That’s where fi comes into play: fi is the »function index« register. Your embedded code has a table of environment-provided functions it imports and a number of functions it exports. fi is an index into this table. So you could say that the »real« program counter is the pair of fi, pc.

If that makes you wonder how a call instruction works: Indeed it assigns the value of a pointer register to fi, selecting the function to call, and clears pc to zero. There is also a »BIOS« call instruction call.x, which can call into one of up to 512 built-in functions to communicate with the embedding environment. call.x uses the C instruction format.

Data Address Spaces

Finvil separates a program into multiple separate address spaces. Each address space comes with its own set of instructions, resulting in impossible to work-around access restrictions. For example, there is no instruction to read the contents of any function address space, especially not one to write.

Other address spaces are GRS, GWS, GSS, LRS, LWS, LSS. Here’s an example of reading a signed 16 bit integer from GRS:

; mnemonic s.ld.grs
x0 ← GRS:s16[p1]

The pointer contained in p1 is used relative to GRS. So like pc, p1 ← 7 means a different thing for GRS, LWS, or whichever else. Here’s what the address space names mean:

G-S: Global segment, i.e. shared among all scripting contexts.
L-S: Local segment, i.e. uniquely instanced per scripted entity.
-RS: Read-only segment. This data can be read directly from the script blob by the emulator, allowing for zero-copy deserialisation.
-WS: Read and write segment for temporary data. Intended to live for as long as the scripted entity lives, but no longer.
-SS: Read and write segment that is to be persisted across multiple instantiations of the scripted entity. In the language of game programming, this is your save data for the scripted entity.

As you would expect, there are no instructions for writing to -RS segments. Environment interaction was designed to happen transactionally. You’d first call.x a BIOS function to load the data you need into your address space. Then you modify this data as you need and finish by performing a committing call.x. Potentially broken invariants of the requested data are caught by enforcing this serialisation indirection. Of course this comes at a performance cost.

Opcode Numbers?

I’ve got a virtual CPU ISA now, it can compute things. Now what does this assembly code compile to? As in: What actual binary data is being spat out by an assembler? Well, the answer to that is… uhh… none yet? Okay, let’s try an example.

.label:
  b* ← xor 1 ; flip b0
  if b0: goto .label

Here’s two simple instructions. The b.xor.i instruction uses the UX instruction format. The b.goto.i instruction on the other hand uses OB. Here’s a reminder of how these are encoded:

Learn the full table by heart by next week for the exam!
Instruction Format
	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
UX	uint8								Op	Opcode
OB	sint9									1	1	1	1	b0-b7

To encode b.xor.i, the uint8 field is our constant 1. b.goto.i encodes our target register b0, index 0, right in the major opcode field. The sint9 index is −1 to jump back one instruction. −1 in binary 2’s complement is »all ones«, i.e. 1 1111 1111₂. Our encoding table now looks like this:

Hmm, something’s missing…
Instruction Format
	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
b* ← xor 1	0	0	0	0	0	0	0	1	Op	Opcode
if b0: goto .label	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0

Aaand… that’s it. The actual opcode bits are yet unassigned. Why is that?

To put it simply, i was still finalising which opcodes i want to have in the end. Do i need more conversion ops? Do i want more granular address space handling? Do i want more or fewer data types? For example, i forgot to add cvt.p2u last time i worked on Finvil. I only found out about this while writing pseudo code for this post.

Without a final list of opcodes, it makes no sense to assign numbers to them. And it doesn’t just end there. Ideally these numbers are assigned in a way that minimises the amount of branching a real hardware implementation would have to perform. The numbers should be assigned by first sorting all opcodes by instruction category, then by instruction format. Or maybe not, maybe first by format, then category. That heavily depends on the distribution of opcodes within these two sorting criteria. It has performance implications quite like sorting loops in an SQL query plan.

So this is why, unfortunately, you cannot build a Finvil compiler and emulator yet. But the next version will get there.

Future Finvil

This all has been a long and exhausting read and write-up, so i’d like to keep the outlook on Finvil’s possible future short, focussed on the main pain points.

Turing Completeness Be Gone

A lot of design problems were caused by having to reduce the register file as much as i reasonably can, in order to limit the size impact on each scriptable entity or a scripted game as a whole.

(Did i mention Finvil was designed for game scripting?)

In a future version of Finvil i’d like to experiment with more high-level control flow, like Wasm. Except limited to if … else and counted loops†. A language where these are your only diverging control flow structures is known in computer science as FOR, which is not turing complete. Add a while loop and you get the language WHILE, which now is turing complete.

(† And a few predicated operations like conditional calls or conditional copies, like Finvil in its current design has.)

In addition to only having counted loops, i’d like to add a stack of active loop counters, so that code can indeed not manipulate the loop condition from within the loop itself. This has three major benefits for the embedding environment:

The environment knows that any random program is guaranteed to eventually finish.
Just by looking at the loop counter stack, the environment can judge whether entering yet another loop is »too expensive« and can thus cheaply abort a script on loop entering.
The abort logic for too long-running code now only has to run when starting a new loop, not like before during every instruction fetch.

With these the environment can now guarantee that any random script will finish in a given time, sucessfully or not. As a result, there is no register file we have to store per entity anymore, only pointers to local segments. And thanks to that, i can make the register file as big as i want and need. For example, i can without worrying add a loop counter stack. ☺

Return of the Vectors

I’d like my vector registers back. They’re just too useful for shuffling data around between segments, or for performing fast linear algebra. Game code is full of linear algebra. However, if i were to bring back vector registers, i need to come up with a way to encode the additional vector instructions.

One possibility is to go for variable length instructions and use wider encodings for vector operations. This could be doable, for if i end up using more high-level control flow opcodes, and if i forbid pointer-indirect jumps in favour of precomputed jump tables in a compiled function, i no longer have to worry about the possibility of computing addresses that would land in the middle of a wide instruction.

Another possibility is to perform a »mode switch«, where you switch from one opcode decoder that understands scalar opcodes to a second opcode decoder, which decodes every opcode the same, except that scalar opcodes are replaced with vector opcodes. Like high-level control flow, a mode switch instruction could be a pair of instructions v.begin, v.end, and the existence of such a pair could be used by the environment to conditionally allocate vector register space for a function or not. This also goes hand in hand with the general observation that vector operations aren’t that common and usually occur in number crunching loops or during bulk data transfers.

I’m leaning more towards the second option, the mode switch. This mode switch can be especially valuable for JIT compilers, which then have an easier time figuring out whether to clobber or ignore the host CPU’s vector registers. And for a software emulator the implementation is as easy as changing the pointer to the table of computed goto labels or interpreter functions.

Rethinking Registers

For the longest time i was going back and forth on how i want to handle function arguments, return values, and calling conventions. The status quo on all popular CPU ISAs is quite frankly dumb. A lot of time when calling a function is spent pushing or popping registers explicitly that are by convention allocated as temporary data, function arguments, or function return values. There is also a strange asymmetry in that in all calling conventions i know of, there is less register space assigned to return values than to arguments. You can squeeze a lot of extra performance out of your normal code just by tweaking the number and sizes of your function parameters and returned data, and i do mean a lot.

Currently Finvil handles calling conventions by not bothering. Like all retro CPUs, it is on you to come up with a calling convention and then stick to it. At the end of the day, however, this just means that Finvil has all the same performance pitfalls of regular CPUs, where pushing or popping registers to please calling conventions can take quite some time.

I’d like to take lessons from less popular CPU ISAs, which handle all the calling convention magic in hardware. Two CPU ISAs come to mind:

Itanium handles calling conventions by all local registers being part of a big ring buffer. Each function allocates a slice of this buffer as its addressable register space. To pass in arguments and accept return values, you allow part of the called function’s register slice to overlap with your own. The big benefit of Itanium’s design is that every function only ever allocates exactly as many registers as it will actually need. This does, however, come at an increased cost of checking whether register slice overlappings are valid on every call or return.

Sun SPARC took a different approach. It is equipped with a stack of overlapping register windows of fixed sizes. Like Itanium, you pass arguments to a called function by writing into overlapping registers, and through these overlapping registers you can receive return values. SPARC’s register window stack is much easier to implement and its static register allocation requires no extra bounds checks during calls or returns. However, this does have the downside that complex functions may run out of spare registers, needing to spill data to the stack, or that small functions will leave a lot of registers unused. For high performance hardware this turned out undesirable, which is why Itanium went for dynamic allocation. However, for a software emulator this register window stack seems highly attractive. We have tons of RAM at our disposal, what do a few unused registers in the middle of a call stack matter?

I’m considering experimenting with SPARC’s register window stack. However, i’ll have to rethink the instruction encoding then, for 8 addressable registers will not be enough for this overlapping to be efficient.

Here’s a table demonstrating how this works for a 3 functions deep call graph:

For each row, `I` are a function’s »input« registers, aka. function parameters/arguments, `C` are parameters for a function you want to call, and `W` are your »work« registers, where you can operate on temporary data no other function will be able to see. The registers marked `…` are currently in use, but are not visible to the current function. When a called function returns, its return values are written into your `C` registers. When *your* function returns, you write your return values into your `I` registers. A function that doesn’t call another function (yet) can safely use its `C` registers for more temporary data. This is often the case for »number crunchers«.
Call Stack	Register Window Stack
Function 1	I	W	C
Function 2	…	…	I	W	C
Function 3	…	…	…	…	I	W	C

A possible way out for the encoding problem of wider register groups may come from Lua. This programming language operates on a bytecode interpreter that uses a random access stack for storing live data, often referred to as a register machine. The trick to learn from is that in Lua’s bytecode, stack index operands can have asymmetric sizes. You can address data farther down the stack as a source operand than you can address the target of your operation.

A next version of Finvil could have a new WRR instruction format where Source 1 and Target can only address the 8 work registers, but Source 2 can also address 4 input and 4 output registers, spanning the full window of then 16 partially overlapping registers. This would at least make operating on function arguments more efficient. It is still desirable to also be able to write results back into I/O registers, however, without requiring extra transfer opcodes.

Why 16?

24 is a much better number, isn’t it? Fun and banter aside, i do mean it. I quite often encounter situations where 8 bits of data are too few, but 16 bits are too much. You know what would’ve been close to perfection for me? 12 bits. And for bigger data i just as often encounter the same issue. 16 bits is way too little, but 32 bits is way too much. 24 bits would be lovely. However, few programming languages i know have built-in support for 12 and 24 bit integer types. Zig being one exception. The primary reason for this is that to a CPU, dividing by 3 is rather annoying without special hardware. Keeping everything sized in powers of two allows for quite predictable, quite clean, quite cheap data fetching logic.

But this is Finvil. Software emulators can deal with multiples of 3, you just have to… convince them hard enough. Just think of RGB-encoded images, that’s multiples of 3 inside your data stream. And a real hardware implementation can always add a special »3-divider« circuit. Division of small numbers by small constants can be implemented rather efficiently in hardware. And software? See for yourself. Compilers have tricks to make remainder divisions by constants not that expensive.

Sooo… why not experiment with a 24-bit instruction encoding, hmm?

Wherefore Art Thou, Segment, Whither Dost Thou Go

The observant among you who didn’t yet fall asleep may have noticed that there is a little… disagreement in what i have presented you so far. A symptom of Finvil still being a work in progress.

In the register file table, there were 3 special registers, which as you may have guessed, store segment spans: fs, gs, ls. In the section »Data Address Spaces«, however, i mentioned the 6 segments GRS, GWS, GSS, LRS, LWS, LSS. These seem to relate to the gs, ls registers, but what about fs, then?

Put simply, i haven’t yet finalised how many segments for what purpose i want or need. This involves writing lots of pseudo code (like the code examples so far) to figure out the best way to pass around big blocks of data of different access permission and data persistency levels.

What i really want to have is something Wasm has a not yet implemented draft for. Per permission/persistency level, i want there to be a list of segments the current function can access. Global segment? That’s just the first entry. Local segment? Second entry. Wanna transfer big blobs of data to a called function? Throw it a segment generated from a slice of your own segments. Need to read or write state of the embedding environment? Ask for them to be mapped in as extra segments.

(Search keyword for those of you curious what this Wasm feature is: »Multiple Memories«)

This would require a mildly wider instruction encoding, however.

External & Internal Functions

This last one is short, i promise. Currently, Finvil has no way of splitting up big functions into smaller worker functions other than by exporting these worker functions. Given that function indices merely range from 0 to 65535, this precious index space could quickly be eaten up by functions other scripts shouldn’t be able to see. The solution is simple: Add a second pair of call instructions. I just haven’t come around to free up opcodes for them yet.

Conclusion

Designing custom CPU ISAs or bytecode machines for fun or special purposes or both is super fun. I recommend giving it a try! You can take it as far as i do, trying to optimise every detail before finalising a version 1 design, or you just go with the flow and let things evolve. Or you do neither and just seek to better understand what went through the minds of other designers of CPU ISAs or bytecode machines. ~~Or thou art henceforth forever horrified by the mere pensée of ISA or bytecode design.~~

Whichever it is, i hope you could draw some inspiration or better understanding out of me ranting about an old project of mine. There’s lessons to be learned from everything everywhere. But ultimately i just want you to have fun.

Bye!

Terms Used

Bytecode: Another name for instructions for when we don’t speak of arbitrary CPU ISAs, but about software interpreters. Most bytecode languages are equipped with high-level programming language features, such as instructions for instantiating objects of classes or dynamically looking up their properties, indexing array objects, and more. The reason i describe Finvil as a »virtual ISA« is that it doesn’t have any of these high-level features most bytecode languages have. You could implement the entirety of Finvil in hardware. Your embedding environment is then a much more powerful management CPU next to however many Finvil cores.
Component
Entity: In an Entity Component System, a software architecture style popular with some game engines, an entity is basically whatever you can spawn into and operate over in a game world. An entity can have a bunch of different components, which are special-purpose sets of properties. In Finvil’s old design, the Finvil register file would’ve been a scriptable entity’s component. And then there are the systems, which define behaviour over entities with specific combinations of components. For example, a Finvil script system would run over all entities that have a Finvil register file, segment list, and script component, and then execute these scripts over the segments and registers.
Embedded Interpreter: In the world of CPUs, embedded means a different thing, so let’s be clear in case there is some confusion: By »embedded interpreter« i mean you have this interpreter included as a software library and run it within your own application. A stand-alone interpreter would be something like Python. (Yes, technically you can embed Python, but only the chaotic evil do this.)
FPGA: A field-programmable gate array. This is basically a big block of basic configurable logic blocks and a lot of configurable routing between them. The logic blocks you can configure to perform the most basic of computations: adding, xor-combining, storing data in flip-flops, and the likes. Then you configure which logic blocks take their data where from and whither they move their results. All in all, it’s a programmable CPU-like device that you can make behave as if it was whichever CPU you like. (Assuming the code of the CPU you like fits into the FPGA.) A CPU model programmed onto an FPGA is what i’d call »hardware emulation«. In case you wondered why i’m adding the »software« in »software emulation« all the time, this is why.
ISA
CPU ISA: Short for »Instruction Set Architecture«. This is a CPU’s specification of software-observable features. I.e. an ISA is about what instructions a CPU has, how they are encoded, what a CPU’s memory model is, and all the other things related to how you actually program this thing. What’s not part of an ISA are things like pin voltage levels, bus capacitance, the exact transistor layout of an integer adder unit, etc.
Macro-Op Fusion: Your CPU is capable of performing a quite complex operation in one step, a very useful complex operation at that. However, your current ISA design is incapable of expressing this operation with a single opcode, be it for historical or design philosophical reasons. For example, your ISA may need two instructions to compare, jump if equal, although your CPU’s branch unit may have a direct connection to your comparator units. Or your ISA may need two instructions for one operation, because while the CPU produces two results for a wide operation, your instruction formats have no way of expressing a secondary target register. For example mul_lo, mul_hi to compute the lower and higher half of a result twice as big as your registers separately. Or you wish to load a big constant from the instruction stream, but your instruction formats can only carry small constants, so you have to build your big constants out of many instructions, like with s.isli. In all these cases, instead of adding more or wider instructions that can handle them in a single step, you could do the inverse and have your CPU detect these sequences and merge them into a single internal magical instruction before execution. That is macro-op fusion.
Mnemonic: An easier to remember name for something. In assembly languages, the mnemonics are the text names for opcodes which to the CPU are really just numbers. The CPU may see 1337, which means nothing to you. Your assembly language says xor, which means nothing to the CPU. The xor is the mnemonic for you, and your assembler compiles that to 1337. And it doesn’t end with just opcodes. Even register names are mnemonics.
Opcode: Short for »operation code«. For every distinct operation a CPU ISA can perform, there is one assigned opcode. You can add two registers together? One add opcode. You can add a register and a constant together? One add constant opcode. In a RISC design. In a CISC machine, constant or register is a property defined in the operands, not by the opcode itself, so there you would only have one add whatever to whatever opcode. The combination of an opcode, maybe some operands, and maybe some targets for the results of the operation, make up what is called an instruction. The name opcode is also used to refer to the numeric value of an opcode when encoded in an instruction. Unfortunate possible source of confusion.
Predicate Register: Most of you who are familiar with CPU design are familiar with the concept of »status flags registers«. These are registers that after some operations accumulate short-lived statistical data on the results of these operations. Things like: »The result was 0«, or: »The result was negative«, or »The result had uneven parity«. Status flags can improve code density and are very cheap to implement in hardware. However, they create instruction interdependencies that make parallelising operation execution in a real CPU difficult. It’s also difficult with flags to carry their state around for a while and use them later, because many common instructions on the way override their state. Ultimately what you want to do is perform a conditional branch based on the status flags. Like branching away if the result of the last computation was zero. Predicate registers are special registers that can only hold boolean values. Whenever you branch conditionally, you read one of the predicate registers. To set a predicate register to the value you need, you perform a boolean operation on other predicates, or comparisons on other registers. Predicate registers have a lot of benefits, such as making it easy to carry around branch conditions for branches that are yet to come, or to use a set of predicate registers for masking vector operations. The downside is, however, that predicate instructions require a lot more opcodes than a status flags register.
Spreadsheet: The eternal source of nightmares among IT professionals, plaguing the world ever since 1969. They pretend to be relational databases, with a really shiny, easy to use and modify GUI, but they aren’t relational databases. O bored programmers of the world, hear ye my prayer: Please make a user-friendly GUI app that can easily make new tables, modify them as needed, have computed columns, have graph and chart widgets derived from table data, and all the other pretty stuff, except sitting on top of SQLite. K thx bai.
Verilog: A hardware description language. Not the best one, it’s quite old. More modern HDLs exist, but they all usually compile to Verilog. It’s the C of hardware design, for better or worse. And yes, for quite some time now, hardware is programmed. You write the high-level logic of how your circuit behaves in e.g. Verilog, and then use proprietary magical tooling to compile Verilog to so-called »netlists«, a lower-level representation of your code that only involves the most basic of logic gates and a few common elements like multiplexers or latches. Optimise the layout of your netlist and your circuit design is ready for fabrication.