Designing a Virtual CPU ISA
I found myself talking a few times about an old project of mine over at the good ol’ RPLCS. Today i finally thought: Why not blog about it? Introducing: Finvil, a virtual CPU ISA, meant to be embedded into applications as a bytecode interpreter for scripting. Fasten your seat belts, for we are gonna talk about different approaches and considerations in ISA and bytecode design.
On This Page:
As mentioned, code written for Finvil is meant to run in an embedded bytecode interpreter. It should run sandboxed code, quite possibly from different authors, and be capable of safely modifying parts of the embedding environment. Yet it should feel like you are writing code for a real CPU ISA you could implement in Verilog and test on an FPGA.
And yes, »should« and »meant to«. Finvil is far from production ready and in its current design didn’t even make it to the numeric opcode assignment yet. I plan to overhaul its design a lot, and in this article i’d like to share with you some of the old design decisions i made. Why were they made, why do some of them work, why don’t some of them work, and what are the lessons learned for the next version of Finvil?
- We’ll begin by looking at Finvil’s register file. It will give you an idea of how a Filvil interpreter would’ve been used, and what kind of data Finvil code would’ve been able to operate over and how.
- Then we’ll take a look at Finvil’s instruction formats. Here we’ll see the results of all the compromises that had to be made in order for Finvil to fit its intended purpose: How many operands do we have, how are they encoded, what are the limitations?
- Then i’ll show you a few example Finvil instructions to get the full picture of how the embedding environment, the register file, and all the instruction formats play together. There you’ll also see how embedding would’ve worked.
- Finally i’ll share my thoughts on what things are going to change in the next Finvil design and why.
The Register File
One of the biggest design challenges was the register file. First of all, i kept track of all register sizes in my spreadsheet, summing up their bit and byte widths. Then i made sure never to overshoot the hard limit of 256 bytes, and ideally not the soft limit of 128 bytes.
(I did overshoot the soft limit.)
This limit exists due to a concern i had during the early design phases: Long-running loops. The embedding environment must never get stuck running a Finvil script. Worst case that meant a script had to be paused and continued later, which in turn meant saving and restoring all register state to/from somewhere. And that somewhere was Finvil emulator components in scriptable game entities. The larger the file, the more specialised registers i could add, but the heavier each scriptable entity becomes, so minimising the register file size was critical.
With that out of the way, here it is, the register file:
Register Group | Size (bits) | Amount & Mnemonics | Sum (bits) | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Pointer Register | 16 | p0 a0.p |
p1 a1.p |
p2 | p3 | p4 | p5 | p6 | p7 sp |
128 |
Scalar Register | 32 | x0 a0.x |
x1 a1.x |
x2 | x3 | x4 | x5 | x6 | x7 | 256 |
Predicate Register | 1 | b0 a0.b |
b1 a1.b |
b2 | b3 | b4 | b5 | b6 | b7 | 8 |
Special Pointer | 16 | pc | fi | 32 | ||||||
Segment Register | 64 | fs | gs | ls | 192 | |||||
Call Stack | 32 | c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | 256 |
Call Stack | 32 | c8 | c9 | c10 | c11 | c12 | c13 | c14 | c15 | 256 |
The »scalar register« name may have raised some eye brows.
Indeed, an even earlier design of Finvil was equipped with
8 vector registers v0 … v7
, which were fixed 4-lanes-vectors
of scalars. The vector registers were ultimately dropped for
two reasons:
- They contained too much state to recover, 128 more bytes for every scriptable entity. And that for something that isn’t going to be used most of the time.
- They caused a doubling in opcodes. Whatever you could do with the scalars, you should be able to do with vectors. Add in instructions to transfer data between scalars and vector lanes, and suddenly all remaining free opcodes are assigned. As a consequence, many useful opcodes the current design offers didn’t exist.
The other registers we’ll talk about later. For now it suffices to know they exist and let your imagination run wild.
Now, all the above registers add up to a total of 1128 bits, or 141 bytes. Far below 256, just above 128. Scratching the second call stack row would get us below the soft limit.
The Opcode Formats
Here’s a question for fellow bytecode designers: How big are your instructions?
Are they fixed-sized? These are very quick and easy to decode, but you either end up with instructions that are too short for a bunch of useful things or simply too big. Short instructions can’t encode a lot of operand data you may need, such as far jump offsets, big immediate constants, or just having more indexable registers over all. On the other hand, if your instructions are wider, you can easily encode all the crazy powerful instructions you’ll ever want, though also at a cost. By far most instructions in normal general-purpose code are easily compressible. You want to compress as much code as you can, so that more of it fits into a single cache line. This matters just as much to a bytecode interpreter as it matters to a real CPU.
Are they dynamically-sized? Your instructions can be of any length you need. You can have them range from 1 byte to 16 bytes or even more. Need a load instruction for an immediate 128 bit vector? Slap another byte containing the load immediate opcode on it and you get a single 17 bytes instruction. Just need to sign-flip a single register? What more than a byte does that instruction ever need? Now, however, you’re faced with two challenges. ① How do you deal with jumps? What happens if you jump into the middle of an instruction, and how — if at all — do you make sure that doesn’t happen? This is a design challenge quite similar to that of UTF-8. ② How do you know how long an instruction is? Can you know it right away, e.g. for there being a counter in the first byte, or do you have to walk through the entire instruction to know when it eventually ends, if at all? In real hardware, if you cannot know the size of an instruction up front, then decoding instructions in parallel becomes difficult and expensive. Worst case, decoding instructions in a sequence could cause stalls if it takes many cycles to find the end. And in software, you may end up spending more time fetching all parts of an instruction than performing something as simple as register-register-addition.
As you will see, i went for fixed-sized instructions with Finvil. Note, however, that i went for that design because Finvil is meant to be run by a bytecode interpreter, running only small amounts of code for game entity scripting. When designing CPUs for e.g. desktop computers or phones, i’d instead go for a variable length encoding that self-synchronises. Being fixed-sized, it becomes rather easy to ensure ahead of script execution that all jumps land on valid instructions. In fact, jumps into the middle of an instruction are made impossible, but more on that later.
Here’s the table of Finvil instruction formats, mapping fields of operands and opcodes to their positions within 16 bit instructions:
Instruction Format | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |
TRR | Type | Source 1 | Target | Opcode | ||||||||||||
XRR | Opcode | Source 1 | Target | Opcode | ||||||||||||
RRR | Source 2 | Source 1 | Target | Opcode | ||||||||||||
URR | uint3 | Source 1 | Target | Opcode | ||||||||||||
SRR | sint3 | Source 1 | Target | Opcode | ||||||||||||
UR | uint6 | Target | Opcode | |||||||||||||
SR | sint6 | Target | Opcode | |||||||||||||
XR | Opcode | Target | Opcode | |||||||||||||
UA | uint8 | A | Opcode | |||||||||||||
SA | sint8 | A | Opcode | |||||||||||||
FA | Q3.4 | A | Opcode | |||||||||||||
UX | uint8 | Op | Opcode | |||||||||||||
C | uint9 | Opcode | ||||||||||||||
OB | sint9 | 1 | 1 | 1 | 1 | b0-b7 | ||||||||||
OPB | sint6 | p0-p7 | 1 | 1 | 1 | 0 | b0-b7 | |||||||||
X | Opcode | Opcode | ||||||||||||||
XUR | Op | uint5 | Target | Opcode | ||||||||||||
RRRA | Source 2 | Source 1 | Target | A | Opcode | |||||||||||
URRA | uint3 | Source 1 | Target | A | Opcode |
Here we see how many design decisions of Finvil came together, for their requirements interlock:
- A fixed-width 16 bit instruction size was chosen as a compromise between many factors.
- Most RISC-y register machines have a »3 operand format«, meaning they can fetch data from 2 source registers and write a single result to 1 target register. Finvil uses 3 operands as well, where possible, for this format reduces »register allocation pressure«†, resulting in faster code.
- If Finvil had 16 addressable registers, we’d need 4 bits per register index. With 3 operands that’s 12 bits of register indices, leaving just 4 bits for opcodes. That’s by far not enough, so Finvil has just 8 addressable registers per group, taking up 9 bits to encode 3 operands, leaving 7 bits for opcodes. Much better.
- With 7 opcode bits we can encode up to 128 instructions.
Finvil has over 160 opcodes, however. As the
X
-formats show, this is achieved by embedding more opcode bits where we’d otherwise encode operands. - Conditional jumps are faced with a challenge: They need a
lot of operand bits to encode a wide jump range, but they
also need a number of bits to encode a jump condition.
Finvil handles conditional instructions with predicate
registers: Boolean registers that can only be used for
logical operations, such as storing comparison results or
being tested for truthness. At the expense of a lot of opcodes,
the
OB
andOPB
formats handle the encoding of conditional jumps with a trick. All 9 operand bits are used for the jump offset inOB
, giving us a range of ±256 instructions. The predicate register index tested is thus stored in the only location that is left: Within our 7 opcode bits, meaning that a conditional jump is actually 8 different opcodes, one per tested register. - It’s handy to have 8 bit immediates we could write
to our target registers, e.g. to incrementally load big
constants. However, out of 9 operand bits that leaves us
with just 1 spare bit for target register addressing. We
can’t do the
OB
trick again, for it takes away too many opcodes. And so 1 bit it shall be. We call it the »accumulator« bit, for like old 8-bit accumulator CPUs, it can only address one of two target registers. You have seen these accumulator regitsers in the register group table, shown as aliases.a0.x
, for example, is »accumulator 0« of thex
group, an alias forx0
. - Speaking of 8 bit loads, the
FA
format has a strangeQ3.4
operand. This is a fixed-point number, encoded with 1 sign bit, 3 integer bits, and 4 fraction bits. This format is good enough to encode most common short floating-point numbers, such as2.5
or1.25
or0.0625
(¹⁄₁₆). And being fixed-point, we don’t have to deal with subnormals or other IEEE 754 surprises. Of course you can still incrementally load bit patterns of regular 32 bit IEEE 754 binary floating-point numbers.
A detail you may have been wondering about is the Type
field of a TRR
instruction. It is capable of distinguishing
between 8, 16, 32
bit signed or unsigned integers, or
16, 32
bit IEEE 754 binary floating-point numbers.
These instructions are used — as you likely have guessed —
to reëncode/convert data as needed, e.g. by integer sign
extension.
The Instructions
As mentioned, Finvil in this design ended up having over 160 different instructions. This is too much for a mere blog article about designing an ISA, especially given that most instructions are the usual junk one would expect: Basic arithmetic, boolean logic, bit shifting, etc. Thus, this section will focus on a few examples. Some mundane, just to show that this is overall a pretty standard ISA, some special, given the unique design challenges and goals of Finvil.
The Opcode Field(s)
As a reminder, we have an opcode field of just 7 bits, giving us 128 major opcodes to work with. Real CPU ISAs are often much leaner with how they lay out their opcodes, but Finvil is made to be emulated in software. We want the gathering of opcode bits during emulation to be as cheap as reasonable. Its major opcodes are…
- … allocated at a fixed location within an instruction, so that a software emulator can easily mask off the opcode bits.
- … allocated in the bottom-most instruction bits, so that after masking no further shifting is required to use the gathered opcode as, say, an index.
- … an integer power of 2, such that by merely masking off the opcode bits, we get an array index that cannot overflow.
- … to be used as an index for an array of 128 function pointers
or computed
goto
labels.
This allows for a quite fast emulator main loop implementation.
All minor opcodes, which largely aren’t exhaustively assigned to useful operations, can then be individually bounds-checked within a major opcode emulation branch/function.
Helpful Tables
A quite useful tool during the design of Finvil were two tables summing up and further processing statistics on what operations were available and which instruction formats they used.
Opcode Format Statistics
The first table is about just the opcode format statistics, i.e. about what opcodes were assigned to which instruction format, and how much each such opcode counts to the available major opcode space:
Opcode Statistics: Format | |||
---|---|---|---|
Format | Weight | Amount | |
TRR | 1 | 12 | ✓ |
XRR | ¹⁄₈ | 6.25 | ✗ |
RRR | 1 | 43 | ✓ |
URR | 1 | 1 | ✓ |
SRR | 1 | 6 | ✓ |
UR | 1 | 4 | ✓ |
SR | 1 | 6 | ✓ |
XR | ¹⁄₆₄ | ¹⁄₃₂ | ✗ |
UA | 1 | 7 | ✓ |
SA | 1 | 2 | ✓ |
FA | 1 | 4 | ✓ |
UX | 1 | 10 | ✓ |
C | 1 | 2 | ✓ |
OB | 8 | 8 | ✓ |
OPB | 8 | 8 | ✓ |
X | ¹⁄₅₁₂ | ¹⁄₂₅₆ | ✗ |
XUR | ¹⁄₂ | 4 | ✓ |
RRRA | 1 | 2 | ✓ |
URRA | 1 | 2 | ✓ |
Sum: | 127.285 | ||
% Used: | 99.44 % |
As the 99.44 % tells us, all major opcodes are fully occupied, though there are a lot of unassigned minor ones left.
Additionally, this table shows opportunities for format
reduction. For example, the URR
format is used only once.
Would it make sense to remove this format, then? The instruction
assigned to URR
turns out to be s.isli
. It is a logical
shift of an integer constant by a register value, such as
x0 ← lsl 1, x1
. This instruction is a highly useful and quite
dense way to create common integer constants like 512, 768, or 4096.
There is, however, not really a reason the constant operand has to
be unsigned, so s.isli
could be changed to use the SSR
format
instead.
On the same note, let’s have a look at X
. Only 2 instructions
have that format: ret
and exit
. The only reason X
exists
is because these two instructions do not take any operands.
If i assign these to XRR
instead, that format’s »amount«
improves to a factor of 6.5
from 6.25
, and we can scratch
an entire format full of 510 unassigned opcodes. Then we could
think of a way to make useful use of the two register operands,
or define to ignore them. The same story goes for the XR
format
as well. Yes, this old version of Finvil is an unfinished project,
how do you know? ☺
This is pretty much exactly how my opcode assignment process worked back in the day, and how i’d continue to do it.
Opcode Category Statistics
I also assigned each opcode a category, depending on what physical execution unit in a real CPU they’d be scheduled on, and then just summed up all assigned opcodes per category.
Opcode Statistics: Category | |
---|---|
Category | Count |
Constant | 10 |
Memory | 20 |
Comparison | 6 |
Jump | 8 |
Copy | 14 |
Arithmetic | 63 |
Logic | 44 |
Unsurprisingly, arithmetic and logic operations take the lion’s share.
The instructions categorised as »copy« are all program stack
push/pop instructions, except for 2: b.cp.p
and b.cp.s
.
In other assembly languages one knows these as »conditional move«.
b.cp.s
is used as e.g. if b0: x1 ← x5
, which assigns the
value of x5
to x1
, if the predicate b0
is true
.
b.cp.p
performs the same operation, except on pointer registers.
I mainly used this table as a sort of »sanity check« of the assigned opcodes. When too many instructions are about data and control flow logistics instead of performing operations, then either i fell asleep thinking of useful opcodes, or the encoding space turned out to be just too small to be useful. Luckily, neither was the case.
On Opcode Naming
A short side note before jumping into the rest of the opcode
assignments. You’ve seen me use two opcode namings so far,
e.g. s.isli
and x0 ← lsl 1, x1
. The first one is the
proper opcode mnemonic, the second one is assembly code. In
assembly code, we don’t need all the Hungarian notation
of our opcode mnemonics, for all that extra information can
be derived from the given operands. Then, the arrow ←
operator
for assignment was chosen to make learning this assembly language
easier†.
(† Yes, you there with the subpar keyboard layout that
never heard of Unicode can also use the Latin-1
— or shall i say ASCII — substitute <-
. ☺)
Boring Instructions
So, what is it Finvil can actually do? As we’ve seen in the statistics, it can perform the usual ALU operations.
; It can add…
x0 ← add.s x0, 7 ; … integers!
x1 ← add.f x1, x2 ; … floats!
; It can do the usual bit logic tricks.
x7 ← 31
x7 ← lsl 1, x7 ; aka. 1 << 31, see, s.isli is useful. =)
x1 ← add.f x2, x3 ; do some float math
x1 ← xor x1, x7 ; sign-flip with bit hacks!
; It can do counted loops using pointer/index registers.
p1 ← 8
.loop_8_times:
; do stuff …
p1 ← dec 1
b1 ← ne p1, 0
if b1: goto .loop_8_times
Wait, what shenanigans were going on at the end there? Nothing
special, really. ne p1, 0
uses the opcode p.ne.i
, which
assigns the boolean equality test result to a predicate register.
The if goto
uses b.goto.i
, which conditionally, based on a
predicate register’s state, jumps to a fixed location. b.goto.i
is the instruction family that uses the OB
format, allowing
it to jump back up to 256 instructions.
That’s it, Finvil can demonstratably perform all the basic operations you’d expect of any small standard off-the-shelf CPU ISA. Let’s have a look at some more elaborate instructions, then.
Counted Loops & Array Indexing
Finvil not only has special pointer registers, but also dedicated pointer instructions. Most of them exist to make writing safe software easier and also faster.
Let’s start with counted loops. Say we have this high-level code which we want to write in Finvil assembly:
let mut a = 0;
let mut b = 0;
for i in 2..=13
// Let’s count from 7 down to 0!
for j in .rev
Finvil has special instructions to express both loops with little code.
x0 ← 0 ; a
x1 ← 0 ; b
p1 ← 2
p2 ← 13
.loop_up:
x3 ← cvt.p2u p1 ; Conversion to read the index
x0 ← add.s x0, x3 ; a += i
b1, p1 ← inc 1 to p2 ; (p1++) < p2
if b1: goto .loop_up
p3 ← 8
p0 ← 0
.loop_down:
b1, p3 ← dec 1 to p0 ; (p3--) > p0
x3 ← cvt.p2u p3
x1 ← sub.s x1, x3 ; b -= j
if b1: goto .loop_down
The p.dec/p.inc
instructions here merge two operations into
one for very common code patterns. In real hardware this
would save an entire CPU cycle, in software emulation this
saves us instruction dispatching branch overhead. As the
title of this section suggests, it doesn’t end with loops.
Finvil is also equipped with an instruction for quick bounds
checking.
This code here:
let start: *const T = …;
let end : *const T = …;
let p : *const T = …;
if
||
Becomes that snippet:
p0 ← … ; start
p1 ← … ; end
p2 ← … ; p
b1 ← p2 in p0 ..< p1 ; right-exclusive range
; an inclusive variant exists
if b1: goto .nooo
This instruction, mnemonic p.ini
for inclusive and p.inx
for exclusive ranges, even merges 3 basic operations: Two
comparisons and an or
-combination of the results. Making
bounds checks easy and fast is critical to making people
opt in to safer software design.
A downside of p.dec, p.inc, p.ini, p.inx
is that they use
the URRA
and RRRA
formats, the only ones with 4 distinct
operands. The compromise here is that the target predicate
register is an accumulator, i.e. only b0
or b1
can be
written to. You also can’t use p4-p7
as the tested index
register.
Incremental Loads
This is an instruction family i mentioned a few times already but so far never explained. The time has come. Incremental loads are yet another case of instructions that merge highly common instruction sequences into one.
Imagine you want to load a 32-bit constant into a scalar
register. Handily, the UA
instruction format has a target
register index as well as an 8 bit unsigned integer
constant. The incremental load instruction allows us to build
bigger constants 8 bits at a time:
x7 ← 0xAABBCCDD ; pretend for a moment that this assembles
; to one instruction
x0 ← ldi 0xAA
x0 ← ldi 0xBB
x0 ← ldi 0xCC
x0 ← ldi 0xDD
b0 ← eq x0, x7 ; true
An incremental load performs the operation of first shifting
the current contents up by 8 bits, then or
-combining the
shifted contents with the 8 bit constant of the
instruction. While Finvil itself only has at most 32 bit
addressable registers, this technique scales to arbitrarily
large immediate values encoded directly into the instruction
stream. A beefier hardware implementation or a clever emulator
implementation may even attempt to perform macro-op fusion on
successive incremental loads.
Remainder Division
The assembly syntax is pretty straight forward:
x0, x1 ← divrem.u x5, x9
This performs the operation x0 = x5 / x9
and x1 = x5 % x9
,
for those of you familiar with a few programming languages.
There is a twist, however. x0, x1
cannot be changed, these
are hard-coded target registers. The two divrem
instructions
(signed and unsigned integers) use the XRR
format, which only has two register operands. To not clobber
the input registers unless you explicitly want to, two other
target registers had to be pulled out of thin air. I could
have used the RRRA
format for them, but divrem
is not a
common enough operation to warrant taking up two full major
opcodes. And given that division is already a very slow
operation, the added performance impact of possibly having
to copy the old x0, x1
contents before divrem
isn’t worth
mentioning.
Function Address Space
There are more individual instructions we could talk about, but let’s instead start diving into the feature set that makes running sandboxed code possible. Here’s a weird instruction:
; mnemonic p.fn
p0, p1 ← fi, pc
We just fetched two hidden registers into our pointer registers
p0
and p1
. As many of you may be familiar with, pc
is the
»program counter«. This is the register that points at the next
instruction for the CPU to execute. Jump instructions work by
changing the value of this register behind the scenes. The
difference to your off-the-shelf pc
is that pc
is relative
to the start of the current function, and it is bounds-checked
against the end of the current function. So for every function
in your code, pc=7
means a different thing. That’s where fi
comes into play: fi
is the »function index« register. Your
embedded code has a table of environment-provided functions it
imports and a number of functions it exports. fi
is an index
into this table. So you could say that the »real« program
counter is the pair of fi, pc
.
If that makes you wonder how a call
instruction works:
Indeed it assigns the value of a pointer register to fi
,
selecting the function to call, and clears pc
to zero.
There is also a »BIOS« call instruction call.x
, which can
call into one of up to 512 built-in functions to communicate
with the embedding environment. call.x
uses the C
instruction format.
Data Address Spaces
Finvil separates a program into multiple separate address spaces. Each address space comes with its own set of instructions, resulting in impossible to work-around access restrictions. For example, there is no instruction to read the contents of any function address space, especially not one to write.
Other address spaces are GRS, GWS, GSS, LRS, LWS, LSS
. Here’s
an example of reading a signed 16 bit integer from GRS
:
; mnemonic s.ld.grs
x0 ← GRS:s16[p1]
The pointer contained in p1
is used relative to GRS
.
So like pc
, p1 ← 7
means a different thing for GRS
,
LWS
, or whichever else. Here’s what the address space names
mean:
G-S
: Global segment, i.e. shared among all scripting contexts.L-S
: Local segment, i.e. uniquely instanced per scripted entity.-RS
: Read-only segment. This data can be read directly from the script blob by the emulator, allowing for zero-copy deserialisation.-WS
: Read and write segment for temporary data. Intended to live for as long as the scripted entity lives, but no longer.-SS
: Read and write segment that is to be persisted across multiple instantiations of the scripted entity. In the language of game programming, this is your save data for the scripted entity.
As you would expect, there are no instructions for writing to
-RS
segments. Environment interaction was designed to happen
transactionally. You’d first call.x
a BIOS function to load
the data you need into your address space. Then you modify this
data as you need and finish by performing a committing call.x
.
Potentially broken invariants of the requested data are caught
by enforcing this serialisation indirection. Of course this comes
at a performance cost.
Opcode Numbers?
I’ve got a virtual CPU ISA now, it can compute things. Now what does this assembly code compile to? As in: What actual binary data is being spat out by an assembler? Well, the answer to that is… uhh… none yet? Okay, let’s try an example.
.label:
b* ← xor 1 ; flip b0
if b0: goto .label
Here’s two simple instructions. The b.xor.i
instruction uses
the UX
instruction format. The b.goto.i
instruction on
the other hand uses OB
. Here’s a reminder of how these are
encoded:
Instruction Format | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |
UX | uint8 | Op | Opcode | |||||||||||||
OB | sint9 | 1 | 1 | 1 | 1 | b0-b7 |
To encode b.xor.i
, the uint8
field is our constant 1
.
b.goto.i
encodes our target register b0
, index 0
, right
in the major opcode field. The sint9
index is −1
to jump
back one instruction. −1
in binary 2’s complement is »all ones«,
i.e. 1 1111 1111₂
. Our encoding table now looks like this:
Instruction Format | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |
b* ← xor 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Op | Opcode | ||||||
if b0: goto .label | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
Aaand… that’s it. The actual opcode bits are yet unassigned. Why is that?
To put it simply, i was still finalising which opcodes i want
to have in the end. Do i need more conversion ops? Do i want
more granular address space handling? Do i want more or fewer
data types? For example, i forgot to add cvt.p2u
last time
i worked on Finvil. I only found out about this while writing
pseudo code for this post.
Without a final list of opcodes, it makes no sense to assign numbers to them. And it doesn’t just end there. Ideally these numbers are assigned in a way that minimises the amount of branching a real hardware implementation would have to perform. The numbers should be assigned by first sorting all opcodes by instruction category, then by instruction format. Or maybe not, maybe first by format, then category. That heavily depends on the distribution of opcodes within these two sorting criteria. It has performance implications quite like sorting loops in an SQL query plan.
So this is why, unfortunately, you cannot build a Finvil compiler and emulator yet. But the next version will get there.
Future Finvil
This all has been a long and exhausting read and write-up, so i’d like to keep the outlook on Finvil’s possible future short, focussed on the main pain points.
Turing Completeness Be Gone
A lot of design problems were caused by having to reduce the register file as much as i reasonably can, in order to limit the size impact on each scriptable entity or a scripted game as a whole.
(Did i mention Finvil was designed for game scripting?)
In a future version of Finvil i’d like to experiment with more
high-level control flow, like Wasm. Except limited to if … else
and counted loop
s†. A language where these are your only diverging
control flow structures is known in computer science as FOR
,
which is not turing complete. Add a while
loop and you get
the language WHILE
, which now is turing complete.
(† And a few predicated operations like conditional calls or conditional copies, like Finvil in its current design has.)
In addition to only having counted loop
s, i’d like to add a
stack of active loop counters, so that code can indeed not
manipulate the loop condition from within the loop itself.
This has three major benefits for the embedding environment:
- The environment knows that any random program is guaranteed to eventually finish.
- Just by looking at the loop counter stack, the environment
can judge whether entering yet another loop is »too expensive«
and can thus cheaply abort a script on
loop
entering. - The abort logic for too long-running code now only has to
run when starting a new
loop
, not like before during every instruction fetch.
With these the environment can now guarantee that any random script will finish in a given time, sucessfully or not. As a result, there is no register file we have to store per entity anymore, only pointers to local segments. And thanks to that, i can make the register file as big as i want and need. For example, i can without worrying add a loop counter stack. ☺
Return of the Vectors
I’d like my vector registers back. They’re just too useful for shuffling data around between segments, or for performing fast linear algebra. Game code is full of linear algebra. However, if i were to bring back vector registers, i need to come up with a way to encode the additional vector instructions.
One possibility is to go for variable length instructions and use wider encodings for vector operations. This could be doable, for if i end up using more high-level control flow opcodes, and if i forbid pointer-indirect jumps in favour of precomputed jump tables in a compiled function, i no longer have to worry about the possibility of computing addresses that would land in the middle of a wide instruction.
Another possibility is to perform a »mode switch«, where you
switch from one opcode decoder that understands scalar opcodes
to a second opcode decoder, which decodes every opcode the same,
except that scalar opcodes are replaced with vector opcodes.
Like high-level control flow, a mode switch instruction could be
a pair of instructions v.begin, v.end
, and the existence of
such a pair could be used by the environment to conditionally
allocate vector register space for a function or not. This also
goes hand in hand with the general observation that vector
operations aren’t that common and usually occur in number
crunching loops or during bulk data transfers.
I’m leaning more towards the second option, the mode switch.
This mode switch can be especially valuable for JIT compilers,
which then have an easier time figuring out whether to clobber
or ignore the host CPU’s vector registers. And for a software
emulator the implementation is as easy as changing the pointer
to the table of computed goto
labels or interpreter functions.
Rethinking Registers
For the longest time i was going back and forth on how i want to handle function arguments, return values, and calling conventions. The status quo on all popular CPU ISAs is quite frankly dumb. A lot of time when calling a function is spent pushing or popping registers explicitly that are by convention allocated as temporary data, function arguments, or function return values. There is also a strange asymmetry in that in all calling conventions i know of, there is less register space assigned to return values than to arguments. You can squeeze a lot of extra performance out of your normal code just by tweaking the number and sizes of your function parameters and returned data, and i do mean a lot.
Currently Finvil handles calling conventions by not bothering. Like all retro CPUs, it is on you to come up with a calling convention and then stick to it. At the end of the day, however, this just means that Finvil has all the same performance pitfalls of regular CPUs, where pushing or popping registers to please calling conventions can take quite some time.
I’d like to take lessons from less popular CPU ISAs, which handle all the calling convention magic in hardware. Two CPU ISAs come to mind:
Itanium handles calling conventions by all local registers being part of a big ring buffer. Each function allocates a slice of this buffer as its addressable register space. To pass in arguments and accept return values, you allow part of the called function’s register slice to overlap with your own. The big benefit of Itanium’s design is that every function only ever allocates exactly as many registers as it will actually need. This does, however, come at an increased cost of checking whether register slice overlappings are valid on every call or return.
Sun SPARC took a different approach. It is equipped with a stack of overlapping register windows of fixed sizes. Like Itanium, you pass arguments to a called function by writing into overlapping registers, and through these overlapping registers you can receive return values. SPARC’s register window stack is much easier to implement and its static register allocation requires no extra bounds checks during calls or returns. However, this does have the downside that complex functions may run out of spare registers, needing to spill data to the stack, or that small functions will leave a lot of registers unused. For high performance hardware this turned out undesirable, which is why Itanium went for dynamic allocation. However, for a software emulator this register window stack seems highly attractive. We have tons of RAM at our disposal, what do a few unused registers in the middle of a call stack matter?
I’m considering experimenting with SPARC’s register window stack. However, i’ll have to rethink the instruction encoding then, for 8 addressable registers will not be enough for this overlapping to be efficient.
Here’s a table demonstrating how this works for a 3 functions deep call graph:
Call Stack | Register Window Stack | ||||||
---|---|---|---|---|---|---|---|
Function 1 | I | W | C | ||||
Function 2 | … | … | I | W | C | ||
Function 3 | … | … | … | … | I | W | C |
A possible way out for the encoding problem of wider register groups may come from Lua. This programming language operates on a bytecode interpreter that uses a random access stack for storing live data, often referred to as a register machine. The trick to learn from is that in Lua’s bytecode, stack index operands can have asymmetric sizes. You can address data farther down the stack as a source operand than you can address the target of your operation.
A next version of Finvil could have a new WRR
instruction
format where Source 1
and Target
can only address the
8 work registers, but Source 2
can also address 4 input
and 4 output registers, spanning the full window of then 16
partially overlapping registers. This would at least make
operating on function arguments more efficient. It is still
desirable to also be able to write results back into I/O
registers, however, without requiring extra transfer opcodes.
Why 16?
24 is a much better number, isn’t it? Fun and banter aside, i do mean it. I quite often encounter situations where 8 bits of data are too few, but 16 bits are too much. You know what would’ve been close to perfection for me? 12 bits. And for bigger data i just as often encounter the same issue. 16 bits is way too little, but 32 bits is way too much. 24 bits would be lovely. However, few programming languages i know have built-in support for 12 and 24 bit integer types. Zig being one exception. The primary reason for this is that to a CPU, dividing by 3 is rather annoying without special hardware. Keeping everything sized in powers of two allows for quite predictable, quite clean, quite cheap data fetching logic.
But this is Finvil. Software emulators can deal with multiples of 3, you just have to… convince them hard enough. Just think of RGB-encoded images, that’s multiples of 3 inside your data stream. And a real hardware implementation can always add a special »3-divider« circuit. Division of small numbers by small constants can be implemented rather efficiently in hardware. And software? See for yourself. Compilers have tricks to make remainder divisions by constants not that expensive.
Sooo… why not experiment with a 24-bit instruction encoding, hmm?
Wherefore Art Thou, Segment, Whither Dost Thou Go
The observant among you who didn’t yet fall asleep may have noticed that there is a little… disagreement in what i have presented you so far. A symptom of Finvil still being a work in progress.
In the register file table, there were 3 special registers,
which as you may have guessed, store segment spans: fs, gs, ls
.
In the section »Data Address Spaces«, however, i mentioned
the 6 segments GRS, GWS, GSS, LRS, LWS, LSS
. These seem to
relate to the gs, ls
registers, but what about fs
, then?
Put simply, i haven’t yet finalised how many segments for what purpose i want or need. This involves writing lots of pseudo code (like the code examples so far) to figure out the best way to pass around big blocks of data of different access permission and data persistency levels.
What i really want to have is something Wasm has a not yet implemented draft for. Per permission/persistency level, i want there to be a list of segments the current function can access. Global segment? That’s just the first entry. Local segment? Second entry. Wanna transfer big blobs of data to a called function? Throw it a segment generated from a slice of your own segments. Need to read or write state of the embedding environment? Ask for them to be mapped in as extra segments.
(Search keyword for those of you curious what this Wasm feature is: »Multiple Memories«)
This would require a mildly wider instruction encoding, however.
External & Internal Functions
This last one is short, i promise. Currently, Finvil has no way of splitting up big functions into smaller worker functions other than by exporting these worker functions. Given that function indices merely range from 0 to 65535, this precious index space could quickly be eaten up by functions other scripts shouldn’t be able to see. The solution is simple: Add a second pair of call instructions. I just haven’t come around to free up opcodes for them yet.
Conclusion
Designing custom CPU ISAs or bytecode machines for fun or
special purposes or both is super fun. I recommend giving
it a try! You can take it as far as i do, trying to optimise
every detail before finalising a version 1 design, or you
just go with the flow and let things evolve. Or you do neither
and just seek to better understand what went through the minds
of other designers of CPU ISAs or bytecode machines.
Or thou art henceforth forever horrified by the mere
pensée of ISA or bytecode design.
Whichever it is, i hope you could draw some inspiration or better understanding out of me ranting about an old project of mine. There’s lessons to be learned from everything everywhere. But ultimately i just want you to have fun.
Bye!
Terms Used
- Bytecode
- Another name for instructions for when we don’t speak of arbitrary CPU ISAs, but about software interpreters. Most bytecode languages are equipped with high-level programming language features, such as instructions for instantiating objects of classes or dynamically looking up their properties, indexing array objects, and more. The reason i describe Finvil as a »virtual ISA« is that it doesn’t have any of these high-level features most bytecode languages have. You could implement the entirety of Finvil in hardware. Your embedding environment is then a much more powerful management CPU next to however many Finvil cores.
- Component
- Entity
- In an Entity Component System, a software architecture style popular with some game engines, an entity is basically whatever you can spawn into and operate over in a game world. An entity can have a bunch of different components, which are special-purpose sets of properties. In Finvil’s old design, the Finvil register file would’ve been a scriptable entity’s component. And then there are the systems, which define behaviour over entities with specific combinations of components. For example, a Finvil script system would run over all entities that have a Finvil register file, segment list, and script component, and then execute these scripts over the segments and registers.
- Embedded Interpreter
- In the world of CPUs, embedded means a different thing, so let’s be clear in case there is some confusion: By »embedded interpreter« i mean you have this interpreter included as a software library and run it within your own application. A stand-alone interpreter would be something like Python. (Yes, technically you can embed Python, but only the chaotic evil do this.)
- FPGA
-
A field-programmable gate array. This is basically a big
block of basic configurable logic blocks and a lot of
configurable routing between them. The logic blocks you
can configure to perform the most basic of computations:
adding,
xor
-combining, storing data in flip-flops, and the likes. Then you configure which logic blocks take their data where from and whither they move their results. All in all, it’s a programmable CPU-like device that you can make behave as if it was whichever CPU you like. (Assuming the code of the CPU you like fits into the FPGA.) A CPU model programmed onto an FPGA is what i’d call »hardware emulation«. In case you wondered why i’m adding the »software« in »software emulation« all the time, this is why. - ISA
- CPU ISA
- Short for »Instruction Set Architecture«. This is a CPU’s specification of software-observable features. I.e. an ISA is about what instructions a CPU has, how they are encoded, what a CPU’s memory model is, and all the other things related to how you actually program this thing. What’s not part of an ISA are things like pin voltage levels, bus capacitance, the exact transistor layout of an integer adder unit, etc.
- Macro-Op Fusion
-
Your CPU is capable of performing a quite complex operation
in one step, a very useful complex operation at that. However,
your current ISA design is incapable of expressing this operation
with a single opcode, be it for historical or design philosophical
reasons. For example, your ISA may need two instructions to
compare, jump if equal
, although your CPU’s branch unit may have a direct connection to your comparator units. Or your ISA may need two instructions for one operation, because while the CPU produces two results for a wide operation, your instruction formats have no way of expressing a secondary target register. For examplemul_lo, mul_hi
to compute the lower and higher half of a result twice as big as your registers separately. Or you wish to load a big constant from the instruction stream, but your instruction formats can only carry small constants, so you have to build your big constants out of many instructions, like withs.isli
. In all these cases, instead of adding more or wider instructions that can handle them in a single step, you could do the inverse and have your CPU detect these sequences and merge them into a single internal magical instruction before execution. That is macro-op fusion. - Mnemonic
-
An easier to remember name for something. In assembly
languages, the mnemonics are the text names for opcodes
which to the CPU are really just numbers. The CPU may
see
1337
, which means nothing to you. Your assembly language saysxor
, which means nothing to the CPU. Thexor
is the mnemonic for you, and your assembler compiles that to1337
. And it doesn’t end with just opcodes. Even register names are mnemonics. - Opcode
-
Short for »operation code«. For every distinct operation
a CPU ISA can perform, there is one assigned opcode. You
can add two registers together? One
add
opcode. You can add a register and a constant together? Oneadd constant
opcode. In a RISC design. In a CISC machine, constant or register is a property defined in the operands, not by the opcode itself, so there you would only have oneadd whatever to whatever
opcode. The combination of an opcode, maybe some operands, and maybe some targets for the results of the operation, make up what is called an instruction. The name opcode is also used to refer to the numeric value of an opcode when encoded in an instruction. Unfortunate possible source of confusion. - Predicate Register
- Most of you who are familiar with CPU design are familiar with the concept of »status flags registers«. These are registers that after some operations accumulate short-lived statistical data on the results of these operations. Things like: »The result was 0«, or: »The result was negative«, or »The result had uneven parity«. Status flags can improve code density and are very cheap to implement in hardware. However, they create instruction interdependencies that make parallelising operation execution in a real CPU difficult. It’s also difficult with flags to carry their state around for a while and use them later, because many common instructions on the way override their state. Ultimately what you want to do is perform a conditional branch based on the status flags. Like branching away if the result of the last computation was zero. Predicate registers are special registers that can only hold boolean values. Whenever you branch conditionally, you read one of the predicate registers. To set a predicate register to the value you need, you perform a boolean operation on other predicates, or comparisons on other registers. Predicate registers have a lot of benefits, such as making it easy to carry around branch conditions for branches that are yet to come, or to use a set of predicate registers for masking vector operations. The downside is, however, that predicate instructions require a lot more opcodes than a status flags register.
- Spreadsheet
- The eternal source of nightmares among IT professionals, plaguing the world ever since 1969. They pretend to be relational databases, with a really shiny, easy to use and modify GUI, but they aren’t relational databases. O bored programmers of the world, hear ye my prayer: Please make a user-friendly GUI app that can easily make new tables, modify them as needed, have computed columns, have graph and chart widgets derived from table data, and all the other pretty stuff, except sitting on top of SQLite. K thx bai.
- Verilog
- A hardware description language. Not the best one, it’s quite old. More modern HDLs exist, but they all usually compile to Verilog. It’s the C of hardware design, for better or worse. And yes, for quite some time now, hardware is programmed. You write the high-level logic of how your circuit behaves in e.g. Verilog, and then use proprietary magical tooling to compile Verilog to so-called »netlists«, a lower-level representation of your code that only involves the most basic of logic gates and a few common elements like multiplexers or latches. Optimise the layout of your netlist and your circuit design is ready for fabrication.