Machine Code is Typed
Wait, what…? Sometimes, when discussions about type system designs arise, someone makes a slight remark that machine code, aka. that binary programming language your CPU speaks, is an example of an »untyped« language. I heard that claim again recently, listening to a podcast, and so i felt inspired to write about why i disagree with that claim.
On This Page:
Types & Type Systems
Before i get started, allow me to ensure we’re on the same page about what exactly a »type« and a »type system« is. To quote the English Wikipedia article on Data typeʷᵇ
And the article on Type systemʷᵇ claims:
Now, this gives us an easy to follow recipe to assess whether a specific machine language is »typed«:
- Check whether there are any operations which are only valid on some values.
- If so, this machine language has »data types«.
- If it has data types, it follows that there is a »type system« that dictates which machine code expressions are valid.
You may already have guessed what i’m getting at here. ☺
② and ③ are both consequences of ①, so that’s all we have to look out for.
Is it typed?
Unfortunately, i do not know every machine language in the world, so i cannot give you an exhaustive answer. Instead, let’s briefly explore some example machine languages. I picked a really simple one to get started, for the same principles apply to all other machine languages.
Is 6502 typed?
What’s a »value« or »datum« here? Put simply, it’s our register file. All registers store individual values, on which we’re performing operations producing new values, which in turn we store in registers. Where you’d write a + b
in, say, C++, you usually write something like add r1, r2
in a CPU’s assembly language†. What was our variable names in C++ became register names in assembly code.
On the 6502 we get 3 such registers: A
, X
, and Y
. Each of these is capable of storing an 8-bit value of any bit pattern. This means, that at least with regards to the »set of possible values«, A
, X
, and Y
have the same data type. And if there was indeed just one data type for all data in a language, then it is an untyped language. However, let’s not get hasty. There’s still the criterium of »allowed operations« to consider.
A
, X
, and Y
each have different types.
Let’s have a look at a 6502 instruction listingʷᵇ. The first 3 instructions there already suffice to support my claim. LDA
, LDX
, LDY
, they each operate on a different register. You cannot make LDX
load a value into A
, that’s only something LDA
can do. »Okay«, you may think, »but that’s unfair, because 6502 instructions can only take 1 operand!« It’s fine. So let’s accept LDA #10
as merely strange syntax for LD A, #10
. Now all 3 registers support the LD
operation, and thus we’re back to being untyped, right? Well, not quite. See that #10
? That’s just one possible kind of operand for LD
. LDA
supports this list of »addressing modes« for its operand:
Addressing Mode | What dat? |
---|---|
Immediate | An 8-bit value of any bit pattern. Sounds just like our register data, nice. |
Zero Page | Another 8-bit value of any bit pattern, but magic. |
Zero Page, X | Magic with X . |
Absolute | A… 16-bit value of any bit pattern. Well, that doesn’t fit into our registers! More magic? |
Absolute, X | Analogous to the »Zero Page« thing. |
Absolute, Y | Analogous to the »Zero Page« thing. |
(Indirect, X) | We’re back to 8-bit magic values. |
(Indirect), Y | Yes, the different bracing has subtle meanings. |
We don’t have to actually know what all these do. Let’s just compare this list to LDX
’s addressing modes instead:
- Immediate
- Zero Page
- Zero Page, Y
- Absolute
- Absolute, Y
And… it’s different! You cannot perform the operation LDX ($77, X)
, but you can do LDA ($77, X)
. So indeed, by restriction of available valid operations, values of A
are of another type than values of X
. And if we also check the list of addressing modes for LDY
, we see that values of X
are also of a different type than values of Y
:
- Immediate
- Zero Page
- Zero Page, X
- Absolute
- Absolute, X
Can we »cast« values between these 3 types? Like we can do (int) 6.282f
in C? Sure thing! 6502 calls its type casting »register transfers«. TAX
, for example, »casts« values of A
into values of X
by copying their bit patterns between these storage locations. You could thus say the 6502 is »storage location typed«. And because you cannot implicitly perform these »casts«, always requiring these explicit operations like TAX
, you can further state that 6502 has a »strong type system«.
So, A
, X
, Y
are typed data. But we’ve seen yet another data type already. (Many others, even.) The »absolute« addressing modes all of the sudden have 16-bit data. 16-bit data can represent values (e.g. 1024 when interpreted as an unsigned binary integer) which 8-bit data can’t. Yet, 16-bit data can represent every value an 8-bit datum can. That’s some sub-typing here. But it gets stranger.
What even is #10
or $77
? A value, so much we’ve established. And we can perform operations on them, like in the LDA ($77, X)
example before. If 6502 was truly untyped… we should be able to perform the LD
operation on these, too. Alas, at least on 6502, LD#10 A
‡ is not a valid operation, meaning that even if A
, X
, Y
were of the same type, 8-bit (and 16-bit) »immediates« are still different data types.
So here’s the conclusion:
6502 has a strong, storage-location-based type system.
It is not untyped.
Is x86 typed?
That’s now pretty trivial to demonstrate:
- IEEE-754 floating-point operations like
addss
are only available on the SIMD registersxmm0
up to who knows how many of these your CPU has by now. You cannot writeaddss eax, ebx
, the base registers are unavailable to this operation. However, you canaddss xmm0, xmm1
just fine. - You need explicit casting to transfer values between registers of different types, like
cvtss2si eax, xmm0
. - Of course, the same shenanigans as with 6502’s immediate values vs. register values also apply to x86.
For historical reasons, we get a little bit of extra fun. As implied, all xmm
registers have the same data type. x86 SIMD instructions don’t discriminate between them. They do, however, discriminate between values of the base registers eax
, ebx
, r9
, r12
, etc. While all the r{N}
registers have the same data type, all sharing the same set of possible values and operations, each of the »alphabet« registers like eax
and ebx
have distinct types.
The instruction div r12
, for example, divides a 128-bit integer by the current value of r12
. But where is that 128-bit value? In the registers rdx
and rax
combined. Your 128-bit value to divide cannot be anywhere else. And the results? rax
receives the quotient, rdx
the remainder. You cannot change these, either.
There are even examples of x86 restricting which values are allowed for some data, not just operations. For example, AAAA'AAAA'AAAA'AAAA₁₆
is an invalid memory address in x86, until an x86 CPU with a full 64-bit virtual address space comes around. For all others, addresses have less than 64 bits and must be sign-extended to 64 bits. AAA…₁₆
is not sign extended, it flips every other bit around. That’s in addition to the well-known error on integer divisions by zero. These checks are only performed at runtime, however, so x86 also has instances of dynamic typing.
x86 has a strong, storage-location-based type system, too.
Is RISC-V Zfinx typed?
Of course, we can still apply the same old trick from 6502, demonstrating that RISC-V at least makes a difference between immediate values and register values. Ignoring that, i sprinkled in this little Zfinx
ʷᵇ instruction set extension to make this section more interesting.
For those who don’t know, like x86 (or ARM, or MIPS, or…), RISC-V has a different set of registers for »integer-like« data and IEEE-754 data, allowing IEEE-754 floating-operations only on the latter, and requiring explicit casting to convert data. The integer-like registers are called x0…x31
, the floating-point registers are f0…f31
.
Now here’s the magic:
With the Zfinx
extension, these are no longer two separate register sets! f0
becomes x0
, f1
becomes x1
, and so on. This means, that with this extension, all† operations can now be equally performed on all data. This in turn means that (ignoring immediates) there is only one type, making RISC-V Zfinx
almost untyped!
RISC-V Zfinx
is the closest to untyped yet, but it ain’t.
Is Mill typed?
For those not familiar with the only radically different CPU ISA in decades that is slowly being worked on in the dark, hiding its great ideas behind an army of patents that expire no earlier than 2036: Mill is a new CPU ISA that is best known (assuming people know it) for its »belt«.
The »belt« is patented magic lingo for what we programmers like to call a »random read-access ring buffer«. A machine code instruction enque
s however many results it produces, which results in an equal amount of oldest values being deque
d. Operands of an instruction can be read from any slot in that ring buffer — sorry, i mean »belt«.
There is only one »belt«, and what traditional CPUs would call a »register« is a single slot in the ring buffer here. Now, if Mill instructions which take their operands from the »belt« all share the same belt, doesn’t that mean that all operations are available for all belt values? (You know the drill with immediates. ☺) If so, then belt values are »untyped«.
Don’t get too excited now. The Mill has another trick up its sleeve: Dynamic typing.
Instructions that produce belt values out of nowhere or fetch them from memory have to provide type annotations. loadu
has such an annotation in the argument width
. This informations is carried around through your operations and all belt slots, such that an eventual storeu
operation knows how many bytes to actually write to memory. We don’t want to, say, pad an 8-bit integer to the full 128-bit register width when saved to memory. So Mill has runtime type information, and values of different types sometimes behave differently.
So what about converting between these types? While there are explicit conversion instructions like u2ff
(unsigned integer to binary floating-point), most instructions are weakly typed. addu
will just assume your operands are unsigned integers of width
and scalarity
. addf
will do just the same. Of course, doing addf
on two random integers yields very different results to first u2ff
-converting, then adding them.
The Mill has type inferrence and is weakly typed.
Yes, they are typed… and…
… every CPU architecture out there has its own little type system nuances.
- Strong or Weak
- Are explicit conversion operations needed? Strong.
TAX
on the 6502.
- Are data bits just blindly accepted as-is? Weak.
addu
vs.addf
on Mill.- RISC-V
Zfinx
moves all floating-point operations to our integer registers.
- Are explicit conversion operations needed? Strong.
- Dynamic or Static
- Memory addresses are often dynamically typed.
AAA…₁₆
on x86 causes memory operations to fail at runtime.
- Most operations are statically typed.
- An integer addition will do integer addition, nothing else.
LDX
on 6502 cannot useX
as an index register.- There is no
LD#10
on 6502 to load data into the number 10.
- Some operations can use dynamic dispatch.
- Mill auto-vectorises operations based on the
scalarity
runtime type information. - Mill picks between 8-bit, 16-bit, etc. variants of operations based on the
width
runtime type information.
- Mill auto-vectorises operations based on the
- Memory addresses are often dynamically typed.
- Manifest or Inferred
addss
on x86 adds 32-bit binary floating-point scalars together. Manifest typing.addu
on Mill uses dynamic dispatch based on runtime type info to add across all vector lanes. Inferred typing.
Is there any untyped machine language?
Data in RAM carries no type information whatsoever. You can store what you like, load anything as anything. Ignoring the type differences of immediate values, the closest you are likely to get to an untyped CPU ISA are RISC-V with the Zfinx
extension, or the Propellerʷᵇ. This is a fun one, for on this CPU, all registers are stored in RAM. Register indices are just shorter names for actual memory addresses, meaning there is no distinction between register and memory operands. PICʷᵇ microcontrollers also have their registers stored in memory. Getting closer, but even these two have their special cases.
So really the answer is: Probably not.
Machine code is typed.