📅︎
✏︎ by
Evy
⏱︎ 14 min ±10%

Machine Code is Typed

Wait, what…? Sometimes, when discussions about type system designs arise, someone makes a slight remark that machine code, aka. that binary programming language your CPU speaks, is an example of an »untyped« language. I heard that claim again recently, listening to a podcast, and so i felt inspired to write about why i disagree with that claim.

On This Page:

Types & Type Systems

Before i get started, allow me to ensure we’re on the same page about what exactly a »type« and a »type system« is. To quote the English Wikipedia article on Data typeʷᵇ

[A] data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these values as machine types.

And the article on Type systemʷᵇ claims:

[A] type system is a logical system comprising a set of rules that assigns a property called a type […] to every term […]. Usually the terms are various language constructs of a computer program, such as variables, expressions, functions, or modules. A type system dictates the operations that can be performed on a term.

Now, this gives us an easy to follow recipe to assess whether a specific machine language is »typed«:

  1. Check whether there are any operations which are only valid on some values.
  2. If so, this machine language has »data types«.
  3. If it has data types, it follows that there is a »type system« that dictates which machine code expressions are valid.

You may already have guessed what i’m getting at here. ☺

② and ③ are both consequences of ①, so that’s all we have to look out for.

Is it typed?

Unfortunately, i do not know every machine language in the world, so i cannot give you an exhaustive answer. Instead, let’s briefly explore some example machine languages. I picked a really simple one to get started, for the same principles apply to all other machine languages.

Is 6502 typed?

What’s a »value« or »datum« here? Put simply, it’s our register file. All registers store individual values, on which we’re performing operations producing new values, which in turn we store in registers. Where you’d write a + b in, say, C++, you usually write something like add r1, r2 in a CPU’s assembly language†. What was our variable names in C++ became register names in assembly code.

On the 6502 we get 3 such registers: A, X, and Y. Each of these is capable of storing an 8-bit value of any bit pattern. This means, that at least with regards to the »set of possible values«, A, X, and Y have the same data type. And if there was indeed just one data type for all data in a language, then it is an untyped language. However, let’s not get hasty. There’s still the criterium of »allowed operations« to consider.

Before digging up the evidence, let me make this claim:
A, X, and Y each have different types.

Let’s have a look at a 6502 instruction listingʷᵇ. The first 3 instructions there already suffice to support my claim. LDA, LDX, LDY, they each operate on a different register. You cannot make LDX load a value into A, that’s only something LDA can do. »Okay«, you may think, »but that’s unfair, because 6502 instructions can only take 1 operand!« It’s fine. So let’s accept LDA #10 as merely strange syntax for LD A, #10. Now all 3 registers support the LD operation, and thus we’re back to being untyped, right? Well, not quite. See that #10? That’s just one possible kind of operand for LD. LDA supports this list of »addressing modes« for its operand:

Addressing ModeWhat dat?
ImmediateAn 8-bit value of any bit pattern. Sounds just like our register data, nice.
Zero PageAnother 8-bit value of any bit pattern, but magic.
Zero Page, XMagic with X.
AbsoluteA… 16-bit value of any bit pattern. Well, that doesn’t fit into our registers! More magic?
Absolute, XAnalogous to the »Zero Page« thing.
Absolute, YAnalogous to the »Zero Page« thing.
(Indirect, X)We’re back to 8-bit magic values.
(Indirect), YYes, the different bracing has subtle meanings.

We don’t have to actually know what all these do. Let’s just compare this list to LDX’s addressing modes instead:

  • Immediate
  • Zero Page
  • Zero Page, Y
  • Absolute
  • Absolute, Y

And… it’s different! You cannot perform the operation LDX ($77, X), but you can do LDA ($77, X). So indeed, by restriction of available valid operations, values of A are of another type than values of X. And if we also check the list of addressing modes for LDY, we see that values of X are also of a different type than values of Y:

  • Immediate
  • Zero Page
  • Zero Page, X
  • Absolute
  • Absolute, X

Can we »cast« values between these 3 types? Like we can do (int) 6.282f in C? Sure thing! 6502 calls its type casting »register transfers«. TAX, for example, »casts« values of A into values of X by copying their bit patterns between these storage locations. You could thus say the 6502 is »storage location typed«. And because you cannot implicitly perform these »casts«, always requiring these explicit operations like TAX, you can further state that 6502 has a »strong type system«.

So, A, X, Y are typed data. But we’ve seen yet another data type already. (Many others, even.) The »absolute« addressing modes all of the sudden have 16-bit data. 16-bit data can represent values (e.g. 1024 when interpreted as an unsigned binary integer) which 8-bit data can’t. Yet, 16-bit data can represent every value an 8-bit datum can. That’s some sub-typing here. But it gets stranger.

What even is #10 or $77? A value, so much we’ve established. And we can perform operations on them, like in the LDA ($77, X) example before. If 6502 was truly untyped… we should be able to perform the LD operation on these, too. Alas, at least on 6502, LD#10 A‡ is not a valid operation, meaning that even if A, X, Y were of the same type, 8-bit (and 16-bit) »immediates« are still different data types.

So here’s the conclusion:

6502 has a strong, storage-location-based type system.

It is not untyped.

Is x86 typed?

That’s now pretty trivial to demonstrate:

  • IEEE-754 floating-point operations like addss are only available on the SIMD registers xmm0 up to who knows how many of these your CPU has by now. You cannot write addss eax, ebx, the base registers are unavailable to this operation. However, you can addss xmm0, xmm1 just fine.
  • You need explicit casting to transfer values between registers of different types, like cvtss2si eax, xmm0.
  • Of course, the same shenanigans as with 6502’s immediate values vs. register values also apply to x86.

For historical reasons, we get a little bit of extra fun. As implied, all xmm registers have the same data type. x86 SIMD instructions don’t discriminate between them. They do, however, discriminate between values of the base registers eax, ebx, r9, r12, etc. While all the r{N} registers have the same data type, all sharing the same set of possible values and operations, each of the »alphabet« registers like eax and ebx have distinct types.

The instruction div r12, for example, divides a 128-bit integer by the current value of r12. But where is that 128-bit value? In the registers rdx and rax combined. Your 128-bit value to divide cannot be anywhere else. And the results? rax receives the quotient, rdx the remainder. You cannot change these, either.

There are even examples of x86 restricting which values are allowed for some data, not just operations. For example, AAAA'AAAA'AAAA'AAAA₁₆ is an invalid memory address in x86, until an x86 CPU with a full 64-bit virtual address space comes around. For all others, addresses have less than 64 bits and must be sign-extended to 64 bits. AAA…₁₆ is not sign extended, it flips every other bit around. That’s in addition to the well-known error on integer divisions by zero. These checks are only performed at runtime, however, so x86 also has instances of dynamic typing.

x86 has a strong, storage-location-based type system, too.

Is RISC-V Zfinx typed?

Of course, we can still apply the same old trick from 6502, demonstrating that RISC-V at least makes a difference between immediate values and register values. Ignoring that, i sprinkled in this little Zfinxʷᵇ instruction set extension to make this section more interesting.

For those who don’t know, like x86 (or ARM, or MIPS, or…), RISC-V has a different set of registers for »integer-like« data and IEEE-754 data, allowing IEEE-754 floating-operations only on the latter, and requiring explicit casting to convert data. The integer-like registers are called x0…x31, the floating-point registers are f0…f31.

Now here’s the magic:

With the Zfinx extension, these are no longer two separate register sets! f0 becomes x0, f1 becomes x1, and so on. This means, that with this extension, all† operations can now be equally performed on all data. This in turn means that (ignoring immediates) there is only one type, making RISC-V Zfinx almost untyped!

RISC-V Zfinx is the closest to untyped yet, but it ain’t.

Is Mill typed?

For those not familiar with the only radically different CPU ISA in decades that is slowly being worked on in the dark, hiding its great ideas behind an army of patents that expire no earlier than 2036: Mill is a new CPU ISA that is best known (assuming people know it) for its »belt«.

The »belt« is patented magic lingo for what we programmers like to call a »random read-access ring buffer«. A machine code instruction enques however many results it produces, which results in an equal amount of oldest values being dequed. Operands of an instruction can be read from any slot in that ring buffer — sorry, i mean »belt«.

There is only one »belt«, and what traditional CPUs would call a »register« is a single slot in the ring buffer here. Now, if Mill instructions which take their operands from the »belt« all share the same belt, doesn’t that mean that all operations are available for all belt values? (You know the drill with immediates. ☺) If so, then belt values are »untyped«.

Don’t get too excited now. The Mill has another trick up its sleeve: Dynamic typing.

Instructions that produce belt values out of nowhere or fetch them from memory have to provide type annotations. loadu has such an annotation in the argument width. This informations is carried around through your operations and all belt slots, such that an eventual storeu operation knows how many bytes to actually write to memory. We don’t want to, say, pad an 8-bit integer to the full 128-bit register width when saved to memory. So Mill has runtime type information, and values of different types sometimes behave differently.

So what about converting between these types? While there are explicit conversion instructions like u2ff (unsigned integer to binary floating-point), most instructions are weakly typed. addu will just assume your operands are unsigned integers of width and scalarity. addf will do just the same. Of course, doing addf on two random integers yields very different results to first u2ff-converting, then adding them.

The Mill has type inferrence and is weakly typed.

Yes, they are typed… and…

… every CPU architecture out there has its own little type system nuances.

  • Strong or Weak
    • Are explicit conversion operations needed? Strong.
      • TAX on the 6502.
    • Are data bits just blindly accepted as-is? Weak.
      • addu vs. addf on Mill.
      • RISC-V Zfinx moves all floating-point operations to our integer registers.
  • Dynamic or Static
    • Memory addresses are often dynamically typed.
      • AAA…₁₆ on x86 causes memory operations to fail at runtime.
    • Most operations are statically typed.
      • An integer addition will do integer addition, nothing else.
      • LDX on 6502 cannot use X as an index register.
      • There is no LD#10 on 6502 to load data into the number 10.
    • Some operations can use dynamic dispatch.
      • Mill auto-vectorises operations based on the scalarity runtime type information.
      • Mill picks between 8-bit, 16-bit, etc. variants of operations based on the width runtime type information.
  • Manifest or Inferred
    • addss on x86 adds 32-bit binary floating-point scalars together. Manifest typing.
    • addu on Mill uses dynamic dispatch based on runtime type info to add across all vector lanes. Inferred typing.

Is there any untyped machine language?

Data in RAM carries no type information whatsoever. You can store what you like, load anything as anything. Ignoring the type differences of immediate values, the closest you are likely to get to an untyped CPU ISA are RISC-V with the Zfinx extension, or the Propellerʷᵇ. This is a fun one, for on this CPU, all registers are stored in RAM. Register indices are just shorter names for actual memory addresses, meaning there is no distinction between register and memory operands. PICʷᵇ microcontrollers also have their registers stored in memory. Getting closer, but even these two have their special cases.

So really the answer is: Probably not.

Machine code is typed.

Evyl Blue

Attempting world domination since 1997.