Disassembler working!

November 27, 2020

Looking into the toolchain more, the next step after the assembler seemed to be a disassembler. LLVM handles a lot of the details of this using the same data tables created for the assembler, so it seemed like it wouldn’t be a huge task. This will be very useful once we start working on a C compiler, to avoid having to hand-disassemble code when debugging.

It ended up being relatively straightforward to get working, other than one issue that I spent a few days on and am still unsure if I solved it the “right” way. I’m following along with the RISCV LLVM patchset which has been a good reference, but since RISCV has significant architectural differences from ECLair there have been some issues to work through.

In this case, we have a few instructions (such as ldi/load immediate) which can take the general purpose registers A-D as an operand, or special registers SP, DP, FLAGS, and PTB. When used with general purpose registers the specific register is encoded in the opcode using two specific bits but the special registers have their own opcodes without the register encoded in them due to the fact that we only have 8 bits of opcode to work with. For these I had set LLVM up with a table specifying that the first operand to ‘ldi’ could be A-D, and how to decode that, then separately that it could be each of the special registers. To accomplish this, I had register classes set up, one for A-D, then separate ones for FLAGS, PTB, SP, and DP, with one register in each one. This worked fine for assembling, since it checked to make sure the operand matched one of the registers in the class and, there being only one, it would. Once I got the disassembler basically working though, all of the instructions with these single-register classes began to crash the disassembler with the vague error “Assertion `idx < size()’ failed”.

After a few days of troubleshooting, it turns out the problem was related to these single-register classes. When setting up an LLVM backend you specify information about the instructions in a .td file, which a tool called TableGen reads and turns into C++ files which are compiled in. This tool turns out to only generate decoding routines for operands which directly affect the encoding, hence the A-D versions working but the others failing. It’s essentially the same as if you had a regex with a single capture group, then asked it for the data in the second captured group. The idx < size() failed meant that I was trying to access an item past the end of an array, since we weren’t capturing anything for the register but then tried to decode it.

The fix I found that worked was that, instead of having the single-register classes, I put the register name right in the instruction, and then specified using “Def [FLAGS]” that this instruction changes the FLAGS register even though that register is not encoded in the bits of the instruction. This seems to work for both assembly and disassembly, though I can’t really find what the “proper” way to accomplish this is so it’s possible it’ll cause different problems later.

While working on this I realized that I half-implemented the jmp* and call instructions, intending to move branches to being PC-relative in the future. These are still absolute addresses, so I want to fix that before doing any further work on the toolchain. After that I’ll probably keep working on the toolchain for a bit, since I’ve got some momentum on that currently.