04.23

Sarah Wittman describes “simultaneously designing top-down and bottom-up until your designs meet in the middle”. She goes into a little detail on the subject, and I recommend reading what she has to say. Like Sarah’s, my project (as it currently exists) has come together in the middle.

When I first started this effort I had an idea to make what I call a “serialized” processor. The idea was to be able to operate on a stream of bits rather than on a collection of bits. The theoretical advantages of this idea were as follows:

- minimal transistor count – only enough transistors to perform the various operations on a pair of bits would be required
- use of serial busses – again, minimizing the amount of resources required to actually build the processor. Another advantage is that it tends to be a bit easier to run serial busses at higher clock speeds.
- Easy variation of word size – The only difference between an eight bit word and a 16 bit word would be the number of clock cycles for which a given instruction ran. In theory, this would make it much easier to go from eight bits to 16, or from 16 to 32, etc.
- Because the entire processor is serial, shift registers can be used for memory. It would also allow for a neat trick where data is processed immediately as it is shifted out of memory, essentially eliminating any latency incurred by fetching data from memory(in theory at least).

I actually created a logic diagram for an adder based around this paradigm. Little did I know at the time, that an adder is probably the easiest of the arithmetic instructions to implement. Still, I had created it without looking at anybody else’s work, and I was proud of my design.

I never did build it, partly because I did not have a good way to test it. I wanted to have two shift registers to use for Input-1 and Input-2, and a third to use for the output. Those registers ended up being very important because while attempting to build them I completely changed the direction of the project, which is something I will discuss in more detail in a future post.

In retrospect, I do not know whether or not the design would have worked. This is largely due to the logic dealing with the carry. Also notice that there is no clock input. The idea at the time was simply to turn the entire circuit on and off. The flip-flop would be powered constantly (except between instructions), preserving the carry bit.

Initially, partly because of the advantages of a “serialized” design described above, I wanted to build a 16 bit processor instead of a four or eight bit processor. Why? At least in part because the mighty 8086 is 16 bit. There is also a great deal of code written for 16 bit processors, including CP/M and DOS. I guess I just could not resist trying to build something that could run all of that pre-existing code (despite knowing that I would have to write a compiler and re-compile it all).

You may be thinking that I am a bit insane for even thinking about basing a homebrew processor on the venerable 8086. Such an undertaking may not be quite as crazy as it sounds. Let us consider some intel on the Intel:

- initially clocked at up to five Mhz
- contained roughly 29,000 transistors
- implemented approximately 81 unique instructions
^{1}

While a processor of those specifications would be quite an undertaking, I think it would be possible. That said, I had no intention of trying to implement the entire x86 instruction set. A review of the 8086 instruction set indicated a number of instructions that were scenario-specific versions of other instructions, or were otherwise specific to the architecture of that particular chip. I did find the 8086 instruction set to be a useful starting point though.

Over time I came up with what I feel like is a fairly solid instruction set. The following instruction set should be fairly versatile, yet still relatively easy to implement. I still have one opcode without an associated instruction. I have a few different ideas for what I want to use that for, and when I finalize that decision I will post an update. I may still make some changes going forward, but at this point I do not expect any major changes. I should also point out that the “stage” column has nothing to do with pipelines, or architecture at all. The idea was to build the processor in multiple stages, each one building upon the last. The “stage” column indicates the stage in which the hardware is to be built.

As I was working on the processor in my head I came to realize that I could pretty much build the more complex instructions around the add and subtract instructions (when I first started this project I was not familiar with two’s complement). After all, what is multiplication but repeated addition and division but repeated subtraction? My idea for how to do this was quite simple, but for reasons that will eventually be self-evident, I am not going to go into the implementation details. I did not realize it at the time, but this would be a basic utilization of micro-code.

At this point, I realized a few problems in my design. Let us use the example of multiplication to illustrate. If I have two 16 bit numbers, let us say 65,535 and 1, then the micro-code has to be sufficiently advanced to realize that multiplying by one is a special case(as is multiplication by zero). Assuming that is the case, there is still one of two possible routes of multiplication by repetitive addition:

- add 65,535 to zero
- add 1 to zero 65,535 times

Obviously, option 1 is going to be much more desirable than option 2. In fact, in general, we are going to want to add the larger of two numbers to zero the smaller of two numbers times. This does impose an additional requirement on our microcode though, and now we need to be able to compare two numbers in order to determine which of the two is larger, and then order the two numbers in order to make the most efficient utilization of the adder.

Let us look at an additional (pun intended) example involving two different numbers. In this case let us consider 255 (11111111) and 255 (11111111). The product of these two eight bit integers is 65,025 (11111110 00000001). If we multiply these two integers by using repeated addition, the addition instruction would be executed 255 times. Assuming the use of the “serialized” adder described above, that would require 16 cycles per instruction (under optimal conditions) times 255, or 4080 cycles.

Carrying this example further, if we multiply 65,535 by 65,535 (knowing we will have a rather significant wrap-around error), the processor would require 16 times 65,535, or 1,048,560 cycles. Assuming the processor can run at 1 Mhz (which is a rather significant assumption), that instruction would take more than one second to execute. Clearly, this is unacceptable.

One way to fix that problem would be to add bound checking to the micro-code. Aside from further complicating the micro-code, we have added an inherent *assumption* about software to the hardware: that code would never want to multiply 65,535 by 65,535 and accept the rounding error. This specific assumption is not particularly problematic, but making these *types* of assumptions is a bad habit to get into.

Further, why would we want to stop with this scenario? The square root of 65,535 is just under 256 (the square root of 65,536 is 256), so why should the boundary limitation allow any number greater than 256 as an input? One reason is that 16,383 times three is 49,149, which is well within bounds of a 16 bit integer. The point is, the number of potential wrap-around cases for two 16 bit inputs is substantial, and checking for those cases in micro-code is impractical at best.

Somebody who knows a thing or two about this sort of thing might say, “*Yeah, but why would you use such a ridiculous algorithm in the first place?*” The short answer to that question is because repeated addition is what multiplication is. In response to this point though, I would not use this algorithm, it just makes a good example.

Let us consider (what I now know to be) a common approach that is a bit more realistic: shift left and add. This technique shifts an input by a number of places equal to the position of each high bit in the other input, and accumulates the results. I realize this may not be a very clear explanation of the method. Further explanation of this technique can be found here or here.

This still requires some work on the part of the micro-code. Ideally the inputs should still be ordered, but it is not necessary. Using this technique for a pair of 16 bit inputs, 16 one bit shift operations will be required. The number of addition operations would be equivalent to the word length in bits for the inputs squared, assuming a single two-input adder is used. This would require 16 times 16, or 256 addition operations. Assuming each shift operation requires 16 cycles and each addition operation requires 16 cycles this method would require a total of 4,352 cycles (again, under somewhat unrealistically optimal conditions).

This is not great, but it is much better than the worst case scenario of 1,048,560 cycles that was considered above. This method could be made considerably faster by using multiple adders or multi-input accumulators. The problem with that technique is the “serialized” nature of the adder described above. It could still be done, but doing so would introduce a few new considerations.

This leads to the next point, but this post is already quite long. In an effort to keep these logs a bit (get it?) easier to digest I have broken this into two separate posts. The next log will continue from here.

^{1} At that time, floating point arithmetic would have been handled in software or in a math co-processor such as the Intel 8087.

## No Comment.

Add Your Comment