Wellington Macintosh Society Incorporated

Megahertz Myth part II

For months, Apple has been fighting a battle to convince folks that megahertz (MHz) isn't the ultimate benchmark in determining the speed of a computer. They haven't made much headway, but in many ways they're right. In this three-part series we'll show you why. Today's installment: MacCentral's P4 -- PowerPC, Pentium, Pipelines and Parallelism.

Microprocessors are really lots of little processors brought together. They tackle instructions, which can be simple or complex. Complex instructions take more cycles. And there are major differences in the various microprocessor architectures, which make an across-the-board MHz comparison between processor families like the PowerPC and the Pentium questionable.

Microprocessors did things in three stages in the early days. First, they would pull data from memory to do the calculation, next they would do the calculation, then write the result back to memory, according to Peter N. Glaskowsky, analyst with Microprocessor Report, a resource for microprocessor information that regularly publishes processor performance information and routinely compares different processor architectures.

With this scheme, only one instruction could be worked on at a time. The clock cycle of the processor, even then rated in MHz, dictated the performance of the processor. A processor running at 8MHz could complete eight million instructions in one second, and one running at 12MHz could complete twelve million instructions in one second. However, even then the read and write parts of the three-step process count as instructions. So in the primitive model above, the 12MHz processor is actually only completing four million whole processes a second.

To recover these wasted cycles and make processors work faster, microprocessor designers very early in the game came up with processors that have different execution units. To continue with our primitive example, we could give the 8MHz processor a Load unit to fetch data from memory, an Integer unit to make our calculations, and a Store unit to send finished calculations back to memory. The frequency of the processor determines when each of these units will act. Like workers on an assembly line, each unit does its bit of the process each tick of the clock.

So, eight million times a second the processor loads data from memory, makes a calculation, and writes data to memory. By making three units (each with a specific duty), we can make each clock cycle result in a finished process. Each beat of the frequency results in a load, execute, and store function being completed. And, most importantly, we get eight million finished instructions per second from our 8MHz processor -- three times more performance than our more primitive 12MHz processor.

This is essentially what is called pipelining. Take a look at the table below. The left column is which clock cycle we are on -- one, two, and so on. Across the top lists each of our three execution units that make up the processor. The letters in the middle show where each process -- the whole step of making one calculation -- is at each clock cycle. So, A is the first calculation that is being made, B is the second, and so on. Lets track what each unit on our simple processor is doing at each cycle.

Cycle Load Integer Store

1 A - -

2 B A -

3 C B A

4 D C B

See how it still takes three steps to finish an instruction? "A" is loaded on cycle one, Integer calculated on cycle two, and stored on cycle three. I still need to load, calculate and store any single process. But, because of the pipeline I can have each unit work on its task and improve the overall output efficiency of the processor when compared with an imaginary non-pipelined processor.

However, there is a nasty problem with the above example. Let's say our little 8MHz processor needed to have a specific result from one of its calculations before it could make the next calculation. It would have to wait for that result to be written to memory before it could start on the next. Let's look at this on the flow diagram -- process Y requires data from process X.

Cycle Load Integer Store

1 X - -

2 Can't load Y, no result yet X -

3 - - X

4 Y - -

Y then gets integer calculated and Z is loaded in the next cycle.

Do you see the inefficiency here? We're back to three clock cycles per process. There are techniques to get around this bottleneck -- writing more logical code for the processor (optimization), adding different units on the processor that handle different things, etc. Nevertheless, every processor has a pipeline that must be handled. Also, the main point is that longer pipelines can result in even greater inefficiencies if things don't resolve exactly to plan. Also, more complex calculations that required multiple steps can further bog down performance.

Eventually, technology advanced so that multiple instructions could be worked on at the same time. Think of the above charts being eight or sixteen steps deep to get an idea of this complexity. With each additional unit added to the pipeline, each unit had a more and more specialized task. Such a specialized unit can do things more quickly (higher clock frequency), but at a penalty of greater inefficiencies if anything holds up the whole process.

With pipelining, "you could run the clock faster because you were doing less work on each thing that you were doing," Glaskowsky said. "You could do two things separately. This doubled clock speed, but that, in and of itself, doesn't double the amount of work you're getting done. What doubles this is how many different things are going on at the same time, or parallelism. That's the other major factor in getting a certain performance from a CPU and what's being sort of glossed over when only MHz is being discussed."

Parallelism is a method of attacking highly processor intensive tacks. Essentially, you specifically build your processor (or execution units on your processor) to quickly divide up processes and resolve them at the same time together. For a simplified example, let's turn all the pixels on our screen from white to blue. I could have four units of my processor all working on the problem at once in parallel. It doesn't matter what the result of the previous calculation was because we know it -- the pixel changed from white to blue. So, we never have to worry about what the other units are doing and slow down to wait for one or the other. The net result is that I can get all the pixels on the screen blue four times faster with four units working on the problem in parallel than one unit working on the problem linearly.

For the majority of things that most computers do (integer operations for drawing, taking your keystrokes, running spell checkers, etc.), different chip architectures such as the PowerPC and Pentium do about the same amount of work per clock period. When you look at it like this, Intel chips are faster in proportion to their PowerPC rivals, Glaskowsky said.

"So a Pentium III running at one 1 GHz, for a lot of functions, really is twice as fast as, say, a 500 MHz G4. But for most of the types of tasks described, extra speed beyond a certain point doesn't really make much difference," said Glaskowsky. For example, Both a 500MHz G4 and a 1GHz Pentium are both regularly waiting on you to type. "The difference between 500MHz and 1 GHz isn't terribly important for most programs. The emphasis on performance today has shifted to multimedia processing that involve things such as digital video and MP3 encoding."

For this sort of work, both the Pentium 3 and G4 have extra circuitry designed to accelerate multimedia functions: MMX extensions on the Intel and the AltiVec/Velocity Engine on the PowerPC.

"MMX and AltiVec are very different," Glaskowsky said. "And for the things that these units do, the Altivec does about twice as much work per clock cycle as the Pentium 3. In some ways, it's even a little more efficient than that. This means that, for many multimedia tasks, the G4 at 500 MHz is effectively just as fast as a Pentium 3 at 1 GHz. One caveat is that you can find examples of code that may be better for one or the other." Both MMX and AltiVec take advantage of specific improvements derived from parallel processing, and both require specifically written code to derive a performance benefit.

You can also boost system performance by adding dedicated secondary processors. These processors are a part of the embedded processor market -- processors designed for a specific consumer, industrial or commercial task. Uses range from cell phone controllers to anti-lock brakes to computers. However, development of these products are rarely influenced by consumer opinion. A proper ratio of performance, heat generation, power consumption and price determines the clock frequency of these processors.

For example, some tasks can be offloaded to graphics processors such as products from NVIDIA, ATI, and others. Dan Vivoli at NVIDIA credits the rapid growth of performance and physical size of the processors used in 3D to the fact that the problem of rendering 3D images can be readily handled by parallel processing. The GeForce3 is physically larger and has millions more transistors than the high-end G4, but the GeForce3 is designed to do one thing quickly -- render 3D images. And the reason it is so big and so speedy is parallel processing and dedicated design.

The embedded chip market may have the right idea. Whereas the desktop market has fallen prey to the marketing and public relations hype over MHz, in the embedded market there's more of a focus on real performance and how the embedded chip works in real life conditions. And no one complains that graphics cards from companies such as ATI and NVIDIA are slow.

Unfortunately, this level of specialization comes at a price. While they are incredibly efficient at what they do, they can't do much else. Graphics controllers are usually running at around 250MHz on the high end and have good memory systems next to them to 'shove' data. One of the major bottlenecks in graphics is actually getting the data to the processor.

The good thing about the computer processor is that it can do basically everything, but it can't do everything extremely well. Some things it takes hits on, some it doesn't. That's why it's almost impossible to equitably compare the different types of processors. Different ones excel at different tasks.

Memory is one of the most important components of a computer -- and one of the most overlooked aspects of having a fast system. The microprocessor reads and writes to RAM constantly, so slow RAM can hit you coming and going. Having too little RAM can also adversely affect performance, as data will need to come from even slower components such as the hard drive. You can never have too much RAM or too fast RAM.

The memory bus itself -- a common pathway, or channel, between multiple devices -- is critical because it determines how fast the processor "talks" to the computer memory. In fact, it's probably second only in importance to the processor itself for the performance of an overall system.

Other components that affect speed are the overhead of your particular operating system, the ways applications are compiled, the amount of cache (a high speed portion of RAM set aside to store frequently used info), and the type of system bus.

To oversimplify things, any processor can be made to run at any speed, but perhaps not effectively. The modern processor, with its dozens of execution units, is like a symphony. Different "instruments," or execution units come together at certain times. Some have to wait a cycle or cycles and some are always working. The conductor of this symphony is the processor clock. Its job is to tirelessly beat out a time so that all of the other components don't trip over each other.

A faster clock often does result in faster performance, but difficulties arise when the processor's pipeline is extended simply to increase the clock frequency of the processor. What makes the best processor is an efficient balance of pipeline efficiency, parallel processing, and clock frequency. Finally, the processor is only one part of a whole computer system. Factors such as memory speed, what type of OS and other components also affect a whole computer's performance. Remember to evaluate the whole computer and not just the processor. Although the processor is a critical part, it is not the sole factor in a system's performance.

Wellington Macintosh Society Inc. 2002