CLK Cycle

Mof · Mar 25, 2008

Oh man! Ad-blocking software has been detected! :'(

This website is run by the community, for the community... and it needs advertisements in order to keep running. Blocking our ads means your killing our stats!
Please disable your ad-block, or become a premium member to hide all advertisements and this notice.

I under stand clk speeds but why is there a minimum of two cycles per each instruction.

dmarsh · Mar 25, 2008

Oh man! Ad-blocking software has been detected! :'(

This website is run by the community, for the community... and it needs advertisements in order to keep running. Blocking our ads means your killing our stats!
Please disable your ad-block, or become a premium member to hide all advertisements and this notice.

There isn't

What processor you talking about. The number of cycles depends on the design of the processor, the relative complexity of the operation and what storage areas if any the instruction references.

With a superscalar design it can take time to fill the pipeline, an instruction may take one instruction to decode and one to execute. A pipeline stall could therefore cause a 2 cycle duration. Normally theres something like fetch, decode, execute and store stages to the pipeline. Each stage could potentially be further divided making a longer pipeline or stages could be merged.

http://en.wikipedia.org/wiki/Pipeline_%28computing%29
http://en.wikipedia.org/wiki/Superscalar
http://en.wikipedia.org/wiki/CPU_cache
http://arstechnica.com/articles/paedia/cpu/p4andg4e.ars
http://www.pcmech.com/article/pentium-4-calculation-controversy/
http://www.hardwaresecrets.com/article/270/4
http://softwarecommunity.intel.com/articles/eng/3089.htm

But, because the P4 will still pump out one instruction per clock cycle, which is still good. It will just take longer to do individual instructions
Click to expand...

The pipeline approach usually means one clock cycle per pipeline stage, a longer pipeline therefore will take longer to load and potentially after a stall can take many cycles to execute a simple instruction. With no stalls the processor can potentially process one or more instructions per cycle.

With non superscalar processors theres no reason something like a NOP or a register XOR can't take one cycle in absolute terms.

Mof · Mar 25, 2008

Reading Mike Meyers. man in box states there is a min of two cycles, i suppose it means one cycle to take of the EDB ond secound to place back on EDB.

dmarsh · Mar 25, 2008

EDB ?

It depends on the design of the system, please bear in mind the PC is just a bunch of circuits somebody designed. Its not the only type of computer and even PC's differ widely in architecture and processors.

A processor can be designed so that all its operations take one clock cycle if desired.

Processors however often have expensive and cheap operations. Various optimizations have been designed to try to get the most out of the processor, this normally means making expensive operations take multiple cycles, while cheap operations take one cycle.

Then pipelining, superscalar, branch predication, caching, SMP, hyperthreading, dual core, etc came along further complicating the issues...

The question is meaningless without some context, are you talking about under best or worst conditions ? Are you talking when the pipeline and cache is fully loaded ? What exact processor are you talking about ?

An instructions result need never leave the processor if its result is stored in a register, therefore the 'store' phase can be cheap.

Any instructions involving memory will of course involve many cycles due to the inherant latency caused by a slower memory speed and slower bus speeds.

http://books.google.co.uk/books?id=...ym6uzmA&sig=3ZRZ0QsXZYAHxvCjzAf6J1k08e4&hl=en
Click to expand...

I imagine hes talking about the decode and execute stages of the pipeline, when there is no fetch or store and its a simple instruction, its hard to be certain...

Fergal1982 · Mar 25, 2008

doesnt it take one cycle to put the data into the processor, and the second cycle to get it out again? So each cycle looks like this:

Cycle: Previous instruction output / next instruction input

If thats the case (and I seemed to think it was), then it would indeed (at least look like) take 2 cycles minimum.

dmarsh · Mar 25, 2008

No Fergal, to my knowledge thats not correct.

An instruction can take a value from a register, negate it and store it back in the same register for instance.

Depending on the design of the processor this could take 1-N cycles, its all down to the design...

There does not have to be a pipeline in the externally visible sense.

If there is a pipeline that is obvious than the stages will typically take one cycle each.

Thats my understanding but I've not coded hardly any assembler for 13+ years and my memories of microprocessor arch lectures are pretty fuzzy...

Getting data in or out of the processor takes MANY cycles :-

The load may then have to wait ~8 cycles (Athlon) or ~14 cycles (Pentium 4)
Click to expand...

http://chip-architect.com/news/2000_10_13_Willamette_MPF.html

The L1 data
cache operates with a 2-clock load-use latency for integer
loads and a 6-clock load-use latency for floatingpoint/
SSE loads.
This 2-clock load latency is hard to achieve with the very
high clock rates of the Pentium 4 processor. This cache
uses new access algorithms to enable this very low loadaccess
latency. The new algorithm leverages the fact that
almost all accesses hit the first-level data cache and the
data TLB (DTLB).
Click to expand...

http://www.intel.com/technology/itj/q12001/pdf/art_2.pdf

To compound these problems, memory chips haven't come close to matching the s oaring clock speeds of CPUs. When Intel designed the first x86 chip, CPUs could fetch data from memory as fast as they could process the data. Today, CPUs spend hundreds of clock cycles waiting for data to arrive from memory, despite having large, fast caches.
Click to expand...

(This is a quote from pentium II days before alot of the advanced caching and new memory architectures !)

This has latency at 100-800 cycles, it really depends what you are measuring, fully cached best performance, general performance or worst performance. With todays processors we can only really give statistics not absolutes.

http://www.digit-life.com/articles2/cpu/rmma-p4-latency.html

You explanation is woefully oversimplified, even my explanations miss all the advanced features of modern processors. The caching, instruction re-ordering and branch prediction logic is quite complex...

Fergal1982 · Mar 25, 2008

Meh, I dont really know, my understanding of processors is fairly limited to be honest - always been much more software based. Thats just how I always seemed to think of it.

dmarsh · Mar 25, 2008

Meh, I dont really know, my understanding of processors is fairly limited to be honest - always been much more software based. Thats just how I always seemed to think of it.
Click to expand...

Its all code isn't it ?

How would you write a JIT or an assembler without this knowledge ?

Fergal1982 · Mar 25, 2008

dmarsh26 said: ↑

Its all code isn't it ?

How would you write a JIT or an assembler without this knowledge ?
Click to expand...

Simple answer? I wouldnt. When I come to a point where I need to learn it, I'll learn it, but right now, I have no reason to do so, and neither have I had need to up until this point.

Mof · Mar 25, 2008

I believe hes talking about the 8088

hbroomhall · Mar 25, 2008

I remember the 6502 had just one clock cycle for some instructions.

The problem with the 8086 line of processors is that they have changed dramaticaly down the years, which is hardly suprising. So any description of how they work would have to be fairly simplistic if it was too general!

Harry.

dmarsh · Mar 25, 2008

doesnt it take one cycle to put the data into the processor, and the second cycle to get it out again? So each cycle looks like this:

Cycle: Previous instruction output / next instruction input
Click to expand...

I guess you could have been describing the Fetch-Decode-Execute (FDX) cycle :-

http://en.wikipedia.org/wiki/Instruction_cycle

Heres a good summary at last :-

http://en.kioskea.net/pc/processeur.php3

The FDX cycle does not have to occur in one clock cycle in a pipelined design, the 'cycle' is used in the metaphorical sense, because its a continuous process :-

In general, 1 to 2 clock cycles (rarely more) for each pipeline step or a maximum of 10 clock cycles per instruction should be planned for. For two instructions, a maximum of 12 clock cycles are necessary (10+2=12 instead of 10*2=20) because the preceding instruction was already in the pipeline. Both instructions are therefore being simultaneously processed, but with a delay of 1 or 2 clock cycles. For 3 instructions, 14 clock cycles are required, etc.

The principle of a pipeline may be compared to a car assembly line. The car moves from one workstation to another by following the assembly line and is completely finished by the time it leaves the factory. To completely understand the principle, the assembly line must be looked at as a whole, and not vehicle by vehicle. Three hours are required to produce each vehicle, but one is produced every minute!

It must be noted that there are many different types of pipelines, varying from 2 to 40 steps, but the principle remains the same.
Click to expand...

Other design ideas like superscalar mean that under optimum conditions many instructions can be performed in parallel in one cycle.

One thing the processor can be fairly certain of is there will be more instructions, so there is a fetch buffer or instruction cache and much work in recent years has been put into branch prediction as a branch can make most of the instruction cache and the pipeline state irrelevant. Some instructions can also be performed in parallel or out of order to optimise performance, this is a sort of hardware parallelism where the processor determines the synch points. This is necessary to make use of the superscallar design where there are multiple execution units.

The processors have indeed changed a lot since the days of the 8086 and 68000 which were the processors in vogue when I was learning assembler. The RISC designs did indeed have a different approach making for easier to understand assembler in my mind.

http://www.gamasutra.com/features/wyatts_world/19990528/pentium3_04.htm

This shows that SIMD instructions can indeed be issued once per cycle under optimum conditions.

http://homes.esat.kuleuven.be/~cosicart/pdf/AB-9600.pdf

Both instructions in the pair must be simple. Simple instructions are
entirely hardwired, and therefore require no microcode support. In this way,
they can normally execute in a single clock cycle. These instructions include
register-to-register and immediate-to-register ALU operations (any arithmetic
or logical instruction, such as add, and, or, xor, rol); movs, inc, dec,
push, pop, lea, and nop.
Click to expand...

Mathematix · Mar 25, 2008

you guys do realise that the fetch-decode-execute (or fetch-execute) cycle is very different from the number of cycles taken to execute an instruction, right?

Unless there are parallel processes going on under the hood then the fetch-execute cyle will always take more than one clock cycle for a serial architecture, whereas executing a single instruction in assembly like incrementing a value can be executed in one cycle.

Rather than going into the detail of Intel vs. AMD architectures which obscure important information, I'd research the following:

1. Reduced instruction set computers (RISC) - examples being the UNIX based Sun SPARCstation. My most favourite computer at University! Did all my programming on it.
2. Complex instruction set computers (CISC) - examples being the humble PCs that we know and love.

3-2-1 - research!

dmarsh · Mar 25, 2008

Well I did my best to explain it but like I said its been a while !

My original links and the latest links detail the superscalar, pipeling and CISC vs RISC arguments that I think are most important to understand the subject.

The Pentium series of processors are Hybrid, they are a CISC instruction set on a RISC core.

Yes, I thought I explained that FDX is not a clock cycle to fergal, again in a roundabout way ! In some early processors as well as small microcontrollers is quite possible that the instruction cycle and the FDX cycle could be closely linked but there will normally be at least a two stage pipeline.

A two-stage pipeline overlaps fetch and execution of instructions. Consequently, the instructions normally execute in a single cycle. The only exceptions are program branches and special instructions to transfer data between instruction and data memories.
Click to expand...

Linking the two concepts completely would effectively halve the performance of the microprocessor while not dropping the complexity or transistor count by very much.

In summary to the original question I don't know what mike myers was trying to say. The 68000 had a minimum instruction cycle whereby the simplist instructions time in cycles is multiplied by the time for one cycle. Maybe he was trying to say the simplist instruction takes 2 cycles, but as far as I can tell this is not true, it could be one or more depending on the circumstances. As has also been said, the PC's architecture has changed so much over the years that any comments without specifying a particular architecture are meaningless.

So yes I'd agree with Math and say learn the theory, but thats not what the question was about, it specifed a 2 cycle detail. If Mike was talking about theory and the Fetch-Execute cycle and the two stage pipeline that it indicates, than thats what he should have said ! Note the pentium series does not have a two stage pipeline ! This is the problem with certs, sometimes really you need to learn the theory to really understand. The 'lets skip the computer science bit' doesn't always cut it.

Fergal1982 · Mar 25, 2008

Perhaps the OP could quote the passage in the AIO book which mentions the 2 cycle minimum?

Mathematix · Mar 25, 2008

Fergal1982 said: ↑

Perhaps the OP could quote the passage in the AIO book which mentions the 2 cycle minimum?
Click to expand...

One instruction per cycle.

Fergal1982 · Mar 25, 2008

Mathematix said: ↑

One instruction per cycle.
Click to expand...

Ah, but what I mean is that Mof can point out the section he is talking about. Perhaps we might be able to shed a better light on exactly what is being talked about.

dmarsh · Mar 25, 2008

A clock wire receives a given voltage or a clock cycle that allows the CPU to process a command. The CPU requires at least TWO clock cycles to act on each command. The maximum number of clock cycles that the CPU can handle is called the clock speed or the maximum rated speed for the CPU. It is measured in megahertz (1 million cycles per second) or gigahertz(1 billion cycles per second).
Click to expand...

(Mike Myers)

Personally I would have said :- "In theory one unit of work can be performed by the processor per cycle. A cycle is one pulse of the system clock. The system clock is normally a quartz crystal that is used as a square wave signal generator. The speed of the system clock is measured in cycles per second or Hertz. This is also used as a rough performance rating of the processor. Processors should be run at the manufactuers approved speed rating. The commands a processor can execute are called instructions, an instruction can take one or more clock cycles to complete."

I don't know where he gets the number two from or why he thinks its significant in this context...

Mathematix · Mar 25, 2008

In its simplest terms it says that a logic gate's state can change in one CPU clock cycle, which is perfectly reasonable. I'll try to present an example of how states can change in one cycle.

You guys recall 'twos-complement', or 'addition by subtraction' when subtracting a pair of binary numbers. In a complex instruction set machine it would have one instruction to perform this:

sub b, a

the above is pseudo-assembly which says to subtract 'a' from 'b' and store the result in 'b'.

Now, for a reduced instruction set computer will break the above into something like:

str a
str b
not b (this 'not' can be performed in one clock cycle!)
add b, 1 (and maybe this as well)
add b, a

This is a very rough example that illustrates the reasons why RISC machines were produced for execution speed at the expense of longer assembly code. (Do not take this particular example as a real world implementation because it isn't!)

Of course, subtraction is included in RISC instruction sets. Intensive instructions like multiplications and divisions are replaced with additions, substration, bitshifts, etc.

dmarsh · Mar 25, 2008

While its true that a transistor can only change state once per cycle this does not limit the complexity of the processes that can be undertaken in one clock cycle given enough transistors.

This is the old RISC vs CISC argument

http://cse.stanford.edu/class/sophomore-college/projects-00/risc/risccisc/

It is now widely accepted that its not really the 'reduced instruction set' concept thats important, its the resulting simplicities in design and transistor count that allow for other optimisations. In some situations like FPUs and GPUs the more complex instructions and the transistor counts they require are justified. Larger programs can in themselves cause slowdown as the instruction cache must be bigger.

Here is a more up to date definition :-

http://arstechnica.com/cpu/4q99/risc-cisc/rvc-1.html

Log in or Sign up

CLK Cycle

Mof Megabyte Poster

dmarsh Petabyte Poster

Mof Megabyte Poster

dmarsh Petabyte Poster

Fergal1982 Petabyte Poster

dmarsh Petabyte Poster

Fergal1982 Petabyte Poster

dmarsh Petabyte Poster

Fergal1982 Petabyte Poster

Mof Megabyte Poster

hbroomhall Petabyte Poster Gold Member

dmarsh Petabyte Poster

Mathematix Megabyte Poster

dmarsh Petabyte Poster

Fergal1982 Petabyte Poster

Mathematix Megabyte Poster

Fergal1982 Petabyte Poster

dmarsh Petabyte Poster

Mathematix Megabyte Poster

dmarsh Petabyte Poster

Share This Page

Navigation

Popular Forums

Useful Links

Log in or Sign up

CLK Cycle

Mof Megabyte Poster

dmarsh Petabyte Poster

Mof Megabyte Poster

dmarsh Petabyte Poster

Fergal1982 Petabyte Poster

dmarsh Petabyte Poster

Fergal1982 Petabyte Poster

dmarsh Petabyte Poster

Fergal1982 Petabyte Poster

Mof Megabyte Poster

hbroomhall Petabyte Poster Gold Member

dmarsh Petabyte Poster

Mathematix Megabyte Poster

dmarsh Petabyte Poster

Fergal1982 Petabyte Poster

Mathematix Megabyte Poster

Fergal1982 Petabyte Poster

dmarsh Petabyte Poster

Mathematix Megabyte Poster

dmarsh Petabyte Poster

Share This Page

Useful Searches