
This obviously results in increased design complexity as hardware is duplicated, however, it offers excellent performance scaling possibilities. This way two instructions can be in each stage of the pipeline in every cycle. One way of doing this is by expanding the pipeline so that there are two or more bits of hardware that can handle each stage.
#Superscalar procssor full#
Expanding the pipeline to superscalarĪ superscalar processor is capable of running more than one full instruction per cycle. This works, because many instructions don’t necessarily rely on the result of a previous one. It actually runs the program in the wrong order to fix this.


This behaviour can be countered with the inclusion of a dispatcher, that analyses upcoming instructions and ensures that no instruction that is dependent on another is run in too close succession. Assuming these two instructions are directly after one another, the process to calculate the new value of “a” won’t have completed, let alone be written back to memory before the second instruction reads the old value of “a” and then gives the wrong answer for “d”. If, however, you have “a = b + c” followed by “d = a + e” you have a problem. For example, if you have two addition instructions “a = b + c” and “d = e + f” these can be run in a pipeline with no issue. In reality, programs are complex and reduce the throughput. This allows a maximum throughput of one completed instruction per cycle. In a scalar pipelined processor, each stage of an instructions execution can be performed once per clock cycle. In a perfect world, this could lead to a 5x speed up and for the processor to be perfectly scalar, running a full instruction per cycle. Thus, if you’re careful with the data you feed into the hardware for each stage, you can keep each of them busy every cycle. Each of the five stages of an instruction being executed run in different bits of hardware in the actual processor core. Pipelining to scalarĪ scalar processor can be achieved by applying a system pipeline. This design can only ever achieve less than one instruction completed per cycle. In a subscalar processor with no pipeline, each part of each instruction is executed in order. This type of processor design is called subscalar as it runs less than one instruction per clock cycle.

In the meantime, everything else needs to wait as no other data or instructions can be processed. This is a lot slower, adding dozens or hundreds of clock cycles of latency. Unfortunately, if the data isn’t in a register, then it must be requested from the CPU cache or the system RAM. Moreover, the integration also provides 2.22× energy efficiency compared with the superscalar processor along.Each of these stages should take one cycle to complete. The vector program with software-level optimizations achieves 9.53× improvement on instruction throughput and 10.18× improvement on real-time throughput.

The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor.ĭifferent test programs are evaluated on the fully-tested superscalar processor. The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput. The processor is capable of speculative execution with five checkpoints. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. RISC-V processors are becoming popular in many fields of applications and research. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. With the increasing number of digital products in the market, the need for robust and highly configurable processors rises.
