Appendix G - Hardware and Software for VLIW and EPIC.pdf - Computer Architecture A Quantitative Approach (Dodatki) - Kowalski2015

G.1

Introduction: Exploiting Instruction-Level Parallelism Statically

G-2

G.2

Detecting and Enhancing Loop-Level Parallelism

G-2

G.3

Scheduling and Structuring Code for Parallelism

G-12

G.4

Hardware Support for Exposing Parallelism: Predicated Instructions

G-23

G.5

Hardware Support for Compiler Speculation

G-27

G.6

The Intel IA-64 Architecture and Itanium Processor

G-32

G.7

Concluding Remarks

G-44

Hardware and Software for

VLIW and EPIC

The EPIC approach is based on the application of massive resources.

These resources include more load-store, computational, and branch

units, as well as larger, lower-latency caches than would be required for

a superscalar processor. Thus, IA-64 gambles that, in the future, power

will not be the critical limitation, and that massive resources, along with

the machinery to exploit them, will not penalize performance with their

adverse effect on clock speed, path length, or CPI factors.

M. Hopkins [2000]

in a commentary on the EPIC

approach and the IA-64 architecture

G-2

Appendix G

Hardware and Software for VLIW and EPIC

G.1

Introduction: Exploiting Instruction-Level

Parallelism Statically

In this chapter we discuss compiler technology for increasing the amount of par-

allelism that we can exploit in a program as well as hardware support for these

compiler techniques. The next section deﬁnes when a loop is parallel, how a

dependence can prevent a loop from being parallel, and techniques for eliminat-

ing some types of dependences. The following section discusses the topic of

scheduling code to improve parallelism. These two sections serve as an introduc-

tion to these techniques.

We do not attempt to explain the details of ILP-oriented compiler techniques,

since that would take hundreds of pages, rather than the 20 we have allotted.

Instead, we view this material as providing general background that will enable

the reader to have a basic understanding of the compiler techniques used to

exploit ILP in modern computers.

Hardware support for these compiler techniques can greatly increase their

effectiveness, and Sections G.4 and G.5 explore such support. The IA-64 repre-

sents the culmination of the compiler and hardware ideas for exploiting parallel-

ism statically and includes support for many of the concepts proposed by

researchers during more than a decade of research into the area of compiler-based

instruction-level parallelism. Section G.6 is a description and performance analy-

ses of the Intel IA-64 architecture and its second-generation implementation,

Itanium 2.

The core concepts that we exploit in statically based techniques—ﬁnding par-

allelism, reducing control and data dependences, and using speculation—are the

same techniques we saw exploited in Chapter 2 using dynamic techniques. The

key difference is that the techniques in this appendix are applied at compile time

by the compiler, rather than at run time by the hardware. The advantages of com-

pile time techniques are primarily two: they do not burden run time execution

with any inefﬁciency, and they can take into account a wider range of the pro-

gram than a run time approach might be able to incorporate. As an example of the

latter, the next section shows how a compiler might determine that an entire loop

can be executed in parallel; hardware techniques might or might not be able to

ﬁnd such parallelism. The major disadvantage of static approaches is that they

can use only compile time information. Without run time information, compile

time techniques must often be conservative and assume the worst case.

G.2

Detecting and Enhancing Loop-Level Parallelism

Loop-level parallelism is normally analyzed at the source level or close to it,

while most analysis of ILP is done once instructions have been generated by the

compiler. Loop-level analysis involves determining what dependences exist

among the operands in a loop across the iterations of that loop. For now, we will

G.2 Detecting and Enhancing Loop-Level Parallelism

G-3

consider only data dependences, which arise when an operand is written at some

point and read at a later point. Name dependences also exist and may be removed

by renaming techniques like those we explored in Chapter 2.

The analysis of loop-level parallelism focuses on determining whether data

accesses in later iterations are dependent on data values produced in earlier itera-

tions; such a dependence is called a

loop-carried dependence

(i=1000; i>0; i=i–1)

x[i] = x[i] + s;

, but this depen-

dence is within a single iteration and is not loop carried. There is a dependence

between successive uses of

x[i]

in different iterations, which is loop carried, but this

dependence involves an induction variable and can be easily recognized and

eliminated. We saw examples of how to eliminate dependences involving induc-

tion variables during loop unrolling in Section 2.2, and we will look at additional

examples later in this section.

Because ﬁnding loop-level parallelism involves recognizing structures such

as loops, array references, and induction variable computations, the compiler can

do this analysis more easily at or near the source level, as opposed to the

machine-code level. Let’s look at a more complex example.

Example

Consider a loop like this one:

(i=1; i<=100; i=i+1) {

A[i+1] = A[i] + C[i]; /* S1 */

B[i+1] = B[i] + A[i+1]; /* S2 */

}

are distinct, nonoverlapping arrays. (In practice, the

arrays may sometimes be the same or may overlap. Because the arrays may be

passed as parameters to a procedure, which includes this loop, determining

whether arrays overlap or are identical often requires sophisticated, interproce-

dural analysis of the program.) What are the data dependences among the state-

ments S1 and S2 in the loop?

, and

Answer

There are two different dependences:

S1 uses a value computed by S1 in an earlier iteration, since iteration

com-

putes

A[i+1]

, which is read in iteration

i+1

. The same is true of S2 for

B[i]

and

B[i+1]

S2 uses the value,

A[i+1]

, computed by S1 in the same iteration.

. Most of the exam-

ples we considered in Section 2.2 have no loop-carried dependences and, thus,

are loop-level parallel. To see that a loop is parallel, let us ﬁrst look at the source

representation:

for

In this loop, there is a dependence between the two uses of

for

Assume that

G-4

Appendix G

Hardware and Software for VLIW and EPIC

These two dependences are different and have different effects. To see how

they differ, let’s assume that only one of these dependences exists at a time.

Because the dependence of statement S1 is on an earlier iteration of S1, this

dependence is loop carried. This dependence forces successive iterations of this

loop to execute in series.

The second dependence (S2 depending on S1) is within an iteration and is not

loop carried. Thus, if this were the only dependence, multiple iterations of the

loop could execute in parallel, as long as each pair of statements in an iteration

were kept in order. We saw this type of dependence in an example in Section 2.2,

where unrolling was able to expose the parallelism.

It is also possible to have a loop-carried dependence that does not prevent

parallelism, as the next example shows.

Example

Consider a loop like this one:

(i=1; i<=100; i=i+1) {

A[i] = A[i] + B[i]; /* S1 */

B[i+1] = C[i] + D[i]; /* S2 */

}

What are the dependences between S1 and S2? Is this loop parallel? If not, show

how to make it parallel.

Answer

Statement S1 uses the value assigned in the previous iteration by statement S2, so

there is a loop-carried dependence between S2 and S1. Despite this loop-carried

dependence, this loop can be made parallel. Unlike the earlier loop, this depen-

dence is not circular: neither statement depends on itself, and although S1

depends on S2, S2 does not depend on S1. A loop is parallel if it can be written

without a cycle in the dependences, since the absence of a cycle means that the

dependences give a partial ordering on the statements.

Although there are no circular dependences in the above loop, it must be

transformed to conform to the partial ordering and expose the parallelism. Two

observations are critical to this transformation:

There is no dependence from S1 to S2. If there were, then there would be a

cycle in the dependences and the loop would not be parallel. Since this other

dependence is absent, interchanging the two statements will not affect the

execution of S2.

On the ﬁrst iteration of the loop, statement S1 depends on the value of

B[1]

computed prior to initiating the loop.

These two observations allow us to replace the loop above with the following

code sequence:

for

Appendix G - Hardware and Software for VLIW and EPIC.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: