Appendix G - Hardware and Software for VLIW and EPIC.pdf
(
269 KB
)
Pobierz
App G.fm
G.1
Introduction: Exploiting Instruction-Level Parallelism Statically
G-2
G.2
Detecting and Enhancing Loop-Level Parallelism
G-2
G.3
Scheduling and Structuring Code for Parallelism
G-12
G.4
Hardware Support for Exposing Parallelism: Predicated Instructions
G-23
G.5
Hardware Support for Compiler Speculation
G-27
G.6
The Intel IA-64 Architecture and Itanium Processor
G-32
G.7
Concluding Remarks
G-44
G
Hardware and Software for
VLIW and EPIC
The EPIC approach is based on the application of massive resources.
These resources include more load-store, computational, and branch
units, as well as larger, lower-latency caches than would be required for
a superscalar processor. Thus, IA-64 gambles that, in the future, power
will not be the critical limitation, and that massive resources, along with
the machinery to exploit them, will not penalize performance with their
adverse effect on clock speed, path length, or CPI factors.
M. Hopkins [2000]
in a commentary on the EPIC
approach and the IA-64 architecture
G-2
Appendix G
Hardware and Software for VLIW and EPIC
G.1
Introduction: Exploiting Instruction-Level
Parallelism Statically
In this chapter we discuss compiler technology for increasing the amount of par-
allelism that we can exploit in a program as well as hardware support for these
compiler techniques. The next section defines when a loop is parallel, how a
dependence can prevent a loop from being parallel, and techniques for eliminat-
ing some types of dependences. The following section discusses the topic of
scheduling code to improve parallelism. These two sections serve as an introduc-
tion to these techniques.
We do not attempt to explain the details of ILP-oriented compiler techniques,
since that would take hundreds of pages, rather than the 20 we have allotted.
Instead, we view this material as providing general background that will enable
the reader to have a basic understanding of the compiler techniques used to
exploit ILP in modern computers.
Hardware support for these compiler techniques can greatly increase their
effectiveness, and Sections G.4 and G.5 explore such support. The IA-64 repre-
sents the culmination of the compiler and hardware ideas for exploiting parallel-
ism statically and includes support for many of the concepts proposed by
researchers during more than a decade of research into the area of compiler-based
instruction-level parallelism. Section G.6 is a description and performance analy-
ses of the Intel IA-64 architecture and its second-generation implementation,
Itanium 2.
The core concepts that we exploit in statically based techniques—finding par-
allelism, reducing control and data dependences, and using speculation—are the
same techniques we saw exploited in Chapter 2 using dynamic techniques. The
key difference is that the techniques in this appendix are applied at compile time
by the compiler, rather than at run time by the hardware. The advantages of com-
pile time techniques are primarily two: they do not burden run time execution
with any inefficiency, and they can take into account a wider range of the pro-
gram than a run time approach might be able to incorporate. As an example of the
latter, the next section shows how a compiler might determine that an entire loop
can be executed in parallel; hardware techniques might or might not be able to
find such parallelism. The major disadvantage of static approaches is that they
can use only compile time information. Without run time information, compile
time techniques must often be conservative and assume the worst case.
G.2
Detecting and Enhancing Loop-Level Parallelism
Loop-level parallelism is normally analyzed at the source level or close to it,
while most analysis of ILP is done once instructions have been generated by the
compiler. Loop-level analysis involves determining what dependences exist
among the operands in a loop across the iterations of that loop. For now, we will
G.2 Detecting and Enhancing Loop-Level Parallelism
G-3
consider only data dependences, which arise when an operand is written at some
point and read at a later point. Name dependences also exist and may be removed
by renaming techniques like those we explored in Chapter 2.
The analysis of loop-level parallelism focuses on determining whether data
accesses in later iterations are dependent on data values produced in earlier itera-
tions; such a dependence is called a
loop-carried dependence
(i=1000; i>0; i=i–1)
x[i] = x[i] + s;
, but this depen-
dence is within a single iteration and is not loop carried. There is a dependence
between successive uses of
x[i]
in different iterations, which is loop carried, but this
dependence involves an induction variable and can be easily recognized and
eliminated. We saw examples of how to eliminate dependences involving induc-
tion variables during loop unrolling in Section 2.2, and we will look at additional
examples later in this section.
Because finding loop-level parallelism involves recognizing structures such
as loops, array references, and induction variable computations, the compiler can
do this analysis more easily at or near the source level, as opposed to the
machine-code level. Let’s look at a more complex example.
i
Example
Consider a loop like this one:
(i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
are distinct, nonoverlapping arrays. (In practice, the
arrays may sometimes be the same or may overlap. Because the arrays may be
passed as parameters to a procedure, which includes this loop, determining
whether arrays overlap or are identical often requires sophisticated, interproce-
dural analysis of the program.) What are the data dependences among the state-
ments S1 and S2 in the loop?
A
,
B
, and
C
Answer
There are two different dependences:
1.
S1 uses a value computed by S1 in an earlier iteration, since iteration
i
com-
putes
A[i+1]
, which is read in iteration
i+1
. The same is true of S2 for
B[i]
and
B[i+1]
.
2.
S2 uses the value,
A[i+1]
, computed by S1 in the same iteration.
. Most of the exam-
ples we considered in Section 2.2 have no loop-carried dependences and, thus,
are loop-level parallel. To see that a loop is parallel, let us first look at the source
representation:
for
In this loop, there is a dependence between the two uses of
for
Assume that
G-4
Appendix G
Hardware and Software for VLIW and EPIC
These two dependences are different and have different effects. To see how
they differ, let’s assume that only one of these dependences exists at a time.
Because the dependence of statement S1 is on an earlier iteration of S1, this
dependence is loop carried. This dependence forces successive iterations of this
loop to execute in series.
The second dependence (S2 depending on S1) is within an iteration and is not
loop carried. Thus, if this were the only dependence, multiple iterations of the
loop could execute in parallel, as long as each pair of statements in an iteration
were kept in order. We saw this type of dependence in an example in Section 2.2,
where unrolling was able to expose the parallelism.
It is also possible to have a loop-carried dependence that does not prevent
parallelism, as the next example shows.
Example
Consider a loop like this one:
(i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
What are the dependences between S1 and S2? Is this loop parallel? If not, show
how to make it parallel.
Answer
Statement S1 uses the value assigned in the previous iteration by statement S2, so
there is a loop-carried dependence between S2 and S1. Despite this loop-carried
dependence, this loop can be made parallel. Unlike the earlier loop, this depen-
dence is not circular: neither statement depends on itself, and although S1
depends on S2, S2 does not depend on S1. A loop is parallel if it can be written
without a cycle in the dependences, since the absence of a cycle means that the
dependences give a partial ordering on the statements.
Although there are no circular dependences in the above loop, it must be
transformed to conform to the partial ordering and expose the parallelism. Two
observations are critical to this transformation:
1.
There is no dependence from S1 to S2. If there were, then there would be a
cycle in the dependences and the loop would not be parallel. Since this other
dependence is absent, interchanging the two statements will not affect the
execution of S2.
2.
On the first iteration of the loop, statement S1 depends on the value of
B[1]
computed prior to initiating the loop.
These two observations allow us to replace the loop above with the following
code sequence:
for
Plik z chomika:
Kowalski2015
Inne pliki z tego folderu:
Appendix K - Historical Perspectives and References.pdf
(372 KB)
Appendix J - Survey of Instruction Set Architectures.pdf
(505 KB)
Appendix I - Computer Arithmetic.pdf
(2898 KB)
Appendix H - Large-Scale Multiprocessors and Scientific Applications.pdf
(409 KB)
Appendix G - Hardware and Software for VLIW and EPIC.pdf
(269 KB)
Inne foldery tego chomika:
100 ciosów karate
100 Things Every Designer Needs to Know About People
50 przepisów Sałatki
A Guide to Graph Colouring Algorithms and Applications
ABC Naprawy Odbiorników Radiowych
Zgłoś jeśli
naruszono regulamin