2-17 PARALLEL AND VECTORIZED FORTRAN
*************************************
Using vector and parallel computers require somewhat similar
considerations. In simple words, both types of computers try to execute
more than one FORTRAN statement at the same time, and thus change the
statement execution order (see the examples below). We must ensure that
changing the execution order of statements will not change the results.
The problem of vectorizing/parallelizing non-looping code is difficult,
and vectorizing/parallelizing is usually limited to DO loops that satisfy
a condition of DATA INDEPENDENCE.
Vector machines have longer registers that can hold at the same time
several data items (up to 128) and operate on them. For example,
the register may be loaded with data items that belong to some
consecutive iterations of a DO loop, thus more than one iteration
can be performed at the same time.
Parallel machines can execute different iterations of a DO loop on
different processors at the same time. The processors may have shared
main memory or they can be independent and connected by a fast network.
By the way, the term 'vector' is used here in a similar way to that
used in linear algebra, it means a sequence of scalar data items.
Bernstein conditions for single loop independence
-------------------------------------------------
We will assume that variables or array elements appearing in
the loop aquire their values only by assignment statements.
Consider every iteration of the loop as a separate unit, with
the variables and array elements that are referenced in it.
A variable or array element appearing on the right side of an
assignment statement will be called an INPUT VARIABLE of the
iteration it belongs to.
Similarly a variable or array element appearing on the left
side of an assignment will be called an OUTPUT VARIABLE of
the iteration it belongs to.
In C terminology we call an input variable a rvalue, and an
output variable a lvalue.
If the following two conditions are true, NO DATA DEPENDENCY
IS POSSIBLE between iterations:
1) No output variable is also an input variable
in another iteration.
2) No output variable is also an output variable
in another iteration.
If this simple criterion is violated in a loop, we have to
perform some kind of analysis (see below) in order to determine
if a dependency really exists, and whether it is a REMOVABLE.
The variables that violated the criterion will help us classify
the type of dependency involved.
Remark: Auxiliary variables, that can be eliminated algebraically,
and are computed before they are used, are output variables
according to our definition, but can be safely ignored.
Vectorizing/Parallelizing compilers
-----------------------------------
Vectorizing/Parallelizing compiler technology is quite developed,
loops are analyzed automatically for data dependencies and if found
independent the compiler generates vector code.
Compilers use algorithms like the Banerjee's inequalities or the
GCD tests to analyze dependencies.
The compiler will warn you of dependencies, and you should go over
them armed with the concepts discussed in this chapter, and try to
remove them.
Classification of dependencies
------------------------------
The dependencies we will discuss are:
Store before load Competing stores Load before store
================= ================ ===============
DO I=2,4 DO I=2,4 DO I=2,4
A(I-1) = B(I) A(I-1) = B(I) B(I) = A(I-1)
C(I) = A(I) A(I) = C(I) A(I) = C(I)
ENDDO ENDDO ENDDO
Recurrence Sum reduction
=============== ==============
DO I=2,4 SUM = 0.0
A(I) = A(I-1) DO I=2,4
ENDDO SUM = SUM + A(I)
ENDDO
Removable data dependancy: store before load
--------------------------------------------
A small example:
DO I=2,4
A(I-1) = B(I)
C(I) = A(I)
ENDDO
Let's unroll the loop and see the correct order for executing
the assignment statements:
t=1 A(1) = B(2)
t=2 C(2) = A(2)
t=3 A(2) = B(3)
t=4 C(3) = A(3)
t=5 A(3) = B(4)
t=6 C(4) = A(4)
Performing all iterations in parallel:
Parallel #1 Parallel #2 Parallel #3
=========== =========== ===========
t=1 A(1) = B(2) A(2) = B(3) A(3) = B(4)
t=2 C(2) = A(2) C(3) = A(3) C(4) = A(4)
We see that the execution order is changed when running the
loop in parallel, e.g. the first assignment of #2 (t=3) comes
before the second assignment of #1 (t=2).
When the second assignment of #1 executes, the value of A(2)
is already corrupted, the first assignment of #2 gave it the
value of B(3) instead of its original value. The same problem
occurs between #2 and #3, the data is overwritten before
being used and the result is wrong.
A simple modification can save the parallel code:
DO I=2,4
TEMP = A(I)
A(I-1) = B(I)
C(I) = TEMP
ENDDO
This technique is called PRE-LOADING, the endangered value
of A(I) is temporarily stored in the new auxiliary variable
TEMP, the 'protective store' is done at the top of the loop.
Possibly removable data dependancy: two competing stores
--------------------------------------------------------
A small example:
DO I=2,4
A(I-1) = B(I)
A(I) = C(I)
ENDDO
Unrolled:
t=1 A(1) = B(2)
t=2 A(2) = C(2)
t=3 A(2) = B(3)
t=4 A(3) = C(3)
t=5 A(3) = B(4)
t=6 A(4) = C(4)
Parallelized:
Parallel #1 Parallel #2 Parallel #3
=========== =========== ===========
t=1 A(1) = B(2) A(2) = B(3) A(3) = B(4)
t=2 A(2) = C(2) A(3) = C(3) A(4) = C(4)
We see that the execution order is changed when running the
loop in parallel, e.g. the first assignment of #2 (t=3) comes
before the second assignment of #1 (t=2).
The value left in A(2) after the loop executes is wrong,
instead of B(3) it is equal to C(2). The same problem occurs
between #2 and #3, the order of the store operations is wrong
and the result is wrong.
Again a simple modification can save the parallel code,
we just SWITCH LINES inside the loop:
DO I=2,4
A(I) = C(I)
A(I-1) = B(I)
ENDDO
This simple trick will make the order of stores come right,
IF we can assume that the parallel 'tracks' are synchronous.
Vector processors are synchronous, parallel machines may
be not (?).
Possibly removable data dependancy: load before store
-----------------------------------------------------
A small example:
DO I=2,4
B(I) = A(I-1)
A(I) = C(I)
ENDDO
Unrolled:
t=1 B(2) = A(1)
t=2 A(2) = C(2)
t=3 B(3) = A(2)
t=4 A(3) = C(3)
t=5 B(4) = A(3)
t=6 A(4) = C(4)
Parallelized:
Parallel #1 Parallel #2 Parallel #3
=========== =========== ===========
t=1 B(2) = A(1) B(3) = A(2) B(4) = A(3)
t=2 A(2) = C(2) A(3) = C(3) A(4) = C(4)
We see that the execution order is changed when running the
loop in parallel, e.g. the first assignment of #2 (t=3) comes
before the second assignment of #1 (t=2).
The value of A(2) is used before it is assigned the right
value C(2). The same problem occurs between #2 and #3,
the variable is assigned before it contains the right value.
The order of the store operations is wrong and the result
is wrong.
Again a simple modification can save the parallel code,
looking closely in the code we can see that SWITCHING
LINES inside the loop will not change the result, and
will solve the dependency problem:
DO I=2,4
A(I) = C(I)
B(I) = A(I-1)
ENDDO
This simple trick will make the order of stores come right,
IF we can assume that the parallel 'tracks' are synchronous.
Vector processors are synchronous, parallel machines may
be not (?).
Irremovable data dependency: recurrence
-----------------------------------------
A small example:
DO I=2,4
A(I) = A(I-1)
ENDDO
Unrolled:
t=1 A(2) = A(1)
t=2 A(3) = A(2)
t=3 A(4) = A(3)
Parallelized:
Parallel #1 Parallel #2 Parallel #3
=========== =========== ===========
t=1 A(2) = A(1) A(3) = A(2) A(4) = A(3)
Well, nothing can be done in this case, the dependency is essential,
and can't be removed by some simple manipulation.
Bernstein conditions for double loop independence
-------------------------------------------------
If we have either a vector machine or a parallel one, treating 2 level
loop nests is simple, we just vectorize/parallelize one of the loops.
With a machine that is both vector and parallel we would try to vectorize
the inner loop and parallelize the outer one.
Dependency analysis is more complicated in this case ....
Other loops: FORALL, DOALL
--------------------------
Shared memory systems
---------------------
... interleaved memory ...
Memory contention
-----------------
Distributed memory systems and message passing
----------------------------------------------
Bibliography
------------
An nice introduction to hardware and software:
Advanced Computer Architecture:
Parallelism, Scalability, Programmability
Kai Hwang
McGraw-Hill 1993
Library og Congress QA76.9 A73 H87
ISBN 0-07-113342-9
Return to contents page