Parallel Fortran


 2-17  PARALLEL AND VECTORIZED FORTRAN 
 *************************************
 Using vector and parallel computers require somewhat similar 
 considerations. In simple words, both types of computers try to execute 
 more than one FORTRAN statement at the same time, and thus change the 
 statement execution order (see the examples below). We must ensure that 
 changing the execution order of statements will not change the results.

 The problem of vectorizing/parallelizing non-looping code is difficult, 
 and vectorizing/parallelizing is usually limited to DO loops that satisfy
 a condition of DATA INDEPENDENCE. 

 Vector machines have longer registers that can hold at the same time 
 several data items (up to 128) and operate on them. For example, 
 the register may be loaded with data items that belong to some 
 consecutive iterations of a DO loop, thus more than one iteration 
 can be performed at the same time.

 Parallel machines can execute different iterations of a DO loop on 
 different processors at the same time. The processors may have shared 
 main memory or they can be independent and connected by a fast network.

 By the way, the term 'vector' is used here in a similar way to that
 used in linear algebra, it means a sequence of scalar data items.


 Bernstein conditions for single loop independence
 -------------------------------------------------
 We will assume that variables or array elements appearing in
 the loop aquire their values only by assignment statements.

 Consider every iteration of the loop as a separate unit, with 
 the variables and array elements that are referenced in it. 

 A variable or array element appearing on the right side of an 
 assignment statement will be called an INPUT VARIABLE of the
 iteration it belongs to.

 Similarly a variable or array element appearing on the left
 side of an assignment will be called an OUTPUT VARIABLE of 
 the iteration it belongs to.

 In C terminology we call an input variable a rvalue, and an 
 output variable a lvalue. 

 If the following two conditions are true, NO DATA DEPENDENCY 
 IS POSSIBLE between iterations:

    1) No output variable is also an input variable
       in another iteration. 

    2) No output variable is also an output variable
       in another iteration. 

 If this simple criterion is violated in a loop, we have to 
 perform some kind of analysis (see below) in order to determine 
 if a dependency really exists, and whether it is a REMOVABLE. 
 The variables that violated the criterion will help us classify 
 the type of dependency involved.

 Remark: Auxiliary variables, that can be eliminated algebraically,
         and are computed before they are used, are output variables 
         according to our definition, but can be safely ignored.


 Vectorizing/Parallelizing compilers
 -----------------------------------
 Vectorizing/Parallelizing compiler technology is quite developed,
 loops are analyzed automatically for data dependencies and if found
 independent the compiler generates vector code. 

 Compilers use algorithms like the Banerjee's inequalities or the 
 GCD tests to analyze dependencies.

 The compiler will warn you of dependencies, and you should go over
 them armed with the concepts discussed in this chapter, and try to
 remove them. 


 Classification of dependencies
 ------------------------------
 The dependencies we will discuss are: 

    Store before load      Competing stores      Load before store
    =================      ================      ===============
    DO I=2,4               DO I=2,4              DO I=2,4        
      A(I-1) = B(I)          A(I-1) = B(I)         B(I) = A(I-1) 
      C(I)   = A(I)          A(I)   = C(I)         A(I) = C(I)   
    ENDDO                  ENDDO                 ENDDO           

    Recurrence             Sum reduction
    ===============        ==============
    DO I=2,4               SUM = 0.0
      A(I) = A(I-1)        DO I=2,4
    ENDDO                    SUM = SUM + A(I)
                           ENDDO


 Removable data dependancy: store before load
 --------------------------------------------
 A small example:

      DO I=2,4           
        A(I-1) = B(I)    
        C(I)   = A(I)    
      ENDDO              

 Let's unroll the loop and see the correct order for executing
 the assignment statements:

      t=1   A(1) = B(2)
      t=2   C(2) = A(2)
      t=3   A(2) = B(3)
      t=4   C(3) = A(3)
      t=5   A(3) = B(4)
      t=6   C(4) = A(4)

 Performing all iterations in parallel:

            Parallel #1      Parallel #2      Parallel #3
            ===========      ===========      ===========
      t=1   A(1) = B(2)      A(2) = B(3)      A(3) = B(4)
      t=2   C(2) = A(2)      C(3) = A(3)      C(4) = A(4)
      
 We see that the execution order is changed when running the 
 loop in parallel, e.g. the first assignment of #2 (t=3) comes 
 before the second assignment of #1 (t=2). 

 When the second assignment of #1 executes, the value of A(2)
 is already corrupted, the first assignment of #2 gave it the
 value of B(3) instead of its original value. The same problem
 occurs between #2 and #3, the data is overwritten before 
 being used and the result is wrong.

 A simple modification can save the parallel code:

      DO I=2,4
        TEMP   = A(I)
        A(I-1) = B(I)
        C(I)   = TEMP
      ENDDO

 This technique is called PRE-LOADING, the endangered value
 of A(I) is temporarily stored in the new auxiliary variable 
 TEMP, the 'protective store' is done at the top of the loop.


 Possibly removable data dependancy: two competing stores
 --------------------------------------------------------
 A small example:

      DO I=2,4           
        A(I-1) = B(I)    
        A(I)   = C(I)    
      ENDDO              

 Unrolled:

      t=1   A(1) = B(2)
      t=2   A(2) = C(2)
      t=3   A(2) = B(3)
      t=4   A(3) = C(3)
      t=5   A(3) = B(4)
      t=6   A(4) = C(4)

 Parallelized:

            Parallel #1      Parallel #2      Parallel #3
            ===========      ===========      ===========
      t=1   A(1) = B(2)      A(2) = B(3)      A(3) = B(4)
      t=2   A(2) = C(2)      A(3) = C(3)      A(4) = C(4)

 We see that the execution order is changed when running the 
 loop in parallel, e.g. the first assignment of #2 (t=3) comes 
 before the second assignment of #1 (t=2). 

 The value left in A(2) after the loop executes is wrong, 
 instead of B(3) it is equal to C(2). The same problem occurs 
 between #2 and #3, the order of the store operations is wrong
 and the result is wrong. 

 Again a simple modification can save the parallel code,
 we just SWITCH LINES inside the loop:

      DO I=2,4
        A(I)   = C(I)   
        A(I-1) = B(I)
      ENDDO

 This simple trick will make the order of stores come right,
 IF we can assume that the parallel 'tracks' are synchronous.
 Vector processors are synchronous, parallel machines may 
 be not (?). 


 Possibly removable data dependancy: load before store 
 -----------------------------------------------------
 A small example:

      DO I=2,4        
        B(I) = A(I-1) 
        A(I) = C(I)   
      ENDDO           

 Unrolled:

      t=1   B(2) = A(1)
      t=2   A(2) = C(2)
      t=3   B(3) = A(2)
      t=4   A(3) = C(3)
      t=5   B(4) = A(3)
      t=6   A(4) = C(4)

 Parallelized:

            Parallel #1      Parallel #2      Parallel #3
            ===========      ===========      ===========
      t=1   B(2) = A(1)      B(3) = A(2)      B(4) = A(3)
      t=2   A(2) = C(2)      A(3) = C(3)      A(4) = C(4)

 We see that the execution order is changed when running the 
 loop in parallel, e.g. the first assignment of #2 (t=3) comes 
 before the second assignment of #1 (t=2). 

 The value of A(2) is used before it is assigned the right 
 value C(2). The same problem occurs between #2 and #3, 
 the variable is assigned before it contains the right value.
 The order of the store operations is wrong and the result 
 is wrong. 

 Again a simple modification can save the parallel code,
 looking closely in the code we can see that SWITCHING 
 LINES inside the loop will not change the result, and
 will solve the dependency problem:

      DO I=2,4
        A(I) = C(I)   
        B(I) = A(I-1)
      ENDDO

 This simple trick will make the order of stores come right,
 IF we can assume that the parallel 'tracks' are synchronous.
 Vector processors are synchronous, parallel machines may 
 be not (?). 


 Irremovable data dependency: recurrence
 -----------------------------------------
 A small example:

      DO I=2,4          
        A(I) = A(I-1)   
      ENDDO             

 Unrolled:

      t=1   A(2) = A(1)
      t=2   A(3) = A(2)
      t=3   A(4) = A(3)

 Parallelized:

            Parallel #1      Parallel #2      Parallel #3
            ===========      ===========      ===========
      t=1   A(2) = A(1)      A(3) = A(2)      A(4) = A(3)

 Well, nothing can be done in this case, the dependency is essential,
 and can't be removed by some simple manipulation.



 Bernstein conditions for double loop independence
 -------------------------------------------------
 If we have either a vector machine or a parallel one, treating 2 level 
 loop nests is simple, we just vectorize/parallelize one of the loops. 

 With a machine that is both vector and parallel we would try to vectorize 
 the inner loop and parallelize the outer one.

 Dependency analysis is more complicated in this case ....



 Other loops: FORALL, DOALL
 --------------------------


 Shared memory systems 
 ---------------------
 ... interleaved memory ...


 Memory contention
 -----------------


 Distributed memory systems and message passing
 ----------------------------------------------


 Bibliography
 ------------
 An nice introduction to hardware and software:

   Advanced Computer Architecture: 
    Parallelism, Scalability, Programmability
   Kai Hwang
   McGraw-Hill 1993
   Library og Congress QA76.9 A73 H87
   ISBN 0-07-113342-9
Return to contents page