Code optimization - user techniques


 1-9 CODE OPTIMIZATION - USER TECHNIQUES
 ***************************************
 (Thanks to Craig Burley for the excellent comments.
  Thanks to Timothy Prince for the important comments)

 FORTRAN is used for heavy number crunching, and often decreasing the 
 running time of a program is desirable, however, it shouldn't be set
 as a primary goal except on special cases.

 While doing manual optimizations, it's a good idea to check the code
 also from a numerical point of view. At least, avoid optimizations 
 that may lead to accuracy deterioration. 


 The need for profiling
 ----------------------
 Since it's common that programs spend most of the execution time in 
 a few small parts of the code (usually some loops), it is recommended 
 to perform first "profiling", i.e. analysis of the relative execution 
 time spent in different parts of the program. 

 Profiling is also important when you apply optimizations, compilers 
 and computer hardware are complicated systems, and the results of 
 applying an optimization may be hard to guess. 


 Optimizations that can't be done automatically
 ----------------------------------------------
 There are some optimizations you can do, which a compiler can't, 
 for example: 

   1) Using a fast algorithm
   2) Using unformatted I/O
   3) Using higher optimizations levels
   4) Performing "aggressive" manual optimization
   5) Writing in assembly language


 Using a fast algorithm
 ----------------------
 A good optimization is choosing a fast algorithm. 
 Other types of optimizations probably can't make linear sort (a N**2 
 algorithm) compete with heapsort (N*logN) for long sequences, or make 
 a Runge-Kutta integrator compete with a Rosenbruck integrator for 
 very stiff problems. 

 Make yourself familiar with basic good algorithms. See the chapter on 
 literature for references to some classic books on algorithmics.
 

 Using unformatted I/O
 ---------------------
 Unformatted I/O is often better because:

   1) The CPU intensive translation from internal and external 
      representation is avoided. 

   2) The radix conversion and roundoff errors associated 
      with the translation are avoided, so the accuracy 
      of computations involving numbers written to, and 
      then read back from a file increases. 

   3) It is more efficient because unformatted data takes
      less space, hence more data can be passed in each 
      I/O operation, and less (slow) operations are used.

   4) The resulting data files are smaller.

 The disadvantages of unformatted I/O are that data files are not 
 portable to other systems, and cannot be edited and studied directly
 by a text editor such as EVE, Emacs or vi.


 Classification of optimizations types
 -------------------------------------
 Optimizations that are performed automatically by a compiler or manually
 by the programmer, can be classified by various characteristics.

 The "scope" of the optimization:

   1) Local optimizations - Performed in a part of one procedure.

        1) Common sub-expression elimination (e.g. those 
           occurring when translating array indices to memory
           addresses. 
        2) Using registers for temporary results, and if 
           possible for variables.
        3) Replacing multiplication and division by shift
           and add operations. 

   2) Global optimizations - Performed with the help of data flow
      analysis (see below) and split-lifetime analysis.

        1) Code motion (hoisting) outside of loops
        2) Value propagation
        3) Strength reductions

   3) Inter-procedural optimizations - 


 What is improved in the optimization:

   1) Space optimizations - Reduces the size of the executable/object.

        1) Constant pooling 
        2) Dead-code elimination.

   2) Speed optimizations - Most optimizations belong to this category


 There are important optimizations not covered above, e.g. the various
 loop transformations:

   1) Loop unrolling - Full or partial transformation of 
      a loop into straight code

   2) Loop blocking (tiling) - Minimizes cache misses by 
      replacing each array processing loop into two loops, 
      dividing the "iteration space" into smaller "blocks"

   3) Loop interchange - Change the nesting order of loops,
      may make it possible to perform other transformations

   4) Loop distribution - Replace a loop by two (or more)
      equivalent loops

   5) Loop fusion - Make one loop out of two (or more)



 Automatic/Manual Optimizations  
 ------------------------------
 A good compiler can automatically perform many code optimizations, 
 for example: 

   1) Alignment of data
   2) Proper loop nesting
   3) Procedure inlining 
   4) Loop unrolling
   5) Loop blocking (tiling)  
   6) Local optimizations

 Manually performing this optimizations is recommended, as long as
 program clarity is not affected adversely.


 Alignment of data
 -----------------
 See the chapter on data alignment for information on the influence
 of data alignment on performance.


 Proper loop nesting
 -------------------
 Operating systems (except DOS) don't have to load all of your program 
 into memory during execution, the memory your program needs to store 
 code or data is partitioned into pages, and the pages are read and 
 written from and to the disk as necessary.

 Computers use memory caches to reduce memory references. The cache is 
 a small and fast memory unit, when you reference a memory location, 
 the cache automatically saves the memory locations nearby. Later, if 
 you reference a memory location that happens to be in the cache it can 
 be brought much faster.

 The way memory management and caches work has important implications 
 on array processing, it determines the nesting order of loops that
 process multi-dimensional arrays.

 FORTRAN arrays with more than one index are stored in memory in a 
 'reverse' order (relative to C). For example a two-dimensional array 
 is stored column after column. Such storage scheme is called a 
 'column major' or 'column first' scheme. 

 When you have large matrices that you process element after element, 
 you may save a lot of CPU consuming paging activity, and a lot of 
 'cache misses', if you nest your loops properly:

      DO 100 J=1,50
        DO 200 I=1,50
          A(I,J) = 0.5 * A(I,J)
200     CONTINUE
100   CONTINUE

 The inner loop runs over the left matrix index. so when you run the 
 program, memory references to the matrix don't jump back and forth, 
 they progress a step at a time over the matrix. In this way page faults 
 are minimized and the cache is utilized efficiently.

 Similar considerations apply to IMPLIED-DO loops in I/O statements, 
 for similar reasons.


 Procedure inlining
 ------------------
 This is an efficient language-independent optimization technique.
 Done manually, it makes your program look horrible, but many compilers
 can perform it automatically. Note that this technique enlarges the 
 size of the executable.

 Procedure inlining eliminates the overhead of procedure calls, 
 instead of keeping a single copy of a procedure's code and inserting 
 repeatedly the code required to call it, the compiler inserts the 
 procedure's code itself every time it is needed. In this way the 
 execution of the CALLING SEQUENCEs is eliminated at the small price 
 of enlarging the executable file. These transformation is even more 
 effective on highly pipelined CPUs.


 Loop unrolling
 --------------
 This is another efficient language-independent optimization technique.
 Note that it also enlarges the size of the executable.

 There is overhead inherent in every loop, upon initializing the loop, 
 and on every iteration (see the chapter on DO loops), the overhead 
 may be larger if the loop does little, e.g. initialization loops. 
 Eliminating the loop and writing code separately for each loop index 
 significantly increases speed .

 On every iteration of a DO loop some operations must be performed:

    1) Decrementing the loop counter
    2) Checking if further iterations are needed
    3) If #2 is positive, modify the loop variable
       and jump to the start of the loop. 
       If #2 is negative exit the loop

 Saving loop iterations saves all this "book keeping" operations.
 Loop unrolling is even more effective on highly pipelined CPUs, 
 without it the instruction cache may be trashed every iteration.

 You may use partial loop unrolling, in that method you process 
 several loop indexes in one loop iteration (see example below).
 The following example program tests partial loop unrolling on VMS:


      PROGRAM CPUCLK
C     ------------------------------------------------------------------
      INCLUDE '($JPIDEF)'
C     ------------------------------------------------------------------
      INTEGER       
     *              TIME1, TIME2,
     *              I
C     ------------------------------------------------------------------
      REAL
     *              TMP
C     ------------------------------------------------------------------
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME1)
C     ------------------------------------------------------------------
      TMP = 0.0
      DO I = 1, 1000000
        TMP = TMP + 1.0 / REAL(I)
      ENDDO
      WRITE(*,*) ' Sum of original loop: ', TMP
C     ------------------------------------------------------------------
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME2)
      WRITE(*,*) ' Time of original loop: ', TIME2 - TIME1
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME1)
C     ------------------------------------------------------------------
      TMP = 0.0
      DO I = 1, 1000000, 4
        TMP = TMP + 1.0 / REAL(I)   + 1.0 / REAL(I+1)
     &            + 1.0 / REAL(I+2) + 1.0 / REAL(I+3)
      ENDDO
      WRITE(*,*) ' Sum of unrolled loop: ', TMP
C     ------------------------------------------------------------------
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME2)
      WRITE(*,*) ' Time of unrolled loop: ', TIME2 - TIME1
C     ------------------------------------------------------------------
      END


 This example program tests partial loop unrolling (by a factor of 4). 
 The only result-related I/O is a single WRITE statement (added so the 
 compiler wouldn't optimize away all computations).

 Note that the partial unrolling of the loop, made it possible to save 
 some operations, we don't have to add every term of the series to the 
 partial-sum accumulator TMP, instead of: 

      TMP = TMP + 1.0 / REAL(I)
      TMP = TMP + 1.0 / REAL(I+1)
      TMP = TMP + 1.0 / REAL(I+2) 
      TMP = TMP + 1.0 / REAL(I+3)

 with 4 assignments, we have:

      TMP = TMP + 1.0 / REAL(I)   + 1.0 / REAL(I+1)
     &          + 1.0 / REAL(I+2) + 1.0 / REAL(I+3)

 with the same number of arithmetic operations, but only one assignment. 
 If TMP is not kept in a register this could be significant since the 
 main memory is about one order of magnitude slower than CPU registers.

 The same saving of "memory references" can occur in a nest of loops,
 let's generalize a bit the previous example: 

      INTEGER  I, J
      REAL     A(M)

      DO J = 1, 2*N
        DO I = 1, M
          A(I) = A(I) + 1.0 / REAL(I+J)
        END DO
      END DO

 For every array element A(I), 2*N stores are performed, one for each
 iteration of the outer loop. Let's unroll the outer loop two times,
 and perform "two iterations in one go":

      DO J = 1, 2*N, 2
        DO I = 1, M
          A(I) = A(I) + 1.0 / REAL(I+J)
        END DO
        DO I = 1, M
          A(I) = A(I) + 1.0 / REAL(I+J+1)
        END DO
      END DO

 We can perform "loop fusion" on the two new inner loops:

      DO J = 1, 2*N, 2
        DO I = 1, M
          A(I) = A(I) + 1.0 / REAL(I+J)
          A(I) = A(I) + 1.0 / REAL(I+J+1)
        END DO
      END DO

 "Fusing" the two inner statements we get:

      DO J = 1, 2*N, 2
        DO I = 1, M
          A(I) = A(I) + 1.0 / REAL(I+J) + 1.0 / REAL(I+J+1)
        END DO
      END DO

 We can interchange the two loops to minimize paging activity 
 and cache misses:

      DO I = 1, M
        DO J = 1, 2*N, 2
          A(I) = A(I) + 1.0 / REAL(I+J) + 1.0 / REAL(I+J+1)
        END DO
      END DO

 Here only N stores are performed to each array element. 

 Two remarks:

   1) Interchanging the two loops from the beginning, and 
      unrolling the J-loop would be a shorter procedure. 
      However, loop interchanging is not always possible.

   2) In any case we will pay the price of making the loop
      nest more complicated.


 Loop interchange
 ----------------
 Changing the nesting order of loops is a powerful technique. 
 In a previous section we saw that a proper nesting order can improve 
 the localization of memory references. In other cases it may enable 
 further loop transformations.

 Let's examine if it's possible to interchange simple loop nests of 
 the form:

      INTEGER    M, N, IOFF, JOFF, I, J
      PARAMETER  (M = ???, N = ???, IOFF = ???, JOFF = ???)
      REAL       A(M,N)

      DO I = 1, M
        DO J = 1, N
          A(I,J) = some-function-of(A(I+IOFF, J+JOFF))
        END DO
      END DO

 We have here a "self-modifying array", the loop nest scans the array in 
 the "wrong" way, i.e. row after row. Of course, the ranges of the loops
 should be modified so array reference doesn't get out of bounds.

 Will the interchanged nest:

      DO J = 1, N
        DO I = 1, M
          A(I,J) = some-function-of(A(I+IOFF, J+JOFF))
        END DO
      END DO

 give the same results?

 Possible values of (IOFF, JOFF) parameters can be partitioned into 4 
 regions, by the lines IOFF = 0, and JOFF = 0:

           IOFF = 0
              |
     Absolute | Reversing
       past   |   zone
   -----------+----------- JOFF = 0
    Reversing | Absolute
      zone    |  future
              |

 Having IOFF and JOFF in the "absolute" regions,
 .................................................


 Loop blocking (tiling)
 ----------------------
 Cache misses and paging activity can be minimized by partitioning
 large matrices to small enough rectangular blocks. This can be done
 by partitioning the "iteration space" into blocks. 

 A good example is matrix multiplication of square matrices:

      INTEGER  I, J, K
      REAL     A(N,N), B(N,N), C(N,N)

      DO J = 1, N
        DO I = 1, N
          A(I,J) = 0.0
        END DO
      END DO

      DO I = 1, N
        DO J = 1, N
          DO K = 1, N
            A(I,J) = A(I,J) + B(I,K) * C(K,J)
          END DO
        END DO
      END DO

 This loop nest implements  A = B X C.  The inner K-loop performs a dot 
 product of the I-th row of matrix B, with the J-th column of matrix C.
 On every execution of the K-loop we get references all over B and C.

 We can block an ordinary loop:

      DO I = 1, N
        .........
      END DO

 If we replace it with:

      DO IT = 1, N, IS
        DO I = IT , MIN(N, IT+IS-1)
          .........................
        END DO
      END DO

 The outer loop goes over the "blocks", the inner traverses each block
 in its turn. The block size IS, should be choosed to fit in the cache
 or a page (whichever is smaller). 

 Performing blocking on all three loops:

      DO IT = 1, N, IS
      DO I = IT, MIN(N, IT+IS-1)
        DO JT = 1, N, JS
        DO J = JT, MIN(N, JT+JS-1)
          DO KT = 1, N, KS
          DO K = KT, MIN(N, KT+KS-1)
            A(I,J) = A(I,J) + B(I,K) * C(K,J)
          END DO
          END DO
        END DO
        END DO
      END DO
      END DO

 Here loop interchanging can be performed to yield:

      DO IT = 1, N, IS
      DO JT = 1, N, JS
      DO KT = 1, N, KS
        DO I = IT, MIN(N, IT+IS-1)
        DO J = JT, MIN(N, JT+JS-1)
        DO K = KT, MIN(N, KT+KS-1)
          A(I,J) = A(I,J) + B(I,K) * C(K,J)
        END DO
        END DO
        END DO
      END DO
      END DO
      END DO

 The size of each 3D block is IS * JS * KS, it should fit in the cache
 or a memory page.

 This horrible loop nest has more overhead in loop initializing and 
 control, but paging activity is minimized and reusing data already
 in cache is maximized.


 Data Flow Analysis (DFA)
 ------------------------
 Compilers can 'follow' the computations in a program to some degree, 
 and detect the points at which the variables get modified. This is
 called DATA-FLOW ANALYSIS. The information gathered in this way is 
 used to identify safe code optimizations automatically. 

 Theoretical considerations may limit data-flow analysis, DFA seems 
 more difficult than the famous halting problem, a problem known to
 be undecidable, i.e. a general algorithm doesn't exist. So it's 
 probably impossible to have a general DFA analyzer - a program that 
 can perform complete data-flow analysis on every possible program.

 Real DFA has additional limitations that theoretical DFA may ignore, 
 there are some things the compiler cannot know, e.g. what value a user 
 will supply interactively at run time. Even more stringent limits are 
 usually imposed in practice to keep compilation times reasonably short,
 so compilers doesn't even try to perform complete DFAs.
  
 In the following sections you will see that good programming practises 
 usually help the compiler optimize your code. Efficient code doesn't
 have to be ugly, on the contrary, ugly code doesn't optimize well.

 
 Programming practises that may help compiler optimization
 ---------------------------------------------------------
 1) Using VARIABLE FORMATS instead of RUN-TIME FORMATS.
    Variable formats are non-standard in all Fortrans, and not 
    widely supported either, but are a pretty cool extension.

    Variable format is compiled to some degree, run-time 
    format is done by the format control interpreter at run-time.
    See the chapter on formats for more information.

 2) Using small simple statement functions (compiler dependant)

 3) Avoid implied do loops in I/O statements. You can use 
    the array name (if not passed in the assumed-size method).

    Many compilers do detect cases where the implied-DO construct 
    clearly identifies an ordered "chunk" of memory just as if an 
    EQUIVALENCEd array is involved, and thus optimize that accordingly.  
    Some (like DEC ones, I believe) can even handle those that are 
    "chunked" with a constant "stride" between elements 
    (e.g. "WRITE (10) (A(I),I=1,10000,2)").

 4) Using the system low-level I/O routines (non-standard!)

 5) Using larger I/O blocks, pre-allocation of disk area for 
    a file (instead of repeatedly extending it) (non-standard!)


 Programming practises that may hinder compiler optimizations
 ------------------------------------------------------------
 1) Partitioning the program to procedures, the compiler may not
    perform the data-flow analysis on variables and arrays that
    are passed between compilation units.

 2) Declaring too many variables and arrays, the data-flow analysis
    may be limited to some predetermined number of variables. 
    Such a problem may happen in routines or main programs that are
    very large.

 3) Using unnecessary common blocks, the compiler may not perform
    data-flow analysis on variables belonging to a common block,
    especially if there is a call to another routine in which the 
    same common block is declared.

 4) Variables that are used in Variable Format Expressions may be
    excluded from analysis.

 5) Equivalenced variables are too much bother, so the compiler
    may not analyze them individually.

 6) Using control constructs like an ASSIGNED GOTO may make some
    compilers give up further analysis.


 Programming practises that inhibit compiler optimizations
 ---------------------------------------------------------
 1) Partitioning the program to several files, the compiler can't
    perform the data-flow analysis on variables and arrays that
    are passed to/from such external compilation units.

 2) Declaring variables as volatile, makes the compiler exclude
    them from analysis. Volatile variables are used in procedures
    that are executed when an interrupt or exception seizes control
    (e.g. floating-point underflow handlers).


 A performance detrimental FORTRAN statement
 -------------------------------------------
 On delimited-variable-size files, and on counted-variable-length files 
 that don't have a suffix count field (VMS), backspacing may be very
 inefficient (see the files and records chapter).

 On such files BACKSPACE goes to the beginning of the file, and reads all 
 previous records.
Return to contents page