1-10 CODE OPTIMIZATION - COMPILER
*********************************
(Thanks to Craig Burley for the excellent comments.
Thanks to Timothy Prince for the note on architectures with
Instruction Level Parallelism)
Optimization techniques used by compilers may inspire good and
efficient programming practices and are interesting in their
own right.
Before trying to perform 'hand optimization' please note the
following points:
1) Remember that the compiler perform such optimizations anyway,
so the benefit of doing them manually may be small or negative!.
2) Profile First!! (A chapter on profiling will be added)
Programmers who are learning this arcane art should certainly
play around with all the techniques on "make-believe" code,
but should NOT waste their time (and, especially, risk
introducing bugs) by optimizing any _real_ code until after
they've gotten _elegant, correct_ code working and profiled
to discover what areas need optimizing (and have test suites
to use to validate new optimized versions).
Performing operations at compile-time (if possible)
---------------------------------------------------
Computations and type conversions on constants, computing addresses
of array elements with constant indexes, can be performed already
by the compiler.
Value propagation
-----------------
Tracing the VALUE of
Inlining small functions
------------------------
Repeatedly inserting the function code instead of calling it, saves
the calling overhead and enable further optimizations. Inlining
large functions will make the executable too large.
Code hoisting
-------------
Moving as much as possible computations outside loops, saves computing
time. In the following example (2.0 * PI) is an invariant expression
that there is no reason to recompute it 100 times.
DO I = 1, 100
ARRAY(I) = 2.0 * PI * I
ENDDO
Introducing a temporary variable 't' it can be transformed to:
t = 2.0 * PI
DO I = 1, 100
ARRAY(I) = t * I
ENDDO
Dead store elimination
----------------------
If the compiler detects variables that are never used, it may safely
ignore many of the operations that compute their values.
Such operations can't be ignored if there are (non-intrinsic) function
calls involved, those functions have to be called, because of their
possible side effects. Remember that before Fortran 95, Fortran didn't
have the concept of "pure" function.
Programs used as performance tests, and perform no 'real' computations,
should be written to avoid being 'completely optimized out', writing the
'results' to screen/file may be enough to fool the compiler.
Strength reduction
-----------------
Taking advantage of the machine architecture
--------------------------------------------
A simple example, the subject is clearly too machine dependant and
highly technical for more than that:
Register operations are much faster than memory operations, so all
compilers try to put in registers data that is supposed to be
heavily used, like temporary variables and array indexes.
To facilitate such 'register scheduling' the largest sub-expressions
may be computed before the smaller ones.
Eliminating common sub-expressions
----------------------------------
This is an old optimization trick that compilers are able to
perform quite well:
X = A * LOG(Y) + (LOG(Y) ** 2)
We introduce an explicit temporary variable t:
t = LOG(Y)
X = A * t + (t ** 2)
We saved one 'heavy' function call, by an elimination of
the common sub-expression LOG(Y), now we will save the
exponentiation by:
X = (A + t) * t
which is much better.
The compiler may do all of this automatically, so don't waste too
much energy on such transformations.
A classic example - computing the value of a polynomial
-------------------------------------------------------
Eliminating Common Subexpressions may inspire good algorithms like
the classic 'Horner's rule' for computing the value of a polynomial.
y = A + B*x + C*(x**2) + D*(x**3) (canonical form)
It is more efficient (i.e. executes faster) to perform the two
exponentiations by converting them to multiplications, in this
way we will get 3 additions and 5 multiplications in all.
The following forms are more efficient to compute, they require
less operations, and the operations that are saved are the 'heavy'
ones (multiplication is an operation that takes a lot of CPU time,
much more than addition).
Stage #1:
y = A + (B + C*x + D*(x**2))*x
Stage #2 and last:
y => A + (B + (C + D*x)*x)*x
The last form requires 3 additions and only 3 multiplications!
The algorithm hinted here, can be implemented with one loop to compute
an arbitrary order polynomial. It may also be better numerically than
direct computation of the canonical form.
Note:
On architectures with Instruction Level Parallelism the fastest way
is: A + B*x +x**2*(C + D*X). Adding parentheses A + (B*X + x**2())
will improve accuracy in the case where A is the largest term.
+-------------------------------------------------+
| CHECK THE CODE FROM A NUMERICAL POINT OF VIEW |
+-------------------------------------------------+
Return to contents page