ftune.html

FORTRAN Performance Tuning co-guide

Freely distributable with acknowledgment

Chapter 2
Use of Profiling in Code Performance Tuning

Typical profilers take periodic samples in which they determine which subroutine is executing at sample time. Profilers like gprof also count how many times each subroutine is invoked by each calling subroutine. They are indispensable in located the sections of a program where performance might be improved, and in assessing the effects of changes.

Processors with hardware event counters may allow the choice of various counters to trigger sampling. Thus it may be possible to estimate the number of events such as cache or TLB misses occurring in each subroutine. If one "knows" how much time should be spent in each subroutine, timing profiles will show up locations where performance is not up to expectation.

Priority can usefully be placed on clarity of expression and verifiable results throughout the code, but performance actually is enhanced by minimizing the bulkiness of code which is seldom executed, while local speed is important only where measurable time is spent in execution.

Static profiling data are available on a few systems. One of the more useful types is one which provides the following information about the code generated for each inner loop:

Profilers which attempt to analyze the time spent line by line usually use a combination of dynamic and static analysis, and the results have to be viewed with suspicion. Dynamically acquired data at best may be representative of time spent in various blocks of the compiled code, and the relationship of these to the source code may be obscured by unrolling and cleanup loops, loop fission and fusion, and the like. Thus the well established practice of providing timings by subroutine, and not reporting explicitly on functions which are expanded in-line in the calling subroutine.

Even the attempts made by profilers like gprof to resolve the time spent in a subroutine according to caller are not reliable, as they depend on the assumption that the subroutine takes the same time on each call.

If a profiler is not able to produce the required information, one may resort to inclusion of timing calls in the source code. The standard timing call cpu_time() is just a renaming of the Cray style call second(). While the definition of date_and_time() provides for timing with millisecond resolution, sub second resolution is not always implemented, one of the problems being that it is not supported by Standard C. Resolutions of the order of 10 to 20 milliseconds, as provided by typical "wall clock" hardware, are barely adequate for test runs taking several minutes or more. Certain operating systems introduce unpredictable delays of several milliseconds in timer calls, and timing differences of 10% may not be significant.

Evidently, a sufficient body of profiled test cases with results which are considered to be correct is a valuable starting point to assess effectiveness of modifications made in the hope of improving speed, accuracy, or generality.

Invoking Profiling

Many compilers require certain options to be set in preparation for profiling when compiling individual subroutines or libraries. This is done to insert calls to a system subroutine which counts the number of times called. These profiler calls should record the number of times called by each calling subroutine, but that is likely to fail for calls to functions by indirect reference. On some systems, the profiling option for separate compilations can be left on without detrimental effect, as the linker maybe run with profiling off and then will insert a quick return stub for the counter call.

Some profilers require certain seemingly unrelated options to be set. Profilers which run in an interpretative (e.g. debugger) mode can gather call statistics, but do not reflect the effects of optimizations. Systems which have shared libraries may not profile properly unless the libraries are linked statically, preferably using a version of the library which has the profiling options set. Static library calls may differ in performance from shared library calls, as, for example, they may not produce as many instruction cache misses.

Typical Unix options for the effects discussed above:

-g, -G, +gprof to invoke compiling and linking with profiling on (not all systems need this)

-B STATIC or -Wl,-a,archive for preferring static libraries over shared ones

Depending on the compiler vendor's view of the way things should be done, profiling may occur automatically when set at link time, or it maybe done on a normally linked executable program by running it under a profiling program. The latter approach is a good one, which is likely to be used where there are various ways of profiling, which may be selected for each test run without having to re-link. Typically, each type of cache exception or timing statistic needs a separate test run.

When a program is run with profiling, statistics are gathered and written into a file which may not survive abnormal termination. The evident reason for this is that the profiler must avoid taking time to set up this file except at the very beginning and end of the test run. Usually, there is a separate profiling analysis program (prof, gprof) which is run afterwards (not automatically) to generate a usable summary of the statistics. One reason for this is that there may be various options on how the summary is presented.

To profile an application, you need test cases which are realistic but preferably do not take more time than necessary. In the time-consuming applications which this author is familiar with, the time taken may be controlled either by re-sizing the problem or deciding to run short of or beyond a satisfactory converged solution. Re-sizing a problem affects cache behavior and so, if there is a normal size problem for which good performance is wanted, that size should be used. That leaves setting the problem up so it stops after a fixed number of steps, probably well short of a satisfactory solution. Such applications usually have a convergence history which is indispensable for tuning to see the rate at which it is running and, by re-running identical cases, whether changes (possibly introducing bugs) have affected the results.

Interpreting and Using Profiler Results

Run-time profiler results should show a summary giving the percentage of total samples during which each subroutine is found to be running, and the estimated CPU time spent in each subroutine. In the early stages of tuning, one may wish to modify subroutines which take a great deal of time or appear to generate unreliable results, but most tuning modifications should be left for later.

The effect of various compiler options may be tested by comparing profiling results (and checking validity) for various runs. First, a determination of the best compiler options for each subroutine may be made. The profiler may be a great help here, as, for a first cut, a compiler option may be tried for a large group of subroutines to see which ones may or may not benefit. Separate compilation of subroutines is useful, to be able to use various compiler options, and to isolate as far as possible the effect of changes in each subroutine.

Profiling often shows unexpected side effects. Expanding the code in one place (e.g. by more massive unrolling) may have adverse effects elsewhere. This may happen through instruction or data cache misses. If the profiler has the appropriate facility, this may be checked by invoking the various event profilers, but otherwise it has to be inferred. In any case, the effect of changing compile options or code in one subroutine has to be evaluated both for local effect and for possible side effects. Compiling little used subroutines in a mode which minimizes size of generated code may produce an overall performance benefit, or changes which have a slight favorable local effect may have a large negative global effect.