Copyright (C) 1998 Timothy C. Prince
Freely distributable with acknowledgment
Typical profilers take periodic samples in which they determine which subroutine is executing at sample time. Profilers like gprof also count how many times each subroutine is invoked by each calling subroutine. They are indispensable in located the sections of a program where performance might be improved, and in assessing the effects of changes.
Processors with hardware event counters may allow the choice of various counters to trigger sampling. Thus it may be possible to estimate the number of events such as cache or TLB misses occurring in each subroutine. If one "knows" how much time should be spent in each subroutine, timing profiles will show up locations where performance is not up to expectation.
Priority can usefully be placed on clarity of expression and verifiable results throughout the code, but performance actually is enhanced by minimizing the bulkiness of code which is seldom executed, while local speed is important only where measurable time is spent in execution.
Static profiling data are available on a few systems. One of the more
useful types is one which provides the following information about the
code generated for each inner loop:
Even the attempts made by profilers like gprof to resolve the time spent in a subroutine according to caller are not reliable, as they depend on the assumption that the subroutine takes the same time on each call.
If a profiler is not able to produce the required information, one may resort to inclusion of timing calls in the source code. The standard timing call cpu_time() is just a renaming of the Cray style call second(). While the definition of date_and_time() provides for timing with millisecond resolution, sub second resolution is not always implemented, one of the problems being that it is not supported by Standard C. Resolutions of the order of 10 to 20 milliseconds, as provided by typical "wall clock" hardware, are barely adequate for test runs taking several minutes or more. Certain operating systems introduce unpredictable delays of several milliseconds in timer calls, and timing differences of 10% may not be significant.
Evidently, a sufficient body of profiled test cases with results which
are considered to be correct is a valuable starting point to assess effectiveness
of modifications made in the hope of improving speed, accuracy, or generality.
Many compilers require certain options to be set in preparation for
profiling when compiling individual subroutines or libraries. This is done
to insert calls to a system subroutine which counts the number of times
called. These profiler calls should record the number of times called by
each calling subroutine, but that is likely to fail for calls to functions
by indirect reference. On some systems, the profiling option for separate
compilations can be left on without detrimental effect, as the linker maybe
run with profiling off and then will insert a quick return stub for the
counter call.
Some profilers require certain seemingly unrelated options to be set. Profilers which run in an interpretative (e.g. debugger) mode can gather call statistics, but do not reflect the effects of optimizations. Systems which have shared libraries may not profile properly unless the libraries are linked statically, preferably using a version of the library which has the profiling options set. Static library calls may differ in performance from shared library calls, as, for example, they may not produce as many instruction cache misses.
Typical Unix options for the effects discussed above:
-g, -G, +gprof to invoke compiling and linking with profiling on (not all systems need this)
-B STATIC or -Wl,-a,archive for preferring static libraries over shared ones
Depending on the compiler vendor's view of the way things should be done, profiling may occur automatically when set at link time, or it maybe done on a normally linked executable program by running it under a profiling program. The latter approach is a good one, which is likely to be used where there are various ways of profiling, which may be selected for each test run without having to re-link. Typically, each type of cache exception or timing statistic needs a separate test run.
When a program is run with profiling, statistics are gathered and written into a file which may not survive abnormal termination. The evident reason for this is that the profiler must avoid taking time to set up this file except at the very beginning and end of the test run. Usually, there is a separate profiling analysis program (prof, gprof) which is run afterwards (not automatically) to generate a usable summary of the statistics. One reason for this is that there may be various options on how the summary is presented.
To profile an application, you need test cases which are realistic but
preferably do not take more time than necessary. In the time-consuming
applications which this author is familiar with, the time taken may be
controlled either by re-sizing the problem or deciding to run short of
or beyond a satisfactory converged solution. Re-sizing a problem affects
cache behavior and so, if there is a normal size problem for which good
performance is wanted, that size should be used. That leaves setting the
problem up so it stops after a fixed number of steps, probably well short
of a satisfactory solution. Such applications usually have a convergence
history which is indispensable for tuning to see the rate at which it is
running and, by re-running identical cases, whether changes (possibly
introducing bugs) have affected the results.
Run-time profiler results should show a summary giving the percentage of total samples during which each subroutine is found to be running, and the estimated CPU time spent in each subroutine. In the early stages of tuning, one may wish to modify subroutines which take a great deal of time or appear to generate unreliable results, but most tuning modifications should be left for later.
The effect of various compiler options may be tested by comparing profiling results (and checking validity) for various runs. First, a determination of the best compiler options for each subroutine may be made. The profiler may be a great help here, as, for a first cut, a compiler option may be tried for a large group of subroutines to see which ones may or may not benefit. Separate compilation of subroutines is useful, to be able to use various compiler options, and to isolate as far as possible the effect of changes in each subroutine.
Profiling often shows unexpected side effects. Expanding the code in
one place (e.g. by more massive unrolling) may have adverse effects elsewhere.
This may happen through instruction or data cache misses. If the profiler
has the appropriate facility, this may be checked by invoking the various
event profilers, but otherwise it has to be inferred. In any case, the
effect of changing compile options or code in one subroutine has to be
evaluated both for local effect and for possible side effects. Compiling
little used subroutines in a mode which minimizes size of generated code
may produce an overall performance benefit, or changes which have a slight
favorable local effect may have a large negative global effect.