Copyright (C) 1998 Timothy C. Prince
Freely distributable with acknowledgment
alpha: a Digital Equipment/Intel architecture noted for high clock
speed and long pipelines
Operating systems: VMS, NT, Unix, Linux
Usually uses 3 levels of cache, uses double precision (no single) floating
point registers, has hardware implementation of SIGN() (f95 style) so that
ABS(x) is the same as SIGN(x,1.).
HP-PA (Hewlett-Packard Precision Architecture)
Operating Systems: Unix, NT
Uses single level direct mapped separate instruction and data cache,
31 double precision registers accessible as pairs of single precision registers,
plus (PA-8000) equally large set of shadow registers, and out-of-order
execution.
Has fused multiply-add with u-nrounded multiplication (PA8000) Add and
multiply of PA8K have 3 cycle latency, and PA8K has 2 parallel divider/sqrt
units.
PA8000 favors strongly:
stride 1 inner loops
extensive loop fusion and 2 level unrolling
invariant-if code movement
expression ordering for multiply-add chaining and
parallelism
detailed testing of best unrolling for each loop
f90 support is a bit weak, and may remain so until a unified
Intel/HParchitecture comes to market.
Pentium Pro/II (Intel)
Operating systems: all Microsoft, Linux, Unix
Has small but effective 2 level cache with separate 8K instruction and
data at level 1, 64-bit access to cache depending on data alignment.
Floating point registers are 80-bit double extended to protect accuracy of
operations such as complex arithmetic, log(), and ASCII to binary conversion.
The 8 programmable registers are backed by 32 shadow registers and out-of-order
execution. Latencies are 2 cycles add, 5 multiply, with multiply
allowed to initiate only every other cycle. Many compiler vendors are active,
with excellent f90 support, fast compiling, but lagging in optimization.
Some weaknesses are slow dynamic memory allocation and sloppy timing in
Microsoft OS, particularly when using the gnu ports like cygnus.
MIPS/SGI R10000
Operating System: Unix
large 2 level cache, separate 32K level 1 instruction and data,
31 floating point registers with no extended precision, equal set of shadow
registers and out-of-order execution. Fused multiply-add of earlier
model sis replaced by IEEE-compliant single instruction multiply and add
with full rounding. Latencies are 2 cycles for multiply and for add.
Hardware cache miss and other event counters are used in the profiler,
which is available for use with any compiler. EPC and SGI both market
compilers. SGI's compilers use software pipelining rather than the more
common group unrolling. Recent compilers use pseudo-instructions
extensively in apparent readiness for future models which may have direct
paths between integer and floating point registers.
Weaknesses: SGI compilers tend to generate excessive spills rather than
using shadow registers effectively. Default compiler options are
somewhat strange; -OPT:IEEE_comparisons=ON:fold_reassociate=OFF is required
to avoid unexpected re-interpretation of source code.
Strengths: few latency bottlenecks in hardware, some good compilers
available along with the bad, reputation for excellent graphics (including
ready-to-run free Internet availability) and multi-processor shared memory
systems.
PowerPC, an architecture developed by IBM, Motorola (and Apple?)
Operating systems: Unix (IBM only), MacOS and Linux (Apple only)
Normally uses 2 levels of cache, separate level 1 data and instruction
cache. Floating point register format is double precision only, with
32 program accessible registers and 4 "rename" (shadow) registers to support
out-of-order execution. When not compiling in backward compatibility
mode, there are special instructions to support MAX/MIN (used in g77) and
SIGN (not in g77). The mass produced models lack hardware sqrt()
and integer to double precision conversion. 603 style models are
slow in double precision and integer multiplication.
There are fused multiply/add instructions without intermediate rounding;
typically they are not used by compilers in the common A - B*C situation,
possibly because this use would not be compatible with directed rounding.
Choice of compilers is limited, and the future of this architecture is
uncertain.
SPARC: Sun and Sun-licensed manufacturers
Operating systems: Unix, linux
32 double precision registers accessible as pairs of single precision
variety of compilers available, with long-standing gnu connection