8.3 Optimization Patterns
We now turn our attention toward patterns for optimization. The
techniques presented here serve two purposes. The first is to explain
how some of the most common optimizations work, so that you can
understand whether a specific compiler optimization flag is likely to
help performance. The second is to enable you to, in some cases,
apply these optimizations by hand. This is sometimes necessary, when
the compiler is not able to make assumptions about whether an
optimization is safe. This does
not mean that you should apply these optimizations everywhere! The
compilers are usually pretty reliable. Careful use of timing
techniques in critical code flow sections will help you identify
cases where the compiler could use some help.
We cannot present every optimization, so we'll focus
on the most common and important ones. Generally, these optimizations
fall into a handful of categories: optimizations on arithmetic, on
loops, and on string operations.
These optimizations will be
presented in C. If you are interested in Java tuning, please consult
Java Performance Tuning by Jack Shiraz
(O'Reilly). IBM has written an excellent summary
text on this topic, which is called the Optimization and
Tuning Guide for Fortran, C, and C++.
8.3.1 Arithmetic
One of the most fundamental
arithmetic optimizations is to replace a sequence of operations by a
sequence of equivalent, less complex operations. For example, the
compiler might replace a line of code like:
x = pow (y, 5);
with a line like:
x = y * y * y * y * y;
because in many cases repeated
multiplication is less expensive than exponentiation. This is called
strength reduction. It is by far the most
common arithmatic optimization.
Another common example of strength reduction is replacing a division
with a multiplication-by-inverse. For example, you could replace this
line of code:
x = y / z;
with a line like:
x = y * (1 / z);
and see a performance increase simply because, on many platforms,
inverting and multiplying is much faster than dividing.
Another good idea in mathematics-intensive codes is to link against
your vendor's high-performance mathematics library,
if one is available. This step can produce fantastic performance
increases by calling highly tuned assembly-language routines for
basic math functions.
8.3.2 Loops
Please note that all of these loop
optimizations are performed automatically by a modern optimization
compiler. However, compilers are required to be somewhat conservative
(the emphasis is on generating
"correct" code, rather than fast
code); using some of these techniques manually can help the compiler
find opportunities for other, more complex optimizations.
Loop fusion is the process of combining the
statements inside several loops into a single loop.
|
for (i = 0; i <= 1024; i++) {
x = x * a[i] * b[i];
}
for (j = 0; j <= 1024; j++) {
y = y * a[j] * b[j];
}
|
for (i = 0; i <= 1024; i++) {
x = x * a[i] * b[i];
y = y * a[j] * b[j];
}
|
This technique not only halves the loop overhead, it also creates
better opportunities for the compiler to overlap instructions to take
advantage of parallelism in the underlying microprocessor. You must
exercise care, however, as this optimization has the possibility of
decreasing performance by increasing the cache miss rate (for
example, when each loop accesses a significant portion of the data
cache). It also can be complicated by problems of data dependence;
you must be sure that the work done in the second loop is not reliant
on the work finished by the first loop.
Floating invariant flow control moves flow
control statements that do not change between iterations of a loop
outside of the loop.
|
for (i = 0; i <= 1024; i++) {
for (j = 0; j <= 1024; j++) {
if (a[i] >= 64) {
b[i] = -a[i];
}
x = x + a[j] - b[j];
}
}
|
for (i = 0; i <= 1024; i++) {
if (a[i] >= 64) {
b[i] = -a[i];
}
for (j = 0; j <= 1024; j++) {
x = x + a[j] - b[j];
}
}
|
Most compilers do this automatically, but the compiler may not be
able to if the loop contains calls to other procedures or in complex
loops where the invariance cannot easily be determined.
Loop unrolling replicates the functional part
of the loop while reducing the iteration count proportionally.
|
for (i = 0; i <= 1024; i++) {
for (j = 0; j <= 4; j++) {
a[i] = b[i, j];
}
}
|
for (i = 0; i <= 1024; i++) {
a[i] = b[i, 1];
a[i] = b[i, 2];
a[i] = b[i, 3];
a[i] = b[i, 4];
}
|
This is primarily of use to generate more opportunities for
scheduling parallelism in the compiler and microprocessor.
Stride refers to the relationship between the
layout of an array's elements in memory and the
order in which they are processed. For example, an array of 32-bit
integers is accessed at stride 1 if they are accessed by
monotonically increasing the value of the rightmost subscript.
|
for (i = 0; i <= 1024; i++) {
for (j = 0; j <= 16; j++) {
a[j] = b[j, i];
}
}
|
for (j = 0; j <= 16; i++) {
for (i = 0; i <= 1024; i++) {
a[j] = b[j, i];
}
}
|
The lower the stride, the higher performing the software, due to
optimal use of the cache hierarchy.
8.3.3 Strings
The runtime of the standard C
library functions is proportional to the lengths of their operands.
It's quite common to loop over calls to these
functions and consume a significant amount of CPU time. There a few
simple optimizations you can take to reduce this effect:
Instead of calling strlen( ) in a loop involving
the string itself, you can rewrite the loop so that you set a
temporary value equal to strlen( ) before the
loop, and then modify the loop as you add and remove characters.
If you are testing whether a string is empty, strlen(
) is probably not the fastest way to do it. If the strings
are of substantial size, strlen() will check every
character until it finds the terminator character. You can replace it
with *s == '\0', which checks
only the first character as well as saving a function call.
Similarly, if you want to clear a string, *s =
'\0' is much faster than strcpy(s,
"").
strcat( ) scans the full length of the string on
each call. If you've been keeping track of the
length of the string as you go along (as described regarding
strlen( ) previously), you have a ready-made
index to the end of the string, and can use strcpy(
) or memcpy( ) from there.
You might be able to save a little bit of time with a
strcmp( ) between two strings containing natural
language by checking if the first characters of the strings are equal
before doing the call to strcmp( ). Because of
the statistically nonuniform distribution of letters in natural
languages, the chances of a match are a bit better than would be
otherwise expected. However, this optimization can be
counterproductive, so be aware of the environment
you're working in (e.g., if you are comparing
adjacent strings in sorted output, this will not buy you anything).
|
No comments:
Post a Comment