Wednesday, November 11, 2009

8.3 Optimization Patterns










 < Free Open Study > 







8.3 Optimization Patterns





We now turn our attention toward patterns for optimization. The

techniques presented here serve two purposes. The first is to explain

how some of the most common optimizations work, so that you can

understand whether a specific compiler optimization flag is likely to

help performance. The second is to enable you to, in some cases,

apply these optimizations by hand. This is sometimes necessary, when

the compiler is not able to make assumptions about whether an

optimization is safe.[8] This does

not mean that you should apply these optimizations everywhere! The

compilers are usually pretty reliable. Careful use of timing

techniques in critical code flow sections will help you identify

cases where the compiler could use some help.

[8] By default, compilers are

pessimistic about whether certain optimizations are permissible.

Humans are often able to look at a piece of code and determine, by

visual inspection and a thorough understanding of the code flow,

whether a given optimization is safe.





We cannot present every optimization, so we'll focus

on the most common and important ones. Generally, these optimizations

fall into a handful of categories: optimizations on arithmetic, on

loops, and on string operations.







These optimizations will be

presented in C. If you are interested in Java tuning, please consult

Java Performance Tuning by Jack Shiraz

(O'Reilly). IBM has written an excellent summary

text on this topic, which is called the Optimization and

Tuning Guide for Fortran, C, and C++
.







8.3.1 Arithmetic





One of the most fundamental

arithmetic optimizations is to replace a sequence of operations by a

sequence of equivalent, less complex operations. For example, the

compiler might replace a line of code like:





x = pow (y, 5);




with a line like:





x = y * y * y * y * y;




because in many cases repeated

multiplication is less expensive than exponentiation. This is called

strength reduction. It is by far the most

common arithmatic optimization.





Another common example of strength reduction is replacing a division

with a multiplication-by-inverse. For example, you could replace this

line of code:





x = y / z;




with a line like:





x = y * (1 / z);




and see a performance increase simply because, on many platforms,

inverting and multiplying is much faster than dividing.





Another good idea in mathematics-intensive codes is to link against

your vendor's high-performance mathematics library,

if one is available. This step can produce fantastic performance

increases by calling highly tuned assembly-language routines for

basic math functions.









8.3.2 Loops





Please note that all of these loop

optimizations are performed automatically by a modern optimization

compiler. However, compilers are required to be somewhat conservative

(the emphasis is on generating

"correct" code, rather than fast

code); using some of these techniques manually can help the compiler

find opportunities for other, more complex optimizations.





Loop fusion is the process of combining the

statements inside several loops into a single loop.







Untuned





Tuned





for (i = 0; i <= 1024; i++) {

x = x * a[i] * b[i];

}

for (j = 0; j <= 1024; j++) {

y = y * a[j] * b[j];

}




for (i = 0; i <= 1024; i++) {

x = x * a[i] * b[i];

y = y * a[j] * b[j];

}






This technique not only halves the loop overhead, it also creates

better opportunities for the compiler to overlap instructions to take

advantage of parallelism in the underlying microprocessor. You must

exercise care, however, as this optimization has the possibility of

decreasing performance by increasing the cache miss rate (for

example, when each loop accesses a significant portion of the data

cache). It also can be complicated by problems of data dependence;

you must be sure that the work done in the second loop is not reliant

on the work finished by the first loop.





Floating invariant flow control moves flow

control statements that do not change between iterations of a loop

outside of the loop.







Untuned





Tuned





for (i = 0; i <= 1024; i++) {

for (j = 0; j <= 1024; j++) {

if (a[i] >= 64) {

b[i] = -a[i];

}

x = x + a[j] - b[j];

}

}




for (i = 0; i <= 1024; i++) {

if (a[i] >= 64) {

b[i] = -a[i];

}

for (j = 0; j <= 1024; j++) {

x = x + a[j] - b[j];

}

}






Most compilers do this automatically, but the compiler may not be

able to if the loop contains calls to other procedures or in complex

loops where the invariance cannot easily be determined.





Loop unrolling replicates the functional part

of the loop while reducing the iteration count proportionally.







Untuned





Tuned





for (i = 0; i <= 1024; i++) {

for (j = 0; j <= 4; j++) {

a[i] = b[i, j];

}

}




for (i = 0; i <= 1024; i++) {

a[i] = b[i, 1];

a[i] = b[i, 2];

a[i] = b[i, 3];

a[i] = b[i, 4];

}






This is primarily of use to generate more opportunities for

scheduling parallelism in the compiler and microprocessor.





Stride refers to the relationship between the

layout of an array's elements in memory and the

order in which they are processed. For example, an array of 32-bit

integers is accessed at stride 1 if they are accessed by

monotonically increasing the value of the rightmost subscript.







Untuned





Tuned





for (i = 0; i <= 1024; i++) {

for (j = 0; j <= 16; j++) {

a[j] = b[j, i];

}

}




for (j = 0; j <= 16; i++) {

for (i = 0; i <= 1024; i++) {

a[j] = b[j, i];

}

}






The lower the stride, the higher performing the software, due to

optimal use of the cache hierarchy.









8.3.3 Strings





The runtime of the standard C

library functions is proportional to the lengths of their operands.

It's quite common to loop over calls to these

functions and consume a significant amount of CPU time. There a few

simple optimizations you can take to reduce this effect:





  • Instead of calling strlen( ) in a loop involving

    the string itself, you can rewrite the loop so that you set a

    temporary value equal to strlen( ) before the

    loop, and then modify the loop as you add and remove characters.

  • If you are testing whether a string is empty, strlen(

    )
    is probably not the fastest way to do it. If the strings

    are of substantial size, strlen() will check every

    character until it finds the terminator character. You can replace it

    with *s == '\0', which checks

    only the first character as well as saving a function call.

    Similarly, if you want to clear a string, *s =

    '\0'
    is much faster than strcpy(s,

    "")
    .

  • strcat( ) scans the full length of the string on

    each call. If you've been keeping track of the

    length of the string as you go along (as described regarding

    strlen( ) previously), you have a ready-made

    index to the end of the string, and can use strcpy(

    )
    or memcpy( ) from there.

  • You might be able to save a little bit of time with a

    strcmp( ) between two strings containing natural

    language by checking if the first characters of the strings are equal

    before doing the call to strcmp( ). Because of

    the statistically nonuniform distribution of letters in natural

    languages, the chances of a match are a bit better than would be

    otherwise expected. However, this optimization can be

    counterproductive, so be aware of the environment

    you're working in (e.g., if you are comparing

    adjacent strings in sorted output, this will not buy you anything).
















     < Free Open Study > 



    No comments:

    Post a Comment