C++ Compilers' Performance
Implementation's quality

Authors / Авторы: Sergey Satskiy
Motorola Inc.
Roman Plekhanov
Motorola Inc.
Publication date / Опубликовано: 30.09.2006
Version / Версия текста: 1.10

Introduction

There is a certain number of publications available in the Internet which compare performance of a code generated by different C++ compilers on various hardware platforms for some test tasks (e.g. a review on the "coyote gulch" site). This type of reviews is also issied by compiler manufacturers to draw attention to their products. Surely this type of reviews is very practical; they help to take into account a factor of the generated code absolute performance while selecting a development toolchain.

At the stage when a compiler has already been selected, however, a developer faces a question of what approach should be used for implementation a particular piece of code. As soon as any C++ compiler is able to compile a C code as well there is always a choice: to use new language features provided by C++ or to implement the same functionality in pure C. In other words there is a question of overheads which are introduced by new C++ features or how well the corresponding feature is implemented by a certain C++ compiler.

There are not too many comparisons of a performance a C++ code and a functionally equivalent C code which are generated by the same compiler. In fact there is only one well known source - it is "Technical Report on C++ Performance" issued by WG21 committee. The report gives specific numbers and the used test code while the compilers are left anonymous. The logic of the committee is understandable however practicing engineers need more specific information.

The above mentioned report is taken as a basis for the article. The set of tests was extended; the code was modified in some cases and the more detailed analysis of the incurred overheads is provided. In most cases the exact figures are given for each pair: a compiler - a hardware platform. The reader may get an answer not only for the question "what overheads are incurred by that language feature?" but for the question "how big are those overheads?" as well.

Hardware platforms, compilers and test results representation

The table below describes tested hardware platforms, compilers and operating systems.

Hardware platform Compilers OS
Intel 32 bit gcc 2.95.3, gcc 3.3.4, gcc 4.1.1, Intel C++ compiler 9.1.038 Linux
Intel 64 bit gcc 2.96, gcc 3.3.4, gcc 4.1.1, Intel C++ compiler 9.1.038 Linux
ARM11, 32 bit Crosscompiler gcc 3.4.3 Linux
Sun UltraSPARC-II, 64 bit gcc 2.95.3, gcc 3.3.4, gcc 4.1.1 Sun OS
Table 1. Tested hardware, compilers and OS

The same multiplatform compiler can show different performance results for the code generated for different hardware. The performance depends considerably on the hardware specific optimizer and hardware specific code generator. This is the reason why each compiler was tested on various hardware if it was applicable.

While developing test code and its execution environment all the efforts were applied to:

  • To reduce requirements to the installed software
  • To simplify the process of adding a new compiler into the list of tested ones
  • To simplify the process of adding a new test

In fact the main requirement is to have GNU make, awk and bash utilities installed.

The idea behind covering gcc 2.95 is to look at how gcc compilers progressed over the last years. In spite of the fact that this version is still in use it is rather related to C than to C++. As soon as C++ comes up developers prefer to use newer gcc releases in most of the cases.

The gcc series 2 version on the IA-64 platform differs to the versions of this compiler on other platforms. This is because of the fact that gcc 2.95.3 configuration script does not support the platform. The Linux provider however supplied gcc 2.96 for this platform so this version was used in tests on IA-64.

The following model was used to analyse the results:

  • Each C++ feature was associated with a functional equivalent written in C
  • A time spent on execution C and C++ versions was collected
  • C execution time was expressed as percentage of the C++ time

So if a number in a table is above 100 it means that C++ revealed better performance. In case if the analysis way differs to the described above a description is provided separately for each case.

Overheads Sources

Three types of overheads might appear if C++ is used instead of C:

  1. Run-time overheads which consist of the following:
  • CPU(s) time spent on the generated code execution
  • RAM amount required for the generated code execution
  1. Compile-time overheads which consist of the following:
  • Time spent on the code compilation
  • RAM amount required to a compiler
  • Disk space required to store object and probably temporary files
  1. Disk space overheads. This is the size of the generated executable file and possibly dynamic libraries.

Nowadays RAM and disk space is getting to be a cheap resourse in many cases. So the most interesting aspect of the overheads is the generated code performance. Compile-time overheads are mitigated quite often by crosscompiling on power host computers. There are at least various ways of avoiding compile-time overheads problems.

Bearing in mind the speculations above the main focus will be on the generated code performance while the size related figures will be provided just as a reference.

C++ Features that do not Imply Significant Overheads

Namespaces and explicit type convertions are such features.

Strictly speaking namespaces may introduce overheads of increasing compilation time. This increase however is neglible small to talk about this seriously.

C++ introduces 4 new explicit type conversion statements: static_cast, const_cast, reinterpret_cast and dynamic_cast. First 3 influence the compilation stage only while the dynamic_cast may introduce run-time overheads. These overheads are related to the analysis of RTTI information and will be discussed later in a separate chapter.

Abstraction Layers Penalties

Alex Stepanov Test

Alex Stepanov, the STL inventor, developed tests to estimate abstraction layers penalties. The test does semantically the same calculations using 13 different ways and the execution time of each way is collected. The actual task is to calculate a sum of 2000 double values from an array. A C++ wrapper around pod double variable is used to introduce an abstraction layer.

struct Double
{
    double    value;

    Double() {}
    Double( const double &  x ) : value( x ) {}
    operator double() { return value; }
};

double    data[ 2000 ];
Double    Data[ 2000 ];

double_pointer and Double_pointer wrappers for pointers to double and Double are introduced similarly. The ways to fulfil the task are as follows:

0. for ( size_t  i = 0; i < 2000; ++i ) result += data[ i ];
1. accumulate( data, data + 2000, 0 );
2. accumulate( Data, Data + 2000, Double( 0 ) );
3. accumulate( double_pointer( data ), double_pointer( data + 2000 ), 0 );
4. accumulate( Double_pointer( Data ), Double_pointer( Data + 2000 ), 0 );
5. Using reverse_iterator< double *, double >
6. Using reverse_iterator< Double *, Double >
7. Using reverse_iterator< double_pointer, double >
8. Using reverse_iterator< Double_pointer, Double >
9. Using reverse_iterator< reverse_iterator< double *, double >, double >
10. Using reverse_iterator< reverse_iterator< Double *, Double >, Double >
11. Using reverse_iterator< reverse_iterator< double_pointer, double >, double >
12. Using reverse_iterator< reverse_iterator< Double_pointer, Double >, Double >

Each next calculation way increases the abstraction level. Execution time of each way is measured and the overheads are calculated as geometric mean of divisions:

Figure 1. Geometric mean of execution time ratios

The calculated value characterizes a compiler optimization quality. The smaller value - the better. Greater then one values mean loss of performance with new abstraction layers.

The results are given in the tables below.

Optimisation gcc 2.95 gcc 3.3 gcc 4.1 intel 9.1
-O0 11.78 8.5 9.16 12.12
-O2 1.07 1.14 1.03 1.06
-O3 -fomit-frame-pointer 1.06 1.12 1.03 1.06
Table 2. IA-32 abstraction layers penalties
Optimisation gcc 2.96 gcc 3.3 gcc 4.1 intel 9.1
-O0 2.1 4.68 4.26 3.51
-O2 1.18 0.94 1.11 0.99
-O3 -fomit-frame-pointer 1.18 0.94 1.05 2.04
Table 3. IA-64 abstraction layers penalties
Optimisation gcc 2.95 gcc 3.3 gcc 4.1
-O0 5.43 7.79 7.42
-O2 0.53 1.25 1.12
-O3 -fomit-frame-pointer 0.53 1.25 1
Table 4. Sun abstraction layers penalties
Optimisation gcc 3.4
-O0 5.32
-O2 0.76
-O3 -fomit-frame-pointer 0.76
Table 5. ARM abstraction layers penalties

The -fomit-frame-pointer optimization key is introduced to give a compiler a chance to use all the available CPU registers effectively.

The results are that modern compilers eliminate abstraction layers overheads well with optimization switched on. Gcc series 3 demonstrated strange results on IA-64 and ARM platforms. It seems that performance raises with new abstraction layers. The analysis revealed that most probably the root cause for this is that a not effective code was generated for C version i.e. for the way #0.

The Intel's compiler also demonstrated unexpected result on the IA-64 platform. The higher optimization led to worse performance. The analisys of the case revealed that the C performance increased significantly while the C++ performance did not change.

The whole picture is still bright for modern C++ compilers. The generated C++ code is about the same powerfull as its C functional equivalent.

Functors vs Function Pointers

Abstraction layers might increase C++ performance in comparison to C one in some cases. For example, functors may win versus function pointers. Let's take sorting as a typical programming task. C library provides the qsort function which requires a pointer to an element comparing function to be passed as an argument. In case of C++ the std::sort algorithm could be used. The algirothm accepts various ways of the elements comparison.

The tables below give the various sorting options performance results.

Optimisation Container Comparison way gcc 2.95, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 array Function pointer 187 191 135 169
    Standard functor 178 253 229 184
    native operator < 302 375 317 274
  std::vector Function pointer 187 107 88 84
    Standard functor 178 129 130 84
    native operator < 294 153 147 112
-O2 Array Function pointer 220 251 315 265
    standard functor 460 605 577 706
    native operator < 557 572 611 706
  std::vector Function pointer 220 245 305 302
    standard functor 460 542 577 662
    native operator < 557 572 577 662
-O3 -fomit-frame-pointer Array Function pointer 253 267 360 265
    standard functor 520 582 673 706
    native operator < 577 582 631 662
  std::vector Function pointer 247 267 348 302
    standard functor 520 521 631 706
    native operator < 577 550 673 662
Table 6. IA-32 sorting performance
Optimisation Container Comparison way gcc 2.96, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Array Function pointer 93 78 61 50
    standard functor 146 116 95 76
    native operator < 158 158 130 140
  std::vector Function pointer 93 37 34 28
    standard functor 146 45 43 38
    native operator < 158 52 48 50
-O2 Array Function pointer 145 144 147 107
    standard functor 187 220 221 179
    native operator < 212 220 220 178
  std::vector Function pointer 145 139 146 107
    standard functor 188 180 200 173
    native operator < 214 180 201 176
-O3 -fomit-frame-pointer Array Function pointer 150 145 154 104
    standard functor 190 218 219 176
    native operator < 218 221 219 177
  std::vector Function pointer 150 139 152 106
    standard functor 192 180 220 173
    native operator < 216 180 219 175
Table 7. IA-64 sorting performance
Optimisation Container Comparison way gcc 2.95, % gcc 3.3, % gcc 4.1, %
-O0 Array Function pointer 74 81 68
    standard functor 115 104 94
    native operator < 160 187 179
  std::vector Function pointer 73 46 38
    standard functor 115 49 45
    native operator < 160 63 55
-O2 Array Function pointer 69 63 75
    standard functor 232 268 402
    native operator < 291 341 402
  std::vector Function pointer 68 63 72
    standard functor 232 252 353
    native operator < 281 309 368
-O3 -fomit-frame-pointer Array Function pointer 72 63 81
    standard functor 309 273 520
    native operator < 334 363 505
  std::vector Function pointer 71 63 77
    standard functor 321 269 491
    native operator < 334 327 505
Table 8. Sun sorting performance
Optimisation Container Comparison way gcc 3.4, %
-O0 Array указатель на функцию 186
    standard functor 180
    native operator < 293
  std::vector указатель на функцию 72
    standard functor 71
    native operator < 86
-O2 Array указатель на функцию 234
    standard functor 371
    native operator < 396
  std::vector указатель на функцию 236
    standard functor 359
    native operator < 371
-O3 -fomit-frame-pointer Array указатель на функцию 235
    standard functor 369
    native operator < 388
  std::vector указатель на функцию 235
    standard functor 364
    native operator < 369
Table 9. ARM sorting performance

It is easy to see that C++ wins against C significantly. The win might reach 600% on the IA-32 platform. It is also obvious that the win on the IA-32 platform is higher than on other platforms in general. This might be a kind of evident of the compilers maturity on this platform.

The usage of functions pointers as the way to compare elements is slower than other ways in most of the cases. So if there is a choice between function pointers and functors it is preferably to use functors.

It is worth to say that sorting which uses the native ::operator < (:) is faster than other options in most of the cases.

Templates

Instantiated templates work at run-time at the same speed as non-template equivalent classes so the run-time overheads are the same as it was discussed in "Abstraction Layers Penalties" chapter.

At the compilation time templates may introduce considerable overheads however. Moreover disk space overheads may also appear due to the "bloating code" effect.

There are at least three main approaches to implementation of the C++ template instantiation:

  • "Greedy"
  • By request
  • Iterative

The instantiation process may depend considerably on the way an application or a library is built. Suppose that a classic scheme with two components - a compiler and a linker - is used. A compiler translates source code files into object files that hold a machine code and have cross references to other object files and libraries. A linker combines object files resolving references into a single executable file. C and C++ compilers handle each compilation unit independently. A straightforward approach to templates implementation would suppose instantiation of non-inlined functions for each compilation unit. So there is a chance that more than one object file will hold function bodies with the same names. The linkage stage will fail in this case.

Let's consider in details how each of the mentioned above instantiation approaches resolve the described problem and what overheads come up to the picture.

"Greedy" Instantiation Approach

The greedy instatiation allows creating duplicates in many object files however those duplicates have a special mark (e.g. "instantiated template that should be linked"). As soon as linker finds duplicates it keeps only one throwing away all the others. This approach has some drawbacks:

  • A compiler may spend time on generating and optimizing actually the same function bodies while the only one will be used. The object files size may also be increased.
  • The generated template instantiations may vary a bit for the same specialization so the instantiated bodies are not usually checked for binary identity. Thus it is possible that not the best dublicate will be selected from many which came from compilation units compiled with different optimization keys.

There are advantagies as well:

  • The traditional scheme - a compiler and a linker - is kept unchanged; there are no new elements. There is still a correspondence: one translation unit - one object file.
  • Implementation of inline functions is relatively easy because duplicates are allowed. For example if a compiler is not able to inline all the inline function calls it keeps the function body in an object file.

So the greedy instantiation approach overheads are increased compilation and linkage time and probably increased size of object and the final executable files. There is also a chance of releasing not the best optimized code.

Queried Instantiation

This approach supposes creation and support of a special database which is used during compilation of all the translation units. The database holds information of instantiated templates specializations as well as their dependencies on the source code. The bodies of the instantiated templates are usually stored in that database as well.

In case of instantiation by request a compiler does not perform unnecessary template instantiations however there are difficulties in the implementation of the approach:

  • The traditional translation scheme is not applicable any more. A single unit compilation produces more than just a single object file. The linkage stage requires not only object files but the generated instantiations of templates specializations from a database as well.
  • Many compilers implement a parallel compilation which makes the template database support complicated.
  • A linkage with libraries which have template specializations becomes more complicated. If a database has no information about what templates specializations are in the libraries code duplication may come up.

The main overhead here is the disk space which is occupied by the templates specializations database.

Iterative Instantiation

There exist various similar methods implementing an iterative instantiation scheme. Their specific is in using of a preliminary linker. On of them is implemented in Comeau compiler. Automatic instantiation method works as follows:

  1. The first time the source files of a program are compiled, no template entities are instantiated. However, the generated object files contain information about things that could have been instantiated in each compilation. For any source file that makes use of a template instantiation an associated ".ii" file is created if one does not already exist.
  2. When the object files are linked together, a program called the prelinker is run. It examines the object files, looking for references and definitions of template entities, and for the added information about entities that could be instantiated.
  3. If the prelinker finds a reference to a template entity for which there is no definition anywhere in the set of object files, it looks for a file that indicates that it could instantiate that template entity. When it finds such a file, it assigns the instantiation to it. The set of instantiations assigned to a given file is recorded in the associated ".ii" file.
  4. The prelinker then executes the compiler again to recompile each file for which the ".ii" file was changed.
  5. When the compiler compiles a file, it reads the ".ii" file for that file and obeys the instantiation requests therein. It produces a new object file containing the requested template entities (and all the other things that were already in the object file).
  6. The prelinker repeats steps 3--5 until there are no more instantiations to be adjusted.
  7. The object files are linked together.

Some compilers store information about things that could have been instantiated in associated ".ti" files, also storing information about how object file is compiled.

Using this approach may result in increased link time. This increase is not dramatic though as soon as the linkage is not done on the preliminary stage. Moreover the instantiation requests files can be reused for subsequent linkages so the number of recompilations is reduced.

Iterative instantiation overheads are increased linkage time and a disk space to store instantiation requests files. Most probably the required disk space will be insignificant.

Templates Tests

The following information was collected: the compilation time, the executable file size with and without symbols information. The strip utility was used to remove the symbols information from the executables.

Two versions of the source code were tested. The first version instantiated 40 std::list containers and each container holded pointers to its own type. The second version instantiated 40 std::list containers and all the containers holded pointers to the same type. The results are in the tables below:

Optimisation Source code version Measured value gcc 2.95 gcc 3.3 gcc 4.1 intel 9.1
-O2 40 different templates Compilation time, sec 212 20 3 11
    Size before strip, KB 505 87 10 145
    Size after strip, KB 222 83 6 109
  40 the same templates Compilation time, sec 265 22 1 6
    Size before strip, KB 498 80 6 94
    Size after strip, KB 217 78 4 85
-O3 -fomit-frame-pointers 40 different templates Compilation time, sec 371 20 3 11
    Size before strip, KB 602 87 8 145
    Size after strip, KB 320 83 6 109
  40 the same templates Compilation time, sec 518 22 2 6
    Size before strip, KB 594 80 8 94
    Size after strip, KB 314 78 6 85
-Os 40 different templates Compilation time, sec 227 24 4 10
    Size before strip, KB 505 88 29 148
    Size after strip, KB 222 83 10 105
  40 the same templates Compilation time, sec 294 27 1 6
    Size before strip, KB 498 81 7 93
    Size after strip, KB 217 79 5 81
Table 10. IA-32 templates compilation time and files sizes
Optimisation Source code version Measured value gcc 2.96 gcc 3.3 gcc 4.1 intel 9.1
-O2 40 different templates Compilation time, sec 40 29 2 7
    Size before strip, KB 375 117 20 308
    Size after strip, KB 368 112 15 212
  40 the same templates Compilation time, sec 34 27 1 3
    Size before strip, KB 360 106 11 124
    Size after strip, KB 356 104 8 116
-O3 -fomit-frame-pointers 40 different templates Compilation time, sec 40 29 2 7
    Size before strip, KB 375 117 15 308
    Size after strip, KB 368 112 12 212
  40 the same templates Compilation time, sec 35 27 1 3
    Size before strip, KB 360 107 15 124
    Size after strip, KB 356 104 12 116
-Os 40 different templates Compilation time, sec 33 32 3 7
    Size before strip, KB 375 119 64 320
    Size after strip, KB 368 113 43 216
  40 the same templates Compilation time, sec 56 31 1 3
    Size before strip, KB 360 108 13 128
    Size after strip, KB 356 105 10 116
Table 11. IA-64 templates compilation time and files sizes
Optimisation Source code version Measured value gcc 2.95 gcc 3.3 gcc 4.1
-O2 40 different templates Compilation time, sec 164 98 8
    Size before strip, KB 798 77 19
    Size after strip, KB 216 71 13
  40 the same templates Compilation time, sec 160 90 2
    Size before strip, KB 785 64 8
    Size after strip, KB 206 61 5
-O3 -fomit-frame-pointers 40 different templates Compilation time, sec 165 99 10
    Size before strip, KB 797 77 10
    Size after strip, KB 216 71 7
  40 the same templates Compilation time, sec 158 92 5
    Size before strip, KB 784 64 10
    Size after strip, KB 205 61 7
-Os 40 different templates Compilation time, sec 180 108 9
    Size before strip, KB 798 78 62
    Size after strip, KB 217 72 44
  40 the same templates Compilation time, sec 173 99 2
    Size before strip, KB 785 65 9
    Size after strip, KB 206 62 6
Table 12. Sun templates compilation time and files sizes
Optimisation Source code version Measured value gcc 3.4
-O2 40 different templates Size before strip, KB 20
    Size after strip, KB 8
  40 the same templates Size before strip, KB 24
    Size after strip, KB 8
-O3 -fomit-frame-pointers 40 different templates Size before strip, KB 20
    Size after strip, KB 18
  40 the same templates Size before strip, KB 24
    Size after strip, KB 18
-Os 40 different templates Size before strip, KB 29
    Size after strip, KB 18
  40 the same templates Size before strip, KB 24
    Size after strip, KB 18
Table 13. ARM templates compilation time and files sizes

The interesting thing here is a confirmation of the fact that compilers made a big step to reducing the compilation time and the executable file size. In some cases the gcc compilation time reduced by factor 100 by moving from series 2 to series 4.

The ARM platform compilation time is not given because a crosscompiler was used so the compilation time depended on the host system speed (IA-32). Numbers for a single compiler on this platform are not really interesting however the table is given with the hope to extend it in the future.

Basic Class Operations

Member Functions

A member function call is roughly the same as a free function call with one additional parameter - a pointer to an object. Let's consider three options of calling a member function:

Description C++ C
Notation '->' x->g( i ); g( ps, i );
Notation '.' x.g( i ); g( &s, i );
Static member function vs free function X::f( i ); f( i );
Table 14. Functions calls options

Tests compare function calls with an integer argument which is shown as "i" in the table above. The "ps" in the table is a pointer while "s" is an object.

The test results are given in the tables below.

Optimisation Test Gcc 2.95, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Notation '->' 102 98 99 98
  Notation '.' 101 98 96 101
  Static member function vs free function 105 100 100 100
-O2 Notation '->' 95 87 102 100
  Notation '.' 110 90 100 104
  Static member function vs free function 101 100 100 153
-O3 -fomit-frame-pointer Notation '->' 106 95 104 90
  Notation '.' 111 100 104 104
  Static member function vs free function 100 101 95 160
Table 15. IA-32 functions calls performance

Optimisation Test gcc 2.96, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Notation '->' 81 95 95 100
  Notation '.' 81 95 95 99
  Static member function vs free function 96 100 100 100
-O2 Notation '->' 38 270 117 86
  Notation '.' 38 243 83 85
  Static member function vs free function 63 100 100 99
-O3 -fomit-frame-pointer Notation '->' 37 83 100 85
  Notation '.' 36 83 207 85
  Static member function vs free function 63 100 33 100
Table 16. IA-64 functions calls performance
Optimisation Test gcc 2.95, % gcc 3.3, % gcc 4.1, %
-O0 Notation '->' 114 113 99
  Notation '.' 114 85 100
  Static member function vs free function 100 99 99
-O2 Notation '->' 100 102 90
  Notation '.' 100 99 87
  Static member function vs free function 92 87 99
-O3 -fomit-frame-pointer Notation '->' 99 100 100
  Notation '.' 99 89 100
  Static member function vs free function 99 91 100
Table 17. Sun functions calls performance
Optimisation Test gcc 3.4, %
-O0 Notation '->' 100
  Notation '.' 99
  Static member function vs free function 100
-O2 Notation '->' 118
  Notation '.' 112
  Static member function vs free function 89
-O3 -fomit-frame-pointer Notation '->' 100
  Notation '.' 151
  Static member function vs free function 101
Table 18. ARM functions calls performance

C++ performance on the IA-32, Sun and ARM platforms does not differ from the C performance more than on 10% in most of the cases. The results on the IA-64 platform are not so even. The C++ performance significantly depends on a particular case and varies from overwhelming of C++ (gcc series 4 with max optimization for notation '.' - 207%) till a major loss (gcc series 4 with max optimization for static member functions - 33%).

Virtual Functions - C and C++ Versions

Virtual functions as well as non-virtual could be called using notations '->' and '.'. As soon as pointers to virtual functions are stored in a separate table a virtual function call is about the same as a call a function with one additional parameter via a pointer which is stored in an array. The table below describes C++ and C calls options.

Description C++ C
Notation '->' x->f( i ); (p[1])(ps,i);
Notation '.' x.f( i ); (p[1])(&s,i);
Table 19. Virtual functions calls options

Here "I" is an integer parameter, "p" is a functions pointers array, "ps" is a pointer to an object and "s" is an object.

The test results are given in the tables below.

Optimisation Notation gcc 2.95, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Notation '->' 92 87 114 91
  Notation '.' 104 103 101 105
-O2 Notation '->' 89 92 90 97
  Notation '.' 110 106 110 702
-O3 -fomit-frame-pointer Notation '->' 97 93 91 97
  Notation '.' 122 106 500 702
Table 20. IA-32 virtual functions calls performance
Optimisation Notation Gcc 2.96, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Notation '->' 81 95 95 100
  Notation '.' 81 95 95 99
-O2 Notation '->' 96 100 100 100
  Notation '.' 38 270 117 86
-O3 -fomit-frame-pointer Notation '->' 38 243 83 85
  Notation '.' 63 100 100 99
Table 21. IA-64 virtual functions calls performance
Optimisation Notation gcc 2.95, % gcc 3.3, % gcc 4.1, %
-O0 Notation '->' 94 94 95
  Notation '.' 158 112 152
-O2 Notation '->' 77 91 95
  Notation '.' 224 207 206
-O3 -fomit-frame-pointer Notation '->' 81 85 93
  Notation '.' 205 225 1234
Table 22. Sun virtual functions calls performance
Optimisation Notation gcc 3.4, %
-O0 Notation '->' 90
  Notation '.' 125
-O2 Notation '->' 96
  Notation '.' 141
-O3 -fomit-frame-pointer Notation '->' 96
  Notation '.' 498
Table 23. ARM virtual functions calls performance

It is possible to notice that C++ almost always wins in case of notation '.'. Sometimes C++ wins at the rate of 5 - 7. Most probably it is because of an optimizer implementation specific. In case of C++ the optimizer is able presumably to perform de-virtualisaton while in case of C there are no tries to do a similar thing.

In case of notation '->' C wins a bit except of rare cases. Thus gcc series 4 on the IA-32 platform without optimization generated better code for C++. The Intel's compiler on the IA-64 platform without optimization demonstrated lack of C++ performance in turn.

Virtual and non-virtual Functions - C++ Only

Overheads on calling virtual and non-virtual C++ member functions can be different. The tables below give the results of comparing virtual and non virtual functions calls. Each table cell shows a percentage of virtual functions calls performance. A number greater than 100 means that virtual functions calls are faster than non-virtual.

Optimisation Notation gcc 2.95, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Notation '->' 80 87 112 92
  Notation '.' 99 99 100 101
-O2 Notation '->' 90 85 90 6
  Notation '.' 98 100 100 100
-O3 -fomit-frame-pointer Notation '->' 81 83 16 6
  Notation '.' 100 100 95 100
Table 24. IA-32 virtual and non-virtual functions calls performance

Optimisation Notation gcc 2.96, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Notation '->' 77 79 77 42
  Notation '.' 95 100 100 100
-O2 Notation '->' 150 70 59 77
  Notation '.' 258 100 85 526
-O3 -fomit-frame-pointer Notation '->' 158 60 13 77
  Notation '.' 273 100 100 699
Table 25. IA-64 virtual and non-virtual functions calls performance
Optimisation Notation gcc 2.95, % gcc 3.3, % gcc 4.1, %
-O0 Notation '->' 52 70 64
  Notation '.' 99 100 97
-O2 Notation '->' 36 41 46
  Notation '.' 100 96 114
-O3 -fomit-frame-pointer Notation '->' 38 42 7
  Notation '.' 100 111 99
Table 26. Sun virtual and non-virtual functions calls performance
Optimisation Notation gcc 3.4, %
-O0 Notation '->' 72
  Notation '.' 99
-O2 Notation '->' 64
  Notation '.' 93
-O3 -fomit-frame-pointer Notation '->' 19
  Notation '.' 119
Table 27. ARM virtual and non-virtual functions calls performance

The fact that the notation '.' virtual and non-virtual functions calls performance is about the same can be explained by the ability for compilers to perform de-virtualisation of the virtual functions calls.

In case of the notation '->' virtual functions lose in most of the cases. Sometimes the loss is dramatic - Intel's compiler lost at the rate of 16 on the IA-32 platform and gcc series 4 lost at the rate of 6.

In some cases (e.g. gcc series 4 on Sun, -O3) the significant loss of virtual functions calls performance in comparison to the performance of non-virtual functions calls can be explained by the fact that the compiler was able to inline the non-virtual function while the virtual function was not inlined.

Inline Functions

Inline functions is a C++ alternative to C macroses. The tables below give performance comparison of those alternatives for notations '->' and '.'.

Optimisation Notation gcc 2.95, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Notation '->' 64 54 49 47
  Notation '.' 31 49 36 35
-O2 Notation '->' 100 123 100 95
  Notation '.' 97 98 100 104
-O3 -fomit-frame-pointer Notation '->' 97 82 100 100
  Notation '.' 102 98 102 102
Table 28. IA-32 inline functions vs macroses

Optimisation Notation gcc 2.96, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Notation '->' 108 64 74 68
  Notation '.' 95 52 58 58
-O2 Notation '->' 101 33 99 33
  Notation '.' 446 100 300 299
-O3 -fomit-frame-pointer Notation '->' 301 99 300 100
  Notation '.' 447 100 33 200
Table 29. IA-64 inline functions vs macroses
Optimisation Notation gcc 2.95, % gcc 3.3, % gcc 4.1, %
-O0 Notation '->' 63 64 58
  Notation '.' 84 44 47
-O2 Notation '->' 99 100 99
  Notation '.' 99 100 99
-O3 -fomit-frame-pointer Notation '->' 100 99 100
  Notation '.' 100 100 100
Table 30. Sun inline functions vs macroses
Optimisation Notation gcc 3.4, %
-O0 Notation '->' 48
  Notation '.' 38
-O2 Notation '->' 120
  Notation '.' 100
-O3 -fomit-frame-pointer Notation '->' 101
  Notation '.' 83
Table 31. ARM inline functions vs macroses

With optimization switched off compilers even don't try to inline functions calls. First two lines in each of the tables confirm this assumption.

With optimization switched on the results differ a lot for various platforms and cases. The most stable results are demonstrated by gcc series 4 on the IA-32 platform - inline functions and macroses have the same performance. The Intel's compiler demonstrates loss of inline functions performance for notation '->' on the IA-32 platform.

The IA-64 platform revealed both wins and losses. For example the Intel's compiler inline functions win for notation '.' while gcc series 4 inline functions loss significantly for notation '.' and -O2 optimisation.

Inheritance and Virtual Functions

Additional run-time overheads may come up for calling virtual functions in comparison to calling non-virtual ones. There could be both a CPU overhead and a RAM overhead. The overheads could vary for various inheritance cases - single and multiple - and even for various sequence of inheritance. Let's consider in details what is going on in various cases for a typical C++ implementation.

Single Inheritance

Suppose that the following type is used as a base one.

struct Base
{
    Data          d1;

    virtual void  f( void );
    void          g( void );
};

Objects of the Base type will be allocated in RAM as shown on the figure below.

Figure 2. Allocation of an object with a virtual function

A virtual functions table (vtbl) for the Base type holds a pointer to a virtual function f and the data members are extended with a pointer (vptr) to the vtbl. What exact elements stored in the virtual functions table is not important for now. That could be pointers, deltas for "this" pointer correction or something else.

Now suppose that there is type Derived which inherits from Base:

struct Derived : public Base
{
    Data          d2;

    virtual void  f( void );
    virtual void  h( void );
};

Objects of the Derived type will be allocated in RAM as shown on the figure below.

Figure 3. Allocation of an object in case of a single inheritance

The base type mebers are allocated first followed by the derived type members. The vtbl is extended with one more pointer &Derived::h and the &Base::f is replaced with &Derived::f.

It is necessary to notice that in case of allocating an object of type Derived (i.e. in case of single inheritance) the addresses of both Base and Derived objects are the same. One more interesting feature is that it is possible to store only one copy of the vtbl for many allocated objects of the same type. That reduces run-time memory overheads and possibly disk space overheads.

Multiple Inheritance

Suppose that there are two base types: Base1 and Base2:

struct Base1
{
    Data          d1;

    virtual void  f( void );
};
struct Base2
{
    Data          d2;

    virtual void  f( void );
    virtual void  g( void );
};

Now suppose that DerivedMultilpe type derives from both Base1 and Base2:

struct DerivedMultiple : public Base1, public Base2
{
    Data          d3;

    virtual void  f( void );
    virtual void  g( void );
    virtual void  h( void );
};

Allocation of the objects of the Base1 and Base2 types is similar to the allocation of the type Base objects as shown on figure 2. The interesting part is the DerivedMultiple type objects allocation:

Figure 4. Allocation of an object in case of multiple inheritance

"s" marks a size occupied by the Base1 type object.

The Base1 members are allocated in memory first followed by the Base2 members. The DerivedMultiple members follow Base2 members. The most important detail here is that the DerivedMultiple object has two addresses which are marked as a1 and a2 on the figure. The addresses appear when a developer writes a similar code similar to the following:

DerivedMultiple *    Object( new DerivedMultiple );  // Corresponds to a1
Base1 *              base1( Object );                // Corresponds a1 as well
Base2 *              base2( Object );                // Corresponds a2

When the virtual function f is called, however, it is required to pass the correct "this" pointer i.e. a pointer to the object which was originally created. This is the a1 pointer in the example. If there is the a2 pointer additional actions are required - the a2 pointer should be corrected to the size of the base1 that is "s". That is why the vtbls on the figure are extended with one more information element. It is the value for the "this" pointer correction in case of virtual function calls.

A similar situation appears when the described hierarchy is used as follows:

Base2 *  base2a( new Base2 );
Base2 *  base2b( new DerivedMultiple );

base2a->f();
base2b->f();

The Base2 * in the example above can point to the Base2 object or to a part of the DerivedMultiple object. In the first case of calling virtual function f the Base2::f will be called while in the second - DerivedMultiple::f. As soon as base2b points to the Base2 part of the DerivedMultiple the "this" pointer should be corrected to make it pointing to the DerivedMultiple object. The correction value is s.

Analysing the description above it is easy to see that the RAM (storing vtbls) and CPU (analysis of those tables and possibly "this" corrections) run-time overheads come up when virtual functions are used. Funtion inlining will not either working in case of virtual functions.

It is worth to say that quite often a virtual function is called in a context when a compiler has all the required type information which makes it possible to convert a virtual function call into an ordinary function call. This kind of optimization is called de-virtualisation and allows moving from an indirect call via a table of function pointers to a direct function call.

There are at least two approaches which can be employed by compilers to implement virtual functions. The first one supposes storing deltas for the "this" pointer as shown on the figures above. The second approach supposes generation of small piece code called "thunk" which corrects the "this" pointer. In cases if the correction is not required the corresponding thunk code gets empty which optimizes a virtual function call.

Test Results

As it was mentioned above the functions calls overheads can be different for the cases of single and multiple inheritance. The overheads can be also different for virtual and non-virtual functions. The sequence of the inheritance may also influence the overheads. The test results for all the mentioned cases are given below.

The diagram below shows two types hierarchies which were used in tests. The branch which is related to the Base1 base class in case of multiple inheritance will be referred as the "first" branch for the further discussion. The branch which is related to the Base2 base class will be referred ans the "second" branch correspondingly.

Figure 5. Functions calls

The performance of the virtual and non-virtual functions calls is measured in the tests for various inheritance cases. In case of multiple inheritance the performance is measured twice for both branches of inheritance.

The table below provides functions calls performance in case multiple inheritance in comparison to the case of single inheritance. That is a number larger than 100 means that a function call in case of multiple inheritance if faster than in case of a single inheritance.

Optimisation Virtuality Inheritance branch gcc 2.95, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Non-virtual Base1 105 103 94 101
    Base2 98 99 94 96
  Virtual Base1 100 100 100 99
    Base2 88 82 90 61
-O2 Non-virtual Base1 102 102 100 100
    Base2 102 103 104 100
  Virtual Base1 100 98 99 99
    Base2 78 97 94 99
-O3 -fomit-frame-pointer Non-virtual Base1 102 99 102 100
    Base2 102 118 102 100
  Virtual Base1 100 99 99 98
    Base2 86 69 83 99
Table 32. IA-32 functions calls performance in case of single and multiple inheritance
Optimisation Virtuality Inheritance branch gcc 2.96, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Non-virtual Base1 103 100 99 100
    Base2 95 92 95 96
  Virtual Base1 100 100 99 99
    Base2 124 93 96 62
-O2 Non-virtual Base1 36 99 116 100
    Base2 37 99 116 99
  Virtual Base1 99 100 99 99
    Base2 85 91 90 99
-O3 -fomit-frame-pointer Non-virtual Base1 298 100 100 99
    Base2 100 300 33 99
  Virtual Base1 100 99 99 99
    Base2 86 90 90 100
Table 33. IA-64 functions calls performance in case of single and multiple inheritance
Optimisation Virtuality Inheritance branch gcc 2.95, % gcc 3.3, % gcc 4.1, %
-O0 Non-virtual Base1 100 113 100
    Base2 108 94 97
  Virtual Base1 88 98 100
    Base2 90 82 84
-O2 Non-virtual Base1 99 100 99
    Base2 99 104 91
  Virtual Base1 108 94 99
    Base2 107 64 97
-O3 -fomit-frame-pointer Non-virtual Base1 100 100 100
    Base2 100 100 100
  Virtual Base1 97 100 100
    Base2 101 70 99
Table 34. Sun functions calls performance in case of single and multiple inheritance
Optimisation Virtuality Inheritance branch gcc 3.4, %
-O0 Non-virtual Base1 100
    Base2 92
  Virtual Base1 100
    Base2 85
-O2 Non-virtual Base1 77
    Base2 100
  Virtual Base1 99
    Base2 79
-O3 -fomit-frame-pointer Non-virtual Base1 100
    Base2 100
  Virtual Base1 99
    Base2 79
Table 35. ARM functions calls performance in case of single and multiple inheritance

It is possible to notice that the sequence of inheritance does not affect the performance of calling non-virtual functions considerably for the modern compilers. The picture is different for the virtual functions. The Intel's compiler with the optimization switched on demonstrated the same performance of a virtual functions calls regardless of the inheritance sequence and it was about the same as for the performance of a virtual functions calls in case of a single inheritance.

Gcc compilers series 3 and 4 with the optimization switched on demonstrated a minor difference in performance of virtual functions calls depending on the inheritance branch. The second branch performance lost was between 10% and 30% in comparison to virtual functions calls in case of a single inheritance. Such losses are practically neglected in case of the first inheritance branch.

Virtual Inheritance

In case of virtual inheritance data structures become even more complicated. Let's consider the following hierarchy:

Figure 6. Type hierarchy with virtual inheritance

Mediator1 and Mediator2 inherit virtually from the TopBase. Suppose that the types are defined as follows:

struct TopBase
{
    Data          d1;

    virtual void  f( void );
};

struct Mediator1 : virtual public TopBase
{
    Data          d2;

    virtual void  f( void );
    virtual void  g( void );
};
struct Mediator2 : virtual public TopBase
{
    Data          d3;

    virtual void  f( void );
    virtual void  h( void );
};

struct DerivedVirtual : public Mediator1, public Mediator2
{
    Data          d4;

    virtual void  f( void );
    virtual void  g( void );
    virtual void  h( void );
};

Objects of the TopBase type are allocated in memory similar to the way shown on figure 2. The allocation of the Mediator1 type objects for a typical implementation is shown below. The Mediator2 objects are allocated similar to the Mediator1 ones.

Figure 7. Allocation of an object with a virtual base class

The virtual base class members are located after all the other data members. This is done to unify run-time analysis of the vtbl regardless of what exact object type (Mediator1, Mediator2 or DerivedMultiple) was created. The Mediator1 vtbl is extended with a pointer to the virtual base object pTopBase. Access to the virtual base members is provided not directly but via pTopBase pointer in the vtbl. The indirect access leads to run-time overheads.

The figure below shows the DerivedVirtual objects allocation.

Figure 8. Allocation of an object with a virtual base class in case of multiple inheritance

Data members of a virtual base object appear in memory once so the exact location of this portion is known only to the really allocated object. It is the reason why a pointer to the beginning of the virtual base object is stored in the vtbl of each deriving objects.

The described above approach supposes unified access to the virtual base object members - via a pointer in the vtbl - regardless of what exact object is allocated whether it was Mediator1, Mediator2 or DerivedMultiple. It is also valid for the case when the DerivedVirtual object was created and a pointer to that object was converted to the pointer to Mediator1 or to Mediator2 objects.

Some compilers hold a pointer to the beginning of a virtual base object not in the vtbl but as an additional data member.

Tests Results

The virtual inheritance may lead to a loss of performance in comparson to a usual inheritance. The loss may come up in case of calling member functions or accessing data members of a virtual base class.

A virtual base class may have both virtual and non-virtual functions. The performance of calling those functions can be different.

The test results are grouped by relation to virtual and non-virtual member functions. For non-virtual functions the results are given for virtual and usual single inheritance. A function call incremented a member of a usual or a virtual base class. The table below refers to the cases of incrementing a single member of a base class as "option 1". For virtual functions the results are also given for virtual and usual single inheritance. Two options of functions calls which are illustrated below were used.

Figure 9. A virtual function call, option 2a

Figure 10. A virtual function call, option 2b

The table below provides the functions calls performance in case of virtual inheritance in comparison to the functions calls performance in case of usual inheritance. A number greater that 100 means that the function call in case of virtual inheritance is faster than in case of usual inheritance.

Optimisation Function call option gcc 2.95, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Option 1 91 68 74 79
  Option 2a 72 64 61 54
  Option 2b 69 54 57 50
-O2 Option 1 83 57 62 74
  Option 2a 75 54 57 62
  Option 2b 60 48 54 58
-O3 -fomit-frame-pointer Option 1 28 48 27 74
  Option 2a 76 47 48 62
  Option 2b 57 41 42 58
Table 36. IA-32 functions calls with virtual and usual inheritance
Optimisation Function call option gcc 2.96, % gcc 3.3, % gcc 4.1, % intel 9.1, %
-O0 Option 1 95 79 77 45
  Option 2a 123 75 76 35
  Option 2b 120 61 66 25
-O2 Option 1 90 66 66 77
  Option 2a 91 138 60 60
  Option 2b 139 116 50 50
-O3 -fomit-frame-pointer Option 1 33 16 100 77
  Option 2a 91 138 60 60
  Option 2b 139 116 50 49
Table 37. IA-64 functions calls with virtual and usual inheritance
Optimisation Function call option gcc 2.95, % gcc 3.3, % gcc 4.1, %
-O0 Option 1 94 96 85
  Option 2a 101 61 73
  Option 2b 97 58 65
-O2 Option 1 94 95 84
  Option 2a 93 62 81
  Option 2b 91 57 71
-O3 -fomit-frame-pointer Option 1 18 16 18
  Option 2a 92 61 86
  Option 2b 83 56 70
Table 38. Sun functions calls with virtual and usual inheritance
Optimisation Function call option gcc 3.4, %
-O0 Option 1 76
  Option 2a 58
  Option 2b 50
-O2 Option 1 71
  Option 2a 67
  Option 2b 54
-O3 -fomit-frame-pointer Option 1 22
  Option 2a 57
  Option 2b 48
Table 39. ARM functions calls with virtual and usual inheritance

The performance of functions calls in case virtual inheritance loses significantly against the case of a usual inheritance in most of the cases regardless it was a virtual or a non-virtual function.

The fact that "option 2a" call was almost always faster than "option 2b" call means that the spent time depends on the number of times the virtual base class data members were accessed in the function. Theoretically it is possible to get rid of this dependency if compilers stored a pointer to the virtual base class object at the beginning of the function.

RTTI

The most interesting case of the RTTI usage is convertion tries which suppose type hierarchies analysis. Such an analysis is performed when dynamic_cast is used. The C programming language does not support explicitly type hierarchies so it is difficult to find a functional equivalent. Due to these difficulties there will be only theoretical analysis of possible RTTI implementation and implied overheads without comparison of performance between C and C++ implementations.

Suppose that there is a type hierarchy shown below (it is necessary to note that the E type must be polymorphic).

Figure 11. RTTI hierarchy

Suppose also that there is the following C++ code:

E *       pE( new E );
B *       pB( pE );

D *       pD( dynamic_cast< D * >( pB ) );

It is obvious that the conversion at the last line must be completed successfully. The obstacle is that the argument of the conversion is a pointer to type B which is not a direct base type of type D. In order to complete the conversion successfully it is necessary to walk down the type hierarchy to the type E and then to complete the conversion. The described hierarchy traversal could be supported by RTTI implementation shown on the figure below.

Figure 12. Typical RTTI implementation

The vtbl is extended with one more pointer which helps to retrieve type information. Information about all the types is stored in a separate table. Having a pointer to B (pB) the beginning of the really created object is found at the first stage. Then a table with information of all the type E predecessor types is located. Then the target type_info is compared consequently with all the type_infos from the predecessors list. If type_infos are equal at some stage then the conversion is possible.

So the run-time overheads on supporting RTTI are as follows:

  • RAM for storing more pointers in the vtbls
  • RAM for storing RTTI table
  • CPU time for searching the RTTI table
  • CPU time for type_info comparisons

Quite often the comparisons do not require expensive string comparisons however there are compilers that use strings.

Exceptions

C++ provides exceptions for error handling. Traditional C alternatives are as follows:

  • Return code analysis
  • Calls of error handling functions
  • Longjump-ed error handlers
  • Passing an additional object pointer to each called function. The object holds the current state information.

Exceptions in C++ are usually implemented using one of two approaches: a table approach and a code approach.

The table approach supposes creation of special tables on the compilation stage. Those tables associate PC counter value ranges with actions that should be executed in case if an exception is generated. Those actions could be passing the execution control to the corresponding catch block, calling local objects destructors, stack unwinding etc. The main run-time overhead in case of the table approach is RAM for storing prepared tables.

The code approach supposes generation on-the-fly a list of actions which should be executed in case of exceptions. The list of actions is similar to what is stored in tables in case of the table approach. The main run-time overhead in case of the code approach is the CPU time. The dynamically generated actions take much less RAM than static actions tables in case of the table approach.

The most popular and simple way of error handling in C is the return code analysis. C code is usually similar to the following:

int  f( void );
. . .
{
    int    ReturnValue;

    . . .

    ReturnValue = f();
    if ( ReturnValue != 0 )
    {
        /* Process error some way */
    }

    /* No errors */

The main run-time overhead in the code above is the CPU time spent on comparing the return code with some value and possibly a jump. This comparison is performed regardless whether the f function completed successfully or not. In case if C++ exceptions are used there is no return code so there is no if statement. Therefore if the function f completed successfully there will not be run-time CPU overheads. The CPU overheads however in case if function f generated an exception will be most probably higher. It is possible to measure time for an exception processing in C++ case and for error code checking for C. The ratio of those times will give some value. The value means what the minimum number of successful C++ calls of function f should be done to make the C++ code more effective than the C equivalent in terms of CPU consumption. For example, if the value is 220 that means that if an exception is generated rearer than one time per 220 calls then the C++ code will work faster than return code analyzing C version.

The tables below demonstrate test results with the ratio of time spent on an exception handling to time spent on return code analysis.

Optimisation gcc 2.95 gcc 3.3 gcc 4.1 intel 9.1
-O0 265 396 394 491
-O2 N/A 497 445 854
-O3 -fomit-frame-pointer N/A 470 609 711
Table 40. Exceptions handling on IA-32
Optimisation gcc 2.96 gcc 3.3 gcc 4.1 intel 9.1
-O0 164 232 202 465
-O2 509 635 582 1445
-O3 -fomit-frame-pointer 512 646 505 1399
Table 41. Exceptions handling on IA-64
Optimisation gcc 2.95 gcc 3.3 gcc 4.1
-O0 87 88 84
-O2 101 112 121
-O3 -fomit-frame-pointer 107 108 270
Table 42. Exceptions handling on Sun
Optimisation gcc 3.4
-O0 100
-O2 102
-O3 -fomit-frame-pointer 106
Table 43. Exceptions handling on ARM

The values in the tables can be used by developers while making a decision for a certain way of error handling. An interesting fact is that gcc compilers demonstrate similar results while the Intel's compiler loses up to 2.5 times against gcc.

Gcc series 2 on the IA-32 platform with optimization switched on generated a code which aborted at run-time so the corresponding cells in the table has N/A.

IOStream Library

The C++ input/output library has a reputation of not effective one. The library performance migh be affected by the syncronisation mode with the C input/output streams. The mode is on by default.

Test code performed file input/output operations for decimals, heaxadecimal and float values for both synchronization modes - switched on and off. The figures below show the results. The vertical axis is the time spent on the operations so the higher bar is the worse performance it means.

Figure 13. IA-32 input

Figure 14. IA-32 output

Figure 15. IA-64 input

Figure 16. IA-64 output

Figure 17. Sun input

Figure 18. Sun output

Figure 19. ARM input

Figure 20. ARM output

C++ I/O performance is worse than C in all the cases except one. The worst result is slowing down at the rate of 600%. The exception is gcc series 2 C++ I/O performance however some sources explain it by incorrect implementation of the I/O streams in term of C++ standard requirements. Besides the gcc series 2 is out-of-date now and is not considered seriously by developers as a candidate for C++ projects.

Conclusion

Modern C++ compilers demonstrate high quality of the new language features implementation on all the tested platforms. The C++ code does not practically lose against C and sometimes allows reaching a higher performance. The regrettable exception is the C++ input/output however there is a workaround. The C++ compiler will easily compile C-style input/output code. And we surely have a hope on the compilers developers. The C++ input/output library will be gaining better and better anyway. Bearing in mind that C++ directly supports more programming paradigms than C the latter will have less and less chances to be used in complicated projects.

Tests Automation

To make the process of collecting test results on various platforms for different compilers and various compilation keys a set of scripts was developed. The scripts could be easily extended with new tests, new compilers and new combinations of compilation keys. In order to do that some changes should be done in the files described below.

compilers.list File

The file is located at the top of the framework directories structure and holds a list of the tested compilers. An example of the file is given below:

# File format:
# first   compiler vendor
# second  c compiler path
# third   c++ compiler path

gcc4.1.1
/home/twinpeek/compilers/gcc/4.1.1/bin/gcc
/home/twinpeek/compilers/gcc/4.1.1/bin/g++

intel9.1
/home/twinpeek/compilers/intel/9.1.038/bin/icc
/home/twinpeek/compilers/intel/9.1.038/bin/icpc

The example defines names and pathes to two compilers - gcc series 4 and Intel's compiler. The comment lines start with the '#' character. Empty lines are allowed.

projects.list File

The file is also located at the top of the directories structure and holds a list of projects which are included into the tests. An example of the file is given below:

# File format:
# Pathes to the projects

abstraction_penalty/stepanov_test
abstraction_penalty/mitigation
abstraction_penalty/templates_boat_diff
abstraction_penalty/templates_boat_same

The example defines four projects which are specified in a form of relative pathes. The comment lines start with the '#' character. Empty lines are allowed.

optimization.info File

A number of optimization keys sets which are used with each of the compilers is defined separately for each of the projects. That is why the optimization.info file is located in the home directory of each project. For example the optimization.info file for the abstraction_penalty/stepanov_test project can look as follows:

0
1
2

Each line gives a name of the optimization keys set. Here we have digits as the names.

Optimisation Keys

The search of the compiler optimization keys which correspond to each of the sets from the optimizations.info file is done as follows. A file name is formed using the following rule:

<CompilerName>_opt<OptimisationKeysSetName>.flags

For example for the Intel's compiler and the last optimization keys set the following name will be formed:

intel9.1_opt2.flags

The search of this file will start with the "flags" directory in the corresponding project home directory. If the file is not found in that folder the search will continue in the "flags" directory which is located at the top of the framework directories structure. The file should define two variables - for C and for C++ - with the corresponding optimization keys. For example the intel9.1_opt2.flags file can look as follows:

CFLAGS=-O3 -fomit-frame-pointer
CPPFLAGS=-O3 -fomit-frame-pointer

Such an approach supports ability to have common optimization keys for all the projects and ability to tune optimization keys individually for a specific project.

Running

To run compilation of all the projects and collect the tests results the following command could be used:

./do_test.sh > TestResults.log

References

  1. GCC 4.0. A Review for AMD and Intel Processors. http://www.coyotegulch.com/reviews/gcc4/index.html
  2. Technical Report on C++ Performance. http://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf
  3. David Vandevoorde, Nicolai M. Josuttis. C++ Templates. The Complete Guide. http://www.amazon.co.uk/C++-Templates-Complete-David-Vandevoorde/dp/0201734842
  4. Bjarne Stroustrup. The C++ Programming Language, Special Edition. http://www.amazon.co.uk/C%2B%2B-Programming-Language-Special/dp/0201700735
  5. Scott Meyers. Effective C++: 55 Specific Ways to Improve Your Programs and Designs http://www.amazon.co.uk/Effective-C%2B%2B-Specific-Professional-Computing/dp/0321334876
  6. Scott Meyers. More Effective C++: 35 New Ways to Improve Your Programs and Designs http://www.amazon.co.uk/More-Effective-C%2B%2B-Professional-Computing/dp/020163371X
  7. Stephen C. Dewhurst. C++ Gotchas: Avoiding Common Problems in Coding and Design http://www.amazon.co.uk/C%2B%2B-Gotchas-Addison-Wesley-Professional-Computing/dp/0321125185
  8. Jonathan L. Schilling. Optimizing Away C++ Exception Handling. http://sco.com/developers/products/ehopt.pdf

Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.

Разрешается копирование и распространение этой статьи любым способом без внесения изменений, при условии, что это разрешение сохраняется.
Last Updated: October 1, 2007