![]() |
![]() |
|
|
C++ Compilers' Performance
|
| Hardware platform | Compilers | OS |
|---|---|---|
| Intel 32 bit | gcc 2.95.3, gcc 3.3.4, gcc 4.1.1, Intel C++ compiler 9.1.038 | Linux |
| Intel 64 bit | gcc 2.96, gcc 3.3.4, gcc 4.1.1, Intel C++ compiler 9.1.038 | Linux |
| ARM11, 32 bit | Crosscompiler gcc 3.4.3 | Linux |
| Sun UltraSPARC-II, 64 bit | gcc 2.95.3, gcc 3.3.4, gcc 4.1.1 | Sun OS |
The same multiplatform compiler can show different performance results for the code generated for different hardware. The performance depends considerably on the hardware specific optimizer and hardware specific code generator. This is the reason why each compiler was tested on various hardware if it was applicable.
While developing test code and its execution environment all the efforts were applied to:
In fact the main requirement is to have GNU make, awk and bash utilities installed.
The idea behind covering gcc 2.95 is to look at how gcc compilers progressed over the last years. In spite of the fact that this version is still in use it is rather related to C than to C++. As soon as C++ comes up developers prefer to use newer gcc releases in most of the cases.
The gcc series 2 version on the IA-64 platform differs to the versions of this compiler on other platforms. This is because of the fact that gcc 2.95.3 configuration script does not support the platform. The Linux provider however supplied gcc 2.96 for this platform so this version was used in tests on IA-64.
The following model was used to analyse the results:
So if a number in a table is above 100 it means that C++ revealed better performance. In case if the analysis way differs to the described above a description is provided separately for each case.
Three types of overheads might appear if C++ is used instead of C:
Nowadays RAM and disk space is getting to be a cheap resourse in many cases. So the most interesting aspect of the overheads is the generated code performance. Compile-time overheads are mitigated quite often by crosscompiling on power host computers. There are at least various ways of avoiding compile-time overheads problems.
Bearing in mind the speculations above the main focus will be on the generated code performance while the size related figures will be provided just as a reference.
Namespaces and explicit type convertions are such features.
Strictly speaking namespaces may introduce overheads of increasing compilation time. This increase however is neglible small to talk about this seriously.
C++ introduces 4 new explicit type conversion statements: static_cast, const_cast, reinterpret_cast and dynamic_cast. First 3 influence the compilation stage only while the dynamic_cast may introduce run-time overheads. These overheads are related to the analysis of RTTI information and will be discussed later in a separate chapter.
Alex Stepanov, the STL inventor, developed tests to estimate abstraction layers penalties. The test does semantically the same calculations using 13 different ways and the execution time of each way is collected. The actual task is to calculate a sum of 2000 double values from an array. A C++ wrapper around pod double variable is used to introduce an abstraction layer.
struct Double { double value; Double() {} Double( const double & x ) : value( x ) {} operator double() { return value; } }; double data[ 2000 ]; Double Data[ 2000 ]; |
double_pointer and Double_pointer wrappers for pointers to double and Double are introduced similarly. The ways to fulfil the task are as follows:
0. for ( size_t i = 0; i < 2000; ++i ) result += data[ i ]; 1. accumulate( data, data + 2000, 0 ); 2. accumulate( Data, Data + 2000, Double( 0 ) ); 3. accumulate( double_pointer( data ), double_pointer( data + 2000 ), 0 ); 4. accumulate( Double_pointer( Data ), Double_pointer( Data + 2000 ), 0 ); 5. Using reverse_iterator< double *, double > 6. Using reverse_iterator< Double *, Double > 7. Using reverse_iterator< double_pointer, double > 8. Using reverse_iterator< Double_pointer, Double > 9. Using reverse_iterator< reverse_iterator< double *, double >, double > 10. Using reverse_iterator< reverse_iterator< Double *, Double >, Double > 11. Using reverse_iterator< reverse_iterator< double_pointer, double >, double > 12. Using reverse_iterator< reverse_iterator< Double_pointer, Double >, Double > |
Each next calculation way increases the abstraction level. Execution time of each way is measured and the overheads are calculated as geometric mean of divisions:
![]() |
The calculated value characterizes a compiler optimization quality. The smaller value - the better. Greater then one values mean loss of performance with new abstraction layers.
The results are given in the tables below.
| Optimisation | gcc 2.95 | gcc 3.3 | gcc 4.1 | intel 9.1 |
|---|---|---|---|---|
| -O0 | 11.78 | 8.5 | 9.16 | 12.12 |
| -O2 | 1.07 | 1.14 | 1.03 | 1.06 |
| -O3 -fomit-frame-pointer | 1.06 | 1.12 | 1.03 | 1.06 |
| Optimisation | gcc 2.96 | gcc 3.3 | gcc 4.1 | intel 9.1 |
|---|---|---|---|---|
| -O0 | 2.1 | 4.68 | 4.26 | 3.51 |
| -O2 | 1.18 | 0.94 | 1.11 | 0.99 |
| -O3 -fomit-frame-pointer | 1.18 | 0.94 | 1.05 | 2.04 |
| Optimisation | gcc 2.95 | gcc 3.3 | gcc 4.1 |
|---|---|---|---|
| -O0 | 5.43 | 7.79 | 7.42 |
| -O2 | 0.53 | 1.25 | 1.12 |
| -O3 -fomit-frame-pointer | 0.53 | 1.25 | 1 |
| Optimisation | gcc 3.4 |
|---|---|
| -O0 | 5.32 |
| -O2 | 0.76 |
| -O3 -fomit-frame-pointer | 0.76 |
The -fomit-frame-pointer optimization key is introduced to give a compiler a chance to use all the available CPU registers effectively.
The results are that modern compilers eliminate abstraction layers overheads well with optimization switched on. Gcc series 3 demonstrated strange results on IA-64 and ARM platforms. It seems that performance raises with new abstraction layers. The analysis revealed that most probably the root cause for this is that a not effective code was generated for C version i.e. for the way #0.
The Intel's compiler also demonstrated unexpected result on the IA-64 platform. The higher optimization led to worse performance. The analisys of the case revealed that the C performance increased significantly while the C++ performance did not change.
The whole picture is still bright for modern C++ compilers. The generated C++ code is about the same powerfull as its C functional equivalent.
Abstraction layers might increase C++ performance in comparison to C one in some cases. For example, functors may win versus function pointers. Let's take sorting as a typical programming task. C library provides the qsort function which requires a pointer to an element comparing function to be passed as an argument. In case of C++ the std::sort algorithm could be used. The algirothm accepts various ways of the elements comparison.
The tables below give the various sorting options performance results.
| Optimisation | Container | Comparison way | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|---|
| -O0 | array | Function pointer | 187 | 191 | 135 | 169 |
| Standard functor | 178 | 253 | 229 | 184 | ||
| native operator < | 302 | 375 | 317 | 274 | ||
| std::vector | Function pointer | 187 | 107 | 88 | 84 | |
| Standard functor | 178 | 129 | 130 | 84 | ||
| native operator < | 294 | 153 | 147 | 112 | ||
| -O2 | Array | Function pointer | 220 | 251 | 315 | 265 |
| standard functor | 460 | 605 | 577 | 706 | ||
| native operator < | 557 | 572 | 611 | 706 | ||
| std::vector | Function pointer | 220 | 245 | 305 | 302 | |
| standard functor | 460 | 542 | 577 | 662 | ||
| native operator < | 557 | 572 | 577 | 662 | ||
| -O3 -fomit-frame-pointer | Array | Function pointer | 253 | 267 | 360 | 265 |
| standard functor | 520 | 582 | 673 | 706 | ||
| native operator < | 577 | 582 | 631 | 662 | ||
| std::vector | Function pointer | 247 | 267 | 348 | 302 | |
| standard functor | 520 | 521 | 631 | 706 | ||
| native operator < | 577 | 550 | 673 | 662 |
| Optimisation | Container | Comparison way | gcc 2.96, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|---|
| -O0 | Array | Function pointer | 93 | 78 | 61 | 50 |
| standard functor | 146 | 116 | 95 | 76 | ||
| native operator < | 158 | 158 | 130 | 140 | ||
| std::vector | Function pointer | 93 | 37 | 34 | 28 | |
| standard functor | 146 | 45 | 43 | 38 | ||
| native operator < | 158 | 52 | 48 | 50 | ||
| -O2 | Array | Function pointer | 145 | 144 | 147 | 107 |
| standard functor | 187 | 220 | 221 | 179 | ||
| native operator < | 212 | 220 | 220 | 178 | ||
| std::vector | Function pointer | 145 | 139 | 146 | 107 | |
| standard functor | 188 | 180 | 200 | 173 | ||
| native operator < | 214 | 180 | 201 | 176 | ||
| -O3 -fomit-frame-pointer | Array | Function pointer | 150 | 145 | 154 | 104 |
| standard functor | 190 | 218 | 219 | 176 | ||
| native operator < | 218 | 221 | 219 | 177 | ||
| std::vector | Function pointer | 150 | 139 | 152 | 106 | |
| standard functor | 192 | 180 | 220 | 173 | ||
| native operator < | 216 | 180 | 219 | 175 |
| Optimisation | Container | Comparison way | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % |
|---|---|---|---|---|---|
| -O0 | Array | Function pointer | 74 | 81 | 68 |
| standard functor | 115 | 104 | 94 | ||
| native operator < | 160 | 187 | 179 | ||
| std::vector | Function pointer | 73 | 46 | 38 | |
| standard functor | 115 | 49 | 45 | ||
| native operator < | 160 | 63 | 55 | ||
| -O2 | Array | Function pointer | 69 | 63 | 75 |
| standard functor | 232 | 268 | 402 | ||
| native operator < | 291 | 341 | 402 | ||
| std::vector | Function pointer | 68 | 63 | 72 | |
| standard functor | 232 | 252 | 353 | ||
| native operator < | 281 | 309 | 368 | ||
| -O3 -fomit-frame-pointer | Array | Function pointer | 72 | 63 | 81 |
| standard functor | 309 | 273 | 520 | ||
| native operator < | 334 | 363 | 505 | ||
| std::vector | Function pointer | 71 | 63 | 77 | |
| standard functor | 321 | 269 | 491 | ||
| native operator < | 334 | 327 | 505 |
| Optimisation | Container | Comparison way | gcc 3.4, % |
|---|---|---|---|
| -O0 | Array | указатель на функцию | 186 |
| standard functor | 180 | ||
| native operator < | 293 | ||
| std::vector | указатель на функцию | 72 | |
| standard functor | 71 | ||
| native operator < | 86 | ||
| -O2 | Array | указатель на функцию | 234 |
| standard functor | 371 | ||
| native operator < | 396 | ||
| std::vector | указатель на функцию | 236 | |
| standard functor | 359 | ||
| native operator < | 371 | ||
| -O3 -fomit-frame-pointer | Array | указатель на функцию | 235 |
| standard functor | 369 | ||
| native operator < | 388 | ||
| std::vector | указатель на функцию | 235 | |
| standard functor | 364 | ||
| native operator < | 369 |
It is easy to see that C++ wins against C significantly. The win might reach 600% on the IA-32 platform. It is also obvious that the win on the IA-32 platform is higher than on other platforms in general. This might be a kind of evident of the compilers maturity on this platform.
The usage of functions pointers as the way to compare elements is slower than other ways in most of the cases. So if there is a choice between function pointers and functors it is preferably to use functors.
It is worth to say that sorting which uses the native ::operator < (:) is faster than other options in most of the cases.
Instantiated templates work at run-time at the same speed as non-template equivalent classes so the run-time overheads are the same as it was discussed in "Abstraction Layers Penalties" chapter.
At the compilation time templates may introduce considerable overheads however. Moreover disk space overheads may also appear due to the "bloating code" effect.
There are at least three main approaches to implementation of the C++ template instantiation:
The instantiation process may depend considerably on the way an application or a library is built. Suppose that a classic scheme with two components - a compiler and a linker - is used. A compiler translates source code files into object files that hold a machine code and have cross references to other object files and libraries. A linker combines object files resolving references into a single executable file. C and C++ compilers handle each compilation unit independently. A straightforward approach to templates implementation would suppose instantiation of non-inlined functions for each compilation unit. So there is a chance that more than one object file will hold function bodies with the same names. The linkage stage will fail in this case.
Let's consider in details how each of the mentioned above instantiation approaches resolve the described problem and what overheads come up to the picture.
The greedy instatiation allows creating duplicates in many object files however those duplicates have a special mark (e.g. "instantiated template that should be linked"). As soon as linker finds duplicates it keeps only one throwing away all the others. This approach has some drawbacks:
There are advantagies as well:
So the greedy instantiation approach overheads are increased compilation and linkage time and probably increased size of object and the final executable files. There is also a chance of releasing not the best optimized code.
This approach supposes creation and support of a special database which is used during compilation of all the translation units. The database holds information of instantiated templates specializations as well as their dependencies on the source code. The bodies of the instantiated templates are usually stored in that database as well.
In case of instantiation by request a compiler does not perform unnecessary template instantiations however there are difficulties in the implementation of the approach:
The main overhead here is the disk space which is occupied by the templates specializations database.
There exist various similar methods implementing an iterative instantiation scheme. Their specific is in using of a preliminary linker. On of them is implemented in Comeau compiler. Automatic instantiation method works as follows:
Some compilers store information about things that could have been instantiated in associated ".ti" files, also storing information about how object file is compiled.
Using this approach may result in increased link time. This increase is not dramatic though as soon as the linkage is not done on the preliminary stage. Moreover the instantiation requests files can be reused for subsequent linkages so the number of recompilations is reduced.
Iterative instantiation overheads are increased linkage time and a disk space to store instantiation requests files. Most probably the required disk space will be insignificant.
The following information was collected: the compilation time, the executable file size with and without symbols information. The strip utility was used to remove the symbols information from the executables.
Two versions of the source code were tested. The first version instantiated 40 std::list containers and each container holded pointers to its own type. The second version instantiated 40 std::list containers and all the containers holded pointers to the same type. The results are in the tables below:
| Optimisation | Source code version | Measured value | gcc 2.95 | gcc 3.3 | gcc 4.1 | intel 9.1 |
|---|---|---|---|---|---|---|
| -O2 | 40 different templates | Compilation time, sec | 212 | 20 | 3 | 11 |
| Size before strip, KB | 505 | 87 | 10 | 145 | ||
| Size after strip, KB | 222 | 83 | 6 | 109 | ||
| 40 the same templates | Compilation time, sec | 265 | 22 | 1 | 6 | |
| Size before strip, KB | 498 | 80 | 6 | 94 | ||
| Size after strip, KB | 217 | 78 | 4 | 85 | ||
| -O3 -fomit-frame-pointers | 40 different templates | Compilation time, sec | 371 | 20 | 3 | 11 |
| Size before strip, KB | 602 | 87 | 8 | 145 | ||
| Size after strip, KB | 320 | 83 | 6 | 109 | ||
| 40 the same templates | Compilation time, sec | 518 | 22 | 2 | 6 | |
| Size before strip, KB | 594 | 80 | 8 | 94 | ||
| Size after strip, KB | 314 | 78 | 6 | 85 | ||
| -Os | 40 different templates | Compilation time, sec | 227 | 24 | 4 | 10 |
| Size before strip, KB | 505 | 88 | 29 | 148 | ||
| Size after strip, KB | 222 | 83 | 10 | 105 | ||
| 40 the same templates | Compilation time, sec | 294 | 27 | 1 | 6 | |
| Size before strip, KB | 498 | 81 | 7 | 93 | ||
| Size after strip, KB | 217 | 79 | 5 | 81 |
| Optimisation | Source code version | Measured value | gcc 2.96 | gcc 3.3 | gcc 4.1 | intel 9.1 |
|---|---|---|---|---|---|---|
| -O2 | 40 different templates | Compilation time, sec | 40 | 29 | 2 | 7 |
| Size before strip, KB | 375 | 117 | 20 | 308 | ||
| Size after strip, KB | 368 | 112 | 15 | 212 | ||
| 40 the same templates | Compilation time, sec | 34 | 27 | 1 | 3 | |
| Size before strip, KB | 360 | 106 | 11 | 124 | ||
| Size after strip, KB | 356 | 104 | 8 | 116 | ||
| -O3 -fomit-frame-pointers | 40 different templates | Compilation time, sec | 40 | 29 | 2 | 7 |
| Size before strip, KB | 375 | 117 | 15 | 308 | ||
| Size after strip, KB | 368 | 112 | 12 | 212 | ||
| 40 the same templates | Compilation time, sec | 35 | 27 | 1 | 3 | |
| Size before strip, KB | 360 | 107 | 15 | 124 | ||
| Size after strip, KB | 356 | 104 | 12 | 116 | ||
| -Os | 40 different templates | Compilation time, sec | 33 | 32 | 3 | 7 |
| Size before strip, KB | 375 | 119 | 64 | 320 | ||
| Size after strip, KB | 368 | 113 | 43 | 216 | ||
| 40 the same templates | Compilation time, sec | 56 | 31 | 1 | 3 | |
| Size before strip, KB | 360 | 108 | 13 | 128 | ||
| Size after strip, KB | 356 | 105 | 10 | 116 |
| Optimisation | Source code version | Measured value | gcc 2.95 | gcc 3.3 | gcc 4.1 |
|---|---|---|---|---|---|
| -O2 | 40 different templates | Compilation time, sec | 164 | 98 | 8 |
| Size before strip, KB | 798 | 77 | 19 | ||
| Size after strip, KB | 216 | 71 | 13 | ||
| 40 the same templates | Compilation time, sec | 160 | 90 | 2 | |
| Size before strip, KB | 785 | 64 | 8 | ||
| Size after strip, KB | 206 | 61 | 5 | ||
| -O3 -fomit-frame-pointers | 40 different templates | Compilation time, sec | 165 | 99 | 10 |
| Size before strip, KB | 797 | 77 | 10 | ||
| Size after strip, KB | 216 | 71 | 7 | ||
| 40 the same templates | Compilation time, sec | 158 | 92 | 5 | |
| Size before strip, KB | 784 | 64 | 10 | ||
| Size after strip, KB | 205 | 61 | 7 | ||
| -Os | 40 different templates | Compilation time, sec | 180 | 108 | 9 |
| Size before strip, KB | 798 | 78 | 62 | ||
| Size after strip, KB | 217 | 72 | 44 | ||
| 40 the same templates | Compilation time, sec | 173 | 99 | 2 | |
| Size before strip, KB | 785 | 65 | 9 | ||
| Size after strip, KB | 206 | 62 | 6 |
| Optimisation | Source code version | Measured value | gcc 3.4 |
|---|---|---|---|
| -O2 | 40 different templates | Size before strip, KB | 20 |
| Size after strip, KB | 8 | ||
| 40 the same templates | Size before strip, KB | 24 | |
| Size after strip, KB | 8 | ||
| -O3 -fomit-frame-pointers | 40 different templates | Size before strip, KB | 20 |
| Size after strip, KB | 18 | ||
| 40 the same templates | Size before strip, KB | 24 | |
| Size after strip, KB | 18 | ||
| -Os | 40 different templates | Size before strip, KB | 29 |
| Size after strip, KB | 18 | ||
| 40 the same templates | Size before strip, KB | 24 | |
| Size after strip, KB | 18 |
The interesting thing here is a confirmation of the fact that compilers made a big step to reducing the compilation time and the executable file size. In some cases the gcc compilation time reduced by factor 100 by moving from series 2 to series 4.
The ARM platform compilation time is not given because a crosscompiler was used so the compilation time depended on the host system speed (IA-32). Numbers for a single compiler on this platform are not really interesting however the table is given with the hope to extend it in the future.
A member function call is roughly the same as a free function call with one additional parameter - a pointer to an object. Let's consider three options of calling a member function:
| Description | C++ | C |
|---|---|---|
| Notation '->' | x->g( i ); | g( ps, i ); |
| Notation '.' | x.g( i ); | g( &s, i ); |
| Static member function vs free function | X::f( i ); | f( i ); |
Tests compare function calls with an integer argument which is shown as "i" in the table above. The "ps" in the table is a pointer while "s" is an object.
The test results are given in the tables below.
| Optimisation | Test | Gcc 2.95, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Notation '->' | 102 | 98 | 99 | 98 |
| Notation '.' | 101 | 98 | 96 | 101 | |
| Static member function vs free function | 105 | 100 | 100 | 100 | |
| -O2 | Notation '->' | 95 | 87 | 102 | 100 |
| Notation '.' | 110 | 90 | 100 | 104 | |
| Static member function vs free function | 101 | 100 | 100 | 153 | |
| -O3 -fomit-frame-pointer | Notation '->' | 106 | 95 | 104 | 90 |
| Notation '.' | 111 | 100 | 104 | 104 | |
| Static member function vs free function | 100 | 101 | 95 | 160 |
| Optimisation | Test | gcc 2.96, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Notation '->' | 81 | 95 | 95 | 100 |
| Notation '.' | 81 | 95 | 95 | 99 | |
| Static member function vs free function | 96 | 100 | 100 | 100 | |
| -O2 | Notation '->' | 38 | 270 | 117 | 86 |
| Notation '.' | 38 | 243 | 83 | 85 | |
| Static member function vs free function | 63 | 100 | 100 | 99 | |
| -O3 -fomit-frame-pointer | Notation '->' | 37 | 83 | 100 | 85 |
| Notation '.' | 36 | 83 | 207 | 85 | |
| Static member function vs free function | 63 | 100 | 33 | 100 |
| Optimisation | Test | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % |
|---|---|---|---|---|
| -O0 | Notation '->' | 114 | 113 | 99 |
| Notation '.' | 114 | 85 | 100 | |
| Static member function vs free function | 100 | 99 | 99 | |
| -O2 | Notation '->' | 100 | 102 | 90 |
| Notation '.' | 100 | 99 | 87 | |
| Static member function vs free function | 92 | 87 | 99 | |
| -O3 -fomit-frame-pointer | Notation '->' | 99 | 100 | 100 |
| Notation '.' | 99 | 89 | 100 | |
| Static member function vs free function | 99 | 91 | 100 |
| Optimisation | Test | gcc 3.4, % |
|---|---|---|
| -O0 | Notation '->' | 100 |
| Notation '.' | 99 | |
| Static member function vs free function | 100 | |
| -O2 | Notation '->' | 118 |
| Notation '.' | 112 | |
| Static member function vs free function | 89 | |
| -O3 -fomit-frame-pointer | Notation '->' | 100 |
| Notation '.' | 151 | |
| Static member function vs free function | 101 |
C++ performance on the IA-32, Sun and ARM platforms does not differ from the C performance more than on 10% in most of the cases. The results on the IA-64 platform are not so even. The C++ performance significantly depends on a particular case and varies from overwhelming of C++ (gcc series 4 with max optimization for notation '.' - 207%) till a major loss (gcc series 4 with max optimization for static member functions - 33%).
Virtual functions as well as non-virtual could be called using notations '->' and '.'. As soon as pointers to virtual functions are stored in a separate table a virtual function call is about the same as a call a function with one additional parameter via a pointer which is stored in an array. The table below describes C++ and C calls options.
| Description | C++ | C |
|---|---|---|
| Notation '->' | x->f( i ); | (p[1])(ps,i); |
| Notation '.' | x.f( i ); | (p[1])(&s,i); |
Here "I" is an integer parameter, "p" is a functions pointers array, "ps" is a pointer to an object and "s" is an object.
The test results are given in the tables below.
| Optimisation | Notation | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Notation '->' | 92 | 87 | 114 | 91 |
| Notation '.' | 104 | 103 | 101 | 105 | |
| -O2 | Notation '->' | 89 | 92 | 90 | 97 |
| Notation '.' | 110 | 106 | 110 | 702 | |
| -O3 -fomit-frame-pointer | Notation '->' | 97 | 93 | 91 | 97 |
| Notation '.' | 122 | 106 | 500 | 702 |
| Optimisation | Notation | Gcc 2.96, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Notation '->' | 81 | 95 | 95 | 100 |
| Notation '.' | 81 | 95 | 95 | 99 | |
| -O2 | Notation '->' | 96 | 100 | 100 | 100 |
| Notation '.' | 38 | 270 | 117 | 86 | |
| -O3 -fomit-frame-pointer | Notation '->' | 38 | 243 | 83 | 85 |
| Notation '.' | 63 | 100 | 100 | 99 |
| Optimisation | Notation | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % |
|---|---|---|---|---|
| -O0 | Notation '->' | 94 | 94 | 95 |
| Notation '.' | 158 | 112 | 152 | |
| -O2 | Notation '->' | 77 | 91 | 95 |
| Notation '.' | 224 | 207 | 206 | |
| -O3 -fomit-frame-pointer | Notation '->' | 81 | 85 | 93 |
| Notation '.' | 205 | 225 | 1234 |
| Optimisation | Notation | gcc 3.4, % |
|---|---|---|
| -O0 | Notation '->' | 90 |
| Notation '.' | 125 | |
| -O2 | Notation '->' | 96 |
| Notation '.' | 141 | |
| -O3 -fomit-frame-pointer | Notation '->' | 96 |
| Notation '.' | 498 |
It is possible to notice that C++ almost always wins in case of notation '.'. Sometimes C++ wins at the rate of 5 - 7. Most probably it is because of an optimizer implementation specific. In case of C++ the optimizer is able presumably to perform de-virtualisaton while in case of C there are no tries to do a similar thing.
In case of notation '->' C wins a bit except of rare cases. Thus gcc series 4 on the IA-32 platform without optimization generated better code for C++. The Intel's compiler on the IA-64 platform without optimization demonstrated lack of C++ performance in turn.
Overheads on calling virtual and non-virtual C++ member functions can be different. The tables below give the results of comparing virtual and non virtual functions calls. Each table cell shows a percentage of virtual functions calls performance. A number greater than 100 means that virtual functions calls are faster than non-virtual.
| Optimisation | Notation | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Notation '->' | 80 | 87 | 112 | 92 |
| Notation '.' | 99 | 99 | 100 | 101 | |
| -O2 | Notation '->' | 90 | 85 | 90 | 6 |
| Notation '.' | 98 | 100 | 100 | 100 | |
| -O3 -fomit-frame-pointer | Notation '->' | 81 | 83 | 16 | 6 |
| Notation '.' | 100 | 100 | 95 | 100 |
| Optimisation | Notation | gcc 2.96, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Notation '->' | 77 | 79 | 77 | 42 |
| Notation '.' | 95 | 100 | 100 | 100 | |
| -O2 | Notation '->' | 150 | 70 | 59 | 77 |
| Notation '.' | 258 | 100 | 85 | 526 | |
| -O3 -fomit-frame-pointer | Notation '->' | 158 | 60 | 13 | 77 |
| Notation '.' | 273 | 100 | 100 | 699 |
| Optimisation | Notation | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % |
|---|---|---|---|---|
| -O0 | Notation '->' | 52 | 70 | 64 |
| Notation '.' | 99 | 100 | 97 | |
| -O2 | Notation '->' | 36 | 41 | 46 |
| Notation '.' | 100 | 96 | 114 | |
| -O3 -fomit-frame-pointer | Notation '->' | 38 | 42 | 7 |
| Notation '.' | 100 | 111 | 99 |
| Optimisation | Notation | gcc 3.4, % |
|---|---|---|
| -O0 | Notation '->' | 72 |
| Notation '.' | 99 | |
| -O2 | Notation '->' | 64 |
| Notation '.' | 93 | |
| -O3 -fomit-frame-pointer | Notation '->' | 19 |
| Notation '.' | 119 |
The fact that the notation '.' virtual and non-virtual functions calls performance is about the same can be explained by the ability for compilers to perform de-virtualisation of the virtual functions calls.
In case of the notation '->' virtual functions lose in most of the cases. Sometimes the loss is dramatic - Intel's compiler lost at the rate of 16 on the IA-32 platform and gcc series 4 lost at the rate of 6.
In some cases (e.g. gcc series 4 on Sun, -O3) the significant loss of virtual functions calls performance in comparison to the performance of non-virtual functions calls can be explained by the fact that the compiler was able to inline the non-virtual function while the virtual function was not inlined.
Inline functions is a C++ alternative to C macroses. The tables below give performance comparison of those alternatives for notations '->' and '.'.
| Optimisation | Notation | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Notation '->' | 64 | 54 | 49 | 47 |
| Notation '.' | 31 | 49 | 36 | 35 | |
| -O2 | Notation '->' | 100 | 123 | 100 | 95 |
| Notation '.' | 97 | 98 | 100 | 104 | |
| -O3 -fomit-frame-pointer | Notation '->' | 97 | 82 | 100 | 100 |
| Notation '.' | 102 | 98 | 102 | 102 |
| Optimisation | Notation | gcc 2.96, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Notation '->' | 108 | 64 | 74 | 68 |
| Notation '.' | 95 | 52 | 58 | 58 | |
| -O2 | Notation '->' | 101 | 33 | 99 | 33 |
| Notation '.' | 446 | 100 | 300 | 299 | |
| -O3 -fomit-frame-pointer | Notation '->' | 301 | 99 | 300 | 100 |
| Notation '.' | 447 | 100 | 33 | 200 |
| Optimisation | Notation | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % |
|---|---|---|---|---|
| -O0 | Notation '->' | 63 | 64 | 58 |
| Notation '.' | 84 | 44 | 47 | |
| -O2 | Notation '->' | 99 | 100 | 99 |
| Notation '.' | 99 | 100 | 99 | |
| -O3 -fomit-frame-pointer | Notation '->' | 100 | 99 | 100 |
| Notation '.' | 100 | 100 | 100 |
| Optimisation | Notation | gcc 3.4, % |
|---|---|---|
| -O0 | Notation '->' | 48 |
| Notation '.' | 38 | |
| -O2 | Notation '->' | 120 |
| Notation '.' | 100 | |
| -O3 -fomit-frame-pointer | Notation '->' | 101 |
| Notation '.' | 83 |
With optimization switched off compilers even don't try to inline functions calls. First two lines in each of the tables confirm this assumption.
With optimization switched on the results differ a lot for various platforms and cases. The most stable results are demonstrated by gcc series 4 on the IA-32 platform - inline functions and macroses have the same performance. The Intel's compiler demonstrates loss of inline functions performance for notation '->' on the IA-32 platform.
The IA-64 platform revealed both wins and losses. For example the Intel's compiler inline functions win for notation '.' while gcc series 4 inline functions loss significantly for notation '.' and -O2 optimisation.
Additional run-time overheads may come up for calling virtual functions in comparison to calling non-virtual ones. There could be both a CPU overhead and a RAM overhead. The overheads could vary for various inheritance cases - single and multiple - and even for various sequence of inheritance. Let's consider in details what is going on in various cases for a typical C++ implementation.
Suppose that the following type is used as a base one.
struct Base { Data d1; virtual void f( void ); void g( void ); }; |
Objects of the Base type will be allocated in RAM as shown on the figure below.
![]() |
A virtual functions table (vtbl) for the Base type holds a pointer to a virtual function f and the data members are extended with a pointer (vptr) to the vtbl. What exact elements stored in the virtual functions table is not important for now. That could be pointers, deltas for "this" pointer correction or something else.
Now suppose that there is type Derived which inherits from Base:
struct Derived : public Base { Data d2; virtual void f( void ); virtual void h( void ); }; |
Objects of the Derived type will be allocated in RAM as shown on the figure below.
![]() |
The base type mebers are allocated first followed by the derived type members. The vtbl is extended with one more pointer &Derived::h and the &Base::f is replaced with &Derived::f.
It is necessary to notice that in case of allocating an object of type Derived (i.e. in case of single inheritance) the addresses of both Base and Derived objects are the same. One more interesting feature is that it is possible to store only one copy of the vtbl for many allocated objects of the same type. That reduces run-time memory overheads and possibly disk space overheads.
Suppose that there are two base types: Base1 and Base2:
struct Base1 { Data d1; virtual void f( void ); }; struct Base2 { Data d2; virtual void f( void ); virtual void g( void ); }; |
Now suppose that DerivedMultilpe type derives from both Base1 and Base2:
struct DerivedMultiple : public Base1, public Base2 { Data d3; virtual void f( void ); virtual void g( void ); virtual void h( void ); }; |
Allocation of the objects of the Base1 and Base2 types is similar to the allocation of the type Base objects as shown on figure 2. The interesting part is the DerivedMultiple type objects allocation:
![]() |
"s" marks a size occupied by the Base1 type object.
The Base1 members are allocated in memory first followed by the Base2 members. The DerivedMultiple members follow Base2 members. The most important detail here is that the DerivedMultiple object has two addresses which are marked as a1 and a2 on the figure. The addresses appear when a developer writes a similar code similar to the following:
DerivedMultiple * Object( new DerivedMultiple ); // Corresponds to a1 Base1 * base1( Object ); // Corresponds a1 as well Base2 * base2( Object ); // Corresponds a2 |
When the virtual function f is called, however, it is required to pass the correct "this" pointer i.e. a pointer to the object which was originally created. This is the a1 pointer in the example. If there is the a2 pointer additional actions are required - the a2 pointer should be corrected to the size of the base1 that is "s". That is why the vtbls on the figure are extended with one more information element. It is the value for the "this" pointer correction in case of virtual function calls.
A similar situation appears when the described hierarchy is used as follows:
Base2 * base2a( new Base2 ); Base2 * base2b( new DerivedMultiple ); base2a->f(); base2b->f(); |
The Base2 * in the example above can point to the Base2 object or to a part of the DerivedMultiple object. In the first case of calling virtual function f the Base2::f will be called while in the second - DerivedMultiple::f. As soon as base2b points to the Base2 part of the DerivedMultiple the "this" pointer should be corrected to make it pointing to the DerivedMultiple object. The correction value is s.
Analysing the description above it is easy to see that the RAM (storing vtbls) and CPU (analysis of those tables and possibly "this" corrections) run-time overheads come up when virtual functions are used. Funtion inlining will not either working in case of virtual functions.
It is worth to say that quite often a virtual function is called in a context when a compiler has all the required type information which makes it possible to convert a virtual function call into an ordinary function call. This kind of optimization is called de-virtualisation and allows moving from an indirect call via a table of function pointers to a direct function call.
There are at least two approaches which can be employed by compilers to implement virtual functions. The first one supposes storing deltas for the "this" pointer as shown on the figures above. The second approach supposes generation of small piece code called "thunk" which corrects the "this" pointer. In cases if the correction is not required the corresponding thunk code gets empty which optimizes a virtual function call.
As it was mentioned above the functions calls overheads can be different for the cases of single and multiple inheritance. The overheads can be also different for virtual and non-virtual functions. The sequence of the inheritance may also influence the overheads. The test results for all the mentioned cases are given below.
The diagram below shows two types hierarchies which were used in tests. The branch which is related to the Base1 base class in case of multiple inheritance will be referred as the "first" branch for the further discussion. The branch which is related to the Base2 base class will be referred ans the "second" branch correspondingly.
![]() |
The performance of the virtual and non-virtual functions calls is measured in the tests for various inheritance cases. In case of multiple inheritance the performance is measured twice for both branches of inheritance.
The table below provides functions calls performance in case multiple inheritance in comparison to the case of single inheritance. That is a number larger than 100 means that a function call in case of multiple inheritance if faster than in case of a single inheritance.
| Optimisation | Virtuality | Inheritance branch | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|---|
| -O0 | Non-virtual | Base1 | 105 | 103 | 94 | 101 |
| Base2 | 98 | 99 | 94 | 96 | ||
| Virtual | Base1 | 100 | 100 | 100 | 99 | |
| Base2 | 88 | 82 | 90 | 61 | ||
| -O2 | Non-virtual | Base1 | 102 | 102 | 100 | 100 |
| Base2 | 102 | 103 | 104 | 100 | ||
| Virtual | Base1 | 100 | 98 | 99 | 99 | |
| Base2 | 78 | 97 | 94 | 99 | ||
| -O3 -fomit-frame-pointer | Non-virtual | Base1 | 102 | 99 | 102 | 100 |
| Base2 | 102 | 118 | 102 | 100 | ||
| Virtual | Base1 | 100 | 99 | 99 | 98 | |
| Base2 | 86 | 69 | 83 | 99 |
| Optimisation | Virtuality | Inheritance branch | gcc 2.96, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|---|
| -O0 | Non-virtual | Base1 | 103 | 100 | 99 | 100 |
| Base2 | 95 | 92 | 95 | 96 | ||
| Virtual | Base1 | 100 | 100 | 99 | 99 | |
| Base2 | 124 | 93 | 96 | 62 | ||
| -O2 | Non-virtual | Base1 | 36 | 99 | 116 | 100 |
| Base2 | 37 | 99 | 116 | 99 | ||
| Virtual | Base1 | 99 | 100 | 99 | 99 | |
| Base2 | 85 | 91 | 90 | 99 | ||
| -O3 -fomit-frame-pointer | Non-virtual | Base1 | 298 | 100 | 100 | 99 |
| Base2 | 100 | 300 | 33 | 99 | ||
| Virtual | Base1 | 100 | 99 | 99 | 99 | |
| Base2 | 86 | 90 | 90 | 100 |
| Optimisation | Virtuality | Inheritance branch | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % |
|---|---|---|---|---|---|
| -O0 | Non-virtual | Base1 | 100 | 113 | 100 |
| Base2 | 108 | 94 | 97 | ||
| Virtual | Base1 | 88 | 98 | 100 | |
| Base2 | 90 | 82 | 84 | ||
| -O2 | Non-virtual | Base1 | 99 | 100 | 99 |
| Base2 | 99 | 104 | 91 | ||
| Virtual | Base1 | 108 | 94 | 99 | |
| Base2 | 107 | 64 | 97 | ||
| -O3 -fomit-frame-pointer | Non-virtual | Base1 | 100 | 100 | 100 |
| Base2 | 100 | 100 | 100 | ||
| Virtual | Base1 | 97 | 100 | 100 | |
| Base2 | 101 | 70 | 99 |
| Optimisation | Virtuality | Inheritance branch | gcc 3.4, % |
|---|---|---|---|
| -O0 | Non-virtual | Base1 | 100 |
| Base2 | 92 | ||
| Virtual | Base1 | 100 | |
| Base2 | 85 | ||
| -O2 | Non-virtual | Base1 | 77 |
| Base2 | 100 | ||
| Virtual | Base1 | 99 | |
| Base2 | 79 | ||
| -O3 -fomit-frame-pointer | Non-virtual | Base1 | 100 |
| Base2 | 100 | ||
| Virtual | Base1 | 99 | |
| Base2 | 79 |
It is possible to notice that the sequence of inheritance does not affect the performance of calling non-virtual functions considerably for the modern compilers. The picture is different for the virtual functions. The Intel's compiler with the optimization switched on demonstrated the same performance of a virtual functions calls regardless of the inheritance sequence and it was about the same as for the performance of a virtual functions calls in case of a single inheritance.
Gcc compilers series 3 and 4 with the optimization switched on demonstrated a minor difference in performance of virtual functions calls depending on the inheritance branch. The second branch performance lost was between 10% and 30% in comparison to virtual functions calls in case of a single inheritance. Such losses are practically neglected in case of the first inheritance branch.
In case of virtual inheritance data structures become even more complicated. Let's consider the following hierarchy:
![]() |
Mediator1 and Mediator2 inherit virtually from the TopBase. Suppose that the types are defined as follows:
struct TopBase { Data d1; virtual void f( void ); }; struct Mediator1 : virtual public TopBase { Data d2; virtual void f( void ); virtual void g( void ); }; struct Mediator2 : virtual public TopBase { Data d3; virtual void f( void ); virtual void h( void ); }; struct DerivedVirtual : public Mediator1, public Mediator2 { Data d4; virtual void f( void ); virtual void g( void ); virtual void h( void ); }; |
Objects of the TopBase type are allocated in memory similar to the way shown on figure 2. The allocation of the Mediator1 type objects for a typical implementation is shown below. The Mediator2 objects are allocated similar to the Mediator1 ones.
![]() |
The virtual base class members are located after all the other data members. This is done to unify run-time analysis of the vtbl regardless of what exact object type (Mediator1, Mediator2 or DerivedMultiple) was created. The Mediator1 vtbl is extended with a pointer to the virtual base object pTopBase. Access to the virtual base members is provided not directly but via pTopBase pointer in the vtbl. The indirect access leads to run-time overheads.
The figure below shows the DerivedVirtual objects allocation.
![]() |
Data members of a virtual base object appear in memory once so the exact location of this portion is known only to the really allocated object. It is the reason why a pointer to the beginning of the virtual base object is stored in the vtbl of each deriving objects.
The described above approach supposes unified access to the virtual base object members - via a pointer in the vtbl - regardless of what exact object is allocated whether it was Mediator1, Mediator2 or DerivedMultiple. It is also valid for the case when the DerivedVirtual object was created and a pointer to that object was converted to the pointer to Mediator1 or to Mediator2 objects.
Some compilers hold a pointer to the beginning of a virtual base object not in the vtbl but as an additional data member.
The virtual inheritance may lead to a loss of performance in comparson to a usual inheritance. The loss may come up in case of calling member functions or accessing data members of a virtual base class.
A virtual base class may have both virtual and non-virtual functions. The performance of calling those functions can be different.
The test results are grouped by relation to virtual and non-virtual member functions. For non-virtual functions the results are given for virtual and usual single inheritance. A function call incremented a member of a usual or a virtual base class. The table below refers to the cases of incrementing a single member of a base class as "option 1". For virtual functions the results are also given for virtual and usual single inheritance. Two options of functions calls which are illustrated below were used.
![]() |
![]() |
The table below provides the functions calls performance in case of virtual inheritance in comparison to the functions calls performance in case of usual inheritance. A number greater that 100 means that the function call in case of virtual inheritance is faster than in case of usual inheritance.
| Optimisation | Function call option | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Option 1 | 91 | 68 | 74 | 79 |
| Option 2a | 72 | 64 | 61 | 54 | |
| Option 2b | 69 | 54 | 57 | 50 | |
| -O2 | Option 1 | 83 | 57 | 62 | 74 |
| Option 2a | 75 | 54 | 57 | 62 | |
| Option 2b | 60 | 48 | 54 | 58 | |
| -O3 -fomit-frame-pointer | Option 1 | 28 | 48 | 27 | 74 |
| Option 2a | 76 | 47 | 48 | 62 | |
| Option 2b | 57 | 41 | 42 | 58 |
| Optimisation | Function call option | gcc 2.96, % | gcc 3.3, % | gcc 4.1, % | intel 9.1, % |
|---|---|---|---|---|---|
| -O0 | Option 1 | 95 | 79 | 77 | 45 |
| Option 2a | 123 | 75 | 76 | 35 | |
| Option 2b | 120 | 61 | 66 | 25 | |
| -O2 | Option 1 | 90 | 66 | 66 | 77 |
| Option 2a | 91 | 138 | 60 | 60 | |
| Option 2b | 139 | 116 | 50 | 50 | |
| -O3 -fomit-frame-pointer | Option 1 | 33 | 16 | 100 | 77 |
| Option 2a | 91 | 138 | 60 | 60 | |
| Option 2b | 139 | 116 | 50 | 49 |
| Optimisation | Function call option | gcc 2.95, % | gcc 3.3, % | gcc 4.1, % |
|---|---|---|---|---|
| -O0 | Option 1 | 94 | 96 | 85 |
| Option 2a | 101 | 61 | 73 | |
| Option 2b | 97 | 58 | 65 | |
| -O2 | Option 1 | 94 | 95 | 84 |
| Option 2a | 93 | 62 | 81 | |
| Option 2b | 91 | 57 | 71 | |
| -O3 -fomit-frame-pointer | Option 1 | 18 | 16 | 18 |
| Option 2a | 92 | 61 | 86 | |
| Option 2b | 83 | 56 | 70 |
| Optimisation | Function call option | gcc 3.4, % |
|---|---|---|
| -O0 | Option 1 | 76 |
| Option 2a | 58 | |
| Option 2b | 50 | |
| -O2 | Option 1 | 71 |
| Option 2a | 67 | |
| Option 2b | 54 | |
| -O3 -fomit-frame-pointer | Option 1 | 22 |
| Option 2a | 57 | |
| Option 2b | 48 |
The performance of functions calls in case virtual inheritance loses significantly against the case of a usual inheritance in most of the cases regardless it was a virtual or a non-virtual function.
The fact that "option 2a" call was almost always faster than "option 2b" call means that the spent time depends on the number of times the virtual base class data members were accessed in the function. Theoretically it is possible to get rid of this dependency if compilers stored a pointer to the virtual base class object at the beginning of the function.
The most interesting case of the RTTI usage is convertion tries which suppose type hierarchies analysis. Such an analysis is performed when dynamic_cast is used. The C programming language does not support explicitly type hierarchies so it is difficult to find a functional equivalent. Due to these difficulties there will be only theoretical analysis of possible RTTI implementation and implied overheads without comparison of performance between C and C++ implementations.
Suppose that there is a type hierarchy shown below (it is necessary to note that the E type must be polymorphic).
![]() |
Suppose also that there is the following C++ code:
E * pE( new E ); B * pB( pE ); D * pD( dynamic_cast< D * >( pB ) ); |
It is obvious that the conversion at the last line must be completed successfully. The obstacle is that the argument of the conversion is a pointer to type B which is not a direct base type of type D. In order to complete the conversion successfully it is necessary to walk down the type hierarchy to the type E and then to complete the conversion. The described hierarchy traversal could be supported by RTTI implementation shown on the figure below.
![]() |
The vtbl is extended with one more pointer which helps to retrieve type information. Information about all the types is stored in a separate table. Having a pointer to B (pB) the beginning of the really created object is found at the first stage. Then a table with information of all the type E predecessor types is located. Then the target type_info is compared consequently with all the type_infos from the predecessors list. If type_infos are equal at some stage then the conversion is possible.
So the run-time overheads on supporting RTTI are as follows:
Quite often the comparisons do not require expensive string comparisons however there are compilers that use strings.
C++ provides exceptions for error handling. Traditional C alternatives are as follows:
Exceptions in C++ are usually implemented using one of two approaches: a table approach and a code approach.
The table approach supposes creation of special tables on the compilation stage. Those tables associate PC counter value ranges with actions that should be executed in case if an exception is generated. Those actions could be passing the execution control to the corresponding catch block, calling local objects destructors, stack unwinding etc. The main run-time overhead in case of the table approach is RAM for storing prepared tables.
The code approach supposes generation on-the-fly a list of actions which should be executed in case of exceptions. The list of actions is similar to what is stored in tables in case of the table approach. The main run-time overhead in case of the code approach is the CPU time. The dynamically generated actions take much less RAM than static actions tables in case of the table approach.
The most popular and simple way of error handling in C is the return code analysis. C code is usually similar to the following:
int f( void ); . . . { int ReturnValue; . . . ReturnValue = f(); if ( ReturnValue != 0 ) { /* Process error some way */ } /* No errors */ |
The main run-time overhead in the code above is the CPU time spent on comparing the return code with some value and possibly a jump. This comparison is performed regardless whether the f function completed successfully or not. In case if C++ exceptions are used there is no return code so there is no if statement. Therefore if the function f completed successfully there will not be run-time CPU overheads. The CPU overheads however in case if function f generated an exception will be most probably higher. It is possible to measure time for an exception processing in C++ case and for error code checking for C. The ratio of those times will give some value. The value means what the minimum number of successful C++ calls of function f should be done to make the C++ code more effective than the C equivalent in terms of CPU consumption. For example, if the value is 220 that means that if an exception is generated rearer than one time per 220 calls then the C++ code will work faster than return code analyzing C version.
The tables below demonstrate test results with the ratio of time spent on an exception handling to time spent on return code analysis.
| Optimisation | gcc 2.95 | gcc 3.3 | gcc 4.1 | intel 9.1 |
|---|---|---|---|---|
| -O0 | 265 | 396 | 394 | 491 |
| -O2 | N/A | 497 | 445 | 854 |
| -O3 -fomit-frame-pointer | N/A | 470 | 609 | 711 |
| Optimisation | gcc 2.96 | gcc 3.3 | gcc 4.1 | intel 9.1 |
|---|---|---|---|---|
| -O0 | 164 | 232 | 202 | 465 |
| -O2 | 509 | 635 | 582 | 1445 |
| -O3 -fomit-frame-pointer | 512 | 646 | 505 | 1399 |
| Optimisation | gcc 2.95 | gcc 3.3 | gcc 4.1 |
|---|---|---|---|
| -O0 | 87 | 88 | 84 |
| -O2 | 101 | 112 | 121 |
| -O3 -fomit-frame-pointer | 107 | 108 | 270 |
| Optimisation | gcc 3.4 |
|---|---|
| -O0 | 100 |
| -O2 | 102 |
| -O3 -fomit-frame-pointer | 106 |
The values in the tables can be used by developers while making a decision for a certain way of error handling. An interesting fact is that gcc compilers demonstrate similar results while the Intel's compiler loses up to 2.5 times against gcc.
Gcc series 2 on the IA-32 platform with optimization switched on generated a code which aborted at run-time so the corresponding cells in the table has N/A.
The C++ input/output library has a reputation of not effective one. The library performance migh be affected by the syncronisation mode with the C input/output streams. The mode is on by default.
Test code performed file input/output operations for decimals, heaxadecimal and float values for both synchronization modes - switched on and off. The figures below show the results. The vertical axis is the time spent on the operations so the higher bar is the worse performance it means.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
C++ I/O performance is worse than C in all the cases except one. The worst result is slowing down at the rate of 600%. The exception is gcc series 2 C++ I/O performance however some sources explain it by incorrect implementation of the I/O streams in term of C++ standard requirements. Besides the gcc series 2 is out-of-date now and is not considered seriously by developers as a candidate for C++ projects.
Modern C++ compilers demonstrate high quality of the new language features implementation on all the tested platforms. The C++ code does not practically lose against C and sometimes allows reaching a higher performance. The regrettable exception is the C++ input/output however there is a workaround. The C++ compiler will easily compile C-style input/output code. And we surely have a hope on the compilers developers. The C++ input/output library will be gaining better and better anyway. Bearing in mind that C++ directly supports more programming paradigms than C the latter will have less and less chances to be used in complicated projects.
To make the process of collecting test results on various platforms for different compilers and various compilation keys a set of scripts was developed. The scripts could be easily extended with new tests, new compilers and new combinations of compilation keys. In order to do that some changes should be done in the files described below.
The file is located at the top of the framework directories structure and holds a list of the tested compilers. An example of the file is given below:
# File format: # first compiler vendor # second c compiler path # third c++ compiler path gcc4.1.1 /home/twinpeek/compilers/gcc/4.1.1/bin/gcc /home/twinpeek/compilers/gcc/4.1.1/bin/g++ intel9.1 /home/twinpeek/compilers/intel/9.1.038/bin/icc /home/twinpeek/compilers/intel/9.1.038/bin/icpc |
The example defines names and pathes to two compilers - gcc series 4 and Intel's compiler. The comment lines start with the '#' character. Empty lines are allowed.
The file is also located at the top of the directories structure and holds a list of projects which are included into the tests. An example of the file is given below:
# File format: # Pathes to the projects abstraction_penalty/stepanov_test abstraction_penalty/mitigation abstraction_penalty/templates_boat_diff abstraction_penalty/templates_boat_same |
The example defines four projects which are specified in a form of relative pathes. The comment lines start with the '#' character. Empty lines are allowed.
A number of optimization keys sets which are used with each of the compilers is defined separately for each of the projects. That is why the optimization.info file is located in the home directory of each project. For example the optimization.info file for the abstraction_penalty/stepanov_test project can look as follows:
0 1 2 |
Each line gives a name of the optimization keys set. Here we have digits as the names.
The search of the compiler optimization keys which correspond to each of the sets from the optimizations.info file is done as follows. A file name is formed using the following rule:
<CompilerName>_opt<OptimisationKeysSetName>.flags |
For example for the Intel's compiler and the last optimization keys set the following name will be formed:
intel9.1_opt2.flags |
The search of this file will start with the "flags" directory in the corresponding project home directory. If the file is not found in that folder the search will continue in the "flags" directory which is located at the top of the framework directories structure. The file should define two variables - for C and for C++ - with the corresponding optimization keys. For example the intel9.1_opt2.flags file can look as follows:
CFLAGS=-O3 -fomit-frame-pointer CPPFLAGS=-O3 -fomit-frame-pointer |
Such an approach supports ability to have common optimization keys for all the projects and ability to tune optimization keys individually for a specific project.
To run compilation of all the projects and collect the tests results the following command could be used:
./do_test.sh > TestResults.log |