Tips to write Optimized and Effective CASA codes in C++

The aim of this page is to collect tips to write optimized and effective CASA codes in C++. Use template below to add items.

Template
(use "Edit wiki text" button and copy and paste contents below to add a new tips)

---+++ (title of tips)
---++++ Description
(key points of optimization)
---++++ Code Examples
(good and bad code examples) 
---++++ Benchmarks
(results of benchmark tests)
---++++ Related Tips
(references to related tips that have synergistic effects or solves the same issue in different ways)
---++++ References
(links to reference materials)
<hr/>

C++ General


Avoid implicit and/or explicit type conversions unless necessary

Description

When different data types are mixed up in an expression, compiler inserts type conversion codes to align data type according to C++ type conversion rules. It is necessary to use proper types of variables in expressions to avoid extra type conversions.

For instance, suppose you are adding a float value (f_in) to a double (d_in) and substituting the result to a float variable (f_out), i.e.,
f_out = f_in + d_in

In this case, compilers generate a code equivalent to
f_out = static_cast<float>(static_cast<double>(f_in) + d_in)

that is, f_in is first converted to double and added to d_in. The result of summation is at first double but converted again to float when substituting the value to f_out. Two type conversions between float and double are invoked in addition to a summation. If the purpose of calculation is to get the result in float, any of the type conversions are not necessary at all. This kind of unnecessary type conversions, either implicit or explicit, degrade performance especially in loops.

It is necessary to use proper types of variables in equations to avoid extra type conversions.

In the code example below, two extra type conversions as described above take place in the bad example (offset_converted_to_double function). It is because the last term in right hand side of equation, 1.0/3.0, is treated as double. It is necessary to use a single precision float explicitly, 1.0f/3.0f, to make sure the term is treated as float. Benchmarking results show that unnecessary type conversions in the bad example cost twice longer execution time.

Note, however, there are cases that type conversions should be done with care. It is because conversions may cause a degrade of precision and may even cause an overflow when converting from a data type that has wider value range and higher precision to narrower and lower one. Examples of conversions that should be done with care for this reason are those from double to float, from a type of 64bit to 32bit in size, and from float to int32_t.

Code Examples

Sample code: type_conversion.cc
Compile command: c++ -std=c++11 -m64 -Wall -g -O3 -o type_conversion type_conversion.cc
Execution: ./type_conversion
// An extract from type_conversion.cc (full source code is attached)

// a bad Example with unnecessary, implicit type conversions
void offset_converted_to_double(size_t num_data, float const *in_data, float *out_data) {
        for (size_t i = 0; i < num_data; ++i) {
                out_data[i] = in_data[i] + 1.0/3.0; // implicit conversions take place here
        }
}
// a good Example (no type conversion in equation)
void offset_without_conversion(size_t num_data, float const *in_data, float *out_data) {
        for (size_t i = 0; i < num_data; ++i) {
                out_data[i] = in_data[i] + 1.0f/3.0f;
        }
}

Benchmarks

num_data = 5,242,880,000 equivalent

The averaged elapse time of 3 executions
  • Bad example: 3.60 sec
  • Good example: 1.75 sec

The code is compiled with gcc 4.8.1, executed on a host with 8 cores (@3.20GHz) and 65GB memory

N/A

References

N/A


Removal of conditional branch inside the loop without code duplication

Description

Sometimes we need to put conditional branch inside the loop. However, this is not desirable in terms of performance because (1) conditional branch can prevent the loop being vectorized, and (2) a failure of branch prediction can cause significant degradation of performance. On the other hand, if we can put conditional branch outside the loop, we can avoid these issues. In this case, we usually write similar loop several times. Although it will improve the performance, it is also not desirable in terms of maintainability of the code since this modification will result in a code duplication.

In the code example below, solution for such case is presented. Here, conditional branch is put outside the loop, and template function is used to avoid code duplication. There is another implementation by evaluating numerical coefficients depending on the condition. But the latter can only be used in the specific case while the former might be more generic.

Note that recent compiler removes conditional branch from the loop if possible. Because of the optimization by compiler, bad example below sometimes shows similar performance in practice. Note also that whether compiler optimization works or not depends on the code. Thus, we should explicitly get branching out from the loop to ensure the better performance. The example code for performance measurement is written so that the compiler doesn't perform such optimization.

There is another interesting feature in the example code. In the Good Example code, we set struct as a template argument to alter the behavior. This technique can be applied to the case when we need to pass function pointer as an argument. We can implement equivalent code based on struct plus template as a faster implementation if functions are defined as inline and accessible from the compilation unit.

Code Examples

Executable source file: branching_outside_loop.cc
Compiler: gcc 4.8.2
Compile command: g++ -march=native -std=c++11 -O3 -o branching_outside_loop branching_outside_loop.cc
Run command: ./branching_outside_loop

Bad Example: worse performance

void branching_inside_loop(size_t n, float data[], bool flag) {
    for (size_t i = 0; i < n; ++i) {
        if (flag) {
            data[i] = 2.0f * data[i];
        }
        else {
            data[i] = -0.5f * data[i] + 1.0f;
        }
    }
}

Better Example: good performance but worse maintainability

void branching_outside_loop(size_t n, float data[], bool flag) {
    if (flag) {
        for (size_t i = 0; i < n; ++i) {
            data[i] = 2.0f * data[i];
        }
    }
    else {
        for (size_t i = 0; i < n; ++i) {
            data[i] = -0.5f * data[i] + 1.0f;
        }
    }
}

Another Example: good maintainability but function call overhead

float IfTrue(float x) {
    return 2.0f * x;
}

float IfFalse(float x) {
    return -0.5f * x + 1.0f;
}

void convert(size_t n, float data[], float (*func)(float)) {
    for (size_t i = 0; i < n; ++i) {
        data[i] = func(data[i]);
    }
}

void branching_outside_loop2(size_t n, float data[], bool flag) {
    convert(n, data, flag ? IfTrue : IfFalse);
}

Good Example: good performance, good maintainability

struct IfTrue {
    static float func(float x) {
        return 2.0f * x;
    }
};

struct IfFalse {
    static float func(float x) {
        return -0.5f * x + 1.0f;
    }
};

template<typename T>
void convert(size_t n, float data[]) {
    for (size_t i = 0; i < n; ++i) {
        data[i] = T::func(data[i]);
    }
}

void branching_outside_loop_template(size_t n, float data[], bool flag) {
    if (flag) {
        convert<IfTrue>(n, data);
    }
    else {
        convert<IfFalse>(n, data);
    }
}

Another Good Example: good performance, good maintainability, but limited use

void branching_outside_loop_ternary(size_t n, float data[], bool flag) {
    float const a = flag ? 2.0f : -0.5f;
    float const b = flag ? 0.0f : 1.0f;
    for (size_t i = 0; i < n; ++i) {
        data[i] = a * data[i] + b;
    }
}

Benchmark

[Bad Example: worse performance]
examining branching_inside_loop ... elapsed time: 7.06173 sec.

[Better Example: good performance but worse maintainability] 
examining branching_outside_loop ... elapsed time: 0.941991 sec.

[Good Example: good performance, good maintainability]
examining branching_outside_loop_template ... elapsed time: 0.941192 sec.

Static polymorphism using curiously recurring template pattern

Speculative execution

References

http://en.wikipedia.org/wiki/Branch_predictor


Static polymorphism using curiously recurring template pattern

Description

Unlike the dynamic polymorphism with virtual function, CRTP allows you to implement the static polymorphism which doesn't require object instantiation nor function call overhead.

Polymorphism is one of key features in modern programing language. It enables to define a variety of behavior to single interface. Virtual function is commonly used to implement the polymorphism. Typical example is to define a base class that only defines an interface using pure virtual function, and its subclasses implements actual functionalities by overriding it. At runtime, we can switch the behavior by simply substituting desired subclass's instance to a pointer of base class. Polymorphism using virtual function is dynamic since method to be executed is usually decided at runtime.

However, dynamic polymorphism has some weak points. One is that classes must always be instantiated. It introduces an additional memory management cost and a risk of memory leak to the code although it is nowadays easy to manage a lifecycle of instances using smart pointer. Another weakness is performance. Since memory allocation is somewhat expensive process, it may sometimes cause performance penalty if a lot of new/delete is repeated in the code. Furthermore, virtual function is incompatible with inline function. In general, virtual function is not inlined even if they are defined as inline. As a result, it may cause an additional overhead of method call. This is because that compiler cannot know which method should be used for inlining when the code is compiled.

To overcome those issues, another way to fulfill polymorphism is described here. The technique is called Curiously Recurring Template Pattern, CRTP. It uses template to implement polymorphism. The base class is a template class that takes its subclass as template argument. Similar to dynamic polymorphism, the base class defines an interface. The interface method is implemented using subclass's method that can be accessible since subclass is passed as a template argument. In contrast to the virtual function, polymorphism by CRTP is static since the templates are instantiated and functions are inlined when the code is compiled and no instance is created at runtime. See code example below for detail.

Advantage of static polymorphism against dynamic polymorphism is that it can be implemented only by static methods so that no memory management is needed. In terms of performance, static polymorphism is compatible with inline function since compiler is able to know what method should be inlined. On the other hand, static polymorphism tends to make library size larger.

The following benchmark is a comparison between dynamic polymorphism and static one. It also demonstrates that inline methods are not inlined for dynamic polymorphism in practice. Even if method is empty, it requires some sort of times to complete for dynamic case while elapsed time is almost zero for static case.

Code Examples

Dynamic polymorphism:
#include <iostream>

using namespace std;

class Base {
public:
  virtual void execute() = 0;
};

class DerivedA : public Base {
public:
  virtual void execute() {
    cout << "This is A" << endl;
  }
};

class Derived B : public Base {
public:
  virtual void execute() {
    cout << "This is B" << endl;
  }
};

int main(int argc, char *argv[]) 
{
  // dynamic
  DerivedA a;
  Base *p = &a;
  p->execute();

  return 0;
}

Static polymorphism:
#include <iostream>

using namespace std;

template<class Impl>
class Base {
public:
  static void execute() {
    Impl::doexecute();
  }
};

class DerivedA : public Base<DerivedA>
{
public:
  static void doexecute() {
    cout << "This is A" << endl;
  }
};

class DerivedB : public Base<DerivedB>
{
public:
  static void doexecute() {
    cout << "This is B" << endl;
  }
};

// static
int main(int argc, char *argv[]) 
{
  typedef DerivedA OneOfBase;
  OneOfBase::execute();

  return 0;
}

Benchmarks

Benchmark code: https://safe.nrao.edu/wiki/pub/Software/CASA/TipsToOptimizeCppCodes/CRTP_test.ver3.tgz

operation polymorphism secs
subtraction dynamic 0.03
static 0.007
do nothing dynamic 0.03
static 0.0

$ ls
Executor.cc  Makefile  Polymorphism.h  main.cc
$ gcc --version # compiler should support c++11 features
gcc (GCC) 4.8.3
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ make
gcc -O3 --std=c++11 -c Executor.cc -o Executor.o 
gcc -O3 --std=c++11 -c main.cc -o main.o 
gcc -O3 --std=c++11 -o a.out main.o Executor.o -lstdc++
$ ./a.out 1
Perform subtraction
--- Dynamic Polymorphism ---
elapsed time is 0.030323 sec
result = 0
--- Static Polymorphism ---
elapsed time is 0.00686097 sec
result = 0
$ ./a.out 4
Perform nothing
--- Dynamic Polymorphism ---
elapsed time is 0.030355 sec
result = 1
--- Static Polymorphism ---
elapsed time is 0 sec
result = 1
Removal of conditional branch inside the loop without code duplication

Tag Dispatching http://en.wikibooks.org/wiki/More_C++_Idioms/Tag_Dispatching

References

http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern


Expression template as a way for speed-up and less memory access

Description

Expression template is a C++ template metaprogramming technique. It was utilised first in the Blitz++ library in mid 1990s, and is now commonly used in linear algebra libraries including Eigen, Boost.uBLAS, and so on. If you just want to benefit from expression templates in doing linear algebra calculations, you can use these libraries and don't need to code it explicitly by yourself. Yet the great idea of this technique is worth noting.

The essence of this technique is to establish both keeping modularity and preventing intermediate objects to get faster and with less memory access. Other techniques such as lambda expression can be chosen just to prevent creating intermediate objects, but utilising C++ template technique, with some syntax sugar, is a powerful tool to achieve modularity as well in practice.

Given an expression, the process conducted by expression template is twofold: constructing a syntax tree (a tree structure of operation elements) and actual evaluation. It should be emphasised that evaluation is executed all at once in substitution. This avoids generating extra temporary objects and extra loops to reduce computation time and memory access/use.

To see how expression template works, let us go with a popular example, a simple element-wise operation between vector objects. Suppose w, x, y and z are vectors having the same length, and compute w with an expression 'w = x * y + z'. If the vector object is defined in an ordinary way (see code 1), actual behaviour should be as follows. Note that two extra temporary vector objects are generated to store results of each element operation. Also note there exist 3 loops for vector index (at lines 2, 4 and 5) actually.

Vector tmp1 = Vector();
tmp1 = x * y:
Vector tmp2 = Vector();
tmp2 = tmp1 + z;
w = tmp2;

Generating extra temporary objects seems to be just an overhead if they are small. As you can see in Bechmark section below, however, it will severely affects the performance in dealing with more complex expressions and/or larger vectors.

Code 2 shows the vector class VectET after introducing expression template. The member functions add() and mul() now don't execute any calculation, instead they return a tiny object BExpr to preserve information about each operation unit. Since BExpr contains type of operation and reference to operands only, the cost of instanciation is very small and doesn't depend on vector length (thanks to optimisation by C++ compiler, BExpr might even dissappear from the final executable code). Another change in add() and mul() is that they are template functions now. Since the template parameters L and R for operands can be other BExpr as well as VectET, this is the key feature to help constructing a syntax tree of the given expression.

In the construction phase, the expression 'x * y + z', written as

VectET::add(VectET::mul(a, b), c)

is interpreted like:

VectET::add(BExpr<VectET, OpMul, VectET>(x, y), z)

then

BExpr<BExpr<VectET, OpMul, VectET>, OpAdd, VectET>(BExpr<VectET, OpMul, VectET>(x, y), z)

In the evaluation phase (it starts when operator '=' is called), the above syntax tree is evaluated as

for (int i = 0; i < SIZE; ++i) {
    w[i] = BExpr<BExpr<VectET, OpMul, VectET>, OpAdd, VectET>(BExpr<VectET, OpMul, VectET>(x, y), z)[i];
}
then
for (int i = 0; i < SIZE; ++i) {
    w[i] = BExpr<VectET, OpMul, VectET>(x, y)[i] + z[i];
}
and finally,
for (int i = 0; i < SIZE; ++i) {
    w[i] = x[i] * y[i] + z[i];
}

This is the final shape that is actually executed. You can see no temporary objects generated, and just one loop for vector index, whilst two temporary objects and 3 loops are needed using the ordinary Vect class in code 1. If you (surely) feel the above notation annoying, overload intimate operators like '+' and '*', then you can use this vector object simply like 'w = x * y + z'. (see the attached code)

Recursively evaluating elements of syntax tree reminds us of operation() method appears in the Composite design pattern by GoF. Though BExpr (as Composite class) and VectET (as Leaf class) do not have a common superclass, they are connected each other via template parameters and have common operator [] for evaluation.

For the purpose of accelerating linear algebra calculation in your code, again, you are advised to simply use Eigen or other libraries which use expression templates inside. However, focusing on its aspect of delayed evaluation, expression template can be applied for other purposes than numerical computations as well. Some of the Boost libraries including Boost.Xpressive and Boost.Lambda extensively do that. You can learn from their codes if interested.

Lastly mentioning on casa::Array. Its element-wise arithmetics are implemented in casacore/casa/Arrays/ArrayMath.tcc as follows:

template<class T>
Array<T> operator+(const Array<T>& left, const Array<T>& right) {
    return ArrayTransformResult(left, right, std::plus<T>());
}
template<typename T, typename BinaryOperator>
Array<T> arrayTransformResult(const Array<T>& left, const Array<T>& right, BinaryOperator op) {
    Array<T> res(left.shape());
    arrayContTransform(left, right, res, op);
    return res;
}

where arrayTransformResult is implemented in the very style of Vect class in code 1. Hence casa::Array seems to have a room for performance improvement via expression templates.

Code Examples

Code 1: vector class w/o expression templates
#define SIZE 30000000
class Vect {
public:
  enum { N = SIZE };
private:
  double *p_;
public:
  Vect() {
    if (debug) puts("cstr");
    p_ = new double[N];
  }
  ~Vect() { delete[] p_; }
  Vect& operator=(const Vect& rhs) {
    if (debug) puts("copy");
    std::copy(rhs.p_, rhs.p_ + N, p_);
    return *this;
  }
  double& operator[](size_t i) { return p_[i]; }
  double operator[](size_t i) const { return p_[i]; }
  static Vect add(const Vect& lhs, const Vect& rhs) {
    Vect dest;
    for (size_t i = 0; i < N; ++i) { dest.p_[i] = lhs.p_[i] + rhs.p_[i]; }
    return dest;
  }
  static Vect mul(const Vect& lhs, const Vect& rhs) {
    Vect dest;
    for (size_t i = 0; i < N; ++i) { dest.p_[i] = lhs.p_[i] * rhs.p_[i]; }
    return dest;
  }
};

Code 2: vector class using expression templates
#define SIZE 30000000
template<class L, class OP, class R>
class BExpr {
  const L& lhs_;
  const R& rhs_;
public:
  BExpr(const L& lhs, const R& rhs) : lhs_(lhs), rhs_(rhs) {}
  double operator[](size_t i) const { return OP::eval(lhs_[i], rhs_[i]); }
};

struct OpAdd {
  static double eval(double lhs, double rhs) { return lhs + rhs; }
};
struct OpMul {
  static double eval(double lhs, double rhs) { return lhs * rhs; }
};

class VectET {
public:
  enum { N = SIZE };
private:
  double *p_;
public:
  VectET() {
    if (debug) puts("cstr-ET");
    p_ = new double[N];
  }
  ~VectET() { delete[] p_; }
  template<class L, class OP, class R>
  VectET& operator=(const BExpr<L, OP, R>& rhs) {
    if (debug) puts("copy-ET-bexpr");
    for (size_t i = 0; i < N; ++i) { p_[i] = rhs[i]; }
    return *this;
  }
  double& operator[](size_t i) { return p_[i]; }
  double operator[](size_t i) const { return p_[i]; }
  template<class L, class R>
  static BExpr<L, OpAdd, R> add(const L& lhs, const R& rhs) {
    return BExpr<L, OpAdd, R>(lhs, rhs);
  }
  template<class L, class R>
  static BExpr<L, OpMul, R> mul(const L& lhs, const R& rhs) {
    return BExpr<L, OpMul, R>(lhs, rhs);
  }
};

Benchmarks

Benchmark code: https://safe.nrao.edu/wiki/pub/Software/CASA/TipsToOptimizeCppCodes/et.cc

Compiling command: g++ -march=native -std=c++11 -O3 [-fno-tree-vectorize] -o et et.cc

If you add an option '-DDEBUG' in compiling, you can see that extra temporary vector objects are not created in cases with expression template.

The following results are obtained with an old machine equipped with Intel Xeon E5420 CPU and 3.9GB RAM.
vector length w/o ET (sec.) w/ ET (sec.) acceleration ratio
10^4 1.54x10^-4 5.13x10^-5 3.00
10^5 2.18x10^-3 1.02x10^-3 2.14
10^6 0.0324 0.0149 2.18
10^7 0.317 0.156 2.03
10^8 30.3 1.50 20.1

In this case of a simple expression 'x*y+z', computation speed becomes more than twice faster by introducing expression template. Note that this difference in performance gets even greater when vector length is <10^5 and 10^8. In the former case, all needed objects might be on L2 cache (6MB) for VectET but not for Vect (w/o ET). In the latter case, Vect (w/o ET) becomes very slow because of disk thrashing whilst VectET does keep linearity in performance.

N/A

References

A document by Todd Veldhuizen, the inventor of expression template: Techniques for Scientific C++

Composite pattern: https://en.wikipedia.org/wiki/Composite_pattern

I/O


CPU

SoA data layout rather than AoS for SIMD vectorization

Description

SIMD vectorization accelerates your program significantly when the data layout is suitable for the SIMD instructions.

SIMD, as its name suggests, performs some operation on multiple data at a time as shown in Fig 1. We need to lay out data suitable for SIMD to get the best performance because SIMD performs a same operation on every data in general.
  • Fig 1: C[i] = A[i] X B[i]
    SIMD.png

Since operation for each element looks performed vertically as shown in Fig1, it is called "vertical computation". In contrast, the computation between elements in the same array like A[0] + A[1] is called "horizontal computation". SIMD is basically designed for the vertical computation model.

Let's look into a multiplication between complexes as an example. The same is true for vector(x,y,z), color(r,g,b,a) and so on. With the traditional data layout called AoS(Array of Structures), the multiplication is executed as Fig 2. To avoid overlapping lines in the figure, the operations for real factor are shown on element [0] and operations for imaginary factor are shown on element [1]. Both sets of operations are performed on each element in practice.

  • Fig 2: AoS layout
    aos-aos.png

Non-vertical arrows indicate necessity of the horizontal computation and/or the arrangement of data that degrade speed. The horizontal computations and the arrangements often cancel the speed advantage of SIMD instruction and compilers use scalar instructions rather than SIMD instructions in such cases.

With the SIMD friendly SoA(Struct of Arrays) data layout, the multiplication is executed as Fig 3.

  • Fig 3: SoA layout
    soa-soa.png

There are only vertical arrows. It means that the vertical computation is enough to compute the multiplication. So SoA indicates a better performance than AoS, about 6X faster in the example below.

With SoA, arbitrary combination of operands and operation can be computed with the vertical computation. That's why SoA is strongly recommended for SIMD. SoA is suitable for not only SIMD for CPU like SSE and AVX but also GPGPU. It is worth changing the internal data layout of CASA from AoS to SoA. It is also recommended to change the data layout of MeasurementSet and BDF file formats to SoA in a future release.

Code Examples

Executable source file : aos_soa.cc

Low performance example using AoS:
complex<float> A[ARRAY_SIZE], B[ARRAY_SIZE], out[ARRAY_SIZE];
for (size_t i = 0; i < ARRAY_SIZE; ++i) {
    out[i].real(A[i].real() * B[i].real() - A[i].imag() * B[i].imag());
    out[i].imag(A[i].real() * B[i].imag() + A[i].imag() * B[i].real());
}
High performance example using SoA:
typedef struct soa_complex {
    float real[ARRAY_SIZE];
    float imag[ARRAY_SIZE];
} SoAComplex;

SoAComplex A, B, out;
for (size_t i = 0; i < ARRAY_SIZE; ++i) {
    out.real[i] = A.real[i] * B.real[i] - A.imag[i] * B.imag[i];
    out.imag[i] = A.real[i] * B.imag[i] + A.imag[i] * B.real[i];
}

Benchmarks

Elapsed time (in sec. by 10^6 loops, array size 1024 , +- one sigma in the 3 number of measuring times)
SoA: 0.243 +-0.002
AoS: 1.358 +-0.001

N/A

References

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html (4.5.1 Data Structure Layout and 6.5.1 Data Arrangement)


Memory Access

Optimizing memory access is very important because memory is so slow device compared to CPU. If memory bandwidth is saturated, vectorization, multithreading and/or other techniques don't work fine. Optimizing memory access is the bottom line of HPC dealing with large data like CASA.

Memory access for multidimensional arrays should be sequential

Description

Nesting "for" loops is a popular way to access to multidimensional array elements. Seems to be quite trivial, but accessing to multidimensional arrays should be sequential; if not, accessing speed becomes enormously slower, furthermore, additional speed-up via vectorization can not be expected.

Code Examples

Executable source file : md_array.cc
the above sample code executes a simple addition of two 4D arrays of size 128*128*4*1024. compiler version: gcc-4.8.2 command to compile: see the first line of the sample code. Adding an option '-fno-tree-vectorize' will disable vectorization. example of bad case:
template<typename T, size_t M, size_t N, size_t O, size_t P>
struct ALIGN RevOrder4D {
    static constexpr size_t ROWS = M;
    static constexpr size_t COLS = N;
    static constexpr size_t CHNS = O;
    static constexpr size_t POLS = P;
    ALIGN T data[P][O][N][M];
    T &operator ()(size_t row, size_t col, size_t chn, size_t pol) { return data[pol][chn][col][row]; }
};
template<typename L>
L* add4D() {
    ALIGN static L a;
    ALIGN static L b;
    ALIGN static L c;
    for (size_t i = 0; i < L::ROWS; ++i) {
        for (size_t j = 0; j < L::COLS; ++j) {
            for (size_t k = 0; k < L::CHNS; ++k) {
                for (size_t l = 0; l < L::POLS; ++l) {
                    c(i, j, k, l) = a(i, j, k, l) + b(i, j, k, l);
                }
            }
        }
    }
    return &c;
}
int main() {  //bad case (reversed loop order)
    typedef RevOrder4D<float, 128, 128, 1024, 4> RevOrder;
    add4D<RevOrder>();
}

example of good case:
template<typename T, size_t M, size_t N, size_t O, size_t P>
struct ALIGN InOrder4D {
    static constexpr size_t ROWS = M;
    static constexpr size_t COLS = N;
    static constexpr size_t CHNS = O;
    static constexpr size_t POLS = P;
    ALIGN T data[M][N][O][P];
    T &operator ()(size_t row, size_t col, size_t chn, size_t pol) { return data[row][col][chn][pol]; }
};
template<typename L>
L* add4D() {
    ALIGN static L a;
    ALIGN static L b;
    ALIGN static L c;
    for (size_t i = 0; i < L::ROWS; ++i) {
        for (size_t j = 0; j < L::COLS; ++j) {
            for (size_t k = 0; k < L::CHNS; ++k) {
                for (size_t l = 0; l < L::POLS; ++l) {
                    c(i, j, k, l) = a(i, j, k, l) + b(i, j, k, l);
                }
            }
        }
    }
    return &c;
}
int main() {  //good case (proper loop order)
    typedef InOrder4D<float, 128, 128, 1024, 4> InOrder;
    add4D<InOrder>();
}

Benchmarks

Elapsed time (in second)
   Bad cases: 0.749+-0.004 (non-vectrized), 0.745+-0.005 (vectorized)
   Good cases: 0.044+-0.001 (non-vectrized), 0.030+-0.001 (vectorized)

Results

  • nesting loops in proper order is ~17 times faster than that in fully reverse order (without vectorization effects)
  • vectorization brings further ~50% speed-up in good case, while no benefit in bad case.
  • combined use of proper loop order and vectorization brings ~25 times speed-up for this case.

gcc has an option '-floop-interchange' to perform loop interchange transformations on loops. for details, see here. The option doesn't work for complicated cases and manual optimization is still important.
memory alignment, expression template

References


Avoiding an unnecessary initialization against an array of a class which has a default constructor

Description

Each element of an array of a class which has a default constructor is initialized by the default constructor even if it is unnecessary. This is because, for example, the array is going to filled with other values immediately after the array creation. This unnecessary initialization will increase the number of memory access. Therefore we should avoid the initialization to improve the performance as long as it is safe to skip the initialization.
However you should not skip it if the class can not be considered as POD(Plain Old Data). It means:

  • the instance has a pointer member to be initialized
  • the class has a virtual function including virtual destructor
  • the destructor of the class has something to do.

For example, considering making a copy(array A) of an array(array B), there are 3 memory accesses(initialize A, read from B, write to A) for each element without this tip. With this tip, only 2 memory accesses(read B, write A) are required.

I suggest introducing another constructor to casa::Array which doesn't initialize an array used for data storage because it is often the case that casa::Array is filled with other values immediately after its construction.

Code Examples

Executable source file : array_init.cc
BAD example:
complex<float> *CopyArray(size_t num_data, complex<float> *input_data) {
    complex<float> *output_data = new complex<float> [num_data]; //invoking default constructor
    for (size_t i = 0; i < num_data; ++i) {
        output_data[i] = input_data[i];
    }
    return output_data;
}

GOOD example:
complex<float> *CopyArray(size_t num_data, complex<float> *input_data) {
    struct Dummy {
        char x[sizeof(complex<float> )];
    };
    complex<float> *output_data = (complex<float>*) new Dummy[num_data];
    for (size_t i = 0; i < num_data; ++i) {
        output_data[i] = input_data[i];
    }
    return output_data;
}

Benchmarks

Elapsed time (in sec. by 100 loops)

num_data : 10^7
Bad case : 1.839+-0.001
Good case: 1.260+-0.002

This example uses 'complex' as Array of Struct (AoS), so changing AoS to Struct of Array(SoA) will improve the performance. (please see AoS/SoA tips)

References

Avoid frequent update of output parameter

Description

Because updating value of output parameter of type T * or T & causes memory access, it's better to introduce local variable to reduce memory access.

Since there is no way for compiler to tell whether there is any other referer to the parameter or not, compiler generates memory access instruction(s) for the update of the parameter. Then the memory access makes your program slower. Compiler can tell whether there is any other referer to local variable, it assigns the variable to CPU register if possible. As shown in Benchmarks below, GoodCase is about 3 times faster than BadCase with gcc.

There is another advantage to introduce local variable. Without local variable, caller of the function may refer incomplete or unreliable intermediate value of the output parameter especially when the function throws an exception. The value of output parameter should be unchanged or a correct value.

Code Examples

void BadCase(size_t elements, float data[], size_t *count) {
   *count = 0;
   for (size_t i = 0; i < elements; ++i) {
      if (data[i] == 0) {
         ++(*count);
      }
   }
}

void GoodCase(size_t elements, float data[], size_t *count) {
   size_t cnt = 0;
   for (size_t i = 0; i < elements; ++i) {
      if (data[i] == 0) {
         ++cnt;
      }
   }
   *count = cnt;
}

Benchmarks

E3-1275 v3, gcc 4.8.2
secs
BadCase 4.08706
GoodCase 1.47541

__restrict or __restrict__

References


Compiler Option

Profile-guided optimization (PGO)

Description

In recent programming languages like Java which equip with JIT(Just-In-Time) compiler, your program is optimized at runtime by using runtime profile information. Unfortunately C++ compiler can't use the runtime profile information to optimize your program further more in a normal way. Thus the program is sometimes not optimized well for typical case. PGO makes it possible to feed the profile information back to compilation.

See References for details.

There is a risk that the program would be slower in case where the program is used differently from the typical case used to collect profile information.

Code Examples

Any program can become faster by PGO. Benchmark program: pgo.cc

Benchmarks

E3-1275 v3, gcc 4.8.2
secs
w/o PGO 0.413506
with PGO 0.231144

Without PGO:
gcc -O3 -march=native -std=c++11 pgo.cc -o pgo
./pgo 90
Hit ratio: 90
0.413506 sec

With PGO:
gcc -O3 -march=native -std=c++11 -pg -fprofile-generate pgo.cc -o pgo
./pgo 90 # collect profile information
Hit ratio: 90
1.555125 sec
gcc -O3 -march=native -std=c++11 -pg -fprofile-use pgo.cc -o pgo
./pgo 90
Hit ratio: 90
0.231144 sec
Link Time Optimization(LTO)

References

http://en.wikipedia.org/wiki/Profile-guided_optimization

https://gcc.gnu.org/news/profiledriven.html

http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization


CASA Specific Tips

Initialization control of casa::Block and casa::Array

Description

Casacore used to have and still has an inconsistent behavior of initialization of elements in casa::Array. For example,
Array<Float> f(IPosition(1, 10)); // Elements in f are NOT initialized.
Array<Complex> c(IPosition(1, 10)); // Elements in c are initialized to (0,0i) by the default constructor of Complex.
This inconsistency comes from the inconsistent behavior of operator new[].

The implicit initialization of elements in Array by the default constructor is unnecessary in most cases and makes CASA slow because most of Array instances are created to hold the result of some calculation or data read from file. It is rare that you really want zero-initialized Array instances. As described in Avoiding an unnecessary initialization against an array of a class which has a default constructor, there is a way to skip the unnecessary initialization and casacore was improved to make it possible for you to control the initialization of the elements.

A class casa::ArrayInitPolicy was introduced to express the policy of the initialization. It can have two values, ArrayInitPolicy::INIT(do initialize) and ArrayInitPolicy::NO_INIT(don't initialize). Some constructors and methods(resize, copy, Block::remove), one of their arguments is of type ArrayInitPolicy, were added to Block, Array, Cube, Matrix, and Vector. With these constructors and methods, you can control the initialization through the argument. Following example shows how to use the argument and how to express your intent about the initialization.

{
    Array<Int> dirty(IPosition(1, 10), ArrayInitPolicy::NO_INIT); // Elements are NOT initialized.
    std::generate_n(dirty.data(), dirty.nelements(), [] {return std::rand() % 10;});
    Array<Int> zero(IPosition(1, 10), ArrayInitPolicy::INIT); // Elements are initialized to 0 by the default constructor of Int.
    std::for_each(dirty.begin(), dirty.end(), [&zero](Int pos) { zero.data()[pos]++;});
    Array<Int> zero2(IPosition(1, 10), Int(0)); // Same as zero except copy constructor is used.
    Array<Int> ambiguous(IPosition(1, 10)); // NG: ambiguous intent and unrecommended manner
    Array<Int> empty;
    empty.resize(IPosition(1, 10), False, ArrayInitPolicy::NO_INIT); // Elements are NOT initialized
    empty.resize(IPosition(1, 20), False, ArrayInitPolicy::INIT); // Elements are initialized
}
{
    Array<Complex> dirty(IPosition(1, 10), ArrayInitPolicy::NO_INIT); // Elements are NOT initialized.
    Array<Complex> zero(IPosition(1, 10), ArrayInitPolicy::INIT); // Elements are initialized to (0,0i) by the default constructor of Complex.
    Array<Complex> zero2(IPosition(1, 10), Complex()); // Same as zero except copy constructor is used.
    Array<Complex> ambiguous(IPosition(1, 10)); // NG: ambiguous intent and unrecommended manner
    Array<Complex> empty;
    empty.resize(IPosition(1, 10), False, ArrayInitPolicy::NO_INIT); // Elements are NOT initialized
    empty.resize(IPosition(1, 20), False, ArrayInitPolicy::INIT); // Elements are initialized
}

Note that, for some element type like std::string and/or std::vector, you must be very careful to specify ArrayInitPolicy::NO_INIT because operations on unconstructed instances of such types cause catastrophic problem. You are responsible to construct elements before assignment, reference or destruction when you specify ArrayInitPolicy::NO_INIT.

{ // OK
    Array<std::string> ary(IPosition(1, 1), ArrayInitPolicy::NO_INIT);
    ::new (ary.data()) std::string("Hello");  // explicitly constructs the element.
}
{ // NG
    Array<std::string> ary(IPosition(1, 1), ArrayInitPolicy::NO_INIT);
} // crashes because destructor of std::string is called on unconstructed object.

Code Examples

If the initialization of elements via default constructor is essentially meaningless and you can ensure to initialize the elements by yourself before they are used, specify ArrayInitPolicy::NO_INIT as below.

BAD example:
    Array<Complex> an_array(IPosition(1, N));
    storeSomething(&an_array);

GOOD example:
    Array<Complex> an_array(IPosition(1, N), ArrayInitPolicy::NO_INIT);
    storeSomething(&an_array);

Benchmarks

  secs
BAD example 2.09943
GOOD example 0.95276

void storeSomething(Array<Complex> *an_array) {
    size_t n = an_array->nelements();
    auto elements = an_array->data();
    Complex v;
    for (size_t i = 0; i < n; ++i) {
        v.real(v.real() + 1.0f);
        elements[i] = v;
    }
}
// ...
    constexpr size_t N = 1024*1024*128;
    { // BAD example
        double start = GetCurrentTime();
        Array<Complex> an_array(IPosition(1, N));
        storeSomething(&an_array);
        double end = GetCurrentTime();
        std::cout << end - start << std::endl;
    }
    { // GOOD example
        double start = GetCurrentTime();
        Array<Complex> an_array(IPosition(1, N), ArrayInitPolicy::NO_INIT);
        storeSomething(&an_array);
        double end = GetCurrentTime();
        std::cout << end - start << std::endl;
    }

Avoiding an unnecessary initialization against an array of a class which has a default constructor

References

N/A

Memory Alignment and Allocator for casa::Block and casa::Array

Description

Casacore has introduced Allocator for the memory management of casa::Block and casa::Array(including Vector, Matrix and Cube) to provide flexibility of the memory management. By default, Block<T> and Array<T> use casa::DefaultAllocator<T> rather than new[]/delete[] to allocate, deallocate, construct and destruct elements. Allocators use std::allocator compatible interface internally. Unlike new[]/delete[], std::allocator can decouple allocation/deallocation and construction/destruction and it enables to skip the unnecessary initialization described above.

DefaultAllocator guarantees the alignment of allocated memory. The alignment is 32 bytes by default and you can change it by providing CMake parameter, for example "-DCASA_DEFAULT_ALIGNMENT:STRING=64", when building casacore. You can check the alignment by referring DefaultAllocator<int>::type::alignment.

Since DefaultAllocator is a default allocator for Block and Array, you can expect Block::storage() and Array::data() returns an aligned address in most cases. However, the functions may return an unaligned address due to Block<T>::replaceStorage, Array<T>::takeStorage, and constructors taking user-provided storage pointer. The storage may be allocated by new[] and set to Block/Array via such interfaces. NewDelAllocator is provided only for backward compatibility to handle the storage allocated by new[]. To eliminate the possibility that Block::storage() and/or Array::data() return unaligned address, it is strongly recommended to allocate such storage via DefaultAllocator as below.

{ // unrecommended
    int *ptr = new int[10];
    std::iota(ptr, &ptr[10], 0);
    Array<int> ai(IPosition(1, 10), ptr, TAKE_OVER); // NewDelAllocator<int> will be used to release ptr.
}
{ // recommended
    Array<int> ai(IPosition(1, 10), ArrayInitPolicy::NO_INIT);
    std::iota(ai.data(), ai.data() + ai.nelements(), 0);
}
{ // recommended
    typedef DefaultAllocator<int> Alloc;
    Alloc::type allocator;  // allocator provides std::allocator compatible interface.
    int *ptr = allocator.allocate(10);
    std::iota(ptr, &ptr[10], 0);
    Array<int> ai(IPosition(1, 10), ptr, TAKE_OVER, Alloc::value);
}
{ // recommended
    typedef DefaultAllocator<std::string> Alloc;
    Alloc::type allocator;
    std::string *ptr = allocator.allocate(10);
    std::for_each(ptr, &ptr[10],
            [&allocator](std::string &e) { allocator.construct(&e, "-"); });
    Array<std::string> astr(IPosition(1, 10), ptr, TAKE_OVER, Alloc::value);
}

As long as you are sure that Block::storage() and Array::data() return an aligned address, you can accelerate memory access via __builtin_assume_aligned or you can pass the storage without copying data to 3rd-party library which requires aligned address.

Code Examples

N/A

Benchmarks

N/A Initialization control of casa::Block and casa::Array

References

N/A


Fast Libraries


Topic attachments
I Attachment Action Size Date Who Comment
CRTP_test.ver3.tgztgz CRTP_test.ver3.tgz manage 2 K 2015-04-05 - 21:54 TakeshiNakazato benchmark code for static polymorphism by CRTP
SIMD.pngpng SIMD.png manage 12 K 2015-02-24 - 21:28 KohjiNakamura SIMD operation
aos-aos.pngpng aos-aos.png manage 11 K 2015-02-24 - 21:27 KohjiNakamura AoS operation
aos_soa.cccc aos_soa.cc manage 2 K 2015-02-25 - 00:02 KohjiNakamura sample code of SoA/AoS
array_init.cccc array_init.cc manage 2 K 2014-10-16 - 01:27 ShinnosukeKawakami sample code of avoiding an unnecessary initialization against an array of a class which has a default constructor
branching_outside_loop.cccc branching_outside_loop.cc manage 3 K 2014-11-19 - 21:33 TakeshiNakazato sample code of removing conditional branch from the loop without code duplication
colors.csscss colors.css manage 22 bytes 2014-10-17 - 05:53 KohjiNakamura colors css
et.cccc et.cc manage 4 K 2016-03-06 - 11:26 WataruKawasaki sample code of expression template
layout.csscss layout.css manage 355 bytes 2014-10-17 - 06:05 KohjiNakamura layout css
md_array.cccc md_array.cc manage 1 K 2014-10-16 - 22:43 WataruKawasaki sample code of accessing multidimensional arrays
out_param.cccc out_param.cc manage 1 K 2015-01-28 - 00:06 KohjiNakamura sample code of least update of output parameter
pgo.cccc pgo.cc manage 1 K 2014-11-27 - 20:34 KohjiNakamura sample code of Profile-Guided Optimization
soa-soa.pngpng soa-soa.png manage 12 K 2015-02-24 - 21:27 KohjiNakamura SoA operation
type_conversion.cccc type_conversion.cc manage 2 K 2014-10-16 - 23:26 KanaSugimoto A sample code of a tips, "Avoid implicit and/or explicit type conversions unless necessary"
Topic revision: r58 - 2016-03-18, WataruKawasaki
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback