A developing page on interesting optimization/compiler notes, as well as odd issues.

General Optimizations:
• Vectorization
• FP precision (i.e: double vs. float)
• Integer/Fixed point
• use of const; Arrays instead of pointers
• Loop unrolling
• Remove loop conditionals
• Ordering of conditionals
• Use static function decorator (C/C++)
• Use of anonymous namespace (C++)

## Endianess Notes

The issue of Endianess has plagued us all since the dawn of computers. There are some good reasons for one vs. another, but that's another subject. How do we handle 'foreign endian' data?

First we need to recognize that the conversion is dependent upon the size of the data type. As it turns out, we don't care if its floating point or not. So the obvious ones: The ARPA net style to/from network byte order functions, requiring #include <arpa/inet.h> Note: 'Network order' is defined as big-endian.
• uint16_t ntos(uint16_t) - convert 16 bit/two byte data from/to big-endian
• uint16_t htons(uint16_t)
• uint32_t htonl(uint32_t) - convert 32 bit data from/to big-endian
• uint32_t ntohl(uint32_t)
Note that there is no ntohLL (64 bit). For this we must look elsewhere.

The 'endian' (included on most all linux-boxen and BSD systems) are listed as non-standard, but include the 64 bit variants and explict 'to little endian' functions. There require #include<endian.h> (sys/endian.h on some BSD systems). 16 bit functions:
• uint16_t htobe16(uint16_t) // replacement for htons()
• uint16_t htole16(uint16_t); // no equivalent in ARPA routines
• uint16_t be16toh(uint16_t); // replacement for ntohs()
• uint16_t le16toh(uint16_t); // no equivalent
32 bit Functions:
• uint32_t htobe32(uint32_t); // replaces htonl()
• uint32_t htole32(uint32_t); // no equivalent
• uint32_t be32toh(uint32_t); // replaces ntohl()
• uint32_t le32toh(uint32_t); // no equivalent
64 bit Functions: (none available in ARPA)
• uint64_t htobe64(uint64_t);
• uint64_t htole64(uint64_t);
• uint64_t be64toh(uint64_t);
• uint64_t le64toh(uint64_t);

Easy. But what if we have non-standard data type widths? (40 bit integers for example) Well this takes some thought, but since these types must be defined by bitfields in a structure, we can try some things. For a real-example, SPEAD packets for us looks something like this:

 SPEAD packet Item 'pointer' mode (1bit) identifier (23 bits) 40 bits of data or address MSB LSB

So how to convert this from big-endian to little endian with the least obfustication? First one might be inclined to perform a bunch of mask-shift-or operations, but its actually much easier. Just perform an 8 byte swap/pivot using be64toh(), and define a structure like so:
```struct _ItemPointer
{
unsigned item_identifier:23;   ///< 23 bit identifier
unsigned item_address_mode:1;  ///<  1 bit flag defining immediate (1) or relative (0)
};
```
Why does this work? The secret is a combination of realizing that swapping data and bit field definitions allows the 'correct' things to happen. In detail we take an example 8 bytes and follow it through:
Original BE data:
 A B C D E F G
Swapped using betoh64():
 G F E D C B A
Interpreted with bit fields, GFEDC is the 40 bit field in the correct ordering, C,B, and 7 bits of A are the 23 bit field, and the most significant bit of A is the 1-bit field.

### Type Punning Elimination

Not really an endianess issue, I ran into the following code, which intends to take a char *, cast the pointer to an int, then index into it, byte swap and return an integer:
```// This is ugly, and results in compiler warnings
#define BYTE_ARR_TO_UINT(array, idx) (ntohl(((unsigned int*)(array))[idx]))
```
Perhaps very terse, but a bit opaque. A better version is this:
```// This is actually faster than the macro expansion of above, plus it conforms with
// C99 standards.
union __XXY
{
const int*  iptr;
const char* cptr;
};
static inline unsigned int BYTE_ARR_TO_UINT(const char *p1, int idx)
{
union __XXY x;
x.cptr = p1;
return ntohl(x.iptr[idx]);
}
```
All that compiles into a single x86 one 64 bit inlined instruction (idx was a constant, so addressing was determined at compile time). Actually both reduce into a single instruction, but the second version is C99 compliant and causes no compiler warnings.

## Floating Point Considerations -- Odd Numerical Cases

### Direct vs. Indirect Trigs in OOF

I've noticed that it is possible to break the OOF program in some interesting ways. Consider the following:
```#ifdef BROKEN
double cos_phi, sin_phi;
cos_phi = cos(phi[i]);
sin_phi = sin(phi[i]);
fin[i*2]  = amp[i] * cos_phi;
fin[i*2+1]= amp[i] * sin_phi;
#else
fin[i*2]  = amp[i] * cos(phi[i]);
fin[i*2+1]= amp[i] * sin(phi[i]);
#endif
```

When the token 'BROKEN' is defined, we get a different set of answers than when BROKEN is undefined. Even more interesting is that optimization is set at -O2, so the compiler would probably optimize the temporaries out making the code equivalent.

The assember output differences are minor, (differing mostly in mov instructions and stack addresses) but in both cases call the same sin() and cos() functions. One other difference is register use, where rax is used instead of half of xmm1. (Both are 64 bit registers.)

A wild guess would be that in one case the 80 bit intermediate value is somehow being used directly, but the memory referenced value being truncated to 64 bits. I don't believe this but I don't have a better answer either. Compiler optimization level doesn't seem to change the result.

### The Large and the Small

Floating point works by splitting a value into a binary representation holding a value, sign and exponent (e.g. 4.5 x 103, except in binary). The big and small refers to performing an operation on two values: large and small. It must be recalled that floating point representation on a number line are not linear. The difference between LSB's of a large number may represent a huge delta, whereas the difference in LSB's of a very small number will be extremely small. The problem arises when two values are combined. Given the table below, what should the sum of 6x1018 + 3 be ?

Magnitude Space Between Values
6e-12 8.07794e-28
6e-10 1.03398e-25
6e-08 1.32349e-23
6e-06 8.47033e-22
0.0006 1.0842e-19
0.06 6.93889e-18
6 8.88178e-16
600 1.13687e-13
60000 7.27596e-12
6e+06 9.31323e-10
6e+08 1.19209e-07
6e+10 7.62939e-06
6e+12 0.000976562
6e+14 0.125
6e+16 8
6e+18 1024
6e+20 131072

A good way to do floating point comparisons is illustrated with this example:
```#include <cmath>
#include <limits>
#include <iomanip>
#include <iostream>
#include <type_traits>

template<class T>
typename std::enable_if<!std::numeric_limits<T>::is_integer, bool>::type
almost_equal(T x, T y, int ulp)
{
// the machine epsilon has to be scaled to the magnitude of the larger value
// and multiplied by the desired precision in ULPs (units in the last place)
return std::abs(x-y) <=   std::numeric_limits<T>::epsilon()
* std::max(std::abs(x), std::abs(y))
* ulp;
}
...
// Example use:
if (almost_equal(myval, refval, n_decimals))
```

### Order of Operations

The ordering of operations will impact the results -- whether that is a problem is of course application specific. Similar to the reasons in 'Large and Small', combinatorial order comes into play. For Example:
```w = x + y + z;
should be the same as
w = x + (y + z);
```
However this may not be the case!

### FFTW SSE vs. non-SSE in OOF (Unsolved)

A second case, which perhaps is more plausable, is the numerical impact of using the FFTW library configured for SSE vs. no-SSE. This too has a similar impact as the code above, causing differences in the fitted Zernike coefficients.

I have searched for any notes/reports of numerical issues concerning the use of SSE, and have not found any. I must conclude what we have here is a 'butterfly effect', where extremely small differences send the fitting routines down an alternate trajectory.

## The Pointer Aliasing and the `restrict` Keyword

Not often used, but very important is the type quailifier `restrict` (C) or `__restrict` (C++). What does this do? The restrict keyword is a way to tell the compiler some information about a pointer. In essence it says "I promise that this pointer is the only way to access the memory it points to." What does this imply?
• If two pointers in a function are both restrict pointers, then they will never point to overlapping memory regions.
• There is no dependence between reads/stores of the different restricted pointers.
With this 'promise', the compiler can enable various optimizations. Consider the code:
 ```struct XYZ { float x; float y; float z; float w; }; void foo(XYZ *pos, XYZ *vel, XYZ *acl, size_t n) { float dt = 0.1; for (i=0; i

The problem here is that the compiler has no way of knowing if pos, vel and acl overlap, or interact in some way. So in pseudo-code the resulting code does this during each loop iteration: Note: In both examples prior to loop xmm15 is loaded with four copies of dt
• load acl[i] (x,y,z,w) from memory into acl[i] -> xmm1
• vector multiply (x,y,z,w) by dt xmm1 * xmm15 -> xmm1
• load vel[i] (x,y,z,w) from memory into vel[i] -> xmm2
• vector add the result xmm2 + xmm1 -> xmm2
• store vel[i] (x,y,z,w) to memory from xmm2 -> vel[i]
• wait for store to complete
• load pos[i] (x,y,z,w) from memory into pos[i] -> xmm3
• vector add vel[i] (x,y,z,w) xmm2 + xmm3 -> xmm3
• store pos[i] (x,y,z,w) to memory xmm3 -> pos[i]

The problem here is that a dependenct between the store of vel and load of pos. The complier doesn't know if the store of vel doesn't impact the value of pos. Adding the restrict qualifiers to pos,vel and acl the dependence is broken and now the compiler can do:
• load acl[i] (x,y,z,w) from memory into acl[i] -> xmm1
• vector multiply by dt xmm1 * xmm15 -> xmm1
• load vel[i] (x,y,z,w) from memory into vel[i] -> xmm2
• vector add the result xmm2 + xmm1 -> xmm2
• load pos[i] (x,y,z,w) from memory into pos[i] -> xmm3
• vector add vel[i] (x,y,z,w) xmm2 + xmm3 -> xmm3
• store pos[i] (x,y,z,w) to memory xmm3 -> pos[i]
• store vel[i] (x,y,z,w) to memory from xmm2 -> vel[i]

## Compiler Vectorization

We all have heard about multi-core, and the push for parallelism. However, it seems a key capability is often overlooked. Each x86 core has the ability to operate on two doubles or four floats [four doubles and eight floats in the case of an AVX capabile system] as a single unit. A good approach then is to parallelize outer loops, and vectorize inner loops as a rule of thumb. How is the question. Fortunately we can get some help from the compiler if we organize and write code it can recognize as a vector operation.

But we must recognize some constraints in vectorization. The first is (as noted above) issues like overlapping memory and dependencies. We can calm the compiler with _restrict.

A second constraint is alignment considerations. SSE vector's must be aligned on 16-byte boundries, so some care must be taken when allocating heap objects. An example of how to allocate N floats is:
```{
float *a = (float *)memalign(16, sizeof(float)* N);
}

OR we can a global scope, define a new 'new':
void * operator new[](size_t s, size_t align)
{
// cout << "allocated " << s << " bytes at aligment of " << align << endl;
return memalign(align, s);
}
```
One could also argue that you could write your own allocator for STL. I would not recommend that for a number of reasons, but YMMV.

We can also do this with objects, but we should note:
• The alignment specifier in this case applies to each object, not the base of the array of objects
• If object contains a vtable, your code is borked, and your mother dresses you funny. (i.e: don't do this either!)
We can see how this works (and doesn't work) in the example below:
```void * operator new[](size_t s, size_t align)
{
cout << "allocated " << s << " bytes at aligment of " << align << endl;
return memalign(align, s);
}

class X
{
public:
X() : a(2.0) {}
virtual int foo() { return(a*a); }
float a;
float b;
};

class Y : public X
{
public:
Y() : X() { cout << "New Y at " << (void *)this << endl; }
virtual int foo() { return a; }
};

int main(int, char **)
{
float *a = new(16) float;
float *b = new(256) float;
Y     *c = new(32) Y;

cout << "C.a = " << c.foo() << endl;
for (int i=0; i<10; ++i)
{
cout << "address of a["<<i<<"]="<< (void*)&a[i] << endl;
cout << "address of b["<<i<<"]="<< (void*)&b[i] << endl;
}
delete [] a;
delete [] b;
delete [] c;
```

### Automatic

At -O2 and -O3 optimization levels, the tree vectorizer (tree refering to the intermediate code representation [as opposed to GIMPLE], not a vectorization algorithm) is enabled. Simple loops are vectorized automagically. In gcc-4.4, the flag '-ftree-vectorizer-verbose=5' will print out some info indicating loop vectorization.

The compiler can auto-vectorize, but it needs help. Loops which have data-dependent termination conditions are typically problemmatic, and will fail vectorization. Another item to keep in mind is that SSE data types have 16-bytes alignment requirements. Other alignments can be used, it will result in slower results. In fact the compiler will often emit two versions of a loop: one for the aligned inputs/outputs case, and another for the unaligned case (usually a non-SSE implementation).

There are two additional points which may help. If you examine the generated code, you will often see duplicate implementations (one for aligned one for unaligned) or possibly one for overlapping arrays one for not etc. There are two things we can add: the restrict keyword, (__restrict in C++) and the builtin_assume_aligned directives. The first case makes it easier for the compiler to tell if the arrays overlap. The second tells the compiler we promise to pass in aligned arrays.

```    void foob(double * restrict a, double * restrict b)
{
size_t i, j;
// use the 'aligned' pointers in data references
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i=0; i<1024; ++i)
{
y[i] = x[i]*x[i];
}
}
```

#### Vectorized Loop Examples:

This section almost seems too obvious, be here goes:
```void foo (float _restrict *a, float _restrict *b, float _restrict *c)
{
int i;
for (i=0; i<256; i+=4)
{
a[i]   = b[i]   + c[i];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
a[i+3] = b[i+3] + c[i+3];
}
}
// The compiler is smart enough to see this and vectorize it too:

void foo (float _restrict *a, float _restrict *b, float _restrict *c)
{
int i;
for (i=0; i<256; ++i)
{
a[i] = b[i] + c[i];
}
}
```
Easy enough, and quite intuitive

Now if we use a multi-dimensional array, it should still vectorize. In all cases where the vectorized loop contains a non-integral number of iterations, the compiler also will 'peel-out' the end cases, and implement those with scalar instructions.
```// double loop w/ multi-dimensional array example
// Note: column major iteration, row major would not work
void bar (const float x)
{
// #pragma omp parallel for
for (int i=0; i<M; i++)
{
for (int j=0; j<N; j++)
{
g[i][j] = x;
}
}
}
```
Now what if we uncomment the openMP pragma, and add the -fopenmp compiler flag, what interactions are there between the two? The answer is (in this case) not very much. Openmp splits the outer iteration into smaller outer loops, retaining the inner.

If we change this slightly:
```void foobar (const float x)
{
#pragma omp parallel for
for (int i=0; i<M*N; i++)
{
for (int j=0; j<N; j++)
{
h[i] = x;
}
}
}
```
Again the compiler detects a vectorization opportunity. Note in all these examples, not only does it vectorize, it also does some loop unrolling. (Repeating the chunk of SSE code several times and modifying the loop count as needed.

#### Failed to Vectorize Examples:

A classic example is the copy loop, which depends on the data itself:
```while (*q)
*p++ = *q++;
```
The reason being it is too costly to compare all elements of the vector prior to executing it.

Another problem crops up when there is control flow inside of a loop, which depends on the iteration variables:
```void foo(const double *_y, double* _x)
{
const size_t N = 1024;
const double *y = (const double *)__builtin_assume_aligned(_y, 16);
double *x = (double *)__builtin_assume_aligned(_x, 16);
for (size_t i=0; i<N; i+=2)
{
double s = y[i]*y[i] - 100.0;
if (s > 0)
{
x[i]   = y[i];
x[i+1] = y[i+1];
}
else
{
x[i]   = y[i]*2;
x[i+1] = y[i+1]*2;
}
}
}
```

### Using Intrinsics

When the compiler won't do it, intrinsics are better than assembly, but require an understanding of SSE instructions. Note: although I use the xmm[1-3] etc., these are just variable names -- their is no connection with the actual registers.
```      // Original Expression:
fin[i*2]  = amp[i] * cossin_phi[COS];
fin[i*2+1]= amp[i] * cossin_phi[SIN];
// SSE implementation:
// First load cos and sin values into the upper/lower 64 bit halves of a SSE register ie:
// xmm1.low = cos(phi);
// xmm1.hi  = sin(phi);
// Now load the the amp[i] values into both hi/low (duplicate value)
// xmm2.low = amp[i]
// xmm2.hi  = amp[i]
const __m128d xmm2 = _mm_load1_pd(&amp[i]); // load 1 double into both halves
// Now multiply the two registers (2 doubles x 2 doubles)
// fin[i*2]   = xmm2.low * xmm1.low
// fin[i*2+1] = xmm2.hi  * xmm1.hi
// Write the 2 resulting values back into memory
_mm_store_pd(&fin[i*2], _mm_mul_pd(xmm2,xmm1));
```

Often for code readability we implement various operators like (),+,-,etc. This is good. What is not is how we write them. Consider the following implementation of a operator + for a vector class:
```Vector3 operator+( Vector3 v )
{
Vector3 retnv;
retnv.m_x = m_x + v.m_x;
retnv.m_y = m_y + v.m_y;
retnv.m_z = m_z + v.m_z;
return retnv;
}
```
This requires 3 calls to the constructor. One for the initialization of 'retnv'; one to make a const copy of retnv; and one to actually create the returned value. But knowning this we can short-circuit some of these calls. We don't need 'retnv' initialized at all because we are going to assign it anyway. If we write this a bit differently like:
```Vector3 operator+( Vector3 v )
{
return Vector3(m_x + v.m_x, m_y + v.m_y, m_z + v.m_z);
}
</vector>
Now we are down to one constructor call, which actually does some useful work for us. Note that for this to work, we need to provide the proper constructor.

---++ Compiler Flags
The breath of compiler flags is quite long, and several options 'turn-on' other options. The quick script below should help figure some of this out:
<verbatim>
# usage: gccflags 'quoted options here'
# or:    gccflags gcc47 'quoted options here'
#
gccflags()
{
if [ \$# -lt 2 ]; then
comp_test="gcc"
comp_test_opt=\$1
else
comp_test=\$1
comp_test_opt=\$2
fi
echo "int main() { return 0; }" | \$comp_test  "\$comp_test_opt" -v -Q -x c - 2>&1
rm -f a.out
}
</verbatim>

So for example, with a 4.4 gcc/g++ %RED%on colossus%ENDCOLOR% using -O2 we get:
<verbatim>
GNU C (GCC) version 4.4.6 20120305 (Red Hat 4.4.6-4) (x86_64-redhat-linux)
compiled by GNU C version 4.4.6 20120305 (Red Hat 4.4.6-4), GMP version 4.3.1, MPFR version 2.4.1.
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
options passed:  -v - -mtune=generic -O2
options enabled:  -falign-labels -falign-loops -fargument-alias
-fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg
-fcaller-saves -fcommon -fcprop-registers -fcrossjumping
-fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks
-fdwarf2-cfi-asm -fearly-inlining -feliminate-unused-debug-types
-fexpensive-optimizations -fforward-propagate -ffunction-cse -fgcse
-fgcse-lm -fguess-branch-probability -fident -fif-conversion
-fif-conversion2 -findirect-inlining -finline
-finline-functions-called-once -finline-small-functions -fipa-cp
-fipa-pure-const -fipa-reference -fira-share-save-slots
-fmath-errno -fmerge-constants -fmerge-debug-strings
-fmove-loop-invariants -fomit-frame-pointer -foptimize-register-move
-foptimize-sibling-calls -fpeephole -fpeephole2 -freg-struct-return
-fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop
-fsched-interblock -fsched-spec -fsched-stalled-insns-dep
-fschedule-insns2 -fsigned-zeros -fsplit-ivs-in-unroller
-ftoplevel-reorder -ftrapping-math -ftree-builtin-call-dce -ftree-ccp
-ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename
-ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-fre
-ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize
-ftree-parallelize-loops= -ftree-pre -ftree-reassoc -ftree-scev-cprop
-ftree-sink -ftree-sra -ftree-switch-conversion -ftree-ter
-ftree-vect-loop-version -ftree-vrp -funit-at-a-time -funwind-tables
-fvar-tracking -fvar-tracking-assignments -fvect-cost-model
-fzero-initialized-in-bss -m128bit-long-double -m64 -m80387
-maccumulate-outgoing-args -malign-stringops -mfancy-math-387
-mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mmmx -mno-sse4 -mpush-args
-mred-zone -msse -msse2 -mtls-direct-seg-refs
</verbatim>
Note that the '-mtune' arg is pre-configured in. This option examines the current host processor and uses that information to optimize for cache sizes etc. <BR>Therefore: %RED%NOTE!: This output will depend upon the specifics of the host it is run on.%ENDCOLOR%

---++ NUMA (Non-Uniform Memory Access API)
These are calls to manipulate NUMA from inside your program:
* numa_num_possible_nodes() -- Number of NUMA nodes
* numa_bind()  -- Sets task affinity and memory domain preference
* numa_set_localalloc() -- Sets a policy which attempts to bind physical memory to the allocating node (as opposed to first touch node)
* numa_alloc() -- Like alloc(), but is slower and always rounds up to multiple of the pagesize.
* numa_distance() -- Returns 'distance' or 'cost' of accessing remote node memory. Returns cost in multiples of 10.

---+++ NUMA API  System calls
* set_mempolicy()
* Sets the general NUMA policy (process)
* Does not include mbind'ed memory
* Does not include current memory.
* Policies are  Default, bind, preferred, interleave
* Takes a list of core id's
* mbind(ptr, len, policy, coremask, mode)
* More specific than set_mempolicy()
* Can move existing memory NUMA domain
* move_pages(pid,count,pages, nodes, status, flags)
* Can move existing memory

Notes:
* In all cases, granularity is by multiples of pagesize (usually 4096 bytes)
* NUMA policies are inherited by child processes. (This is how numactl and likwid work)

* sched_setaffinity(pid, cpusetsize, cpuset) -- binds a process to a set of cores

Notes:
* Affinity settings are inherited by child process/threads
* Child can only further reduce (narrow) the set of available cores to run on, Unless it has the CAP_SYS_NICE priviledge (usually root or specially configured processes)

---+++ NUMA Command Line Tools
These tools are for setting NUMA policies outside of a program:
* numactl  included in most distributions
* likwid  open source project. Has many of the functions numactl has, but easier to use

Likwid Examples:
* likwid topology -- shows NUMA topology
* likwid-pin -- allows pinning a process to a set of cores

---++ C++ STL Containers & Memory Footprint
Typical STL container mechanisms
* Array, Vector  contiguous memory chunk
* Map  Red/Black tree of nodes
* List  list
* Deque  tree of blocks of nodes
* Hashmap(C++11)  Hash table/vector
* Sets  RB trees

C++ Containers -- Attributes
Some thoughts about each data structure:
* std::list is relatively slow to iterate through the collection due to its poor spatial locality.
* std::vector and std::deque perform better than std::list with *small sized elements*
* std::list efficiently handles very large elements
* std::deque performs better than a std::vector for inserting at random positions (especially at the front, which is constant time)
* std::deque and std::vector are not the best choice for data types with high cost of copy/assignment

So given this, Which one to use? This draw here is some rules of thumb on usage of each data structure:
* Number crunching: use std::vector or std::deque (also std::array or std::valarray)
* Linear search: use std::vector or std::deque
* Random Insert/Remove: use std::deque (if object size is very small [typically < 64B], otherwise use std::list)
* Large object size: use std::list (not if intended for searching)
* If fast searching is important: use Hashmaps (especially on large datasets)

-- Main.JoeBrandt - 2013-01-15```
Topic revision: r15 - 2013-11-08, JoeBrandt
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback