I'm trying to figure out how much I should care about alignment. Here I'm testing some arithmetic using two different buffers. If I set this up right, in the 'wacky' buffer the integer is stored at the 29th byte arbitrarily. In the 'normal' buffer the integer is stored at the 29th 4-byte integer, like any sane array would. I am printing out the results of my tests. The wacky integers are slower, but it actually doesn't matter if I pick 29 or 0 or 1, the performance ratio is about the same. The ratio also doesn't change if compiler optimizations are turned on or off. Is this an accurate representation of the performance cost of doing this? I might be completely confused or missing something here but I'd appreciate if someone could point me in the right direction.
Here's the code:
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
typedef uint8_t *wacky_int32_t;
#define WACKY_OFFSET 29
wacky_int32_t wacky_int32(int32_t number)
{
uint8_t *buffer = (uint8_t *)calloc(sizeof(int32_t) WACKY_OFFSET, sizeof(uint8_t));
memcpy(buffer WACKY_OFFSET, &number, sizeof(int32_t));
return(buffer);
}
static inline int32_t unwacky_int32(wacky_int32_t number)
{
return(*(int32_t *)(number WACKY_OFFSET));
}
long perfcount()
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
return(ts.tv_nsec);
}
int main(int argc, char **argv)
{
int testcount = 0;
while(testcount < 24) {
int32_t randa = rand();
int32_t randb = rand();
int32_t randc = rand();
int32_t randd = 1 rand();
wacky_int32_t a = wacky_int32(randa);
wacky_int32_t b = wacky_int32(randb);
wacky_int32_t c = wacky_int32(randc);
wacky_int32_t d = wacky_int32(randd);
int32_t *a2 = (int32_t *)calloc(WACKY_OFFSET 1, sizeof(int32_t));
int32_t *b2 = (int32_t *)calloc(WACKY_OFFSET 1, sizeof(int32_t));
int32_t *c2 = (int32_t *)calloc(WACKY_OFFSET 1, sizeof(int32_t));
int32_t *d2 = (int32_t *)calloc(WACKY_OFFSET 1, sizeof(int32_t));
a2[WACKY_OFFSET] = randa;
b2[WACKY_OFFSET] = randb;
c2[WACKY_OFFSET] = randc;
d2[WACKY_OFFSET] = randd;
long start = perfcount();
int32_t ans = (unwacky_int32(a) (unwacky_int32(b) * unwacky_int32(c))) % unwacky_int32(d);
long wackytime = perfcount() - start;
free(a); free(b); free(c); free(d);
start = perfcount();
int32_t ans2 = (a2[WACKY_OFFSET] (b2[WACKY_OFFSET] * c2[WACKY_OFFSET])) % d2[WACKY_OFFSET];
long normaltime = perfcount() - start;
free(a2); free(b2); free(c2); free(d2);
printf("[wacky mode ] ans = %-16d time = %-16ld\n[normal mode] ans = %-16d time = %-16ld\n\n",
ans, wackytime, ans2, normaltime);
}
return 0;
}
Here's some output:
[wacky mode ] ans = -139296355 time = 370
[normal mode] ans = -139296355 time = 124
[wacky mode ] ans = 1254173191 time = 134
[normal mode] ans = 1254173191 time = 127
[wacky mode ] ans = -428008505 time = 95
[normal mode] ans = -428008505 time = 91
[wacky mode ] ans = 1411083651 time = 90
[normal mode] ans = 1411083651 time = 91
[wacky mode ] ans = -250251228 time = 88
[normal mode] ans = -250251228 time = 69
[wacky mode ] ans = 1670511475 time = 90
[normal mode] ans = 1670511475 time = 76
[wacky mode ] ans = -142905250 time = 93
[normal mode] ans = -142905250 time = 75
[wacky mode ] ans = 402377226 time = 107
[normal mode] ans = 402377226 time = 76
[wacky mode ] ans = -680962320 time = 93
[normal mode] ans = -680962320 time = 73
[wacky mode ] ans = -992960967 time = 98
[normal mode] ans = -992960967 time = 72
[wacky mode ] ans = 20339958 time = 95
[normal mode] ans = 20339958 time = 72
[wacky mode ] ans = -1090114074 time = 95
[normal mode] ans = -1090114074 time = 78
[wacky mode ] ans = 170467638 time = 95
[normal mode] ans = 170467638 time = 76
[wacky mode ] ans = 102978457 time = 88
[normal mode] ans = 102978457 time = 73
[wacky mode ] ans = 96879004 time = 95
[normal mode] ans = 96879004 time = 78
[wacky mode ] ans = 941108877 time = 94
[normal mode] ans = 941108877 time = 76
[wacky mode ] ans = -3164800 time = 92
[normal mode] ans = -3164800 time = 72
[wacky mode ] ans = -73124107 time = 88
[normal mode] ans = -73124107 time = 73
[wacky mode ] ans = 759564988 time = 94
[normal mode] ans = 759564988 time = 76
[wacky mode ] ans = 103176158 time = 92
[normal mode] ans = 103176158 time = 78
[wacky mode ] ans = 1234836399 time = 94
[normal mode] ans = 1234836399 time = 79
[wacky mode ] ans = 498712444 time = 89
[normal mode] ans = 498712444 time = 74
[wacky mode ] ans = 207578849 time = 97
[normal mode] ans = 207578849 time = 76
[wacky mode ] ans = 1447403380 time = 91
[normal mode] ans = 1447403380 time = 70
CodePudding user response:
The consequence of reading unaligned integers depends very much on the compiler and hardware, i.e. it is implementation dependent and not directly covered by the C standard. It is, however, indirectly covered since the constructs needed to make unaligned access fall in the infamous "undefined behavior" category (which does not exclude compilers from defining a behavior). Thus, the following is a - probably incomplete - account of what may be observed on a given system.
On a 8 bit CPU it will typically not matter at all (the second bullet below may apply, though). On a 16 bit CPU it will typically have some consequence to do a read which is not aligned to the byte width of the CPU. The possible consequences include:
- Performance penalty since the operation involves more memory fetches than aligned read.
- Performance penalty or faulty behavior if the operation breaks assumptions/requirements of the memory cache system.
- Causing an otherwise atomic operation to become non-atomic (which can be a serious issue if the variable is shared between execution threads, including interrupts).
- Triggering a CPU error (which will typically terminate the program or enter an error state).
It should be mentioned that unaligned access can happen due to subtle reasons. I have experienced that compiler optimization of initialization of a structure (only with char members) caused a hard fault because the struct was not aligned as the compiler expected.
About the performance measurements in the question: the time spans are very short and may be affected by system interrupts running in between the time readings. To get reliable results, an analysis of average and deviations over many iterations should be made.
That being said, the results seem to show a performance penalty of unaligned access within what could be expected.
CodePudding user response:
Generally, you should care. Use memcpy both ways:
static inline int32_t unwacky_int32(wacky_int32_t number)
{
int32_t r;
memcpy(&r, number WACKY_OFFSET, sizeof(int32_t));
return r;
}
The compiler should be able to optimize that to produce optimal code, so there's no performace worry, so there is really no reason write non-portable (doesn't work on some CPUs, generates bus error due to alignment) code.
CodePudding user response:
The issue with unaligned access is not that it might be slower.
The issue with unaligned access is not that it might be unpredictably slower.
No, the issue with unaligned access is not that it might not work at all.
Unaligned access is undefined behavior, and undefined behavior is, usually, poison.
And you absolutely can not derive a useful conclusion about undefined behavior by trying it and observing that it seems to work on your machine (today).
If you're trying to write a useful program, you want one that works everywhere, every day, on anyone's machine.
"Works on my machine" is not a useful certification.
