This morning I saw a commit in our group with nearly this content:
- std::fill(v1.begin(), v1.end(), 0);
- std::fill(v2.begin(), v2.end(), 0);
- std::fill(v3.begin(), v3.end(), 0);
+ for(int i = 0; i < N; ++i) {
+ v1[i] = v2[i] = v3[i] = 0;
+ }
with comment "small optimization".
I've been a bit wondered "Is it really give any speedup?", but, anyway, decided to check. Results are expected, all but the last one: strange thing, I can't do this code work fast w/o hand optimization even w/ -march=native. Here is the source code and my benchmark digits:
#include <algorithm>
#include <vector>
#include <stdio.h>
using namespace std;
#define N 1000
inline void i32_fill(int *start, int n, int c)
{
int d1, d2;
asm volatile(
"cld\n\t"
"rep\t\n"
"stosl\n\t"
:"=&D"(d1), "=&c"(d2)
:"0"(start), "1"(n), "a"(c)
:"cc", "memory"
);
}
int main()
{
int crc = 0;
vector v1(N), v2(N), v3(N), v4(N);
for ( int loop = 0; loop < 10000000; loop++ )
{
/* Uncomment any of the following methods:
* I especially did not place 'em one after each other keeping
* in mind you may ask "wasn't it because of caching?"
*/
/*
std::fill(v1.begin(), v1.end(), 1);
std::fill(v2.begin(), v2.end(), 2);
std::fill(v3.begin(), v3.end(), 3);
std::fill(v4.begin(), v4.end(), 4);
*/
/*
for ( int i = 0; i < N; i++) v1[i] = 1;
for ( int i = 0; i < N; i++) v2[i] = 2;
for ( int i = 0; i < N; i++) v3[i] = 3;
for ( int i = 0; i < N; i++) v4[i] = 4;
*/
/*
for ( int i = 0; i < N; i++) {
v1[i] = 1;
v2[i] = 2;
v3[i] = 3;
v4[i] = 4;
}*/
/*
i32_fill(&v1[0], N, 1);
i32_fill(&v2[0], N, 2);
i32_fill(&v3[0], N, 3);
i32_fill(&v4[0], N, 4);
*/
crc += v1[N-1] + v2[N-1] + v3[N-1] + v4[N-1];
}
printf("CRC=%d\n", crc);
return 0;
}
First of all, there is a great difference (over 4 times) using -O2 & -O3 for all but i32_fill. Here is test' results for:
$ g++ -O3 -march=native fill.cpp && ./time a.out
| Method | Time |
|---|---|
| std::fill | 0m7.090s |
| N x loop | 0m7.018s |
| loop x N | 0m5.119s |
| i32_fill | 0m4.072s |
So, the question is - is here a better solution than i32_fill and why "N x loop" wasn't optimized so much by compiler? BTW, memset(3) uses "stosq then stosb" approach and it's built-in gcc.
This comment has been removed by the author.
ReplyDeleteYou are doing it wrong (c)
ReplyDeleteFirst of all, haven't you ever heard that -march=native for general-purpose application is a some kind of sqrt(evil)? For my cpu 'native' turns to:
-march=core2 -mcx16 -msahf -mno-movbe -mno-aes -mno-pclmul -mno-popcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-tbm -mno-avx -mno-sse4.2 -msse4.1 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=6144 -mtune=core2
Are you sure that you really need all this staff?
For example, cache-tuning options are efficient only when you are using single-tasking environment. Have you got one? ;-P
The only orthodox faithful way is "-march=${YOUR_ARCH}", but not "-march=native".
Don't forget about optimization options beyond -O N.
GCC 4.6 with "-O3 -march=core2" compiles std::fill using SSE2's MOVDQA (Move Aligned Double Quadword):
movdqa .LC0(%rip), %xmm0
It's a little bit faster than your fastest implementation: 8.161s vs 8.497s.