This morning I saw a commit in our group with nearly this content:
- std::fill(v1.begin(), v1.end(), 0); - std::fill(v2.begin(), v2.end(), 0); - std::fill(v3.begin(), v3.end(), 0); + for(int i = 0; i < N; ++i) { + v1[i] = v2[i] = v3[i] = 0; + }with comment "small optimization".
I've been a bit wondered "Is it really give any speedup?", but, anyway, decided to check. Results are expected, all but the last one: strange thing, I can't do this code work fast w/o hand optimization even w/ -march=native. Here is the source code and my benchmark digits:
#include <algorithm> #include <vector> #include <stdio.h> using namespace std; #define N 1000 inline void i32_fill(int *start, int n, int c) { int d1, d2; asm volatile( "cld\n\t" "rep\t\n" "stosl\n\t" :"=&D"(d1), "=&c"(d2) :"0"(start), "1"(n), "a"(c) :"cc", "memory" ); } int main() { int crc = 0; vectorv1(N), v2(N), v3(N), v4(N); for ( int loop = 0; loop < 10000000; loop++ ) { /* Uncomment any of the following methods: * I especially did not place 'em one after each other keeping * in mind you may ask "wasn't it because of caching?" */ /* std::fill(v1.begin(), v1.end(), 1); std::fill(v2.begin(), v2.end(), 2); std::fill(v3.begin(), v3.end(), 3); std::fill(v4.begin(), v4.end(), 4); */ /* for ( int i = 0; i < N; i++) v1[i] = 1; for ( int i = 0; i < N; i++) v2[i] = 2; for ( int i = 0; i < N; i++) v3[i] = 3; for ( int i = 0; i < N; i++) v4[i] = 4; */ /* for ( int i = 0; i < N; i++) { v1[i] = 1; v2[i] = 2; v3[i] = 3; v4[i] = 4; }*/ /* i32_fill(&v1[0], N, 1); i32_fill(&v2[0], N, 2); i32_fill(&v3[0], N, 3); i32_fill(&v4[0], N, 4); */ crc += v1[N-1] + v2[N-1] + v3[N-1] + v4[N-1]; } printf("CRC=%d\n", crc); return 0; }
First of all, there is a great difference (over 4 times) using -O2 & -O3 for all but i32_fill. Here is test' results for:
$ g++ -O3 -march=native fill.cpp && ./time a.out
Method | Time |
---|---|
std::fill | 0m7.090s |
N x loop | 0m7.018s |
loop x N | 0m5.119s |
i32_fill | 0m4.072s |
So, the question is - is here a better solution than i32_fill and why "N x loop" wasn't optimized so much by compiler? BTW, memset(3) uses "stosq then stosb" approach and it's built-in gcc.
This comment has been removed by the author.
ReplyDeleteYou are doing it wrong (c)
ReplyDeleteFirst of all, haven't you ever heard that -march=native for general-purpose application is a some kind of sqrt(evil)? For my cpu 'native' turns to:
-march=core2 -mcx16 -msahf -mno-movbe -mno-aes -mno-pclmul -mno-popcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-tbm -mno-avx -mno-sse4.2 -msse4.1 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=6144 -mtune=core2
Are you sure that you really need all this staff?
For example, cache-tuning options are efficient only when you are using single-tasking environment. Have you got one? ;-P
The only orthodox faithful way is "-march=${YOUR_ARCH}", but not "-march=native".
Don't forget about optimization options beyond -O N.
GCC 4.6 with "-O3 -march=core2" compiles std::fill using SSE2's MOVDQA (Move Aligned Double Quadword):
movdqa .LC0(%rip), %xmm0
It's a little bit faster than your fastest implementation: 8.161s vs 8.497s.