Saturday, January 14, 2012

slow std::fill behavior

This morning I saw a commit in our group with nearly this content:

- std::fill(v1.begin(), v1.end(), 0);
- std::fill(v2.begin(), v2.end(), 0);
- std::fill(v3.begin(), v3.end(), 0);
+ for(int i = 0; i < N; ++i) {
+     v1[i] = v2[i] = v3[i] = 0;
+ }
with comment "small optimization".

I've been a bit wondered "Is it really give any speedup?", but, anyway, decided to check. Results are expected, all but the last one: strange thing, I can't do this code work fast w/o hand optimization even w/ -march=native. Here is the source code and my benchmark digits:

#include <algorithm>
#include <vector>
#include <stdio.h>
using namespace std;

#define N 1000

inline void i32_fill(int *start, int n, int c)
{
    int d1, d2;
    asm volatile(
         "cld\n\t"
         "rep\t\n"
         "stosl\n\t"
        :"=&D"(d1), "=&c"(d2)
        :"0"(start), "1"(n), "a"(c)
        :"cc", "memory"
    );
}

int main()
{
    int crc = 0;
    vector v1(N), v2(N), v3(N), v4(N);

    for ( int loop = 0; loop < 10000000; loop++ )
    {
        /* Uncomment any of the following methods:
         * I especially did not place 'em one after each other keeping
         * in mind you may ask "wasn't it because of caching?"
         */

        /*
        std::fill(v1.begin(), v1.end(), 1);
        std::fill(v2.begin(), v2.end(), 2);
        std::fill(v3.begin(), v3.end(), 3);
        std::fill(v4.begin(), v4.end(), 4);
        */

        /*
        for ( int i = 0; i < N; i++) v1[i] = 1;
        for ( int i = 0; i < N; i++) v2[i] = 2;
        for ( int i = 0; i < N; i++) v3[i] = 3;
        for ( int i = 0; i < N; i++) v4[i] = 4;
        */

        /*
        for ( int i = 0; i < N; i++) {
            v1[i] = 1;
            v2[i] = 2;
            v3[i] = 3;
            v4[i] = 4;
        }*/

       /* 
        i32_fill(&v1[0], N, 1);
        i32_fill(&v2[0], N, 2);
        i32_fill(&v3[0], N, 3);
        i32_fill(&v4[0], N, 4);
        */
        crc += v1[N-1] + v2[N-1] + v3[N-1] + v4[N-1];
    }

    printf("CRC=%d\n", crc);
    return 0;
}

First of all, there is a great difference (over 4 times) using -O2 & -O3 for all but i32_fill. Here is test' results for:

$ g++ -O3 -march=native fill.cpp && ./time a.out
MethodTime
std::fill0m7.090s
N x loop0m7.018s
loop x N0m5.119s
i32_fill0m4.072s

So, the question is - is here a better solution than i32_fill and why "N x loop" wasn't optimized so much by compiler? BTW, memset(3) uses "stosq then stosb" approach and it's built-in gcc.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. You are doing it wrong (c)
    First of all, haven't you ever heard that -march=native for general-purpose application is a some kind of sqrt(evil)? For my cpu 'native' turns to:
    -march=core2 -mcx16 -msahf -mno-movbe -mno-aes -mno-pclmul -mno-popcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-tbm -mno-avx -mno-sse4.2 -msse4.1 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=6144 -mtune=core2
    Are you sure that you really need all this staff?
    For example, cache-tuning options are efficient only when you are using single-tasking environment. Have you got one? ;-P
    The only orthodox faithful way is "-march=${YOUR_ARCH}", but not "-march=native".
    Don't forget about optimization options beyond -O N.
    GCC 4.6 with "-O3 -march=core2" compiles std::fill using SSE2's MOVDQA (Move Aligned Double Quadword):
    movdqa .LC0(%rip), %xmm0
    It's a little bit faster than your fastest implementation: 8.161s vs 8.497s.

    ReplyDelete