Wednesday, July 25, 2012

pstack for amd64

If you ever printed stack with gdb(1), you may noticed it's slow. It's OK while debugging, but surely not suitable for some-kind-of-realtime. That's because gdb performs extraction of every symbol of all files to which observing executable been linked.

Here is nice replacement for this - pstack(1), but only for x86 binaries. Here is attempt to do this for x86_64 too. It uses libunwind to unroll stack frames, and then Perl-script (omfg) to extract symbols and debug-info and to make pretty output.

Here maybe a nice use-case on servers: to automatically print out backtrace of monitored processes which is starved, right before killing then. In most cases it's enough to dig the problem, even w/o debug symbols. Or ... just to see how to perform unrolling remote stack with libunwind since I didn't find any example :-)

Example

Anyway, here is example of it's output:
$./pstack64 20794

20794:./a.out
#0 0x00007fb66ee42020 in /lib/x86_64-linux-gnu/libc-2.15.so: nanosleep@@GLIBC_2.2.5
#1 0x00007fb66ee41edc in /lib/x86_64-linux-gnu/libc-2.15.so: __sleep (/build/buildd/eglibc-2.15/posix/../sysdeps/unix/sysv/linux/sleep.c:138)
#2 0x0000000000400561 in /tmp/a.out: fn
#3 0x0000000000400571 in /tmp/a.out: a
#4 0x0000000000400581 in /tmp/a.out: main
#5 0x00007fb66eda576d in /lib/x86_64-linux-gnu/libc-2.15.so: __libc_start_main (/build/buildd/eglibc-2.15/csu/libc-start.c:258)
#6 0x0000000000400489 in /tmp/a.out: _start

This program don't use any dynamic libraries but libc. Here is example of another program written in C++ and using Qt:

./pstack64 9190
9190:keepassx
#0 0x00007f9c71a3eb03 in /lib/x86_64-linux-gnu/libc-2.15.so: __GI___poll (/build/buildd/eglibc-2.15/io/../sysdeps/unix/sysv/linux/poll.c:87)
#1 0x00007f9c70c2a036 in /lib/x86_64-linux-gnu/libglib-2.0.so.0.3200.3: -
#2 0x00007f9c70c2a164 in /lib/x86_64-linux-gnu/libglib-2.0.so.0.3200.3: g_main_context_iteration
#3 0x00007f9c726cf3bf in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.1: QEventDispatcherGlib::processEvents(QFlags)
#4 0x00007f9c72c6ad5e in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.1: -
#5 0x00007f9c7269ec82 in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.1: QEventLoop::processEvents(QFlags)
#6 0x00007f9c7269eed7 in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.1: QEventLoop::exec(QFlags)
#7 0x00007f9c726a3f67 in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.1: QCoreApplication::exec()
#8 0x000000000041be9e in /usr/bin/keepassx: -
#9 0x00007f9c7197976d in /lib/x86_64-linux-gnu/libc-2.15.so: __libc_start_main (/build/buildd/eglibc-2.15/csu/libc-start.c:258)
#10 0x000000000041caa1 in /usr/bin/keepassx: -
Keepassx binary is stripped, so we can't see it's procedures, dashes printed instead. As you can see C++ names are demangled. BTW, I wondered it's straightforward with c++filt coming with GNU binutils. To make a short story longer ( :-) ), I'll put a piece of README here:

Benefits

It's easy to read :-) It shows symbols much faster than `gdb -batch` since it performs a lazy lookup. Works well with executables and shared objects. Falling back to dynamic symbols lookup if none of them found in (debug) table.

Drawbacks

It's strongly depends on GNU binutils and therefore it's Linux-only It doesn't support threads (even if you pick up right LWP)

Permissions to trace

Since unwind uses ptrace(2), it's worth to note what in latest Linux-distro it's forbidden to trace "foreign" processes by default. For example, see /etc/sysctl.d/10-ptrace.conf in *Ubuntu, or simply run pstack64 with sudo.

Separated debug-info

It's worth to say what many distributions of Linux provide so-called "separated debug-info": dynamic libraries or even executables containing DWARF records. Since debug info in DWARF doesn't affect other sections and do not require any transformation of executable code, it might be easily excluded (stripped) from object file. But since the size is critical, after being compiled shared libraries usually stripped and /usr/lib/ contains nothing but symbols for dynamic loader (likewise extracted by pstack64 too, anyway). But the original one may be installed too.

For example, here is a libc6-dbg in Ubuntu which provides /usr/lib/debug/lib/x86_64-linux-gnu/libc-2.15.so. The thing is, it can be easily used instead of runtime libc since all virtual addresses (or section offsets) are valid for debug-version too.

BTW, it's interesting to glance on this short introduction to DWARF.

Sunday, July 1, 2012

memcached: dump to disk

Preface


Memcached is well-known, excellent memory storage. But what if you need to dump it's content to disk? This may be need, for example, in the following case: you have memcached with around 50% hitrate. Your service' average load is about 70% in rush hour. So, if your cache-server will reboot, you'll lose your cache, and requests will suddenly double, your users will suffer for this time due to timeouts.
Sounds realistic for you? Then, try this fork of memcached: memcached-dd.
As said in README, usage is straightforward: just add `-F file' option to command-line. Memcached will read this `file' at start and write to file.tmp when SIGUSR2 received. Then (after successfull write and sync), it will rename file.tmp -> file. So, `file' should be never truncated. For example:
$ memcached -F /tmp/memcache.dump -m 64 -p 11211 -l 127.0.0.1

Some notes to be clear

  • Dump performs in separate thread and doesn't block memcached itself
  • If you using TTL for your data, being restored the data will have the same TTL as in the time of dump.
  • All expired and flushed (flush_all command) content left behind
  • There is no any schedule-like maintaining for dumps, it's better to do with crontab and/or your own scripts

Example

I'll show the usage with Perl script. Assume, you have downloaded and built memcached-dd; see INSTALLATION section in README if in doubt. This Perl-scenario will load fake data into memcached:
use Cache::Memcached;

$memd = new Cache::Memcached {
    'servers' => [ "127.0.0.1:11211" ]
};

$| = 1;

for (my $i = 0; $i <= 10000; $i++) {
    $memd->set( "key_$i", "x"x100 . " [$i]" );
    # my $val = $memd->get( "xkey_$i");

    if ($i % 1000 == 0) {
        print "\r$i...";
    }
}
Having this load.pl script, launch the memcached-dd:
$ ./memcached -P /tmp/memcached.pid -F /tmp/memcached.dump -m 128 -p 11211 -l 127.0.0.1
# now load the data into memcached:
$ perl ./load.pl
# and now, assuming memcached has this data, dump it:
$ kill -USR2 `cat /tmp/memcached.pid`
1Mb dumped: 10001 items (0 expired during dump, 0 nuked by flush)
Moving temprorary /tmp/memcached.dump.tmp -> /tmp/memcached.dump

OK, now you have file /tmp/memcached.dump with all 10000 records dumped. You may reload it anytime launching memcached-dd with the same -F (assuming you killed memcached):

$ ./memcached -P /tmp/memcached.pid -F /tmp/memcached.dump -m 128 -p 11211 -l 127.0.0.1
Now check this keys with netcat:
$ echo get key_1 | nc 127.0.0.1 11211
VALUE key_1 0 104
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [1]
END

As you can see, content successfully restored from dump. Hope this will help for your particular usecase. If you have some problems with memcached-dd, feel yourself free to post this issue

Thursday, February 16, 2012

Nice C-static-assert

Buddy gave this example of C-static assert.
#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))

Wednesday, February 8, 2012

Linux needs to reboot

Locks not removed after procs is finished:
$ ls -i dummy_daemon.pid 
3146845 dummy_daemon.pid

$ cat /proc/locks  | grep --color 3146845
1: FLOCK  ADVISORY  WRITE 13676 08:01:3146845 0 EOF

$ ps -p 3146845
# NOTHING like no entry in /proc/3146845

Trying to run program: it performs exit(1) with no comments. Here is backtrace:

#0  __GI_exit (status=1) at exit.c:100
#1  0x00007ffff660f314 in __libc_start_main (main=0x55cd31 <main(int, char**)>, argc=2, ubp_av=0x7fffffffe4e8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffe4d8) at libc-start.c:258
#2  0x000000000055c9a9 in _start ()

Ha!
No stable software ever written :-)
just reboot it /

Sunday, January 29, 2012

Slow reading from std::cin

It's been 3rd hour I tried to optimize a program. All been OK, and I finally checked time(1) output: 9 seconds in userspace. Well done, no more critical things to optimize. Then I ran it again, but with another method: without filename as param (it's supposed to read from stdin in this case). time(1) showed me 21s. this time! What?! Aaa... it showed me as twice as lower previous run! Really, it depends on stdin?! I also picked another variant to check (my system?):
$ cat big_file | time myprogram ... /dev/stdin
9.18user 0.11system 0:09.32elapsed 99%CPU (0avgtext+0avgdata 6608maxresident)k
0inputs+8624outputs (0major+471minor)pagefaults 0swaps
Hmm... now it's fine. But now reading goes through std::ifstream not from std::cin. I checked my concern with google-perftools - and what I saw: nearly 50% has been spent by calling of std::getline(), here is the top10 functions by google-perftools:
     908  41.6%  41.6%      908  41.6% gogo::BlacklistFilter::exists <<< WORK
     340  15.6%  57.2%      357  16.4% _IO_getc
     271  12.4%  69.6%     1095  50.2% std::getline
     221  10.1%  79.7%      239  11.0% _IO_acquire_lock_fct
     172   7.9%  87.6%      193   8.8% _IO_ungetc
      54   2.5%  90.1%      287  13.2% __gnu_cxx::stdio_sync_filebuf::uflow
      48   2.2%  92.3%       48   2.2% std::__once_callable
      43   2.0%  94.3%       43   2.0% _IO_sputbackc
      37   1.7%  96.0%       37   1.7% std::_Rb_tree_black_count    <<< WORK
      27   1.2%  97.2%      275  12.6% __gnu_cxx::stdio_sync_filebuf::underflow
Fantastic! Hardly 45% is work_code-related, the rest caused by std::getline. Strange distribution, especially _IO_acquire_lock_fct function. This name seems to be self-explained, so I easily found this method: ios_base::sync_with_stdio. So, putting std::cin.sync_with_stdio(false); tamed my program as well. CPU time returned back and I'm happy again.
Note, this behavior doesn't related to -pthread or anything else compiler-key. It's just become 30 to 40 times slower when you reading from std::cin, no matter using std::getline or not.
Surely, that's not the potion what make all programs faster, but I'll keep this std::ios' weakness in mind.

Sunday, January 22, 2012

Swap words in text w/o additional memory

Yesterday I saw an interview-problem from our C-team:

Revert words-order in text no using additional memory

For example, string
"Fedora Project promotes internet freedom" should be translated to
"freedom internet promotes Project Fedora" [nice semantic palindrome, isn't it? :-)].

Doesn't looks hard, but the only solution I found is to inverse all text, and then inverse each word.

#include <stdio.h>
#include <string.h>

/* invert characters in range [beg; end) */
inline void inv(char *beg, char *end)
{
    end--;
    while(beg < end) {
        char c = *beg;
        *beg++ = *end;
        *end-- = c;
    }
}

int main(int argc, char *argv[])
{
    char *s = argv[1], *beg = s, *end = strchr(s, '\0');

    inv(beg, end);

    beg = s;
    while(*beg)
    {
        end = strchr(beg, ' ') ? : strchr(beg, '\0');
        inv(beg, end);
        beg = end + 1;
    }

    printf("%s\n", s);
    return 0;
}
The problem is we have to scan string twice. Maybe there is a better (by-algo, not especially by-speed) solution?
BTW, nice construction which is used everywhere in Linux kernel
x = y ? : z; // equals to "if(y) x = y; else x = z;"

Monday, January 16, 2012

Long Live ... SSD!

I pretty assume you'are using Linux :-)

Some advices to make your SSD live longer

  • Enable TRIM command from filesystem to disk firmware (ext4 has option 'discard', see man 8 mount)
  • set 'noatime' and 'nodiratime' options (again, see man 8 mount)
  • enlarge /proc/sys/vm/dirty_writeback_centisecs up to 60000 (60 seconds) to make pdflush write rarely
Do you know any more?

Surely, do not apply it blindly: there are many explanation why OS have to work w/ solid drives differently comparing with usual HDD:
  1. Illustrated process of rewriting block
  2. wear-leveling, or how solid state drives (or even USB sticks) remap data blocks.
  3. How to configure TRIM in Ubuntu and other distros. With little benchmarking

UP:

Also it's worth to change default IO scheduler to "noop". This will boost synchronous operations in case of distributed read requests. Many peoples thing it reduces much CPU cycles of "too smart" defult schedulers: CFQ or deadline. But I think the plenty of effect is not in CPU time. Instead, it reduces average of IO request waiting it's time to be actually send to device. Because "noop" scheduler does not buffer IO requests.

I saw big difference with iostat(1)' "await" column; from manpage it is "The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them". For my workload it decreased 8 times!

To switch your disk' scheduler to noop perform:

$ echo noop | sudo tee /sys/block/sda/queue/scheduler # my SSD is sda

And, certainly we have to do this thing each time on system boot. The most native way to do this in Ubuntu, as I found, is via procps:

$ sudo apt-get install procps

Then add following line to /etc/sysfs.conf:

block/sda/queue/scheduler = noop