Monday, December 17, 2012

crxprof: handy profiler

Some weeks ago we faced with strange situation: program that performs indexing for our search engine started to ding. And it was very interesting what exactly going on right now. We launched `gdb` and tried to `finish` particular stack frame, but nothing unusual: stack frames finished and started again. So, there were nothing that stalled process. I was almost OK, just very slow, much slower than usual.
After that I tried to remember profiler which is able to work with already launched executable. There are some (popular) linux profilers:
  • gprof
  • Callgrind
  • OProfile
  • Google perftools

gprof

UNIX gprof is pretty old profiler which been written by Bill Joy during performance checking of BSD. It requires recompilation of your source code to inject checkpoints at the beginning and at the end of every function. So your code will look like as follows:
void fn()
{
  enter_fn();
  
  ... /* actual code of fn() */

  leave_fn();
}
Surely, difference of time between enter_fn() and leave_fn() will be usage of function fn(). And gprof will know exactly, how many times you called an fn(). But the drawbacks are obvious: it has to be integrated in compile-time, and gives appreciable overhead: the less your fn() contain, the more percent will take checkpointing. And surely it doesn't work with already launched process.

Callgrind

Callgrind is a part of Valgrind - great instrumentation framework for building dynamic analysis tools. Callgrind do profiling based on breakpoints on instruction like function call and return. It slows down launched program significantly, 5x to 20x times. And usually it's hard to use it for big data sets, don't speaking about runtime. But it has a simple format of call-graph and there is nice program to visualize it: KCachegrind.

OProfile

OProfile is a system-wide profiler for Linux systems, capable of profiling all running code at low overhead. Before Linux 2.6.31 it was kernel driver and user-space daemon for gathering sample data. Now (since 0.9.8) it performs profiling via Linux Kernel Performance Events. Performing a system-wide profiling requires a root authority. Oprofile is sampling profiler (gathering Program Counter with specific frequency). It really low-cost doing flat profile, but requires more for callgraph (see notes about unwinding)

Google perftools

Google profiler is a part of Google perftools set. It contains tcmalloc (allocator designed specially for multythreading environment), heap checker and CPU profiler. It works by collecting samples using ITIMER_PROF as timer. Using ITIMER_PROF gives ability to collect samples only when execution really performing, because usually you won't interest in sleep(3) or epoll_wait(2) usages.
Each time SIGPROF occurs, it collects backtrace using libunwind. After your program successfully finished (via exit(3)), you will get your profile raw-data, which is convertible to many formats using google-pprof.
Google profiler, just like any other tool from perftools, can be used being explicitly linked or at runtime: via LD_PRELOAD facility. So, it can be used for any program, but still it's not suitable for already launched ones due to it's design.
There are some more disadvantages here: google perftools doesn't go through fork(2), and your program can't be finished abnormally (via signal). That makes it hard to profile daemons: they usually build upon master-workers schema and assume endless event-loop.

Crxprof

crxprof is simple profiler designed to profile already launched programs. It collects callchain and may visualise it by request (ENTER) or after the completion of traced program. It also saves call graph in callgrind format making it easy to examine by KCachegrind. It works extremely fast and doesn't require any additional commands to convert raw-data. Simply because it doesn't write any internal format :-).
It works mostly like Google CPU profiler, but performs profiling externally via ptrace(2). Like Google profiler it uses libunwind to unroll stack. To avoid some work on raw-data (for example, heavy addr2line(1) like google profiler does) it also uses libbfd.
No any special support is required - you can use crprof with any program you able to (s)trace.
You can download crxprof from github. Since it's been made for me and my colleagues, I suppose there may be some features missing for your particular use-case. Feel free to ask.

Building

To build crxprof you may follow usual Unix build-sequence like:
autoreconf -fiv
./configure
make
sudo make install
If you have libunwind installed in special place, point this via:
./configure --with-libunwind=/path/to/libunwind
You may also skip installing since ./crxprof is the only file you need. Also, I recommend you to use static linkage to copy this file to "fresh" servers.

Profiling

To get job done you need to launch crxprof like this:
crxprof pid
That's all! Press ENTER to print profile, ^C to exit. crxprof will also exit (showing profile info) when program dies.

Options

As with most UNIX programs, you can get actual help using
$ crxprof --help
But I'll post this usage() here anyway. It's very compact:
Usage: ./crxprof [options] pid
Options are:
 -t|--threshold N:  visualize nodes that takes at least N% of time (default: 5)
 -d|--dump FILE:    save callgrind dump to given FILE
 -f|--freq FREQ:    set profile frequency to FREQ Hz (default: 100)
 -m|--max-depth N:  show at most N levels while visualising (default: no limit)
 -r|--realtime:     use realtime profile instead of CPU
 -h|--help:         show this help

 --full-stack:      print full stack while visualising
 --print-symbols:   just print funcs and addrs (and quit)

Real example

To make real but not complicated example, I will use this program. Just run crxprof asking to dump callgraph to file. (Assuming 32366 is PID of test program)
$ crxprof --dump /tmp/test.calls 32366
Reading symbols (list of function)
reading symbols from /home/dkrot/test/a.out (exe)
reading symbols from /lib/x86_64-linux-gnu/libc-2.15.so (dynlib)
reading symbols from /lib/x86_64-linux-gnu/ld-2.15.so (dynlib)
Attaching to process: 32366
Starting profile (interval 10ms)
Press ENTER to show profile, ^C to quit
2248 snapshot interrputs got (0 dropped)
main (100% | 0% self)
 \_ strong_function (75% | 49% self)
   \_ a (25% | 25% self)
 \_ a (24% | 24% self)
Profile saved to /tmp/test.calls (Callgrind format)
^C--- Exit since ^C pressed



Using this visualisation we can easily see what's going on:
  • main() calls strong_function() (and this is the most consuming path)
  • strong_function() calls an a()
  • main() also calls an a()
  • strong_function() half of CPU-time itself.
  • a() consuming the rest of CPU-time being called from 2 different places
  • main() doesn't consume anything by itself

This visualisation made by principle of "Biggest Subtrees First". So, it's handy to use crxprof in terminal. But for GUI representation and just deeper analysis you can use saved dump file (/tmp/test.calls):

$ kcachegrind /tmp/test.calls
And get something like this picture. KCachegrind summarise the information and shows that a() consumes 50% self-time. It differs from visualisation for terminal: I found separate accounting more appropriate for compact text-output.



Unwinding stack

Unwinding stack needed for collecting backtrace. Without backtrace it's impossible to show callgraph. And usually it's not so interesting to look at flat profile: you can't eliminate all malloc-s if they take significant time. And it's not you interested in. Usually you interested in "who called malloc" to work around this particular call-chain. What's why flat profile is mostly negligible.

Pretty old mechanism

Basically, stack consist of arguments, instruction pointer to return to (caller IP) and local variables. To make addressing easier, special register BP (base pointer) is used.
In this schema it's easy to unroll stack using previos base-pointer saved on stack.But the problem is, what making stack frame sometimes wasting. If your function consist of just 10 commands, overhead will be great. Therefore, some distributions compile it's core libraries without frame pointer (gcc -fomit-framepointer). Local variables and params still can be accessed via stack pointer (SP), saving one more register for general cases.
. As example, e-glibc from Debian distribution: built without frame pointers.-->
But the interesting thing is what frame pointers itself are not used by debuggers: they use exception frame handlers

Exception handling frames

Exception handling frames was involved for languages that support exceptions, such as C++. They consist of records addressing relative positions of IP and params. Each of this record covers specified region of code pointing "where stack frame is located when you are here". So, to extract IP you should unpack these uncommon records depending on where exactly you are now (IP). It's one of the reasons why exception handling in C++ is slow. I mean, it should be used exactly as exception handling, not as a thing which occur 100000 times per second.
On Linux exception frames represented within ELF file by sections:
  • .eh_frame: exception frames itself
  • .eh_frame_hdr: index over .eh_frame suitable for lookup

Thursday, December 6, 2012

HBase: finding balance

The disadvantage of abstractions

The interesting thing about abstractions. It's good to make independent parts of the system. And it's fine when it works as you expect. But suddenly it breaks, and you starting to realize you should dive into problem to formulate your expectations precisely. Because, just of of blue, you have to split your problem to gain "micro-expectations". Expectations of lower level than just "make this world happy". Sometimes abstractions just don't work. Sometimes, you have to unfold this black box, and start to hurl bricks.

Hey, it's not a lecture of gnosiology! It's just discourse about Java. And about Hadoop. Adherents of Java are always trying to create abstract things with a statement "it should just work". But when it breaks, you left with huge expectations and no idea how deal with it. And, Java style worsen the situation - it's just harder to "unfold" this black box because of sophistication of creator. All details are carefully hidden. Otherwise, books will not be so transparent, and the idea itself will be unclean.

Still, it's not a lecture :-) Just look at the following problem in HBase

Data locality in HBase

The whole idea of map-reduce (in terms of performance) is data-locality and independance. Jobs are work with their own data. You will gain maximum performance if your data spreaded equally within your cluster. Each job work with local data 'cause access to local HDDs much cheaper than remote HDD and network transmission.

Strong side of abstraction is what HBase itself is just a idea build upon HDFS. And therefore, it has to play HDFS' rules.

When HBase starts, it's regions are balanced throughout region-servers (RS). Bu how does data-locality work in this case? Regions are just a couple of files in HDFS. And HBase have no secret interfaces to HDFS. It simply works using this rule while creating new blocks:

  • Try to put initial block onto requesting server
  • Put second block as near as possible: to the same cluster, even to the same frame
  • Put third block as far as possible, just for backup. To another frame, or even another cluster

And it really works. But then you restart your HBase cluster. Because of error, just for prevention at the end! Anyway, your cluster starting to work slower than before. Why? It's a "law of Windows": to work perfectly after restart/re-install! Why portable Java doesn't follow this rule?!

The problem is: by-default HBase doesn't store block-map. It simply starts with absolutely another distribution of regions. RSes have no meaning about previous state. And you can see higher network load in your monitoring. Hadoop slowly rearrange your blocks. So slowly, what it's better to rewrite them all to recreate data-locality artificially.

I really don't know how to enforce HBase to memorize this state. But here is a simple script to measure locality. Just launch

./count_locality.sh tablename

It's output is data-locality of each RS in your cluster. Locality of 98% - 100% is perfect. Locality lower than 50 percent is certainly bad.