tag:blogger.com,1999:blog-8978810755403384402024-03-13T19:42:43.029+02:00dkrotx-prgdkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.comBlogger16125tag:blogger.com,1999:blog-897881075540338440.post-9963253398609318092013-03-23T17:32:00.000+03:002013-03-24T19:37:57.243+03:00mkstatic: join binary and it's libraries together<html>
<link href="http://alexgorbatchev.com/pub/sh/current/styles/shCore.css" rel="stylesheet" type="text/css"></link>
<link href="http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css" rel="stylesheet" type="text/css"></link>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js" type="text/javascript"></script>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js" type="text/javascript"></script>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPlain.js" type="text/javascript"></script>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js" type="text/javascript"></script>
<script language="javascript">
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.all();
</script>
<body>
<p>You just built your program on your own notebook. And noticed, you don't have many of libraries on the other host. The host, where you want to check out your job. Nothing unusual.
First approach to install 'em (lucky if you got root-access). Another approach is to recompile your program linking all the libraries statically; and it's good if you have static brothers for your dynlibs. So, it doesn't work everywhere. Especially, on production server.</p>
<p>But what if you just want to launch your program on another machine and you don't want to bother yourself with tons of libraries it depends on? To do this job quickly you may use <a href="https://github.com/dkrotx/mkstatic">mkstatic</a> Perl script. It creates .static package which contains your binary and all it's libraries within.</p>
<h1>No magic, again :-(</h1>
<p>Surely, it doesn't perform anything you can't do manually (if you familiar with <a href="http://linux.die.net/man/8/ld-linux">ld.so</a>). But I think you won't do this job so accurate. So, what <b>mkstatic</b> does for you:
<ul>
<li>it collects all dependencies of your binary_file (and remembers symlinks to libraries)</li>
<li>it creates .tgz which is actually placed in Shell-file. This archive includes binary, libraries and bootstrapping code</li>
<li>you may use binary.staic as usual binary file</li>
</ul>
I believe the latest point is the most critical. Because all the mess are hidden from you: all it works just like your original binary. With absolutely the same usage. And requires nothing from target host.</p>
<h2>Known limitations</h2>
<p>Keen reader may guess: "Hey, this will work for all binaries!". Yes, you can't [easily] copy your Chromium distribution this way to empty machine. Simply because it depends not only from libraries, but from many data (drivers for your Xorg, font configs). And mkstatic doesn't know anything about them. But you can use mkstatic, for instance, with Midnight Commander :-). Or any your executable which uses "data-free" libraries (libogg, libboost, libstdc++, etc).
<h1>Let's test it</h1>
<p>Surely, it's better to test mkstatic in two machines: one which has all the bunch of libraries, and the fresh one. But I'll show you how I've tested this thing.</p>
First of all, you have to build .static package. Use Midnight Commander' binary as example:
<pre class="brush:bash; gutter: false">$ ./mkstatic -o /tmp/mc.static `which mc`
executable package is ready: /tmp/mc.static</pre>
I'm using xubuntu-12.04 (Precise Pangolin). As any Debian-like distribution it contains <a href="http://wiki.debian.org/Debootstrap">debootstrap</a> utility. So, launch:
<pre class="brush:bash; gutter: false">$ sudo debootstrap precise precise-chroot http://mirror.yandex.ru/ubuntu/
$ sudo cp /tmp/mc.static precise-chroot/tmp/
$ sudo chroot precise-chroot /bin/bash # now you're in test environment
# /tmp/mc.static
</pre>
<p>Wuala! Midnight Commander is working on your chroot environment, though you don't have <i>libgpm.so</i> within. You may say what Midnight commander is pretty simple. Surely! But you may use mkstatic with much more heavy binaries like mencoder which requires about 100 libraries. Or even Skype! All programs containing one executable binary file is mkstatic-friendly. Try it!</p>
<p>As usually, there is manual-page in package. See <b>mkstatic --man</b> for details.</p>
P.S. If you just interested in approach self-extractable archive, you may see <a href="http://megastep.org/makeself/">makself</a>. It's widely used for binary installations on Unix world (Nvidia drivers, VirtualBox, etc).
</body>
</html>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-25067884034807262422012-12-17T00:35:00.000+03:002013-03-25T14:32:57.383+03:00crxprof: handy profiler<html>
<head>
<link href="http://alexgorbatchev.com/pub/sh/current/styles/shCore.css" rel="stylesheet" type="text/css"></link>
<link href="http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css" rel="stylesheet" type="text/css"></link>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js" type="text/javascript"></script>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js" type="text/javascript"></script>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPlain.js" type="text/javascript"></script>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js" type="text/javascript"></script>
<script language="javascript">
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
</head>
<body>
Some weeks ago we faced with strange situation: program that performs indexing for our search engine started to ding. And it was very interesting what exactly going on <i>right now</i>. We launched `gdb` and tried to `finish` particular stack frame, but nothing unusual: stack frames finished and started again. So, there were nothing that stalled process. I was almost OK, just very slow, much slower than usual.<br />
After that I tried to remember profiler which is able to work with <i>already launched</i> executable. There are some (popular) linux profilers:
<ul>
<li>gprof
</li>
<li>Callgrind
</li>
<li>OProfile
</li>
<li>Google perftools
</li>
</ul>
<h2>
gprof</h2>
UNIX gprof is pretty old profiler which been written by Bill Joy during performance checking of BSD. It requires recompilation of your source code to inject checkpoints at the beginning and at the end of every function. So your code will look like as follows:
<pre class="brush:c; gutter: false">void fn()
{
enter_fn();
... /* actual code of fn() */
leave_fn();
}
</pre>
Surely, difference of time between enter_fn() and leave_fn() will be usage of function fn(). And gprof will know exactly, <i>how many</i> times you called an fn(). But the drawbacks are obvious: it has to be integrated in compile-time, and gives appreciable overhead: the less your fn() contain, the more percent will take checkpointing. And surely it doesn't work with already launched process.<br />
<h2>
Callgrind</h2>
Callgrind is a part of Valgrind - great instrumentation framework for building dynamic analysis tools. Callgrind do profiling based on breakpoints on instruction like function call and return. It slows down launched program significantly, 5x to 20x times. And usually it's hard to use it for big data sets, don't speaking about runtime. But it has a simple <a href="http://valgrind.org/docs/manual/cl-format.html">format</a> of call-graph and there is nice program to visualize it: KCachegrind.<br />
<h2>
OProfile</h2>
OProfile is a system-wide profiler for Linux systems, capable of profiling all running code at low overhead. Before Linux 2.6.31 it was kernel driver and user-space daemon for gathering sample data. Now (since 0.9.8) it performs profiling via Linux Kernel Performance Events. Performing a system-wide profiling requires a root authority. Oprofile is sampling profiler (gathering Program Counter with specific frequency). It really low-cost doing flat profile, but requires more for callgraph (see <a href="#unwinding">notes about unwinding</a>)<br />
<h2>
Google perftools</h2>
<a href="http://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html">Google profiler</a> is a part of Google perftools set. It contains tcmalloc (allocator designed specially for multythreading environment), heap checker and CPU profiler. It works by collecting samples using ITIMER_PROF as timer. Using <a href="http://linux.die.net/man/2/setitimer">ITIMER_PROF</a> gives ability to collect samples only when execution really performing, because usually you won't interest in sleep(3) or epoll_wait(2) usages.<br />
Each time SIGPROF occurs, it collects backtrace using <a href="http://www.nongnu.org/libunwind/">libunwind</a>. After your program successfully finished (via exit(3)), you will get your profile raw-data, which is convertible to many formats using google-pprof.<br />
Google profiler, just like any other tool from perftools, can be used being explicitly linked or at runtime: via LD_PRELOAD facility. So, it can be used for any program, but still it's not suitable for already launched ones due to it's design.<br />
There are some more disadvantages here: google perftools doesn't go through fork(2), and your program can't be finished abnormally (via signal). That makes it hard to profile daemons: they usually build upon master-workers schema and assume endless event-loop.<br />
<br />
<h1>
Crxprof</h1>
crxprof is simple profiler designed to profile already launched programs. It collects callchain and may visualise it by request (ENTER) or after the completion of traced program. It also saves call graph in callgrind format making it easy to examine by KCachegrind. It works <i>extremely fast</i> and doesn't require any additional commands to convert raw-data. Simply because it doesn't write any internal format :-).<br />
It works mostly like Google CPU profiler, but performs profiling externally via <a href="http://man7.org/linux/man-pages/man2/ptrace.2.html">ptrace(2)</a>. Like Google profiler it uses libunwind to unroll stack. To avoid some work on raw-data (for example, heavy <a href="http://linux.die.net/man/1/addr2line">addr2line(1)</a> like google profiler does) it also uses <a href="http://en.wikipedia.org/wiki/Binary_File_Descriptor_library">libbfd</a>.<br />
No any special support is required - you can use crprof with any program you able to (s)trace.<br />
<a href="https://github.com/dkrotx/crxprof">You can download crxprof from github</a>. Since it's been made for me and my colleagues, I suppose there may be some features missing for your particular use-case. Feel free to ask.<br />
<h2>
Building</h2>
To build crxprof you may follow usual Unix build-sequence like:
<pre class="brush:bash; gutter: false">
autoreconf -fiv
./configure
make
sudo make install
</pre>
If you have libunwind installed in special place, point this via:<br />
<pre class="brush:bash; gutter: false">./configure --with-libunwind=/path/to/libunwind
</pre>
You may also skip installing since ./crxprof is the only file you need. Also, I recommend you to use static linkage to copy this file to "fresh" servers.
<h2>
Profiling</h2>
To get job done you need to launch crxprof like this:
<pre class="brush:bash; gutter: false">crxprof pid
</pre>
That's all! Press ENTER to print profile, ^C to exit. crxprof will also exit (showing profile info) when program dies.
<h2>Options</h2>
As with most UNIX programs, you can get actual help using
<pre class="brush:bash; gutter: false">$ crxprof --help</pre>
But I'll post this usage() here anyway. It's very compact:
<pre class="brush:plain; gutter: false">
Usage: ./crxprof [options] pid
Options are:
-t|--threshold N: visualize nodes that takes at least N% of time (default: 5)
-d|--dump FILE: save callgrind dump to given FILE
-f|--freq FREQ: set profile frequency to FREQ Hz (default: 100)
-m|--max-depth N: show at most N levels while visualising (default: no limit)
-r|--realtime: use realtime profile instead of CPU
-h|--help: show this help
--full-stack: print full stack while visualising
--print-symbols: just print funcs and addrs (and quit)
</pre>
<h2>Real example</h2>
To make real but not complicated example, I will use <a href="https://docs.google.com/open?id=0B_-I_KI_Fo89bUV3MnZKeUhGX3M">this program</a>. Just run crxprof asking to dump callgraph to file. (Assuming 32366 is PID of test program)
<pre>
$ crxprof --dump /tmp/test.calls 32366
<font color="#ff7070">Reading symbols (list of function)
reading symbols from /home/dkrot/test/a.out (exe)
reading symbols from /lib/x86_64-linux-gnu/libc-2.15.so (dynlib)
reading symbols from /lib/x86_64-linux-gnu/ld-2.15.so (dynlib)
Attaching to process: 32366
Starting profile (interval 10ms)
Press ENTER to show profile, ^C to quit</font>
<font color="#ff7070">2248 snapshot interrputs got (0 dropped)</font>
main (100% | 0% self)
\_ strong_function (75% | 49% self)
\_ a (25% | 25% self)
\_ a (24% | 24% self)
Profile saved to /tmp/test.calls (Callgrind format)
^C--- Exit since ^C pressed
</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/--jcQ8evK-i4/UM4h8Iaq5cI/AAAAAAAAAH0/DC21_kTEwgw/s1600/profile.png" imageanchor="1" style="clear:right; float:right; margin-left:1em; margin-bottom:1em"><img border="0" height="320" width="313" src="http://3.bp.blogspot.com/--jcQ8evK-i4/UM4h8Iaq5cI/AAAAAAAAAH0/DC21_kTEwgw/s320/profile.png" /></a></div>
<br /><br /><br />
Using this visualisation we can easily see what's going on:
<ul>
<li><b>main()</b> calls <b>strong_function()</b> (and this is the most consuming path)
<li><b>strong_function()</b> calls an <b>a()</b>
<li><b>main()</b> also calls an <b>a()</b>
<li><b>strong_function()</b> half of CPU-time itself.
<li><b>a()</b> consuming the rest of CPU-time being called from 2 different places
<li><b>main()</b> doesn't consume anything by itself
</ul>
<p>
This visualisation made by principle of "Biggest Subtrees First". So, it's handy to use crxprof in terminal. But for GUI representation and just deeper analysis you can use saved dump file (/tmp/test.calls):
<pre>
$ kcachegrind /tmp/test.calls
</pre>
And get something like this picture. KCachegrind summarise the information and shows that <b>a()</b> consumes 50% self-time. It differs from visualisation for terminal: I found separate accounting more appropriate for compact text-output.</p>
<br /><br />
<a name="unwinding"></a>
<h1>Unwinding stack</h1>
Unwinding stack needed for collecting backtrace. Without backtrace it's impossible to show callgraph. And usually it's not so interesting to look at flat profile: you can't eliminate all malloc-s if they take significant time. And it's not you interested in. Usually you interested in "who called malloc" to work around this particular call-chain. What's why flat profile is mostly negligible.
<h2>
Pretty old mechanism</h2>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-4KTplwKTWhc/UM4-VrLCFWI/AAAAAAAAAII/IMBlLBTPooY/s1600/Screen%2BShot%2B2012-12-17%2Bat%2B1.30.48%2BAM.png" imageanchor="1" style="clear:right; float:right; margin-left:1em; margin-bottom:1em"><img border="0" height="224" width="320" src="http://1.bp.blogspot.com/-4KTplwKTWhc/UM4-VrLCFWI/AAAAAAAAAII/IMBlLBTPooY/s320/Screen%2BShot%2B2012-12-17%2Bat%2B1.30.48%2BAM.png" /></a></div>
Basically, stack consist of arguments, instruction pointer to return to (caller IP) and local variables. To make addressing easier, special register BP (base pointer) is used.<br />
In this schema it's easy to unroll stack using previos base-pointer saved on stack.But the problem is, what making stack frame sometimes wasting. If your function consist of just 10 commands, overhead will be great. Therefore, some distributions compile it's core libraries without frame pointer (gcc -fomit-framepointer). Local variables and params still can be accessed via stack pointer (SP), saving one more register for general cases.<br />
. As example, e-glibc from Debian distribution: built without frame pointers.-->
<br />
But the interesting thing is what frame pointers itself are not used by debuggers: they use exception frame handlers<br />
<h2>
Exception handling frames</h2>
<a href="http://refspecs.linuxfoundation.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/ehframechpt.html">Exception handling frames</a> was involved for languages that support exceptions, such as C++. They consist of records addressing relative positions of IP and params. Each of this record covers specified region of code pointing "where stack frame is located when you are here". So, to extract IP you should unpack these uncommon records <i>depending on where exactly you are now</i> (IP). It's one of the reasons why exception handling in C++ is slow. I mean, it should be used exactly as exception handling, not as a thing which occur 100000 times per second.<br />
On Linux exception frames represented within ELF file by sections:
<br />
<ul>
<li><b>.eh_frame</b>: exception frames itself
</li>
<li><b>.eh_frame_hdr</b>: index over .eh_frame suitable for lookup
</li>
</ul>
</body>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com1tag:blogger.com,1999:blog-897881075540338440.post-43946013945878590022012-12-06T00:13:00.001+03:002012-12-06T00:13:25.065+03:00HBase: finding balance <head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPerl.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
<h1>The disadvantage of abstractions</h1>
<p>The interesting thing about abstractions. It's good to make independent parts of the system. And it's fine when it works as you expect. But suddenly it breaks, and you starting to realize you should dive into problem to formulate your expectations precisely. Because, just of of blue, you have to split your problem to gain "micro-expectations". Expectations of lower level than just "make this world happy". Sometimes abstractions just don't work. Sometimes, you have to unfold this black box, and start to hurl bricks.</p>
<p>Hey, it's not a lecture of gnosiology! It's just discourse about Java. And about Hadoop. Adherents of Java are always trying to create abstract things with a statement "it should just work". But when it breaks, you left with huge expectations and no idea how deal with it. And, Java style worsen the situation - it's just harder to "unfold" this black box because of sophistication of creator. All details are carefully hidden. Otherwise, books will not be so transparent, and the idea itself will be unclean.</p>
<p>Still, it's not a lecture :-) Just look at the following problem in HBase</p>
<h1>Data locality in HBase</h1>
<p>The whole idea of map-reduce (in terms of performance) is data-locality and independance. Jobs are work with their own data. You will gain maximum performance if your data spreaded equally within your cluster. Each job work with local data 'cause access to local HDDs much cheaper than remote HDD and network transmission.</p>
<p>Strong side of abstraction is what HBase itself is just a idea build upon HDFS. And therefore, it has to play HDFS' rules.</p>
<p>When HBase starts, it's regions are balanced throughout region-servers (RS). Bu how does data-locality work in this case? Regions are just a couple of files in HDFS. And HBase have no secret interfaces to HDFS. It simply works using this rule while creating new blocks:</p>
<UL>
<li>Try to put initial block onto requesting server
<li>Put second block as near as possible: to the same cluster, even to the same frame
<li>Put third block as far as possible, just for backup. To another frame, or even another cluster
</UL>
<p>And it really works. But then you restart your HBase cluster. Because of error, just for prevention at the end! Anyway, your cluster starting to work slower than before. Why? It's a "law of Windows": to work perfectly after restart/re-install! Why portable Java doesn't follow this rule?!</p>
<p>The problem is: by-default HBase doesn't store block-map. It simply starts with absolutely another distribution of regions. RSes have no meaning about previous state. And you can see higher network load in your monitoring. Hadoop slowly rearrange your blocks. So slowly, what it's better to rewrite them all to recreate data-locality artificially.</p>
<p>I really don't know how to enforce HBase to memorize this state. But <a href="https://github.com/dkrotx/hbase_locality">here is a simple script</a> to measure locality. Just launch
<pre class="brush:bash; gutter: false">
./count_locality.sh tablename
</pre>
<p>It's output is data-locality of each RS in your cluster. Locality of 98% - 100% is perfect. Locality lower than 50 percent is certainly bad.</p>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-89336399600084477872012-08-12T17:12:00.002+03:002012-08-12T19:58:27.876+03:00Speedup file reading on linux<html>
<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPlain.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.defaults['gutter'] = false;
SyntaxHighlighter.defaults['toolbar'] = false;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
<body>
<p>(Actually, this is pretty old post from my previos address.)</p>
<p>It's about how to speedup reading a pile of files from disk drives. Evident, such operation requires not so rarely - parsing couple of files, copying them over fast ethernet connection as like as moving them from one partition to another.</p>
<p>There are several things to speedup and I'm sure you know about IO-buffer size dependence, but one of main advantage achieves by "prereading" of data and doing this in right order, the things you usually can't control from userspace. Since the main obstacle while reading files is non-linear moving of disk drive heads, we should achieve as native physical order as possible.</p>
<p>This native order may be retrieved by using <b>ioctl(FIBMAP)</b> on opened file descriptor, but there are some limits: third argument of `ioctl' call presents pointer to integer - logical block being translated to physical on output, so obviously number of physical block able to be mapped is not very large. It may hit the limit on XFS and other huge FSes. There is also a big disadvantage of FIBMAP - it requires a superuser privilegies (don't know why). Instead of using old ioctl, new linux kernel provides an another one: <b>FS_IOC_FIEMAP</b>. This variant is much more flexible, universal and limit-safe. It also requires no superuser privilegies. This call provides you viewing of file as physical extents (even for filesystems allocating data by bitmaps, see flags). You can find much information in kernel documentation.</p>
<p>Here is the sample of how to retrieve first physical block by methods mentioned above:</p>
<pre class="brush:c; ruler: false; gutter: false; highlight: [18, 22]">
#include <linux/fs.h>
#include <linux/fiemap.h>
uint64_t
get_physblock(const char *f)
{
int fd = open(f, O_RDONLY);
uint64_t block = ~0ULL;
if (fd >= 0) {
#ifdef FS_IOC_FIEMAP
union {
struct fiemap fm;
char buf[sizeof(struct fiemap) + sizeof(struct fiemap_extent) * 1];
};
memset(&fm, 0, sizeof fm);
fm.fm_length = 1; /* one byte mapping from logical offset=0 */
fm.fm_extent_count = 1; /* buffer for one extent provided */
if (ioctl(fd, FS_IOC_FIEMAP, &fm) != -1 && fm.fm_mapped_extents == 1)
block = blk;
#else
int blk = 0; /* first logical block */
if (-1 != ioctl(fd, FIBMAP, &blk))
block = blk;
#endif
close(fd);
}
return block;
}
</pre>
<h2>Test</h2>
<p>Clearly right what relying on physical block ID, you may reorder files to read. In addition, there is a <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/readahead.2.html">readahead(2)</a> syscall which can be used to "preread" file data in VFS cache. It differs from the reading by read(2) since it has no "copy_to_userspace" overhead.
Indeed, there is no much to talk about but give a test results. Testing principle is quite simple: read linux sources file by file. At first read them `as is', then apply readahead, and finally - preordering. FS cache between that cases may be purged by
<pre class="brush:shell; ruler: false; gutter: false">
$ echo 2 >/proc/sys/vm/drop_caches
</pre>
Test results are following:
<table border="1">
<tr>
<th>Method used</th><th>Time elapsed (sec)</th>
</tr>
<tr>
<td>as-is</td><td>33</td>
</tr>
<tr>
<td>readahead</td><td>26</td>
</tr>
<tr>
<td>reorder + readahead</td><td>14</td>
</tr>
</table>
<p>It's not difficult to see what applying readahead, especially with preordering, gives much benefit. So, this method may be used for caching - there are several implementation engaged in popular Linux distributions, for example <b>readahead</b> package, used by default in Fedora and Ubuntu.</p>
<p>You can read details about fiemap on <a href="http://lwn.net/Articles/297696/">LWN page</a>.</p>
<p>I hope this short note will convince you using this tricky calls when FS reading speed is valuable. Surely, this method worthy only for reading much of files, but not for couple of huge files since it's already self-ordered. For myself, I used this approach when developed library which creates dictionaries for classifying phrases - there was over 60000 of input files, and until then read this files consuming the most time.</p>
</body>
</html>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-41636646703811441422012-07-25T10:19:00.000+03:002012-08-12T19:45:04.251+03:00pstack for amd64<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPlain.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.defaults['gutter'] = false;
SyntaxHighlighter.defaults['toolbar'] = false;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
<p>If you ever printed stack with gdb(1), you may noticed it's slow. It's OK while debugging, but surely not suitable for some-kind-of-realtime. That's because gdb performs extraction of every symbol of all files to which observing executable been linked.</p>
<p>Here is nice replacement for this - pstack(1), but only for x86 binaries. <a href="https://github.com/dkrotx/pstack64">Here</a> is attempt to do this for x86_64 too. It uses <a href="http://www.nongnu.org/libunwind/">libunwind</a> to unroll stack frames, and then Perl-script (omfg) to extract symbols and debug-info and to make pretty output.</p>
<p>Here maybe a nice use-case on servers: to automatically print out backtrace of monitored processes which is starved, right before killing then. In most cases it's enough to dig the problem, even w/o debug symbols. Or ... just to see how to perform unrolling remote stack with libunwind since I didn't find any example :-)</p>
<h2>Example</h2>
Anyway, here is example of it's output:
<pre class="brush:plain">
$./pstack64 20794
20794:./a.out
#0 0x00007fb66ee42020 in /lib/x86_64-linux-gnu/libc-2.15.so: nanosleep@@GLIBC_2.2.5
#1 0x00007fb66ee41edc in /lib/x86_64-linux-gnu/libc-2.15.so: __sleep (/build/buildd/eglibc-2.15/posix/../sysdeps/unix/sysv/linux/sleep.c:138)
#2 0x0000000000400561 in /tmp/a.out: fn
#3 0x0000000000400571 in /tmp/a.out: a
#4 0x0000000000400581 in /tmp/a.out: main
#5 0x00007fb66eda576d in /lib/x86_64-linux-gnu/libc-2.15.so: __libc_start_main (/build/buildd/eglibc-2.15/csu/libc-start.c:258)
#6 0x0000000000400489 in /tmp/a.out: _start
</pre>
<p>This program don't use any dynamic libraries but libc. Here is example of another program written in C++ and using Qt:</p>
<pre class="brush:plain">
./pstack64 9190
9190:keepassx
#0 0x00007f9c71a3eb03 in /lib/x86_64-linux-gnu/libc-2.15.so: __GI___poll (/build/buildd/eglibc-2.15/io/../sysdeps/unix/sysv/linux/poll.c:87)
#1 0x00007f9c70c2a036 in /lib/x86_64-linux-gnu/libglib-2.0.so.0.3200.3: -
#2 0x00007f9c70c2a164 in /lib/x86_64-linux-gnu/libglib-2.0.so.0.3200.3: g_main_context_iteration
#3 0x00007f9c726cf3bf in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.1: QEventDispatcherGlib::processEvents(QFlags)
#4 0x00007f9c72c6ad5e in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.1: -
#5 0x00007f9c7269ec82 in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.1: QEventLoop::processEvents(QFlags)
#6 0x00007f9c7269eed7 in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.1: QEventLoop::exec(QFlags)
#7 0x00007f9c726a3f67 in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.1: QCoreApplication::exec()
#8 0x000000000041be9e in /usr/bin/keepassx: -
#9 0x00007f9c7197976d in /lib/x86_64-linux-gnu/libc-2.15.so: __libc_start_main (/build/buildd/eglibc-2.15/csu/libc-start.c:258)
#10 0x000000000041caa1 in /usr/bin/keepassx: -
</pre>
Keepassx binary is stripped, so we can't see it's procedures, dashes printed instead. As you can see C++ names are demangled. BTW, I wondered it's straightforward with c++filt coming with GNU binutils.
To make a short story longer ( :-) ), I'll put a piece of README here:
<br /><br />
<h3>Benefits</h3>
It's easy to read :-)
It shows symbols much faster than `gdb -batch` since it performs a lazy lookup.
Works well with executables and shared objects. Falling back to dynamic symbols lookup if none of them found in (debug) table.
<br /><br />
<h3>Drawbacks</h3>
It's strongly depends on GNU binutils and therefore it's Linux-only
It doesn't support threads (even if you pick up right LWP)
<br /><br />
<h2>Permissions to trace</h2>
Since unwind uses ptrace(2), it's worth to note what in latest Linux-distro it's forbidden to trace "foreign" processes by default. For example, see /etc/sysctl.d/10-ptrace.conf in *Ubuntu, or simply run pstack64 with sudo.
<br /><br />
<h2>Separated debug-info</h2>
<p>It's worth to say what many distributions of Linux provide so-called "separated debug-info": dynamic libraries or even executables containing DWARF records. Since debug info in DWARF doesn't affect other sections and do not require any transformation of executable code, it might be easily excluded (stripped) from object file. But since the size is critical, after being compiled shared libraries usually stripped and /usr/lib/ contains nothing but symbols for dynamic loader (likewise extracted by pstack64 too, anyway). But the original one may be installed too.</p>
<p>For example, here is a <i>libc6-dbg</i> in Ubuntu which provides <i>/usr/lib/debug/lib/x86_64-linux-gnu/libc-2.15.so</i>. The thing is, it can be easily used instead of runtime libc since all virtual addresses (or section offsets) are valid for debug-version too.</p>
<p>BTW, it's interesting to glance on this short introduction to <a href="http://dwarfstd.org/doc/Debugging%20using%20DWARF-2012.pdf">DWARF</a>.</p>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-21056405439893967062012-07-01T21:36:00.000+03:002012-08-12T17:39:37.674+03:00memcached: dump to disk<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPerl.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
<h2>Preface</h2>
<br />
<a href="http://memcached.org/">Memcached</a> is well-known, excellent memory storage. But what if you need to dump it's content to disk? This may be need, for example, in the following case: you have memcached with around 50% hitrate. Your service' average load is about 70% in rush hour. So, if your cache-server will reboot, you'll lose your cache, and requests will suddenly double, your users will suffer for this time due to timeouts.<br />
Sounds realistic for you? Then, try this fork of memcached: <a href="https://github.com/dkrotx/memcached-dd">memcached-dd</a>.<br />
As said in README, usage is straightforward: just add `-F file' option to command-line. Memcached will read this `file' at start and write to file.tmp when <b>SIGUSR2</b> received. Then (after successfull write and sync), it will rename file.tmp -> file. So, `file' should be never truncated.
For example:
<pre class="brush:shell; ruler: false; gutter: false">
$ memcached -F /tmp/memcache.dump -m 64 -p 11211 -l 127.0.0.1
</pre>
<h2>Some notes to be clear</h2>
<ul>
<li>Dump performs in separate thread and doesn't block memcached itself</li>
<li>If you using TTL for your data, being restored the data <i>will have the same</i> TTL as in the time of dump.</li>
<li>All expired and flushed (flush_all command) content left behind</li>
<li>There is no any schedule-like maintaining for dumps, it's better to do with crontab and/or your own scripts</li>
</ul>
<div>
<br /></div>
<h2>Example</h2>
<div>
I'll show the usage with Perl script. Assume, you have downloaded and built memcached-dd; see INSTALLATION section in README if in doubt. This Perl-scenario will load fake data into memcached:</div>
<pre class="brush:perl; gutter: false">use Cache::Memcached;
$memd = new Cache::Memcached {
'servers' => [ "127.0.0.1:11211" ]
};
$| = 1;
for (my $i = 0; $i <= 10000; $i++) {
$memd->set( "key_$i", "x"x100 . " [$i]" );
# my $val = $memd->get( "xkey_$i");
if ($i % 1000 == 0) {
print "\r$i...";
}
}
</pre>
Having this <b>load.pl</b> script, launch the memcached-dd:
<pre class="brush:bash; glutter: false; ">
$ ./memcached -P /tmp/memcached.pid -F /tmp/memcached.dump -m 128 -p 11211 -l 127.0.0.1
# now load the data into memcached:
$ perl ./load.pl
# and now, assuming memcached has this data, dump it:
$ kill -USR2 `cat /tmp/memcached.pid`
1Mb dumped: 10001 items (0 expired during dump, 0 nuked by flush)
Moving temprorary /tmp/memcached.dump.tmp -> /tmp/memcached.dump
</pre>
<p>OK, now you have file /tmp/memcached.dump with all 10000 records dumped. You may reload it anytime launching memcached-dd with the same -F (assuming you killed memcached):</p>
<pre class="brush:bash; glutter: false; ">
$ ./memcached -P /tmp/memcached.pid -F /tmp/memcached.dump -m 128 -p 11211 -l 127.0.0.1
</pre>
Now check this keys with <b>netcat</b>:
<pre class="brush:bash; glutter: false; ">
$ echo get key_1 | nc 127.0.0.1 11211
VALUE key_1 0 104
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [1]
END
</pre>
<p>As you can see, content successfully restored from dump.
Hope this will help for your particular usecase. If you have some problems with memcached-dd, feel yourself free to <a href="https://github.com/dkrotx/memcached-dd/issues" target="_blank">post this issue</a></p>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-89770187468501615962012-02-16T13:08:00.000+03:002012-08-12T17:49:07.349+03:00Nice C-static-assert<html>
<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
<body>
Buddy gave this example of C-static assert.
<pre class="brush:c; gutter: false">
#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))
</pre>
</body>
</html>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-12672222270149893122012-02-08T19:58:00.000+03:002012-08-12T19:50:14.367+03:00Linux needs to reboot<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPlain.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.defaults['gutter'] = false;
SyntaxHighlighter.defaults['toolbar'] = false;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
Locks not removed after procs is finished:
<pre class="brush:shell">
$ ls -i dummy_daemon.pid
3146845 dummy_daemon.pid
$ cat /proc/locks | grep --color 3146845
1: FLOCK ADVISORY WRITE 13676 08:01:3146845 0 EOF
$ ps -p 3146845
# NOTHING like no entry in /proc/3146845
</pre>
<p>
Trying to run program: it performs exit(1) with no comments. Here is backtrace:</p>
<pre class="brush:plain">
#0 __GI_exit (status=1) at exit.c:100
#1 0x00007ffff660f314 in __libc_start_main (main=0x55cd31 <main(int, char**)>, argc=2, ubp_av=0x7fffffffe4e8, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffffffe4d8) at libc-start.c:258
#2 0x000000000055c9a9 in _start ()
</pre><br />
Ha!<br />
No stable software ever written :-)<br />
just reboot it /dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-4420345152625624442012-01-29T23:12:00.001+03:002012-01-29T23:14:21.255+03:00Slow reading from std::cinIt's been 3rd hour I tried to optimize a program. All been OK, and I finally checked time(1) output: 9 seconds in userspace. Well done, no more critical things to optimize. Then I ran it again, but with another method: without filename as param (it's supposed to read from stdin in this case). time(1) showed me 21s. this time! What?! Aaa... it showed me as twice as lower previous run! Really, it depends on stdin?! I also picked another variant to check (my system?): <br />
<pre>$ cat big_file | time myprogram ... /dev/stdin
9.18user 0.11system 0:09.32elapsed 99%CPU (0avgtext+0avgdata 6608maxresident)k
0inputs+8624outputs (0major+471minor)pagefaults 0swaps
</pre>Hmm... now it's fine. But now reading goes through std::ifstream not from std::cin. I checked my concern with <a href="http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html">google-perftools</a> - and what I saw: nearly 50% has been spent by calling of std::getline(), here is the top10 functions by google-perftools:<br />
<pre> 908 41.6% 41.6% 908 41.6% gogo::BlacklistFilter::exists <i><<< WORK</i>
340 15.6% 57.2% 357 16.4% _IO_getc
271 12.4% 69.6% 1095 50.2% std::getline
221 10.1% 79.7% 239 11.0% <b>_IO_acquire_lock_fct</b>
172 7.9% 87.6% 193 8.8% _IO_ungetc
54 2.5% 90.1% 287 13.2% __gnu_cxx::stdio_sync_filebuf::uflow
48 2.2% 92.3% 48 2.2% std::__once_callable
43 2.0% 94.3% 43 2.0% _IO_sputbackc
37 1.7% 96.0% 37 1.7% std::_Rb_tree_black_count <i><<< WORK</i>
27 1.2% 97.2% 275 12.6% __gnu_cxx::stdio_sync_filebuf::underflow
</pre>Fantastic! Hardly 45% is work_code-related, the rest caused by std::getline. Strange distribution, especially <b>_IO_acquire_lock_fct</b> function. This name seems to be self-explained, so I easily found this method: <a href="http://www.cplusplus.com/reference/iostream/ios_base/sync_with_stdio/">ios_base::sync_with_stdio</a>. So, putting <code>std::cin.sync_with_stdio(false);</code> tamed my program as well. CPU time returned back and I'm happy again.<br />
Note, this behavior <i>doesn't related</i> to <b>-pthread</b> or anything else compiler-key. It's just become 30 to 40 times slower when you reading from std::cin, no matter using std::getline or not.<br />
Surely, that's not the potion what make all programs faster, but I'll keep this std::ios' weakness in mind.dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-19541386845915414892012-01-22T15:36:00.000+03:002012-08-12T17:56:08.793+03:00Swap words in text w/o additional memory<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
Yesterday I saw an interview-problem from our C-team:<br />
<h3>Revert words-order in text no using additional memory</h3>For example, string<br />
<i>"Fedora Project promotes internet freedom"</i> should be translated to<br />
<i>"freedom internet promotes Project Fedora"</i> [nice semantic palindrome, isn't it? :-)].<br />
<br />
Doesn't looks hard, but the only solution I found is to inverse all text, and then inverse each word.<br />
<pre class="brush:c; ruler: false">
#include <stdio.h>
#include <string.h>
/* invert characters in range [beg; end) */
inline void inv(char *beg, char *end)
{
end--;
while(beg < end) {
char c = *beg;
*beg++ = *end;
*end-- = c;
}
}
int main(int argc, char *argv[])
{
char *s = argv[1], *beg = s, *end = strchr(s, '\0');
inv(beg, end);
beg = s;
while(*beg)
{
end = strchr(beg, ' ') ? : strchr(beg, '\0');
inv(beg, end);
beg = end + 1;
}
printf("%s\n", s);
return 0;
}
</pre>
The problem is we have to scan string twice. Maybe there is a better (by-algo, not especially by-speed) solution?<br />
BTW, nice construction which is used everywhere in Linux kernel
<pre class="brush:c; ruler: false; gutter: false; ">
x = y ? : z; // equals to "if(y) x = y; else x = z;"
</pre>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-81636073815637339422012-01-16T00:55:00.000+03:002014-10-27T11:50:13.853+02:00Long Live ... SSD!<i>I pretty assume you'are using Linux :-)</i><br />
<h3>Some advices to make your SSD live longer</h3><ul><li>Enable TRIM command from filesystem to disk firmware (ext4 has option 'discard', see man 8 mount)</li>
<li>set 'noatime' <b>and</b> 'nodiratime' options (again, see man 8 mount)</li>
<li>enlarge <code>/proc/sys/vm/dirty_writeback_centisecs</code> up to 60000 (60 seconds) to make pdflush write rarely</li>
</ul>Do you know any more?<br />
<br />
Surely, do not apply it blindly: there are many explanation why OS have to work w/ solid drives differently comparing with usual HDD:<br />
<ol><li><a href="http://www.windowsitpro.com/article/john-savills-windows-faqs/q-i-heard-solid-state-disks-ssds-suffer-from-a-decline-in-write-performance-as-they-re-used-why-">Illustrated process of rewriting block</a></li>
<li><a href="http://www.eettaiwan.com/STATIC/PDF/200808/EETOL_2008IIC_Spansion_AN_13.pdf">wear-leveling</a>, or how solid state drives (or even USB sticks) remap data blocks.</li>
<li><a href="http://sites.google.com/site/lightrush/random-1/howtoconfigureext4toenabletrimforssdsonubuntu">How to configure TRIM in Ubuntu and other distros</a>. With little benchmarking</li>
</ol>
<br />
UP:<br />
<p>
Also it's worth to change default IO scheduler to "noop". This will boost synchronous operations in case of distributed read requests. Many peoples thing it reduces much CPU cycles of "too smart" defult schedulers: CFQ or deadline. But I think the plenty of effect is not in CPU time. Instead, it reduces average of IO request waiting it's time to be actually send to device. Because "noop" scheduler <i>does not buffer</i> IO requests.</p>
<p>I saw big difference with iostat(1)' "await" column; from manpage it is "The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time <i>spent by the requests in queue</i> and the time spent servicing them". For my workload it decreased 8 times!</p>
<p>To switch your disk' scheduler to noop perform:<br /><br />
<code>
$ echo noop | sudo tee /sys/block/sda/queue/scheduler # my SSD is sda<br />
</code>
<br />
And, certainly we have to do this thing each time on system boot. The most native way to do this in Ubuntu, as I found, is via procps:<br /><br />
<code>
$ sudo apt-get install procps
</code>
<br /><br />
Then add following line to /etc/sysfs.conf:<br /><br />
<code>
block/sda/queue/scheduler = noop
</code>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-8009873852938159702012-01-14T14:59:00.000+03:002012-08-12T18:08:54.873+03:00slow std::fill behavior<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushDiff.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
<p>This morning I saw a commit in our group with nearly this content:
<pre class="brush:diff; ruler: false; gutter: false;">
- std::fill(v1.begin(), v1.end(), 0);
- std::fill(v2.begin(), v2.end(), 0);
- std::fill(v3.begin(), v3.end(), 0);
+ for(int i = 0; i < N; ++i) {
+ v1[i] = v2[i] = v3[i] = 0;
+ }
</pre>
with comment "small optimization".
<p>I've been a bit wondered "Is it really give any speedup?", but, anyway, decided to check. Results are expected, all but the last one: strange thing, I can't do this code work fast w/o hand optimization even w/ <i>-march=native</i>. Here is the source code and my benchmark digits:</p>
<pre class="brush:cpp; gutter: true;">
#include <algorithm>
#include <vector>
#include <stdio.h>
using namespace std;
#define N 1000
inline void i32_fill(int *start, int n, int c)
{
int d1, d2;
asm volatile(
"cld\n\t"
"rep\t\n"
"stosl\n\t"
:"=&D"(d1), "=&c"(d2)
:"0"(start), "1"(n), "a"(c)
:"cc", "memory"
);
}
int main()
{
int crc = 0;
vector<int> v1(N), v2(N), v3(N), v4(N);
for ( int loop = 0; loop < 10000000; loop++ )
{
/* Uncomment any of the following methods:
* I especially did not place 'em one after each other keeping
* in mind you may ask "wasn't it because of caching?"
*/
/*
std::fill(v1.begin(), v1.end(), 1);
std::fill(v2.begin(), v2.end(), 2);
std::fill(v3.begin(), v3.end(), 3);
std::fill(v4.begin(), v4.end(), 4);
*/
/*
for ( int i = 0; i < N; i++) v1[i] = 1;
for ( int i = 0; i < N; i++) v2[i] = 2;
for ( int i = 0; i < N; i++) v3[i] = 3;
for ( int i = 0; i < N; i++) v4[i] = 4;
*/
/*
for ( int i = 0; i < N; i++) {
v1[i] = 1;
v2[i] = 2;
v3[i] = 3;
v4[i] = 4;
}*/
/*
i32_fill(&v1[0], N, 1);
i32_fill(&v2[0], N, 2);
i32_fill(&v3[0], N, 3);
i32_fill(&v4[0], N, 4);
*/
crc += v1[N-1] + v2[N-1] + v3[N-1] + v4[N-1];
}
printf("CRC=%d\n", crc);
return 0;
}
</pre>
<p>First of all, there is a great difference (over 4 times) using -O2 & -O3 for all but i32_fill. Here is test' results for: </p>
<pre class="brush:shell; gutter: false;">
$ g++ -O3 -march=native fill.cpp && ./time a.out
</pre>
<table border="1">
<tr><th>Method</th><th>Time</th></tr>
<tr><td>std::fill</td><td>0m7.090s</td></tr>
<tr><td>N x loop</td><td>0m7.018s</td></tr>
<tr><td>loop x N</td><td>0m5.119s</td></tr>
<tr><td>i32_fill</td><td>0m4.072s</td></tr>
</tr>
</table>
<p>So, the question is - is here a better solution than i32_fill and why "N x loop" wasn't optimized so much by compiler? BTW, memset(3) uses "stosq then stosb" approach and it's built-in gcc.</p>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com2tag:blogger.com,1999:blog-897881075540338440.post-4838282334521158142011-11-08T06:28:00.001+03:002012-08-12T18:17:26.986+03:00sun (oracle) java-plugin in debian<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
Debian testing (currently Wheezy) doesn't provide sun jre anymore. Sad but <a href="http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=646524">true</a>.<br />
The thing is: icedtea plugin doesn't work correct with raiffeisen bank ("raiffeisen connect"). So, I decided to install it manually.<br />
<br />
Steps are quite straightforward:<br />
1) download jre from Oracle website: <a href="http://java.com/en/download/index.jsp">http://java.com/en/download/index.jsp</a>. They are so "marketing-shifted" - they provide JRE as "Java for your desktop computer" menu item :-) Anyway, download x64 or x32 .bin file (depending on your `uname -m`).<br />
<br />
2) unpack this file into /usr/java/ directory:
<pre class="brush:shell; ruler: false; gutter: false;">
$ sudo mkdir /usr/java/ && cd $_
$ sudo sh ~/Downloads/jre-6u29-linux-x64.bin
</pre>
<br />
3) Use update-alternatives if you are using iceweasel(firefox) (like me):<br />
<pre class="brush:shell; ruler: false; gutter: false;">sudo update-alternatives --install /usr/lib/mozilla/plugins/libjavaplugin.so mozilla-javaplugin.so /usr/java/jre1.6.0_29/lib/amd64/libnpjp2.so 1600
</pre>
If you not familiar with `update-alternatives`, read the manual first to understand this.<br />
<br />
That's All. I'm sure it's pretty much the same with x32 and/or Chrome/Opera.<br />
<br />
UP: not the same way: chromium doesn't have 'alternatives', so simply make symlink to it's plugin directory (/usr/lib/chromium-browser/plugins/)dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-47688091534721163542011-10-27T23:38:00.000+03:002012-08-12T18:23:40.594+03:00rsync & rsync via SSH<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
As you know, rsync may work with it's native rsyncd server and via SSH. If source or destination contains '::' this addresses rsync' mountpoint. Otherwise it works via SSH using cryptographic channel.<br />
Anyway, we use both of them. And there are many cases then you have expressions like:<br />
<pre class="brush:shell; gutter: false">
#!/usr/bin/env bash
URL1=xxx.yyyy.com::point/file1.txt # native rsync
URL2=xxx.yyyy.com:some_folder/file2.txt # rsync-ssh
#...
rsync --contimeout 10 --timeout 40 $URL1 file1.txt
rsync -e ssh "-o BatchMode=yes -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=40" $URL2 dir/
# same things many times
</pre>
<p>
As you can see, we have to use different syntax for native- and ssh-rsync: --contimeout is available only for native case. Really bad news, especially if I want to omit 'rsync' and it's options each time. <b>Should I know what type of URI I'll got in specific call?!</b>. No! It's easy to rewrite it as follows:
<pre class="brush:shell; gutter: true">
SSH_OPTS="-o BatchMode=yes -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=40"
# select rsync or rsync-ssh syntax automatically
function RSYNC
{
local looks_like_rsync=
for a in "$@"
do
if [[ $a =~ :: ]]; then
looks_like_rsync=yes
break
fi
done
if [[ -n $looks_like_rsync ]]; then
rsync --contimeout=10 --timeout=40 -tv "$@"
return $?
fi
rsync -tv -e "ssh $SSH_OPTS" "$@"
}
</pre>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-71895390602347402542011-10-20T00:20:00.000+03:002012-08-12T19:23:02.749+03:00population count (POPCNT)<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.defaults['gutter'] = false;
SyntaxHighlighter.defaults['toolbar'] = false;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
I'm wondering, some people don't know about primitive bits technique. For example, trying to count number of '1' bits in int may be done as follows:
<pre class="brush:c">
int get_nbits(int x) {
int cnt = 0;
while(x) {
cnt += x & 1;
x >>= 1;
}
return cnt;
}
</pre>
<p>
It's rough and not efficient. Any ways to make it faster? Couple of them (and, yea - give more of 'em in comments).</p>
<h3>boost 1)</h3>Look here: we have a number which is a power of 2. Such numbers has a great attribute: if you perform '&' with number-1, you'll always get a zero. This is because you always have "ones" in place you get zero in power-of-two.</p>
Look at this: number 16 and 16-1=15:
<pre>
10000
& 01111
-------
0
</pre>
So what does it means? You always <b>eliminate</b> the lower 'one' bit. And example above may be easily rewritten as:
<pre class="brush:c">
int count_bits(int x) {
int cnt = 0;
while (x) {
x &= (x-1);
cnt++;
}
return cnt;
}
</pre>
<p>So, now we will perform no more than "number-of-ones" loops. Ok, but still doesn't efficient.</p>
<h3>boost 2)</h3>Why we should count a bits using N loops; we have to precompute 'em instead! So, use precomputed table for bytes, we may rewrite our example as follows:
<pre class="brush:c">
static unsigned char nbits[256] = {
/* 0 - 15 */ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
/* 16 - 31 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
/* 32 - 47 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
/* 48 - 63 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
/* 64 - 79 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
/* 80 - 95 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
/* 96 - 111 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
/* 112 - 127 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
/* 128 - 143 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
/* 144 - 159 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
/* 160 - 175 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
/* 176 - 191 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
/* 192 - 207 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
/* 208 - 223 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
/* 224 - 239 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
/* 240 - 255 */ 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
int count_bits(unsigned x)
{
return nbits[x & 0xFFU] + nbits[(x >> 8) & 0xFFU] + nbits[(x >> 16) & 0xFFU] + nbits[x >> 24]
}
</pre><br />
<h3>boost 3) Using POPCNT from Intel SSE4.2 command set</h3>You may find brief description and links <a href="http://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT">here</a>.<br />
<br />
<h3>boost 3.2</h3>GCC has set of built-in commands which expands to CPU-instructions if present, or use software (but efficient) workarounds. As final example, I prefer to use this code:<br />
<pre class="brush:cpp">
//
// Uses SSE4 'POPCNT' instruction if present, or gcc-stub like 'popcount'
//
template<typename T> int popcnt_modern(T);
template<> int popcnt_modern(unsigned x) { return __builtin_popcount(x); }
template<> int popcnt_modern(unsigned long x) { return __builtin_popcountl(x); }
template<> int popcnt_modern(unsigned long long x) { return __builtin_popcountll(x); }
</pre>
<p>Although It seems what gcc generates inefficient code unless your CPU provide SSE4.2. Hard to provide numbers here, but trust - code with byte-table is slightly faster.</p>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0tag:blogger.com,1999:blog-897881075540338440.post-29696932748139072382011-10-18T22:41:00.000+03:002012-08-12T19:06:44.431+03:00eval is your friend<head>
<!-- SYNTAX HIGHLIGHTER BEGINS -->
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shThemeEclipse.css' rel='stylesheet' type='text/css'/>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.defaults['gutter'] = false;
SyntaxHighlighter.defaults['toolbar'] = false;
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/current/scripts/clipboard.swf';
SyntaxHighlighter.all();
</script>
<!-- SYNTAX HIGHLIGHTER ENDS -->
</head>
<p>I like to use expressions like {1..100} in bash - it makes tedious FORs much more easier. But you can't express this using variables. Even via readonly variables.So, you <b>can not</b> easily write this:</p>
<pre class="brush:bash">
readonly START_ID=200
readonly END_ID=250
for i in {{1..100},{$START_ID..$END_ID}}; do
# do something using $i
done
</pre>
<p>
But (thank God) here is pretty workaround: eval. So, construction above may be rewritten as
<br />
<pre class="brush:bash">
for i in $( eval echo {{1..100},{$START_ID..$END_ID}} ); do
# ...
done
</pre>dkrotxhttp://www.blogger.com/profile/08519715678786396335noreply@blogger.com0