Wide Finder: Many Cores

This is the third post about the Wide Finder project; see the previous articles for context.

Today's numbers

So last time I got some embarassing numbers caused by inefficient communication between the master and worker processes. I switched to pickle/unpickle using temporary files (the overhead is very low) and things are much better. I also wrote a good deal of timing code, to see how much time is spent on each part of the algorithm. This information is proving invaluable.

On to some results. I've ran the test on a 10 million lines file - 1.9GB worth of text - on Tim's T2000.

[30 threads, 30 jobs]

real    1m56.242s
user    28m56.106s
sys     0m15.322s

Notice that the user+system time is a little over 30 minutes; wallclock time is almost 2 minutes. About one minute of wallclock time is spent by the 30 map jobs doing their thing, each crunching a 64MB piece of the file. Another minute is spent by the master thread, reducing the results to a central dictionary. This second half of time depends on the number of results to be reduced and on the size of individual results. I experimented a bit: using fewer (larger) pieces seems to always be faster, and the time needed to do the reduce operation is far longer than pickle/write_to_file/read_file/unpickle. Also, this reduce time does not increase as more results are reduced - the size of the central dictionary does not seem to influence any single reduce operation. Finally, producing the sorted report takes 8 seconds. I'll have to implement a faster algorythm to get the top 10 results - sorting the whole dictionary is not needed.

Another weird thing is that I had to modify the pprocess.py library to work on Solaris. The problem seems to be that, unlike Linux, when using poll to check for data on sockets, Solaris doesn't say that a socket was closed, so I had to make pprocess manually close and forget about used sockets. This is not great, and my quick-and-dirty fix breaks some of the functionality of pprocess, but it works.

Here is the code so far: wide_finder.tgz

Next steps

One thing that bothers me is that map tasks parse their data at about 1.3 megabytes per second, which adds up to about 40MB/s - far less than the 150MB/s that this machine is supposed to achieve. I'm going to run some profiling code on an individual job, maybe I can speed it up.

There seems to be an error in the data file, it crashed my program mid-way while running on the full 45GB dataset. I need to add some error checking and useful reporting before I can run my code on that file again.

I'm not entirely sure if one thread is enough to reduce the results of 30 mapper threads, especially if I can speed those up. Right now, slicing the dataset in 64MB chunks, it's load is about 100%, and with larger chunks it will probably be lower, but we'll have to see.

Created:
3 Jun 2008, 12:55

Reader Comments

Ray Waldin [4 Jun 2008, 00:48]

Very nice. I'm going to borrow a couple of ideas from your code. Specifically, the fact that you never parse the status integer and instead just do a switch on string cases, and also you don't parsing the bytes integer except in cases where it's used. Double duh on my part for not doing both of those already!

Also i think i know why your run may have crashed. There are some log entries that don't have the HTTP/1.0 part of the request field, or what would be in group 8 of your regex match. That makes your bytes variable contain the referral and you're blindly parsing it as an int. I bet that's what it is anyway.

I'll write up a post about the ideas i'm "borrowing" from you shortly :)

alex [4 Jun 2008, 11:22]

Borrow away; anyway I think I got them from Tim's code, but I'm not sure.

Thanks for the heads up about the weird log entries, that's probably what bit me. I'll have another go tonight and see what I get.

« previous
(eLiberatica 2008)
next »
(Wide Finder: Looking for Bottlenecks)