Wide Finder: Looking for Bottlenecks

I've been doing some reading about the UltraSparc T1, the processor in our test server - it's an interesting piece of silicon. The cores are in-order (so single thread performance is going to be pretty bad) but they make up for that by having enough waiting threads to keep the execution units busy. There are 8 cores with 4 hardware threads each, so when one thread is waiting for data from memory, one of the other 3 threads can execute in its place.

I ran the code on the whole dataset with a bigger chunk size, 150MB - bigger chunks should be better. I didn't let it finish, both because it was taking too long, and because I knew it would eventually crash (see previous entry), but I wanted to get some numbers on a big dataset. The numbers were indeed interesting.

load average: 20.99, 21.32, 18.67

Turns out I'm only using 20 hardware threads for processing and one for the "reduce" operation. The bottleneck is in "reduce", it can only gather partial data 20 times faster than the "map" functions produce it. Last time that number was 30, I'm not sure what's going on, but I don't want to tweak that too much because I don't expect the ratio to get much better. Instead I want to utilize the inherent parallelism in the given problem and do 5 simultaneous "reduce" operations, one for each dictionary of reported data. So I'm looking at something like 30 "map" threads, 5 "reduce" threads and one master thread that only dispatches messages (in the form of files).

This is going to complicate my setup a little. The biggest problem is in the hack I did to pprocess.py to get it working on Solaris, because now I need to use some of that functionality I broke. Paul Boddie - the author of pprocess - said he would help, so how long can it take? :) Update: fix received, tested, works great, thanks Paul!

One problem I had in the previous installment was that the "map" code, even when maxing out all cores, only consumed data at a rate of about 40MB/s. I've tried a number of things to improve that number. The processors (actually, the OS sees them as processors but they are the 32 hardware threads) are busy at 100% so increasing the number of jobs to, say, 60 concurrent processes was obviously not useful. So I tried to optimize the "map" function itself.

Almost half of the time was taken up by the first regular expression, the one that parsed every line of the file. I replaced that with a split using ' ' (space) as a separator - it's functionally identical for the given data. I also used the defaultdict object instead of manually checking if an entry in a dict needs to be created. These two changes shaved about 25% off the execution time of the "map" function on my MacBook. Beyond that I don't think I can optimize the code much more, unless I'm missing something obvious. Also, the server has Python 2.4.4 and defaultdict was introduced in 2.5 - bummer.

Created:
6 Jun 2008, 20:22

Reader Comments

Paul Boddie [8 Jun 2008, 22:42]

I've mailed a possible solution to Alex regarding his problems with pprocess. Effectively, sockets created using socketpair appear to interact with poll in a broken way on Solaris, unless a stream of never-ending input conditions is considered sane behaviour when the other end of a socket is closed.

Tim [12 Jun 2008, 22:05]

Why don't you download and build the latest Python?

alex [13 Jun 2008, 08:21]

I will. Actually I'll do something even better: I'll build Stackless Python, that has some of the Erlang-like cheap threads and message passing built-in.

Dan Hatfield [28 Oct 2008, 01:50]

I'd be curious to see how Python via Psycho compared...

alex [28 Oct 2008, 09:05]

Hmm, Psyco seems just the thing to speed up the Python wide finder, once all the cores are put to work.

« previous
(Wide Finder: Many Cores)
next »
(The T-shirt)