Wide Finder: Going Parallel
So I fiddled with the Wide Finder script a bit. First, I abstracted out the parsing and reporting logic: now I have a LogCounter class that can understand, and count, log entries. It can also merge the results of running several instances of the class on different parts of the log file. So "all" I have to do is run instances of LogCounter in different threads/processes, and merge the results.
Now I need a way to read into the log file at random places - a combination of seek() and readline() that can work around landing in the middle of a line when starting a chunk. Ten lines of python code did the trick, verified by more than 40 lines of test code. Seems to work well on the test data too, using this function didn't introduce any errors in the report.
Now for the parallel magic. I've settled on pprocess because this library supports message passing between processes using Python's pickle/unpickle (serialization built into the language - Python FTW!). Threading in Python is not an option because of the Global Interpreter Lock, so the only way to usefully parallelize this task is with multiple processes. Anyway, the interface is straightforward: a simple way of using pprocess is to wrap a worker function in a pprocess function. Call it normally and the library takes care of process spawning and communicatioin. I use a slightly more advanced method: I queue all the jobs asynchronously and then collect them in a loop.
I wonder how far that approach will take me. The main bottleneck is probably going to be passing partial statistics back to the main thread. Pickle is about as fast as it can get, but there's the overhead of inter-process communication, and those partial results might get pretty big. So, let's run some numbers.
[single-thread implementation]
real 0m1.269s
user 0m1.218s
sys 0m0.050s
[using pprocess, 2 cores, 4 jobs queued]
real 0m4.898s
user 0m2.949s
sys 0m4.487s
Ok, so there's something seriously wrong with that. User time more than doubled, and system time is enormous. I checked the pickled data size, and it was less than 300KB. This was on my MacBook, and I suspected MacOS was not that great at inter-process communication in general, so I ran the same tests on a quad-core Opteron (but I told pprocess to only use 2 cores).
[single-thread implementation]
real 0m1.431s
user 0m1.428s
sys 0m0.004s
[using pprocess, 2 cores, 4 jobs queued]
real 0m1.486s
user 0m1.844s
sys 0m0.732s
That's more like it. System time is still worrying, but at least it's not downright scary. So let's see what happens when we scale it up.
[using pprocess, 4 cores, 4 jobs queued]
real 0m1.225s
user 0m1.764s
sys 0m0.756s
[using pprocess, 4 cores, 16 jobs queued]
real 0m1.398s
user 0m1.836s
sys 0m0.964s
For the first time, the parallel version was faster the sequential version in wallclock time, that's encouraging. :) System time is still large, but not that large, and splitting the tasks in smaller pieces actually slows things down. So, it looks like the dataset is too small - it's a 100K line, 22MB file after all. Time to move on to larger things.
I'll upload the code soon, once I have it cleaned up and tidy.
Update: I ran the code on Tim's T2000 machine, and the results are really bad. IPC (inter-process communication) seems to be a problem there too. I've poked around in the pprocess.py library and it looks like it's not really meant to pass around anything large between tasks, there's no buffering of data. I was passing hundreds of kilobytes' worth of partial results from each task. So I'll have to rethink my methods of passing around partial data, maybe using temporary files?
Here are some numbers from the Sparc T2000. The third run, without IPC, is incomplete because it gathers no results from the tasks - it's just meant to emphasize the IPC problem.
[single-thread implementation]
real 0m8.005s
user 0m7.899s
sys 0m0.104s
[using pproces, 4 cores, 20 jobs queued]
real 0m30.943s
user 0m13.638s
sys 0m29.693s
[using pprocess, 4 cores, 20 jobs queued, no results]
real 0m2.657s
user 0m9.644s
sys 0m0.382s
Notice that the single-thread implementation on a T2000 core takes 8 seconds - for the same dataset, a MacBook core needs 1.3 seconds to do the job. So the T2000 has to make up for that by heavy parallelization.