A different kind of filesystem
I have a fair amount of data that needs archiving. It's in the order of a few tens of gigabytes, small enough to fit on a harddisk, but I want it backed up, and available on at least two computers in different locations (home and office). Snapshots and rollbacks would be nice; also, I'd rather have peer-to-peer synching than a central master copy.
If you're like me, this is starting to sound like a DVCS. Trouble is, version control systems are designed to have a working copy outside the repository, so my data would essentially be duplicated; storing large files would also be tricky.
DVCS for data storage
So what about using a DVCS to build a filesystem on top? FUSE is perfect for the front-end, and a bare Git repository can store the data. This has several unique advantages:
- storage is abstracted as blobs, trees and commits;
- we get snapshots and rollbacks for free;
- efficient synching between repositories, also for free.
So, after a quick and fruitless search for existing implementations, I set off writing my own, using Python, Dulwich and fusepy.
SpaghettiFS
The first working code stored files as blobs, and folders as trees, just like a normal Git repository - it worked, but was inefficient for large files. Now files are split into small blocks, linked from a tree that is essentially an inode. Folder entries reference the inodes, just like in a typical filesystem.
So it works. Several gigabytes of my data already live happily in such a filesystem. Synching is nearly painless. Reading and writing are still slow, but usable, and there's plenty of opportunity for improvement. Some Posix filesystem features (symlinks, rename, permissions) have yet to be implemented.
Check out the code on GitHub, feel free to use the issue tracker, and please let me know if you find SpaghettiFS useful.
Test-driven development
There's an interesting discussion going on about TDD: Tim Bray, Uncle Bob, Peter Seibel. I'm a fan of TDD, no big expert, but most of the code I'm proud to have written was test-driven. So here's my two cents.
Under-the-hood
That's where test-driven coding shines. Problems are well-defined and clearly bounded, you want to write simple APIs to keep things modular, so you have natural test points for the code. That's where the tricky refactoring happens too, because there's lots of external code using such a module, so it's not obvious what could break. Bugs have room to hide.
TDD helps separate the concerns of design and implementation: you design (think about what to build, write tests) and then implement (make those tests pass). You get regression tests basically for free, and you can add features incrementally.
Bodywork
There's also front-end work. You're not exposing a simple and clear API; rather, you're designing user interaction. You want to get all the little details right, keep the interface simple and intuitive and consistent.
TDD doesn't work so well here. For one thing, tests at this level are harder to write, so they take more time. Also, you can't separate design from implementation, because the design evolves as you experiment; in other words, the implementation is the design. There's not much external code depending on front-end code, so you're unlikely to break something inadvertently; any mistake you make is reflected immediately in the user interface. Still, some level of testing is useful, to make sure things don't break in obvious ways, e.g. checking that a rendered web page does contain the expected message somewhere in the HTML.
Use your judgement
TDD is a great tool, and like any tool, it can be over-used. Like Uncle Bob says, keep a small roll of duct tape around. But writing dependable library code without TDD, or at least very good test coverage, that's just asking for trouble.
Durus
I've recently been using this nifty little object database, Durus, on several small projects. It's dead simple, the only documentation you need to get started and grok the concepts is their presentation from PyCon 2005. Having worked with ZODB helps – Durus draws heavily from that.
So Durus gives you two basic building blocks. You have the DB connection and you have the Persistent class. The DB connection is fairly obvious: open a database file (or connect to a server), retrieve data, commit, rollback, close. Persistent is where the magic lives: inherit from Persistent and your object will be a first-class database container. When you make changes, Persistent will be notified automatically; commit or abort and the data gets saved or reverted. There's one caveat: Persistent can't detect changes inside mutable attributes (like a list or dict), but there are simple ways around this.
Several power tools come in the package. There's PersistentList and PersistentDict (which behave like list and dict, but detect when you make changes); BTree, which is like PersistentDict, except it's much better for handling a large number of entries; and ComputedAttribute, with is a fancy way of saying "don't persist this piece of information, I want to compute it myself". There's also machinery to run a server with multiple clients but I haven't used any of that.
Mostly Durus just keeps out of your way. The data model is basically Python's data model. No need to map table columns to object fields; no serializing to a foreign format (JSON, XML); just plain Pickle. You still need to have some level of understanding of how the bits are being stored, and object databases are weird animals at that, but it's much more fun than hand-optimizing SQL queries.
