Heterobak

Heterobak is a backup program quite different from most. It was created with the intent backing up a very large (multi-terabyte) array onto a large number of small hard drives for offsite storage. This was a simple cost-cutting idea: I happened to have a box full of 250GB, 500G and 1TB drives already.

My requirements are a little out of the ordinary for a backup program, but not exceptional:
- The backup media is old. Hard drives that have seen years of active use, so the software must verify all data copied can be read back.
- The drives are of a variety of sizes and brands, so the software must be able to effectively utilise mismatched media.
- The ability to restore individual files or directories easily, as well as the entire backed up area.
- Due to the small size of the drives to be used as backup media, many will be needed. Carrying so many would be cumbersome, so once written a drive's physical presence must not be required until time comes to restore data from it. Write, stick in cupboard, forget.
- Media rotation must be minimal, not regular: An occasional trip to put another drive in the off-site store is unavoidable, but there must be no need to 'cycle' media at fixed periods.

The basis of heterobak is seperation of data from the index that says where the data belongs. Each backup pass produces an index file which lists the path, size and hash of every file in the area to be protected, but not the actual file contents. These are stored in one or more removable hard drives, which are shared between all backup jobs. This means that each new backup job executed stores only the unique files which have not been previously backed up, but can be restored as a self-contained job without having to go through the full-differential-incrimental chain. The best of both full and differential backup types. It also allows file-level deduplication to be handled implicitly. Versioning of files is possible if you keep the old index files, but that isn't an intended application.

These 'drives' are physical hard drives in my use, but the software is flexible. They could be any form of removable media, or files on a network share, or tapes.

The name 'heterobak' refers to the hard drives used as backup media. Heterobak was created in order to utilise a large pile of sizeable but surplus hard drives, mostly 500GB or 1TB - each far less than the data to be backed up, but collectively of sufficient capacity. There is no need for the containers to be the same size, or the same type - if you wished you could split your backup over removable hard drives, regular files, USB sticks, zip disks and floppies. They can be different sizes or brands of drive, and thus the name.

Only one of the backup volumes need be connected. The others may be safely storred disconnected or offsite. All heterobak needs to keep is a 'volume list file' which lists the hashes of all objects stored in that volume, so it can tell what doesn't need adding to the current one.

The big advantage of this approach is in media efficiency. There is no fixed schedule of media rotation. You simply need one volume connected (This can be an entire removable drive, or a file on a conventional filesystem) whenever you run a backup job. When this eventually gets filled you remove it, put it in a cupboard somewhere, and connect a new one. Filling can take a long time, as only new unique files get added. Old volumes can be retired by simply deleting their corresponding contents-list file - any objects thus lost and still required will be added to the current volume during the next backup pass, allowing for media rotation.

Restoring a backup requires the index file, plus each of the backup containers. The required objects are taken from each in turn and, using the index file, copied back to their appropriate locations as restored files. The contents of the backup media can be restored one-by-one, or concurrently - it just depends how many SATA ports and USB docks you have.

Heterobak uses a storage container developed for the purpose, the 'minimal object file system.' Files - referred to as 'meatballs' within MOFS - are retrievable by SHA256 hash. You can store the backup containers as files or on block devices, including dm-crypt loopback devices.

Heteroback was created as a personal project, for the personal use of one person. It's been tested backing up a very large collection of accumulated documents, family photos, silly internet pictures and archived video from youtube - but you'd be crazy to depend upon it for any critical business use.