The math-free guide to extreme lossless compression.
So you have some data, and you want to make it smaller without losing detail. Easy enough - just zip it up, right?
But you want to make it a lot smaller, without needing to study all manner of esoteric technologies, and you need to be able to get this back in ten years without having to get ancient software running. Processor time is cheap now, but storage isn't - so what do you do when you really need to squeeze everything you possibly can into limited space?
So here's a very quick guide telling you how to achieve the best lossless compression ratio, when time is no object. If you use these techniques, you can get compressed file sizes much smaller than simply putting things in a ZIP will achieve - but at the expense of requiring more effort, and a great deal more processor time to compress.
What not to use.
Lossless compression has a long history, and a few elements of that history still hang around. These were good once - a lot of them were historically revolutionary - but can't compete with more modern algorithms. So the first thing to do is stop using any of these.
Zip: The venerable format, a classic. Revolutionised the world of compression. Ultra-portable. But with a slight problem: The algorithms you can be sure it'll support are ancient, and will be easily outperformed by more modern developments. It also lacks support for solid encoding (allowing a file to use previous files to improve compression). Some newer software (Winzip 12.1+) supports newer algorithms, but there is no assurance that older decompressors will take these, and they still aren't the best around.
gzip2: It's a handy utility, open, and good-enough for most purposes - but the compression is no longer the best around. Technology moved on.
bzip2: Better compression than gzip on most things, but still not the best. It does have the advantage of running quite fast, as the algorithm is very easily parallizeable - there's a multithreaded version, pbzip2, which increases linearily with number of processor cores. That can make it one of the faster compressors for the ratio on a multicore system, so if speed is important you may still want to choose this over something like 7z. Even though 7z will achieve a better compression rate, it could take hours to work through a few gigabytes of data.
RAR: Replaced ZIP for many people. In technological terms, this is a great improvement. Compression ratio is higher on just about any file, plus it supports a lot more features. It's also well-known for supporting very strong encryption. A good format certainly, but again has a critical flaw: It's propritary, an cross-platform support is lacking. It's also not the absolute best around, though it's not far off. In terms of compression alone, it's usually inferior to 7z - but not by far, and only because 7z allows the use of some 'ridiculously slow' settings to edge ahead.
ACE: Slightly better compression than RAR, but just as propritary, only with even less documentation and cross-platform support. Still inferior to 7z, so stay away. Rarely seen these days.
General compression software.
7zip/7z. All the time. There's little to debate on this - 7z wins by every standard.
- It's an open standard, with open-source software. It'll always be readable.
- Feature support is all there. Spanning archives, solid compression, encryption, ridiculously huge files - it'll do it all.
- Compression, on defaults or 'high', is on a par with RAR - but that's without the use of a very large dictionary size. Set that, and it beats RAR easily.
7zip has a hidden advantage over RAR: The ability to set a very high dictionary size. This can substantially boost compression of large files at the expense of processor and memory requirements. If you look in the 7z man page, you will see an example of 'extreme' settings given: "7z a -t7z -m0=lzma2 -mx=9 -mfb=64 -md=32m -ms=on archive.7z dir1" These settings aren't actually that extreme: The key is that -md=32m. That parameter sets the dictionary size, and turning it up higher will improve compression on large files substantially (Up to the size of the input - anything past that is useless). 256m can be a good choice - settings beyond that tend to cause even high-spec PCs to run out of memory, but if your PC isn't up to it you might have to make do with a lower value.
The -ms=on enables 'solid compression.' That just means that the compressor will handle all the files within the archive as one solid lump of data, so it'll be better able to find redundencies between files. Compression goes up, though it also eliminates all hope of recovering part of the contents if the archive is later corrupted.
There is some debate about the relative merits of the LZMA vs LZMA2 algorithm. As the names imply, they are related. In terms of compression, they are too close to call either one superior - one rarely has an advantage of more then one percent over the other, and which wins depends upon the input. LZMA2 does have a performance advantage, as it is much better able to benefit from multi-core environments.
Another option is xz. This uses the same algorithm (LZMA2) as 7z, but without the elaborate container format - it's nothing but a compressed stream and minimal header. Much like gzip/bzip2 - if you want to store folder structures you usually use tar to achieve that, combining them to give a .tar.xz file. Why not just use 7z? Because xz can be used for streaming stdin to stdout, which 7z can't. There are times this can come in handy - for example, you can use tar to read files to backup, pipe this into xz to compress it, pipe that into gnupg to encrypt it, and write that to the destination - saving yourself an intermediate file. The extreme memory-is-cheap compression settings for xz are '-e --lzma2=preset=9,dict=256M' - as with 7z, adjust the dict= number according to the capabilities of your hardware. Note that the xz container doesn't support LZMA1.
Advice on specific types.
|File type||Is really||Appropriate procedure|
|.docx .xlsx .pptx||.zip container, very specific file layout inside||If it's huge, probably contains large embedded images. Examine and determine appropriate action. If it really is due to large amounts of text, you can use the 'stretchzip' script then compress the result with 7z.|
|.cbz||.zip full of images, and not too fussy about type. Usually jpegs. Sometimes png. Rarely others. Alphabetic order determines page order.||Examine. If .bmp, convert to PNG. Rezip with 'store' compression type - zip compression won't work on jpeg or png, you'll need to rely on the image compression of the constituent files. You can use the 'stretchzip' script.|
|.cbr cb7, cba, cbt||.rar, .7z, .ace, .tar. File structure layout the same as cbz.||Extract. Convert to .cbz. Treat as above. The cb7, cba and cbt forms are obscure an ill-supported, so should be converted to (store) cbz.|
|.jar||.zip, but very fussy about every detail of packing.||Don't try to poke around inside this unless you know exactly what you are doing. You'd just break it. A zip extraction tool can extract .jar files, but cannot create them.|
Sometimes you can just run 7z with -md=256m and enjoy your nicely compacted data, but there are two issues with this:
- Some data can be compressed far, far better with software designed specifically for that type of data. A bitmap image, for example, will compress well if placed in a zip or 7z file - but it'll compress far better still if converted to a PNG file, a format designed for images.
- Some files include their own compression capabilities. As a rule, you cannot compress data that is already compressed - but the files own compression is likely inferior to what you can use. These files need to be expanded into a non-compressed form first - they'll get larger, but this will be more than offset by how much smaller they'll get when placed into a 7z archive.
For type-specific lossless compression, there isn't much to debate. The leading software is well agreed-upon.
- Images: PNG
- Audio: FLAC. Monkey's Audio can get a slightly better rate, but the improvement is too small to outweigh the format's other disadvantages.
- Raw video: Huffyuv, or x264 on the 'lossless' cq=0 mode. Note that x264 isn't bit-for-bit lossless, due to rounding errors in the mathematics used, but it's very close indeed. Also, either of these is going to mean remuxing, which can sometimes be a tricky task in itsself if you're dealing with a file containing multible audio streams, subtitles, chapter markers, an embedded thumbnail or metadata.
If you have video that is simply too big to store without loss, you may find my guide to x264 extreme encoding useful. It lists various filters and settings you can use to get your video compact without losing much quality by throwing processor time at it. If you use those instructions with a very high quality setting (crf=5 or so) you can achieve near-perfect at a fraction of the storage requirements of an entirely lossless compressor.
I've written a program called BLDD that can do very well on disc images and virtual machines, or on tar files containing many repeating or similar files. It's not a true compressor, it's a block-level deduplicator designed to suppliment something like 7z or xz - use BLDD first, then 7z, so you end up with an 'image.bldd.7z' file. It's also effective on very large TAR archives, as this format stores content files always starting on 512-byte block boundries. On suitable input, it can be very capable indeed. It is highly effective at sorting out the unallocated-but-still-there sectors in filesystem images. It's a true lossless program, bit-for-bit, so handy if you need forensic backups.
As for files with their own compression, there are a few to look out for.
The most common offenders in this are a number of Office file types which actually use .zip as a container - by informal convention, these usually have extensions ending in x such as .docx or .xlsx. You can extract these before reforming them into a new zip using the 'store' dummy-compressor mode. I've provided a useful little script for that. The resulting file will be larger than the input, but this should be more than offset by greater compressibility once you put them inside of a .7z. You can very easily identify any ZIP file by just opening it up in a text editor - if the first two characters are 'PK' then it is probably, though not certainly, a ZIP file. You should be able to see some of the content file filenames a little further down, which is a very handy trick for telling which software to open an unknown file with.
Another offender, less often encountered, is the .DMG file. These will be familiar to any OSX users, but rarely seen outside of that OS. The format is actually quite strange, being essentially a filesystem image repurposed as an archive container, but it does use compression. You can convert these into .bin with a utility called dmg2bin (what else?), but even that image has barely any cross-platform support - so you may be better of simply extracting and placing the contents into something like .7z instead.
Digital-format books, usually comics or graphic novels, are sometimes in .cbz or .cbr files - or occasionally .cb7, .cba or .cbt, though these are rare things indeed. All of these are actually container formats with a changed extension, in the same way that Office documents are really zip files. The CB z, r, 7, a and t are zip, rar, 7z, ace and tar respectively. If you change the extension you can look inside - the pages are sorted into alphabetical order by the viewing software for display. Usually the individual pages are either .JPG or .PNG files. There's not much you can do to make those smaller without loss, but on occasion you might find one containing BMP or TIFF which can be converted to PNG for a substantial reduction in size. You can also convert them into store-only CBZ, using the same script provided for Office documents. Zip compression is of no use on PNG or JPG files anyway, and converting them into store files means that duplicated pages between issues or different packagings of the same issue can be more easily identified, potentially leading to improved compression when storing a large collection inside a solid archive. Incidentally, if you ever do find one of the rare .cba files, you should convert it to .cbz - ACE is an obscure and obsolete format, and there is no guarantee it'll be at all easy to open in the future.
Some inputs also contain a lot of 'junk' data that you don't want, but is difficult to separate. Usually this is found on filesystem or drive images - there will still be data in unallocated sectors. If you are trying to compress a backup of an entire volume, this unallocate-but-still-present data is going to seriously increase the size of your compressed backup. You can fix this by simply zeroing unallocated space before backup begins. Which software to use for this depends on the filesystem. I threw together a tiny program to do just that for NTFS. Obviously this isn't an option if you're archiving backups for forensic purposes - for those, BLDD is your best option for compacting them and still getting bit-for-bit identical output.