The math-free guide to extreme lossless compression.
So you have some data, and you want to make it smaller without losing detail. Easy enough - just zip it up, right?
But you want to make it a lot smaller, without needing to study all manner of esoteric technologies, and you need to be able to get this back in ten years without having to get ancient software running. Processor time is cheap now, but storage isn't - so what do you do when you really need to squeeze everything you possibly can into limited space?
This guide is a compilation of techniques and guides to software which can be used to shrink data down using lossless compression. This is often a good idea - a one-time cost of processing time to shrink a file is a good trade for an on going saving in storage and transmission. As well as general-purpose compression utilities, there are also short guides to specific file types and software which can be used to compact these files in a transparent manner so that the smaller, compacted file may be used as a direct replacement for the larger original.
What not to use.
Lossless compression has a long history, and a few elements of that history still hang around. These were good once - a lot of them were historically revolutionary - but can't compete with more modern algorithms. So the first thing to do is stop using any of these.
Zip: The venerable format, a classic. Revolutionised the world of compression. Ultra-portable, supported on just about every OS and usually out of the box. But with a slight problem: The algorithms you can be sure it'll support are ancient, and will be easily outperformed by more modern developments. The principle algorithm used is DEFLATE, and though some newer software (Winzip 12.1+) supports newer algorithms, but there is no assurance that other software will decompress these. Even those algorithms are still not the best available today. While ZIP is certainly a very capable format and will likely have a vital role to play for decades to come, if you are looking for the best possible compression then it simply isn't the best any more.
gzip: It's a handy utility, open, and good-enough for most purposes - but the compression is no longer the best around. Technology moved on. It uses the same DEFLATE algorithm as ZIP, and so will perform equally well.
bzip2: Better compression than gzip on most things, but still not the best. It does have the advantage of running quite fast, as the algorithm is very easily parallizeable - there's a multithreaded version, pbzip2, which increases linearily with number of processor cores. That can make it one of the faster compressors for the ratio on a multicore system, so if speed and performance are both concerns you may still want to choose this over something like 7z/xz. Even though 7z will achieve a better compression rate, it could take hours to work through a few gigabytes of data.
RAR: Replaced ZIP for many people. In technological terms, this is a great improvement. Compression ratio is higher on just about any archive format, plus it supports a lot more features. It's also well-known for supporting very strong encryption. A good format certainly, but again has a critical flaw: It's propritary, an cross-platform support is lacking. It's also not the absolute best around, though it's not far off. In terms of compression alone, it's usually very slightly inferior to 7z - but not by far, and only because 7z allows the use of some 'ridiculously slow' settings to edge ahead.
ACE: Slightly better compression than RAR, but just as propritary, only with even less documentation and cross-platform support. Still inferior to 7z, so stay away. Rarely seen these days, having been almost entirely displaced by RAR.
General compression software.
7zip/7z or xz. All the time. There's little to debate on this - they win by every standard. The two formats are closely related, both using the same compression. 7z is an archive format similar to ZIP in application, while xz is a pure stream compressor that can be a drop-in replacement for gzip or bzip2.
- It's an open standard, with open-source software. It'll always be readable.
- Feature support is all there. Spanning archives, solid compression, encryption, ridiculously huge files - it'll do it all.
- Compression, on defaults or 'high', is on a par with RAR - but that's without the use of a very large dictionary size. Set that, and it beats RAR easily.
7zip has a hidden advantage over RAR: The ability to set a very high dictionary size. This can substantially boost compression of large files at the expense of processor and memory requirements. If you look in the 7z man page, you will see an example of 'extreme' settings given: "7z a -t7z -m0=lzma2 -mx=9 -mfb=64 -md=32m -ms=on archive.7z dir1" These settings aren't actually that extreme: The key is that -md=32m. That parameter sets the dictionary size, and turning it up higher will improve compression on large files substantially (Up to the size of the input - anything past that is useless). 256m can be a good choice - settings beyond that tend to cause even high-spec PCs to run out of memory, but if your PC isn't up to it you might have to make do with a lower value.
The -ms=on enables 'solid compression.' That just means that the compressor will handle all the files within the archive as one solid lump of data, so it'll be better able to find redundencies between files. Compression goes up, though it also eliminates all hope of recovering part of the contents if the archive is later corrupted.
There is some debate about the relative merits of the LZMA vs LZMA2 algorithm. As the names imply, they are related. In terms of compression, they are too close to call either one superior - one rarely has an advantage of more then one percent over the other, and which wins depends upon the input. LZMA2 does have a performance advantage, as it is much better able to benefit from multi-core environments.
Another option is xz. This uses the same algorithm (LZMA2) as 7z, but without the elaborate container format - it's nothing but a compressed stream and minimal header. Much like gzip/bzip2 - if you want to store folder structures you usually use tar to achieve that, combining them to give a .tar.xz file. Why not just use 7z? Because xz can be used for streaming stdin to stdout, which 7z can't. There are times this can come in handy - for example, you can use tar to read files to backup, pipe this into xz to compress it, pipe that into gnupg to encrypt it, and write that to the destination - saving yourself an intermediate file. The extreme memory-is-cheap compression settings for xz are '-e --lzma2=preset=9,dict=256M' - as with 7z, adjust the dict= number according to the capabilities of your hardware. Note that the xz container doesn't support LZMA, only LZMA2.
A special mention goes to rzip. Designed expressly for finding long-distance redundencies, rzip is not quite a general-purpose compression program - it is effective on very large files with dispersed redundencies separated by up to 900MB. This can make it well-suited for compressing backups of user areas. It's something of a niche program, but it's highly capable within that niche.
Sometimes you can just run 7z with -md=256m and enjoy your nicely compacted data, but there are some issues with this:
- Some data can be compressed far, far better with software designed specifically for that type of data. A bitmap image, for example, will compress well if placed in a zip or 7z file - but it'll compress far better still if converted to a PNG file, a format designed for images.
- Some files include their own compression capabilities. As a rule, you cannot compress data that is already compressed - but the files own compression is likely inferior to what you can use. For this, see the instructions further down regarding the specific file type you are working with.
- Sometimes you want to compress a file in a way that can be readily accessed, such as an image file to use as a website resource, so an archive container is not suitable.
For type-specific lossless compression, there isn't much to debate. The leading software is well agreed-upon.
- Images: PNG, typically - but see below on how to use it better. There are actually formats available that compress better than PNG, such as WebP, but adoption is slow and support not certain.
- Audio: FLAC. Monkey's Audio can get a slightly better rate, but the improvement is too small to outweigh the format's other disadvantages.
- Raw video: x264 on the 'lossless' cq=0 mode. Note that x264 isn't bit-for-bit lossless, due to rounding errors in the mathematics used, but it's very close indeed. Also, this is going to mean remuxing, which can sometimes be a tricky task in itsself if you're dealing with a file containing multiple audio streams, subtitles, chapter markers, an embedded thumbnail or metadata. Optimal use of lossy video compression is beyond the scope of this guide.
I've written a program called BLDD that can do very well on disc images and virtual machines, or on tar files containing many repeating or similar files. It's not a true compressor, it's a block-level deduplicator designed to suppliment something like 7z or xz - use BLDD first, then xz, so you end up with an 'image.bldd.xz' file. It's also effective on very large TAR archives, as this format stores content files always starting on 512-byte block boundries. On suitable input, it can be very capable indeed. It is highly effective at deduplicating out the unallocated-but-still-there sectors in filesystem images. It's a true lossless program, bit-for-bit, so handy if you need forensic backups. It was specifically designed for archiving disk images.
Already compressed files
Files which include internal compression come with their own considerations. As a general rule you cannot simply compress that which is already compressed - but you can decompress it, and then compress it again in a better manner.
A rare offender, seldom encountered, is the .DMG file. These will be familiar to any OSX users, but rarely seen outside of that OS. The format is actually quite strange, being essentially a filesystem image repurposed as an archive container, but it does use compression. I can't really advise on how to handle these, as I have no experience with OSX.
The Magic Recompressor: minuimus.pl
Look at all the text below, describing ways to make all sorts of different formats more compact. Wouldn't it be convenient if you could skip all that? Well, you can! I have taken all of that research and condensed it down into a single perl script, minuimus.pl. Simply point it at a file, and it shall endeavor to make the file smaller. Internally, it's applying such utilities as the AdvanceCOMP suit, jpegoptim and qpdf, while automating the process or selecting and applying the correct utility for each file. Better still, it can also identify files which are internally constructed as ZIP files and automatically extract them, and then compress each of those files contained within in the correct manner too. It's really quite powerful. After applying minuimus.pl, a file will be left either unaltered, or smaller but functionally identical and interchangeable.
If you don't want to use that though, and would rather go through the process manually, just read on.
CBZ, CBR, and other comic-book files.
Digital-format books, usually comics or graphic novels, are sometimes in .cbz or .cbr files - or occasionally .cb7, .cba or .cbt, though these are rare things indeed. All of these are actually container formats with a changed extension. The CB z, r, 7, a and t are zip, rar, 7z, ace and tar respectively. If you change the extension you can look inside - the pages are sorted into alphabetical order by the viewing software for display. Usually the individual pages are either .JPG or .PNG files. This is good: You can extract these files, follow the instructions further down regarding compressing JPEG and PNG, and repack them back into a .cbz file. It's also a good idea to convert some of the rarer variations to CBZ as a matter of routine - not a lot of people use the old ACE container now, so converting to CBZ is more future-proof.
ZIP and the ZIP-a-likes
ZIP files can use a wide variety of compression methods internally, but most of them are very seldom ever encountered. The only ones you are likely to find are the null-transform ('store') and DEFLATE. DEFLATE is actually quite dated, which means the best means for making a ZIP file smaller is to turn it into something other than a ZIP. Usually a .7z. It is a shame that the more advanced ZIP compression modes are seldom used, but there is a reason for this: Not all decompressors support them, so it would be foolish to make a ZIP file which not everyone could decompression, so such files are not made, and so there is no reason for decompressors to support them. This is why it is rare to find a ZIP file using any compression algorithm other than DEFLATE.
If you have a ZIP file and want to compress it better, the best way is usually to extract it and compress it as a .7z instead. But, if you do need to keep a zip a zip and still make it smaller, there is a way: The advzip utility, from the AdvanceCOMP suite. This uses Zopfli, a higher-compressing (but much slower) DEFLATE encoder. It'll make the zip file smaller, usually, but the file will be otherwise completely identical - the contents are not affected at all, just packed with greater efficiency. Not as good as .7z, but there is a very good reason for using advzip in this way: Not all ZIP files end with .zip.
Many formats are actually 'zip in disguise.' This includes all of the Microsoft Office formats ending in x - .docx, .xlsx, etc - as well as .jar, .epub, .cbz, and a few others. Most of these are based upon the Open Packaging Conventions, a specification that defines a general way to design a new filetype based upon a ZIP container. As these are all zip files by another extension, advzip will work on them all. It is not always the best way, because sometimes it might be even more space-efficent to extract and process the individual files within.
All formats based on the Open Packaging Conventions are pure, unmodified ZIP files and may be opened, altered, and saved using standard zip utilities. This class include the Office Open XML formats - .docx, .pptx, .xlsx, etc. This may not be true of all zip-based formats. The EPUB format, for example, requires special treatment for the 'mimetype' file, which must be stored first within the archive and uncompressed. All of the ZIP-based files may be safely processed with advzip, as it preserves file ordering.
If you are using the 'zip' utility within linux or similar, remember to use the -X option when adding a file to an archive. If you do not specify -X, default behavior is to store the UNIX owner IDs and permissions in the zip extra data field, where it just consumes space needlessly.
EPUB is one of those formats that uses a ZIP container, though it isn't part of the Open Packaging Conventions. It's almost an ordinary ZIP file, but with one special requirement to be aware of: The required file called 'mimetype' must be the very first file in the zip archive, and it must be stored uncompressed. You can most easily compact an epub by just running advzip on it, but if the epub has images you may be able to achieve even smaller size by also compressing each of these resources individually. Conveniently I have produced a perl script, advepub.pl which will extract an EPUB file, apply optipng and jpegoptim followed by advzip, and produce a nicely compacted - usually substantially smaller - file.
The PDF format, like many others, is actually an object container and uses DEFLATE compression internally. Unfortunately it not based on something common and easily manipulated like a ZIP. The internals of PDF are an eldrich horror of arcane and unique encodings. It is possible, in theory, to make a PDF file smaller by using Zopfli to re-compress these portions - but in practice there is no reliable software around to carry out this task. I did attempt to write my own, but due to the sheer awfulness of the PDF container I never got this working reliably or entirely safely. I actually advise against ever publishing anything in PDF format unless consistently defined page layout is truly essential, but if you really must make a PDF smaller, you could try the compressor. It might work, but no promises.
Other than that dubious software, the next-best option is to try to massage the file a little smaller with qpdf. You could try 'qpdf --decode-level=none --object-streams=generate --linearize <infile> <outfile>' - that will compact some of the structures and remove any objects marked as unused, which often does make a PDF file a little smaller. Not a lot smaller, but it's worth a try. Sometimes it makes them larger.
JPEG, jpegoptim, and why everyone uses jpeg wrong.
A special note is required on the JPEG format, and an accident of history.
When the JPEG format was produced (By the organisation also known as JPEG), the cutting edge of compression technology was a technique known as arithmetic coding. It's still in common use today, because it works, and it works well. Unfortunately at the time, it was also patented. By IBM, a company infamous for highly aggressive patent enforcement. So aggressive that it rendered arithmetic encoding, a technology that would outperform any rival with ease, essentially unusable - the only conditions under which IBM would allow it to be used involved prohibitive pricing and conditions that prevented mass-adoption.
Mostly because of this - also in part because arithmetic coding was uncomfortably slow on hardware of the era - JPEG was written to support two different compression modes: Arithmetic coding, and the unpatented but worse-peforming Huffman coding. Hardly anyone ever used arithmetic coding, as doing so would incur the wrath of IBM.
The patent is long expired, but almost every JPEG today is still Huffman coded. Why? The classic chicken and egg problem. Most JPEG reading software, including major web browsers, does not support arithmetic coding. There's no reason to, no-one ever produces JPEG files that use it. And no-one is going to produce those files because most software would be unable to read them.
What this means in practical terms is that there is a simple, trivial means to process a JPEG that makes it around 10% smaller with absolutely no reduction in quality - and most programs won't be able to open it. If your aim is only to archive JPEG images though, that's not a problem. You can just turn them back afterwards. Converting a JPEG from Huffman to arithmetic does not touch the lossy part of the compresion process, the DCT, so it's a lossless transformation.
If you want other people to be able to open your images though, you can't use that, so you'll have to settle for the second best option. The second best is jpegoptim. This works because the Huffman coder used in most programs that generate JPEG files is sub-optimal - it sacrifices compression in favor of performance. Jpegoptim can decompress this data, then recompress it using a slower encoder. The resulting files are still smaller (but not by quite as much as arithmetic coding would allow) and they still decompress in a bit-identical manner, and even faster than before jpegoptim did its thing. In short, jpegoptim should be standard practice before publishing any JPEG image to the world. Expect reductions on the order of five percent, though some files will compress much better. As progressive JPEG files tend to compress better than baseline, the command 'jpegoptim --all-progressive <file.jpeg>' performs best.
If you do wish to convert jpegs to arithmetic coding, the utility is jpegtran, and the command is 'jpegtran -arithmetic -copy all -outfile <out.jpg> <in.jpg>'. If you want to test compatibility, here is a test file that uses arithmetic coding.
Is there any equivilent of jpegoptim for PNG files? Yes, there is! There are four: pngcrush, optipng, advpng and advdef. They don't all work in the same way, so it's actually possible to get higher performance than any alone by using them in combination.
Advpng and advdef are both part of the AdvanceCOMP suite, and they work in much the same manner as jpegoptim: It decompresses the data within a PNG, then recompresses it using a compressor that achieves a higher ratio at the expense of (much) slower encoding speed. Specifically, this compressor is Zopfli. Optipng is a little more sophisticated, and will attempt various lossless operations such as palette or bit depth conversion to see if these can make the image more compact.
As pngcrush/optipng and advpng/advdef work in completely different ways, they can actually work together to achieve even smaller files than either could alone. Optipng first, then advdef - or advpng, but only if you are sure the image is not animated.
That is the one thing to be cautious of here: Animated PNG. The animated PNG extension to PNG was specifically made to be backwards compatible: Any software written for PNGs but not animated PNG will simply ignore all the animation data, and treat the file as a static PNG image with only the first frame decoded. That is exactly what advpng or pngcrush will do, which means if you put an animated PNG file in you will get a non-animated (though certainly smaller) version out. For this reason you should never use advpng or pngcrush on PNG files which may be animated. This problem does not apply to optipng or advdef though: Optipng detects an animated PNG and skips the file completely without touching it, and advdef simply passes anything it does not recognise through unaltered including animation data so animations will survive perfectly.
If you want to identify animated PNGs in a script, run 'advpng -l <file>' and examine the output for the 'acTL' fragment characteristic of animated PNGs.
The GIF format dates from 1987. It might charitably be called 'venerable.'
The compression algorithm used, LZW, is the best that 1985 had to offer. It's was mostly superceeded by DEFLATE, both because DEFLATE usually achieves slightly better compression and because the LZW algorithm was patented (By Unisys) for many years. Further, GIF doesn't use LZW in the optimal manner for images - it can divide the image into blocks, but then simply feeds pixel data directly into the compressor without any use of adaptive prediction. It can exploit inter-frame redundency to some extent, but poorly. Because of this, the most effective way to make a non-animated GIF file smaller is usually to convert it into a PNG, and then submit that PNG to optipng followed by advpng: PNG is simply a superior format. Converting an animated GIF to PNG is a rather more complicated affair though, as most image converters do not support animated GIF.
If you are committed to GIF though, there are two programs that might help you. Gifsicle, and flexigif. These programs work in different manners, and they do stack in performance - applying gifsicle followed by flexigif can achieve greater savings than either can alone. They may be regarded as to GIF what optipng and advpng are to PNG: Gifsicle works by reconstucting parts of the GIF structure in a more-space-efficient manner, while flexigif works by decompressing the raw pixel streams and recompressing them using a more space-efficient but processor-intensive LZW encoder. Apply gifsicle followed by flexigif and your GIF is about as small as it can possibly get without converting it to a PNG. Both utilities are animation-safe.
There is a small catch to the use of flexigif: Time. Flexigif is slow. Really, really slow. Even by the standards of compression software, which is known for extreme processor time requirements - a 100K GIF I tested took 22 minutes. For this reason the minuimus.pl script will only use flexigif on files less than 100KiB in size.