Many years ago, I created this guide to 'extreme' compression using x264 - attaining compact, high-quality files through the use of the most heavily optimised configuration, without regard to processing time or to the amount of human attention required.

In the years since, technology and expectations have advanced and rendered much of this guide obsolete. HTML5 video, then a new technology, has become widely established. Where once almost all video was downloaded, it is now as likely to be streamed. Most importantly, device profiles matter a lot more, as viewing video on comparatively low-powered devices such as tablets and phones has become just as important as viewing on PCs and laptops.

As I write this new introduction, h265 is poised to take over from h264 - a codec substantially more advanced. I hae decided against updating this guide to modern technologies and conventions, as this would be a near-total rewrite. Certain portions of it, however, still hold value, and so nor am I going to take it down entirely. Insead I am archiving it into IPFS, to serve as both a reference and as a historical view into the video technology of the 2000's.

Please bear this in mind when using anything you read below: Much of it is written with old technology in mind, and no longer conforms to current best practices. These instructions are aimed at creating the most compact possible file for download, and push the playback device to a limit that many portables are unable to play back. Nothing below should be followed blindly, but only taken as a suggestion.

- Codebird, 2017, with regard to Codebird, 2007.

Preparing the source
Filtering and restoration
Basic encoding settings
Advanced encoding settings
Audio, muxing and metadata
Filtering and Restoration

Once your video is progressive, you may need to restore it. If the video is from a well-mastered DVD, blu-ray or other high-quality source, this usually won't be an issue - it'll already be near-perfect. If it's from somewhere of lower quality such as VHS, over-compressed broadcast or historical footage then some effort is desperately needed.

Outside of production, there are two reasons for filtering: Restoration, and optimization. These are usually accomplished in exactly the same manner. Restoration aims to improve a video to make it more closely resemble the ideal to a human viewer - this can be just the elimination of artifacts introduced by previous compression, but on poor-quality video or video of great age it can be far more elaborate. Optimisation involves altering video in a manner that makes it easier for algorithms to process. Fortunately video encoders focus on preserving characteristics important to human perception, and thus optimisation and restoration are in practice almost identical in their approach.

A particuarly important step when working with video transfered from film is the removal of jitter, the unsteady bouncing around of the image caused by warping of the film and vibration within the equipment. While a temporal smooth can slightly help when the jitter is on a very small scale, usually a filter designed for the task will give superior results. Correction of jitter is essential for the correct operation of temporal smoothing and many other forms of filtering, as well as allowing for improved encoding as motion compensation can be performed more accurately and using more predictable vectors on a stationary image.

Restoration may include the undoing of undesired edits such as television channel logos ('DOGs'), inept format/color conversions or censorship. It usually includes the removal of noise, which may include artifacts from digital processing (Compression artifacts) or previous analog processing (Radio interference during transmission, multipath ghosting, film grain or sensor noise).

A much-desired side benefit of this treatment is the reduction of the information content of the video, which then renders it far easier to compress - free of the interference of noise and jitter the encoder is able to be far more precise. For this reason correct filtering of the video is a key element in the preparation of a video for extreme encoding.

A full discussion of restoration is beyond the scope of this article, but there are a number of filters which, properly used, can achieve somewhat passable results in restoration and significant gains in optimisation.

If your source video has been degraded by poor compression before, such as a low-bitrate broadcast, the use of the temporal smooth and smartsmooth filters in Virtualdub can be highly effective. These work especially well on animation - a tendency you will see mentioned often in this guide. Use caution in these though, as turning the strength up too high can easily do more harm than good, causing the loss of fine details. Once this information is lost, it can never be recovered entirely - so be sure that you don't overdo it.

Animation is the restorer and compressors dream. The fantastically low information content per frame compared to live-action video and the extreme inter-frame redundancy make it especially easy to de-noise and compress.

If you are working with animation, the dup filter for AVISynth is an excellent tool. This doesn't just work better on animation: It's almost useless on anything that isn't animated. The only exceptions are in documentary programs, where a graphic or picture might hold on screen for several seconds, and as an aid in removing duplicated frames. This filter should always be used after the temporal and smart smoothers from virtualdub have been used, and only sparingly - generally, a 2% threshold is dangerously high, and anything above that should never be used. Typically a value from 0.2% to 1% is about right, but it depends on source video. Whatever value you find works, use a little less.

You should also lose any black letterbox bars by cropping the image - there is never any reason to keep them. Ever. This may positively impact on the encode quality a little, but not greatly - pure black boxes do not require a great many bits to store. More importantly, they slow both encoding and decoding down and can also pose an inconvenience for viewers who may be using many different resolutions of display and player programs. Perhaps even worse, then increase the memory requirements to decode, which can mean some embedded players will refuse to decode entirely. It is a useful thing to know that the MPEG and MKV containers are capable of handling video with a non-square pixel aspect ratio, a feature lacking in AVI. The handling of aspect ratios is the responsibility of the playback device or software, but with MPEG or MKV you can tell the player what aspect ratio to display.

Smoothing filters
Example of an unusual repeating color band.
This is a form of banding artifact that can appear in areas of perfectly flat color, and is particually common in animated video from the iTunes store. A weak edge-preserving smooth is a good way to be rid of this unsightly banding without any damage to the image detail. For something like this, a suitable filter might be '-vf smartblur=7:0.7:1' on mencoder. Exact settings depend on the video - tweek and find what works.

You may want to apply a smoothing filter to your video before encoding. This is both to improve visual quality, and to improve compressibility. Understand that this is not always a good idea - some video is just too clean to benefit from filtering, and some cannot be effectively filtered without the loss of details. In this case, just skip this step.

In general there are two situations in which applying a smoothing denoise would be of benefit:
- Any 2D animation in which compression artifacts are visible. A non-temporal, edge-preserving smooth is a miracle worker here. Throw in a mild temporal smooth to clean up the backgrounds, and the video will be as clear as when it was drawn. Even if you can't see any noise at all, run these filters on a very mild setting and you might see the video compress significantly better.
- Any video with film grain or its modern equivalent, CCD noise. Both of these are characteristic of cheap cameras, but are also at times deliberately created as a style. Like the lens flare, what was a limit of technology became part of the language of cinematography. Most of the time though, film grain really is unwanted noise. CCD noise is usually seen when filming with a digital camera in very low-light conditions. These two forms of noise have very similar mathematical properties and so can be handled in exactly the same way: A combination of spatial and temporal smooth. There's a filter built into mencoder and available for avisynth called hqdn3d that is designed for exactly this - it works very well.

Smoothing filters are used to eliminate noise of high spatial frequency, such as film grain, analog noise or some forms of compression artifacts. They are some of the most powerful de-noising filters available, especially on animation. There are many different smoothing filters available, all of which can be classified as one of two types:
- The spatial smoothing filters operate on one frame at a time, using only the information from within that frame. These include the 'smartsmooth' family of filters. These are sometimes referred to as 'edge-preserving' smooths.
- The temporal smoothers expand upon the spatial smoothers by also using the information from each frame in the smoothing of other nearby frames. Some can even do this backwards, filtering each frame based upon it's successors. This makes them much more powerful than purely spatial smoothers, at the expense of greatly increased resource usage and. Forward-looking filters are also non-causal, which can preclude their use in low-latency applications - but if you are reading this guide, you probably aren't looking for low-latency applications. Temporal smoothers are sometimes referred to as '3d' smoothers.

Which to use depends upon the characteristics of the video. A smoothing filter on a very high setting can be exceptionally good for removing compression artifacts from animation, while a smoothing filter on a very low setting will be more suited for removing film grain.

Smoothing filters are distinct from the related blurring filters in that their transformations are limited in some manner to reduce the loss of desired details, while blurring filters perform far simpler mathematical operations typically using a convolution matrix. Put in simple terms, a smoothing filter is an elaboration upon a blurring filter. A smoothing filter with too high a strength set or too low a threshold will approach the properties of a blur filter - that means the loss of fine details, smudging of edges and, in the case of a temporal smoother, the appearance of 'ghost' images as portions of one frame are displayed upon another.

Your choice of exactly which filter to use will be to some extent by your toolchain - many filters are only available as plugins for a specific piece of software. Fortunately the algorithms are largely common between different filters with only minor variations.

If you are also using filters for other purposes such as color correction, dejittering or flicker reduction, the order of filtering often matters. Temporal filters especially do not work correctly on unsteady images, so if dejittering is required the temporal smooth must go afterwards. It is possible to use a temporal smooth as a de-jitter filter, but only in the case of very slight (sub-pixel) jitter.

Internal smoother (Internal)
Smart Smoother
temporal smoother (Internal)
temporal cleaner

mplayer/mencoder (Access using -vf option on mencoder):

Other potentially useful filters

Aside from smoothing, there are some other filters that may be of use if the video is of poor quality. None of this applies if you're working from a good source, but they can be valuable when trying to encode decades-old degraded film footage.

If the image bounces around at random between frames, you need to fix this. This bounce seriously impedes temporal filters and the encoder motion estimation process. For this you will want a de-jitter or anti-shake filter. Two terms, same meaning. This noise is caused by either film bouncing around in the camera, being warped in storage or bouncing in the telecine device. As this is a mechanical source of noise, it'll only ever be encountered in video that was once on photographic film - though some modern video may deliberately use it as part of an 'old timey' video feel filter.

If the image flickers in brightness between frames, you'll want to fix this too. This random brightness change renders the use of P/B frames far less effective, greatly reducing compressibility. You can approximate flicker removal by using a temporal smooth turned up very high, but the filter usually has to be turned up so high it'll cause ghosting of the image too - this isn't an issue with a filter made to clean up flicker.

If you have a stable, perfectly stationary image you can use an averaging filter. It'll take out all the noise including speckling. This has only a very niche application - cleaning up title cards or intertitles in silent movies. It sometimes works too well: The transition from noisy, flickery video to a perfectly static intertitle can be jarring.

Duplicate detection and copying or averaging is a good idea on animation. There exists an excellent filter for this. It works by exploiting a common technique of animation: Static shots are achieved by simply showing one cel over and over, and even when moving it is very common (especially on cheaper productions) to show each frame twice, three times or even more to speed production. This duplicate removal filter finds these frames (So long as they are adjacent) then averages them and replaces each with the averaged copy - which makes it a very powerful denoiser. It should usually be the last filter in the pipeline when working with animation. Just don't turn it up too high - 1% is typical, and 2% dangerous, as it'll start removing tiny background movements that you may not notice.

One of the hardest tasks is the removal of flecks - those annoying black and white spots that appear on film. There are many different filters to perform this task and, as a general rule, they all do a very bad job. This is because the task of identifying what is and isn't a spot isn't a nice easy mathematical operation - it's a task that could only be performed by an algorithm with a generalized visual understanding of the subject. That means either a human, or a quite advanced artificial intelligence. If you can solve this one, you'll get a scientific paper out of it. Still, even these limited filters can be of some limited value - and you can always go through every frame by hand editing out the particuraly annoying flecks, if you've enough time.

If your source video is composed only of greys - old black-and-white recordings - then it is a good idea to make sure they really are purely grey. Most filters and encoders are not designed for greyscale, and can introduce a very slight chroma element - not enough to be perceptible to the viewer, but enough semi-random chroma noise to have a detrimental effect on motion estimation during encoding. This is easily avoided: Just run the video through a true greyscale filter first. It'll have no perceptible effect, but the encoder will benefit very slightly. If you're thinking of setting x264 to a 4:0:0 mode where it stores no chroma, don't: x264 doesn't support this mode, as the developers consider the benefits minimal (An all-zeros chroma channel will be compressed down to practically nothing anyway), and most decoders also lack the capability. You can also set no-chrome-me when encoding: It won't improve encoding quality, but it'll make encoding considerably faster by disabling motion estimation efforts on the (empty) chroma channels.

I do apologize for not releasing all of my own filters. This is for a very practical reason: I lost the source code to some, and others are proof-of-concept scripts that only work on a giant folder filled with bitmaps. This is impractical. I may complete and release these at some future date.

Resolution, aspect ratio and frame rate

Just what resolution is 720p anyway?

From a purely technological or standards standpoint, the definition of 720p is simple: It's a resolution of 1280x720, with an aspect ratio of 16:9. Where television and broadcast are concerned, that's perfect: It's widescreen TV. Problems come when movies are involved, as cinematic content is rarely produced at 16:9 - it generally, for historical reasons, comes at a wider aspect ratio. 2.39:1 is common, or 2.35:1 for older films. This doesn't fit nicely into 720p, because the aspect ratio doesn't match. A similar problem presents for 1080p displays: True 1080p is limited to a 16:9 ratio.

There are two general solutions to this. The first is to use 1280x720 with a non-square-pixel ratio, and use a specifier in the container to inform the playback device. The second is to scale to a horizontal resolution of 1280 but less than 720 vertical, then pad with black bars pre-encode, which allows the for easy display on 16:9 hardware (this is the usual solution in broadcasting). This can lead to some confusion when a recipient notices a file labeled as 720p actually contains less than 720 lines of vertical resolution or their 'widescreen' TV still letterboxes movies. The same situation occured with the old DVD format - so severe was the situation with cropping that many films had less than three hundred lines of usable vertical resolution in order to maintain the correct aspect ratio, until widescreen TVs recognising suitable flags became commonplace enough to allow for effective use of the non-square-pixel approach.

There is another common solution - or at least, common in some circles. Though rarely seen elsewhere, it is a common practice in the movie piracy community to use more unconventional resolutions. This works because, though consumer media equipment is often limited to the strictly standard resolutions, pirates usually use PCs or more flexible playback devices that are capable of resizing a video as appropriate for a particular display resolution. The practice is seen outside of piracy too - Apple, for example, produce trailers in 1920x816. Often these result from simply cropping the image out of a letterboxed blu-ray source, and optionally resizing down. For a handy guide to the correct resolution, see this chart:

Target display16:92.35:12.39:1
720 vert1280x7201696x7201720x720
1920 hor1920x10801920x8161920x804

Note, however, that these are common resolutions by informal convention, not formal standard, aside from the 16:9. They may not be used in broadcast media or on blu-ray discs: For these purposes, you may have to just use the letterbox. The advantage of these non-standard resolutions is in allowing the playback device to make the final decision as to how to display the video. The playback device is in a better place to decide this: You cannot know if it is a 4:3 TV, 16:9 TV, 5:4 monitor, 16:10 laptop, or even a tablet turned sideways.

Through all the filtering and processing stages, you should be working at the full resolution of your source. This is because you don't want to lose any information that can be of use in noise removal until you are done with that stage - most filters just work a lot better if applied before scaling, if substantially slower. Before going on to the encoding stage you should consider if there is really any benefit to be had from high resolution.

This depends what you are encoding, and how it is to be viewed. If you are preparing video you expect some viewers to watch on a large television, then a full 1080p may be justified. Even here, though, most viewers couldn't tell it from 720p. It shouldn't need stating that there is never a reason to enlarge video before encoding - that is an operation to be performed at the decoding end.

Around this stage you may also discover the non-square pixel issue. This was once the nightmare of video distribution: Non-square-pixel video was commonly pulled from VCDs, SVCDs and even DVDs as a result of widescreen video being stored in a non-widescreen format or vice versa, not to mention the many different types of widescreen aspect ratio. It is tempting to simply rescale video to make the pixels square, but this is a bad idea - it makes the video much more difficult to compress well. The only situation in which this should ever be done is when a subsequent filter is incapable of handling non-square pixels well. A much better solution is available now: The MKV container, like MPEG (and unlike AVI) is able to store an aspect ratio in the file metadata to use when playing back. Your video may look a little stretched during editing when you're using the AVI, but so long as you make sure to set the correct aspect ratio before the final mux it'll play correctly when you're done. Webm, a simplified subset of MKV, can also store an aspect ratio, as can mp4 - but I am not sure if browsers respect this information. Fortunately, they do respect height and width specified in the HTML5 video tag.

Another consideration is frame rate. The mere suggestion of decimating is enough to make some editors cringe, but there are times when it is to net benefit. This applies only to very low-motion video, such as the classic 'talking head' shot. On video like this, you can decimate by two (30fps becomes 15, 25fps becomes 12.5) and the lower frame rate will be barely noticeable - but the frame count is halved. That means you need half the bitrate, and the player needs half as much processing capability. That is a very important point if you're targeting mobiles, or using encoding parameters which require a lot of processor time to decode.

I've provided an example: A video supplied at 25, 12.5 and 5fps. Only the more observant viewers will be able to notice the reduced frame rate at 12.5fps - but go much below that, and it starts to become severely noticeable. There is such a thing as a 'variable frame rate' video which shows frames for different durations, and in theory could achieve some fantastic things in terms of quality and size - but support for this technology is as yet immature, and not widely deployed in a reliable manner, so I cannot at this time advise its general use.

This only applies to low-motion things like web video, though. If you're encoding a film or TV program, just stick with the source frame rate.

Frame rates in excess of 25FPS are the subject of a holy war. Some people insist that such high frame rates are essential for preserving detail in high-motion scenes, especially sports broadcasts. Others insist with equal confidence that perception of such rapid events in detail is beyond the capabilities of human perception. This is one reason many sports are broadcast interlaced rather than progressive - it offers the improved temporal resolution of a doubled frame rate without being so highly demanding upon bandwidth and equipment as a full progressive broadcast at that effective rate and resolution would be.

Three scalers demonstrated on animation at a reduction to 2/3 size, as would be typical in reducing 1080p to 720p. The bilinear's poor peformance can be readily seen, with details lost - this scaler should never be used unless speed is required. Both bicubic and lanczoz perform better, close enough that neither can really be said to be the 'best' filter. I personally favor lanczoz for animation and bicubic for live-action video. Note the anisotropy at small scales of the bilinear and bicubic filters, most visible in the center 'flower' below the balcony. This is actually a pixel-scale effect causer by a moire-like pattern between native pixels and sampling points and is one of the principle drawbacks of those filters. Lanczoz better preserves the roundness of the shape, and is thus commonly prefered for animation.

If the video is for web use, you may want to offer different resolutions. You'll need to offer at least two codecs anyway, due to the ongoing patent issues. Once more editors cringe at the idea of losing detail, but the end use of the video must be kept in mind. If the video is an action-packed movie for watching on a large television, every pixel matters - but if the video is of a lecturer at a podium, or a news report intended for mobile devices, there is simply no benefit to a full 1080p.

Conveniently, many television productions made in the pre-HD era were created with the limitations of home SD television in mind. When an important document is shown, it will be in huge font and closeup. Computers always have graphical interfaces where dialog boxes take up two-thirds of the screen for their 72-point message. Actors' faces are shown tight, to ensure every detail of expression is captured. That makes them well suited by nature to viewing at low resolution on mobile devices.

If you do decide to scale the video down, then you also need to pick a scaler. Which one is best, like so much, depends on the video. There is one constant in choice of scaler: Bilinear is always the wrong choice. Bilinear is, though, the default on mencoder. That's fine if you're just playing video in realtime - bilinear is the default because it is also the fastest - but for encoding, if using the mencoder built-in scale filter, you'll need to also specify the -sws argument to pick a filter that is not so terrible quality-wise. For reduction, the usual options are either Lanczos (-sws 9) or the bicubic spline (-sws 10). Bicubic (-sws 2) is also good on live-action or CGI but tends to blur edges on animation. You'll want to run a few tests on these, to see what looks best. Make sure to watch at native though, or else your player's own scaler will just make it impossible to judge fairly.