file-summary

File-summary is just a utility that generates a few graphs summarising the statistical properties of a file. These are sufficient to tell a lot about a file's nature, though not precise structure - something of particular use when demonstrating the effect of various transformations or forms of compression.

File-summary graphs are plotted on a per-pixel level of detail - they are intended for getting an overall feel of a file, not precisely reading - and show the most important statistics:
- A correlation matrix for the probability each value of byte is followed by each other value. Grey indicates the 1/256 probability of true random, shades of green a probability greater, red lesser, and black zero.
- A simple bar chart of the relative frequencies of each byte value.
- A chart of the ratios of 1s to 0s for each bit.
- Numerical displays of the file's size, total value of all bytes, mean value, standard deviation, average run length and total value of bytes after delta encoding.

File-summary is distributed as source only. You'll need the magick++ development package installed to compile it. It should be compileable under Windows, but you'll need to alter the hard-coded directory in which it looks for a required font file.

The easiest way to explain the graphs is through some examples

file-summary example
Purely random values. The near-pure grey matrix and roughly level bar chart match in their visual dullness the dullness of the file. There is no obvious pattern to be found here.


file-summary example
A gnupg encrypted file. Reassuringly indistinguishable from random.


file-summary example
A jpeg. Being compressed data, this looks a lot like random noise - but not quite the same, as can be seen in the frequency graph. Those levels are not close to equal.


file-summary example
Text, typical english, UTF-8 or ASCII. This shows a lot more color. If you look very closely you can see a few interesting things in the matrix, like the very high probability that 'q' will be followed by 'u' or that '0' will be followed by 'n' or 'h'. The latter confused me - at first I thought it was a flaw in the software, but on inspecting the file I discovered that it had originated with an OCR program that frequently misrecognised 'O' as '0' - and the 'on' and 'oh' pairs appear a lot in English.


file-summary example
The same text, after processing with the burrows-wheeler transform. Notice the frequency graph is unaltered, as you'd expect from the BWT, but the statistics have now changed greatly. The average run length has doubled. The longer runs can be seen visually too, as the green diagonal line in the matrix. It's also a little longer - the BWT demonstration program I used adds a small header specifying the offset required to recreate the input.


file-summary example
The BWT above followed, as the BWT often is, by move-to-front encoding. Even a brief glance at the frequency graph shows this is going to be favorable to compression - it's composed almost entirely of low-valued bytes.