A study of a large number of PDFs from general circulation.

This analysis considers the composition of stream objects found within a large number of PDF files collected from general public circulation. It is intended to provide understanding of the real-world usage of PDF files in order to aid in the refinement of minuimus_pdf_helper, a component of the Minuimus file optimisation usility. The principle interest is in determining the most commonly encountered filters, combinations of filters, and unusual edge cases which need to be taken into consideration.

The test data consists largely of ebooks downloaded without regard to copyright, so the exact contents will not be disclosed. The bulk of the data was collected by running a google search for 'pdf intitle:"index of"' and feeding the results into a web-crawling script. It includes a very large number of files from the-eye.eu, a collection titled as the "Great Science Textbooks Library," collections of service manuals for laptops and radios, scientific papers, and a number of smaller collections. Additional books were donated as the personal collections of two eBook pirates. These files were indexed by hash before analysis to identify and remove duplications. The combined and deduplicated collection contaned approximately 70,000 individual PDF files.

To analyse these PDF files, qpdf was used to extract the dictionary of all stream objects. qpdf also normalises the use of whitespace when extracting these dictionaries, simplifying later analysis. These dictionary objects are the only data used in later stages of this analysis: The PDFs themselves, and the contents of the stream, are not used beyond this point.

The prepared data for analysis consisted of a text file listing the dictionary objects, one object per line, with a total of 38,566,986 stream objects.


--- The /F parameter and external file references ---

14679 of the objects used the /F option, for referencing an external file. As none of these PDFs come with this external file present, it is not possible for any PDF containing one of these objects to be correctly rendered. These include files such as "/F (Macintosh HD:My Projects_Anritsu:Projects:Publication Projects:User's Guide Covers:00986-00079 MT8212B UG:Pictures:productlineicon.eps)", "/F (\\\\Kc2\\data\\Issue205\\2708005_Lacoste\\2708005Fig2.tif)", "/F (Bandersnatch:Users:dpg:Desktop:0071373241_iovine_app:Iovine_137324-1", and "/F (iMac Data:DAW:in Progress:BDHISMIS:Buddhism Art:01F_Monks_Protest.tif)". 

Many of these files reference paths using the : separator - characteristic of MacOS, an operating system obsolete since circa 2000. Some of these files are clearly very old. They also leak information which, at the time, may have been potentially useful to an attacker - internal hostnames and usernames.

--- Filters ---

/FlateDecode    : 24182372
/DCTDecode      :  4061142
/JPXDecode      :  3312194
/JBIG2Decode    :  2719344
/CCITTFaxDecode :  2472466
None            :  1834314
/ASCII85Decode  :  43546
/LZWDecode      :  7755
/ASCIIHexDecode :  614
/RunLengthDecode:  0

The lack of RunLengthDecode is quite remarkable. In 70,000 PDF files, produced using a diverse range of software, there is not even a single use of RunLengthDecode. This filter appears to be completely unused - it performs so poorly that perhaps no PDF creation software has ever utilised it!

The total of all these filters exceeds the number of objects. This is because many objects utilise more than one filter. /ASCII85Decode is commonly used to wrap /LZWDecode data: This combination accounts for 2,231 objects. /ASCII85Decode and /DCTDecode are used together in 15,980 objects. The /ASCII85Decode filter was used in 43,546 objects, but in only 725 of those was it used alone. This may be a legacy of PDF's early origins: It was originally possible to use these filters to create printable-text-only PDF files which would be proof against the processes of converting character sets and line endings, back in an era when converting betweeen computer systems often corrupted these characters.

JPXDecode filters are also surprisingly common, almost as common as DCTDecode - PDF's term for JPEG. The JPEG2000 was not successful when released - a combination of patent concerns and a lack of software support hindered adoption, and it soon fell into obscurity. Even today the only major web browser to support JPEG2000 is Safari - there is no support in IE, Edge, Chrome, Firefox, or the Android browser. Yet in PDF, JPEG2000 is used almost as commonly as JPEG. PDF may be the one place where JPEG2000 still lived on as a commonly-used and widely-supported technology. Even some web browsers which do not support JPEG2000 files still include a decoder as part of their PDF-viewing functionality.


--- Subtypes ---

None          : 19269541
/Image        : 15568284
/XML          :  1683930
/Form         :  1319297
/Type1C       :   606793
/CIDFontType0C:   118490
/image#2fjpeg :      414
/OpenType     :      151


--- Special mentions of unusual objects ---

There are some objects found in these PDF files which stand out as so unusual, they merit further discussion. Only one example is needed of each oddity, though in all cases many like it were found. Most of these are noteworthy as special cases which might be mis-processed by a parser that does not correctly handle situations such as empty arrays or nested dictionaries.


LZW-compressed JPEG in an unusual color space, with a very small height or width:
<< /BitsPerComponent 8 /ColorSpace /DeviceCMYK /Filter [ /LZWDecode /DCTDecode ] /Height 2 /Length 329 /Name /X /Subtype /Image /Type /XObject /Width 249 >>

Objects in which the filter is given as a one-entry array:
<< /BitsPerComponent 8 /ColorSpace /DeviceRGB /DecodeParms [ null ] /Filter [ /DCTDecode ] /Height 2338 /Length 313182 /Subtype /Image /Type /XObject /Width 1605 >>

Empty DecodeParms:
<< /BitsPerComponent 8 /ColorSpace 18965 0 R /DecodeParms << >> /Filter /DCTDecode /Height 76 /Length 5252 /Name /X /Subtype /Image /Type /XObject /Width 398 >>

Ascii-encoded LZW:
<< /Filter [ /ASCII85Decode /LZWDecode ] /Length 1082 0 R >>

Hex-encoded font data:
<< /Filter /ASCIIHexDecode /Length 565 /Subtype /CIDFontType0C >>

Tiny, 1x1-pixel images. Strangely, these seem to have an extra byte:
<< /BitsPerComponent 1 /Height 1 /ImageMask true /Length 2 /Subtype /Image /Type /XObject /Width 1 >>

Tiny, 1x1-pixel images encoded using 'compression' filters that have more overhead than the size of the image:
<< /BitsPerComponent 1 /DecodeParms << /Columns 1 /K -1 >> /Filter /CCITTFaxDecode /Height 1 /ImageMask true /Length 5 /Subtype /Image /Type /XObject /Width 1 >>

Use of JBIG2decode, an inherently two-dimensional compression, on a one-dimensional image. Also note the does-nothing /Decode [ 0.0 1.0 ]  - an identity function.
<< /BitsPerComponent 1 /ColorSpace /DeviceGray /Decode [ 0.0 1.0 ] /Filter /JBIG2Decode /Height 59 /Length 30 /Subtype /Image /Type /XObject /Width 1 >>

CCITTFax using /Decode to invert the colors after decompression:
<< /BitsPerComponent 1 /ColorSpace /DeviceGray /Decode [ 1 0 ] /DecodeParms << /Columns 800 /K -1 >> /Filter /CCITTFaxDecode /Height 133 /Length 821 /Name /im5 /Subtype /Image /Type /XObject /Width 800 >>

Particually complicated use of /Decode, /Domain and /Encode:
<< /BitsPerSample 8 /Decode [ 0 1 0 1 0 1 0 1 ] /Domain [ 0 1 ] /Encode [ 0 63 ] /Filter /FlateDecode /FunctionType 0 /Length 201 /Range [ 0 1 0 1 0 1 0 1 ] /Size [ 64 ] >>

The zero-length stream:
<< /Length 0 >>

An unusual form of color space: An indexed image. Eight-bit palette, stored in object 3353 0. Also a new variation upon the theme of do-nothing DecodeParms - and on an object with no Filter, too:
<< /BitsPerComponent 8 /ColorSpace [ /Indexed /DeviceRGB 255 3353 0 R ] /DecodeParms [ << >> ] /Height 16 /Length 20016 /Subtype /Image /Type /XObject /Width 1251 >>

Dictionary in dictionary in dictionary:
<< /BBox [ 0 0 612 792 ] /Filter /FlateDecode /FormType 1 /Length 77 /Matrix [ 1 0 0 1 0 0 ] /Resources << /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject << /Xf4 5995 0 R /img0 12785 0 R /img1 12786 0 R /img2 12787 0 R >> >> /Subtype /Form /Type /XObject >>