PDF sucks, and this is why.

The PDF format is one of the most widely known in the world. It's universally supported. It also sucks. The problem is not that the PDF format itsself is a bad format (though it is), but that the format is often used in manners quite unlike those for which it was originally designed. Features which were once essential are now liabilities.

The first version of the PDF format was published by Adobe in 1993. This was a very different era. It was before the widespread adoption of unicode. Before UTF-8 existed at all. Before JSON or XML. Before the Web. The internet existed, as an obscure technology accessible only to academics and a few government agencies - the first commercial ISP to accept the general public as customers opened less than a year before. The PDF format was intended to solve a specific problem of the time: That of accurately and precisely representing the printed page. This was a serious problem in the publishing industry, one of the first to rely heavily on computers.

Technology of the era could not handle high-resolution bitmap graphics. Documents had to be represented using some form of markup or DTP document, and rendered as needed. This created the problem: Even the slightest difference between renderers could result in documents not matching, or even in wasted print time. If an editor produced a perfect document and printed it, but the printer's ROM contained did not contain the custom font used, then the document would print with text missing. Even a slight difference in the exact width of the font would cause columns to mis-align. PDF solved this by combining the existing PostScript technology with a container that bundled a document with a page index and all required resources. The ability to exactly represent a printed page is the reason PDF became so popular - and the origin of some of the most severe limitations.


Problem 1: No reflow.

PDF does not reflow. This is by design. It was the very purpose for which PDF was written: To not reflow. To represent a page for printing, with absolute precision and without ambiguity, independent of the display device or environment. That was great for printing, but means PDFs are almost unusable on small-screen devices.

There is a reflow capability in PDF now, but it was crudely added on to a format that was expressly designed not to support it. The PDF must be expressly designed for reflow. Very few are, and those that are, it doesn't work very well.

Problem 2: Limited accessibility.

Much the same issues apply as with reflowing: When PDF was designed, accessibility was not a major concern in the computing industry. Decisions were made at the time which would later come to greatly impair accessibility of the format. Later efforts to extend the format to rectify these shortcomings were partially effective, but not entirely.

A PDF page, to give a simplified description, consists of a postscript program to generate that page together with a set of resources (consisting largely of images and fonts) which that program can reference. The order in which the script draws the page bears no relationship to the reading order. It may be drawn from top to bottom, or piece by piece, or even individual paragraphs or lines in arbitary order. This completely breaks the behavior of screen readers. Guesses can be made, an assumption that text is read from top to bottom, but this is only a guess - the presence of multiple columns or image captions breaks that idea. Worse, many PDFs are little more than wrapped-up image data - a two-color bitmap of scanned text, which may or may not contain a hidden text layer. Simply try to copy-paste a large segment of text from a PDF file and you will see how difficult that can be. Nor can text be enlarged and reflowed for large print.

The solution, again, is the PDF 1.4 'tagged PDF' feature - but few PDF files are authored with this capability used, and those that are often use it incorrectly.


Problem 3: The format is horrific, leading to numerous mis-renderings and strange viewer-specific bugs.

If you have a large collection of PDF files from diverse sources, try an experiment: Process all of them through qpdf or pdf2djvu and observe the warnings. There will be many warnings. The first version of PDF was published in 1993, and has since undergone seven major revisions. Having written software to process PDF files without relying on one of the established libraries, I found it to be the single most horrible format I have ever had the misfortune to encounter.

If PDF were to be re-designed today, it would be designed around a binary container - perhaps ZIP, more likely something which permitted previewing more easily - which included a convenient dictionary of objects at a fixed offset. The objects within would be represented using a standard format such as XML or JSON, for which many parsing libraries exist. Text would be stored as UTF-8, and images using a variety of compression methods which are in common use at the time of design. PNG and SVG would likely be supported. Pages would be stored in the form of XML or JSON tables defining page size and a list of all objects to be placed on the page with their associated parameters. None of these technologies existed in 1993. To process a PDF file means thinking in the manner of a programmer from the early nineties, when all file formats were much more uniquely engineered and libraries less common.

The joys of PDF include switching between ASCII and UTF-16 in mid-string and allowing the use of either CR or CR-LF line breaks in some places but not in others. Directory objects which can only be properly read by processing a linked list. Arbitary whitespace. String objects of undefined length and no fixed termination marker. From all of this mess, specification-violating PDF files are commonplace. Reading software accomodates this - mostly - by being tolerant of such malformities. Thus the warnings.

Creating a PDF file is actually quite easy. Parsing one to open it is a nightmare.


Problem 4: Inefficient image representation.

The PDF format allows for seven different forms of compression for image resources. Five are lossless, and two are lossy. For lossless these are LZW, DEFLATE, RLE, CCITT Fax, and JBIG2. For lossy, JPEG and JPEG2000.

Lossily, PDF does not lack at all. There is support for JPEG, of course. This does work quite well, and the format is a perfectly standard JPEG file that can be easily extracted. PDF also supports JPEG2000, intended as a successor to JPEG, and this is a very common means of lossy image compression in PDFs. The JPEG2000 standard fell out of use elsewhere for reasons relating to patents rather than because of any technological problem, but lives on within PDF files.

Lossless is another matter. The RLE compression is so inefficient as to merit no discussion. The CCITT and JBIG2 compressors are capable of working only on bi-color images. JBIG2 does a fairly effective job here, which is why many printed documents are scanned directly to bi-color PDF files. For more general-purpose compression, there is only the obsolete LZW and the widely-used DEFLATE. DEFLATE works, but it is dated - it still uses Huffman coding, and will not match the effciency of a more modern compression such as LZMA. PDF also lacks effective support for any specialised lossless image compression beyond bi-color, though it can filter DEFLATE in a manner similar to PNG and achieves almost-equivilent performance in doing so.

This means that images in PDF files are, in general, huge - and so are the PDF files. The compression for lossless images is so inefficient, a lot of PDF authoring software simply converts to near-lossless JPEG instead to keep file size managable.


Alternatives.

One reason PDF is in such common usage is that it still does an excellent job of achieving the initial design goal: Accurate representation of a printed page across diverse environments. If you want to print off a form or some vital paperwork, nothing quite matches it. PDF even ensures a printed document comes out at precisely the intended size, perfect for labels. No other format can quite achieve this. OpenXPS comes close.

All of the non-PDF alternatives include one common drawback: Inconvenience. Because of the popularity of PDF, many web browsers include PDF-viewing functionality built in now and will open PDF files in a browser tab as if it were a web page. All the alternatives require the user to download and open the document. It takes only seconds, but seconds can be critical for user experience. If the document is intended for download though, there are many other options.

The most obvious choices are the documents from which the PDFs are made. Many PDFs originate as office documents - either Microsoft Office or OpenDocument. Efficient, accessible, editable formats which have none of the drawbacks of PDF. Since Microsoft Office transitioned from entirely proprotary formats to Office Open XML files, these can be opened easily by a number of open-source office suites. 

For electronic books, the EPUB format may be a viable alternative: It was expressly designed for this purpose, and has none of the drawbacks of PDF. It's also supported on all ebook reading devices with the exception of the Kindle (though it can be converted into the Kindle's propritary format with some additional effort) and readers exist for all major operating systems.

For strictly-accurate representation of pages, there are not many good alternatives around. The OpenXPS format is one. Like PDF, it is designed for page representation, and has many of the same advantages and disadvantages. The main advantages over PDF are much smaller file sizes due to the use of more modern compression technology (Mostly PNG and JPEG XR), and a much more modern and easier-to-impliment file format that poses far less difficulty for programmers and is less prone to the out-of-spec files and compatibility issues that plague PDF. It also lacks many of the PDF extensions for features such as cryptographic signing, interactive forms, DRM and encryption - though the last may be an advantage, as PDF's encryption has some severe weaknesses and should be considered insecure. Accessibility-wise, though, it is no better than PDF: Accessibility depends entirely upon the manner in which the file was authored.

Sometimes simplicity is the best solution. If the PDF contains nothing but text, why not just use text? The simplist of all formats, plain text data, is still viable. It has universal compatibility, compact representation, and is highly accessible. I have encountered numerous PDFs which are obviously nothing more than pages of plain, unformatted text - there is little reason for such files to exist, and I do not understand why people create them.
 
The worst offenders in PDF are those which contain nothing but images, one per page. These 'container' PDFs are very commonly encountered, often as products of document scanning solutions. They exist to take advantage of PDF's universal support and efficient bi-color image compression: Not every user will have a convenient TIFF reader installed, but almost everyone has a PDF reader. PDF also offers a convenient way to embed multiple page-images in a single file. This is predictably wasteful of space. If it's a one-page document, you are better off just having the image itsself in a more efficient format. An even better solution is to properly process the image via OCR into text data and separated graphical elements, but this is far from a straightforward procedure.

An obscure but potentially useful format is CBZ - though calling this a 'format' is a stretch. The CBZ file is nothing but a ZIP containing a set of images, given equal-length filenames containing a page number and displayable in asciibetical order. It is a convention, rather than a formal specification, originally devised by collectors of digitised comic books. Many CBZ viewing programs exist, but the obscurity of the format renders it unsuitable for business use. It is still an ideal format for publishing comics. As an image-container only, it is inherently inaccessible - the best a viewer can do is magnify part of the image, same as with PDF.