XML, PDF, Meet Mr. Barnes and Mr. Noble

By Kas Thomas

Kenneth Brooks, Jr. of Barnes and Noble, Inc. recently published a paper that, for me, crystallized a lot of inklings and half-thoughts (of which there are many floating around my head these days) on XML, PDF, and the state of the eBook art. The paper is titled "XML and PDF in digital printing: irreconcilable differences?" and it's an eye-opener. You can see a copy of the Brooks paper at http://www.gca.org/papers
/xmleurope2000/pdf/s01-05.pdf
. It's only 5 pages long and will repay careful study if you're curious about PDF's role in eBooks.

Brooks tells how last year, Barnes and Noble embarked on a Manhattan Project, of sorts, to convert existing p-book content into eBook deliverables, using today's (not tomorrow's) technology. Unable to find vendors who could do the necessary conversions at a reasonable cost and with the required accuracy of less than two typos per 100,000 characters, BN set up its own OCR operation in Mexico City, with post-processing in Manila. As Brooks describes it:

    In Mexico City, the books are scanned using either a 300 dpi process or a 600 dpi process depending on the ultimate formats required. If a publisher is requesting that the title go directly into POD [print on demand], the 600 dpi process is used to give appropriate resolution through the print engines. The PDF in this case is simply a package of 600 dpi scanned page images. If eBook formats are desired the 300 dpi process is used. In the eBook workflow the scans are then transmitted to the Manila conversion operation as TIFF images. It's in Manila that much of the interesting work takes place. The files are zoned, driving images to an image cleanup process and text into OCR. OCR text streams are then cleaned up using a heavy dose of AI supplemented with manual editing to produce high quality RTFs, with a small amount of styling applied.

The crux of the system, however, is the way in which XML is exploited. Early on, BN's Manila crew converted RTF (rich text format) to HTML, OEB, MS Reader, or whatever final target format was required. This proved messy, needless to say, so a new (automated) system was implemented in which zoned RTF gets converted to XML using a custom DTD. At this point the content is in what amounts to a proprietary BN format (XML that only BN can use). To get from XML to the target format (such as OEB or HTML), custom XSL stylesheets are brought into play. But since XSL and other XML rendering mechanisms haven't proven themselves capable of generating the quality of typography needed on the fly, says Brooks, there is really no alternative but to store two versions of the file: an XML version to generate the various non-paged outputs and a PDF version for paged outputs.

In other words, to get paginated, professionally typeset book pages, at the end of the day you have to use PDF.

At this point, some XML-savvy soul is probably thinking Hey, don't you know you can solve the typesetting and layout problem with SVG (Scalable Vector Graphics, the XML grammar for high-end web graphics)? Which is true. You can, in fact, achieve wonderful typesetting effects with SVG and subpixel-accurate positioning of page elements, etc. The only problem is, SVG (like other flavors of XML and HTML) doesn't really know anything about pagination. You have to invent your own scheme for page breaks, media boxes, bleeds, and so forth. You also have to implement your own scheme for dealing with column balancing within pages and across page breaks, widow and orphan control across page breaks, hyphenation, justification, reflow around graphics, and all the other nightmare items that people gladly pay $699 to solve when they buy InDesign or Quark.

Mass-Market eContent

If the great downfall of markup languages is their implicit assumption that every document is one page long, the cardinal disappointment of PDF must be its lack of ability to reflow text. It's interesting to consider who has the harder problem: Adobe or Microsoft. Adobe needs to find a way to make PDF pages reflow. Microsoft (with its MS Reader format) needs to find a way to accommodate complex page layouts and have them paginate (and print) correctly. I'd say it's pretty obvious that Adobe has less work to do (and will no doubt be able to offer some kind of solution soon).

Microsoft, on the other hand, still doesn't get it. Their eBook ninjas are operating on the basis of a top-down business model in which the biggest fruit must be shaken from the tree first. Hence the push to get a Windows version of MS Reader into PC and laptop users' hands, and the hasty deal-making with large purveyors of (mass-market) content. The idea is, let's accomplish a premier coup by being first into the mass market with the first true eBook bestsellers, available only on Pocket PC and Windows. (And we'll worry about mopping up the rest of the market, the crumbs on the floor as it were, later.) The Microsoft model is always to be the first sperm cell to enter the egg, thus sealing off the egg against other sperm cells.

The reason this won't work is that, first of all, it's an elitist, exclusionary model (blocking out Linux and Mac users, for example) with built-in hardware dependencies. Microsoft knows a lot about that sort of market, but unfortunately for the Gates/Ballmer axis, that's not where the market is headed.

The epochal, paradigm-reordering, one might even say seismic importance of the Age of eContent is its deployment model, which is (compared to print publishing) fast, low-cost, unconstrained by length/size, and open to all authors, with low or no overhead in terms of warehousing or transportation costs, returns, spoilage, shrinkage, etc. Works that wouldn't otherwise see the light of day will be published, and bought, as eContent. And it's the aggregate sales of these back-aisle titles via the Web that will be important in the long run.

Because Microsoft is obsessed with what it perceives as bestseller content, its MS Reader format faces a tough migration path to richly formatted back-aisle content. It's stone-simple to put a Stephen King novel into a markup-language format, because it's all text, and all one-column-wide text, at that. The layout requirements could not be simpler. HTML 1.0 can handle it.

But can you put Peachpit's Visual QuickStart guides or their Photoshop "WOW" book into a markup-language format (or MS Reader) and still have it look, feel, and read like a VQS or WOW book? I don't think so. Not today. Certainly not on a Pocket PC.

Microsoft is planning to steal the eBook market by providing the technology behind the Tom Clancy and Anne Rice blockbluster bestsellers of tomorrow. Which is kind of like trying to capture the shoe market by flooding stores with millions of size-9-and-a-half mahogany brown loafers. That may be the most popular size shoe, and mahogany brown may be the most popular color for loafers, and the overall market for loafers may be huge. But if that's the only size, color, and style of footwear you're offering, you shouldn't think you're going to run away, so to speak, with the shoe market.

XML, the Ultimate Smelly Sock

If the Barnes and Noble experience is any indication, the e-publishing world is in for a rude shock when it tries to convert backlist p-content to e-content. The majority of printed content today is visually rich, and getting richer. The movement to electronic media will only explode the richness, as authors and publishers discover (for example) that it no longer costs more to use unlimited colors, bleeds, odd trim sizes, etc. PDF is well positioned to take advantage of this sudden rush to richness. XML, on the other hand, despite all the happy-talk and rosy predictions, is (in and of itself) no solution to anything. Getting from XML to a composited page (with or without XSL) is no easy task. In fact, people will be working on it for years to come. Many millions of dollars will be spent trying to create, in XML, what has already been done in PDF.

Conclusion No. 1: PDF still has an important role to play in eContent deployment. Right now, and for years to come.

Conclusion No. 2: The Redmond contingent would do well to consider whether it's the bottom of the food chain that's more important, or the top.

Ask any biologist.