Shaun'blog: pdf2word的有趣文章

PDF-to-Word Conversion: Why It’s So Hard to Do

When a file is converted to PDF, it loses its meaning. On the surface all the information is there, and to your eyes it looks exactly the same, but underneath that, all the method, structure and intelligence used when designing the original document has been lost^†. This forms the heart of the challenge faced when attempting to convert PDF files back to formats like DOC (Microsoft Word), RTF and HTML, and is not dissimilar to those faced when OCRing paper-based documents.

Once you have your PDF file, the original layout and meaning formed from text-based building blocks — including words, lines (and line breaks), paragraphs, columns, tables, headers/footers and outlines — are long gone. Once in a PDF, its content just describes how and where on the page each object should be displayed.

This is a far cry from where you would be if you went back to the original file in Microsoft Word, Open Office, Google Docs, Adobe InDesign, or whatever. These kinds of word processing and desktop publishing applications follow similar principles, and it’s why converting files between them (while certainly not perfect) is a much more simple process.

How files are normally designed and edited in word processing applications

Most word processing applications use the same sort of principles for formatting and giving meaning to content. For the sake of this article, I’ll use Microsoft Word as the example. Here’s a few of the main ones:

Paragraphs let you work with text that reflows across lines and can be quickly reformatted using styles to adjust spacing, indent, size and more.
Columns let you incorporate more complex page layouts and in many cases make content easier to follow and give meaning to using different grouping styles.
Tables let you layout tabular information not suited to the more linear formatting offered by paragraphs and columns.
Headers & footers let you repeat content more consistently across multiple pages.

PDF to Word is like the OCR process

If you’re familiar with optical character recognition (OCR) and converting paper to electronic form, you might have already grasped some of the complexities we’re dealing with. Apart from recognizing fonts and how they should be displayed on the page, the challenges are much the same for both as all meaning and structure is gone from the contents.

The loss of the text stream

Take a look at the screenshot below. The first three lines of text show how it is displayed on the page in a PDF. The second shows how many separate objects the text is broken into inside the PDF. For each small text object, the PDF includes co-ordinates that simply describe where it should be positioned on the page and how it should be displayed.

Text objects in PDF

The first challenge for exporting text back out of PDF files comes when the streams of text from the original word processor get broken up into these seemingly random chunks. From here we must start to discern what their relationship is to the content around them. This process begins by sucking out all the text from the PDF.

Rediscovering words one object at a time

To begin with, the PDF-to-Word converter must put the words back together. We can look at each text object, its properties, the distance it is from surrounding text objects, and start to see where words and whitespace between them might exist.

Good conversion starts at the accurate detection of line ends

The key to recreating an editable Word file is accurately detecting where each line ends. If you take a look at the example below, it’s pretty easy to see where each of the lines end, new paragraphs start, and columns are placed alongside each other, but inside the PDF there is nothing that notes these facts.

If you look at the example below, getting it wrong could easily result in one line of text in the left column merging with a completely unrelated line of text from the right column -– not a hard mistake to make when you’re making your decision based on the amount of whitespace between text objects.
Line breaks in PDF

Recreating editable paragraphs and detecting them accurately is next

Detecting line ends correctly not only saves you from merging columns of content together, it does the even more important task of starting to rebuild the structure of the text content. Once you can see a series of horizontal lines you can start deducing where a paragraph with reflowing text might need to be –- and once you have that you can start re-creating a Microsoft Word file that is highly editable.

Of course, it’s never that simple (if it was I wouldn’t be writing about it!). Unfortunately, paragraph-based content can be presented in many different ways, which makes the accurate detection and reproduction even more difficult. Examples include:

Different text styles and colours.
Drop caps at the beginning of chapters or sections.
Indents to multiply lines to indicate new (and different) paragraphs of content such as quotes.
First-line indents to indicate the start (and end) of paragraphs.
Changes in or maintaining the same alignment, such as left, right, centered and justified across lines can indicate separate paragraphs and content blocks.
Changes in line spacing, which can even show that the content was originally part of a separate paragraph and should therefore be treated differently.

Paragraphs in PDF

A few examples of the different kinds of paragraphs a document might contain.

Laying out the page with columns

Like examining the relationship and patterns between lines of text to re-discover paragraphs, we can start to figure out where columns might exist by looking at all the text and paragraphs on the page. For example, if we see a series of paragraphs, whose alignment on the left is all on the same vertical axis and each paragraph uses similar text styles and spacings, we may well have found ourselves a column of content.

For it to all come together well, we need to take a holistic approach and not assume too much before looking at all the page content. Once you think you have the overall text layout for the page, you need to take all the necessary page measurements so you can make use of them in Microsoft Word.

Using the Column settings in Word, you need to specify the number of columns, their widths and spacing between them, and then insert column- and line-breaks to ensure the text is placed in the right column and in the right area of the page.

Truly editable comes with advanced table detection

Tabulated content (i.e. tables) is similar to columns only more complex. You’re dealing with columns and rows, and varying degrees of information to discern tables accurately. Quality table detection is bordering on a black art as you each table you encounter is different, forcing you to have a large range of processes to run through before working out whether the content is a table or not.

If you look at the table below you can see how cells and tables can be formatted in different ways. There are more obvious signposts such as cell background coloring and borders to show it’s a table, but as you look to the bottom of the table you’ll see those indicators are gone.

Table text in PDF files

To get to that level of detection you have to run many processes to find a pattern and accurately identify the presence of a table.

Separating section/chapter-level content from page-level content

When determining the correct page margins to use when converting the PDF to Word, header and footer content often gets in the way and causes layout and editability problems. If you can detect this content and keep it separate, the normal page content is much more likely to lay out well.

You do this by scanning across multiple pages for similar content positioned in the same places at the bottom and/or top of the page, and when you find it keep it separate from the normal content. This advanced technique keeps the pages cleaner and allows you to incorporate this content into the actual header and footer areas of a Microsoft Word document.

The other ways to layout the text …

There are other, easier ways to produce an OK visual rendition of the page layout. PDF-to-Word converters sometimes do it using:

Line breaks. Instead of spending a lot of time working out whether lines above and below each other are part of the same paragraph, just force a line break at the end of every line.
Text boxes. Insert paragraphs of text into text boxes to get the position on the page the same as the original.
Tabs for page columns. In between columns just insert a tab to get the positioning approximately right.
Tabs for table columns. Again, in between each table column, insert a tab to position it.

Unfortunately, all these have an inherent problem because they only deal with the presentation and positioning of the content on the page. When it comes to making any changes to the text, the nightmare begins:

For line breaks you must manually remove each one to reflow text in formal paragraphs
For text boxes, you’re isolated from the rest of the on-page text and many formatting tools won’t work.
For tabs in columns and tables, you make text editing about as awkward as it can be as the text flow moves left to right across the page and ignores the fact that the content in some cells and columns originally flowed downwards over multiple lines.

These techniques are rarely the right answer.

Winning the battle between visual accuracy and editabiity

The biggest challenge when converting PDF to Word (and other formats) is to retain the visual appearance of the document, while adding back to it a meaningful structure that makes it possible to easily edit and re-purpose the content. A constant battle exists because techniques to improve visual accuracy can easily force the converter into creating less editable content, and vice versa. Moreover, no two documents are formatted and laid out exactly the same, meaning converters must be as flexible as possible.

Achieving 100% accurate PDF to Word conversion is an impossible feat. Instead, we must aspire to being as accurate as we possibly can, with as many possible documents. That’s where the biggest difference lies between the tools available to convert PDF to Word.

^† It is possible to create PDF files with embedded structure information in them, however most PDF files don’t have this structure.

Note for Nitro Pro users: If you’re a Nitro Pro user and have been waiting for us to improve our PDF to Word performance, you’ll be pleased to know that with the 6.0 release we’ll be including a whole new PDF-to-Word conversion engine. In fact, you can try it out now as our new free PDF to Word online service uses the same core technology. To get access to the beta, just enter nitro as your invite code.

Blogged with the Flock Browser

Shaun'blog

Friday, March 13, 2009

pdf2word的有趣文章