Board logo

subject: Structured Content: Pdf To Html [print this page]


A while back I included the following as one of the areas of interest of the

PDF/D Consortium:

Structured Documents and Single Sourcing: improving round-trips to document

software

What did I mean by Structured Documents? For years Solid Documents has been



converting PDF files to Word documents
with a

focus on retaining format and layout to allow customers to repurpose the

content. While this is a great solution for a large

amount of customers, it is not the only type of reconstruction that is

interesting.

PDF is by nature a "document" format: the layout is in the form of pages.

Content also needs to exist in alternate formats

like a continuously flowing stream. Use cases for continuously flowing content

include:

* conversion to HTML to reflow for form factors other than "pages"

* conversion to content management systems where structure is more important

than layout and formatting

* conversion for alternate readers for people with disabilities (text to speech,

etc)

Reconstruction for these use cases focuses more on the structure of the document

than on the layout and formatting. For

example, we need to take unstructured PDF files and recognize columns, tables,

lists, headers and footers, etc. This allows

us to organize the content in a logical structure. Ultimately, we'll recognize

topics and sections too so that we can produce

logical hierarchies from plain old non-tagged PDF files.

One great example of where conventional PDF pages are not the most appropriate

way to read a document are on small screens of

handheld devices. For example, the typical Blackberry has a 3"x2" screen with a

resolution something like 320x240 pixels.

In this diagram the little rectangles represent the viewing area on a Blackberry

when viewing a document laid out on 8.5"x11"

pages.

>> Continue Reading:



Structured Content: PDF to HTML


by: same




welcome to loan (http://www.yloan.com/) Powered by Discuz! 5.5.0