subject:
Structured Content: Pdf To Html
[print this page]
A while back I included the following as one of the areas of interest of the
PDF/D Consortium:
Structured Documents and Single Sourcing: improving round-trips to document
software
What did I mean by Structured Documents? For years Solid Documents has been
converting PDF files to Word documents
with a
focus on retaining format and layout to allow customers to repurpose the
content. While this is a great solution for a large
amount of customers, it is not the only type of reconstruction that is
interesting.
PDF is by nature a "document" format: the layout is in the form of pages.
Content also needs to exist in alternate formats
like a continuously flowing stream. Use cases for continuously flowing content
include:
* conversion to HTML to reflow for form factors other than "pages"
* conversion to content management systems where structure is more important
than layout and formatting
* conversion for alternate readers for people with disabilities (text to speech,
etc)
Reconstruction for these use cases focuses more on the structure of the document
than on the layout and formatting. For
example, we need to take unstructured PDF files and recognize columns, tables,
lists, headers and footers, etc. This allows
us to organize the content in a logical structure. Ultimately, we'll recognize
topics and sections too so that we can produce
logical hierarchies from plain old non-tagged PDF files.
One great example of where conventional PDF pages are not the most appropriate
way to read a document are on small screens of
handheld devices. For example, the typical Blackberry has a 3"x2" screen with a
resolution something like 320x240 pixels.
In this diagram the little rectangles represent the viewing area on a Blackberry
when viewing a document laid out on 8.5"x11"
pages.
>> Continue Reading:
Structured Content: PDF to HTML
by: same
welcome to loan (http://www.yloan.com/)
Powered by Discuz! 5.5.0