COMPUTER-ASSISTED
REPORTING
Fred Vallance-Jones
Liberate your
PDF files
Luckily there are ways
for journalists to gain access to and manipulate data that institutions
would rather keep under lock and key I love them. I love them
not.
That just about sums
up my relationship with Adobe Acrobat files. PDF files,as they
are also known, have popped up everywhere on the Internet. They
are the format of choice for companies and governments making
documents available to the public because the documents look just
like the printed versions. And they can’t be altered. I love them
because they look just like the printed versions. I hate them
because they can’t be altered.
As soon as you want
to put some information into a spreadsheet or database, you discover
it’s locked up as securely as Anne Boleyn in the Tower of London.
Countless journalists
using Adobe’s free Acrobat Reader have shared the frustration
of trying to copy and paste a PDF table, only to find it lands
in Excel as an unreadable jumble. Luckily,we don’t have to lose
our heads over it. Third-party software vendors have developed
products to get us over the wall.
The best one I have
seen of late is called Jade. A free demo is available from BCL
computers at
www.bcl-computers.com. Jade flawlessly transforms multiple
pages of PDF tables into delimited text. Any spreadsheet or database
program can then import the text file. The program will also export
ordinary text to use in a word processor. In my limited tests,
Jade even worked on PDF files with Acrobat’s security features
turned on. Jade costs about $350. It is a plug-in for the full
version of Adobe Acrobat, so you will need that program as well.That
puts the full package at about $700, but it is well worth it if
you will be working with a lot of PDF files.
Before Jade,the reigning
champion in this field was Redwing. It comes from Datawatch (www.datawatch.com)
and also extracts data flawlessly. There is no demo version to
try out,but you can buy it online. Like Jade, Redwing is an Acrobat
plug-in and will extract as many pages as you like, either as
tables or text. It will also export directly to Excel format and
can bulk-extract entire documents into Datawatch’s Monarch program.
If you want to find
out what Monarch does, check out the Datawatch website. Unlike
Jade, Redwing will not work with Acrobat version 5. It also won’t
work when security is turned on in the document. For both of these
reasons, I do not recommend it as heartily as I once did. At $350
(U.S.) it is also considerably more expensive than Jade.
If you are looking
for a high-end tool that doesn’t require the full version of Acrobat,
you may want to try OmniPage Pro 12 by Scansoft. It takes a different
approach to liberating data from PDF files. OmniPage Pro is an
optical character recognition (OCR) program and its primary use
is to convert text from scanned documents into something you can
edit with a word processor. But starting with version 11, the
folks at Scansoft built in the ability to work with PDF files.
I was so impressed with this that I demonstrated OmniPage Pro
at the last CAJ national conference in Ottawa. While Jade and
Redwing work by extracting text from the actual document ,Omni
Pro treats the PDF file as an image,and tries to recognize each
character optically. This has a couple of great advantages.It
can never be defeated by Acrobat’s security features, and will
work on PDF files that were made from scanned images rather than
from documents created in office software. No other extraction
method can do this. As with Redwing, OmniPage Pro will save tables
directly to Excel format. The main downside is that you need to
proofread the output file to make sure the program recognized
all of the text correctly.That can be a bit tedious if there are
a great many pages.The program is also expensive, at $900 for
the full version. If you own a scanner with a stripped-down OCR
program,you may be eligible for the upgrade version, which costs
about $190.
But if even that is
too expensive, there are ways of extracting data from PDF files
that will appeal to the budget-minded. However, be warned.You
do get what you pay for. A lot of CAR people still recommend a
cheap extension to the Acrobat reader called Aerial. You can find
a free demo for downloading at http://www.infodata.com/aerial.asp.
Version 2.0 works with Acrobat Reader versions 3 and 4. It does
not work with Acrobat Reader 5. Aerial users must copy and paste
tables by way of the Windows clipboard. Since Acrobat Reader only
allows you to highlight text on one page at a time,you have to
copy and paste each table separately. For this reason, Aerial
is of limited use if you have a lot of tables to extract. You
will also have to proofread the results because Aerial can make
mistakes in formatting or even in the text itself.
If you want to do this
stuff for free, there are two ways I know of. Version 5 of the
free Acrobat reader comes with a column select tool. You can use
it to copy and paste one column of a table at a time into Excel
or Quattro Pro. If you only have one or two tables to copy, and
there aren’t too many columns, this works just fine. Naturally,you
will have to make sure you line up the columns correctly in your
spreadsheet.
The other no-cost
method is to mail the document as an attachment to pdf2txt@adobe.com.The
server will mail back the converted file almost immediately. Unfortunately,
tables will still come back jumbled and will need quite a bit
of cleaning up before you can make use of them in Excel or Access.
Even with text, results can vary. Still,if you want a quick and
dirty text conversion, it’s not a bad way to get it.
There are other tools
on the market for working with PDF files. Adobe provides a list
of third-party software on its Web site. You may also want to
visit www.pdfzone.com
. You might even learn to love PDF files all the time.
Fred Vallance-Jones
is a specialist in computerassisted reporting at The Hamilton
Spectator.