GETTING AWAY WITH MURDER.
Fall 2002

Contents

Features

Departments
Media Magazine

Publisher
Nick Russell


Editor
David McKie

Books Editor
Gillian Steward

Legal Advisor
Peter Jacobsen
(Paterson McDougall)

Magazine Designer
Ric Kadubiec


Editorial Board
Chris Cobb
Wendy McLellan
Sean Moore
Catherine Ford
J.T. Grossmith
Linda Goyette
John Gushue
Carolyn Ryan

Advertising Sales
John Dickins
(613) 526-8061
Fax: (613) 521-3904
E-mail: caj@igs.net

Administrative Director
John Dickins
(613) 526-8061
Fax: (613) 521-3904
E-mail: caj@igs.net

Back Issues Online


Please forward any comments or suggestions for
Media Magazine's page to Media Magazine.


  






COMPUTER-ASSISTED REPORTING

Fred Vallance-Jones

Liberate your PDF files

Luckily there are ways for journalists to gain access to and manipulate data that institutions would rather keep under lock and key I love them. I love them not.

That just about sums up my relationship with Adobe Acrobat files. PDF files,as they are also known, have popped up everywhere on the Internet. They are the format of choice for companies and governments making documents available to the public because the documents look just like the printed versions. And they can’t be altered. I love them because they look just like the printed versions. I hate them because they can’t be altered.

As soon as you want to put some information into a spreadsheet or database, you discover it’s locked up as securely as Anne Boleyn in the Tower of London.

Countless journalists using Adobe’s free Acrobat Reader have shared the frustration of trying to copy and paste a PDF table, only to find it lands in Excel as an unreadable jumble. Luckily,we don’t have to lose our heads over it. Third-party software vendors have developed products to get us over the wall.

The best one I have seen of late is called Jade. A free demo is available from BCL computers at www.bcl-computers.com. Jade flawlessly transforms multiple pages of PDF tables into delimited text. Any spreadsheet or database program can then import the text file. The program will also export ordinary text to use in a word processor. In my limited tests, Jade even worked on PDF files with Acrobat’s security features turned on. Jade costs about $350. It is a plug-in for the full version of Adobe Acrobat, so you will need that program as well.That puts the full package at about $700, but it is well worth it if you will be working with a lot of PDF files.

Before Jade,the reigning champion in this field was Redwing. It comes from Datawatch (www.datawatch.com) and also extracts data flawlessly. There is no demo version to try out,but you can buy it online. Like Jade, Redwing is an Acrobat plug-in and will extract as many pages as you like, either as tables or text. It will also export directly to Excel format and can bulk-extract entire documents into Datawatch’s Monarch program.

If you want to find out what Monarch does, check out the Datawatch website. Unlike Jade, Redwing will not work with Acrobat version 5. It also won’t work when security is turned on in the document. For both of these reasons, I do not recommend it as heartily as I once did. At $350 (U.S.) it is also considerably more expensive than Jade.

If you are looking for a high-end tool that doesn’t require the full version of Acrobat, you may want to try OmniPage Pro 12 by Scansoft. It takes a different approach to liberating data from PDF files. OmniPage Pro is an optical character recognition (OCR) program and its primary use is to convert text from scanned documents into something you can edit with a word processor. But starting with version 11, the folks at Scansoft built in the ability to work with PDF files. I was so impressed with this that I demonstrated OmniPage Pro at the last CAJ national conference in Ottawa. While Jade and Redwing work by extracting text from the actual document ,Omni Pro treats the PDF file as an image,and tries to recognize each character optically. This has a couple of great advantages.It can never be defeated by Acrobat’s security features, and will work on PDF files that were made from scanned images rather than from documents created in office software. No other extraction method can do this. As with Redwing, OmniPage Pro will save tables directly to Excel format. The main downside is that you need to proofread the output file to make sure the program recognized all of the text correctly.That can be a bit tedious if there are a great many pages.The program is also expensive, at $900 for the full version. If you own a scanner with a stripped-down OCR program,you may be eligible for the upgrade version, which costs about $190.

But if even that is too expensive, there are ways of extracting data from PDF files that will appeal to the budget-minded. However, be warned.You do get what you pay for. A lot of CAR people still recommend a cheap extension to the Acrobat reader called Aerial. You can find a free demo for downloading at http://www.infodata.com/aerial.asp. Version 2.0 works with Acrobat Reader versions 3 and 4. It does not work with Acrobat Reader 5. Aerial users must copy and paste tables by way of the Windows clipboard. Since Acrobat Reader only allows you to highlight text on one page at a time,you have to copy and paste each table separately. For this reason, Aerial is of limited use if you have a lot of tables to extract. You will also have to proofread the results because Aerial can make mistakes in formatting or even in the text itself.

If you want to do this stuff for free, there are two ways I know of. Version 5 of the free Acrobat reader comes with a column select tool. You can use it to copy and paste one column of a table at a time into Excel or Quattro Pro. If you only have one or two tables to copy, and there aren’t too many columns, this works just fine. Naturally,you will have to make sure you line up the columns correctly in your spreadsheet.

The other no-cost method is to mail the document as an attachment to pdf2txt@adobe.com.The server will mail back the converted file almost immediately. Unfortunately, tables will still come back jumbled and will need quite a bit of cleaning up before you can make use of them in Excel or Access. Even with text, results can vary. Still,if you want a quick and dirty text conversion, it’s not a bad way to get it.

There are other tools on the market for working with PDF files. Adobe provides a list of third-party software on its Web site. You may also want to visit www.pdfzone.com . You might even learn to love PDF files all the time.


Fred Vallance-Jones is a specialist in computerassisted reporting at The Hamilton Spectator.