ABBYY FineScanner for Archival Research

Since starting grad school, I’ve tried out — and cast aside — quite a few tools for reproducing and organizing archival documents. Digital cameras, portable scanners, FileMaker, EverNote, RefWorks — these are just some of the detritus lining the long and winding research road I’ve traveled these past four (!!) years. Now, at the halfway point of my time in Brazil, I can finally say that I have found my logistical footing. Three pieces of software have emerged as the pillars of my archival process: ABBYY FineScanner for document capture, DevonThink Pro Office for organization and note-taking, and Zotero for secondary bibliography.

While Zotero is both widely used and straightforward, the other two programs may be less familiar. With the Northern-summer research season fast upon us, I’d like to share my experience with these pieces of software, in the hopes of saving researchers at earlier stages of their projects from the hassles of constant platform-shifting that have plagued mine. In this post, I’ll talk a bit about document capture; in a later installment, I’ll describe my approach to organization and note-taking.

For my first three years of grad school, archival document capture meant taking digital photos of individual pages. I’d also photograph the boxes and folders that contained these documents, in the order that I reviewed them, all the while taking notes in an archive-specific word document. The result would be two interlinked narratives of my research, one textual and the other photographic, which I could then draw upon to assemble the individual photos into whole-document pdfs.

This approach carried a single distinct advantage: it enabled me to copy large quantities of documents quickly. But the downsides were massive. For one, no matter how hard I tried, I’d consistently wind up with about one unreadably blurry picture in 50. Even in the best photo-quality case, the process of turning images of individual pages into pdfs was painstaking and extraordinarily time-consuming. Indeed, I still have a large backlog of archival photos awaiting such processing, months or even years after they were taken. Finally, the resulting pdfs were quite large, even when I reduced the component photos — and sending them through a time-consuming OCR converter made them bigger still. Even as I’d find ways to tweak this process, my fundamental dissatisfaction remained.

All the while, I knew from historian friends that there was another way: I could turn my phone into a handheld scanner using one of the many apps on the market. Even with this knowledge, however, for years I found reasons to resist making the switch. I valued the flexibility and speed of digital photos, I though. Wouldn’t scanning on the spot slow me down? Plus, I’d invested a bunch of money in a nice digital camera; was I really going to abandon it in favor of my phone’s lower-quality one?

In January, though, my first smartphone went caput. When its replacement impressed me with its higher photo quality, I realized that it was finally time to give scanning apps a chance. I dug up this PC Mag breakdown of the major options, and ABBYY FineScanner immediately caught my eye. The full-featured version is pricey — on the order of $20 for a year or $60 for a lifetime of OCR-equipped scanning — but I’d used ABBYY products extensively in Columbia’s Digital Humanities Center and had always been satisfied. So I decided to take the plunge.

By the end of my first day using the app in the archive, I knew that my research life would never be the same. There could be no doubt, of course, that FineScanner makes for a much slower capture process than simple photo-taking. The basic trick is this: you take a photo of a page, and no matter the angle of the photo, the app will identify and crop the document into a flat, undistorted image. This is what an original photo looks like at the cropping stage:

While this “autocrop” functionality works pretty well, I still need to double-check every page, and in the end I have to manually crop quite a lot of them (depending on the document, the proportion ranges from 10 to 100% of the pages I scan). Then, in order to turn the photos into searchable pdfs, I upload them to the ABBYY server — something doable either between document scans at the archive or once I get home, depending on the size of the documents and the pace of the day.

These slight hassles, though, are vastly outweighed by the utility of the final product. FineScanner is able to recognize nearly all of the documents I send it, meaning that the individual photos I take come back to me as completely searchable, centered and undistorted pdfs, which I can then upload directly to the cloud or send via email or text. Here’s the page from above, post-processing:

The app can recognize nearly 200 languages, and while I can’t vouch for most of them, my experiences with English, Portuguese, Spanish, and French have been excellent. And miraculously, the recognized pdfs that emerge are tiny — generally about 10% of the size of the smallest unrecognized pdfs I used to make. (Fifty-page pdfs, for instance, usually weigh in around 2.5 MB.)

The advantage of the OCR phone-scanning approach is clearest in comparison. Whereas before, I would end a day at the archive with several hundred individual photographs and the dreadful knowledge that hours of processing awaited at some sure-to-be-later time, now my days end with tiny, fully-searchable pdfs ready to be organized and consulted on demand.

If you’ve made it to the end of this post, there’s a good chance that you belong to the tiny minority of people whose lives can be changed by a well-crafted OCR-optimized portable-scanning smartphone app. And if indeed you do, I’d recommend giving ABBYY FineScanner a spin.