Archival Research | Paul R. Katz

In a post about the wonders of ABBYY FineScanner back in May, I promised to write about another pillar of my archival process, the database management program, DevonThink Pro Office. Like ABBYY FineScanner, it’s quite pricey ($149.95 after a 150-hour test-drive), but coming up on 15 months together I couldn’t imagine my life without it.

I should say at the outset that I can claim no particular expertise with regard to this program. I have no doubt that someone with more technical skill could wring much more from it than I can. I should also note that the program is only available for Mac — I know, I know — so if you haven’t been sucked into the Apple vortex, this post won’t be of much use to you. But my fellow Mac-owning archival researchers looking to build a digital database may find something of value in the ensuing description of the DevonThink process I’ve come to rely on over the past year.

When I fire up the program and open my Dissertation database, I’m met with the menu you see below to the left. At the top are a few items: Inbox, the default repository for new files I drag into the program; Tags, which I don’t really use; Mobile Sync, a reception point for items that come in through the DevonThink ToGo mobile app, and Evernote, which receives clippings I make with the Evernote app. (You’ll find a bit more on these last two at the bottom of the post). All of these came with the program or with apps I connected to it, as did the four items at the bottom of the list (i.e., All Images, All PDF Documents, Duplicates, and Orphaned Files). The stuff in between, though, is user-generated.

The header labeled Archives is where I put the documents I scan and the notes I take on them, organized by country and then by archive. Books/Articles is where I take notes on secondary sources; it’s also organized geographically. For Others are documents unrelated to my own project that may be of interest to friends and colleagues. Internet (Clippings/Links) is where I sort stray news articles and websites of interest. Logistics is home to information about the infrastructure of academic life — fellowships, grants, conference funding, seminars, and the like. Notebook is where I take notes and organize documents in ways that cut across multiple archives. Random/Interesting is self-explanatory, and Teaching Aids are where I put things that may be helpful for teaching all of this when I’m back home.

When I’m at the archive itself, the Archives header is, unsurprisingly, where most of the action is. Let’s imagine I’m spending the day at Argentina’s National Library. The “Biblioteca Nacional” folder has three subfolders, which correspond to the three divisions I’ve used so far: Archivo, Historia Oral, and Libros. As I work through an archival collection, I’ll create a subheader for the collection, then one for each of its boxes that I consult, and finally for each archival folder of interest.

Let’s say I’m working with the Silvio Frondizi Subcollection, on which my recent post, Revolutionary Human Rights, was based. More specifically, I’m looking through a folder from Box 7, labeled “Movimiento Nacional contra la Represión y la Tortura 1/2” (see below). When I come across a document I want to take note of, I’ll create a Rich Text File (RTF) in the corresponding folder, titled first with the date as closely as I know or can approximate it, and then either its title or a phrase that more effectively conveys its use. (I mark my own date approximations with question marks.) In the body of the RTF, I’ll include any document or page numbers that I may need for later citation followed by whatever thoughts have come into my mind. In cases where I have general observations about a folder, or a box, or an entire archival collection, I’ll create a separate RTF file in the corresponding place in the database titled “0 Overall” and take notes there. (The initial 0 is a way to make sure the file jumps to the top of the alpha-numeric heap.)

The contents of the folder, “Movimiento Nacional contra la Represión y la Tortura 1/2.” At right, above the line, an alphabetized list of the files it contains. Below it, a space to scroll through them. At left, nesting drop-down menus organized by Country, then Archive, Collection, Box, and Folder.

If a document is worth copying and I am permitted to take photos, I’ll scan it with my cell phone and convert it into an OCR-recognized PDF, which I will then label with the same name as the related RTF I’ve just created in DevonThink. Then, when I get home, I can upload the PDFs from my phone and easily sort them into their corresponding DevonThink folders. As a final step, I’ll right-click on the PDF, choose “Copy Item Link,” and paste a permalink to the PDF into the RTF (see below). That turns the RTF into an all-purpose base of operations, which I can then use as a building block for subsequent indexing.

*Copying the item link to the PDF for “1972? Ellos son torturados….” I will then paste this permalink into the identically named RTF.*

What kind of indexing? Sometimes an archival collection is already organized in ways that make sense for my research. The Silvio Frondizi Subcollection, for instance, groups documents chronologically and by organization or project, which is exactly how I want them. On the level of the collection itself, then, there’s no need for further reshuffling.

But other collections aren’t arranged in ways that are helpful to my work. This is particularly true of police and military archives, which typically operate through master indexes of names but are physically organized into vast collections based on other considerations, such as reporting unit or jurisdiction. I want to preserve this original system of organization, both because I will need to specify where I found the documents that I ultimately cite, and because each security organ’s proprietary system is a window onto the repressive logics I am trying to understand. But relying exclusively on these original systems would greatly hobble my ability to draw connections across the archive and to conceptualize it in ways that correspond to my arguments.

In these cases, I create archive-specific indexes that meet my own thematic needs. Take the political police files held at the Arquivo Público do Estado de São Paulo (APESP), where I worked for hundreds of hours from March till May, and which I drew on for this earlier post about torture and São Paulo’s armed Left. After finishing at APESP, I created an RTF titled “0 APESP Index.” The index features a couple dozen topics grouped under five major headings: Police/Military, Armed Groups, Anti-Torture/Human Rights Groups/Campaigns, Links to Other Countries, and Torture Topics. Within each of these categories, I added as many subheadings as necessary — phrases like “Resisting Torture” and “Testimonies” in the case of the “Torture Topics” grouping, for instance. I then went through the full list of RTF files that I created at APESP one-by-one, right clicking, copying each of their item links, and pasting these links into the “0 APESP Index” RTF in whatever slots seemed right (see below). Helpfully, even if I move the linked RTFs around, or modify their content or titles, the links will still work.

*The APESP index, at right. To the left, the organizational system used by São Paulo’s political police.*

(Because I take reasonably thorough notes while in the archive, holding a future index of just this sort in mind, the whole indexing process is quite a bit less arduous than it might sound. In this instance, it took about an hour and a half to catalogue the 150-or-so PDF files that I’d created at APESP. To my mind, it’s a worthwhile investment given the organizational and analytic power it unlocks. To be fair, though, this sort of stuff is fun to me to an extent that sometimes even I find disturbing.)

Archive-specific indexes aren’t the only sort I use DevonThink to build. The second kind are the thematic indexes which fill the Notebook portion of my database. Here, I keep running compilations of links to documents that I come across related to specific organizations, individuals, places, or themes. For instance, the armed Peronist group Montoneros is of particular interest. When I come across a document that pertains to this group, I copy-and-paste its item link into the “Montoneros” RTF in my Notebook (see below). It is my hope that, as I move into the writing stage, these indexes will serve as proto-outlines and also help me with the macro organization of the dissertation and subsidiary articles.

*A thematic index, for Argentina’s Montoneros.*

This description of my process hasn’t touched on many of the features that set DevonThink apart, so allow me to mention them briefly. With DevonThink, you can:

— Import photos and merge them instantly into multi-page PDFs, which can then be OCR-converted
— Take notes on documents and PDFs
— “Replicate” files so that identical copies sit in various places at once, yet an edit to any is an edit to all
— Sync to your phone or tablet using DevonThink ToGo (a product which which I’m less satisfied than with DevonThink Pro Office)
— Import directly from EverNote (which has far better web-clipping capabilities than DevonThink ToGo)
— Develop customized workflows using Automator
— Create “smart groups” based on tags, keywords, or full text
— Enjoy powerful search functionality including concordance

This last feature alone is, to me, worth DevonThink’s purchase price. While no OCR is perfectly searchable, on net it works pretty well, especially when supplemented by the keyword-driven notes I take in the linked RTFs. The result is that when I have only the inkling of a document in mind, I can almost always find it quickly. Full-database searches, moreover, at times yield parallels and connections that I wouldn’t have anticipated. I’d never create a thematic index without doing one first.

In closing, I want to stress that the process I’ve described here is not something I could have created whole-cloth at the outset — even having consulted the numerous academy-specific posts I found online. (Though this one in particular, by historian Rachel Leow, did serve as an extremely helpful jumping-off point.) Rather it’s a method that could only have grown, trial-and-error style, out of my intensive use of the program during a sustained period of primary research, and I’m sure it will continue to change as my work advances. If you end up going the DevonThink route, I’m sure your system will look different than mine; indeed, that’s the idea!

I hope these words and screenshots prove useful to someone. If you’re that person, or if you have any questions or have found anything here to be unclear, please do let me know!

Since starting grad school, I’ve tried out — and cast aside — quite a few tools for reproducing and organizing archival documents. Digital cameras, portable scanners, FileMaker, EverNote, RefWorks — these are just some of the detritus lining the long and winding research road I’ve traveled these past four (!!) years. Now, at the halfway point of my time in Brazil, I can finally say that I have found my logistical footing. Three pieces of software have emerged as the pillars of my archival process: ABBYY FineScanner for document capture, DevonThink Pro Office for organization and note-taking, and Zotero for secondary bibliography.

While Zotero is both widely used and straightforward, the other two programs may be less familiar. With the Northern-summer research season fast upon us, I’d like to share my experience with these pieces of software, in the hopes of saving researchers at earlier stages of their projects from the hassles of constant platform-shifting that have plagued mine. In this post, I’ll talk a bit about document capture; in a later installment, I’ll describe my approach to organization and note-taking.

For my first three years of grad school, archival document capture meant taking digital photos of individual pages. I’d also photograph the boxes and folders that contained these documents, in the order that I reviewed them, all the while taking notes in an archive-specific word document. The result would be two interlinked narratives of my research, one textual and the other photographic, which I could then draw upon to assemble the individual photos into whole-document pdfs.

This approach carried a single distinct advantage: it enabled me to copy large quantities of documents quickly. But the downsides were massive. For one, no matter how hard I tried, I’d consistently wind up with about one unreadably blurry picture in 50. Even in the best photo-quality case, the process of turning images of individual pages into pdfs was painstaking and extraordinarily time-consuming. Indeed, I still have a large backlog of archival photos awaiting such processing, months or even years after they were taken. Finally, the resulting pdfs were quite large, even when I reduced the component photos — and sending them through a time-consuming OCR converter made them bigger still. Even as I’d find ways to tweak this process, my fundamental dissatisfaction remained.

All the while, I knew from historian friends that there was another way: I could turn my phone into a handheld scanner using one of the many apps on the market. Even with this knowledge, however, for years I found reasons to resist making the switch. I valued the flexibility and speed of digital photos, I though. Wouldn’t scanning on the spot slow me down? Plus, I’d invested a bunch of money in a nice digital camera; was I really going to abandon it in favor of my phone’s lower-quality one?

In January, though, my first smartphone went caput. When its replacement impressed me with its higher photo quality, I realized that it was finally time to give scanning apps a chance. I dug up this PC Mag breakdown of the major options, and ABBYY FineScanner immediately caught my eye. The full-featured version is pricey — on the order of $20 for a year or $60 for a lifetime of OCR-equipped scanning — but I’d used ABBYY products extensively in Columbia’s Digital Humanities Center and had always been satisfied. So I decided to take the plunge.

By the end of my first day using the app in the archive, I knew that my research life would never be the same. There could be no doubt, of course, that FineScanner makes for a much slower capture process than simple photo-taking. The basic trick is this: you take a photo of a page, and no matter the angle of the photo, the app will identify and crop the document into a flat, undistorted image. This is what an original photo looks like at the cropping stage:

While this “autocrop” functionality works pretty well, I still need to double-check every page, and in the end I have to manually crop quite a lot of them (depending on the document, the proportion ranges from 10 to 100% of the pages I scan). Then, in order to turn the photos into searchable pdfs, I upload them to the ABBYY server — something doable either between document scans at the archive or once I get home, depending on the size of the documents and the pace of the day.

These slight hassles, though, are vastly outweighed by the utility of the final product. FineScanner is able to recognize nearly all of the documents I send it, meaning that the individual photos I take come back to me as completely searchable, centered and undistorted pdfs, which I can then upload directly to the cloud or send via email or text. Here’s the page from above, post-processing:

The app can recognize nearly 200 languages, and while I can’t vouch for most of them, my experiences with English, Portuguese, Spanish, and French have been excellent. And miraculously, the recognized pdfs that emerge are tiny — generally about 10% of the size of the smallest unrecognized pdfs I used to make. (Fifty-page pdfs, for instance, usually weigh in around 2.5 MB.)

The advantage of the OCR phone-scanning approach is clearest in comparison. Whereas before, I would end a day at the archive with several hundred individual photographs and the dreadful knowledge that hours of processing awaited at some sure-to-be-later time, now my days end with tiny, fully-searchable pdfs ready to be organized and consulted on demand.

If you’ve made it to the end of this post, there’s a good chance that you belong to the tiny minority of people whose lives can be changed by a well-crafted OCR-optimized portable-scanning smartphone app. And if indeed you do, I’d recommend giving ABBYY FineScanner a spin.

Tag: Archival Research

DevonThink for Archival Research

ABBYY FineScanner for Archival Research