Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents

The developing “information age” is continually unraveling new ways of discovering, presenting and sharing information. Most new academic material is digitally formatted upon its creation and is thus easy to find and query. However, there remains a good deal of material from times prior to the “i...

Full description

Bibliographic Details
Main Authors: Andrew S.I.D. Lang, Joshua Rio-Ross
Format: Article
Language:English
Published: Code4Lib 2011-10-01
Series:Code4Lib Journal
Online Access:http://journal.code4lib.org/articles/6004
Description
Summary:The developing “information age” is continually unraveling new ways of discovering, presenting and sharing information. Most new academic material is digitally formatted upon its creation and is thus easy to find and query. However, there remains a good deal of material from times prior to the “information age” that has yet to be converted to digital form. Much of this material can be found in library collections—whether academic, public or private—and thus remains available only to a limited number of locals or willing-and-able sojourners. Using OCR technology, most typeset documents can be digitized and made available online; and there are several projects underway to do exactly this. However, there remains little to be done for handwritten materials. Those who own collections of handwritten documents are increasingly wanting to make the content thereof available to the general public. Unfortunately, traditional transcription models typically prove to be expensive or inefficient and pdf snapshots are not searchable. We have developed a model for digital transcription using Google Docs and Amazon's Mechanical Turk. Using this model, one can use an online workforce to efficiently transcribe handwritten texts and perform quality control at a cost much lower than professional transcription services. To illustrate the model we used Amazon’s Mechanical Turk to transcribe and then proofread the Frederick Douglass Diary which we have made available on a public searchable wiki. The total cost of transcription and proofreading for the 72 page diary was less than $25.00 with some pages being transcribed and proofread for as little as $0.04. Our results show that using Amazon’s Mechanical Turk holds great promise for providing an affordable transcription method for hand-written historical documents making them easily sharable and fully searchable.
ISSN:1940-5758