Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents

The developing “information age” is continually unraveling new ways of discovering, presenting and sharing information. Most new academic material is digitally formatted upon its creation and is thus easy to find and query. However, there remains a good deal of material from times prior to the “i...

Full description

Bibliographic Details
Main Authors: Andrew S.I.D. Lang, Joshua Rio-Ross
Format: Article
Language:English
Published: Code4Lib 2011-10-01
Series:Code4Lib Journal
Online Access:http://journal.code4lib.org/articles/6004
id doaj-857cf980e8e34b73bb272c9f22fb1b5f
record_format Article
spelling doaj-857cf980e8e34b73bb272c9f22fb1b5f2020-11-25T03:45:00ZengCode4LibCode4Lib Journal1940-57582011-10-0115Using Amazon Mechanical Turk to Transcribe Historical Handwritten DocumentsAndrew S.I.D. LangJoshua Rio-RossThe developing “information age” is continually unraveling new ways of discovering, presenting and sharing information. Most new academic material is digitally formatted upon its creation and is thus easy to find and query. However, there remains a good deal of material from times prior to the “information age” that has yet to be converted to digital form. Much of this material can be found in library collections—whether academic, public or private—and thus remains available only to a limited number of locals or willing-and-able sojourners. Using OCR technology, most typeset documents can be digitized and made available online; and there are several projects underway to do exactly this. However, there remains little to be done for handwritten materials. Those who own collections of handwritten documents are increasingly wanting to make the content thereof available to the general public. Unfortunately, traditional transcription models typically prove to be expensive or inefficient and pdf snapshots are not searchable. We have developed a model for digital transcription using Google Docs and Amazon's Mechanical Turk. Using this model, one can use an online workforce to efficiently transcribe handwritten texts and perform quality control at a cost much lower than professional transcription services. To illustrate the model we used Amazon’s Mechanical Turk to transcribe and then proofread the Frederick Douglass Diary which we have made available on a public searchable wiki. The total cost of transcription and proofreading for the 72 page diary was less than $25.00 with some pages being transcribed and proofread for as little as $0.04. Our results show that using Amazon’s Mechanical Turk holds great promise for providing an affordable transcription method for hand-written historical documents making them easily sharable and fully searchable.http://journal.code4lib.org/articles/6004
collection DOAJ
language English
format Article
sources DOAJ
author Andrew S.I.D. Lang
Joshua Rio-Ross
spellingShingle Andrew S.I.D. Lang
Joshua Rio-Ross
Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents
Code4Lib Journal
author_facet Andrew S.I.D. Lang
Joshua Rio-Ross
author_sort Andrew S.I.D. Lang
title Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents
title_short Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents
title_full Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents
title_fullStr Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents
title_full_unstemmed Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents
title_sort using amazon mechanical turk to transcribe historical handwritten documents
publisher Code4Lib
series Code4Lib Journal
issn 1940-5758
publishDate 2011-10-01
description The developing “information age” is continually unraveling new ways of discovering, presenting and sharing information. Most new academic material is digitally formatted upon its creation and is thus easy to find and query. However, there remains a good deal of material from times prior to the “information age” that has yet to be converted to digital form. Much of this material can be found in library collections—whether academic, public or private—and thus remains available only to a limited number of locals or willing-and-able sojourners. Using OCR technology, most typeset documents can be digitized and made available online; and there are several projects underway to do exactly this. However, there remains little to be done for handwritten materials. Those who own collections of handwritten documents are increasingly wanting to make the content thereof available to the general public. Unfortunately, traditional transcription models typically prove to be expensive or inefficient and pdf snapshots are not searchable. We have developed a model for digital transcription using Google Docs and Amazon's Mechanical Turk. Using this model, one can use an online workforce to efficiently transcribe handwritten texts and perform quality control at a cost much lower than professional transcription services. To illustrate the model we used Amazon’s Mechanical Turk to transcribe and then proofread the Frederick Douglass Diary which we have made available on a public searchable wiki. The total cost of transcription and proofreading for the 72 page diary was less than $25.00 with some pages being transcribed and proofread for as little as $0.04. Our results show that using Amazon’s Mechanical Turk holds great promise for providing an affordable transcription method for hand-written historical documents making them easily sharable and fully searchable.
url http://journal.code4lib.org/articles/6004
work_keys_str_mv AT andrewsidlang usingamazonmechanicalturktotranscribehistoricalhandwrittendocuments
AT joshuarioross usingamazonmechanicalturktotranscribehistoricalhandwrittendocuments
_version_ 1724512029496049664