Machine learning, template matching, and the International Tracing Service digital archive: Automating the retrieval of death certificate reference cards from 40 million document scans

Scattered throughout the International Tracing Service (ITS) digital archive, one of the largest and most heterogeneous collections of Holocaust-related material, are hundreds of thousands of reference cards to official death certificates recording a fraction of individuals who perished within conce...

Full description

Bibliographic Details
Main Author: Lee, B.C.G (Author)
Format: Article
Language:English
Published: Oxford University Press 2019
Online Access:View Fulltext in Publisher
LEADER 04106nam a2200409Ia 4500
001 10.1093-llc-fqy063
008 220511s2019 CNT 000 0 und d
020 |a 20557671 (ISSN) 
099 |a CNI card of Boleslaw Pilarczyk, 0.1/62236476/ITS Digital Archive, , USHMM; 
099 |a CNI card of Bronislaw Izienicki, , 0.1/63631590/ITS Digital Archive, USHMM; 
099 |a CNI card of Bronislaw Nohomowitsch, , 0.1/56390786/ITS Digital Archive, USHMM; 
099 |a CNI card of Irene Komeczna, 0.1/28004926/ITS Digital Archive, , USHMM; 
099 |a CNI card of Iwan Kowalenko, , 0.1/28494243/ITS Digital Archive, USHMM; 
099 |a CNI card of Josef Konecny, 0.1/28005119/ITS Digital Archive, , USHMM; 
099 |a CNI card of Josef Konecny, 0.1/28005270/ITS Digital Archive, , USHMM; 
099 |a CNI card of Josef Koneczny, 0.1/28005215/ITS Digital Archive, , USHMM; 
099 |a CNI card of Josef Konieczny, 0.1/28005232/ITS Digital Archive, , USHMM; 
099 |a CNI card of Josef Konieczny, 0.1/28005297/ITS Digital Archive, , USHMM; 
099 |a CNI card of Josef Konierzny, , 0.1/28005160/ITS Digital Archive, USHMM; 
099 |a CNI card of Jozef Konieczny, 0.1/28005136/ITS Digital Archive, , USHMM; 
099 |a CNI card of Jozef Konieczuy, , 0.1/28005273/ITS Digital Archive, USHMM; 
099 |a CNI card of Kasimir Konieczny, , 0.1/28004756/ITS Digital Archive, USHMM; 
099 |a CNI card of Margarete Konieczny, 0.1/28004353/ITS Digital Archive, , USHMM; 
099 |a CNI card of Mikola Kowal, 0.1/126185862/ITS Digital Archive, , USHMM; 
099 |a CNI card of Rita Schorr, 0.1/38326699/ITS Digital Archive, , USHMM; 
099 |a CNI card of Rudolf Konietzny, , 0.1/28003608/ITS Digital Archive, USHMM; 
099 |a CNI card of Simon Kopolowitsch, 0.1/62174890/ITS Digital Archive, , USHMM; 
099 |a CNI card of Srul Jaszczynski, , 0.1/63631590/ITS Digital Archive, USHMM; 
099 |a CNI card of Stanislaw Konecny, 0.1/28003281/ITS Digital Archive, , USHMM; 
099 |a CNI card of Wassili Zborowka, , 0.1/91360004/ITS Digital Archive, USHMM; 
099 |a CNI card of Zdenek Konecny, 0.1/28002393/ITS Digital Archive, , USHMM; 
245 1 0 |a Machine learning, template matching, and the International Tracing Service digital archive: Automating the retrieval of death certificate reference cards from 40 million document scans 
260 0 |b Oxford University Press  |c 2019 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1093/llc/fqy063 
520 3 |a Scattered throughout the International Tracing Service (ITS) digital archive, one of the largest and most heterogeneous collections of Holocaust-related material, are hundreds of thousands of reference cards to official death certificates recording a fraction of individuals who perished within concentration camps. These cards represent the most comprehensive collection of digital material pertaining to these death certificates issued by Sonderstandesamt Arolsen, a German civil registry office. However, the reference cards can only be found dispersed throughout the Central Name Index (CNI), ITS's 46+ million-card finding aid that is indexed only by name. Consequently, aggregating the death certificate reference cards for research requires an intractable manual search. I adopt template matching and machine learning to automate the retrieval of these cards from the ITS digital archive. I demonstrate the efficacy of my method on a test set of 22,117 hand-classified cards, reporting 100% precision and 100% recall. Running this algorithm on 39,967,358 scans of cards from the CNI, I identify 312,183 death certificate reference cards in 13.75 days of elapsed real runtime on a personal computer with only a single,   |6 00 Intel processor. Finally, I demonstrate that this approach can be generalized to many different card types within the CNI, showing great promise for application to other archives. © 2018 The Author(s) 2018. Published by Oxford University Press on behalf of EADH. 
700 1 |a Lee, B.C.G.  |e author 
773 |t Digital Scholarship in the Humanities