Computers can do anything or almost anything. They can even read books with the help of Optical Character Recognition (OCR). Images from printed pages are converted into searchable and editable digital text. OCR has become better, that is: more accurate, over the years. Text collections like Google Books (over 25 million book titles), Project Gutenberg (57.000 free eBooks) and Delpher (in the Netherlands) are growing rapidly and have definitely contributed to the dessimination of our cultural heritage.
Computers are not only able to read printed texts, but also handwriting, often with a little human help. Software like Monk, Transkribus or DigiPal was developed to automate the reading of written texts from the middle ages until today. Monk even claims to process Chinese manuscripts and hieroglyphs.
Human help is often organized in crowdsourcing projects. The word ‘crowdsourcing’ originated in 2005, as a joining of ‘outsourcing’ and ‘crowd’, because businesses were beginning to use digital platforms to outsource work to individuals. This was very rapidly picked up by the cultural and heritage sector, sometimes denoted as GLAM (galleries, libraries, archives and museums), who already for a long time made use of the help of volunteers.
In the middle ages and the early modern period we sometimes find the word ‘coll.’ written in the margin of a text, as an abbreviation of the Latin ‘collatio’, meaning ‘comparison’. Especially official documents like testaments or accounts were copied by two different people and in the end compared with each other and the original text.
A modern way of coping with the computer aided transcription of difficult handwrited texts is called ‘double-keying’, which could be conceived as a modern form of ‘collatio’. Two different operators provide two independent transcriptions of a text and the two versions of the text are compared in order to detect transcription errors. This method usually results in very high accuracy rates. In addition the text can be enriched with structural annotations creating ‘smart data’.
The combination of crowdsourcing and double-keying provides a relatively fast and secure way to digitize difficult archival texts. The automated transcription with the help of OCR will also improve in the near future, but names of persons and places will always be very difficult to read for a machine. Maybe that should be left to paleography experts, at least for the moment.
Google Books: https://books.google.nl/
Project Gutenberg: https://www.gutenberg.org/
Haaf, Susanne, Frank Wiegand, en Alexander Geyken. “Measuring the correctness of double-keying. error classification and quality control in a large corpus of tei-annotated historical text”. Journal of the Text Encoding Initiative, 2013. https://journals.openedition.org/jtei/739#tocto1n10.
Schöch, Christof. “Big? Smart? Clean? Messy? Data in the Humanities”. Journal of Digital Humanities 2, nr. 3 (2013). http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.
Schöch, Christof. “Digitale Wissensproduktion”. In Digital Humanities. Eine Einführung, 206–12, 2017.
Terras, Melissa. “Crowdsourcing in the digital humanities”. In A new companion to digital humanities, 420–38, 2016.
Photo by Bas Lems, taken from: Stadsarchief Rotterdam, toegang 464, Ambacht en gemeente Poortugaal, inv.nr. 262, folio 113v.