AI The Virtual Record Treasury of Ireland

The Virtual Record Treasury of Ireland

Nominated Award: Best Application of AI to achieve Social Good

Website of Company: http://www.virtualtreasury.ie

On June 30th, 1922, the Record Treasury of the Public Record Office of Ireland (PROI), containing Ireland’s documentary heritage dating back to the thirteenth century, was destroyed in an explosion and fire at the Four Courts, Dublin, during the opening engagement of the Irish Civil War. Beyond 2022 is an all-island and international collaborative research programme to recreate digitally this lost national treasure. Several of the most important memory institutions worldwide are joining us in this shared mission to reconstruct Ireland’s lost history. Together with over 70 other Participating Institutions across the island of Ireland, Britain, the USA and Australia, we are working to recover what was lost in that terrible fire one hundred years ago.

On the 27th June 2022, we launched the Virtual Record Treasury of Ireland online: http://www.Virtualtreasury.ie. Many millions of words from destroyed documents have been recovered, digitally linked and reassembled from copies, transcripts and other records scattered among the collections of our archival partners. This rich array of replacement items have been brought together and made freely available within the website. Beyond 2022/ the Virtual Treasury of Ireland has developed deep learning models that enables the automatic transcription and indexing of early modern and modern English handwriting. This technology empowers users around the world, for the first time, to engage with and interrogate ancient historical documents.

The Virtual Treasury models can transcribe most texts created between the years 1600-1800 including notarial (the formal style used in legal documents), secretary (the ‘educated’ style for which William Shakespeare’s original manuscripts are an example) and the more cursive styles of handwriting that emerged as education democratised. Although more handwritten manuscripts survive for the early modern period than there are printed texts, only a handful of people globally have the paleographical training to read these texts. Libraries and archives around the world have been painstakingly creating digital images of their manuscript collections for more than thirty years, but even when made available online these texts are still only accessible to a small number of scholars. Transcription projects are slow and very costly, and until now only a tiny percentage of historical manuscripts has made the transition to searchable, machine-readable text. The Solution This has all changed.

On 28 June 2022 the Virtual Treasury of Ireland went online containing over 50,000 pages of searchable, handwritten documents. In global terms, it is the largest single release of transcribed handwritten documents ever and doubled, at a stroke, the quantity of transcribed original sources for Irish history available online. The Virtual Treasury models, some made publicly available on the Transkribus platform, have been used by over 70 libraries and archives around the world increasing access to records that include Caribbean piracy cases from the seventeenth century and the original letter books of Benjamin Franklin. In an Irish context, through the VRTI, the detail of the Elizabethan and Cromwellian conquests of Ireland are laid bare for the first time together with a census of the Irish population from the eighteenth century.

Reason for Nomination

The State of Play
The combined collections of historical manuscripts held in the anglophone world run to hundreds of millions of items. Less than 5% have been imaged and only a handful of the total is fully searchable. Although the availability of searchable historical texts digitised from printed sources has expanded dramatically over the past decade due to the reliability of OCR, the pace of digitisation of historical manuscripts has declined as doubts spread about the usefulness of these projects.

The Problem
Manual transcription by trained paleographers is very slow and extremely costly. In addition, there is an ‘opportunity cost’, as the few people capable of transcribing these documents are ususally historians who would otherwise make sense of them and add to our overall corpus of historical knowledge. Most historians trained to this level are typically in academic or archival employment meaning that even if there is a pressing social or pedagogical need for certain documents to be transcribed, there is an acute shortage of labour with the necessary skills.

The Solution
The Virtual Treasury of Ireland has developed a suite of deep-learning models capable of transcribing most handwritten historical texts. The models are designed to produce an accurate first draft, with a character error rate of less than 3%, good enough to make the document searchable and reducing the onerous task of full manual transcription to the relatively trivial task of editing for publication. Fifty texts were selected, written in different hands and across a temporal spread, 1600-1850, with an emphasis on texts rich in nominal content and expressions to improve the accuracy of eventual entity extraction. In total, the models run to some 2.2 million words of transcription taken from 7,500 page images of documents. A general model has been developed with an error rate of 8%, together with five style or source specific models that achieve the desired error rate over the majority of documents specific to the needs of the Virtual Treasury.

How did we do it?
The texts were selected from a range of completed digital humanities projects and texts provided by the Irish Manuscripts Commission. These were supplemented by original transcriptions created by the Virtual Treasury research team and important contributions from Irish government agencies, most notably the Property Registration Authority of Ireland. The transcriptions were mapped to segmented images using the Transkribus platform. Transkribus is an output of the €8.3m H2020 READ project that compleed in 2019. The Transkribus platform can perform accurate segmentation of complex images of historical texts, identify lines and words and transform the bitmap segments into word-based vector diagrams representing the geometry of the cursive handwriting that makes up each word. The matched geometries and transcriptions that comprise the training data were transformed into deep learning models using both Pylaia (an open-source deep learning tool developed by the University of Valencia) and an implementation of Tensor Flow developed for READ by Naver Labs. A generic model of one million words has been made publicly available on the Transkribus platform and the style-specific models are shared with other academic users on a request basis.

How difficult was it?
The Virtual Treasury models are entirely novel and no deep-learning solutions for the transcription of early texts in English previously existed. Texts are formed in both cursive of randomly cursive styles, and replete with entry and exit strokes, under- and super-scripts, and abbreviations often devised at the whim of an individual scribe. Some letters can take alternative forms, depending on their position within a word and there are no standard nominal spellings. Texts were selected based on an inverted ‘Scrabble’ approach, whereby letters were scored based on the frequency of their appearance and style. The selection of texts was designed to ensure that both upper- and lower-case letters appeared evenly throughout the language models, and the frequency of these appearances broadly informed the language models to level out the confidence with which the algorithm proposed each matching word or letter. Once this approach was demonstrated internally to be working satisfactorily, it was then a matter of scaling up by selecting texts that would expand both the range of handwriting styles the models could cope with and the range or nominal terms that are transcribed accurately enough to enable automated entity extraction with or without NLP.

The next challenge?
The Tensor Flow implementation on Transkribus can process roughly 300 page images an hour, replacing one month of work by a human trancriber. The 50,000 pages processed so far has created an avalanche of new historical material, much of which was previously unkown to historians. The first 50,000 pages took two years, the next will probably take two months, representing 14 person-years
of human effort. The next challenge?

Additional Information:

Societal Impact
Equalizing access to historical sources in crucial to the health of modern societies. In a world were political leaders simply rewrite histories to serve their own agendas, democratising access to historical sources has never been more important. Across the anglophone world, fewer children are being taught to read handwriting. Given that only a small proportion of our written heritage has been converted from manuscript to printed form, this means that the emerging generation will be increasing unable to access their past. This phenomenon has ramifications beyond the traditional study of history and literature and includes the ability to conduct genealogical research, to access property records and to read recovered datasets, for example meteorological records. Conversely, children demonstrate high rates of literacy when asked to interpret printed text, either on paper or when presented electronically. Converting existing handwritten text to electronic print will preserve literacy levels among the next generation of historical researchers at its current high rate and reduce the bias against reading the earliest, and most difficult to read, material.

Economic Impact
In general terms, historians and literary scholars dislike expressing the value of their work in monetary terms. The fact is, however, that assuming a scholar would be doing well to produce the first draft of a transcribed page in less than an hour, the Virtual Treasury models can produce €25 worth of transcription at the prevailing Irish post-doctoral rate for roughly 11 cents in a matter of seconds. This dramatic fiscal improvement means that not only can existing transcription projects be completed far faster than originally envisaged, additional projects can be commissioned that were not previously thought possible.

Technological/Scientific impact
Before early modern English there was medieval Latin and a range of Gaelic texts written in several variants. There is even Hiberno-Latin. Using generic machine learning platforms, the methodologies employed in the development of the Virtual Treasury models can be applied to any of these ancient languages. All exhibit similar difficulties with styles, abbreviations and variants that can be effectively overcome with robust geometric and language deep-learning models developed along similar lines with the added advantage that recovered texts in these ancient languages and be at least partially translated into languages understood by modern readers.

Historical tabular data, handwritten tables recorded on hand drawn tables, is also now easily recoverable for the first time and our current work, with the problem of handwriting recognition effectively solved, is to develop geometric models that segment images into constituent cells to output text as .xml or .csv tables, a further advance from text to image that promises to further revolutionise historical research.