Teklia are proud to announce that we have completed the work to process all the registers in the BALSAC project, which has been digitizing historical data on the Quebec population.

What is BALSAC?

The BALSAC project has been building a database of demographic events (birth, marriage and death records) of the Quebec population. The database is built by linking together the demographic events from transcribed parish registers. The database comprises about 2 million scanned images of (mostly) handwritten parish records from the second half of the 19th to the beginning of 20th century. Some examples of the pages needing processing appear below.

Some example pages from the BALSAC project.
Some example pages from the BALSAC project.

How has Teklia helped?

As one might expect, it is a lot of work to transcribe 2 million images manually. Fortunately, machine learning methods have become good enough, in recent years, to automate this process.

At Teklia, our goal was to develop a system capable of processing 2 million images, that can automatically transcribe the pages, segment them into acts, or records, detect entities, and in the end export the results as XML so they could be imported into the BALSAC database. The BALSAC team will integrate the exported records in the database and relying on nominative information will link them in order to reconstruct individual biographies and family histories.

To digitize a database of this size, we had to improve Arkindex by training different machine learning models and incorporate them into Arkindex.

Overview of the Machine Learning process

In the workflow figure below, we can see the dependencies between different machine learning steps. We've also outlined each step below the figure.

The machine learning workflow graph on Arkindex for the BALSAC project.
The machine learning workflow graph on Arkindex for the BALSAC project.
  1. UFCN Line Historical - (Document Layout Analysis) Before doing anything, we need to detect the text lines on the image. To do that we use Doc-UFCN model trained on historical data including annotated data from BALSAC. You can read more about Doc-UFCN in a recent blog post.
  2. Kaldi Balsac - (Handwritten Text Recognition) We trained a Kaldi model on annotated BALSAC pages to transcribe text lines into machine-readable text.
  3. U-FCN Balsac act (cut lecture faite) - (Document Layout Analysis) We trained a model on BALSAC pages to segment a page into acts, or records (act_start, act_center, act_end, act). An act can either start on a page, end on a page or be fully contained on a page. In addition, the model will predict whether an act is about a birth, marriage or death based on common keywords for each type. Teklia's model is based on Doc-UFCN library, but extra textual information is included to improve the results. You can read more about our method in the paper we published: Including Keyword Position in Image-based Models for Act Segmentation of Historical Registers.
  4. Paragraph Creator - (Document Layout Analysis) We created a tool to concatenate lines together into paragraphs so the named entity recognition model would have more context and could produce better results.
  5. Flair French Balsac - (Named Entity Recognition) We trained a flair model on annotated BALSAC entities to detect person names, dates, locations, professions.
  6. Scikit Portrait Outlier Detector Balsac - (Outlier Detection) We created an outlier detection model using scikit learn based on the detected text lines to find pages that do not contain acts (or which look out of the ordinary), because we only had pages with acts in the annotated dataset.
  7. Finally, we combined the predictions of different tools together to export each register as an XML file of acts.

Results

After completing the transcription and insertion into the BALSAC database, we were happy with the results. They are exemplified and summarized below.

Example

A page from BALSAC with detected lines and detected acts
A page from BALSAC with detected lines and detected acts

Text from the first full act recognized by our tool:

Le cinq juin mil neuf cent cinq , nous

prêtre curé soussigné avons baptisé Marie

Rose née le même jour , du légitime

mariage de Jean Boisvert et de

Adele Cloutier de cette paroisse . Pouvoir

François Cloutier marraine Arthémise

Beausoleil . Lesquels n ' ont su signer avec nous

Ed . Lessier ptre

Statistics

While processing images with our workflow we found out that sometimes our models perform worse on double pages. To deal with that we naively cut all the double page images in half, so in the end the number of processed pages is higher than the number of images (almost 2 million).

In the dataset we had 44 742 registers from 1 985 parishes. In total we had 2 635 038 single page images that contained 5 591 535 acts, 15 668 671 paragraphs and 111 270 929 text lines.

Count of elements by type on Arkindex of the BALSAC project:

TypeCount
act5 591 535
double_page638 467
page2 635 038
paragraph15 668 671
parish1 985
register44 742
text_line111 270 929

Count of entities by type on Arkindex of the BALSAC project:

TypeCount
date8 655 893
location5 429 807
person30 986 429
profession4 624 291

The number of entities and the number of acts are consistent with our expectations. Each act has about 1-2 dates, about 1 location and about 5-6 persons. But we have to remember that some of those acts are incomplete (act_start, act_center, act_end) that either started on another page or will end on another page.

We are happy to have been able to process all these pages and deliver the exported results. It is an important step to show what is possible to do with Arkindex, and by further developing our tool on the BALSAC dataset, we've made Arkindex that much better for future historical datasets for import.