The International Workshop on Document Analysis System is a major event for researchers in automatic document processing. It takes place every two years and is organised in 2022 by the University of La Rochelle in France. TEKLIA is pleased to be an official sponsor of the conference.
In addition to its support, TEKLIA will present at the conference from 22 to 25 May 2022 , two scientific papers on its work in the field of automatic handwriting recognition and automatic natural language processing:
First, a paper presenting a comprehensive comparison of HTR libraries on a new annotated corpus of historical documents in Norwegian: this is a joint work with the National Library of Norway and Lumex AS. This work has received financing from the Norwegian Research Council of Norway.
A Comprehensive Comparison of Open-Source Libraries for Handwritten Text Recognition in Norwegian by Martin Maarand, Yngvil Beyer, Andre Kåsen, Knut T. Fosseide and Christopher Kermorvant.
In this paper, we introduce a database of historical handwritten documents in Norwegian, the first of its kind, allowing the development of handwritten text recognition models (HTR) in Norwegian. In order to evaluate the performance of state-of-the-art HTR models on this new base, we conducted a systematic survey of open-source HTR libraries published between 2019 and 2021, identified ten libraries and selected four of them to train HTR models. We trained twelve models in different configurations and compared their performance on both random and scripter-based data splitting. The best recognition results were obtained by the PyLaia and Kaldi libraries which have different and complementary characteristics, suggesting that they should be combined to further improve the results.
Our second paper is a comparison of NER libraries for historical document:
A Comprehensive Study of Libraries for Named Entity Recognition, by Claire Bizon Monroc, Blanche Miret, Marie-Laurence Bonhomme and Christopher Kermorvant.
In this paper, we propose an evaluation of several state-of-the-art open-source natural language processing (NLP) libraries for named entity recognition (NER) on handwritten historical documents: spaCy, Stanza and Flair. The comparison is carried out on three low-resource multilingual datasets of handwritten historical documents: HOME (a multilingual corpus of medieval charters), Balsac (a corpus of parish records from Quebec), and Esposalles (a corpus of marriage records in Catalan). We study the impact of the document recognition processes (text line detection and handwriting recognition) on the performance of the NER. We show that current off-the-shelf NER libraries yield state-of-the-art results, even on low-resource languages or multilingual documents using multilingual models. We show, in an end-to-end evaluation, that text line detection errors have a greater impact than handwriting recognition errors. Finally, we also report state-of-the-art results on the public Esposalles dataset.