Automatic information extraction for cataloguing
The Bibliothèque Sainte-Geneviève (BSG) is one of the most important French university libraries and the heir to the library of the Abbey of Sainte-Geneviève.
The BSG asked Teklia to carry out a project aimed at automatically extracting information from a material file and a printed catalogue, both of which were only available as images.
The aim was to complete the library's digital catalogue with the indexing information contained in these two printed tools.
This reverse conversion project is part of a wider project to catalogue the library's collection of over two million items by subject.
Processing the subject file
The subject file consists of 550,000 cards, already digitised and therefore available as images.
These cards allow thematic access to the works thanks to the following information: title, author's name, subject, bibliographical reference, description and call number.
The processing of these records began with a first step of Automatic Text Recognition (ATR), which allowed the transcription of both typed and handwritten text.
In a second step, Callico was used to create annotations to provide a training corpus for a named entity extraction system to identify subjects and ratings.
Once the system was trained, the 550,000 records were processed to extract their phone number and associated subject (N2).
Although artificial intelligence, through ATR and a named entity recognition (NER) model, allows for the mass processing of documents, there are still errors and special cases that need to be handled manually.
In the case of processing the subject file, even if the error rate of the automatic transcription was low, some characters, necessary for the identification of the documents such as Greek letters or superscript letters, were poorly recognised.
TEKLIA therefore set up rules, in collaboration with librarians, to retrieve a certain number of misrecognised identification numbers for which no equivalence was found in the database. The case of works in several volumes or removed from the collection also had to be dealt with.
In the end, more than 85% of the 550,000 records could be processed fully automatically. Of the remaining 15%, some records were deliberately excluded (title pages, duplicate records, crushed documents), others could not be processed because of the format of the data (series with more than 5 volumes, records with several ratings, sometimes illegible handwritten records, etc.) and still others could not be processed because of OCR errors.
Processing of the Poirée-Lamouroux catalogue
The Catalogue abrégé de la Bibliothèque Sainte-Geneviève, compiled by Elie Poirée and Georges Lamouroux at the end of the 19th century, lists most of the works held by the library at the time.
The 3 volumes of the work are divided into sections delimited by titles and subsections indicated by numbers.
A table of correspondences makes it possible to associate each sub-section number with a sub-subject. The following illustration shows subsection 37 of the Science section.
TEKLIA has developed a system for analysing the section and sub-section structure in order to associate each quotation detected with the corresponding thematic sub-section. The complete processing is detailed, step by step, on an extract from the previous page.
The first step in the processing is text recognition. We have to detect the position of the lines and transcribe them using an ATR algorithm. In the example, the detected lines are highlighted in green.
ROCAFORT (Jacques). L'éducation morale
au lycée. 1899. [R. 8º Sup. 4040.]
SAMSON (Mme Jules). Une éducation dans
la famille ... 3e éd. S. d. [R. 8º Sup. 3598.]
The catalogue is divided into sections, which group together records that have a common theme. This information is necessary for the experts, and it was first required to group the lines found into sections.
A section may extend over several successive pages of the same volume. The text of each section was reconstructed by concatenating the text of the lines that make it up.
Once these sections were annotated on Arkindex, the experts were able to manually assign numbers to them using an internal identifier. While the first section (at the top of the page) corresponds to Educational Sciences, the second (at the bottom of the page) corresponds to Philosophy of Science.
After annotating the citations on a small sample of pages in Callico, training was carried out to detect these references. This model was applied to all sections of the volumes.
The last step was to match the detected shelf marks with those in the library database.
For this purpose, we used additional information present in the referenced work. Indeed, the author, the title of the work and the date of publication greatly improve the accuracy of the correspondence when they are present.
On the other hand, the abbreviations and formulations used may differ between the database and the work.
The following image shows the breakdown of the text with these different parts.
In the example, the correspondence can be made with:
- The book with the reference number
8 R SUP 4040, written by
Rocafort in 1899entitled
L'éducation morale au lycée
- The book with the reference number
8 R SUP 3598, written by
Une éducation dans la famille, conseils pratiques d'une mère
In the end, out of the 5869 pages of the catalogue, 29,497 call numbers were automatically extracted and identified in the digital catalogue.
A significant cost reduction
TEKLIA's models made it possible to automate the processing of the material file and the catalogue by more than 90%. While manual validation is still required, the human effort is significantly reduced, with a major impact on the cost of reverse conversion.
Timothée RONY, Documentary Policy Department, Bibliothèque Sainte-Geneviève
"The use of AI allowed us to process a considerable amount of data (550,000 records in the subject file and almost 6,000 pages in the printed catalogue). It would have been unthinkable for us to rely on manual processing and we were looking for a service provider who could help us with this retro-conversion project, which had a dual objective: to enrich the records of our online catalogue with very valuable indexing data that was previously difficult to use because it could only be queried in printed form; and to support us in our project to evaluate and map our collections, by partially automating the counting operations. From our point of view, the initial objectives have been fully met. Despite a very heterogeneous data format, difficulties linked to the complexity of our rating and indexing system and the sometimes late expression of new needs according to the progress of our tests, Teklia has always shown itself to be very available and inclined to make the working method evolve, according to the difficulties and problems that arose. In this respect, we are fully satisfied with our collaboration with the Teklia teams.
Do not hesitate to contact us via our contact form to set up a similar project in your institution.
Sainte Geneviève Library - www.bsg.univ-paris3.fr/iguana/www.main.cls.