Callico, Teklia's annotation platform

HTR and OCR models, like any other Machine Learning model, must be trained on high-quality datasets in order to process documents properly. Although building up these datasets is a crucial step within a project, it can be quite off-putting. The document must first be annotated correctly to create a group of elements which the model is going to be trained on, before going into the recognition process. To meet clients’ needs, Teklia has developed and launched a tool, allowing contributors to work on various annotation campaigns. Let us introduce you to Callico, a comprehensive web annotation platform for your document recognition projects.

Why another annotation platform?

Unlike other automatic transcription and document recognition platforms, the strength of Teklia’s solutions, Callico and Arkindex is to be able to process large volumes of documents and data for both text and image recognition. Very few annotation tools offer the possibility to work on the document image and on the transcription at the same time. Plus, the existing annotation platforms are mainly developed for experts' eyes only and are quite difficult to use on a large scale with novice users. Knowning that, Teklia has designed a much more accessible annotation tool on which any annotator can focus on simple tasks and at a faster pace.

Keep an eye on your annotation campaigns.

Callico is a collaborative platform designed to complete annotation tasks on various types of documents. The key to a successful annotation campaign is quality, that is the reason why Teklia has decided to give its clients and partners the opportunity to supervise the annotation process, providing the project with high quality ground truth. Callico allows you to create your own campaign directly from Arkindex and assign tasks to any member of your team and to anyone who would like to be a volunteer. Once your collaborators know about their assignments, they are only allowed to work on one task at a time, so that you can check the status and take notice of the project’s progress.

Five simple campaign modes for a better preparation of documents.

This crowd sourced annotation platform is meant to be used by a large number of contributors, who might not be familiar with document annotation. Therefore, Teklia has created five modes, each made for specific needs, in order to make the wesbite more intuitive and usable as quick as possible, without being an expert.

Let's go through the available modes:

  • Segmentation Mode, which is basically the creation of zones, or bounding boxes on a document. For instance, surrounding lines of text on handwritten documents, articles on newspapers.

callico_segm

  • Transcription Mode, which is essential when creating datasets for the training of HTR and OCR models. It simply consists in transcribing the handwritten or typed text you can see on the document.

callico_transc_print

  • Named-Entities Mode consists in annotating text as per its categories, such as a person or a place name.

callico_entity

  • Meta-data Mode enables the annotator to add metadata to the document, with a preset form.

callico_form

  • Classification Mode allows the users to assign a class to a document or a part of document.

callico_class

This limited set of actions allows contributors to focus on specific tasks and makes the platform very simple to use for annotators. The platform doesn’t include a lot of options in order to keep tasks as easy as possible for the users. After the annotation process has been executed and validated; the created datasets can easily go on Arkindex to train models such as HTR, OCR, NER or document classification.

A tool with a lot more features on the roadmap

In light of the above, one can understand that the ambitions for Callico are much higher and that it needs to go further.

As we speak, Callico is still in beta-testing with several partners of Teklia such as the National Library of Norway, Bibliothèque de la Sorbonne (BIS), the National Archives of the Netherlands and Groupe Ouest-France.

Going Open-Source

Our goal is to release a stable version of Callico in open-source. That way, the solution will be available to anybody who wants to create annotation campaigns. Soon it will also be possible to signup freely and become an annotator.

On-boarding for volunteers

With its objective to go open-source, Teklia's team is considering the ability to guide and select the annotators who want to contribute, for instance with the creation of tutorials and with the validation of volunteers through some annotation tests. If the given responses to the tests are correct, the applying annotator would then be approved to work on tasks he or she will be assigned later.

An annotation tool for IIIF

For now, Callico is exclusively linked to Arkindex when it comes to importing the documents and exporting the data. But Teklia’s goal is to offer the possibility to import from IIIF collections, thereby expanding the scope of work to larger documents and to a wider variety of sectors. For this to be made, various features must be developed, including the creation of more and more annotation modes, since the range of documents to annotate is going to go wider with the use of IIIF collections.

Image credits: * Bibliothèque nationale de France, ark:/12148/bpt6k638782t * Bibliothèque interuniversitaire de la Sorbonne, Archives parlementaires * National Library of Norway, NorHand dataset * National Archives of the Netherland/Nationaal Archief