We have been working during the summer to bring you one of the most important updates of Arkindex!

Machine Learning workflows with Git

We are proud to announce the release of our new system to run Machine Learning workflows using Arkindex.

Since the initial announcement in release 0.13.2, we've been adding more and more features to that system in order to make it usable for our ML researchers.

It's now possible for any Arkindex user to:

  1. connect your Git repository with any ML tool, as long as it's hosted on Gitlab
  2. store your models using Git Large File Storage
  3. build your Docker images to process Arkindex elements, inside Arkindex (no dependency on a remote build or storage system)
  4. share your workers with other users of the Arkindex instance
  5. get immediate updates after a simple git push from your computer

We've also added a worker_version required field in order to create new ML results (elements, transcriptions, classifications and entities). This allow us to track efficiently which worker created those results, and display that information in the frontend.

If you want to get started on building a new worker, you can use our open source base worker that will give you a nice template to get started with:

  • a Demo worker that uses our helper library to connect to Arkindex,
  • a Dockerfile template to build your tool
  • Gitlab CI configured with linting and unit tests.

This whole new system is still in beta, more documentation will be available for the next release.

Easy Transkribus import

It's now possible to easily import your Transkribus collections into Arkindex!

Arkindex will create a new Project for a Transkribus collection and will import:

  • pages and folder hierarchy
  • transcriptions as Paragraph and Line elements
  • entities as Person, Location, and Roles between them (only for French names right now)

To learn more, read our official documentation.

Update on the Transcription system

We are simplifying our transcription system over several consecutive releases. We want to support multiple transcriptions per element, but without any zone or types (they'll inherit the data from the Element).

This is the first release with changes that may impact you:

  • We added a recursive parameter in ListElementTranscriptions in order to list all the children transcriptions from an element (so you'll easily get all the paragraphs & lines from a page)
  • The CreateElementTranscriptions endpoint now outputs the created transcriptions
  • The whole system still supports both transcriptions styles (with or without zones)

Element types management

You can now manage your Element types in the frontend. This new interface is accessible on the Edit page of every corpus you manage.

<embed alt="Element type management interface" embedtype="image" format="fullwidth" id="105"/>

You'll be able to:

  • rename existing types
  • update the folder state of a type
  • add new types

Miscenallous changes

Django 3.1 update

We have upgraded from Django 2.2 to Django 3.1, bringing us bugfixes and some nice features (native ASGI support, Json fields in all databases)

Deprecations:

  • Bulk transcriptions endpoints are not available anymore, use CreateElementTranscriptions instead.
  • It's not possible to start a git import from the frontend: with the better git revision support available, we only rely on automatic Gitlab hooks.
  • The UploadDataFile endpoint is deprecated, you'll need to use CreateDataFile and the provided S3 url to upload directly your file (this brings way better overall performance).
  • The PageXmlTranscriptionsImport endpoint is deprecated, as our Transkribus import directly create the elements using standard endpoints.

Bugfixes

There are too many to list them all, but we also fixed a lot of backend and frontend bugs, and upgraded some libraries in order to avoid security issues.