We are happy to announce that a new Arkindex release is available.

Beyond the usual new features, improvements and bugfixes, this release is the one where we make Arkindex open-source! Two versions of Arkindex are now available, under different licenses: the community edition, which is the open-source version, available to anyone for open-source projects, and the enterprise edition, only available for paying customers, with extended features and support. Find out more in the dedicated blogpost.

You can explore Arkindex and try out the newest features on our demo instance, demo.arkindex.org.

Machine learning training on Arkindex

A new documentation on how to train machine learning models using Arkindex is available in Arkindex's documentation.

The handling of dataset sets in Arkindex has been overhauled: there is now a dedicated dataset set model in the backend, instead of a field containing a list of names on the dataset objects. The only visible change for users is that when creating a dataset process, you no longer select a dataset and then check checkboxes to pick sets in that dataset, but select a dataset and then sets using a dropdown menu.

A new option to select all sets from a dataset at once has also been added.

Dataset set selection interface in dataset processes, with "all sets" option
Dataset set selection interface in dataset processes, with "all sets" option

When creating a dataset, you can now make it so that one element cannot be a part of multiple sets within the dataset, thus preventing data leakage.

dataset set element uniqueness
Dataset creation modal, with set element uniqueness checkbox

Other updates regarding machine learning training on Arkindex include:

  • It is now possible to set any model version (from a model the user has access to) as the parent of another model version, to allow for the fine-tuning of existing models.
  • Values in user configuration fields of type list, and subtype string, are no longer being trimmed, to allow using whitespace characters as values (which is useful for some machine learning models).
  • The CreateDockerWorkerVersion API endpoint now validates that user configurations passed in its configuration field are valid, and especially checks that default values are of the correct types.

Command Line Interface

The arkindex elements ml-split command now creates datasets and not folders, and uses the updated dataset set API endpoints (see above). See the command documentation for details.

The arkindex elements list command has been removed, as it was very similar to the CSV export, only slower and less robust, as it used API calls instead of a SQL export.

Updates have been made to the Alto XML files import to Arkindex, which is now more robust and can publish more metadata, as well as confidence scores on transcriptions.

The METS files upload to Arkindex has been updated as well, the main change being that launching the arkindex upload mets command can publish elements from related Alto XML files as well as structural information contained in the METS file: there is no need to run the Alto XML file import command as well.

Machine learning processes and results

It is now possible to delete machine learning results in a project by entering a worker run UUID in the worker results deletion modal, using the "Advanced mode".

worker run results delete
Worker results deletion modal, advanced mode

You can also delete elements that have been filtered by worker run using the filter bar when browsing a project, using the "Delete filtered elements" action.

A new RestartTask API endpoint has been created, and the code used to restart tasks in a process is now cleaner: it clones the restarted task and moves eventual tasks that depended on it onto the cloned task, without modifying the original task at all.

When a worker supports GPU execution, and is being used in a process where the user enabled the GPU option, the corresponding task now only requires a GPU if a GPU is available.

process use gpu
Process advanced settings, with GPU usage option

Thumbnails generation

In the context of our current effort to transform some internal Arkindex tasks into workers, thumbnails generation on elements of a project is now a worker that can be used in a process like any other worker. You will find it in the list of available workers when configuring a process. As a result, the "generate thumbnails" checkbox in the advanced settings of a process has been removed.

thumbnails worker
Process using the thumbnails generation worker

Project export from a read-only database

In order to be able to export large projects, it is now possible to select a source database when creating an export, to export data from another database than the default one.

The issue was that, as exporting a large number of elements takes time, it could happen that large project exports would fail because while the export was running, changes were made to the project. Now, you can set up a snapshot database which is a copy of your current database, for example updated once every day, and when creating a new export you can select this read-only database as the data source. This way, there won't be any issue even if elements are added or removed from the project while the export is running.

This feature is only available in the Enterprise Edition of Arkindex.

project exports modal with source
Source selection in the project exports management modal

Cleanup and bugfixes

  • When worker results are deleted using a model version filter, corresponding worker activities are now correctly destroyed as well.
  • As part of the removal of the obsolete entity roles from Arkindex, the display of "entity links" in the frontend has been removed.
  • When using a password reset link that has expired, either from time or usage, the frontend now displays a notification upon password reset form submission, instead of acting like the password reset worked when it did not.
  • As part of our ongoing work to remove Git support from Arkindex, the "repository" mode for processes no longer exists, and docker socket access support has been removed.