We are pleased to announce the release of Arkindex version 0.14.3, enabled on our demo instance

Performance

The main focus of that release was to work on stability and performance, especially to make workflows a lot faster.

Listing elements for a process

The ListProcessElements was first implemented in the last release, and we've worked on it even more this release, making it up to 30 times faster than before. This will speed up the first task of all new processes working on elements: the init_elements task directly use that endpoint.

We have optimized a lot the SQL queries, but also reduced the data loaded, skipped a lot of intermediary requests and results parsing, and built a way more resilient and faster pagination algorithm.

This whole optimization process will be repeated on different critical endpoints until we have a really stable and efficient platform.

Deletion

We've enhanced the Element deletion capability of Arkindex: you can now delete folders and their sub elements.

:warning: If you delete a folder, all its children will also be deleted immediately.

Some large folders may fail to delete for now, but we are actively working on an asynchronous solution to allow deletion of any dataset within Arkindex.

List newly created elements

An element's details page will now automatically reload its sub-element list when they are created by a workflow.

For example, when you navigate from a workflow status toward an element, if the workflow has created many lines, they will now appear without refreshing the page.

Other achievements

  • Optimize the ListWorkerVersions endpoint (5x factor)
  • Replace expensive children_count with has_children in ListChildrenElements: this will speed up a lot the tree displayed on an element's details page.

API endpoints

New endpoint to create transcriptions

A new endpoint is available to create a list of transcriptions in one call: CreateTranscriptions.

You can create transcriptions on multiple (sub-)elements at once, making your Machine Learning Worker a lot faster.

:warning: It can only be used by workers (as it requires a worker_version ID).

Breaking changes

  1. Arkindex does not support the source field anymore on ML Results creation endpoints (in favor of the new worker_version).
  2. The ListElements endpoint now requires a corpus. Its URL is now /api/v1/corpus/{corpus}/elements/ instead of /api/v1/elements/?corpus={corpus}.

Fixes

  • We have fixed a bunch of stale read issues (due to Postgresql clustering), mainly affecting machine learning result creation.
  • Avoid duplicating a process' elements in the pagination
  • Avoid duplicated processes when filtering with unscheduled state

Features

Project creation

We've changed the project creation visibility state: only Arkindex instance administrators can create public projects now.

This means that new projects created by users will be only visible to them.

In the next release, we'll introduce a new feature to share corpora across users on an instance.

File imports

The file import process has been updated:

  • If you import a PDF file with multiple pages, a folder with its name will be created, containing all of its pages;
  • If you import some Images or a single-page PDF, no folder will be created, they will be simply imported as Page, with the name of the original file;
  • The URL import is now available again;
  • You can still select element types, but they are set in the Advanced settings collapsed section.

Transcriptions

All the transcriptions are now visible on the details Panel. There are no transcriptions available from the tree on the left side of the screen: they are all set on an Element from now on.

As a lot of new elements are automatically created to support the transcriptions, we've increased the number of automatically loaded sub-elements.

<embed alt="transcriptions" embedtype="image" format="fullwidth" id="27"/>

Workflows

Secrets

The workflow system now supports storing secret values (such as third party credentials). Those secrets values are set by the Arkindex instance administrators, and only usable by the tasks. No user can ever see those confidentials payloads.

Machine Learning engineers can easily require secrets values through a simple declaration in their workers .arkindex.yml configuration:

workers:
- name: My Worker
secrets:
- project/X/google.json

This configuration will automatically retrieve the secret named project/X/google.json and make it available to the worker (so it can connect to a Third Party service).

More information will be available soon on this website to implement that feature in your workers.

Garbage collector

The agent running our tasks now support an automated garbage collector, that will delete useless Docker payloads (containers & images that are not used anymore). This allows us to keep more of our system's disk space available.

Base worker

A few updates are available in the version 0.1.9 of arkindex-base-worker:

  • Reload known ML classes when an error is received on creation,
  • Add helpers to retrieve and cache worker versions.

Fixes

  • Fix a bug for complex graphs (Deduplicate parents),
  • Prevent UTF-8 decoding errors on task logs,
  • Fix a bug to prevent infinitely pending tasks,
  • Fix a bug that prevented some Stop actions to be effective, thus running unwanted tasks.

API client

Our open source Python API client has been updated to version 1.0.4, and now offers a resillient pagination support (retry on errors, support backend down, and configurable missing data).

More information is available on the PyPi homepage of the project