We have been working during the summer to bring you one of the most important updates of Arkindex!
Machine Learning workflows with Git
We are proud to announce the release of our new system to run Machine Learning workflows using Arkindex.
Since the initial annoucement in release 0.13.2, we’ve been adding more and more features to that system in order to make it usable for our ML researchers.
It’s now possible for any Arkindex user to:
- connect your Git repository with any ML tool, as long as it’s hosted on Gitlab
- store your models using Git Large File Storage
- build your Docker images to process Arkindex elements, inside Arkindex (no dependency on a remote build or storage system)
- share your workers with other users of the Arkindex instance
- get immediate updates after a simple
git pushfrom your computer
We’ve also added a
worker_version required field in order to create new ML results (elements, transcriptions, classifications and entities). This allow us to track efficiently which worker created those results, and display that information in the frontend.
If you want to get started on building a new worker, you can use our open source base worker that will give you a nice template to get started with:
- a Demo worker that uses our helper library to connect to Arkindex,
Dockerfiletemplate to build your tool
- Gitlab CI configured with linting and unit tests.
This whole new system is still in beta, more documentation will be available for the next release.
Easy Transkribus import
It’s now possible to easily import your Transkribus collections into Arkindex!
Arkindex will create a new Project for a Transkribus collection and will import:
- pages and folder hierarchy
- transcriptions as Paragraph and Line elements
- entities as Person, Location, and Roles between them (only for French names right now)
To learn more, read our official documentation on this website.
Update on the Transcription system
We are simplifying our transcription system over several consecutive releases. We want to support multiple transcriptions per element, but without any zone or types (they’ll inherit the data from the Element).
This is the first release with changes that may impact you:
- We added a
ListElementTranscriptionsin order to list all the children transcriptions from an elemnt (so you’ll easily get all the paragraphs & lines from a page)
CreateElementTranscriptionsendpoint now outputs the created transcriptions
- The whole system still supports both transcriptions styles (with or without zones)
Element types management
You can now manage your Element types in the frontend. This new interface is accessible on the
Edit page of every corpus you manage.
You’ll be able to :
- rename existing types
- update the folder state of a type
- add new types
Django 3.1 update
We have upgraded from Django 2.2 to Django 3.1, bringing us bugfixes and some nice features (native ASGI support, Json fields in all databases)
- Bulk transcriptions endpoints are not available anymore, use
- It’s not possible to start a git import from the frontend: with the better git revision support available, we only rely on automatic Gitlab hooks.
UploadDataFileendpoint is deprecated, you’ll need to use
CreateDataFileand the provided S3 url to upload directly your file (this brings way better overall performance).
PageXmlTranscriptionsImportendpoint is deprecated, as our Transkribus import directly create the elements using standard endpoints.
There are too many to list them all, but we also fixed a lot of backend and frontend bugs, and upgraded some libraries in order to avoid security issues.