We're proud to announce that our Python client for the Transkribus API, transkribus-client, has been open-sourced under an MIT license for personal and commercial use.

This Python 3.6+ client interacts with your Transkribus account's REST API, enabling you to browse collections, documents, pages and transcripts, download images, and parse PAGE XML files. It's been powering the Transkribus import for Arkindex for over a year now, so it's safe to say the client is well-tested. By open-sourcing this client, we hope to give back to the open-source community and make it even easier to browse, export, and parse existing Transkribus collections!

Why we wrote transkribus-client

At Teklia, we write software to leverage machine learning and natural language processing to automatically convert images of handwritten or printed documents into digital, searchable media. Our flagship product, Arkindex, was designed to hold collections of those documents and make it easier to search and index through them.

In order to develop automatic transcription models, we needed to train them against existing transcriptions to make them as accurate as possible. Our clients sometimes have used existing tools such as Transkribus to manually transcribe a sample of their document and we need to import these transcriptions to develop and fine-tune our models to make them more accurate.

A Transkribus document imported into Arkindex
A Transkribus document imported into Arkindex

Due to the level of data we were importing from Transkribus, we developed a Python client that works with the rest of the Arkindex suite to import Transkribus data for use with our software. After authoring it, we realized it might be more generally useful for the archival and digitization community, so we began the work in converting the code to a form suitable for open-sourcing it.

Using transkribus-client

You can use the transkribus-client library to browse collections, documents, and individual pages from the Transkribus API; export a collection to a zip file for download, or parse an already-downloaded PAGE XML file to further process its contents. You can hard-code your Transkribus credentials in the calls to transkribus-client or access them from environment variables, so it's safe to use on a server. It should also be trivial to write a command-line or more full-featured client on top of the library, for batch or other processing of a Transkribus collection.

See the README provided with the project for more details on its use and configuration, as well as how to build, test, and contribute back.

Getting transkribus-client

The client and its library are available on PyPI for easy download, or you can browse the source on GitLab. Instructions for building, running, and hacking on transkribus-client are included in the project's README.

If you want to get involved with the client, we welcome bug reports, suggestions, and patches! We hope to release more of our tools in the future to give back to the FOSS community, which makes so much of our work possible.