Once you've collected a number of documents into your Arkindex project, exporting that data into another format for sharing or publishing can be very helpful. However, while the SQLite database generated by Arkindex's Export button is flexible, allowing you to generate views on your data beyond Arikindex's API, it's still not the most convenient option for common use cases.
That's why, in version 0.2.0 of the open source Arkindex command-line tool, arkindex-cli, we've included a new
arkindex export command. This command allows you to convert an Arkindex export to a series of PDF or ALTO files.
arkindex export CLI tool
Assuming that pip is installed on your system and that the defaults fit your project, using the new Arkindex export feature involves just a few steps:
- Export a project from Arkindex to an SQLite database, and download it.
- Install the CLI, and connect to your Arkindex instance:
pip install arkindex-cli[export]
arkindex login --host arkindex.example.com
- Finally, run the exports:
arkindex export /path/to/export.sqlite pdf
arkindex export /path/to/export.sqlite alto
You can use
arkindex export --help to check out the various settings you can use to fine-tune an export. For more details, see the export command documentation.
The PDF export will generate one PDF file for each folder element, containing all of its pages and the recognized text lines. The text lines are selectable and searchable on each page using standard PDF tools.
With this new enhanced PDF export feature, Arkindex users will find the common PDF import-OCR-export workflow---that is, importing a PDF, performing text recognition on it, and exporting to a PDF with selectable text---extremely simple.
For greater visibility of the identified lines of text in the exported PDF, run Arkindex with the
--debug argument. The resulting argument will highlight text lines in the export:
The ALTO export includes as many of the ALTO XML tags that were compatible with the Arkindex structure, including which Workers have been used to transcribe each line and the confidence score they reported.
Teklia's commitment to open source
Teklia is committed to contributing back to the open-source and academic community which has supported and collaborated with us while we do our work. To that end, Arkindex CLI is open-source. You can browse the code, report bugs or submit patches on GitLab.