Since its foundation, Teklia has worked with actors in the research community. In doing so, we've worked a good deal with IIIF servers, which are frequently deployed to manage archival funds and old documents.
The International Image Interoperability Framework (IIIF) specification ensures interoperability between archival image servers and various external tools (e.g. to visualize images), as well as among existing servers. This interoperability allowed us to run our machine learning models on images hosted in an external IIIF server.
We participated in projects concerning up to millions of digitized documents, mainly for Handwritten Text Recognition (HTR) and Named Entity Recognition (NER). To handle such a large amount of data, we developed a tool powerful enough to supervise the processing of large image datasets and browse the mined information to draw conclusions from the data. This tool is named Arkindex and is available for a trial at https://demo.arkindex.org/.
In machine-learning workflows, it's common to retrieve a lot of images from a server in a variety of different resolutions, with scaling, croping, and other processing applied to the image. Sometimes, only metadata is needed as well. The IIIF specification allows image retrieval with these different options, as well as partial image download. These operations can cause high load on an IIIF-compliant server, especially when pulling the number of images necessary in a machine-learning environment.
To figure out the best (read: fastest) IIIF server for our needs, we decided to build a project based on Docker that would allow us to compare the behavior of the most popular IIIF open source implementations. We tested the Cantaloupe, Loris, IIPSrv, and RAIS server implementations, and filetypes including JPEG-2000 lossless and 10x compressed, JPEG q=90%, and TIFF lossless formats.
We presented this project during the 2021 IIIF Annual Conference, on June 23. The video is available on YouTube.
We've also published an article featuring the main results of the study, which is available on our blog. To summarize our results, we found that the JPEG q=90% format, when processed with libjpeg-turbo, was the fastest filetype to transfer, and while Cantaloupe was fastest for full-size images, Loris was faster when considering crops and resizes.
Using these findings, we recently deployed a cluster of IIIF servers, which we hope will bring an important performance improvement to our current workflow. Furthermore, since most of the organizations we work with promote the use of interoperable and open source/free software to host their data, we're publishing our findings in the hope that they can help others in this research space. You can view our benchmark source code on our Gitlab repository to do the test yourself.