s8

In the context of TEKLIA's involvement in the SYNTHESYS+ project, a European project aiming to produce an accessible, integrated European resource for natural sciences researchers, and provide online access to digital collections and data from 115 museums from 21 countries, we took a look at 4 potential data formats to output the results of layout analysis, OCR and HTR, and NER processing.

The data we need to export is comprised of:

  • layout information (text lines),
  • OCR and HTR transcriptions,
  • named entities,
  • metadata on the source images and on these objects.

The 4 candidates were: ALTO, Page XML, XML TEI and JSON/JSON-LD.

In the end, JSON seems like the best format to represent and share this data during the course of the project, as it is convenient, well-documented, machine-readable and effectively used to output the results of automatic document processing, including NER.

ALTO

ALTO is a XML schema developed for the representation of layout information and OCR results.

It also supports the description of named entities. To include named entities in an ALTO document, the NamedEntityTags have to be defined in a XML node, and a reference attribute then points to the relevant entity in the text node.

<alto>
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
</Description>
<Tags>
<NamedEntityTag ID="NHMUK" LABEL="ORGANIZATION" DESCRIPTION="Natural History Museum, London"/>
</Tags>
<Layout>
<Page ID="some_page_id" HEIGHT="2318" WIDTH="1500">
<TextLine ID="some_text_line_id">
<Shape>
<Polygon POINTS="12,9 12,42 226,42 226,9 12,9"/>
</Shape>
<String CONTENT="THE" TAGREFS="NHMUK"/>
<String CONTENT="NATURAL" TAGREFS="NHMUK"/>
<String CONTENT="HISTORY" TAGREFS="NHMUK"/>
<String CONTENT="MUSEUM," TAGREFS="NHMUK"/>
<String CONTENT="LONDON" TAGREFS="NHMUK"/>
<String CONTENT="1938"/>
</TextLine>
</Page>
</Layout>
</alto>

This works well with OCR text, as OCR engines generally output the word and/or even character positions, which means the different String nodes can be given coordinates and so be represented on the images.

However, handwritten text recognition is not done at character level, but generally at line level, and therefore we do not output precise word coordinates, and can at best provide estimate positions. This splitting of text in words so that they can be tagged with references to named entities is inconvenient.

Furthermore, ALTO's vocabulary is not very large and/or permissive, so we would have to tweak it to be able to include a lot of metadata.

It should also be noted that ALTO is not a format that's used by projects dealing with Named Entity recognition, and that its documentation is rather sparse.

Page XML

Page XML is another XML standard for the description of layout information, OCR and HTR results. It is the format used by Transkribus, but it is not widely used otherwise.

It is a much more flexible and richer format but just like ALTO, it is for the representation of named entities that it is the least convenient—which is probably why it isn't used a lot by NER projects either.

<TextLine id="line_1562363578373_234" custom="readingOrder {index:1;} organization {offset:0; length:10;_id:britmus; _normalized:British\u0020Museum;}">
<Coords points="8090,1042 6545,1042 6545,822 8090,822"/>
<TextEquiv>
<Unicode>Brit. Mus. Coll.</Unicode>
</TextEquiv>
</TextLine>

The entities are put inside a custom attribute on the TextLine node, in a manner that is neither nicely human-readable nor easy to deal with in an automated process.

TEI

TEI is a well-documented and very permissive XML standard for the digital representation of textual documents. However, its strength is also its weakness, as it considerably impedes its interoperability. With every user and/or project taking advantage of its flexibility to fit their particular needs and documents, TEI documents are not easily machine-readable and it is hard to deal with TEI documents without knowing how they were created.

Just like in ALTO, the entities are defined in the teiHeader, and then the tags on the relevant text point to the corresponding entity using ids.

<teiHeader>
<fileDesc>
[...]
<sourceDesc>
[...]
<listPerson>
[...]
</listPerson>
<listPlace>
[...]
</listPlace>
<listOrg>
<org xml:id="britmus">
<orgName type="preferred">British Museum</orgName>
<orgName ref="#britmus00">Brit. Mus.</orgName>
</org>
[...]
</listOrg>
[...]
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<p>
<orgName xml:id="britmus00">Brit. Mus.</orgName> Coll.
</p>
</body>
</text>

JSON/JSON-LD

JSON-LD is a machine-readable format for linking data.

There is not standard in JSON/JSON-LD for the representation of textual documents, so we would create one for the project; but using JSON means using a format that is readable even with no knowledge of the specific rules that presided over the creation of a file.

Plus, Microsoft and Google are using JSON to output NER results, so it would make sense to define a format that resembles theirs.

Named entities can be easily represented in JSON, and we can use JSON-LD to link these entities to knowledge bases (such as DBpedia in the following example — which is not proper JSON-LD, only a dummy example).

{
"@id": "some_page_id",
"name": "PAGE_NAME",
"image": {
"@id": "some_image_id",
"url": "https://iiif.teklia.com/main/iiif/2/some_image.jpg",
"width": 1500,
"height": 2318
},
"metadata": [],
"subelements": [
{
"@id": "some_text_line_id",
"name": "text_line_name",
"type": "text_line",
"zone": {
"polygon": [[12,9],[12,42],[226,42],[226,9],[12,9]],
"image": {
"@id": "some_image_id",
"url": "https://iiif.teklia.com/main/iiif/2/some_image.jpg"
}
},
"transcriptions": [
{
"@id": "some_transcription_id",
"text": "THE NATURAL HISTORY MUSEUM, LONDON",
"confidence": 1,
"source": {
"worker_slug": "manual",
"worker_version": null
},
"entities": [
{
"text": "THE NATURAL HISTORY MUSEUM, LONDON",
"@id": "http://dbpedia.org/resource/Natural_History_Museum,_London",
"type": "organization",
"offset": 0,
"length": 34,
"source": {
"worker_slug": "manual",
"worker_version": null
}
}
]
}
],
"metadata": [
{
"type": "text",
"name": "script_type",
"value": "typewritten"
}
]
}
]
}

It may be that at the end of the project, another format will be defined in order to share the corpus of museum specimen data, but to exchange data between participants, processes and tools over its course, JSON/JSON-LD appears like the best option.