Charter in Czech from Národní archive, ACK 1527
Charter in Czech from Národní archive, ACK 1527

When working on Named Entity Recognition for complex documents, e.g. historical documents or from legal or medical domain, choosing an evaluation protocol can become a bit tricky. Even when the NER task is performed on the same text as the gold-standard, one has to take into account various NER prediction scenarii and can apply various scoring levels. Here is a great article by David S. Batista detailing all of the possible ways of scoring the results of a NER system on clean input, and presenting an implementation of a metric including four different levels.

Adding a text recognition task, OCR, handwritten text recognition or speech recognition, before the NER one results in having to detect named entities on noisy text. In this case the gold-standard and the NER prediction texts can differ greatly, depending on the quality of the recognizer, and one can no more simply compare NE tags on corresponding tokens. With the two dimensions of NER evaluation, entity boundary and entity tag, comes the question of the text itself: the number of characters and/or the number of tokens could be different, making entity boundary evaluation not so obviously binary, as well as the recognized characters themselves.

With the aim of comparing our NER models performances on HTR detected content, we designed a metric for NER evaluation on noisy text. The complete python code is available on this repository:

https://gitlab.com/teklia/nerval

Metric description

The inputs should be in BIO format, or it's variation BIOES.

This metric main idea is to use string alignment at character level which we did with the python package edlib.

The automatic transcription is first aligned with the ground truth at character level, by minimising the Levenshtein distance between them. Each entity in the ground truth is then matched with an overlapping entity in the aligned transcription with the same entity label, or an empty character string if no match is found. If the edit distance between the two entities is less than 30% of the ground truth entity length, the predicted entity is considered as recognised. For the purpose of matching detected entities to existing databases, we estimated that a 70% match between the entity texts was a fair threshold.

Details

  • From the BIO files in input, retrieval of the text content and extension of a word-level tagging to a character-level tagging
    • spaces added between each word
    • spaces between two words with the same tag get the same tag, else O
    • information about beginning of entity is dropped

For instance, the following annotation file:

Tolkien B-PER
was O
a O
writer B-OCC
. O

produces the following list of tags, one per character plus spaces:

['PER','PER','PER','PER','PER','PER','PER',
 'O',
 'O', 'O', 'O',
 'O',
 'O',
 'O',
 'OCC','OCC','OCC','OCC','OCC','OCC',
 'O',
 'O']

And the prediction file could be:

Tolkieene B-PER
xas O
writear B-OCC
,. O

producing:

['PER','PER','PER','PER','PER','PER','PER','PER','PER',
 'O',
 'O', 'O', 'O',
 'O',
 'OCC','OCC','OCC','OCC','OCC','OCC','OCC',
 'O',
 'O','O']
  • Character level alignment between annotation and prediction adds '-' characters to both strings so they are the same length

With the following input texts :

annotation : Tolkien was a writer .
prediction : Tolkieen xas writear ,.

the alignment result is:

annotation : Tolkie-n- was a writer- -.
prediction : Tolkieene xas --writear ,.
  • Adapt character-level tag to aligned strings
    • '-' characters in aligned strings get the same tag as the previous proper character in the string
             PPPPPPPPPOOOOOOOCCCCCCCOOO
annotation : Tolkie-n- was a writer- -.
prediction : Tolkieene xas --writear ,.
             PPPPPPPPPOOOOOOOCCCCCCCOOO
  • Search for matching entity for each entity in the annotation
    • Inspecting the annotation character by character, when a new entity tag (not 'O') is encountered, the character is considered as the beginning of an entity to be matched.
    • Considering the opposite character in the prediction string, if the entity tags match on these two characters, tags are back-tracked in the prediction string to detect the beginning of the entity; that is, the first occurrence of said entity tag.
    • Else, if the entity tags don't match on the first character, beginning of matching entity in prediction is looked for until the end of the entity in the annotation.
    • Both for the annotation and the prediction, detected entities end with the last occurrence of the tag of the first character.

Here are examples of several situations with the delimitation of the matched entities in each case.

Matches delimitations are represented by ||

annotation : OOOOOOO|PPPPPPPPPPPPPPPPP|OOOOOO
prediction : OOOO|PPPPPPPPPPP|OOOOOOOOOOOOOOO

annotation : OOOOOOO|PPPPPPPPPPPPPPPPP|OOOOOO
prediction : OOOOOOOOOOOOOO|PPPPPPPPPPPPPP|OO

annotation : OOOOOOO|PPPPPPPPPPPPPPPPP|OOOOOO
prediction : OOOO|PPPPPPPPPPP|OOOOPPPPOOOOOOO

annotation : OOOOOOO|PPPPPPPPPPPPPPPPP|OOOOOO
prediction : OOOOOOO|P|OPPPPPPPPPPPPPPOOOOOOO

annotation : OOOOOOO|PPPPPPPPPPPPPPPPP|OOOOOO
prediction : OOOOOOOOOOOOOOOOOOOOOOOOOOPPPPOO

For this last example, no match is found in the prediction.
  • Get a score on the two matched strings :
    • Compute the Levenshtein distance between the two strings, ignoring the "-" characters
    • If edit_distance / length(annotation_entity) < 0.3, the entity is considered as recognised
edit_distance("Tolkien", "Tolkieene") = 2
len("Tolkien") = 7
2/7 = 0.29 < 0.3
OK

edit_distance("writer", "writear") = 1
len("writer") = 6
1/6 = 0.17 < 0.3
OK
  • Final scores, Precision, Recall and F1-score, are given for each entity types, on entity-level. The total ("ALL") is a micro-average across entity types
PER :
P = 1/1
R = 1/1
F1 = 2*1*1/(1+1)

OCC :
P = 1/1
R = 1/1
F1 = 2*1*1/(1+1)

ALL :
P = 2/2
R = 2/2
F1 = 2*1*1/(1+1)

Illustration : Charter in Czech from Národní archive, ACK 1527