Treating OCR Output as a Language (TOOL) – Improving OCR Output with Seq2Seq Translation

Link:

https://doi.org/10.15439/2025F1103

Autor/in:

Erscheinungsjahr:

2025

Medientyp:

Text

Beschreibung:

Optical Character Recognition (OCR) systems are frequently used to digitize text, but often produce noisy results, especially with historical, poor quality or multilingual data. Despite advances in OCR technology, post-processing remains a major bottleneck. We propose TOOL (Treating OCR Output as a Language), a new approach that understands OCR correction as a machine translation task. By treating noisy OCR text as a language in its own right, TOOL employs sequence-to-sequence models like Marian to translate it into clean, standardized text. This method is scalable, model independent and language-flexible. We demonstrate this approach by translating "OCR German" to Standard German from around 1871 to the present day, improving accuracy at token level by using matched training pairs of OCR output and base text.

Lizenz:

info:eu-repo/semantics/openAccess

Quellsystem:

Forschungsinformationssystem der UHH

Interne Metadaten

Quelldatensatz: oai:www.edit.fis.uni-hamburg.de:publications/a0fc15d2-fd3c-4039-9b7c-006415070746