Treating OCR Output as a Language (TOOL) – Improving OCR Output with Seq2Seq Translation

Link:
Autor/in:
Erscheinungsjahr:
2025
Medientyp:
Text
Beschreibung:
  • Optical Character Recognition (OCR) systems are frequently used to digitize text, but often produce noisy results, especially with historical, poor quality or multilingual data. Despite advances in OCR technology, post-processing remains a major bottleneck. We propose TOOL (Treating OCR Output as a Language), a new approach that understands OCR correction as a machine translation task. By treating noisy OCR text as a language in its own right, TOOL employs sequence-to-sequence models like Marian to translate it into clean, standardized text. This method is scalable, model independent and language-flexible. We demonstrate this approach by translating "OCR German" to Standard German from around 1871 to the present day, improving accuracy at token level by using matched training pairs of OCR output and base text.
Lizenz:
  • info:eu-repo/semantics/openAccess
Quellsystem:
Forschungsinformationssystem der UHH

Interne Metadaten
Quelldatensatz
oai:www.edit.fis.uni-hamburg.de:publications/a0fc15d2-fd3c-4039-9b7c-006415070746