For full digitisation services, please visit our parent site: maxcommunications.co.uk

Themis

Changing the way we digitise

THEMIS is both a production and a project management tool. Fully hosted and managed by Max, THEMIS ensures that all stages of a digitisation project are managed from within a single platform. It also allows our customers to approach digitisation projects differently if appropriate: for instance, using THEMIS it is possible to capture and ingest the digitised images prior to cataloguing and indexing the material.

Themis

Changing the way we digitise

THEMIS is both a production and a project management tool. Fully hosted and managed by Max, THEMIS ensures that all stages of a digitisation project are managed from within a single platform. It also allows our customers to approach digitisation projects differently if appropriate: for instance, using THEMIS it is possible to capture and ingest the digitised images prior to cataloguing and indexing the material.

themis logo
Automated data extraction using THEMIS

For structured data, we have developed automated algorithms to extract specific fields from templated documents based on their spatial relationship to “marker” text floats. Text floats are the blocks of text with their bounding box coordinates that OCR programmes such as Tesseract produce. For example if a series of printed forms have the word “Invoice Number” at the top of a column, individual text blocks that fall within a range below this “marker text” float can be identified as invoice numbers.

Once data has been separated into fields, content specific heuristic checks can be made against format. For example analysis can show that an invoice number should be in a specific format e.g. XXX000, and THEMIS can mark records that don’t match this for review and QA.

themis logo
THEMIS & OCR

THEMIS offers the most efficient means of viewing, assessing and editing imported OCRed data. It provides essential project management information for, among other things, accuracy rates, remedial work statistics including volumes, and trend analysis for continuous improvement.

As part of our solution we have developed a number of methodologies for QA and post production of OCR’ed material. We recognise that the percentage accuracy reports of programs such as Abbyy Fine Read and Tesseract only give a figure for presumed correct interpretations based on the number of definite fails. Therefore to improve OCR results we develop content specific heuristic programming strategies and routines. We also utilise expandable dictionaries of proper nouns and jargon, especially for unstructured content