Tel:  020 8309 5445

Search the site:

Family Archives

PROJECT MANAGEMENT SOFTWARE

THEMIS: structuring unstructured data

THEMIS is both a production and a project management software platform. Fully hosted and managed by Max, THEMIS ensures that all stages of a digitisation project are managed from within a single platform. It also allows our customers to approach digitisation projects differently, if appropriate: for instance, using THEMIS it is possible to capture and ingest the digitised images prior to cataloguing and indexing the material.




THEMIS 0
THEMIS: PROJECT MANAGEMENT SOFTWARE
SOTERIA: DIGITAL PRESERVATION SOFTWARE & SERVICES
Structuring Unstructured Data

THEMIS allows the entire digitisation process to be managed from a single platform. To find out what THEMIS can do for your organisation please click on the link below.

Preserving the past for the future

SOTERIA offers comprehensive digital preservation software and storage services for archival collections. To find out more please click on the link below.

SOTERIA: DIGITAL PRESERVATION SOFTWARE & SERVICES
THEMIS: PROJECT MANAGEMENT SOFTWARE
Preserving the past for the future

SOTERIA offers comprehensive digital preservation software and storage services for archival collections. To find out more please click on the link below.

Structuring Unstructured Data

THEMIS allows the entire digitisation process to be managed from a single platform. To find out what THEMIS can do for your organisation please click on the link below.

THEMIS 1
THEMIS: PROJECT MANAGEMENT SOFTWARE
SOTERIA: DIGITAL PRESERVATION SOFTWARE & SERVICES
Bespoke UIs and Functionality to Match Your Organisation's Needs

THEMIS can be customised both in terms of its UI and underlying functionality to meet your organisation's individual use-case. To find out more about what THEMIS can do for your organisation please click on the link below.

Preserving the past for the future

SOTERIA offers comprehensive digital preservation software and storage services for archival collections. To find out more please click on the link below.

SOTERIA: DIGITAL PRESERVATION SOFTWARE & SERVICES
THEMIS: PROJECT MANAGEMENT SOFTWARE
Preserving the past for the future

SOTERIA offers comprehensive digital preservation software and storage services for archival collections. To find out more please click on the link below.

Bespoke UIs and Functionality to Match Your Organisation's Needs

THEMIS can be customised both in terms of its UI and underlying functionality to meet your organisation's individual use-case. To find out more about what THEMIS can do for your organisation please click on the link below.

THEMIS: Max's In-House Project Management Platform

For the majority of our indexing and transcription projects we use our in-house project management platform, THEMIS. THEMIS is built on a relational database for secure internal/external access and can accommodate any existing catalogue information while allowing the addition of further metadata post digitisation.

To discuss THEMIS or any of Max's services please feel free to telephone us on 020 8309 5445 or to contact us via our contact page.

Automated data extraction using THEMIS

For structured data, we have developed automated algorithms to extract specific fields from templated documents based on their spatial relationship to “marker” text floats. Text floats are the blocks of text with their bounding box coordinates that OCR programmes such as Tesseract produce. For example if a series of printed forms have the word “Invoice Number” at the top of a column, individual text blocks that fall within a range below this “marker text” float can be identified as invoice numbers.

Once data has been separated into fields, content specific heuristic checks can be made against format. For example analysis can show that an invoice number should be in a specific format e.g. XXX000, and THEMIS can mark records that don’t match this for review and QA.

THEMIS & OCR

THEMIS offers the most efficient means of viewing, assessing and editing imported OCRed data. It provides essential project management information for, among other things, accuracy rates, remedial work statistics including volumes, and trend analysis for continuous improvement.

As part of our solution we have developed a number of methodologies for QA and post production of OCR’ed material. We recognise that the percentage accuracy reports of programs such as Abbyy Fine Read and Tesseract only give a figure for presumed correct interpretations based on the number of definite fails. Therefore to improve OCR results we develop content specific heuristic programming strategies and routines. We also utilise expandable dictionaries of proper nouns and jargon, especially for unstructured content.

THEMIS: Max's In-House Project Management Platform

For the majority of our indexing and transcription projects we use our in-house project management platform, THEMIS. THEMIS is built on a relational database for secure internal/external access and can accommodate any existing catalogue information while allowing the addition of further metadata post digitisation.

To discuss THEMIS or any of Max's services please feel free to telephone us on 020 8309 5445 or to contact us via our contact page.

Automated data extraction using THEMIS

For structured data, we have developed automated algorithms to extract specific fields from templated documents based on their spatial relationship to “marker” text floats. Text floats are the blocks of text with their bounding box coordinates that OCR programmes such as Tesseract produce. For example if a series of printed forms have the word “Invoice Number” at the top of a column, individual text blocks that fall within a range below this “marker text” float can be identified as invoice numbers.

Once data has been separated into fields, content specific heuristic checks can be made against format. For example analysis can show that an invoice number should be in a specific format e.g. XXX000, and THEMIS can mark records that don’t match this for review and QA.

THEMIS & OCR

THEMIS offers the most efficient means of viewing, assessing and editing imported OCRed data. It provides essential project management information for, among other things, accuracy rates, remedial work statistics including volumes, and trend analysis for continuous improvement.

As part of our solution we have developed a number of methodologies for QA and post production of OCR’ed material. We recognise that the percentage accuracy reports of programs such as Abbyy Fine Read and Tesseract only give a figure for presumed correct interpretations based on the number of definite fails. Therefore to improve OCR results we develop content specific heuristic programming strategies and routines. We also utilise expandable dictionaries of proper nouns and jargon, especially for unstructured content.