THEMIS is both a production and a project management software platform. Fully hosted and managed by Max, THEMIS ensures that all stages of a digitisation project are managed from within a single platform. It also allows our customers to approach digitisation projects differently, if appropriate: for instance, using THEMIS it is possible to capture and ingest the digitised images prior to cataloguing and indexing the material.
THEMIS allows the entire digitisation process to be managed from a single platform. To find out what THEMIS can do for your organisation please click on the link below.
Preserving the past for the future
SOTERIA offers comprehensive digital preservation software and storage services for archival collections. To find out more please click on the link below.
SOTERIA offers comprehensive digital preservation software and storage services for archival collections. To find out more please click on the link below.
Structuring Unstructured Data
THEMIS allows the entire digitisation process to be managed from a single platform. To find out what THEMIS can do for your organisation please click on the link below.
Bespoke UIs and Functionality to Match Your Organisation's Needs
THEMIS can be customised both in terms of its UI and underlying functionality to meet your organisation's individual use-case. To find out more about what THEMIS can do for your organisation please click on the link below.
Preserving the past for the future
SOTERIA offers comprehensive digital preservation software and storage services for archival collections. To find out more please click on the link below.
SOTERIA offers comprehensive digital preservation software and storage services for archival collections. To find out more please click on the link below.
Bespoke UIs and Functionality to Match Your Organisation's Needs
THEMIS can be customised both in terms of its UI and underlying functionality to meet your organisation's individual use-case. To find out more about what THEMIS can do for your organisation please click on the link below.
For the majority of our indexing and transcription projects we use our in-house project management platform, THEMIS. THEMIS is built on a relational database for secure internal/external access and can accommodate any existing catalogue information while allowing the addition of further metadata post digitisation.
To discuss THEMIS or any of Max's services please feel free to telephone us on 020 8309 5445 or to contact us via our contact page.
Automated data extraction using THEMIS
For structured data, we have developed automated algorithms to extract specific fields from templated documents based on their spatial relationship to “marker” text floats. Text floats are the blocks of text with their bounding box coordinates that OCR programmes such as Tesseract produce. For example if a series of printed forms have the word “Invoice Number” at the top of a column, individual text blocks that fall within a range below this “marker text” float can be identified as invoice numbers.
Once data has been separated into fields, content specific heuristic checks can be made against format. For example analysis can show that an invoice number should be in a specific format e.g. XXX000, and THEMIS can mark records that don’t match this for review and QA.
THEMIS & OCR
THEMIS offers the most efficient means of viewing, assessing and editing imported OCRed data. It provides essential project management information for, among other things, accuracy rates, remedial work statistics including volumes, and trend analysis for continuous improvement.
As part of our solution we have developed a number of methodologies for QA and post production of OCR’ed material. We recognise that the percentage accuracy reports of programs such as Abbyy Fine Read and Tesseract only give a figure for presumed correct interpretations based on the number of definite fails. Therefore to improve OCR results we develop content specific heuristic programming strategies and routines. We also utilise expandable dictionaries of proper nouns and jargon, especially for unstructured content.
For the majority of our indexing and transcription projects we use our in-house project management platform, THEMIS. THEMIS is built on a relational database for secure internal/external access and can accommodate any existing catalogue information while allowing the addition of further metadata post digitisation.
To discuss THEMIS or any of Max's services please feel free to telephone us on 020 8309 5445 or to contact us via our contact page.
Automated data extraction using THEMIS
For structured data, we have developed automated algorithms to extract specific fields from templated documents based on their spatial relationship to “marker” text floats. Text floats are the blocks of text with their bounding box coordinates that OCR programmes such as Tesseract produce. For example if a series of printed forms have the word “Invoice Number” at the top of a column, individual text blocks that fall within a range below this “marker text” float can be identified as invoice numbers.
Once data has been separated into fields, content specific heuristic checks can be made against format. For example analysis can show that an invoice number should be in a specific format e.g. XXX000, and THEMIS can mark records that don’t match this for review and QA.
THEMIS & OCR
THEMIS offers the most efficient means of viewing, assessing and editing imported OCRed data. It provides essential project management information for, among other things, accuracy rates, remedial work statistics including volumes, and trend analysis for continuous improvement.
As part of our solution we have developed a number of methodologies for QA and post production of OCR’ed material. We recognise that the percentage accuracy reports of programs such as Abbyy Fine Read and Tesseract only give a figure for presumed correct interpretations based on the number of definite fails. Therefore to improve OCR results we develop content specific heuristic programming strategies and routines. We also utilise expandable dictionaries of proper nouns and jargon, especially for unstructured content.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.