Compart - Document- and Output-Management


Data: Eyes and Ears of the AI - Digitization in Customer Communication

Compart |

Automated Document Processing

The global volume of data continues to grow strongly. Above all, unstructured data in the form of photos, audio files and videos as well as presentations and text documents will grow disproportionately - according to the market research institute IDC by an average of 62 percent annually. By 2022, this data type is expected to account for around 93 percent of the total volume.¹

Unstructured data includes, according to a Gartner definition, "all content that does not correspond to a specific, predefined data model. It's usually human-generated and person-related content that doesn't fit well into databases." But they often contain valuable customer and behavioural information, the evaluation of which can be important for well-founded decisions.

In addition, in-depth analysis of unstructured data forms the basis for better and expanded services, which can even lead to completely new business models. IDC expects companies that analyze all relevant data by 2020 to achieve a productivity gain of $430 billion over less analytically oriented competitors.²


Reading time: 6 min

  • Digitalization means Automation
  • Different treatment of Data Required
  • Requirements for AI-driven Document Management

Currently, companies are still looking for truly efficient solutions to convert unstructured data into structured data. They face a number of challenges, ranging from the question of geographic location, the type of data storage and governance, to securing and analyzing this information in local and cloud environments. So it is hardly surprising that the MIT Sloan Group classifies 80 percent of all data as untrustworthy, inaccessible or not analyzable. IDC estimates that by 2020 the "digital universe" will contain up to 37 percent of information that could be valuable if analyzed.³

Digitization Means Automation

One thing is certain: Structured and analyzable data are the basic prerequisite for an automated document processing. This refers to the extensive automation and standardization of processes, so that "human intervention" is less and less necessary ("dark processing"). Routine tasks such as service invoicing, confirmation of address and tariff changes or appointment agreements are already taken over by software solutions, language assistants and chatbots based on AI algorithms (self-learning systems).

What's more, even content with a high creative share, such as technical essays and the like, will sooner or later be generated by AI systems. Already today there are programs that can produce simple Wikipedia articles with simple syntax and grammar. You define certain reference points (structure, keywords) (for a text about a city, for example, the number of inhabitants, year of foundation, town twinning, geographical data) and the system retrieves the necessary data from Wikidata, supplements the corresponding stored text modules, which follow a simple grammar (subject - predicate - object) and merges everything into a finished text.

Many still remember the appearance of Google CEO Sundar Pichai at the IO developer conference in May last year, when he introduced the language assistant "Duplex": The chatbot is able to telephone independently without the called person noticing that he is dealing with an "artificial intelligence".⁴

With other processes, on the other hand, such as the cancellation of an insurance policy or the release of an invoice for more than 50,000 euros, for example, it is certain - partly due to regulatory requirements - that a clerk will continue to look into it in the future. But it is only a matter of time before such sensitive areas are also automated. The more reliable the systems become, the higher the threshold for automated processing can ultimately be set. However, this requires correct handling of the data.

Harald Grumser, founder of Compart AG, puts it in a nutshell: "Digital processes need access to the content of documents, and artificial intelligence also needs eyes and ears. It is therefore becoming increasingly important to obtain the data required for automated communication right from the start, to provide it with a structure and to store it correctly."


Documents Are the Human-Readable Representation of Data


That concerns also and exactly the automated document processing and output management as interface between classical (paper-bound) and electronic communication. Typically, digital data is converted into analog data on the output side (e.g. when printing, but also when transforming text content into audio files ("text-to-speech")). On the other hand, there is the situation in the inbox (input management), where exactly the opposite happens: Analog data is converted into electronic documents (e.g. when scanning, but also when converting audio/video files into readable content) - albeit not necessarily in a very high-quality form.

The challenge now is to transform the information and data generated in all areas of inbound and outbound communication into a structured form and store it in the right "data pots" so that it is available for all processes of document and output management - from the capture of incoming messages (input management) to the creation and processing of documents and their output.

It is irrelevant on which digital or analog medium a document is sent or displayed: It is always about the data, because a document is ultimately only its respective representation in a form readable by humans - whereby a distinction must be made here between non-coded and coded documents

In this context, two major trends should be mentioned, which are becoming more and more important and have almost replaced other developments:

  • XML (Extensible Markup Language) as a markup language for complex, hierarchical data, and
  • JSON (JavaScript Object Notation) as a compact data format (similar to XML, only simpler), which today is mainly used in web services. (see also the glossary).

Both technologies have proven themselves for the description and definition of structured data and will certainly play an even greater role.

Data Must Be Checked, Transferred and Stored Correctly

To ensure that the structured data is actually available for automated document processing, it is important that it is stored correctly. Here, non-relational databases such as NoSQL (including the subcategories Graph Database and RDF) now offer new possibilities. Their great advantage over relational databases is that they can manage data even in very complex contexts and thus enable very specific queries (see also the "Glossary").

One of the best known applications for this is Wikidata, the knowledge database of the online encyclopedia Wikipedia, in which tens of millions of facts are now stored. If, for example, you want to know how many Bundesliga players who were born in Berlin are married to Egyptian women, you will certainly find what you are looking for here. Certainly - a very unusual example, but one that makes the significance of the subject clear. The aim is to gain new connections/knowledge from structured data about algorithms (ontologies). This is where artificial intelligence (AI) comes into play, which can then be used to formulate complex queries (see the "Glossary").

A further important topic in this context is that the stored data with a structure must be checked - something that is often not done today. The XML schema, for example, is a proven method for guaranteeing the correctness and completeness of an XML file. Errors caused by unchecked data can be very serious.

Consistent data verification is therefore essential. Last but not least, the data must also be converted into each other using rules. There are also many possibilities for this today, one of the best known is certainly the programming language XSLT (see also the "Glossary"). But there are also other sets of rules.

Instead of Destroying Content....

Anyone who wants to further increase the degree of the automated document processing in customer communication in the sense of the next stage of digitization must ensure structured, consistent and centrally available data. For automated document processing and output management, this means preserving the content of documents as completely as possible right from the start instead of destroying it - as is often observed in the electronic inbox of companies, for example.

The problem here: In many companies, incoming e-mails are still "typed", i.e. converted into an image format, in order to subsequently make parts of the document content interpretable again by means of OCR technology. It's "Deepest Document Middle Ages." It wastes resources unnecessarily, especially when you consider that email attachments today can be quite complex documents with tens of pages.

Above all, however, this media discontinuity is tantamount to a "data gau": electronic documents (e-mails), which in themselves could be read and processed by IT systems, are first converted into TIFF, PNG or JPG files. So "pixel clouds" arise from content. In other words, the actual content is first encoded (raster images) and then made "readable" again with difficulty using Optical Character Recognition (OCR). This is accompanied by the loss of semantic structural information, which is necessary for later reuse.

How nice would it be, for example, if you could convert e-mail attachments of any type into structured PDF files immediately after receipt? This would lay the foundation for long-term, revision-proof archiving; after all, the conversion from PDF to PDF/A is only a small step.

...Rather Preserved Than the Basis for Further Automation

The following example: A leading German insurance group receives tens of thousands of e-mails daily via a central electronic mailbox, both from end customers and from external and internal sales partners. Immediately after receipt, the system automatically "triggers" the following processes:

  • Conversion of the actual e-mail ("body") to PDF/A
  • Individual conversion of the e-mail attachment (e.g. various Office formats, image files such as TIFF, JPG, etc.) to PDF/A
  • Merging of the e-mail body with the corresponding attachments and generation of a single PDF/A file per business transaction
  • At the same time, all important information is read from the file (extracted) and stored centrally for downstream processes (e.g. generation of reply letters on an AI basis, case-closing processing, archiving).

Everything runs automatically and without media discontinuity. The clerk receives the document in a standardized format, without having to worry about preparation (classification, making legible).

The insurer could still "split" the workflow into dark and interactive processing. During dark processing, every incoming e-mail plus attachment is automatically converted into a PDF/A file, transferred to the clerk and finally archived.

Interactive processing, on the other hand, involves the "intelligent" compilation of e-mail documents of different file formats into an electronic dossier (customer file/process). The clerk first opens the e-mail and the attachment on his mail client (Outlook, Lotus Notes, etc.) or his special clerking program and decides what needs to be edited. The normal workflow then applies as with dark processing: conversion - forwarding - processing - archiving.

The interactive variant is particularly useful if not all documents have to be archived. Modern input management systems are now capable of automatically recognizing all common formats of e-mail attachments and converting them into a predefined standard format (e.g. PDF/A or PDF/UA). And: You extract all necessary data from the documents at the same time and store them centrally.

Such scenarios can be implemented, for example, with systems such as DocBridge® Conversion Hub, whose linchpin is a central conversion instance. Its core is a kind of "dispatcher", which analyses every incoming message (e-mail, fax, SMS, messenger service, letter/paper) and automatically converts it into the optimal format for the document in question. How is the further processing to take place?) decides. DocBridge® Conversion Hub also includes an OCR function for extracting content and metadata (Optical Character Recognition).


¹ CIO online, 09/23/2019 ("AI paves the way to unstructured information").
² and ³ Ebenda
⁴ The example of an agreement for a hairdresser's appointment showed the new dimension of intelligent speech systems such as "Duplex": Previous systems can usually be recognized as "robots" within a few words (unnatural sounding voice, wrong emphasis, choppy sentences, wrong or no response to requests). Not so AI tools of the new generation: They are quite able to capture content with complex syntax and "talk" so skilfully with people that they do not notice who or what their counterpart is.

Background: Data, Documents and Processes in Customer Communication - Key Message

In principle: A document is the human-readable representation of data, whereby a distinction must be made between non-coded and coded variants.

1. Non-Coded Documents

  • are pictures, voice recordings and videos
  • can be converted into encoded data, but this is complex and usually multi-level, for example using text recognition methods such as OCR (Optical Character Recognition) with subsequent structure recognition, in order to finally obtain the raw data ("real data recognition").

2. Coded Documents Are

  • the pure representation of data in a form readable by humans (e.g. sales chart, ZUGFeRD or X invoice)
  • content with a high creative share (e.g. German essay, contract, Wikipedia article) which has a structure so that data can be extracted from it (e.g. contracts with a fixed structure such as paragraphs, chapters or Wikipedia articles from which all data contained in the text can also be extracted via the stored Wikidata database). Data can even be extracted from very prosaic texts, even if they are very complex.

3. Data

  • Simple data are tables, e.g. relational database, Excel spreadsheet
  • Complex data usually has a structure, for example in XML or JSON format
    Databases such as NoSQL (including Graph Database and RDF) offer the possibility to centrally manage structured data in complex contexts. They thus form the basis for very complex queries (keyword Big Data). A prominent example of this is Wikidata, the knowledge database of Wikipedia (see also glossary).
  • Metadata is data about data and always a question of location (who created the document when and where?) In principle, it plays no role for automated document processing
  • Data must be checked, for example using XML schemas (see also Glossary)
  • Data must be transferred, e.g. by rules such as XSLT (see also Glossary).

4. Processes

  • in customer communication means nothing else than to create and change data or documents as their representation
  • can take place with human interaction, then one usually speaks of business processes.
  • can also be established without manual intervention, in which case they are purely (automated) technical processes ("dark processing"). A typical example of this is invoices that are created and sent automatically at time X - either electronically (ZUGFeRD or XInvoice), as an email attachment, download file (customer portal) or by traditional mail.
  • Automation means exchanging data, not documents.

Automated Document Processing
DocBridge® Conversion Hub

The high-performance, scalable and seamlessly integrated DocBridge® Conversion Hub platform goes beyond conventional document conversion software in terms of scope and intention. Probably the most important advantage of the solution developed by Compart is the almost unlimited format variety: There is practically no document type that cannot be processed by DocBridge® Conversion Hub.

Technological foundation of an automated, digitized input management creates the prerequisite for AI-controlled document processing

This reflects Compart's profound know-how as a specialist for data streams in document and output management. Even voice messages, for example from messenger services (WhatsApp, Viber, Line), images in various formats, e-mails including attachments and content in very proprietary or outdated formats can be processed with the solution. DocBridge Conversion Hub functions quasi as a "funnel" that receives, recognizes and prepares every received electronic document, regardless of its format, i.e. converts it into a readable and analyzable format, at the same time extracting the relevant data and thus laying the foundations for its automated further processing, including on AI-based processes.

The starting point for the development of DocBridge® Conversion Hub was that insurance companies, banks and utilities, but also the public sector are increasingly confronted with electronic documents in their inbox. Analyzing and classifying these quickly, automatically and as far as possible without media discontinuity is the greatest challenge today in input management for companies with a high volume of documents.

Further Details of this Platform

  • Acceptance and assignment of incoming (electronic) content of any type (e-mails plus attachments, scans, faxes, digitally generated files including Office formats such as WORD or Apache OpenOffice, voice and messenger messages, images generated by mobile devices).
  • Determination of the optimal conversion distance
  • Typical digital inbox activities including text analysis (e.g. opening/reading e-mails, "unpacking" e-mail attachments)
  • Scalable and high-performance conversion of incoming documents into archivable, barrier-free, protected and searchable formats (e.g. PDF/A, PDF/UA, XInvoice) in accordance with rules - depending on requirements;
  • Extraction and central storage of data based on freely definable rules and criteria as a basis for automated further processing of documents
  • Illustration of various conversion/processing routes - "Layer" as upstream dispatcher (How should the document be prepared?)
  • High efficiency in electronic input processing, independent of the input channel (no need for certain activities such as printing and scanning)
  • Establishment of location-independent, scalable conversion strategies using a single platform
  • Also integration of decentralized Office documents (individual correspondence)
  • No loss of content
  • Fast and targeted information research even in complex, large files;
  • Basis for the centralization/consolidation of heterogeneous archive systems;
  • Possibility of interlocking with OM processes (output management)
  • High compliance (e.g. audit-proof long-term archiving according to PDF/A-3, accessibility according to WCAG, protection of sensitive data by "blackening" or anonymizing certain document contents, DSGVO)
  • Significantly lower risk of errors in document processing
  • Reduced costs (e.g. by eliminating redundant or superfluous conversion solutions including maintenance, user training and license renewals)
  • High performance and scalability ("cushioning" of peak loads);
  • Parallel processing;
  • Automated load distribution (optimal utilization)
  • Bidirectional communication
  • Platform independence
  • Seamless integration into existing document and output management structures of companies
  • Monitoring via Web UI
  • Configuration via GUI
  • Deployment concept
  • Priority control
  • Integration in cloud architectures possible
  • High reliability

Glossary and Deeper Knowledge


Wikidata is a freely accessible and jointly maintained knowledge database which, among other things, aims to support the online encyclopedia Wikipedia. The project was launched in 2012 by Wikimedia Deutschland e.V., a non-profit organisation for the dissemination of free knowledge, and provides a common source of certain types of data for Wikimedia projects (e.g. birth dates, universal data) that can be used in all Wikimedia articles.

Wikidata structures the knowledge of the world in language-independent data objects that can be enriched with various information. People as well as machines and IT systems can access this treasure trove of data and generate new knowledge. Wikibase, the software behind Wikidata, is also available as free and open software for all people.

One of the many examples of an open data project created with Wikibase is Lingua Libre. The directory of free audio voice recordings aims to preserve the sound of the world's languages and the pronunciation of their words in the form of structured data and make it available to all people. The project originated in France, where the initiators were keen to promote endangered regional languages. One advantage of Lingua Libre is that interested users can complete the records - with just a few words, proverbs or entire sentences. So even people who are not familiar with phonetic transcription can hear how individual words are pronounced at the click of a mouse. With the launch of the Wikibase installation Lingua Libre 2018, around 100,000 audio files in 46 languages were added to the directory.

Meanwhile up to 1200 recordings per hour can be recorded via the online application and uploaded directly into the free media archive Wikimedia Commons. Via the connection to Wikidata, the recorded sounds enrich Wikimedia projects such as Wikipedia or the free dictionary Wiktionary in particular - but they also support linguistics specialists in their research.

Since its launch, Wikidata has recorded a comparatively strong growth in content pages, with over 60 million data objects now available (as of September 2019).

Relational Databases

Relational databases are used for electronic data management in computer systems and are based on a table-based relational database model. The basis of their concept is the relation. It represents a mathematical description of a table and is a well-defined term in the mathematical sense. Operations on these relations are determined by relational algebra.

The associated database management system is called RDBMS (Relational Database Management System). The SQL (Structured Query Language) language, whose theoretical basis is relational algebra, is predominantly used for querying and manipulating the data. The relational database model was first proposed in 1970 by Edgar F. Codd and is still an established standard for databases despite some criticism.


Ontologies in computer science are mostly linguistic and formally ordered representations of a set of concepts and the relations existing between them in a certain subject area. They are used to exchange "knowledge" in digital and formal form between application programs and services. Knowledge includes both general knowledge and knowledge about very specific topics and processes.

Ontologies serve as a means of structuring and exchanging data to

  • to merge already existing knowledge
  • to search and edit existing knowledge
  • generate new instances from types of knowledge

Ontologies contain inference and integrity rules, i.e. rules on conclusions and on ensuring their validity. They have experienced an upswing with the idea of the semantic web in recent years and are thus part of the representation of knowledge in the field of artificial intelligence. In contrast to a taxonomy, which forms only a hierarchical subdivision, an ontology represents a network of information with logical relations.

NoSQL ("Not only SQL")

NoSQL ("Not only SQL") refers to databases that follow a non-relational approach and thus break with the long history of relational databases. These data stores do not require fixed table schemata and try to avoid joins (result tables). They scale horizontally. In the academic environment they are often referred to as "structured storage".

On Architecture and Demarcation

Relational databases typically suffer from performance problems with data-intensive applications such as indexing large volumes of documents, high-load websites, and streaming media applications. Relational databases are only efficient if they are optimized for frequent but small transactions or for large batch transactions with infrequent write access. However, they cannot cope well with high data requirements and frequent data changes at the same time.

NoSQL, on the other hand, handles many simultaneous read/write requests quite well. NoSQL implementations usually support distributed databases with redundant data storage on numerous servers, for example using a distributed hash table. This allows the systems to be easily expanded and to withstand server failures.

RDF (Resource Description Framework)

RDF (Resource Description Framework) describes a technical approach on the Internet to formulate logical statements about arbitrary things (resources). Originally, RDF was designed by the World Wide Web Consortium (W3C) as a standard for describing metadata.

Meanwhile RDF is regarded as a fundamental building block of the "semantic web". RDF is similar to the classical methods for modeling concepts (UML class diagrams, entity relationship model). In the RDF model, each statement consists of the three units subject, predicate, and object, whereby a resource is described in more detail as a subject with another resource or a value (literal) as an object.

With another resource as a predicate, these three units form a triple. In order to have globally unique identifiers for resources, these are formed according to convention analogous to the URL. URLs for commonly used descriptions (e.g. for metadata) are known to RDF developers and can therefore be used worldwide for the same purpose, which among other things enables programs to display the data meaningfully for humans.


The Extensible Markup Language (XML) is a markup language used to represent hierarchically structured data in the format of a text file that can be read by both humans and machines.

XML is also used for the platform- and implementation-independent exchange of data between computer systems, especially via the Internet, and was published by the World Wide Web Consortium (W3C) on February 10, 1998. The current version is the fifth edition dated November 26, 2008. XML is a meta language on the basis of which application-specific languages are defined by structural and content restrictions. These restrictions are expressed either by a Document Type Description (DTD) or by an XML Schema. Examples of XML languages are: RSS, MathML, GraphML, XHTML, XAML, Scalable Vector Graphics (SVG), GPX, but also the XML Schema itself.


XSL Transformation, XSLT for short, is a programming language for transforming XML documents. It is part of the Extensible Stylesheet Language (XSL) and represents a Turing complete language.

XSLT was published as a recommendation by the World Wide Web Consortium (W3C) on October 8, 1999. XSLT is based on the logical tree structure of an XML document and is used to define conversion rules. XSLT programs, so-called XSLT stylesheets, are themselves structured according to the rules of the XML standard.

The stylesheets are read by special software, the XSLT processors, which use these instructions to convert one or more XML documents into the desired output format. XSLT processors are also integrated in many modern web browsers, such as Opera (version 9 or higher), Firefox and Internet Explorer version 5 (version 6 or higher with full XSLT 1.0 support). XSLT is a subset of XSL, along with XSL-FO and XPath.

Semantisches Web (Semantic Web)

The "Semantic Web" extends the Internet in such a way as to make data more exchangeable between computers and easier for them to use; for example, the term "Bremen" can be supplemented in a web document with information as to whether a ship, family or city name is meant here. This additional information explicates the otherwise unstructured data. Standards for the publication and use of machine-readable data (especially RDF) are used for implementation.

While people can infer such information from the given context (from the whole text, about the kind of publication or the category in it, pictures etc.) and unconsciously build up such links, machines must first teach this context; for this purpose the contents are linked with further information.

The "Semantic Web" conceptually describes a "Giant Global Graph". All things of interest are identified and, provided with a unique address, created as "nodes", which in turn are connected to each other by "edges" (also uniquely named). Individual documents on the Web then describe a series of edges, and the totality of all these edges corresponds to the global graph.

JSON (JavaScript Object Notation)

JSON (JavaScript Object Notation) is a compact data format in an easily readable text form for the purpose of data exchange between applications. Every valid JSON document should be a valid JavaScript and can be interpreted by eval(). However, due to small differences in the amount of Unicode characters allowed, it is possible to create JSON objects that are not accepted by a standard-compliant JavaScript interpreter. Apart from that, however, JSON is independent of the programming language. Parsers exist in practically all common languages.

JSON was originally specified by Douglas Crockford. Currently it is specified by two competing standards, RFC 8259 from Douglas Crockford, and ECMA-404.

JSON is used for the transmission and storage of structured data; it serves as a data format for data transmission. Especially for web applications and mobile apps it is often used in combination with JavaScript, Ajax or WebSockets to transfer data between the client and the server.


Source: Wikipedia;