Complex documentation processing

The HungaroDoc system is a so-called end-to-end documentation processing system, which integrates the processes of reading understanding and distributing documentation into a unified procedure.

The goal is to make it easier for all those who need to do all these processes manually.

For organizations, institutions and companies, the task to be performed is twofold:

  • Saving and processing documents containing valuable information accumulated over the years. In addition, much of the older documentation is an unstructured, paper-based document. The processing of voice-based documentation can also be of great importance. The task is not only to digitize these documents, but also to group and index them according to some important features (entities), so that later a given document can be easily retrieved and processed from a database based on them.
  • Processing of continuously generated electronic documentation in an integrated workflow.

The HungaroDoc system is a Hungarian-language, self-developed processing and archiving system for structured unstructured and voice-based documentation designed for this special task, which is capable of performing the entire documentation management process with high-efficiency RPA (Robotic Process Automation) tools with minimal human intervention.

Documents include voice-based materials that become written documents following a speech-to-text procedure.

HungaroDoc enables collaboration with other intelligent platforms that can be easily integrated. The system uses the Kofax Capture and Kofax Transformation platforms, as well as customer-specific Hungarian language Speech-to-text platforms.

The whole process contains manual elements only if necessary, for example, in order to compile and group the unstructured documents to be scanned in some logical order as well as to validate and check the performed task.

An integral part of the HungaroDoc system is the HungaroNer natural language processing (NLP) system.

Complex documentation processing
Complex documentation processing

Natural language processing

HungaroNer is Hungarocom's self-developed Hungarian natural language processing system. Here are four applications:

  • Named entity recognition (NER)
  • Sentiment analysis
  • Tagging
  • Network analysis
Named Entity Recognition is a natural language processing (NLP) technique that automatically extracts information from unstructured documents. Such information (entities) can be name, geographical location, address, etc.

In sentiment analysis, we determine the emotional polarity of a document, that is, the positive, negative, or indifferent emotions of the text.

During Tagging, the input text is processed and interpreted by the system, and then the words are classified into predefined classes, taking into account word relationships and word usage.

When providing Network analysis, the system starts from specific entities and after extracting them, presents their relationships based on the analysis of the input documents. After extracting the entities, it determines the network of contacts according to certain search criteria (which entities are looking for relationships with which other entities).

The natural language processing on the loaded documents is performed by a self-developed word processing robot, which also handles the inflected forms (lemmatization).

The system can be programmed extremely efficiently and quickly to execute any customer-specific tasks. HungaroNer can also be used in conjunction with software robots (RPAs)

The figure shows the structure of the system.
  • Names
    It handles name entities according to Hungarian language rules. The HungaroNer name recognition algorithm handles personal names according to the following structure. The applied method handles correctly institution names bound to a name (e.g. Neumann János Egyetem)”Neumann János” is not considered as a name. The algorithm does not confuse personal names with location, institution, brand, etc. names. The system also handles suffixed forms of names.
  • Dates
    The algorithm recognizes dates written in a variety of formats.
  • Addresses
    The system recognizes all types of addresses written according to the Hungarian spelling.
  • Phone numbers
    The system ensures that a given phone number cannot be confused with other numbers, e.g. standards or other numbering.
  • Email addresses
    The system recognizes all standard format e-mail addresses.
Named Entity Recognition
Named Entity Recognition

The figure shows the GUI of HungaroNer. Any other type of GUI can be designed according to customer needs.

The left field shows the processed documents. The processed document is displayed in the middle field, the windows of the right field contain the extracted entities.

Different entities appear in the text body with different color markings. Recognized entities with or without suffix can be marked separately.
Recognized entities
Recognized entities

In the input document we determine the weighted mean occurrence values of the terms expressing emotions for positive and negative emotional polarities are determined and then corrected by taking into account the incidence rates of positive and negative emotional polarities.

Emotional polarity is a measurable value in the range of -1 to +1, expressing judgment from very negative to indifferent to very positive.

The algorithm handles and evaluates negations (e.g., friendly-not-friendly) as well as enhancements (e.g., annoying-terribly annoying) and denied enhancements (e.g., terribly annoying — not so terribly annoying).

Further corrections will be made for texts containing extremely negative obscene terms.

The definition of emotional polarity was applied only to Hungarian language documents.

The screen displays the displayed text as well as the emotional polarity.

This feature is useful if you want to handle documents based on emotional polarity (such as customer service).
Sentiment analysis, emotional polarity
Sentiment analysis, emotional polarity
Tagging is the grouping, sorting, and transmission of documents by content. Typical applications:

  • directing documents based on content to the appropriate administrator
  • thematic selection of articles and news contents
  • identification of problematic cases that deserve a quick reaction and special attention

Tagging is based on knowledge bases that contain appropriate keywords word-connections and phrases. Search for a given category is based on knowledge bases with different weighting factors. The application of weighted knowledge bases provides an opportunity to perform fine-tuning and different approaches. Categories are free to create and easy to modify. Multi-categorization is also possible when a document can be classified in more than one category.

The classification is based on the determination of a point value. When determining the point value, the system takes into account the length of the input document as well as the useful size of the knowledge base. These are freely adjustable with parameter values of a specific formula, so that the hit probability of a given text can be specified and independent of the length of the document.

The network analysis function of HungaroNer is used to discover relations between selected entities. HungaroNer performs this task as a built-in function. The figure shows an illustrative example of how to construct a graph representing network of connections.

The task is to discover the relationship between the names and addresses that appear in the input documents to be processed. In the first step we extract entities from the documents, in this example we extract the name and address entities.

The result of network analysis is a graph of nodes and edges from which relationships are clearly revealed. In the window that appears we select the nodes, the entities whose relations we are interested in and then we select the edges (entities) for which we are looking for connections between the nodes. The system then defines the connection system graph, and after selecting the View Graph menu item, the connection network is displayed.
Network analysis
Network analysis