Complex documentation processing
The goal is to make it easier for all those who need to do all these processes manually.
For organizations, institutions and companies, the task to be performed is twofold:
- Saving and processing documents containing valuable information accumulated over the years. In addition, much of the older documentation is an unstructured, paper-based document. The processing of voice-based documentation can also be of great importance. The task is not only to digitize these documents, but also to group and index them according to some important features (entities), so that later a given document can be easily retrieved and processed from a database based on them.
- Processing of continuously generated electronic documentation in an integrated workflow.
The HungaroDoc system is a Hungarian-language, self-developed processing and archiving system for structured unstructured and voice-based documentation designed for this special task, which is capable of performing the entire documentation management process with high-efficiency RPA (Robotic Process Automation) tools with minimal human intervention.
Documents include voice-based materials that become written documents following a speech-to-text procedure.
HungaroDoc enables collaboration with other intelligent platforms that can be easily integrated. The system uses the Kofax Capture and Kofax Transformation platforms, as well as customer-specific Hungarian language Speech-to-text platforms.
The whole process contains manual elements only if necessary, for example, in order to compile and group the unstructured documents to be scanned in some logical order as well as to validate and check the performed task.
An integral part of the HungaroDoc system is the HungaroNer natural language processing (NLP) system.
Natural language processing
- Named entity recognition (NER)
- Sentiment analysis
- Network analysis
In sentiment analysis, we determine the emotional polarity of a document, that is, the positive, negative, or indifferent emotions of the text.
During Tagging, the input text is processed and interpreted by the system, and then the words are classified into predefined classes, taking into account word relationships and word usage.
When providing Network analysis, the system starts from specific entities and after extracting them, presents their relationships based on the analysis of the input documents. After extracting the entities, it determines the network of contacts according to certain search criteria (which entities are looking for relationships with which other entities).
The natural language processing on the loaded documents is performed by a self-developed word processing robot, which also handles the inflected forms (lemmatization).
The system can be programmed extremely efficiently and quickly to execute any customer-specific tasks. HungaroNer can also be used in conjunction with software robots (RPAs)
The figure shows the structure of the system.
It handles name entities according to Hungarian language rules. The HungaroNer name recognition algorithm handles personal names according to the following structure. The applied method handles correctly institution names bound to a name (e.g. Neumann János Egyetem)”Neumann János” is not considered as a name. The algorithm does not confuse personal names with location, institution, brand, etc. names. The system also handles suffixed forms of names.
The algorithm recognizes dates written in a variety of formats.
The system recognizes all types of addresses written according to the Hungarian spelling.
- Phone numbers
The system ensures that a given phone number cannot be confused with other numbers, e.g. standards or other numbering.
- Email addresses
The system recognizes all standard format e-mail addresses.
The left field shows the processed documents. The processed document is displayed in the middle field, the windows of the right field contain the extracted entities.
Different entities appear in the text body with different color markings. Recognized entities with or without suffix can be marked separately.
Emotional polarity is a measurable value in the range of -1 to +1, expressing judgment from very negative to indifferent to very positive.
The algorithm handles and evaluates negations (e.g., friendly-not-friendly) as well as enhancements (e.g., annoying-terribly annoying) and denied enhancements (e.g., terribly annoying — not so terribly annoying).
Further corrections will be made for texts containing extremely negative obscene terms.
The definition of emotional polarity was applied only to Hungarian language documents.
The screen displays the displayed text as well as the emotional polarity.
This feature is useful if you want to handle documents based on emotional polarity (such as customer service).
- directing documents based on content to the appropriate administrator
- thematic selection of articles and news contents
- identification of problematic cases that deserve a quick reaction and special attention
Tagging is based on knowledge bases that contain appropriate keywords word-connections and phrases. Search for a given category is based on knowledge bases with different weighting factors. The application of weighted knowledge bases provides an opportunity to perform fine-tuning and different approaches. Categories are free to create and easy to modify. Multi-categorization is also possible when a document can be classified in more than one category.
The classification is based on the determination of a point value. When determining the point value, the system takes into account the length of the input document as well as the useful size of the knowledge base. These are freely adjustable with parameter values of a specific formula, so that the hit probability of a given text can be specified and independent of the length of the document.
The task is to discover the relationship between the names and addresses that appear in the input documents to be processed. In the first step we extract entities from the documents, in this example we extract the name and address entities.
The result of network analysis is a graph of nodes and edges from which relationships are clearly revealed. In the window that appears we select the nodes, the entities whose relations we are interested in and then we select the edges (entities) for which we are looking for connections between the nodes. The system then defines the connection system graph, and after selecting the View Graph menu item, the connection network is displayed.