SAP AI: Document Information Extraction

Introduction

The labor-intensive task of extracting specific fields and other essential information from documents can be greatly simplified with SAP’s Document Information Extraction service. This service allows users to upload documents via their BTP (Business Technology Platform) account, using either the user interface, the Swagger API interface, or the Python SDK. Multiple algorithms work in tandem to streamline this challenging task. Notably, the service is available on a free tier and offers customization options if the pre-trained document types do not meet specific use cases.

How It Works

The extraction process begins with the All Information Docs feature, which prioritizes reading from barcodes or QR codes on the document over other found values. Following this, SAP’s internal OCR (Optical Character Recognition) algorithm is employed to extract raw text from the document. This underlying OCR algorithm is based on a Convolutional Neural Network (CNN) that detects lines within the document and then utilizes a transformer decoder to extract the text from each identified line. Users can also access the raw OCR results directly via the Docs API.

The Doc Reader model plays a critical role in extracting particularly important header fields from the document. Trained exclusively on images and the key-value string pairs, Doc Reader leverages a significant amount of historically available data generated through human-based information extraction. The architecture of Doc Reader comprises three main modules: the encoder, the attention layer, and the decoder.

Encoder: The encoder is a feed-forward neural network made up of several convolutional blocks, creating the memory used by Doc Reader. This memory is further enriched with one-hot encoded spatial information.
Attention Layer: The attention layer utilizes the spatially aware memory along with the previous character, the input key, and the previous attention weights to generate a new attention area.
Decoder: The decoder then uses this information in conjunction with the previous state to output characters.

Additionally, the service employs a target model capable of processing line items. This model uses the OCR results as input and applies a fully convolutional neural network to create a representation of the document, treating each character as a distinct channel. In contrast to standard CNNs that typically use only one or three color channels, this model operates strictly in a feed-forward manner with one encoder and two decoding stages. The decoders consist of convolutional blocks designed to reverse the downsampling process: the first performs semantic segmentation, while the second one outputs the bounding boxes for the line items.

For those interested in learning more about SAP's Document Information Extraction service, a wealth of resources, including blog posts and tutorials, can be found on the SAP Community homepage.

Keywords

SAP
Document Information Extraction
BTP
OCR
Convolutional Neural Network
Doc Reader
Key-Value Pairs
Attention Layer
Semantic Segmentation
Bounding Boxes

FAQ

Q1: What is SAP's Document Information Extraction service?
A1: It is a service that automates the extraction of specific fields and important information from documents, simplifying a traditionally manual and labor-intensive task.

Q2: How can I use the Document Information Extraction service?
A2: You can create an instance in your BTP account and upload your documents using the user interface, Swagger interface, or Python SDK.

Q3: Is there a cost associated with using this service?
A3: The service is available on a free tier, allowing users to explore its capabilities without incurring costs.

Q4: What kind of algorithms are used in this service?
A4: The service employs multiple algorithms, including a convolutional neural network (CNN) for OCR and a transformer decoder for text extraction.

Q5: Can I customize the pre-trained document types?
A5: Yes, SAP’s Document Information Extraction service is customizable if the pre-trained document types do not fit your specific use case.